Research

Machine Learning Techniques

MDLText

The MDLText is an efficient, lightweight, scalable, and fast multinomial text classifier. It exhibits fast incremental learning and is sufficiently robust to prevent overfitting, desirable features in real-world applications, large-scale problems, and online scenarios.

GMDL

The GMDL is a lightweight, multiclass, and online classifier. Despite its probabilistic nature, it can handle continuous features. Experiments conducted on real-world datasets with different characteristics demonstrated that it outperformed established online classification methods and is robust to overfitting, a desired characteristic for large, dynamic, and real-world classification problems.

ML-MDLText

The ML-MDLText is an efficient and lightweight multilabel text classifier with incremental learning. It is based on the minimum description length principle and can be applied to multilabel classification without requiring the transformation of the classification problem. It takes advantage of dependency information among labels and naturally supports online learning. The results were very competitive with existing state-of-the-art online learning methods and those that transform multilabel problems into several single-label ones.

P2C – Partitioning to Classify

P2C is a new technique to achieve reasonable classification performances using linear prediction models, even on datasets with non-linear separable data. Inspired by the division-and-conquer strategy, the proposed technique applies a clustering method on each partition made of samples of the same class. Subsequently, the union among the clusters inside each partition is performed, creating a single partition where each group can contain linearly separable samples. Then, one or more linear classifiers are trained according to the number of groups.

Question Domain Classifier

This approach was built to classify questions into domains. A corpus of English news and Wikipedia articles was collected and preprocessed, leaving only textual content. A subset of categories from IPTC was chosen, and the documents that belong to those categories were used to generate automatic questions. The questions were used to train a Multinomial Naïve Bayes model, and it was evaluated using human-generated questions that belong to the same categories as the original documents.

TextExpansion

Short text messages (e.g., posts in blogs, forums, and social networks) represent a challenging problem for traditional learning methods nowadays. Such messages are usually fairly short and normally rife with slang, idioms, symbols, and acronyms that make even tokenization difficult. In this scenario, we have designed the TextExpansion tool to normalize and expand the original short and messy text messages to acquire better attributes and enhance the classification/clustering performances.

Datasets

SMS Spam Collection

A public set of SMS-labeled messages collected for mobile phone spam research. It has one collection composed of 5,574 English, real, and non-encoded messages, tagged according to being legitimate (ham) or spam.

Sent Collection

Seven public datasets containing real and non-encoded labeled tweets collected for sentiment analysis research. Each sample was labeled as positive or negative, according to the opinion expressed concerning the object of interest.

YouTube Spam Collection

A public set of YouTube comments that have been collected for spam research. It has five datasets comprising 1,956 real and non-encoded messages labeled as legitimate (ham) or spam.

Domain Categorized Questions Collections

A public set of domain-categorized questions from news and Wikipedia articles (computer-generated) and a human-generated set for the same domains. The training dataset comprises 3,517,858 computer-generated samples based on English news and Wikipedia articles. The test dataset is composed of 500 questions that were human-generated. Both are available here for download in CSV format.

E-Tongue Sugar Collections

Public sets of labeled sugar samples have been collected with an electronic tongue to assess sugar quality automatically. It comprises two datasets: one with 190 samples in their natural form and the other with 185 sugar samples with controlled pH. Both are real and non-encoded, tagged according to being Organic, VHP (Very High Polarization) or VVHP (Very Very High Polarization)

Applications

TubeSpam

TubeSpam is an automatic system that filters spam comments on YouTube. Such an application is based on established and sophisticated computing procedures that automatically learn, detect, and block undesired comments.

Labeling

This app is designed to help you label your machine-learning classification datasets. You can upload your dataset as a CSV file and add collaborators to label your samples.

SentMiner

It’s a tool for opinion detection in English messages empowered by
ensemble and state-of-the-art natural language processing techniques.

Twitter Search

An online search tool using Twitter API, providing an interface and specific filter options.