Research

Machine Learning Techniques

MDLText

The MDLText is an efficient, lightweight, scalable, and fast multinomial text classifier. It exhibits fast incremental learning as well as being sufficiently robust to prevent overfitting, which are desirable features in real-world applications, large-scale problems, and online scenarios.

GMDL

The GMDL is a lightweight, multiclass, and online classifier. Despite its probabilistic nature, it can handle continuous features. Experiments conducted on real-world datasets with different characteristics demonstrated it outperformed established online classification methods and is robust to overfitting, which is a desired characteristic for large, dynamic, and real-world classification problems.

ML-MDLText

The ML-MDLText is an efficient and lightweight multilabel text classifier with incremental learning. It is based on the minimum description length principle and can be applied to multilabel classification without requiring the transformation of the classification problem. It takes advantage of dependency information among labels and naturally supports online learning. The results obtained were very competitive with existing state-of-the-art online learning methods and those that transform multilabel problems into several single-label ones.

P2C – Partitioning to Classify

P2C is a new classification technique to achieve reasonable classification performances using linear prediction models, even on datasets with non-linear separable data. The proposed technique, inspired by the division-and-conquer strategy, applies a clustering method on each partition made of samples of the same class. Subsequently, the union among the clusters inside each partition is performed, creating a single partition, where each group can contain linearly separable samples. Then, one or more linear classifiers are trained, according to the number of groups.

Question Domain Classifier

This approach was built to classify questions into domains. A corpus composed by english news and Wikipedia articles was collected and preprocessed leaving only textual content. A subset of categories from IPTC was chosen and the documents that belong to those categories were used for automatic question generation. The questions were used to train a Multinomial Naïve Bayes model and it was evaluated using human-generated questions that belong to the same categories the original documents.

TextExpansion

Short text messages (e.g. posts in blogs, forums, social networks) represent a challenging problem for traditional learning methods nowadays. Such messages are usually fairly short and normally rife of slangs, idioms, symbols and acronyms that make even tokenization a difficult task. In this scenario, we have designed the TextExpansion tool which aims to normalize and expand the original short and messy text messages in order to acquire better attributes and enhance the classification/clustering performances.

Datasets

SMS Spam Collection

A public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

Sent Collection

Seven public datasets containing real and non-enconded labeled tweets that have been collected for sentiment analysis research. Each sample was labeled as positive or negative, according to the opinion expressed in respect to the object of interest.

YouTube Spam Collection

A public set of YouTube comments that have been collected for spam research. It has five datasets composed by 1,956 real and non-encoded messages that were labeled as legitimate (ham) or spam.

Domain Categorized Questions Collections

A public set of domain categorized questions from news and Wikipedia articles (computer generated) and also a human-generated set for the same domains. The training dataset is composed by 3,517,858 samples that were computer generated based on english news and Wikipedia articles. The test dataset is composed by 500 questions that were human-generated. Both are available here for download in CSV format.

E-Tongue Sugar Collections

Public sets of labeled sugar samples that have been collected with an electronic tongue for automatically accessing the sugar quality. It is composed by two datasets: one with 190 samples in their natural form and other with 185 sugar samples with controlled pH. Both are real and non-enconded, tagged according being Organic, VHP (Very High Polarization) or VVHP (Very Very High Polarization)

Applications

TubeSpam

TubeSpam is an automatic system to filter spam comments on YouTube. Such application is based on established and sophisticated computing procedures to automatically learn, detect and block undesired comments.

Labeling

This app is designed to help you label your machine learning classification datasets. You can upload your dataset as a CSV file and add collaborators to label your samples.

SentMiner

It’s a tool for opinion detection in English messages empowered by
ensemble and state-of-the-art natural language processing techniques.

Twitter Search

An online search tool using Twitter API, providing interface and specific filter options.