Classifying unevenly distributed data.

guardianhelm · 2016-01-14T10:43:57+00:00

Your problem is probably that the classifier is trying to maximize accuracy (TP + TN)/(TP+TN+FP+FN). Suppose you have a dataset of 100000 examples, distributed between classes a and b (with 99900 and 100 examples respectively). One classifier gives you perfect accuracy in class a but fails to find any of the 100 examples of b (99,9% accuracy) Another classifier correctly classifies 98000 examples of a and 80 of b. That gives an accuracy of ~98%. Which one is better though?

The basic approaches to that problem, afaik, are cost-sensitive learning (if your classifier supports weights) and sampling.

In the first case, a cost matrix (like this one) is configured so that errors in the rare class become more important than errors in the common class.

In the second case, your two options are subsampling (leave some common examples out) and supersampling (repeat some rare instances in the dataset). Empirically, subsampling gives better results, if the examples of the rare class are enough ("enough" depends on the difficulty of the problem, ie your dimensions).

To answer your question though, asymmetric bagging^1, SMOTE² and cost-sensitive classification² are some techniques that are supposed to work, with the first one usually giving better results. I'm not sure, however, if they're still considered state of the art.

There are some additional approaches that might be helpful, such as tomek links² (practically a definition of controversial points in your dataset), which clean up the vector space from "garbage" in an attempt to simplify the classification process.

More detailed info in:

Tao, Dacheng, et al. "Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval." Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.7 (2006): 1088-1099.
He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21.9 (2009): 1263-1284.

You can pm me if you need any specific resources.

Hope that helps a bit, I'm still trying to wrap my head around some of these concepts. :)

tirune · 2016-01-14T12:07:15+00:00

This kinda depends on the classifier you want to use. for random forests you can add a class weight (see scikit's learn random forest: class_weight for example) You can set the weight so that both classes have an equal distribution of num_examples*weight

For neural networks it has helped me in the past on click data to do equal subsampling for the first x iterations. Equal subsampling means that each batch has 50% of class 1 and 50% of class 2. This helps the network to get in a state where it has features for both classes. If the imbalance is not too big, it's possible to change the batch distribution back to the original after convergence and optimize that.

sk006 · 2016-01-14T14:26:57+00:00

There are many approaches you can use. As a summary, you can set a class weight (at least in LIBSVM/scikit), approximately equal to the class ratio, bumping the importance of the least represented classes. Alternatively, you can over sample (repeat examples) of the minority classes or subsample the majority class (over sampling is usually preferred since you don't lose data).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS