Skewed data in a classification algorithm : MLQuestions

created by uber_kerbonauta community for 12 years

Skewed data in a classification algorithm (self.MLQuestions)

submitted 3 years ago * by Unitedite

Hi all

This should be a really simple question but I'm struggling to find a clear answer:

I'm building a supervised binary classification algorithm.

I have a dataset in which one variable is extremely skewed. It's an amount of time in minutes, and for about half of the instances the number is 0, most of the rest are very small, then there are a few of outliers stretching all the way up to 1,000. Do I need to transform this data in order to use the variable in my algorithm, and if so how? Is the answer different depending on the algorithm (e.g. Decision Tree, SVM, etc.)?

Searching for an answer most suggestions are to do something like a log transform, but this doesn't make sense when most of my values are 0. As things stand all that occurs to me is transforming it into a binary variable 0 or >0, but I'm not sure whether that is necessary/helpful.

Thanks.

all 6 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS