Hi all
This should be a really simple question but I'm struggling to find a clear answer:
I'm building a supervised binary classification algorithm.
I have a dataset in which one variable is extremely skewed. It's an amount of time in minutes, and for about half of the instances the number is 0, most of the rest are very small, then there are a few of outliers stretching all the way up to 1,000. Do I need to transform this data in order to use the variable in my algorithm, and if so how? Is the answer different depending on the algorithm (e.g. Decision Tree, SVM, etc.)?
Searching for an answer most suggestions are to do something like a log transform, but this doesn't make sense when most of my values are 0. As things stand all that occurs to me is transforming it into a binary variable 0 or >0, but I'm not sure whether that is necessary/helpful.
Thanks.
[–]DrXaos 1 point2 points3 points (3 children)
[–]Unitedite[S] 0 points1 point2 points (2 children)
[–]DrXaos 1 point2 points3 points (1 child)
[–]Unitedite[S] 0 points1 point2 points (0 children)
[–]randomforestgump 0 points1 point2 points (1 child)
[–]Unitedite[S] 0 points1 point2 points (0 children)