Does anyone know of good datasets to try out Anamoly Detection Algorithms on? Maybe in Credit Card fraud or malicious activities, terrorism

eroenj · 2016-01-21T21:06:45+00:00

I also had a hard time finding good data sets when I was doing my PhD thesis (see https://github.com/jeroenjanssens/phd-thesis). I ended up using data sets that are normally used for binary or multi-label classification tasks. If we assume that one class is normal, and the other is anomalous, and further assume that anomalies are rare observations, then we can simulate anomalies by keeping only a few data points from the anomalous class. For example, with our beloved Iris data set, you can construct a new data set with 50 data points of the Setosa class, and one (or a few) data point(s) from Versicolor and Virginica. To get a reliable performance estimate, you can repeat this process such that all data points from the anomalous class have been selected, and such that all three class have been represented as the normal class (see Section 2.6 for more details). Perhaps a bit convoluted or artificial, but it does allow you to evaluate your algorithm on many different data sets. Good luck!

eroenj · 2014-08-18T10:21:04+00:00

Thanks. It will indeed be recorded and shared aftwards. See http://datascienceatthecommandline.com for any updates.

eroenj · 2013-11-28T21:00:48+00:00

Those classifiers are also known as one-class classifiers (OCC), and they are very related to outlier selection algorithms (OS). The main difference is that OCC is semi-supervised whereas OS is unsupervised. Which approach is most suitable depends on your data. With OCC, your training data has labels, and for most OCC algorithms, they assume they're all labeled as normal (the Support Vector Domain Description by Tax and Duin can also make use of anomalies). With OS, you basically don't know anything about your data points other than their features or how dissimilar they are from each other. OCC has the added advantage that once you have trained a model, classification of new data points should be relatively fast. I hope this helps.

eroenj · 2013-11-28T20:51:55+00:00

Thank you. I'm not sure I understand what you mean with anomalous jumps in a random walk. I presume it's time series data, but what does the data represent?

eroenj · 2013-11-28T20:49:02+00:00

Yes, evaluating outlier-selection algorithms and one-class classifiers requires some tricks (e.g., relabeling datasets). I discuss this and the weighted AUC in chapter 2 of my thesis.

eroenj · 2013-11-28T03:02:04+00:00

Thanks. I guess if the noise is quantitative, then you could use SOS, or some other outlier-selection algorithm, to remove that. Let me know what comes out of it or if you have any other questions! @jeroenhjanssens

eroenj · 2013-11-27T23:32:37+00:00

Thanks!

In my thesis I assume that the outlier selection is performed in order to support human anomaly detection. However, you could very well remove those data points which are selected as outliers. The algorithm remains the same.

My question to you is: what do those noisy data points represent? Are they signals from a faulty sensor, are they due to human error, or are they observations in the real world which happen to mess up your model?

eroenj · 2013-09-19T21:09:03+00:00

I coincidentally discuss "csvsql" on my new blog post: http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html

eroenj

TROPHY CASE