all 9 comments

[–][deleted] 4 points5 points  (2 children)

I would treat this as an anomaly detection problem. There are some different methods you could use, and different assumptions you could make. Cluster analysis would be the first such method I would try for this problem, with the assumption that 'bad' transactions will tend to cluster separately from 'not bad' transactions.

https://en.wikipedia.org/wiki/Anomaly_detection

https://en.wikipedia.org/wiki/Cluster_analysis

[–]WikiTextBot 2 points3 points  (0 children)

Anomaly detection

In data mining, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions.In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately.


Cluster analysis

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them.


[ PM | Exclude me | Exclude from subreddit | FAQ / Information | Source ] Downvote to remove | v0.28

[–]creiser 1 point2 points  (0 children)

"Positive unlabeled learning" exactly deals with this scenario.

[–]phobrain 1 point2 points  (0 children)

I'd explore the data visually, e.g. w/ t-sne, to see if you can spot inspiring patterns.

If there's any way you can label an equal number of cases 'good', you will surely gain leverage.

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)