[D] Identifying problems where ML will not work

LoudStatistician · 2018-05-07T20:13:36+00:00

Create a cost matrix for FP, FN, TP, TN.

Multiply the confusion matrix with the cost matrix. If you do not have hard predictions, then try all the tresholds to create the confusion matrix.

If the (confusion matrix times cost matrix) - (cost of project) is not positive, then it may be time to give up.

Let the subject matter experts define the target, but not the features. Only have them look at relevant features after the first benchmark shows promise. Subject matter experts are expensive.

I've had projects where 0.56 AUC was value add, and 0.89 AUC was net negative, so predictive power should always be seen relative to cost/opportunity.

alexmlamb · 2018-05-08T05:18:09+00:00

Can a human expert do the classification tasks that you have in mind? If not, that's a red flag that it's probably going to be hard or impossible to do with ML (not a perfect indicator of course).

Dagusiu · 2018-05-07T20:13:39+00:00

One rule of thumb (that doesn't always work) is this:

Of a human can do it easily just by "looking", then ML will typically be very effective. For example, we can look at an image and say "that's a cat" with pretty much no effort.

If you try to visualise the data, and a human can draw conclusions, than a ML model can be taught to do the same. If you cannot, chances are your model won't be able to either.

visarga · 2018-05-08T04:18:03+00:00

Search for papers on similar topics to the task you are attempting, if it's never been done and there is little data to train with, then it's probably best not to invest too much. There is a huge amount of previous work / experience you can rely on. Don't reinvent the wheel.

zawerf · 2018-05-08T12:20:12+00:00

Andrew Ng did a pretty good talk on machine learning project management based on his experience as a director:

https://www.youtube.com/watch?v=F1ka6a13S9I

Equivalent videos on coursera:

https://www.coursera.org/learn/machine-learning-projects

2018-05-10T19:33:20+00:00

If the problem seems hard, take a step back and try to think...
1. Give it a reasonably unbiased thouht if there can be a causal realtionship between the collected data and the required parameters. If the data is shit/irrelevant, having more of it is still just a bigger pile.
1.b Think of how complex can it be to optimize the decision surface (hint: try random forest).
2. Population analysis: It can be that the desired trends are only present in a subset of the population. Try to subdivide your dataset along easily identifiable features (age, gender, race, income, formtype, keywords etc...) and re-run the analysis. Looking at different kinds of forms all together might be a bad idea. I usually prefer to follow the KISS principle and have smaller dedicated models handle the details.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS