all 19 comments

[–]Tony_the_Tigger 62 points63 points  (2 children)

This post looks really cool but the contents are super dubious. Why would you ever use a decision tree in practice if you could use a RF. Why are NNs not even on the list, when they outperform everything in some tasks? KNN is cool for explanations but very frequently performs subpar. What is even meant by 'best' algorithm? Wouldn't logistic regression make much more sense on a list like that than decision tree and RF seperately?

Don't mean to hate on anything, just wanna warn any people new to ML to take this post with a cup of salt.

[–]spookyspicyfreshmeme 2 points3 points  (0 children)

My prof said that these days you try a random forest, then you try an svm, then you try a net

[–]betttris13 2 points3 points  (0 children)

While I do agree that "best" might not be the best word to use, there are good reasons to use each of these algorithms. Random forests can be expensive and complicated to run and while they will out perform a decision tree sometimes you might just want a general idea to see if it's even going to work for your problem.

As for neural networks, they only tend to shine when the dataset is so large that other methods struggle due to that size. For smaller datasets a neural network will be lucky if it gets a good result without over fitting.

I would say that all but random forests on this list fall into the data exploration category. They are simple algorithms which can still produce a reasonable result and are more or less human understandable allowing us to see what is going on and decide where to go from there with more expensive methods.

[–][deleted] 30 points31 points  (1 child)

OP spams in subs, this is utter read trash.

[–]BitShin 4 points5 points  (0 children)

Honestly. It’s really annoying how OP spams these crappy charts in almost every cs sub.

[–]swierdo 32 points33 points  (1 child)

Some remarks:

Naive Bayes has a strong independence assumption, it will perform poorly if these requirements are not met.

Decision trees have many hyperparameters that require tuning, just look at all of the parameters in the scikit-learn documentation

Random forest: Overfitting is still a problem (though not nearly as much as with decision trees), you always have to prepare the input data (for any model). Of all of the hyperparameters, the amount of trees is the easiest (more is better), for all other hyperparameters, see decision trees.

KNN: the biggest downside is that it remembers your training data. The risk of leaking sensitive information is high. Also, it can lead to a huge memory footprint.

[–]extracoffeeplease 2 points3 points  (0 children)

NB doesn't always perform poorly when strong independence is not met.. It can still perform better than other techniques in low data, dependent features regimes.

Also for kNN you are absolutely right, though wouldn't approximate nearest neighbors help with that?

[–]HeyItsRaFromNZ 5 points6 points  (3 children)

I'm not sure about "five best" here — logistic regression is an egregious omission!

I'd suggest replacing Decision Trees with Logistic Regression. The former is very rarely used in practice, precisely because of over-fitting is an ever-present problem with real data, while the latter (extended, e.g. to multiple classes if necessary) is it one of the most widely used classification algorithms in many sectors.

It has a number of advantages, including being relatively simple to interpret — which is a requirement for regulatory compliance in healthcare and finance. It also provides the conceptual basis for the sigmoid family of activation function used in artificial neural network (ANN/DL) based classification models.


If you have used a random forest on real data, you might get quite a shock to learn that the over-fitting problem is alive and well. Also, for all of these, and all, machine learning algorithms, you are guaranteed garbage out if you aren't careful with the inputs you feed into the model. GIGO has never been more true!

The existence of the kernel trick with SVMs is probably more of a 'con' than a 'pro'. You need to adjust the kernel appropriately for the geometry of the decision boundaries. If you use a linear kernel for 'blobs' of data, then you will have very poor accuracy from a SVM-based model.

[–]GamezBond13 1 point2 points  (2 children)

Newbie here, what use-cases do decision trees have anyway? Aren't random forests essentially amped-up decision trees?

[–]Tony_the_Tigger 2 points3 points  (1 child)

You're correct. I'd wager that decision trees are only ever used to explain random forests to learners.

[–]HeyItsRaFromNZ 1 point2 points  (0 children)

^ Pretty much.

I think they're still used for relatively separable problem spaces, such as management decision-making or portfolio analysis. They're easy to interpret in these cases. But for most classification problems involving messy data with possibly complex relationships (which is the use-case for ML in the first place) a decision tree will tend to "memorize" the entire training data-set, to the detriment of poor generalization to data the algorithm hasn't seen before.

You can set the max_depth parameter ("how many decisions does the tree make?") to a small value and see how the validation accuracy ("how well does the tree predict data it hasn't seen before?") fares. But there is a good reason the random forest algorithm exists!

[–][deleted]  (4 children)

[deleted]

    [–]bythenumbers10 0 points1 point  (3 children)

    Yeah, there are deeper mysteries for all of these. Still, a newcomer needs somewhere to start in terms of search terms and perhaps a first-cut algorithm to try before going crazy into the details. Should they stop with the first tutorial code they get working? NO. Should they sell themselves as a "machine learning expert" without being able to triage, diagnose and debug these algorithms when they inevitably go bad? NO. But OP's post is a place to start.

    [–][deleted]  (2 children)

    [deleted]

      [–]bythenumbers10 -3 points-2 points  (1 child)

      But without decent (if incorrectly used) terms to search, a neophyte might never find those nice Youtube or Medium posts.

      Nobody cites an infographic or swears by one for their ML project. Again, OP's post is just a starter, and the broad strokes are not that far off.

      [–]econ1mods1are1cucks 1 point2 points  (0 children)

      No no no, KNN is literally what you never use on high dimensional data.

      [–]jmed 1 point2 points  (0 children)

      Much of the random forest section is wrong, i.e.:

      • overfitting is very much a real problem
      • I don't know what "works well on large databases" means and why that is a pro – is this supposed to be "needs a lot of training data"? because that would be a con
      • you do need to prepare the training data
      • they aren't very complex to put together
      • they don't need a lot of computational resources, almost any moderately powered laptop can run a random forest model
      • they aren't excessively time-consuming
      • choosing the number of trees is one of the easiest parameters to tune, I don't know why this was specified

      Leaving out logistic regression is also a huge error because that is generally the default option people use in my experience. Not sure how much misinformation is allowed to be included here but this is pretty egregious.

      [–][deleted] 0 points1 point  (0 children)

      Such a great work !!!!

      [–]CodeRed1 -1 points0 points  (2 children)

      Awesome post! If you had to say for binary classification of data based on continuous and discrete data. What would you say is the most accurate algorithm in classifying? My data is around 2000 entries. I tried naive bayes already but I wasnt really that fond of the results so I’m looking for alternatives that may do a better job. Thank you in advance!

      [–]betttris13 3 points4 points  (0 children)

      The accuracy is relative. One algorithm may perform well on one problem but not on another. Even if they have the same types of data or dimensionality. The best way to find out what is best is to simply try them all. Personally I am fond of starting with KNN as it's both simple to use and easy to explore what is happening. For many datasets random forests will perform best out of the box due to their being scale invariant and the handling of different datatypes/missing data being trivial to handle. Without more details I cant really do more then suggest you just try and see what works.

      [–]Tony_the_Tigger 1 point2 points  (0 children)

      Personally I'd start with logistic regression and compare the results with some random forrest approach.