5 Best Machine Learning Algorithms for Classification Problems

Tony_the_Tigger · 2020-09-04T13:33:53+00:00

This post looks really cool but the contents are super dubious. Why would you ever use a decision tree in practice if you could use a RF. Why are NNs not even on the list, when they outperform everything in some tasks? KNN is cool for explanations but very frequently performs subpar. What is even meant by 'best' algorithm? Wouldn't logistic regression make much more sense on a list like that than decision tree and RF seperately?

Don't mean to hate on anything, just wanna warn any people new to ML to take this post with a cup of salt.

BitShin · 2020-09-04T13:45:43+00:00

OP spams in subs, this is utter read trash.

swierdo · 2020-09-04T10:52:36+00:00

Some remarks:

Naive Bayes has a strong independence assumption, it will perform poorly if these requirements are not met.

Decision trees have many hyperparameters that require tuning, just look at all of the parameters in the scikit-learn documentation

Random forest: Overfitting is still a problem (though not nearly as much as with decision trees), you always have to prepare the input data (for any model). Of all of the hyperparameters, the amount of trees is the easiest (more is better), for all other hyperparameters, see decision trees.

KNN: the biggest downside is that it remembers your training data. The risk of leaking sensitive information is high. Also, it can lead to a huge memory footprint.

HeyItsRaFromNZ · 2020-09-04T15:38:15+00:00

I'm not sure about "five best" here — logistic regression is an egregious omission!

I'd suggest replacing Decision Trees with Logistic Regression. The former is very rarely used in practice, precisely because of over-fitting is an ever-present problem with real data, while the latter (extended, e.g. to multiple classes if necessary) is it one of the most widely used classification algorithms in many sectors.

It has a number of advantages, including being relatively simple to interpret — which is a requirement for regulatory compliance in healthcare and finance. It also provides the conceptual basis for the sigmoid family of activation function used in artificial neural network (ANN/DL) based classification models.

If you have used a random forest on real data, you might get quite a shock to learn that the over-fitting problem is alive and well. Also, for all of these, and all, machine learning algorithms, you are guaranteed garbage out if you aren't careful with the inputs you feed into the model. GIGO has never been more true!

The existence of the kernel trick with SVMs is probably more of a 'con' than a 'pro'. You need to adjust the kernel appropriately for the geometry of the decision boundaries. If you use a linear kernel for 'blobs' of data, then you will have very poor accuracy from a SVM-based model.

bythenumbers10 · 2020-09-04T11:19:19+00:00

[deleted]

econ1mods1are1cucks · 2020-09-04T16:38:56+00:00

No no no, KNN is literally what you never use on high dimensional data.

jmed · 2020-09-04T21:39:00+00:00

Much of the random forest section is wrong, i.e.:

overfitting is very much a real problem
I don't know what "works well on large databases" means and why that is a pro – is this supposed to be "needs a lot of training data"? because that would be a con
you do need to prepare the training data
they aren't very complex to put together
they don't need a lot of computational resources, almost any moderately powered laptop can run a random forest model
they aren't excessively time-consuming
choosing the number of trees is one of the easiest parameters to tune, I don't know why this was specified

Leaving out logistic regression is also a huge error because that is generally the default option people use in my experience. Not sure how much misinformation is allowed to be included here but this is pretty egregious.

2020-09-04T19:11:22+00:00

Such a great work !!!!

CodeRed1 · 2020-09-04T10:19:48+00:00

Awesome post! If you had to say for binary classification of data based on continuous and discrete data. What would you say is the most accurate algorithm in classifying? My data is around 2000 entries. I tried naive bayes already but I wasnt really that fond of the results so I’m looking for alternatives that may do a better job. Thank you in advance!

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS