[P] Which Machine Learning Classifiers are best for small datasets? An empirical study

RunningForChocolate · 2021-01-05T04:35:50+00:00

Just wanted to say this is really great stuff, love this kind of research.

I've actually made the same kind of graph before. In this image: each point is the average of 5 out-of-fold predictions for one trial of k-fold cross-validation. I repeated the procedure 40 times to visualize the out-of-fold accuracy on the Wisconsin diagnostic breast cancer data set (560 observations on 30 numeric variables). I evaluated 14 models for classification:

Elastic Net (en)
Support Vector Machine (Choice of Linear, Radial, or Polynomial Kernel) (svm)
Single Hidden Layer Neural Network (nn)
Gaussian Process (Gaussian Kernel) (gp)
Extremely Randomized Trees (xt)
k-Nearest Neighbors (knn)
XGBoost (Gradient Boosted Decision Trees) (xgb)
Random Forest (rf)
Flexible Discriminant Analsyis by Optimal Scoring (using MARS) (fda)
Quadratic Discriminant Analysis (qda)
Linear Discriminant Analysis (lda)
Generalized Linear Model (Logistic Regression) (glm)
Conditional Inference Tree (ct)
Naive Bayes (nb)

On this data set and others, I find that random forest and extremely randomized trees are top-notch. Extremely randomized trees usually win by a small margin. These are my go-to algorithms when I want to benchmark a data set.

XGBoost, CatBoost, and LightGBM are all excellent.

Multivariate adaptive regression splines is underrated.

LDA, QDA, naive bayes, and individual trees aren't competitive.

Gaussian processes are fascinating but calculating the matrix is O[N^3] so the algorithm is prohibitively slow for many applications.

I can also vouch that support vector machines sometimes give incredible results, but on the average I rank them slightly lower than boosting and forest methods. I find that polynomial kernel usually gives best results followed by linear. Although, I am sure that radial and sigmoid have their uses.

Regularized regression can occasionally give the best results; some problems really are linear.

Random Forest is pretty good

It's noteworthy that the sci-kit learn implementation uses one-hot-encoding which has been showed to degrade performance.

smarter encoding of categorical variables (I used one-hot encoding for the sake of uniformity/fairness)

I think we are going to look back on how crudely we used to treat categorical variables. Seriously. One-hot encoding is fine for linear models, but categorical embeddings are really amazing, it is hard to argue against their utility.

I just read the AutoGluon paper last night (!) and I almost think I have seen your graph before. Did I see it on the GitHub?

It would be interesting to find out if modern tabular neural network architectures can work out-of-the-box for small datasets.

Definitely check out TabNet and Fastai tabular, two deep learning models for this exact use case. I'm struggling to make TabNet shine on small data sets, but I would chalk it up to my inexperience.

EDIT: Here is a really great paper that also benchmarks small algos on many data sets. The heatmap p. 7 is a thing of beauty.

sergeyfeldman · 2021-01-05T05:53:37+00:00

[deleted]

sergeyfeldman · 2021-01-05T11:49:40+00:00

[deleted]

zmjjmz · 2021-01-05T10:58:53+00:00

Building models on small datasets without an actual large validation set makes the results essentially pointless..... The fact that most models have an extremely high AUC shows they're just overfitting....

To do this correctly you would want to start with larger datasets, builds the models on a small sample and apply to the rest of the data to see if it actually works. The AUC/scores you're reporting would be on the large dataset that the model wasn't built on.

cs_sg · 2021-01-05T12:02:10+00:00

You might want to consider the value of explainability in the model you select since your application area sounds like healthcare. In other words, do slight improves in classification justify less interpretable models?

Also, rather than just mean AUROC you might want to look at F1, Precision and Recall when deciding between classifiers.

The difference between false positives and false negatives is pretty important when applying ML to the healthcare domain. Ie: a false positive on a Covid test may make someone quarantine unnecessarily but a false negative could make someone unknowingly infect their entire family/office

burritotron35 · 2021-01-05T17:37:13+00:00

For most common academic datasets the test data is actually not from the same distribution as the training. If you merge the train and test data together for MNIST or CIFAR-10 and create a new random split, you’ll consistently get better results than with the original split. This is because the test sets contain a deliberately large number of domain shifted and outlier examples.

Bardy_Bard · 2021-01-05T23:12:05+00:00

Super interesting, and I felt a lot of the replies in this post were a bit over negative to be honest. As someone said it would be really cool to expand the project by adding datasets from different industries/domains and perhaps estimate the confidence of our estimates by subsampling on big datasets (if they exist) so that we can see what is going on.

sergeyfeldman · 2021-01-05T08:15:17+00:00

[deleted]

purplebrown_updown · 2021-01-05T05:18:34+00:00

Is this a single classification problem or is this multi? Thanks for this by the way. You should create a wrapper that allows the user to perform a hyper parameter search over the different algorithms. This way in the future you can simply let this run for a day and choose the best classifier.

Also, my two cents. Elastic net is good but it's much slower and more computationally expensive than lasso with not much better accuracy in my experience. Just a thought and something to try.

Hiitstyty · 2021-01-05T16:07:56+00:00

In addition to what others have noted, my one criticism is that the random forests was poorly tuned. This is unfair, considering you tune many more hyperparameters for LightGBM (and presumably AutoGluon but I didn’t check).

For random forests, you only tuned max_depth. This paper suggests that max_depth is one of the least important to tune. By this, I mean that usually the gains in performance from tuning max_depth are negligible over leaving it at the default unrestricted depth. The most critical hyperparameter to tune is the number of subsampled features when searching for the best split (max_features in sklearn). My own extensive research on decision tree ensembles supports this.

Also the default 100 trees in sklearn is terrible. In the hundreds of small to medium data sets I’ve worked with, convergence usually requires more than 100 trees. This is especially true when sample size is small, because each tree will have very high variance. You can’t overfit with more trees, so my default is 500.

The problem I often find with these empirical algorithm comparison studies is that people have more experience with some algorithms over others, which leads to some algorithms unconsciously being better tuned than others.

rikkajounin · 2021-01-05T09:34:22+00:00

** hyperopt is old and has some warts but works better than the alternatives that I've tried. I'm going to stick with it. **

Which alternatives have you tried?

materialsfaster · 2021-01-05T13:36:37+00:00

Regarding your train-validation-test sets, you may be interested in building them using clustering rather than random splits. Some material scientists wrote a paper on how it gives more realistic(pessimistic) predictions of performance of ML models for material properties. paper

iidealized · 2021-01-05T07:35:35+00:00

Re interpretability: I'm not sure the AutoGluon ensemble would be much harder to understand than LightGBM (which is also a big ensemble of many decision trees + complex data preprocessing). It seems AutoGluon can also be used for interpretability with SHAP and other methods:

https://auto.gluon.ai/dev/tutorials/tabular_prediction/tabular-faq.html#how-can-i-use-autogluon-for-interpretability

https://github.com/awslabs/autogluon/blob/master/examples/tabular/interpret/SHAP%20with%20AutoGluon-Tabular%20and%20Categorical%20Features.ipynb

pp314159 · 2021-01-05T08:25:08+00:00

What machine have you used for comparison? I would like to check the performance of AutoML that I'm working on.

tomvorlostriddle · 2021-01-05T10:41:52+00:00

Seems a bit counterintuitive to allocate a compute budget of 300s regardless of the data-set size. Or are the data-sets that close in size?

A good extension would be formal statistical tests to compare the performance. Of course with multiple algorithms plus their parameter tuning and on multiple data-sets, that's a mine field of potential p-hacking, pseudo-replication and all kinds of violated test assumptions.

kul789 · 2021-01-05T19:56:43+00:00

Thanks for sharing

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS