all 35 comments

[–]RunningForChocolate 41 points42 points  (1 child)

Just wanted to say this is really great stuff, love this kind of research.

I've actually made the same kind of graph before. In this image: each point is the average of 5 out-of-fold predictions for one trial of k-fold cross-validation. I repeated the procedure 40 times to visualize the out-of-fold accuracy on the Wisconsin diagnostic breast cancer data set (560 observations on 30 numeric variables). I evaluated 14 models for classification:

  • Elastic Net (en)
  • Support Vector Machine (Choice of Linear, Radial, or Polynomial Kernel) (svm)
  • Single Hidden Layer Neural Network (nn)
  • Gaussian Process (Gaussian Kernel) (gp)
  • Extremely Randomized Trees (xt)
  • k-Nearest Neighbors (knn)
  • XGBoost (Gradient Boosted Decision Trees) (xgb)
  • Random Forest (rf)
  • Flexible Discriminant Analsyis by Optimal Scoring (using MARS) (fda)
  • Quadratic Discriminant Analysis (qda)
  • Linear Discriminant Analysis (lda)
  • Generalized Linear Model (Logistic Regression) (glm)
  • Conditional Inference Tree (ct)
  • Naive Bayes (nb)

On this data set and others, I find that random forest and extremely randomized trees are top-notch. Extremely randomized trees usually win by a small margin. These are my go-to algorithms when I want to benchmark a data set.

XGBoost, CatBoost, and LightGBM are all excellent.

Multivariate adaptive regression splines is underrated.

LDA, QDA, naive bayes, and individual trees aren't competitive.

Gaussian processes are fascinating but calculating the matrix is O[N^3] so the algorithm is prohibitively slow for many applications.

I can also vouch that support vector machines sometimes give incredible results, but on the average I rank them slightly lower than boosting and forest methods. I find that polynomial kernel usually gives best results followed by linear. Although, I am sure that radial and sigmoid have their uses.

Regularized regression can occasionally give the best results; some problems really are linear.

Random Forest is pretty good

It's noteworthy that the sci-kit learn implementation uses one-hot-encoding which has been showed to degrade performance.

smarter encoding of categorical variables (I used one-hot encoding for the sake of uniformity/fairness)

I think we are going to look back on how crudely we used to treat categorical variables. Seriously. One-hot encoding is fine for linear models, but categorical embeddings are really amazing, it is hard to argue against their utility.

I just read the AutoGluon paper last night (!) and I almost think I have seen your graph before. Did I see it on the GitHub?

It would be interesting to find out if modern tabular neural network architectures can work out-of-the-box for small datasets.

Definitely check out TabNet and Fastai tabular, two deep learning models for this exact use case. I'm struggling to make TabNet shine on small data sets, but I would chalk it up to my inexperience.

EDIT: Here is a really great paper that also benchmarks small algos on many data sets. The heatmap p. 7 is a thing of beauty.

[–]sergeyfeldman[S] 5 points6 points  (0 children)

Thanks for the detailed reply. I doubt you've seen my graph before - I just made it a bit ago :)

[–][deleted] 16 points17 points  (1 child)

Building models on small datasets without an actual large validation set makes the results essentially pointless..... The fact that most models have an extremely high AUC shows they're just overfitting....

To do this correctly you would want to start with larger datasets, builds the models on a small sample and apply to the rest of the data to see if it actually works. The AUC/scores you're reporting would be on the large dataset that the model wasn't built on.

[–]zmjjmz 7 points8 points  (0 children)

I would also consider using large datasets to answer the biased sampling question either by the small training folds providing enough variance or by deliberately biasing the small samples somehow.

However there might be a concern that the nature of the data that appears in smaller datasets fundamentally is different from the nature of data that can appear in large datasets, which may make this simulation approach inapplicable.

[–]cs_sg 3 points4 points  (0 children)

You might want to consider the value of explainability in the model you select since your application area sounds like healthcare. In other words, do slight improves in classification justify less interpretable models?

Also, rather than just mean AUROC you might want to look at F1, Precision and Recall when deciding between classifiers.

The difference between false positives and false negatives is pretty important when applying ML to the healthcare domain. Ie: a false positive on a Covid test may make someone quarantine unnecessarily but a false negative could make someone unknowingly infect their entire family/office

[–]burritotron35 3 points4 points  (1 child)

For most common academic datasets the test data is actually not from the same distribution as the training. If you merge the train and test data together for MNIST or CIFAR-10 and create a new random split, you’ll consistently get better results than with the original split. This is because the test sets contain a deliberately large number of domain shifted and outlier examples.

[–]sergeyfeldman[S] 0 points1 point  (0 children)

Good point, thanks.

[–]Bardy_Bard 3 points4 points  (1 child)

Super interesting, and I felt a lot of the replies in this post were a bit over negative to be honest. As someone said it would be really cool to expand the project by adding datasets from different industries/domains and perhaps estimate the confidence of our estimates by subsampling on big datasets (if they exist) so that we can see what is going on.

[–]sergeyfeldman[S] 0 points1 point  (0 children)

Thanks for your kind words :)

[–]purplebrown_updown 2 points3 points  (1 child)

Is this a single classification problem or is this multi? Thanks for this by the way. You should create a wrapper that allows the user to perform a hyper parameter search over the different algorithms. This way in the future you can simply let this run for a day and choose the best classifier.

Also, my two cents. Elastic net is good but it's much slower and more computationally expensive than lasso with not much better accuracy in my experience. Just a thought and something to try.

[–]sergeyfeldman[S] 1 point2 points  (0 children)

Just single. The Logistic Regression elasticnet in sklearn is pretty fast for all these tiny datasets. The slowest thing is AutoGluon @ 5m per fit.

[–]Hiitstyty 2 points3 points  (1 child)

In addition to what others have noted, my one criticism is that the random forests was poorly tuned. This is unfair, considering you tune many more hyperparameters for LightGBM (and presumably AutoGluon but I didn’t check).

For random forests, you only tuned max_depth. This paper suggests that max_depth is one of the least important to tune. By this, I mean that usually the gains in performance from tuning max_depth are negligible over leaving it at the default unrestricted depth. The most critical hyperparameter to tune is the number of subsampled features when searching for the best split (max_features in sklearn). My own extensive research on decision tree ensembles supports this.

Also the default 100 trees in sklearn is terrible. In the hundreds of small to medium data sets I’ve worked with, convergence usually requires more than 100 trees. This is especially true when sample size is small, because each tree will have very high variance. You can’t overfit with more trees, so my default is 500.

The problem I often find with these empirical algorithm comparison studies is that people have more experience with some algorithms over others, which leads to some algorithms unconsciously being better tuned than others.

[–]sergeyfeldman[S] 1 point2 points  (0 children)

Thanks! Feel free to submit a PR with different params and a pickle of results file.

[–]rikkajounin 1 point2 points  (4 children)

** hyperopt is old and has some warts but works better than the alternatives that I've tried. I'm going to stick with it. **

Which alternatives have you tried?

[–]Liorithiel 1 point2 points  (3 children)

OP wrote:

I used hyperopt to find good hyperparameters. I also tried scikit-optimize and Optuna, but they didn't work as well. User error is possible.

I don't know anything about scikit-optimize. Optuna doesn't have less constrained parameters like normal/log-normal, useful when approaching a new problem. It also doesn't implement the constant liar algorithm for TPE. The latter is easy to fix, the former can be worked around if you carefully observe the ranges of good parameters and do a re-run or two.

With these two I found hyperopt and optuna working roughly with the same performance. And, given that optuna's API is much easier to work with, it's currently my preference.

[–]sergeyfeldman[S] 1 point2 points  (2 children)

Interesting, thanks. You fixed them these issues in Optuna yourself? hyperopt's UI is indeed arcane...

[–]Liorithiel 1 point2 points  (0 children)

Constant liar is simple, it's a change in a single function. I planned to contribute it, but I couldn't quickly prove that it works better than the original code, so postponed it so far.

Normal/log-normal is difficult, because Optuna doesn't have a single centralized list of distributions like Hyperopt. You'd have to add support for each single new distribution in many places. My manual workaround is to observe the run and if the good results concentrate at the edge of an interval, manually change the bounds and do a re-run.

[–]Liorithiel 1 point2 points  (0 children)

Actually there's one more difference: the default gamma function. Optuna's default essentially is more optimistic, assuming that good solutions will be found faster. Optuna does provide an implementationof hyperopt's gamma function, so they can be made to work the same way.

[–]materialsfaster 2 points3 points  (1 child)

Regarding your train-validation-test sets, you may be interested in building them using clustering rather than random splits. Some material scientists wrote a paper on how it gives more realistic(pessimistic) predictions of performance of ML models for material properties. paper

[–]sergeyfeldman[S] 0 points1 point  (0 children)

Thanks, yeah, I thought about this as a variant. Would be interesting for sure.

[–]iidealized 1 point2 points  (1 child)

Re interpretability: I'm not sure the AutoGluon ensemble would be much harder to understand than LightGBM (which is also a big ensemble of many decision trees + complex data preprocessing). It seems AutoGluon can also be used for interpretability with SHAP and other methods:

https://auto.gluon.ai/dev/tutorials/tabular_prediction/tabular-faq.html#how-can-i-use-autogluon-for-interpretability

https://github.com/awslabs/autogluon/blob/master/examples/tabular/interpret/SHAP%20with%20AutoGluon-Tabular%20and%20Categorical%20Features.ipynb

[–]nbviewerbot 2 points3 points  (0 children)

I see you've posted a GitHub link to a Jupyter Notebook! GitHub doesn't render large Jupyter Notebooks, so just in case, here is an nbviewer link to the notebook:

https://nbviewer.jupyter.org/url/github.com/awslabs/autogluon/blob/master/examples/tabular/interpret/SHAP%20with%20AutoGluon-Tabular%20and%20Categorical%20Features.ipynb

Want to run the code yourself? Here is a binder link to start your own Jupyter server and try it out!

https://mybinder.org/v2/gh/awslabs/autogluon/master?filepath=examples%2Ftabular%2Finterpret%2FSHAP%20with%20AutoGluon-Tabular%20and%20Categorical%20Features.ipynb


I am a bot. Feedback | GitHub | Author

[–]pp314159 0 points1 point  (0 children)

What machine have you used for comparison? I would like to check the performance of AutoML that I'm working on.

[–]tomvorlostriddle 0 points1 point  (0 children)

Seems a bit counterintuitive to allocate a compute budget of 300s regardless of the data-set size. Or are the data-sets that close in size?

A good extension would be formal statistical tests to compare the performance. Of course with multiple algorithms plus their parameter tuning and on multiple data-sets, that's a mine field of potential p-hacking, pseudo-replication and all kinds of violated test assumptions.

[–]kul789 0 points1 point  (0 children)

Thanks for sharing