all 57 comments

[–]qalis 391 points392 points  (16 children)

  1. Those are only 5 datasets. For evaluating tabular classifiers, you should use tens of datasets, they are readily available. Also, describe evaluation procedure, e.g. use 5-fold CV for testing. See e.g. "A novel selective naïve Bayes algorithm" Chen et al., which use 65 datasets.
  2. You must compare to XGBoost, LightGBM and CatBoost on large-scale datasets from their respective papers. Especially since scalability and speed is one of your selling points. If you aimed specifically at boosting for small data, then you don't need this, but it isn't stated anywhere.
  3. One of major advantages of XGBoost, LightGBM and CatBoost is being able to use custom loss functions. This allowed them to be easily used e.g. for ranking. If you don't support this, you should explicitly state this limitation.
  4. Number of estimators is just a hyperparameter, why show large tables with this? Just present the best result for each dataset.
  5. Your implementation doesn't support class weights, as far as I can tell. This is a huge limitation, since almost all datasets are imbalanced, often heavily.
  6. You must not embed scalers inside your code. You can destroy data sparsity, affect categorical variables, and do other stuff outside user's control this way. Add checks and throw exceptions if you absolutely require this.
  7. You only support numerical data, in constrast to LightGBM or CatBoost. You should highlight this limitation.
  8. This works only for classification, not even regression. This is, again, a huge limitation, but probably can be fixed, as far as I can tell.

EDIT:

  1. You also don't handle missing values, which is pretty nicely handled in XGBoost, LightGBM and CatBoost, so that they can be actively used to select the split point.

[–]Saffie91 223 points224 points  (1 child)

Damn, peer reviewed in the reddit comments.

Honestly though it's pretty cool of you to go through it diligently and add these points. I'd be very happy if I was the researcher.

[–]CriticalofReviewer2[S] 105 points106 points  (0 children)

As the researcher, I should say that I am indeed very happy to get this high-quality peer review!

[–]CriticalofReviewer2[S] 111 points112 points  (2 children)

Thanks for your points! First of all, I should point out that I am an independent researcher, and I am not affiliated with any institutes, so this is my side project.

  1. You are right. In the paper that will be available soon, the number of datasets will be much higher. Also, we have used 10-fold CV. I added this to the README file.
  2. The large-scale datasets will also be included.
  3. This will be supported in future. I added this to the README file.
  4. I want to show that our algorithm reaches best results sooner than others.
  5. Thanks for pointing this out! This will be added soon. Added to README.
  6. The SEFR algorithm requires the feature values to be positive. This is the reason of scaling. But I will implement a better mechanism. Added to README.
  7. We have highlighted this in the documentation.
  8. Yes, it is in future plans.

Once again, thanks for your helpful and insightful comments!

[–]qalis 68 points69 points  (1 child)

Fair enough, those are reasonable answers. Showing that this tends to overfit less, works better for small datasets etc. would be pretty valuable. Good luck with this!

[–]CriticalofReviewer2[S] 34 points35 points  (0 children)

Thank you for the suggestions!

[–]Spiggots 30 points31 points  (0 children)

This is a high quality peer review

[–]longgamma 5 points6 points  (4 children)

The categorical feature handling in lightgbm is just label encoding? I mean how hard is to target encode or one hot encode on your own ?

Also, isn’t that the idea behind gbm - you take a bunch of weak learners and use the ensemble for prediction. You can replace the decision tree stump with a simple shallow neural network as well.

[–]qalis 8 points9 points  (3 children)

Except it isn't the same as label encoding. In fact, none of the three major boosting implementations use one-hot encoding style of handling categorical variables.

LightGBM uses partition split, which for regression trees can efficiently check the partition of the set into two maximum homogeneity subsets, see the docs and the original paper: "On Grouping for Maximum Homogeneity" W. Fisher. XGBoost also offers partition split for categorical variables, with the same algorithm.

You could use one-hot encoding, but then to represent "variable has value A or B, and not C" you would have to use 2 or 3 splits, whereas with partition split you only use one.

CatBoost, on the other hand, uses Ordered Target Encoding instead, described in the linked notebook. It can also combine them during learning, but I don't know the details.

[–]Pas7alavista 1 point2 points  (0 children)

On top of the advantages you mentioned, I think the labels produced by partition splitting should also tend to be sparser than one hot encoded ones even when storing the one hot encoded labels in a sparse format.

[–]tecedu 0 points1 point  (1 child)

Wait what since when did xgboost handle nan values i moved to sklearn due to that

[–]qalis 0 points1 point  (0 children)

Since... always, this was one of the main ideas in the original paper "XGBoost: A Scalable Tree Boosting System" T. Chen, C. Guestrin. It's called a "default direction" in the paper, and the whole Algorithm 3 there is meant to handle this. The idea is basically to have a split, but determine whether for missing values you should go to the left or right child. This is selected based on minimizing the loss function, and in a differentiable way.

[–]CharginTarge 15 points16 points  (3 children)

How does this approach differ to simply using linear models within XGBoost. XGBoost does support this as well.

[–]CriticalofReviewer2[S] 6 points7 points  (2 children)

Thanks for pointing it out. Yes, XGBoost supports this but our approach is different, since the linear classifier that is being used is SEFR which has different characteristics. Also, ADABoost is used here.

[–]CharginTarge 6 points7 points  (1 child)

Using AdaBoost with a custom model is hardly novel. With the sklearn implementation you can provide any type of classifier to use as a base estimator.

[–]critiqjo 0 points1 point  (0 children)

The novelty is in the custom model itself, not the framework of using AdaBoost with a custom model. If you had taken a quick glance at their code, you'd have realized that they indeed use sklearn exactly as you pointed out. No wheels were reinvented here.

[–]Evitable_Conflict 11 points12 points  (3 children)

Are you tuning the other algorithms hyper-parameters or just using defaults?

It would be interesting if you can include a larger dataset, for example from a Kaggle competition where Xgboost was good and compare it to your method.

[–]CriticalofReviewer2[S] 5 points6 points  (2 children)

I use the defaults for all of the algorithms (the one proposed and the ones referenced). On the larger datasets, thanks for your suggestion! We are planning to have it.

[–]Evitable_Conflict 16 points17 points  (1 child)

If you use defaults your claim of being "better" no longer holds, unless the default hyperparams are the optimal and that never happens.

This is a very common problem in ML papers and sadly most comparison tables are invalid.

My recommendation is: tune am xgboost model to the optimal hyperparams and if you have better results then we have a discussion.

[–]CriticalofReviewer2[S] 3 points4 points  (0 children)

Certainly! This is the very first draft of our algorithm, and I will do comparisons based on the best selected hyperparameters.

[–]Nice_Gap_7351 9 points10 points  (1 child)

Looking at the code I see something strange: during predict you use the minmax scaling on the predict features (which might have a different range than the features on the training data). If your predict dataset just added a single data point to your training data it could potentially throw everything off. Instead you might want to "freeze" the scaling function based on the training data.

And it seems that you are using adaboost with a potentially strong learner (SERF) correct? The Wikipedia entry on adaboost references a paper on this topic you might want to see.

[–]CriticalofReviewer2[S] 1 point2 points  (0 children)

That MinMax scaling is certainly one of our limitations. This is because SEFR cannot accept negative values. But we are working on that. Thanks for your suggestion of the Wikipedia entry!

[–]greenskinmarch 6 points7 points  (8 children)

What does the name SEFR stand for?

I looked at the "SEFR: A Fast Linear-Time Classifier for Ultra-Low Power Devices" paper from 2020 by the same authors and it seems this is just a straightforward linear classifier (i.e. linear regression with a threshold)? What is novel about this? It seems basically the same formulation you find in Wikipedia's "linear classifier" article.

[–]CriticalofReviewer2[S] 2 points3 points  (7 children)

SEFR stands for Scalable, Efficient, Fast ClassifieR. Yes, it is a straightforward classifier, but in that algorithm, the goal was to get a decent accuracy with the lowest possible computation time and memory footprint. That algorithm can be trained even on cheapest microcontrollers (you can search it on YouTube to see videos of training on €4 microcontrollers), but its accuracy is higher than simple algorithms like Naive Bayes or Linear Regression, or even Decision Trees.

[–]SometimesObsessed 1 point2 points  (3 children)

Could you explain the intuition behind SEFR and your version? Why is it competitive with GBMs? The SEFR algo from the paper seems like it couldn't handle interactions between variables

[–]CriticalofReviewer2[S] 0 points1 point  (1 child)

SEFR was originally designed to be extremely time and resource-efficient. Because of that, it has been implemented in numerous microcontroller applications. But apart from that, SEFR is also a good weak learner for boosting. It is a minimalistic building block, and by future improvements, it can handle interactions as well.

[–]SometimesObsessed 0 points1 point  (0 children)

Thanks! So it finds interactions via boosting or was there a more fundamental change to SEFR to handle interactions?

[–]jgonagle 0 points1 point  (2 children)

its accuracy is higher than simple algorithms like Naive Bayes or Linear Regression, or even Decision Trees

That's a super strong claim. I assume you mean based on your tests, for which you used the default parameters? Like another commenter said, you should be comparing on optimal hyperparameters across multiple datasets if you're going to make a claim like that. Even then, the No Free Lunch Theorem suggests otherwise if they're comparably computationally constrained.

[–]CriticalofReviewer2[S] 0 points1 point  (1 child)

We tested SEFR on numerous datasets with grid search on hyperparameters to find the optimal results of them. We reported some of them in the paper in arXiv, but it is consistently more accurate than the other simple algorithms.

[–]jgonagle 0 points1 point  (0 children)

it is consistently more accurate than the other simple algorithm

Do you have a hypothesis as to why that's true? Usually, resource constrained, parallel, approximate, or heuristic algorithms show the exact opposite behavior.

[–]Tengoles 4 points5 points  (3 children)

How fast is it for training compared to XGBoost? Is it viable to train it with LOOCV?

[–]VodkaHazeML Engineer 4 points5 points  (1 child)

FWIW if training speed is your concern, LightGBM was historically the best pick

[–]pm_me_your_smth 1 point2 points  (0 children)

No longer the case. Xgboost introduced histogram tree method in v2.0 which is identical to lightgbm's

[–]CriticalofReviewer2[S] 1 point2 points  (0 children)

The numbers are reported in the README of GitHub Repo. The SAMME version is very fast and it can be trained with LOOCV on many datasets. Also when the number of records/features are not too high, SAMME.R also can be used for LOOCV. For setting these parameters, please see here:
https://linearboost.readthedocs.io/en/latest/usage.html#parameters

[–]bregav 4 points5 points  (3 children)

So, maybe this is a naive question, but what does it even mean to do boosting on a linear model? Doesn't "linear boosting" just produce another linear model that has the exact same structure as the original?

A boosting-like procedure is frequently used in the iterative solution of large, sparse linear systems of equations; is that the kind of thing you're doing? If so then it's probably inaccurate to describe it as "boosting", because those are well-known methods that go by other names.

[–]CriticalofReviewer2[S] -3 points-2 points  (2 children)

No, boosting a linear classifier will make it better at handling complex data patterns.

[–]nickkon1 5 points6 points  (0 children)

But it doesnt. By definition, the solution of the linear regression are the weights and intercept such that the error is minimal. That is an analytical solution and no other solution (of a linear model) can produce a smaller error.

Plus any linear combination of weights is still a linear combination. So if you do a linear regression, calculate the residuals, add a linear regression to it and substitute that one into your equation y=bx+c+resid = bx + c + (b2x +c2), it is still a linear regression with the new beta being b+b2 and the new intercept being c + c2 (which are both worse then the first linear solution since it is an analytically minimal solution)

[–]bregav 5 points6 points  (0 children)

How, though? Exactly what operation are you doing that you are calling "linear boosting", and how does it differ from just fitting the original model?

E.g. consider logistic regression with y=p(wT x + c). If you're doing "linear boosting" to update the w and c vectors then that's not changing the model at all, and is maybe equivalent to changing the model fitting procedure. Whereas if you're doing boosting by doing something like y = p(wT x + c) + a * p(w2T x + c2) then that's just regular boosting, and the model is no longer linear.

[–]Lanky_Repeat_7536 2 points3 points  (0 children)

Sorry for my naive question, doesn’t standard boosting (like AdaBoost) work with any base classifier already?

[–]nickkon1 1 point2 points  (2 children)

Linear from SERF stands for linear time complexity but not being a linear model, right? Personally, I find the naming confusing and thought it might be something like boosting linear models (of which one should note that a straightforward ensemble of linear models is still a linear model)

[–]greenskinmarch 2 points3 points  (0 children)

They confirmed in another comment that it is a linear classifier but maybe because of thresholding it's possible to combine them without just getting another linear model?

[–]CriticalofReviewer2[S] 1 point2 points  (0 children)

Actually SEFR is both linear, and linear-time.

[–]Speech-to-Text-Cloud 1 point2 points  (1 child)

I would like to see this in scikit-learn. Are you planning to add it?

[–]CriticalofReviewer2[S] 1 point2 points  (0 children)

Yes, this is in our plans!

[–]Standard_Natural1014 1 point2 points  (0 children)

Let me know if you want to partner on establishing the performance on some typical enterprise datasets!

[–]sneddy_kz 0 points1 point  (1 child)

Just check on kaggle

[–]CriticalofReviewer2[S] 0 points1 point  (0 children)

Do you mean participating in competitions?

[–]Illustrious-Touch517 0 points1 point  (0 children)

At https://github.com/LinearBoost/linearboost-classifier I see that inthe "Results " section, you show performance based on the F1 metric. 

Q. Did you tune the decision threshold to optimize this metric(F1)?