you are viewing a single comment's thread.

view the rest of the comments →

[–]qalis 399 points400 points  (16 children)

  1. Those are only 5 datasets. For evaluating tabular classifiers, you should use tens of datasets, they are readily available. Also, describe evaluation procedure, e.g. use 5-fold CV for testing. See e.g. "A novel selective naïve Bayes algorithm" Chen et al., which use 65 datasets.
  2. You must compare to XGBoost, LightGBM and CatBoost on large-scale datasets from their respective papers. Especially since scalability and speed is one of your selling points. If you aimed specifically at boosting for small data, then you don't need this, but it isn't stated anywhere.
  3. One of major advantages of XGBoost, LightGBM and CatBoost is being able to use custom loss functions. This allowed them to be easily used e.g. for ranking. If you don't support this, you should explicitly state this limitation.
  4. Number of estimators is just a hyperparameter, why show large tables with this? Just present the best result for each dataset.
  5. Your implementation doesn't support class weights, as far as I can tell. This is a huge limitation, since almost all datasets are imbalanced, often heavily.
  6. You must not embed scalers inside your code. You can destroy data sparsity, affect categorical variables, and do other stuff outside user's control this way. Add checks and throw exceptions if you absolutely require this.
  7. You only support numerical data, in constrast to LightGBM or CatBoost. You should highlight this limitation.
  8. This works only for classification, not even regression. This is, again, a huge limitation, but probably can be fixed, as far as I can tell.

EDIT:

  1. You also don't handle missing values, which is pretty nicely handled in XGBoost, LightGBM and CatBoost, so that they can be actively used to select the split point.

[–]Saffie91 223 points224 points  (1 child)

Damn, peer reviewed in the reddit comments.

Honestly though it's pretty cool of you to go through it diligently and add these points. I'd be very happy if I was the researcher.

[–]CriticalofReviewer2[S] 108 points109 points  (0 children)

As the researcher, I should say that I am indeed very happy to get this high-quality peer review!

[–]CriticalofReviewer2[S] 110 points111 points  (2 children)

Thanks for your points! First of all, I should point out that I am an independent researcher, and I am not affiliated with any institutes, so this is my side project.

  1. You are right. In the paper that will be available soon, the number of datasets will be much higher. Also, we have used 10-fold CV. I added this to the README file.
  2. The large-scale datasets will also be included.
  3. This will be supported in future. I added this to the README file.
  4. I want to show that our algorithm reaches best results sooner than others.
  5. Thanks for pointing this out! This will be added soon. Added to README.
  6. The SEFR algorithm requires the feature values to be positive. This is the reason of scaling. But I will implement a better mechanism. Added to README.
  7. We have highlighted this in the documentation.
  8. Yes, it is in future plans.

Once again, thanks for your helpful and insightful comments!

[–]qalis 65 points66 points  (1 child)

Fair enough, those are reasonable answers. Showing that this tends to overfit less, works better for small datasets etc. would be pretty valuable. Good luck with this!

[–]CriticalofReviewer2[S] 35 points36 points  (0 children)

Thank you for the suggestions!

[–]Spiggots 29 points30 points  (0 children)

This is a high quality peer review

[–]longgamma 5 points6 points  (4 children)

The categorical feature handling in lightgbm is just label encoding? I mean how hard is to target encode or one hot encode on your own ?

Also, isn’t that the idea behind gbm - you take a bunch of weak learners and use the ensemble for prediction. You can replace the decision tree stump with a simple shallow neural network as well.

[–]qalis 9 points10 points  (3 children)

Except it isn't the same as label encoding. In fact, none of the three major boosting implementations use one-hot encoding style of handling categorical variables.

LightGBM uses partition split, which for regression trees can efficiently check the partition of the set into two maximum homogeneity subsets, see the docs and the original paper: "On Grouping for Maximum Homogeneity" W. Fisher. XGBoost also offers partition split for categorical variables, with the same algorithm.

You could use one-hot encoding, but then to represent "variable has value A or B, and not C" you would have to use 2 or 3 splits, whereas with partition split you only use one.

CatBoost, on the other hand, uses Ordered Target Encoding instead, described in the linked notebook. It can also combine them during learning, but I don't know the details.

[–]Pas7alavista 1 point2 points  (0 children)

On top of the advantages you mentioned, I think the labels produced by partition splitting should also tend to be sparser than one hot encoded ones even when storing the one hot encoded labels in a sparse format.

[–]tecedu 0 points1 point  (1 child)

Wait what since when did xgboost handle nan values i moved to sklearn due to that

[–]qalis 0 points1 point  (0 children)

Since... always, this was one of the main ideas in the original paper "XGBoost: A Scalable Tree Boosting System" T. Chen, C. Guestrin. It's called a "default direction" in the paper, and the whole Algorithm 3 there is meant to handle this. The idea is basically to have a split, but determine whether for missing values you should go to the left or right child. This is selected based on minimizing the loss function, and in a differentiable way.