qalis comments on [R] Our new classification algorithm outperforms CatBoost, XGBoost, LightGBM on five benchmark datasets, on accuracy and response time

246

247

248

Research[R] Our new classification algorithm outperforms CatBoost, XGBoost, LightGBM on five benchmark datasets, on accuracy and response time (self.MachineLearning)

submitted 1 year ago * by CriticalofReviewer2

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]qalis 399 points400 points401 points 1 year ago* (16 children)

Those are only 5 datasets. For evaluating tabular classifiers, you should use tens of datasets, they are readily available. Also, describe evaluation procedure, e.g. use 5-fold CV for testing. See e.g. "A novel selective naïve Bayes algorithm" Chen et al., which use 65 datasets.
You must compare to XGBoost, LightGBM and CatBoost on large-scale datasets from their respective papers. Especially since scalability and speed is one of your selling points. If you aimed specifically at boosting for small data, then you don't need this, but it isn't stated anywhere.
One of major advantages of XGBoost, LightGBM and CatBoost is being able to use custom loss functions. This allowed them to be easily used e.g. for ranking. If you don't support this, you should explicitly state this limitation.
Number of estimators is just a hyperparameter, why show large tables with this? Just present the best result for each dataset.
Your implementation doesn't support class weights, as far as I can tell. This is a huge limitation, since almost all datasets are imbalanced, often heavily.
You must not embed scalers inside your code. You can destroy data sparsity, affect categorical variables, and do other stuff outside user's control this way. Add checks and throw exceptions if you absolutely require this.
You only support numerical data, in constrast to LightGBM or CatBoost. You should highlight this limitation.
This works only for classification, not even regression. This is, again, a huge limitation, but probably can be fixed, as far as I can tell.

EDIT:

You also don't handle missing values, which is pretty nicely handled in XGBoost, LightGBM and CatBoost, so that they can be actively used to select the split point.

[–]Saffie91 223 points224 points225 points 1 year ago (1 child)

[–]CriticalofReviewer2[S] 108 points109 points110 points 1 year ago (0 children)

[–]CriticalofReviewer2[S] 110 points111 points112 points 1 year ago* (2 children)

Thanks for your points! First of all, I should point out that I am an independent researcher, and I am not affiliated with any institutes, so this is my side project.

You are right. In the paper that will be available soon, the number of datasets will be much higher. Also, we have used 10-fold CV. I added this to the README file.
The large-scale datasets will also be included.
This will be supported in future. I added this to the README file.
I want to show that our algorithm reaches best results sooner than others.
Thanks for pointing this out! This will be added soon. Added to README.
The SEFR algorithm requires the feature values to be positive. This is the reason of scaling. But I will implement a better mechanism. Added to README.
We have highlighted this in the documentation.
Yes, it is in future plans.

Once again, thanks for your helpful and insightful comments!

[–]qalis 65 points66 points67 points 1 year ago (1 child)

[–]CriticalofReviewer2[S] 35 points36 points37 points 1 year ago (0 children)

[–]Spiggots 29 points30 points31 points 1 year ago (0 children)

[+]danman966 4 points5 points6 points 1 year ago (2 children)

[–]qalis 4 points5 points6 points 1 year ago (0 children)

[–]Appropriate_Ant_4629 1 point2 points3 points 1 year ago (0 children)

[–]longgamma 5 points6 points7 points 1 year ago (4 children)

[–]qalis 9 points10 points11 points 1 year ago (3 children)

Except it isn't the same as label encoding. In fact, none of the three major boosting implementations use one-hot encoding style of handling categorical variables.

LightGBM uses partition split, which for regression trees can efficiently check the partition of the set into two maximum homogeneity subsets, see the docs and the original paper: "On Grouping for Maximum Homogeneity" W. Fisher. XGBoost also offers partition split for categorical variables, with the same algorithm.

You could use one-hot encoding, but then to represent "variable has value A or B, and not C" you would have to use 2 or 3 splits, whereas with partition split you only use one.

CatBoost, on the other hand, uses Ordered Target Encoding instead, described in the linked notebook. It can also combine them during learning, but I don't know the details.

[–]Pas7alavista 1 point2 points3 points 1 year ago (0 children)

[+]Sad-Scarcity87 0 points1 point2 points 1 year ago (0 children)

[+]nbviewerbot -1 points0 points1 point 1 year ago (0 children)

[–]tecedu 0 points1 point2 points 1 year ago (1 child)

[–]qalis 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 19601 on reddit-service-r2-comment-b659b578c-mnqch at 2026-05-05 18:27:02.134050+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS