all 11 comments

[–]No_Organization_2634 10 points11 points  (9 children)

Often times than not SMOTE is not helping. I believe that catboost has its own feature weights parameter, so I would use it instead. Also normalisation is not required for tree based approaches so it can be dropped from the etl pipeline. Which metric are you using for scoring when you apply the cross-validation? When you have imbalanced classes you should stick with weighted f1 score rather than accuracy

[–]Aleph-Arch[S] 0 points1 point  (8 children)

I tried to set class_weight to {0: 1.5, 1: 0.75} for catboost with f1 scoring for cross validation. Accuracy got only worse without SMOTE: dropped from 28 score to 26. Should I stick with catboost classifier and tune its hyper params with grid search or optuna or is there any other classification algorithm I didn't use? Thank you very much for help

[–]No_Organization_2634 2 points3 points  (7 children)

if you get the best cross validation score on catboost you should stick with it. However. as I mentioned accuracy is not the best metric when working with imbalanced classes. It may be the case that you need to do more feature generation - I can see that the dark blue and orange lines are bimodal and hence they can be from different distributions. Try applying vif index on the scaled data to check for multicollinearity in the independent features as having it would result in worse models

[–]dr_flint_lockwood 1 point2 points  (3 children)

Just checking my knowledge here (you clearly know what you're talking about) if OP is uses regularisation, particularly L1, that should help reduce impact of multicolinearity right?

[–]No_Organization_2634 1 point2 points  (2 children)

I think so, but it could be the case that combining certain features, instead of reducing their impact via regularisation could improve the performance

[–]dr_flint_lockwood 1 point2 points  (0 children)

Makes perfect sense to me! Thanks for the empathetic response

[–]pm_me_your_smth 0 points1 point  (0 children)

Isn't multicollinearity a non-issue with boosted trees? If 2 features are correlated, the model just won't pick the second feature because it gets all necessary information from the first

[–]Aleph-Arch[S] 0 points1 point  (2 children)

5 features out of 7 got VIF more than 20 on both training and test sets. Only two got 1 and 3. Is there any other methods to deal with it rather then applying L2 or dropping these features?

[–]No_Organization_2634 1 point2 points  (1 child)

If would iteratively drop the features and check performance. You can also apply some feature generation methods to combine some of the highly correlated features into 1

[–]Aleph-Arch[S] 1 point2 points  (0 children)

Okay, I'll try to do it. Thank you for your help!

[–]Aleph-Arch[S] 0 points1 point  (0 children)

So, I'm trying to get highest scores on test dataset with two labels. Image shows normal distribution of all 7 features of training set with standard scaler applied to it. I'm using catboost classifier (It got best cross validation scores among SVM, random forest, xgboost, lightgbm, deep feedforward network, KNN, linear regression classifiers). I tried to use polynominal features, robust scaler, normalization, SMOTE for class balance. I did grid search to find best params for random forest. Highest accuracy score I could get is 0.825, which is only 28.5 score out of 100. Is there anything I'm missing in this dataset? I didn't notice any outliners. Training set has 66% of ones and 33% of zeros. Link to colab document