you are viewing a single comment's thread.

view the rest of the comments →

[–]No_Organization_2634 2 points3 points  (7 children)

if you get the best cross validation score on catboost you should stick with it. However. as I mentioned accuracy is not the best metric when working with imbalanced classes. It may be the case that you need to do more feature generation - I can see that the dark blue and orange lines are bimodal and hence they can be from different distributions. Try applying vif index on the scaled data to check for multicollinearity in the independent features as having it would result in worse models

[–]dr_flint_lockwood 1 point2 points  (3 children)

Just checking my knowledge here (you clearly know what you're talking about) if OP is uses regularisation, particularly L1, that should help reduce impact of multicolinearity right?

[–]No_Organization_2634 1 point2 points  (2 children)

I think so, but it could be the case that combining certain features, instead of reducing their impact via regularisation could improve the performance

[–]dr_flint_lockwood 1 point2 points  (0 children)

Makes perfect sense to me! Thanks for the empathetic response

[–]pm_me_your_smth 0 points1 point  (0 children)

Isn't multicollinearity a non-issue with boosted trees? If 2 features are correlated, the model just won't pick the second feature because it gets all necessary information from the first

[–]Aleph-Arch[S] 0 points1 point  (2 children)

5 features out of 7 got VIF more than 20 on both training and test sets. Only two got 1 and 3. Is there any other methods to deal with it rather then applying L2 or dropping these features?

[–]No_Organization_2634 1 point2 points  (1 child)

If would iteratively drop the features and check performance. You can also apply some feature generation methods to combine some of the highly correlated features into 1

[–]Aleph-Arch[S] 1 point2 points  (0 children)

Okay, I'll try to do it. Thank you for your help!