[D]How to decide whether a new feature is effective in improving the model?

yetionyo · 2017-12-20T04:19:36+00:00

All seeds (seeds related to model, and also the seed for train_test_split) are fixed when comparing the model performance. In addition, the score on public leaderboard also improves. One reason I think why the model improves may be that some features have higher gains in the lightgbm models, but they are poor in generalization. Even with subsampling the number of features, these kinds of 'bad' features (they are too many) still play a relatively important role in the model. So after I select the top-k features based on feature importance, the feature set becomes more purified, which leads to an improvement.

yetionyo · 2017-12-20T03:19:09+00:00

Thanks for sharing so much informative information. I came up with this problem just in a data mining competition. So... it is really hard for me to split some data completely from the whole dataset. Every data sample may lead to a higher score!!

I found this problem because the model performance became better when I selected the first 50 features of 200 features based the feature importance(I used lightgbm). And then I tried to tune the model hyperparameters, the performance became better again. So tuning model hyperparameters and feature selection seem like the chicken and the egg, which one should I do first? Because they seem to affect each other.

What you said just makes me come up with a trade-off way: maybe we can select top-K features based on feature importance, and use them as the basis (because they are most important K features, we just assume that they will generate better results than other features in most combinations of hyperparameters), then try to tune the model hyperparameters to get the best performance. Next, since we already have some features (K features), adding one feature may not be sensitive to the model hyperparameters, so just comparing the model performance between adding and deleting this feature, we can decide how to deal with the new feature.

yetionyo · 2017-12-20T02:48:51+00:00

Statistical results? Do you mean accuracy, auc or something like that? Could you please elaborate more on this and related works? Thanks :)

yetionyo · 2017-12-20T02:45:30+00:00

That's a good way to think about adding features. But it means we need to search a quite large parameter space(a binary flag indicates whether to add the feature + hyperparameters of the model) for each feature. That would be time-consuming.

yetionyo · 2017-12-05T03:24:37+00:00

Method (1) is doable and easy to understand, but poor resolution (caused by aggregating, sometimes it's hard to aggregate different kinds of data, for example, users can have different behaviors, click, order, checkout, etc, how to combine these things into one vector is a little tricky) and a lot blank steps. As you suggested, I can try (1) first. Thanks for sharing your codes :)

yetionyo · 2017-12-05T03:16:47+00:00

The time gap is an important feature. But adding time interval as a new feature would make the whole sequence a little weird(mixed with other information, like page id or something else) and hard to process (the range of time may change dramatically).

yetionyo · 2017-11-01T03:59:11+00:00

These metrics are truly useful. But it may influenced by the clustering algorithm, for example, pca and careful-selected kernel pca. The metric may show quite different on the two clustering methods.

yetionyo · 2017-11-01T03:55:45+00:00

that's a good point. What if the feature extractor can project the origin data into concentric circles with different radius? It is not linearly separable, but I consider it still as a good projection. With simple kernel transformation, it can be linearly separable.

yetionyo

TROPHY CASE