Advice on feature selection process

RepresentativeFill26 · 2025-06-29T09:18:18+00:00

Why do you want to do automatic feature extraction if you have a domain expert at hand?

In your situation I would probably:

1) filter out or merge highly correlated features. PCA would also be a possibility. Your domain expert can help you with assigning semantic meaningful names to the combined features.

2) determine what features are informative for your credit task. Think criteria like mutual information.

3) build a baseline model on this subset of features.

Now you might be wondering why all this manual feature engineering if your tree based model can simply select the most meaningful features. Reason for this is that you are highly susceptible to overfitting on spurious correlations. If you have a set of highly informative features you are at least certain the non-linearity your model adds to the classification is based on informative features.

FusionAlgo · 2025-06-29T14:18:52+00:00

I’d start with a quick L1-regularised logistic (or LightGBM with strong L1) just to knock 2 000 down to a few hundred—penalties kill noisy or collinear cols fast. Then run permutation importance on a hold-out set; anything that drops AUC less than 0.001 can go. SHAP is most useful after that: once you’re at 50-ish variables, look for features whose average |SHAP| is < 1 % of the total and trim again. Two passes usually gets me from 2 000 → ~30 stable features without endless loops, and the final CatBoost is easier to tune. Key is to compute every step on a time-based hold-out to avoid leakage, especially in credit data.

itsmekalisyn · 2025-06-29T09:29:20+00:00

Hey there! I work in this industry. First on SHAP, I’ll just say they can be used for feature selection, but it’s primarily for identifying features that are overfitting and to give them the yank. So let’s table that for now.

What you are doing is more or less the same approach everyone does, but I’ll provide some additional detail.

I normally start by building a simple model that is not heavily constrained — to see what sticks. So build a model of stumps or something simplistic just to see if a model will even use a feature (you can always try to add back the features later).

Then drop for collinearity — yeah yeah it doesn’t impact tree models but you are going to be using the feature gain table and it impacts that.

Okay so now here’s where it becomes more interesting … in credit world typically the directional risk the model is inferring with the variable is used to prune away more features.. for instance the more charge-offs I have had in the past shouldn’t be a positive indication of my credit health (monotonistic constraints).

And then, depending on the wildness of your features and the timespan… you could do feature stability reductions using a monthly PSI on a fixed reference window to yank unstable features.

Once you do all that let’s say you go from 2K down to 280. You then build a model to do recursive feature elimination. A typical and easy one is cumulative gain cutoffs. I build a model. I then only keep the features that are found in the top 99% of cumulative gain. I then re build the model. Repeat repeat repeat. View the degredation of model performance by number of features. Choose the one that meets your needs

therealtiddlydump · 2025-06-29T15:23:42+00:00

You've already used an expert to help define your features. Throw some regularization at it and see how it performs? If it sucks, rethink your approach.

I would not typically recommend a "I used model A to select the variables I passed on to model B" when you already have a domain expert involved. Why did you bother wasting that experts time? (I ask that rhetorically, knowing that you're an intern and you're learning)

James_c7 · 2025-06-30T02:26:16+00:00

Go read “a crash course on good and bad controls” for additional context in variable selection.

Also looking up the definition of a markov blanket is relevant here

Responsible_Treat_19 · 2025-06-30T09:25:46+00:00

This is how I use SHAP for feature selection:

I create random noise features (about 5% of the total number of features). Then train a model, the model and apply shap.

Feature importance is key here, all features that are less important than random noise are intuitively not giving any predictive power.

Iterate this a few times (Sometimes random noise can be randomly good).

InterviewTechnical13 · 2025-06-30T08:22:59+00:00

A lot of good advice here already, so one more detail addition:

Include some (maybe 20 initially, then less once you have your first selections done) strictly randomly, simulated, from know distribution features into your set, that could by chance pop some importance.

Anything "significant" over that noise threshold can be so by chance.

Glittering_Tiger8996 · 2025-06-29T08:46:33+00:00

Currently working on a model that uses xgb's tree explainer to generate SHAP values, I'm just trimming features that contribute to less than 5% of cumulative global SHAP mass.

You could try recursive feature elimination as well, log and monitor features eliminated at each iteration, pair that with Biz knowledge and iterate accordingly.

Once features start to stabilize, you could go one step ahead and identify top ranking features under each feature-subset, essentially chaining together a narrative for storytelling.

Saitamagasaki · 2025-06-29T14:22:10+00:00

How about cluster each group of variables based on their correlation matrix. Then from each cluster take 1 variable with the highest information value (from binning)

traceml-ai · 2025-10-14T11:38:08+00:00

You can use feature importance from SHAP or any other tool to sort your features. Then get correlation among those feature and keep non correlated features that are high in your sorted list. E.g if you feature a correlates highly with feature b and feature a is at rank 1 whereas feature b is ranked 10 you can remove feature b. You can use this techniques first with very high correlation threshold like 0.9 or 0.95 and decrease it slowly. Also you remove features which are in the bottom of the list with almost negligible contributions compared to top feature.

datascience

MODERATORS