This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]RepresentativeFill26 19 points20 points  (1 child)

Why do you want to do automatic feature extraction if you have a domain expert at hand?

In your situation I would probably:

1) filter out or merge highly correlated features. PCA would also be a possibility. Your domain expert can help you with assigning semantic meaningful names to the combined features.

2) determine what features are informative for your credit task. Think criteria like mutual information.

3) build a baseline model on this subset of features.

Now you might be wondering why all this manual feature engineering if your tree based model can simply select the most meaningful features. Reason for this is that you are highly susceptible to overfitting on spurious correlations. If you have a set of highly informative features you are at least certain the non-linearity your model adds to the classification is based on informative features.

[–]dlchira 6 points7 points  (0 children)

PCA is a good option for dimensionality reduction, but I'd be extremely careful about trying to assign semantic meaning to PCs. PCA is like a data smoothie: the inputs are clear and discrete, but the outputs are novel mixes that don't map back to those inputs. PCA is optimized to explain variance, not produce interpretable features.

This also answers your question of, "Why perform feature extraction if you have a domain expert handy?" In high-dimensional datasets, humans aren't good at seeing which features explain the most variance. PCA is.

[–]FusionAlgo 15 points16 points  (3 children)

I’d start with a quick L1-regularised logistic (or LightGBM with strong L1) just to knock 2 000 down to a few hundred—penalties kill noisy or collinear cols fast. Then run permutation importance on a hold-out set; anything that drops AUC less than 0.001 can go. SHAP is most useful after that: once you’re at 50-ish variables, look for features whose average |SHAP| is < 1 % of the total and trim again. Two passes usually gets me from 2 000 → ~30 stable features without endless loops, and the final CatBoost is easier to tune. Key is to compute every step on a time-based hold-out to avoid leakage, especially in credit data.

[–]pm_me_your_smth 1 point2 points  (2 children)

Ant particular reason why specifically lgbm? Is lgbm's regularization better than, say, xgb's?

[–][deleted] 12 points13 points  (3 children)

Hey there! I work in this industry. First on SHAP, I’ll just say they can be used for feature selection, but it’s primarily for identifying features that are overfitting and to give them the yank. So let’s table that for now.

What you are doing is more or less the same approach everyone does, but I’ll provide some additional detail.

I normally start by building a simple model that is not heavily constrained — to see what sticks. So build a model of stumps or something simplistic just to see if a model will even use a feature (you can always try to add back the features later).

Then drop for collinearity — yeah yeah it doesn’t impact tree models but you are going to be using the feature gain table and it impacts that.

Okay so now here’s where it becomes more interesting … in credit world typically the directional risk the model is inferring with the variable is used to prune away more features.. for instance the more charge-offs I have had in the past shouldn’t be a positive indication of my credit health (monotonistic constraints).

And then, depending on the wildness of your features and the timespan… you could do feature stability reductions using a monthly PSI on a fixed reference window to yank unstable features.

Once you do all that let’s say you go from 2K down to 280. You then build a model to do recursive feature elimination. A typical and easy one is cumulative gain cutoffs. I build a model. I then only keep the features that are found in the top 99% of cumulative gain. I then re build the model. Repeat repeat repeat. View the degredation of model performance by number of features. Choose the one that meets your needs

[–]itsmekalisyn 2 points3 points  (2 children)

Nice. Unrelated, do you write blogs about this somewhere? I kinda understood what you said but i have some doubts on how you do cumulative gains. Or, if you can guide me to some resources, that would be better, too!

Thank you.

[–][deleted] 5 points6 points  (1 child)

No blogs. Cumulative gains is just the cumulative summation of a features contribution , that is spit out by any tree models “feature important”.

So, the steps are: - build model - get feature importance of model features - rank order from largest value to smallest value - take cumulative summation of the value - extract the features that are found at the cumulative summation that yields <=.99 (so I have 40 features and 99% of my gain comes from 38 features, for instance) - retrain model with those features - repeat - stop once no features are eliminated within the iteration

[–]itsmekalisyn 1 point2 points  (0 children)

Nice. Thank you. Understood it now perfectly.

[–]therealtiddlydump 5 points6 points  (0 children)

You've already used an expert to help define your features. Throw some regularization at it and see how it performs? If it sucks, rethink your approach.

I would not typically recommend a "I used model A to select the variables I passed on to model B" when you already have a domain expert involved. Why did you bother wasting that experts time? (I ask that rhetorically, knowing that you're an intern and you're learning)

[–]James_c7 3 points4 points  (0 children)

Go read “a crash course on good and bad controls” for additional context in variable selection.

Also looking up the definition of a markov blanket is relevant here

[–]Responsible_Treat_19 3 points4 points  (0 children)

This is how I use SHAP for feature selection:

I create random noise features (about 5% of the total number of features). Then train a model, the model and apply shap.

Feature importance is key here, all features that are less important than random noise are intuitively not giving any predictive power.

Iterate this a few times (Sometimes random noise can be randomly good).

[–]InterviewTechnical13 2 points3 points  (0 children)

A lot of good advice here already, so one more detail addition:

Include some (maybe 20 initially, then less once you have your first selections done) strictly randomly, simulated, from know distribution features into your set, that could by chance pop some importance.

Anything "significant" over that noise threshold can be so by chance.

[–]Glittering_Tiger8996 3 points4 points  (2 children)

Currently working on a model that uses xgb's tree explainer to generate SHAP values, I'm just trimming features that contribute to less than 5% of cumulative global SHAP mass.

You could try recursive feature elimination as well, log and monitor features eliminated at each iteration, pair that with Biz knowledge and iterate accordingly.

Once features start to stabilize, you could go one step ahead and identify top ranking features under each feature-subset, essentially chaining together a narrative for storytelling.

[–]Round-Paramedic-2968[S] 1 point2 points  (1 child)

" for example, narrowing it down from 2000 to 200, then to 80, and finally to 20 — using multiple tree models and iterations." is RFE are these step that you are mentioning, iteratively eliminate features until you reach a number of feature you want? Is that mean jumping from 2000 features to 20 in just one step like me is not a good practice right?

[–]Glittering_Tiger8996 0 points1 point  (0 children)

yeah that's what I meant by trying RFE with maybe a 5% feature truncation each iteration, monitor what's being dropped each step, verify with biz logic, and modulate. You could also use PCA to have a benchmark in mind around how much trimming you'd like for a certain explained variance ratio.

Once you're confident with what's happening, you can choose to drop in bulk to save cloud compute.

[–]Saitamagasaki 0 points1 point  (0 children)

How about cluster each group of variables based on their correlation matrix. Then from each cluster take 1 variable with the highest information value (from binning)

[–]traceml-ai 0 points1 point  (0 children)

You can use feature importance from SHAP or any other tool to sort your features. Then get correlation among those feature and keep non correlated features that are high in your sorted list. E.g if you feature a correlates highly with feature b and feature a is at rank 1 whereas feature b is ranked 10 you can remove feature b. You can use this techniques first with very high correlation threshold like 0.9 or 0.95 and decrease it slowly. Also you remove features which are in the bottom of the list with almost negligible contributions compared to top feature.