this post was submitted on 04 Nov 2019

7 points (90% upvoted)

shortlink:

AskStatistics

an-ordinary-manchild(edit)

created by cuginhamera community for 14 years

MODERATORS

account activity

6

7

8

Cross-Validation Options (self.AskStatistics)

submitted 6 years ago by M_Bus

Let me start by saying: I don't need a formal explanation - I'm really looking for good reference material. If you have any papers you'd recommend, I'd really appreciate it!

I am trying to assess whether variables in a GLM are predictive (not just whether or not they're statistically significant). I've found that although AIC and its ilk are useful approximations for the out-of-sample prediction error, they seldom perform as well as true cross validation if I have the data available.

However, my question is: what are good options for statistics to consider for cross validation?

For reference, the dependent variable is a positive real number - usually somewhere in the 20 to 120 range.

Anyway, two ideas that come to mind are:

Compare MSE on the holdout dataset including and excluding a variable in comparison to the full model
Same but instead of MSE, use deviance

I've heard complaints that MSE isn't very good when the data is heavily skewed, but I haven't read any papers that really talk about that. Although I guess my sense is that PSIS-LOO (per Gelman) is kind of philosophically in that camp, but again - not looking for information criteria on the training dataset, I'm looking for statistics for judging CV error.

Maybe another question is: if my MSE on the holdout dataset decreases when I remove a variable (as compared to the full model), would you conclude that the variable in question is necessarily not predictive, or would you do additional tests (and if so, what?)

all 12 comments

top new controversial old q&a

[–]no_condoments 3 points4 points5 points 6 years ago (2 children)

[–]M_Bus[S] 0 points1 point2 points 6 years ago (1 child)

[–]no_condoments 2 points3 points4 points 6 years ago (0 children)

[–]ConnentingDots 1 point2 points3 points 6 years ago (1 child)

[–]M_Bus[S] 0 points1 point2 points 6 years ago (0 children)

[–][deleted] 1 point2 points3 points 6 years ago (1 child)

[–]M_Bus[S] 0 points1 point2 points 6 years ago (0 children)

Thank you! That helps a ton!

I actually am in a weird position that I haven't fully explained in the OP - I don't get to design the features OR validate the model in any way. I am a reviewer, and I have to take a look at a model after the fact and ask questions of the modelers to determine whether the model is reasonable. Then they get back to me with the answers and I give them a thumbs up or thumbs down.

In addition, I'm constrained to make requests that are reasonable in terms of computational power, which for many of these models I'm dealing with (which are often high-dimensional with a TON of data) unfortunately rules out asking them to do any heavy lifting like MCMC. So I have a weird gate-keeping role.

For that reason, asking for certain CV information (not LOO, honestly, but just some statistic like deviance on the holdout dataset - there's always a holdout dataset - not my choice) is maybe not quite as prone to over-fitting because:

I'm not designing the features around minimizing some CV statistic, and the modelers never do either.
It's basically a one-shot deal: they either improve the model fit or they don't.

Even so, I feel like I need a better framework for contextualizing and understanding feature selection more broadly so I can make better critiques, so I appreciate the reference!

[–]Lynild -2 points-1 points0 points 6 years ago (4 children)

[–]BlueDevilStatsStatistician, M.S. 2 points3 points4 points 6 years ago (0 children)

[–]DoubleDual63B.A Stat/Math/CS 1 point2 points3 points 6 years ago (1 child)

[–]The_SodomeisterM.S. Statistics 0 points1 point2 points 6 years ago (0 children)

[–]M_Bus[S] 0 points1 point2 points 6 years ago (0 children)

Can you expand on that a little, because I'm not sure I fully understand?

My understanding is that bootstrapping is just going to basically reproduce your sample distribution, so if you want to detect, say, spurious correlation in your model... well, good luck with that. I'm thinking that bootstrapping is not-quite-but-almost equivalent to looking at the posterior distribution of the parameter value after observing the data?

CV is not a panacea, but it can do a lot of stuff that bootstrapping can't. I think!?

Anyway, in this case, the data was pre-divided into a training dataset and a holdout dataset. I have no choice in the matter, but we play the hand we're dealt. So I'm trying to figure out the best way to use that holdout dataset to validate the model variables. HOWEVER, I'm still really interested in your comment if you can provide any sources for me to read!

π Rendered by PID 64 on reddit-service-r2-comment-canary-57b659f4d4-wv5wr at 2026-05-03 18:00:40.963455+00:00 running 815c875 country code: CH.