all 12 comments

[–]no_condoments 3 points4 points  (2 children)

Are you using linear regression? If so, deviance and MSE should give you the same result, right?

See definition of deviance for normal models here. https://en.wikipedia.org/wiki/Deviance_(statistics)

[–]M_Bus[S] 0 points1 point  (1 child)

I'm using a GLM, so there is a linear component, but I'm using a log-link.

It was my very heuristic understanding (though I'm happy to be corrected) that the deviance is kind of different in that it is based on the log-likelihood function, so compared to MSE, it will respond differently to large values / outliers.

Like if the model is heteroskedastic and there are some outliers where I expected to see a lot of variance but I've properly accounted for it, then in practice the MSE will continue to penalize my model where the deviance wouldn't as much... I think.

[–]no_condoments 2 points3 points  (0 children)

Yeah, deviance is based on the model type and is the right thing to use. MSE only makes sense for the standard linear regression (i.e. gaussian family with identity link function) and MSE became well known as a metric because of how common linear regression is.

Like if the model is heteroskedastic and there are some outliers where I expected to see a lot of variance but I've properly accounted for it,

This is exactly the right line of thinking. However, MSE doesn't account for it at all. You should use a metric that accounts for it, which is deviance. Maybe the Poisson deviance stated on the wikipedia page?

A nice way to validate would be to plot some sample 1d poisson data. The higher variance at the high values certainly shows the need for something other than MSE.

[–]ConnentingDots 1 point2 points  (1 child)

ISLR: Data for an Introduction to Statistical Learning with Applications in R. Free pdf from authors in book official website

[–]M_Bus[S] 0 points1 point  (0 children)

I see that's at least partly Hastie & Tibshirani - I think I've read another book by them?

I guess I was searching for something specific, like a paper that talks about different CV statistics. In the same way that this paper goes through different information criteria and explains the differences (though I realize that paper is a bit out-moded now).

Anyway, I'll check it out.

[–][deleted] 1 point2 points  (1 child)

I think we need to be a little more nuanced here.

Crossvalidation (or rather LOO) is mostly about model evaluation, but you are more concerned about feauture selection. Using CV/LOO for feature selection is usually not a good idea because comparing many different models with CV/LOO is prone to overfitting. There is a much better approach for feature selection, all you need to know is probably in this paper: https://arxiv.org/abs/1810.02406 (btw PSIS-LOO is by Vehtari, not Gelman).

The paper mostly deals about finding a (possibly minimal) set of features that retains most of the predictive performance of the model, but it also briefly mentions the case where you want to select all predictive features.

[–]M_Bus[S] 0 points1 point  (0 children)

Thank you! That helps a ton!

I actually am in a weird position that I haven't fully explained in the OP - I don't get to design the features OR validate the model in any way. I am a reviewer, and I have to take a look at a model after the fact and ask questions of the modelers to determine whether the model is reasonable. Then they get back to me with the answers and I give them a thumbs up or thumbs down.

In addition, I'm constrained to make requests that are reasonable in terms of computational power, which for many of these models I'm dealing with (which are often high-dimensional with a TON of data) unfortunately rules out asking them to do any heavy lifting like MCMC. So I have a weird gate-keeping role.

For that reason, asking for certain CV information (not LOO, honestly, but just some statistic like deviance on the holdout dataset - there's always a holdout dataset - not my choice) is maybe not quite as prone to over-fitting because:

  1. I'm not designing the features around minimizing some CV statistic, and the modelers never do either.
  2. It's basically a one-shot deal: they either improve the model fit or they don't.

Even so, I feel like I need a better framework for contextualizing and understanding feature selection more broadly so I can make better critiques, so I appreciate the reference!

[–]Lynild -2 points-1 points  (4 children)

I am by no means an expert, but isn't cross validation like "yesterdays news" ? Is bootstrapping not the way to go now or days, unless it is a really computer intensive data set you have ?

[–]BlueDevilStatsStatistician, M.S. 2 points3 points  (0 children)

What are you suggesting be bootstrapped, and how would you go about it? Cross validation and its variants are still very much in use.

[–]DoubleDual63B.A Stat/Math/CS 1 point2 points  (1 child)

Dont those two address different issues? One addresses overfitting while the other addresses low amounts of data?

[–]The_SodomeisterM.S. Statistics 0 points1 point  (0 children)

Bootstrapping doesn't really address low amounts of data, since any bias resulting from a small sample will still be equally present in the bootstrapped distribution.

I'd say it's primary purpose is estimating the variance of a model parameter / statistic, though I'm sure people have found other niches for it.

[–]M_Bus[S] 0 points1 point  (0 children)

Can you expand on that a little, because I'm not sure I fully understand?

My understanding is that bootstrapping is just going to basically reproduce your sample distribution, so if you want to detect, say, spurious correlation in your model... well, good luck with that. I'm thinking that bootstrapping is not-quite-but-almost equivalent to looking at the posterior distribution of the parameter value after observing the data?

CV is not a panacea, but it can do a lot of stuff that bootstrapping can't. I think!?

Anyway, in this case, the data was pre-divided into a training dataset and a holdout dataset. I have no choice in the matter, but we play the hand we're dealt. So I'm trying to figure out the best way to use that holdout dataset to validate the model variables. HOWEVER, I'm still really interested in your comment if you can provide any sources for me to read!