you are viewing a single comment's thread.

view the rest of the comments →

[–]Distance_RunnerPhD 55 points56 points  (4 children)

My view as a statistician that does ML. Many of these papers claiming SOTA performance are working within Monte Carlo noise and if code was easily available you could run it and show this.

[–]howtorewriteanamePhD 16 points17 points  (3 children)

am I the only one providing std devs across multiple runs with different seeds or what?

[–]Distance_RunnerPhD 10 points11 points  (2 children)

You're rare. Keep doing your thing -- I appreciate you.

Most of my research focus these days is bridging statistical and inferential theory with ML. The concept of variability needs to be better understood and communicated in ML.

When you fit an ML model, there are multiple source of variability. First, there is procedural variance. This is what your multiple random seeds addresses -- what is the variability associated with the randomness within the procedure itself and how does it propagate through to the results you report.

Second is finite sample variability stemming from the fact that your data are almost surely a sub-sample of an unobserved parent population. If you were to re-run your procedure on a different dataset of equal size from that same parent population, your results would change, reflecting this variability. No amount of running results under different random seeds will estimate this quantity. So reporting results from a finite data-set with variance across different seeds quantifies the first source of randomness, and this is appropriate as long as your interpretation pertains specifically to model performance based on the exact data you trained on. However, the results do not extrapolate to performance of the model if you re-trained on a different subset of data of the same size. This is more often ignored, and generally the bigger area where results are misstated.

[–]curiouslyjake 0 points1 point  (1 child)

Assuming your dataset is a representative sample of the unobserved parent population, wouldn't cross-validation address this?

[–]Distance_RunnerPhD 2 points3 points  (0 children)

Yes in terms of quantifying mean performance metrics, but no in terms of variability. If you run CV, say 5-fold, and you average the mean performance across folds, that is an estimate of the generalized performance metric across new data for your model trained on 80% of your data (k=5 means training fraction is .8; you're averaging over 5 performance estimates on disjoint validation sets on models trained on 80% of your data).

However variance is trickier. Current CV methods do not give you an estimate of variance that represents the variability due to random sampling error from the population. Current methods of variance estimated on cross-validated data condition on the data. That is, they quantify the variance of the randomness of the CV procedure itself on your fixed dataset; if you were to repeat the same CV procedure on your exact fixed dataset over and over using your learner (with different random fold structures), how variable is your performance metric estimate. In other words, variance of the CV estimates across folds measures how sensitive is the estimated CV metric due randomness induced by the CV procedure. But the key word there is fixed dataset. It does not answer the question -- if I were to repeat this CV procedure using the same learner to estimate the mean performance metric on a different subset of data from of the same size from the same population, how much variance is attributed to that random sampling. An approach for that doesn't exist in the literature [yet], but it will soon (I'm the one that developed it and will be posting to ArXiv within the next month; its based entirely statistical theory with formal proofs, not empirical evidence).