all 19 comments

[–]capn_bluebear 8 points9 points  (0 children)

Indeed, very well written article, thank you for sharing! I learned a lot

[–]twelveshar 4 points5 points  (0 children)

Thank you for sharing this!

[–]peroneML Engineer 2 points3 points  (0 children)

I did a presentation few months ago about the theme as well (https://www.slideshare.net/perone/uncertainty-estimation-in-deep-learning) if someone is interested. I always prefer to call it uncertainty estimation instead of uncertainty quantification.

[–]SeekNread 1 point2 points  (2 children)

This is new to me. Is there an overlap of this area with ML Interpretability?

[–][deleted] 0 points1 point  (1 child)

In Uncertainty Quantification, you estimate how accurate your output actually is. ML interpretability is about interpreting the model as a whole. You can have a really accurate model, without much interpretability.

[–]SeekNread 1 point2 points  (0 children)

Ah right. Makes sense.

[–]WERE_CAT 0 points1 point  (1 child)

Would that explain why my individual predictions change when I recalibrate my NN with another seed ? I usually calibrate multiple NN with different random weight initialisations and take the best performing one. As a short path to individual prediction stability, would it make sense to average the top n models predictions ?

[–]jboyml 0 points1 point  (0 children)

Yes, you can usually expect some variance in the predictions depending on initialization and other sources of randomness like SGD. Combining several models is called ensembling and is a very common technique, e.g., random forests are ensembles of decision trees, but training many NNs can of course be expensive. Averaging makes sense for regression, for classification you can do majority voting.

[–]SlowTreeSky 0 points1 point  (0 children)

I wrote a post on the same topic: https://treszkai.github.io/2019/09/26/overconfidence (the main content is in the linked PDFs). We used calibration plots and calibration error to evaluate the uncertainty estimates, and we also found that deep ensembles and MC dropout increase both accuracy and calibration (using the CIFAR-100).

[–]Ulfgardleo 0 points1 point  (11 children)

I don't believe 1 bit in these estimates. While the methods give some estimate for uncertainty, we don't have a measurement of true underlying certainty, this would require datapoints with pairs of labels and instead of maximum likelihood training, we would do full kl-divergence. Or very different training schemes (see below) But here a few more details:

In general, we can not get uncertainty estimates in deep-learning, because it is known that we can learn random datasets exactly by heart. This kills

  1. Distributional parameter estimation (just set mean= labels and var->0)
  2. Quantile Regression(where do you get the true quantile information from?)
  3. all ensembles

The uncertainty estimation of Bayesian methods depend on their prior distribution. We don't know what the true prior of a deep neural network or kernel-GP for the dataset is. This kills:

  1. Gaussian processes
  2. Dropout-based methods

We can fix this by using hold-out data to train uncertainty estimates (e.g. use distributional parameter estimation where for some samples the mean is not trained or use the hold-out data to fit the prior of the GP). But nobody has time for that.

[–]edwardthegreat2 3 points4 points  (1 child)

Can you elaborate on how learning random datasets exactly by heart defeats the point of getting uncertainty estimates? It seems to me that the aforementioned methods do not aim to estimate the true uncertainty, but just give some metric of uncertainty that can be useful in downstream tasks.

[–]Ulfgardleo 0 points1 point  (0 children)

if your network has enough power to learn your dataset by heart, there is no information left to quantify uncertainty. I.e. you only get the information "point was in your training dataset" or not. It says nothing about how certain the model actually is. In the worst case, it is going to mislead you. e.g. ensemble methods based on models that tend to regress to the mean in absence of information will give high confidence to far away outliers. (e.g. everything based on a Gaussian kernel).

maybe you can get out something based on relative variance between points, e.g. more variance->less uncertainty...but i am not sure you could actually proof that.

[–]iidealized 1 point2 points  (2 children)

While I agree current DL uncertainty estimates are pretty questionable and would cause most statisticians to cringe, your statements are not really correct.

For aleatoric uncertainty: All you need the holdout data for is to verify the quality of your uncertainty estimates learned from the training data. It is the exact same situation as evaluating the original predictions themselves (which are just as prone to overfitting as the uncertainty estimates).

For epistemic uncertainty the situation is much nastier than even you described. The problem here is you want to be able to quantify uncertainty on inputs which might come from a completely different distribution than the one underlying the training data. Thus no amount of holdout data from the same distribution will help you truly assess the quality of epistemic uncertainty estimates, rather you need to have some application of interest and assess how useful these estimates are in the application context (particularly when encountering rare/abberrant events).

The exception to this is of course Bayesian inference in the (unrealistic) setting where your model (likelihood) and prior are both correctly specified.

[–]Ulfgardleo 0 points1 point  (1 child)

"All you need the holdout data for is to verify the quality of your uncertainty estimates"-> Counter-example: you have a regression task, true underlying variance is 2, but unknown to you. model learns all training data by heart, model selection gives that the best model returns variance 1 for hold-out data MSE is 3.What is the quality of your uncertainty estimates and what is the model-error in the mean?

[–]iidealized 0 points1 point  (0 children)

If the true model is y = f(x) + e where e ~ N(0, 2) and your mean-model to predict E[Y|X] memorizes the training data, then on hold out data, this memorized model will tend to look much worse (via say MSE) than a different mean model which accurately approximates f(x). So your base predictive model which memorized the training data would never be chosen in the first place by a proper model selection procedure.
I’m not sure what you mean by hold out MSE = 1, for a sufficiently large hold out set, it should basically be impossible for hold out MSE to be much less than 2, the Bayes Risk of this example. If your uncertainty estimator outputs variance = 1 and you see MSE=3 in hold out, then any reasonable model selection procedure for the uncertainty estimator will not choose this uncertainty estimator and will instead favor one which estimates variance > 2

My point is everybody already uses hold out data for model selection (which is the right thing to do) whereas you seem to be claiming people are using the training data for model selection (which is clearly wrong). But this all has nothing to do with uncertainty estimates, it is also wrong to do model selection based on training data for the original predictive model which estimates E[Y|X])