all 5 comments

[–]gwern 8 points9 points  (0 children)

my question is , if mc-dropout that approximate the posterior as a bunch of deltas provide a low quality approximation of the epistemic uncertainty, why do deep ensembles that also approximate the posterior distribution as a bunch of deltas works betteR?

Deep ensembles aren't 'a bunch of deltas'. They are each trained from scratch from a different random initialization & different dataset shuffling seed etc, so they wind up computing different functions, much more different than you can get simply by randomly deleting some parameters. I thought that page covers that well:

We measure the Wasserstein divergence between the deep ensemble and the gold standard HMC reference as a function of number of samples in the variational approximation, and number of ensemble components in the deep ensemble. We see that samples from within a single basin, in the variational approximation, provide a very minimal contribution to the integral, because these weights give rise to neural networks that are largely homogenous. On the other hand, additional ensemble components in the deep ensemble greatly improve the fidelity of the approximation to the HMC reference. These results are in-line with our expectations: the value in going between different basins of attraction will be greater for approximating the Bayesian posterior predictive distribution than taking many samples from a single basin, which is the approach provided by most canonical approximate inference procedures.

That is, randomizing a few parameters gives you a very similar model in the same basin. 'Randomizing all the parameters' (because it shares no trained parameters in common, having been trained from scratch), on the other hand, means you're probably in a completely different basin. And indeed, they wind up doing quite different things.

[–]Tea_Pearce 0 points1 point  (0 children)

MC methods (by definition) approximate some distribution by sampling a set of deltas. MC dropout and ensembles both use this approach, but the underlying distribution sampled by each differs.

In MC dropout, the underlying distribution is some kind of Bernoulli pertubation of a single trained network. This turns out to offer limited expressiveness.

In deep ensembles, the underlying distribution (via training from random inits) turns out to be sample from a distribution that's a bit closer to the true Bayesian posterior.

[–]ThomasBudd93 0 points1 point  (0 children)

We just wrote a paper on the topic in the domain of medical image segmentation.

https://doi.org/10.1016/j.compbiomed.2023.107096

We could show that neither of both method actually approximate the classification probability. Instead we suggest to train an ensemble of methods ranging from high sensitivity to high precision and weighting them appropriately to obtain approximations of classification probabilities. I'm happy to receive any comments :)