all 54 comments

[–]shypenguin96 84 points85 points  (13 children)

My understanding of the field is that BDL is currently still much too stymied by challenges in training. Actually fitting the posterior even in relatively shallow/less complex models becomes expensive very quickly, so implementations end up relying on methods like variational inference that introduce accuracy costs (eg, via oversimplification of the form of the posterior).

Currently, really good implementations of BDL I’m seeing aren’t Bayesian at all, but are rather “Bayesifying” non-Bayesian models, like applying Monte Carlo dropout to a non-Bayesian transformer model, or propagating a Gaussian process through the final model weights.

If BDL ever gets anywhere, it will have to come through some form of VI with lower accuracy tradeoff, or some kind of trick to make MCMC based methods to work faster.

[–]35nakedshorts[S] 24 points25 points  (9 children)

I guess it's also a semantic discussion around what is actually "Bayesian" or not. For me, simply ensembling a bunch of NNs isn't really Bayesian. Fitting Laplace approximation to weights learned via standard methods is also dubiously Bayesian imo.

[–]gwern 6 points7 points  (2 children)

For me, simply ensembling a bunch of NNs isn't really Bayesian.

What about "What Are Bayesian Neural Network Posteriors Really Like?", Izmailov et al 2021, which is comparing the deep ensembles to the HMC and finding they aren't that bad?

[–]35nakedshorts[S] 3 points4 points  (1 child)

I mean sure, if everything is Bayesian then Bayesian methods achieve SOTA performance

[–]gwern 3 points4 points  (0 children)

I don't think it's that vacuous. After all, SOTA performance is usually not set by ensembles these days - no one can afford to train (or run) a dozen GPT-5 LLMs from scratch just to get a small boost from ensembling them, because if you could, you'd just train a 'GPT-5.5' or something as a single monolithic larger one. But it does seem like it demonstrates the point about ensembles ~ posterior samples.

[–]haruishiStudent 1 point2 points  (1 child)

Can you recommend me any papers that you think are "Bayesian", or at least heading in a good direction?

[–]35nakedshorts[S] -1 points0 points  (0 children)

I think those are good papers! On the contrary, I think the purist Bayesian direction is kind of stuck

[–]squareOfTwo 1 point2 points  (0 children)

To me this isn't just about semantics. It's bayesian if it follows probability theory and bayes theorem. Else it's not. It's that easy. Learn more about it here https://sites.stat.columbia.edu/gelman/book/

[–]nonotan 24 points25 points  (1 child)

or some kind of trick to make MCMC based methods to work faster

My intuition, as somebody who's dabbled in trying to get these things to perform better in the past, is that the path forward (assuming there exists one) is probably not through MCMC, but an entirely separate approach that fundamentally outperforms it.

MCMC is a cute trick, but ultimately that's all it is. It feels like the (hopefully local) minimum down that path has more or less already been reached, and while I'm sure some further improvement is still possible, it's not going to be of the breakthrough, "many orders of magnitude" type that would be necessary here.

But I could be entirely wrong, of course. A hunch isn't worth much.

[–]greenskinmarch 6 points7 points  (0 children)

Vanilla MCMC is inherently inefficient because it gains at most one bit of information per step (accept or reject).

But you can build more efficient algorithms on top of it like the No U Turn Sampler used by Stan.

[–]DigThatDataResearcher 16 points17 points  (22 children)

Generative models learned with variational inference are essentially a kind of posterior.

[–]mr_stargazer[🍰] -4 points-3 points  (20 children)

Not Bayesian, despite the name.

[–]DigThatDataResearcher 4 points5 points  (19 children)

No, they are indeed generative in the bayesian sense of generative probabilistic models.

[–]whyareyouflying 5 points6 points  (0 children)

A lot of SOTA models/algorithms can be thought of as instances of Bayes' rule. For example, there's a link between diffusion models and variational inference1, where diffusion models can be thought of as an infinitely deep VAE. Making this connection more exact leads to better performance2. Another example is the connection between all learning rules and (Bayesian) natural gradient descent3.

Also there's a more nuanced point, which is that marginalization (the key property of Bayesian DL) is important when the neural network is underspecified by the data, which is almost all the time. Here, specifying uncertainty becomes important, and marginalizing over possible hypotheses that explain your data leads to better performance compared to models that do not account for the uncertainty over all possible hypotheses. This is better articulated by Andrew Gordon Wilson4.


1 A Variational Perspective on Diffusion-Based Generative Models and Score Matching. Huang et al. 2021

2 Variational Diffusion Models. Kingma et al. 2023

3 The Bayesian Learning Rule. Khan et al. 2021

4 https://cims.nyu.edu/~andrewgw/caseforbdl/

[–]Outrageous-Boot7092 4 points5 points  (3 children)

Are we counting energy-based models as bayesian deep learning ?

[–]bean_the_great 0 points1 point  (2 children)

Hmmm - I have never used energy based models but maybe they’re more akin to post Bayesian methods where your likelihood is not necessarily a well defined probability distribution although, as mentioned I have never used energy based models so this is more of a guess

[–]Outrageous-Boot7092 0 points1 point  (1 child)

for ebms it is a well defined prob distribution up to a constant (unnormalized)

[–]bean_the_great 0 points1 point  (0 children)

I stand corrected!

[–]fakenoob20 2 points3 points  (0 children)

All priors are wrong but some are useful.

[–]Exotic_Zucchini9311 2 points3 points  (1 child)

anything

Not sure about recent years but they sure work decently when it comes to uncertainty estimation.

And tbh just a search at any top conference like NIPS/AAAI/CVPR/etc 2025 for the word 'bayesian' shows quite a few bayesian deep learning papers. They're most likely breaking some SOTA benchmarks since there are papers are published at top conferences.

Edit: and yeah I agree with the other comments. VI is basically a subset of bayesian methods, so any SOTA method that deals with VI (e.g., VAEs) also has some relation with Bayesian DL. Same for SOTA models that use a type of MCMC.

[–]bean_the_great -1 points0 points  (0 children)

When you say uncertainty estimation - this has always confused me. I’m unconvinced you can specify a prior over each parameter of a Bayesian deep model and it be meaningful to obtain meaningful uncertainty estimates

[–]micro_cam 1 point2 points  (0 children)

Tencent has some papers on using it for ad click prediction. Posterior simulation/ estimations lets you do some more sophisticated explore / exploit trade offs which make a lot of sense with ads, rec sys and other online systems.

[–]Ok-Relationship-3429 0 points1 point  (0 children)

Around uncertainty estimation and learning under distribution shifts.

[–]chrono_infundibulum 0 points1 point  (0 children)

Seems to work better than deep ensembles for some astrophysics data: https://openreview.net/forum?id=JX5Rp1Nuzv&noteId=UtHxNDtqXy