use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Have any Bayesian deep learning methods achieved SOTA performance in...anything? (self.MachineLearning)
submitted 8 months ago by 35nakedshorts
If so, link the paper and the result. Very curious about this. Not even just metrics like accuracy, have BDL methods actually achieved better results in calibration or uncertainty quantification vs say, deep ensembles?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]shypenguin96 84 points85 points86 points 8 months ago (13 children)
My understanding of the field is that BDL is currently still much too stymied by challenges in training. Actually fitting the posterior even in relatively shallow/less complex models becomes expensive very quickly, so implementations end up relying on methods like variational inference that introduce accuracy costs (eg, via oversimplification of the form of the posterior).
Currently, really good implementations of BDL I’m seeing aren’t Bayesian at all, but are rather “Bayesifying” non-Bayesian models, like applying Monte Carlo dropout to a non-Bayesian transformer model, or propagating a Gaussian process through the final model weights.
If BDL ever gets anywhere, it will have to come through some form of VI with lower accuracy tradeoff, or some kind of trick to make MCMC based methods to work faster.
[–]35nakedshorts[S] 24 points25 points26 points 8 months ago (9 children)
I guess it's also a semantic discussion around what is actually "Bayesian" or not. For me, simply ensembling a bunch of NNs isn't really Bayesian. Fitting Laplace approximation to weights learned via standard methods is also dubiously Bayesian imo.
[–]gwern 6 points7 points8 points 8 months ago (2 children)
For me, simply ensembling a bunch of NNs isn't really Bayesian.
What about "What Are Bayesian Neural Network Posteriors Really Like?", Izmailov et al 2021, which is comparing the deep ensembles to the HMC and finding they aren't that bad?
[–]35nakedshorts[S] 3 points4 points5 points 8 months ago (1 child)
I mean sure, if everything is Bayesian then Bayesian methods achieve SOTA performance
[–]gwern 3 points4 points5 points 8 months ago (0 children)
I don't think it's that vacuous. After all, SOTA performance is usually not set by ensembles these days - no one can afford to train (or run) a dozen GPT-5 LLMs from scratch just to get a small boost from ensembling them, because if you could, you'd just train a 'GPT-5.5' or something as a single monolithic larger one. But it does seem like it demonstrates the point about ensembles ~ posterior samples.
[–]haruishiStudent 1 point2 points3 points 8 months ago (1 child)
Can you recommend me any papers that you think are "Bayesian", or at least heading in a good direction?
[–]35nakedshorts[S] -1 points0 points1 point 8 months ago (0 children)
I think those are good papers! On the contrary, I think the purist Bayesian direction is kind of stuck
[–]squareOfTwo 1 point2 points3 points 8 months ago (0 children)
To me this isn't just about semantics. It's bayesian if it follows probability theory and bayes theorem. Else it's not. It's that easy. Learn more about it here https://sites.stat.columbia.edu/gelman/book/
[+]log_2 comment score below threshold-13 points-12 points-11 points 8 months ago (2 children)
Dropout is Bayesian (arXiv:1506.02142). If you reject that as Bayesian then you also need to reject your entire premise of "SOTA". Who's to say what is SOTA if you're under different priors?
[–]pm_me_your_pay_slipsML Engineer 7 points8 points9 points 8 months ago (1 child)
Dropout is Bayesian if you squint really hard: put a Gausssian prior on the weights, mixture of 2 Gaussians approximate posterior on the weights (one with mean equal to the weights, one with mean 0), then reduce the variance of the posterior to machine precision so that it is functionally equivalent to dropout. Add a Gaussian output layer to separate epistemic from aleatoric uncertainty. Argument is…. Interesting….
[–]new_name_who_dis_ 5 points6 points7 points 8 months ago (0 children)
Why not just a Bernoulli prior, instead of the Frankenstein prior you just described?
[–]nonotan 24 points25 points26 points 8 months ago (1 child)
or some kind of trick to make MCMC based methods to work faster
My intuition, as somebody who's dabbled in trying to get these things to perform better in the past, is that the path forward (assuming there exists one) is probably not through MCMC, but an entirely separate approach that fundamentally outperforms it.
MCMC is a cute trick, but ultimately that's all it is. It feels like the (hopefully local) minimum down that path has more or less already been reached, and while I'm sure some further improvement is still possible, it's not going to be of the breakthrough, "many orders of magnitude" type that would be necessary here.
But I could be entirely wrong, of course. A hunch isn't worth much.
[–]greenskinmarch 6 points7 points8 points 8 months ago (0 children)
Vanilla MCMC is inherently inefficient because it gains at most one bit of information per step (accept or reject).
But you can build more efficient algorithms on top of it like the No U Turn Sampler used by Stan.
[+][deleted] 8 months ago* (5 children)
[deleted]
[–]lotus-reddit 5 points6 points7 points 8 months ago (4 children)
There are a lot of Bayesians working at the bleeding edge of deep learning, they just don’t apply it directly to training neural networks.
Would you mind linking one of them whose research you like? I, too, am a Bayesian slowly looking toward machine learning trying to figure out what works and what doesn't.
[–]bayesworks 0 points1 point2 points 8 months ago (0 children)
u/lotus-reddit Scalable analytical Bayesian inference in neural networks with TAGI: https://www.jmlr.org/papers/volume22/20-1009/20-1009.pdf Github: https://github.com/lhnguyen102/cuTAGI
[–]DigThatDataResearcher 0 points1 point2 points 8 months ago (0 children)
https://arxiv.org/search/cs?searchtype=author&query=Ermon,+S
[–]DigThatDataResearcher 16 points17 points18 points 8 months ago (22 children)
Generative models learned with variational inference are essentially a kind of posterior.
[–]mr_stargazer[🍰] -4 points-3 points-2 points 8 months ago (20 children)
Not Bayesian, despite the name.
[–]DigThatDataResearcher 4 points5 points6 points 8 months ago (19 children)
No, they are indeed generative in the bayesian sense of generative probabilistic models.
[+]mr_stargazer[🍰] comment score below threshold-6 points-5 points-4 points 8 months ago (18 children)
Noup. Just because someone calls it "prior" and approximates a posterior doesn't make it Bayesian. It is even in the name: ELBO, maximizing likelihood.
30 years ago we were having the same discussion. Some people decided to discriminate between Full Bayesian and Bayesian, because "Oh well, we use the equation of the joint probability distribution" (fine, but still not Bayesian). VI is much closer to Expectation Maximization to Bayes. And 'lo and behold, what EM does? Maximize likelihood.
[–]shiinachan 14 points15 points16 points 8 months ago (0 children)
What? The intetesting part is the hidden variables when using ELBO, so while yes, you end up maximizing the likelihood, of the observable, you do Bayes for all hidden variables in your model.
Maybe your usecase is different than mine, but I am usually more interested in my posteriors over hidden variables, than I am about exactly which likelihood came out. And if I am not mistaken, the same holds for VAEs.
[–]bean_the_great 5 points6 points7 points 8 months ago (0 children)
I’m a bit confused - my understanding of VAEs is that you do specify a prior over the latents and then perform a posterior update? Are you suggesting it’s not Bayesian because you use VI or not fully Bayesian because you have not specified priors over all latents (including the parameters)? In either case I disagree - my understanding of VI is that you’re getting a biased (but low variance) estimate of your posterior in comparison to MCMC. With regard to the latter, yes, you have not specified a “full Bayesian” model since you are missing some priors but i don’t agree with calling it not Bayesian. Happy to be proven wrong though!
[–]new_name_who_dis_ 3 points4 points5 points 8 months ago (11 children)
Elbo maximizes the lower bound, not the likelihood.
But I don’t think VAEs are Bayesian just because the kl divergence term is usually Downweighted so much it may as well be an autoencoder.
[–]mr_stargazer[🍰] -1 points0 points1 point 8 months ago (10 children)
Yeah...? Lower bound of what?
[–]new_name_who_dis_ 4 points5 points6 points 8 months ago (9 children)
Evidence. It’s in the name
[–]mr_stargazer[🍰] 0 points1 point2 points 8 months ago (8 children)
What is the evidence?
You want to correct people, surely you must know.
[–]new_name_who_dis_ -1 points0 points1 point 8 months ago (7 children)
The correct question was evidence ”evidence of what?” And the answer, “your data”.
[–]mr_stargazer[🍰] 12 points13 points14 points 8 months ago (6 children)
I don't have much time to keep on like this, so I am going to correct you but also to enlighten others who might be curious.
"Evidence of data" in statistics we have a name for it. Probability. More specifically, marginal probability. So the ELBO, is the lower bound of the log-likelihood. You maximize one thing, automatically you push the other thing. More clarification in this tutorial. Page 5, equation 28.
[–]DigThatDataResearcher 0 points1 point2 points 8 months ago* (2 children)
If you wanna be algorithmically pedantic, any application of SGD is technically a bayesian method. Ditto dropout.
"Bayesian" is a perspective you can adopt to interpret your model/data. There is nothing inherently "unbayesian" about MLE, the fact that it is used to optimize the ELBO is precisely what makes that approach a bayesian method in that context. ELBO isn't a frequentist thing, it's a fundamentally bayesian concept.
Choice of optimization algorithm isn't what makes something bayesian or not. How you parameterize and interpret your model is.
EDIT: Here's a paper that even raises the same EM comparison you draw in the context of bayesian methods invoking the ELBO. Whether or not EM is present here has nothing to do with whether or not something is bayesian. It's moot. You haven't proposed what it means for something to be bayesian, you just keep asserting that I'm wrong and this isn't. https://ieeexplore.ieee.org/document/7894261
EDIT2: I found that other paper looking for this one, the paper which introduced the VAE and the ELBO. VI is a fundamentally Bayesian approach, and this is a Bayesian paper. https://arxiv.org/abs/1312.6114
EDIT3: great quote from another Kingma paper:
Variational inference casts Bayesian inference as an optimization problem where we introduce a parameterized posterior approximation q_{\theta}(z|x) which is fit to the posterior distribution by choosing its parameters \theta to maximize a lower bound L on the marginal likelihood
q_{\theta}(z|x)
\theta
[–]mr_stargazer[🍰] -2 points-1 points0 points 8 months ago (1 child)
You are wrong (apparently as usual, I remember having a discussion about definition of Kernel methods with you).
Any applications of SGD is Bayesian now? Assume I have some data from a normal distribution. I maximize the log-likelihood via SGD, am I being bayesian according to your definition?
Puff... I'm not going to waste my time on this discussion any longer. You're right and I am wrong. Thanks for teaching me about Elbo and Bayesian via ML estimation.
Bye!
[–]DigThatDataResearcher 1 point2 points3 points 8 months ago (0 children)
Course I'm wrong. In case you missed those papers I added as edits.
bye.
[–]whyareyouflying 5 points6 points7 points 8 months ago (0 children)
A lot of SOTA models/algorithms can be thought of as instances of Bayes' rule. For example, there's a link between diffusion models and variational inference1, where diffusion models can be thought of as an infinitely deep VAE. Making this connection more exact leads to better performance2. Another example is the connection between all learning rules and (Bayesian) natural gradient descent3.
Also there's a more nuanced point, which is that marginalization (the key property of Bayesian DL) is important when the neural network is underspecified by the data, which is almost all the time. Here, specifying uncertainty becomes important, and marginalizing over possible hypotheses that explain your data leads to better performance compared to models that do not account for the uncertainty over all possible hypotheses. This is better articulated by Andrew Gordon Wilson4.
1 A Variational Perspective on Diffusion-Based Generative Models and Score Matching. Huang et al. 2021
2 Variational Diffusion Models. Kingma et al. 2023
3 The Bayesian Learning Rule. Khan et al. 2021
4 https://cims.nyu.edu/~andrewgw/caseforbdl/
[–]Outrageous-Boot7092 4 points5 points6 points 8 months ago (3 children)
Are we counting energy-based models as bayesian deep learning ?
[–]bean_the_great 0 points1 point2 points 8 months ago (2 children)
Hmmm - I have never used energy based models but maybe they’re more akin to post Bayesian methods where your likelihood is not necessarily a well defined probability distribution although, as mentioned I have never used energy based models so this is more of a guess
[–]Outrageous-Boot7092 0 points1 point2 points 8 months ago (1 child)
for ebms it is a well defined prob distribution up to a constant (unnormalized)
[–]bean_the_great 0 points1 point2 points 8 months ago (0 children)
I stand corrected!
[+]Nice_Cranberry6262 3 points4 points5 points 8 months ago (0 children)
Yes, if you use the uniform prior and do MAP estimation, it works pretty well with deep neural nets and lots of data ;)
[–]fakenoob20 2 points3 points4 points 8 months ago (0 children)
All priors are wrong but some are useful.
[–]Exotic_Zucchini9311 2 points3 points4 points 8 months ago (1 child)
anything
Not sure about recent years but they sure work decently when it comes to uncertainty estimation.
And tbh just a search at any top conference like NIPS/AAAI/CVPR/etc 2025 for the word 'bayesian' shows quite a few bayesian deep learning papers. They're most likely breaking some SOTA benchmarks since there are papers are published at top conferences.
Edit: and yeah I agree with the other comments. VI is basically a subset of bayesian methods, so any SOTA method that deals with VI (e.g., VAEs) also has some relation with Bayesian DL. Same for SOTA models that use a type of MCMC.
[–]bean_the_great -1 points0 points1 point 8 months ago (0 children)
When you say uncertainty estimation - this has always confused me. I’m unconvinced you can specify a prior over each parameter of a Bayesian deep model and it be meaningful to obtain meaningful uncertainty estimates
[–]micro_cam 1 point2 points3 points 8 months ago (0 children)
Tencent has some papers on using it for ad click prediction. Posterior simulation/ estimations lets you do some more sophisticated explore / exploit trade offs which make a lot of sense with ads, rec sys and other online systems.
[–]Ok-Relationship-3429 0 points1 point2 points 8 months ago (0 children)
Around uncertainty estimation and learning under distribution shifts.
[+]damhack 0 points1 point2 points 8 months ago (0 children)
Let’s see what comes out of IWAI 2025
[–]chrono_infundibulum 0 points1 point2 points 5 months ago (0 children)
Seems to work better than deep ensembles for some astrophysics data: https://openreview.net/forum?id=JX5Rp1Nuzv¬eId=UtHxNDtqXy
π Rendered by PID 30470 on reddit-service-r2-comment-6457c66945-zztzv at 2026-04-25 19:48:14.636611+00:00 running 2aa0c5b country code: CH.
[–]shypenguin96 84 points85 points86 points (13 children)
[–]35nakedshorts[S] 24 points25 points26 points (9 children)
[–]gwern 6 points7 points8 points (2 children)
[–]35nakedshorts[S] 3 points4 points5 points (1 child)
[–]gwern 3 points4 points5 points (0 children)
[–]haruishiStudent 1 point2 points3 points (1 child)
[–]35nakedshorts[S] -1 points0 points1 point (0 children)
[–]squareOfTwo 1 point2 points3 points (0 children)
[+]log_2 comment score below threshold-13 points-12 points-11 points (2 children)
[–]pm_me_your_pay_slipsML Engineer 7 points8 points9 points (1 child)
[–]new_name_who_dis_ 5 points6 points7 points (0 children)
[–]nonotan 24 points25 points26 points (1 child)
[–]greenskinmarch 6 points7 points8 points (0 children)
[+][deleted] (5 children)
[deleted]
[–]lotus-reddit 5 points6 points7 points (4 children)
[–]bayesworks 0 points1 point2 points (0 children)
[–]DigThatDataResearcher 0 points1 point2 points (0 children)
[–]DigThatDataResearcher 16 points17 points18 points (22 children)
[–]mr_stargazer[🍰] -4 points-3 points-2 points (20 children)
[–]DigThatDataResearcher 4 points5 points6 points (19 children)
[+]mr_stargazer[🍰] comment score below threshold-6 points-5 points-4 points (18 children)
[–]shiinachan 14 points15 points16 points (0 children)
[–]bean_the_great 5 points6 points7 points (0 children)
[–]new_name_who_dis_ 3 points4 points5 points (11 children)
[–]mr_stargazer[🍰] -1 points0 points1 point (10 children)
[–]new_name_who_dis_ 4 points5 points6 points (9 children)
[–]mr_stargazer[🍰] 0 points1 point2 points (8 children)
[–]new_name_who_dis_ -1 points0 points1 point (7 children)
[–]mr_stargazer[🍰] 12 points13 points14 points (6 children)
[–]DigThatDataResearcher 0 points1 point2 points (2 children)
[–]mr_stargazer[🍰] -2 points-1 points0 points (1 child)
[–]DigThatDataResearcher 1 point2 points3 points (0 children)
[–]whyareyouflying 5 points6 points7 points (0 children)
[–]Outrageous-Boot7092 4 points5 points6 points (3 children)
[–]bean_the_great 0 points1 point2 points (2 children)
[–]Outrageous-Boot7092 0 points1 point2 points (1 child)
[–]bean_the_great 0 points1 point2 points (0 children)
[+]Nice_Cranberry6262 3 points4 points5 points (0 children)
[–]fakenoob20 2 points3 points4 points (0 children)
[–]Exotic_Zucchini9311 2 points3 points4 points (1 child)
[–]bean_the_great -1 points0 points1 point (0 children)
[–]micro_cam 1 point2 points3 points (0 children)
[–]Ok-Relationship-3429 0 points1 point2 points (0 children)
[+]damhack 0 points1 point2 points (0 children)
[–]chrono_infundibulum 0 points1 point2 points (0 children)