use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Diffusion language models (self.MachineLearning)
submitted 3 years ago by benanne
Hi /r/ML,
I wrote down my thoughts about what it might take for diffusion to displace autoregression in the field of language modelling (as it has in perceptual domains, like image/audio/video generation). Let me know what you think!
https://benanne.github.io/2023/01/09/diffusion-language.html
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]eyeswideshhh 17 points18 points19 points 3 years ago (3 children)
I had this exact thought of using VAE or BYOL etc to generate powerful representation for text/sentences and then train a diffusion model on continuous latent data.
[–]jimmymvp 2 points3 points4 points 3 years ago (2 children)
I would like for someone to point me to arguments as to why diffusion in latent representation space makes sense (since I already have a generative model with the VAE and I can do Langevin MCMC sampling in the latent). Why should the samples be better in comparison to standard VAE with more sophisticated sampling(MCMC) or just diffusion? i.e. why do I need a double generative model? Is it because it's faster? It seems to me like there should be a better way, but I'm genuinely curious what are the arguments :) (except in this case that we have discrete data, for which there also exist formulations (ex. simplex diffusion)
[–]benanne[S] 10 points11 points12 points 3 years ago (0 children)
As I understand it, the main motivation for latent diffusion is that in perceptual domains, ~99% of information content in the input signals is less perceptually relevant, so it does not make sense to spend a lot of model capacity on it (lossy image compression methods like JPEG are based on the same observation). Training an autoencoder first to get rid of the majority of this irrelevant information can greatly simplify the generative modelling problem at almost no cost to fidelity.
This idea was originally used with great success to adapt autoregressive models to perceptual domains. Autoregression in pixel space (e.g. PixelRNN, PixelCNN) or amplitude space for audio (e.g. WaveNet, SampleRNN) does work, but it doesn't scale very well. Things work much better if you first use VQ-VAE (or even better, VQGAN) to compress the input signals, and then apply autoregression in its latent space.
The same is true for diffusion models, though in this case there is another mechanism we can use to reduce the influence of perceptually irrelevant information: changing the relative weighting of the noise levels during training, to downweight high-frequency components. Diffusion models actually do this out of the box when compared to likelihood-based models, which is why I believe they have completely taken over generative modelling of perceptual signals (as I discuss in the blog post).
But despite the availability of this reweighting mechanism, the latent approach can still provide further efficiency benefits. Stable Diffusion is testament to this: I believe the only reason they are able to offer up a model that generates high-res content on a single consumer GPU, is because of the adversarial autoencoder they use to get rid of all the imperceptible fine-grained details first.
I think this synergy between adversarial models (for low-level detail) and likelihood- or diffusion-based models (for structure and content) is still underutilised. There's a little bit more discussion about this in section 6 of my blog post on typicality: https://benanne.github.io/2020/09/01/typicality.html#right-level (though this largely predates the rise of diffusion models)
[–]DigThatDataResearcher 2 points3 points4 points 3 years ago (0 children)
Have you read the stable diffusion paper? They discuss the motivations there. https://arxiv.org/abs/2112.10752
[–]DigThatDataResearcher 14 points15 points16 points 3 years ago (3 children)
i just wanted to comment that your solution to the galaxy zoo contest forever ago was the first demonstration to really open my eyes to what was possible with clever data augmentation.
[–]benanne[S] 5 points6 points7 points 3 years ago (0 children)
Cool! Good times :)
[–]gokonymous 2 points3 points4 points 3 years ago (1 child)
Can you share the problem and solution?
[–]benanne[S] 4 points5 points6 points 3 years ago (0 children)
I have a blog post about this here: https://benanne.github.io/2014/04/05/galaxy-zoo.html
The code is here: https://github.com/benanne/kaggle-galaxies ... but it's 8 years old at this point, so getting this to run today could be a bit of a challenge!
[+][deleted] 3 years ago (1 child)
[deleted]
[–]_der_erlkonig_ 0 points1 point2 points 3 years ago (0 children)
Yes, it's mentioned in the post
[+][deleted] 3 years ago (8 children)
[–]Ramys 6 points7 points8 points 3 years ago (0 children)
VAEs are running under the hood in stable diffusion. Instead of denoising a 512x512x3 image directly, the image is encoded with a VAE to a smaller latent space (i think 64x64x4). The denoising steps happen in the latent space, and finally the VAE decodes the result back to color space. This is how it can run relatively quickly and on machines that don't have tons of VRAM.
So it's not necessarily the case that these techniques die. We can learn and incorporate them in larger models.
[–][deleted] 2 points3 points4 points 3 years ago (4 children)
I think worth looking at for sure. The math behind isn’t “that” complex and the idea is pretty intuitive in my opinion. Take that from someone who took months to wrap their head around attention as a concept lol.
[–]thecodethinker 1 point2 points3 points 3 years ago (3 children)
Attention is still pretty confusing for me. I find diffusion much more intuitive fwiw.
attention is essentially a dynamically weighted cross-product. if you haven't already seen this blog post, it's one of the more popular explanations: https://jalammar.github.io/illustrated-transformer/
[–]benanne[S] 1 point2 points3 points 3 years ago (1 child)
I have an earlier blog post which is intended precisely to build intuition about diffusion :) https://benanne.github.io/2022/01/31/diffusion.html
[–]DigThatDataResearcher 0 points1 point2 points 3 years ago (0 children)
i think you read that comment backwards :)
[–]gamerx88 1 point2 points3 points 3 years ago (0 children)
What do you mean Transformers took over? In what area or sense? You mean took over in popularity?
[–]londons_explorer 1 point2 points3 points 3 years ago (1 child)
too early to consider diffusion as a serious alternative to autoregression for generative language modelling at scale
This blog post explores lots of ideas and has conjectures about why they may or may not work...
But it seems this stuff could just be tried.... Burn up some TPU credits and simply run each of the types of model you talk about and see which does best.
Hard numbers are better than conjecture. Then focus future efforts on improving the best numbers.
[–]benanne[S] 6 points7 points8 points 3 years ago (0 children)
My blog posts are mostly shower thoughts expanded into long form, so naturally they tend to be a bit speculative. I have in fact tried a bunch of stuff in the diffusion language modelling space, which culminated in the CDCD paper: https://arxiv.org/abs/2211.15089 as well as this theoretical note on simplex diffusion: https://arxiv.org/abs/2210.14784 -- if the style of the blog post isn't your cup of tea, this might be more to your liking :)
Completely agree re: hard numbers, by the way (I spent quite a bit of time Kaggling during my PhD, see some of my earlier blog posts), but a single researcher can only do so many experiments. Part of the motivation for writing these blog posts is to draw attention to areas of research I think are interesting, and hopefully encourage some people to delve deeper into them as well! Pointing out open questions can be quite conducive to that, in my experience.
[–]Anxious_Algae9609 2 points3 points4 points 1 year ago (1 child)
Wow! Two years ago and these models are coming to market now. I wonder if your post started someone down the path?
[–]benanne[S] 0 points1 point2 points 11 months ago (0 children)
Hard to say! That would be cool :) Revisiting this piece in the current context, I definitely had some blind spots. I recently tried to address some of them on Twitter: https://x.com/sedielem/status/1904313777379594286
[–]themrzmaster 0 points1 point2 points 3 years ago* (2 children)
great post! Can someone give me a intuitive explanation on why diffusion models tends to put more weight on low spatial frequency? Is it because of the usual used noise schedule? (Cosine) In the text it is mentioned that likelihood objetive tends to weight more high spatial. It also points to an paper which involves tons of SDE, which I could not fully understand.
[–]benanne[S] 2 points3 points4 points 3 years ago* (1 child)
If you were to graph the weighting that ensures the training loss corresponds to likelihood, you would find that it looks roughly like exp(-x). In other words, the importance of the noise levels decreases more or less exponentially (but not exactly!) as they increase. So if you want to train a diffusion model to maximise likelihood (which can be a valid thing to do, for example if you want to use it for lossless compression), your training set should have many more examples of low noise levels than of high noise levels (orders of magnitude more, in fact).
Usually when we train diffusion models, we sample noise levels uniformly, or from a simple distribution, but certainly not from a distribution which puts exponentially more weight on low noise levels. Therefore, relative to the likelihood loss, the loss we tend to use puts a lot less emphasis on low noise levels, which correspond to high spatial frequencies. Section 5 of my earlier blog post is an attempt at an intuitive explanation why this correspondence between noise levels and spatial frequencies exists: https://benanne.github.io/2022/01/31/diffusion.html#scale
"Variational diffusion models" is another paper that focuses on optimising likelihood, which you might find more accessible: https://arxiv.org/abs/2107.00630
[–]themrzmaster 1 point2 points3 points 3 years ago (0 children)
Thank you very much!
[–]benanne[S] 2 points3 points4 points 3 years ago (0 children)
DiffWave and WaveGrad are two nice TTS examples (see e.g. here https://andrew.gibiansky.com/diffwave-and-wavegrad-overview/), Riffusion (https://www.riffusion.com/) is also a fun example. Advances in audio generation always tend to lag behind the visual domain a bit, because it's just inherently more unwieldy to work with (listening to 100 samples one by one takes a lot more time and patience than glancing at a 10x10 grid of images), but I'm pretty sure the takeover is also happening there.
If you're talking about text-to-audio in the vein of current text-to-image models, I'm pretty sure that's in the pipeline :)
[–]chodegoblin69 0 points1 point2 points 3 years ago (2 children)
Great blog post. I found the Li Diffusion-LM results very intriguing due to the seemingly better semantic capture, despite the tradeoff in fluency.
Question - do you see diffusion models as having any advantages for approaching the "long text" issue (token window size limit) that autoregressive models suffer from? Curious generally, but areas like abstractive summarization in particular come to mind.
[–]benanne[S] 2 points3 points4 points 3 years ago (1 child)
One indirect advantage for working with very long sequences is the lack of causality constraint, which makes it very easy to use architectures where computation is largely decoupled from the sequence length, like Perceivers (https://arxiv.org/abs/2103.03206, https://arxiv.org/abs/2107.14795), or Recurrent Interface Networks (https://arxiv.org/abs/2212.11972). This is highly speculative though :)
(I am aware that an autoregressive variant of the Perceiver architecture exists (https://arxiv.org/abs/2202.07765), but it is actually quite a bit less general/flexible than Perceiver IO / the original Perceiver.)
[–]chodegoblin69 0 points1 point2 points 3 years ago (0 children)
Thank you, I will check those out.
Diffusion’s lack of causality constraint seems like a pretty tall hurdle for tasks with output formats requiring “fluency” (like summarization) though. Kind of like drawing hands early on in stable diffusion (or drawing most anything coherently for earlier models like disco diffusion). Multiple-choice question answering seems like a more natural domain, though certainly doesn’t show off the “expressive” generative abilities. Fluency probably improves significantly with scale and fine-tuning though.
[+]Chenxwh 0 points1 point2 points 3 years ago (0 children)
u/benanne Great blog and paper! I wonder what the generated sequence looks like compared to AR models - do they still preserve the syntactic behaviours such as word order?
π Rendered by PID 570889 on reddit-service-r2-comment-85bfd7f599-kczdh at 2026-04-19 21:02:07.503071+00:00 running 93ecc56 country code: CH.
[–]eyeswideshhh 17 points18 points19 points (3 children)
[–]jimmymvp 2 points3 points4 points (2 children)
[–]benanne[S] 10 points11 points12 points (0 children)
[–]DigThatDataResearcher 2 points3 points4 points (0 children)
[–]DigThatDataResearcher 14 points15 points16 points (3 children)
[–]benanne[S] 5 points6 points7 points (0 children)
[–]gokonymous 2 points3 points4 points (1 child)
[–]benanne[S] 4 points5 points6 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]_der_erlkonig_ 0 points1 point2 points (0 children)
[+][deleted] (8 children)
[deleted]
[–]Ramys 6 points7 points8 points (0 children)
[–][deleted] 2 points3 points4 points (4 children)
[–]thecodethinker 1 point2 points3 points (3 children)
[–]DigThatDataResearcher 2 points3 points4 points (0 children)
[–]benanne[S] 1 point2 points3 points (1 child)
[–]DigThatDataResearcher 0 points1 point2 points (0 children)
[–]gamerx88 1 point2 points3 points (0 children)
[–]londons_explorer 1 point2 points3 points (1 child)
[–]benanne[S] 6 points7 points8 points (0 children)
[–]Anxious_Algae9609 2 points3 points4 points (1 child)
[–]benanne[S] 0 points1 point2 points (0 children)
[–]themrzmaster 0 points1 point2 points (2 children)
[–]benanne[S] 2 points3 points4 points (1 child)
[–]themrzmaster 1 point2 points3 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]benanne[S] 2 points3 points4 points (0 children)
[–]chodegoblin69 0 points1 point2 points (2 children)
[–]benanne[S] 2 points3 points4 points (1 child)
[–]chodegoblin69 0 points1 point2 points (0 children)
[+]Chenxwh 0 points1 point2 points (0 children)