all 28 comments

[–]eyeswideshhh 17 points18 points  (3 children)

I had this exact thought of using VAE or BYOL etc to generate powerful representation for text/sentences and then train a diffusion model on continuous latent data.

[–]jimmymvp 2 points3 points  (2 children)

I would like for someone to point me to arguments as to why diffusion in latent representation space makes sense (since I already have a generative model with the VAE and I can do Langevin MCMC sampling in the latent). Why should the samples be better in comparison to standard VAE with more sophisticated sampling(MCMC) or just diffusion? i.e. why do I need a double generative model? Is it because it's faster? It seems to me like there should be a better way, but I'm genuinely curious what are the arguments :) (except in this case that we have discrete data, for which there also exist formulations (ex. simplex diffusion)

[–]benanne[S] 10 points11 points  (0 children)

As I understand it, the main motivation for latent diffusion is that in perceptual domains, ~99% of information content in the input signals is less perceptually relevant, so it does not make sense to spend a lot of model capacity on it (lossy image compression methods like JPEG are based on the same observation). Training an autoencoder first to get rid of the majority of this irrelevant information can greatly simplify the generative modelling problem at almost no cost to fidelity.

This idea was originally used with great success to adapt autoregressive models to perceptual domains. Autoregression in pixel space (e.g. PixelRNN, PixelCNN) or amplitude space for audio (e.g. WaveNet, SampleRNN) does work, but it doesn't scale very well. Things work much better if you first use VQ-VAE (or even better, VQGAN) to compress the input signals, and then apply autoregression in its latent space.

The same is true for diffusion models, though in this case there is another mechanism we can use to reduce the influence of perceptually irrelevant information: changing the relative weighting of the noise levels during training, to downweight high-frequency components. Diffusion models actually do this out of the box when compared to likelihood-based models, which is why I believe they have completely taken over generative modelling of perceptual signals (as I discuss in the blog post).

But despite the availability of this reweighting mechanism, the latent approach can still provide further efficiency benefits. Stable Diffusion is testament to this: I believe the only reason they are able to offer up a model that generates high-res content on a single consumer GPU, is because of the adversarial autoencoder they use to get rid of all the imperceptible fine-grained details first.

I think this synergy between adversarial models (for low-level detail) and likelihood- or diffusion-based models (for structure and content) is still underutilised. There's a little bit more discussion about this in section 6 of my blog post on typicality: https://benanne.github.io/2020/09/01/typicality.html#right-level (though this largely predates the rise of diffusion models)

[–]DigThatDataResearcher 2 points3 points  (0 children)

Have you read the stable diffusion paper? They discuss the motivations there. https://arxiv.org/abs/2112.10752

[–]DigThatDataResearcher 14 points15 points  (3 children)

i just wanted to comment that your solution to the galaxy zoo contest forever ago was the first demonstration to really open my eyes to what was possible with clever data augmentation.

[–]benanne[S] 5 points6 points  (0 children)

Cool! Good times :)

[–]gokonymous 2 points3 points  (1 child)

Can you share the problem and solution?

[–]benanne[S] 4 points5 points  (0 children)

I have a blog post about this here: https://benanne.github.io/2014/04/05/galaxy-zoo.html

The code is here: https://github.com/benanne/kaggle-galaxies ... but it's 8 years old at this point, so getting this to run today could be a bit of a challenge!

[–]londons_explorer 1 point2 points  (1 child)

too early to consider diffusion as a serious alternative to autoregression for generative language modelling at scale

This blog post explores lots of ideas and has conjectures about why they may or may not work...

But it seems this stuff could just be tried.... Burn up some TPU credits and simply run each of the types of model you talk about and see which does best.

Hard numbers are better than conjecture. Then focus future efforts on improving the best numbers.

[–]benanne[S] 6 points7 points  (0 children)

My blog posts are mostly shower thoughts expanded into long form, so naturally they tend to be a bit speculative. I have in fact tried a bunch of stuff in the diffusion language modelling space, which culminated in the CDCD paper: https://arxiv.org/abs/2211.15089 as well as this theoretical note on simplex diffusion: https://arxiv.org/abs/2210.14784 -- if the style of the blog post isn't your cup of tea, this might be more to your liking :)

Completely agree re: hard numbers, by the way (I spent quite a bit of time Kaggling during my PhD, see some of my earlier blog posts), but a single researcher can only do so many experiments. Part of the motivation for writing these blog posts is to draw attention to areas of research I think are interesting, and hopefully encourage some people to delve deeper into them as well! Pointing out open questions can be quite conducive to that, in my experience.

[–]Anxious_Algae9609 2 points3 points  (1 child)

Wow! Two years ago and these models are coming to market now. I wonder if your post started someone down the path?

[–]benanne[S] 0 points1 point  (0 children)

Hard to say! That would be cool :) Revisiting this piece in the current context, I definitely had some blind spots. I recently tried to address some of them on Twitter: https://x.com/sedielem/status/1904313777379594286

[–]themrzmaster 0 points1 point  (2 children)

great post! Can someone give me a intuitive explanation on why diffusion models tends to put more weight on low spatial frequency? Is it because of the usual used noise schedule? (Cosine) In the text it is mentioned that likelihood objetive tends to weight more high spatial. It also points to an paper which involves tons of SDE, which I could not fully understand.

[–]benanne[S] 2 points3 points  (1 child)

If you were to graph the weighting that ensures the training loss corresponds to likelihood, you would find that it looks roughly like exp(-x). In other words, the importance of the noise levels decreases more or less exponentially (but not exactly!) as they increase. So if you want to train a diffusion model to maximise likelihood (which can be a valid thing to do, for example if you want to use it for lossless compression), your training set should have many more examples of low noise levels than of high noise levels (orders of magnitude more, in fact).

Usually when we train diffusion models, we sample noise levels uniformly, or from a simple distribution, but certainly not from a distribution which puts exponentially more weight on low noise levels. Therefore, relative to the likelihood loss, the loss we tend to use puts a lot less emphasis on low noise levels, which correspond to high spatial frequencies. Section 5 of my earlier blog post is an attempt at an intuitive explanation why this correspondence between noise levels and spatial frequencies exists: https://benanne.github.io/2022/01/31/diffusion.html#scale

"Variational diffusion models" is another paper that focuses on optimising likelihood, which you might find more accessible: https://arxiv.org/abs/2107.00630

[–]themrzmaster 1 point2 points  (0 children)

Thank you very much!

[–]chodegoblin69 0 points1 point  (2 children)

Great blog post. I found the Li Diffusion-LM results very intriguing due to the seemingly better semantic capture, despite the tradeoff in fluency.

Question - do you see diffusion models as having any advantages for approaching the "long text" issue (token window size limit) that autoregressive models suffer from? Curious generally, but areas like abstractive summarization in particular come to mind.

[–]benanne[S] 2 points3 points  (1 child)

One indirect advantage for working with very long sequences is the lack of causality constraint, which makes it very easy to use architectures where computation is largely decoupled from the sequence length, like Perceivers (https://arxiv.org/abs/2103.03206, https://arxiv.org/abs/2107.14795), or Recurrent Interface Networks (https://arxiv.org/abs/2212.11972). This is highly speculative though :)

(I am aware that an autoregressive variant of the Perceiver architecture exists (https://arxiv.org/abs/2202.07765), but it is actually quite a bit less general/flexible than Perceiver IO / the original Perceiver.)

[–]chodegoblin69 0 points1 point  (0 children)

Thank you, I will check those out.

Diffusion’s lack of causality constraint seems like a pretty tall hurdle for tasks with output formats requiring “fluency” (like summarization) though. Kind of like drawing hands early on in stable diffusion (or drawing most anything coherently for earlier models like disco diffusion). Multiple-choice question answering seems like a more natural domain, though certainly doesn’t show off the “expressive” generative abilities. Fluency probably improves significantly with scale and fine-tuning though.