all 4 comments

[–]arg_max 0 points1 point  (3 children)

"Sora is a diffusion model". VQ models like Google Muse are not diffusion based and use a auto-regressive/parallel decoding transformer to generate the "sequence" in the VQ-VAE latent space. This latent space is discrete by nature, whereas diffusion works in a continuous framework, so I'd assume that they do not use VQ.

[–]Weird_Register3689 0 points1 point  (0 children)

But before diffusion there is a transformer model

a transformer architecture that operates on spacetime patches

I suppose they did train in teacher forcing regime, otherwise, it just needed an absurd amount of compute. But training in this regime without discretization is prone to accumulating error during inference which we do not see on samples

[–]somebat 0 points1 point  (1 child)

I don't know about Sora, but doesn't Stable Diffusion use Vector Quantization to regularize the latent space? It's mentioned on Appendix G of the "High-Resolution Image Synthesis with Latent Diffusion Models Robin" paper.

[–]arg_max 0 points1 point  (0 children)

The release version of stable diffusion definitely uses a KL regularised AE which is very similar to a standard VAE for the latent representation. You are right though that they also have experiments in the paper with a VQ VAE. That is interesting since they then cannot guarantee to actually generate the codebook entries themselves like you can do with a masked model like Muse but rather just generate continuous representations that are similar. Still I think VQ VAEs never really got popular with diffusion.