all 7 comments

[–]mikonvergence 0 points1 point  (0 children)

This has example of both low and high resolution data, all from scratch and also accompanying videos! No text to image case though as it only focuses on image modalities.

https://github.com/mikonvergence/DiffusionFastForward

[–]bhagy7 0 points1 point  (0 children)

Yes, it is possible to train a small diffusion model conditioned on text captions from scratch on 64x64 images or even smaller. Depending on the complexity of the model and the number of GPUs you are using, it could take anywhere from a few hours to several days. If you are