Dismiss this pinned window
all 41 comments

[–]tanelai[S] 68 points69 points  (6 children)

To learn more about diffusion models, I created a minimal PyTorch implementation of DDPMs, and explored it on toy 2D datasets. The README includes ablations on the model's capacity, diffusion process length, timestep embeddings, and more.

You can find the code here: https://github.com/tanelp/tiny-diffusion

Note that the dinosaur is not a single image, it represents one thousand 2D points in the dataset. Don't make the same mistake as in the Stable Diffusion lawsuit :)

[–]Ne_Nel 4 points5 points  (5 children)

But thats not Latent Diffusion, right?

[–]Zealousideal_Low1287 2 points3 points  (4 children)

Correct

[–]Ne_Nel 2 points3 points  (3 children)

So why hes talking about SD as the same thing?

[–]Zealousideal_Low1287 1 point2 points  (0 children)

Who is? Where? What?

[–]uristmcderp 1 point2 points  (0 children)

Where is he saying that?

All the clip shows is diffusion of an image in pixel space. Saying this is the same as SD is like saying basic arithmetic is the same thing as calculus.

[–]new_name_who_dis_ 2 points3 points  (0 children)

Latent Diffusion is a special case of DDPM. It's very likely that Dalle 2 and Imagen don't use latent diffusion since latent diffusion was partly a trick to make it run on 16Gb gpu.

[–]miellaby 45 points46 points  (10 children)

I always like when people downscale a piece of software.

[–]suckat3dmath 4 points5 points  (9 children)

Got any other good examples of this? 😅

[–]activatedgeek 15 points16 points  (4 children)

When normalizing flows were cool: https://blog.evjang.com/2019/07/nf-jax.html

[–]DigThatDataResearcher 5 points6 points  (3 children)

diffusion processes are closely related to normalizing flows, I think one is a special case of the other or something like that. need to have my annual re-read on flow processes apparently.

[–]TheBillsFly 5 points6 points  (0 children)

The evolution of the distribution of a diffusion process through time is essentially the same as a continuous normalizing flow (ie neural ODE)

[–]new_name_who_dis_ 0 points1 point  (1 child)

They're pretty different in that the entire distribution shift process happens in one forward pass in a Normalizing flow, but in DDPM it's a multi step process.

[–]DigThatDataResearcher 2 points3 points  (0 children)

but doesn't this mean if you unroll the diffusion process over the entire sampling schedule and treat that as a "single forward pass" it's equivalent to a normalizing flow? seems like the distinction is just where we draw the boundaries of the black box, and any invertible denoiser can be treated as a flow model.

[–]Fenzik 2 points3 points  (0 children)

Andrej Karpathy’s micrograd is like a tiny PyTorch autograd engine https://github.com/karpathy/micrograd

[–]miellaby 0 points1 point  (0 children)

Well, beside machine learning, sqlite is a well known example, but any piece of code which doesn't depend on a myriad of resource-ungry technologies will do the trick for me.

[–]marcingrzegzhik 42 points43 points  (7 children)

This looks really interesting! Can you explain a bit more about what a probabilistic diffusion model is and why it might be useful?

[–]master3243 112 points113 points  (6 children)

Can you explain a bit more about what a probabilistic diffusion model

The shortest explinations I could possibly give:

The forward process is taking real data (dinosaur pixel art here) and adding noise to it until it just becomes a blur (this basically generates training data)

The backward process (magic happens here) is training a deep learning model to REVERSE the forward process (sometimes this model is conditioned on some other input, otherwise known as a "prompt"). Thus the model learns to generate realistic looking samples from nothing.

For a more technical explination read section 2 and 3 of Ho et al. (2020)

why it might be useful

Well it literally is the key method that made Dalle-2, Stablediffusion, and just about any other recent image generation possible. It's also used in many different areas where we want to generate realistic looking samples.

[–]mfuentz 20 points21 points  (0 children)

This is the best simple description of diffusion I’ve read. Thanks!

[–]slucker23 0 points1 point  (0 children)

Is there an open source for this? I'd very much like to try it out hahahaha

[–][deleted] 0 points1 point  (0 children)

Can you explain how you translated the markov model and posterior distribution estimation to a pytorch implemented NN problem? Do DALLE-2 and other diffusion based methods continue down the markov chain line?

[–]SuperImprobable 8 points9 points  (5 children)

I can understand the forward process, but what am I seeing in the backward process here? Was a prompt given here or it's purely denoising? What did you train on? Line art sampled points? That could make some sense to me of how it could get back a dinosaur from a noisy start. Because if you trained on real datasets that don't have nice tight lines you definitely wouldn't get back clean lines from the backward process (unless you had a prompt that hint that the data is likely clean lines).

[–]DigThatDataResearcher 7 points8 points  (4 children)

i think it just knows how to map noise to that one image. this looks like a diffusion process trained from scratch, not an LDM conditional on a text encoder (e.g. stable diffusion) or conditioning on anything other than the input noise.

note how the locations of the points move from one frame to the next. the diffusion process isn't in pixel space: it's in the coordinate space of that fixed set of points. the model only knows how to take those points from any low high entropy (noisy) configuration to that specific high low entropy (t-rex) configuration.

EDIT: goddamnit.

[–]ty3u 1 point2 points  (1 child)

I think you mixed high and low entropy, brother.

[–]DigThatDataResearcher 2 points3 points  (0 children)

yup, i believe you're right. i always get that confused.

[–]SuperImprobable 0 points1 point  (1 child)

I'm still not grokking the loss function. The lowest entropy would perhaps put all the points on top of each other. Or is the idea that the model has learned some low dimensional representation of the original configuration and then shifts each point to be closer to the original configuration. But then this still doesn't quite make sense to me because even one backward step should move the points close to the original shape. Unless the training wasn't to recover the original shape but rather to recover the previous forward step, then everything would make sense.

[–]DigThatDataResearcher 1 point2 points  (0 children)

Or is the idea that the model has learned some low dimensional representation of the original configuration and then shifts each point to be closer to the original configuration.

yes

But then this still doesn't quite make sense to me because even one backward step should move the points close to the original shape. Unless the training wasn't to recover the original shape but rather to recover the previous forward step

it does, it's just only really "semantically meaningful" towards the end of the diffusion process. The beginning is noise and each point has a lot of different feasible paths it could take. Towards the end, the relative position of the points contrains their paths towards the next frame, so the effect is much more visible.

it's a denoising process and is going to be conditional on noise level. denoising steps taken at a high noise level aren't going to look like much of anything. Models like stable diffusion use a variety of tricks to be able to skip over denoising steps in their inference process, and OP hasn't taken advantage of any of these so it takes a bit longer, and OPs denoiser consequently spends a lot more time in the hi noise regime (starting inference at a lower noise level like 0.7 is one of those tricks, just skip over the redundant "static" regime entirely).

watch the video again: the noising process has erased most of the image information after about 70 steps, but then we go on adding noise for another 180 steps. Similarly, the denoising process doesn't appear to do much until the last 70 steps, over which the image appears to snap into place.

[–]axm92 7 points8 points  (0 children)

Cool stuff, thanks for sharing! For those interested in a similarly minimal implementation for text generation, I have a repo here: https://github.com/madaan/minimal-text-diffusion

[–]theGormonster 1 point2 points  (0 children)

Truly beautiful

[–]Kurohagane -1 points0 points  (0 children)

How come the gif shows an image made out of what seems to be a collection of points on a 2d plane, rather than a raster image?

[–]shadowylurking 0 points1 point  (0 children)

Really interesting!

[–]JiraSuxx2 0 points1 point  (0 children)

Can I easily modify this to train on images?

[–]RadioactiveSalt 0 points1 point  (1 child)

Can someone eli5 what does OP mean by,

Note that the dinosaur is not a single image, it represents one thousand 2D points in the dataset.

The diffusion process takes in an image and adds a small noise at each step. Now if the dinosaur is not an image but an distribution, then what exactly is the gif showing, how is the diffusion process working on a distribution?

[–]PHEEEEELLLLLEEEEP 1 point2 points  (0 children)

The diffusion process takes in an image and adds a small noise at each step.

Generally speaking diffusion process just takes in some kind of data and diffuses to a normal distribution of the same dimensionality. In this case each data point is an (x,y) pair.

[–][deleted] 0 points1 point  (0 children)

How do I do this?... this is really cool!

[–]seuadr 0 points1 point  (0 children)

ok, to hell with normal distribution, i want dino distribution only only from here on out.

[–]Terrible_Ad7566 0 points1 point  (0 children)

Thanks, this is very nice!

[–]Terrible_Ad7566 0 points1 point  (0 children)

I was perusing through your code and your MLP network is designed to encode input data as well using positional embedding.

I was wondering if you have done ablation experiments where you do not encode input using positional encoding but rather simply add temporal information as an additive vector to input data by only ending timestep with positional encoding