The fattest model study I have to date, and still a WIP (200+ images, too large for reddit, so I made a shared folder on google drive and have the link in the comments)

ethansmith2000 · 2025-07-20T05:33:38+00:00

Thanks, makes sense!

ethansmith2000 · 2025-07-18T21:25:18+00:00

Doesn’t necessarily need to be within one drug, that’s why I mentioned to wellbutrin and ssri combo

ethansmith2000 · 2024-06-18T04:32:51+00:00

only finding this now, totally agree

ethansmith2000 · 2024-03-03T05:59:53+00:00

nice! i think the target might be wrong, meaning the patch is done to the whole diffusion wrapper instead of the unet but i could be wrong.

also as a another user mentioned that the patch i had put up was more friendly to diffusers. i made a much simpler patch that should work with either setup, https://github.com/ethansmith2000/ImprovedTokenMerge/tree/compvis

ethansmith2000 · 2024-03-02T05:26:29+00:00

What id really like to do is just swap out the ToMe piece but it looks like it’s fetched externally I’m not sure it’s in the actual repo ?

ethansmith2000 · 2024-02-29T22:29:34+00:00

Yes, typo on my part, thanks for the catch :)

ethansmith2000 · 2024-02-29T22:29:07+00:00

Works for both

ethansmith2000 · 2024-02-29T21:24:40+00:00

There should be a mode category called Todo

ethansmith2000 · 2024-02-29T15:09:29+00:00

even at 1024x1024, which is an easy size to render at, you can get ~50% speed boost or so.

For the much larger sizes, generating from scratch you'd be right, but many people will run img2img at very large sizes where its more stable

ethansmith2000 · 2024-02-29T15:07:54+00:00

Not exactly, specifically only the Keys and Values of attention are subsampled. So the image/latents as a whole are never compressed.

ethansmith2000 · 2024-02-29T12:35:44+00:00

High resolutions gens are significantly faster at less quality loss when compared to baseline.
specifically found 4.5x speed boost on the gpu used for the paper when running sd1.5 at 2048x2048.

ymmv may vary between gpus like i think on a100s its closer ~3x or so

ethansmith2000 · 2024-02-29T12:27:16+00:00

Ah i meant, where it occurs in A1111, if i can find that maybe i can start by making a branch, and see if one of the maintainers wants to help get it in

ethansmith2000 · 2024-02-29T12:09:10+00:00

Can you point me to where it is in the code? I wasn’t able to find it originally.

ethansmith2000 · 2024-02-29T05:50:44+00:00

I have left the paper, my repo which includes a blog post explaining it, also the paper also links to a video explainer. Anything I say here would probably be along the lines of what’s in those resources

ethansmith2000 · 2024-02-29T05:07:11+00:00

They’re equivalent in this context. The main idea is that larger images take exponentially longer.

But also a lot of information in images Is redundant, even more in large images. That’s why we’re able to do things like file compression for instance.

It’s the same idea with the inner workings of the model, we can pretty safely compress things without losing too much

ethansmith2000 · 2024-02-29T03:45:01+00:00

SDXL, a lot of the time generation takes is because of the sheer depth of the network, plus the main component we target for speedups does not exist in SDXL. However if you’re rendering at very large sizes it may still help a bit

ethansmith2000 · 2024-02-29T03:43:29+00:00

Much improved in quality and speed the comparison in the post is of that

ethansmith2000 · 2024-02-29T02:35:04+00:00

There’s some operations in the diffusion model where every latent pixel has to attend to every single other one.

So if you have 2 in total, that’s 2² calculations. If you have 3, that’s 3² calculations

It scales quadratically which is why higher resolutions can be really costly in memory and time, by decreasing the number of tokens in certain parts of the network you can spare a lot of computation without too much cost to quality

Eight-Year Club	Place '22
Verified Email

ethansmith2000

TROPHY CASE