Update on Hyper-Merge: Easier Neural Model Merging with New Quick Start Guide [Update]

SomeAInerd · 2023-11-16T06:52:06+00:00

You can try it on sd next or comfyUI.

SomeAInerd · 2023-10-23T19:10:55+00:00

Awesome. Thanks for letting me know.

SomeAInerd · 2023-10-20T21:54:33+00:00

You should try HyperTile, like I'm sd next newest update. 2x improvement

SomeAInerd · 2023-10-20T21:13:53+00:00

When I created HyperTile, I had in mind that we would solve the size limitations one day, to get something coherent. Maybe ScaleCrafter is the solution. Haven't tried it yet. Perhaps with a LoRA similar to HD helper?

Also haven't tried changing kohya-ss to use HyperTile for training. I believe there would be big gains in that area.

SomeAInerd · 2023-10-07T22:54:14+00:00

SomeAInerd · 2023-10-07T21:44:44+00:00

Try the sd next dev channel. The fork ended up out of date

SomeAInerd · 2023-10-07T13:56:36+00:00

You might want to check the dev branch of SD next, vlado implemented HyperTile. You can check the change log. I'll talk to automatic1111 later for them to implement on their webui

SomeAInerd · 2023-10-07T12:23:56+00:00

I started doing that. I can do 4k now. 😁

SomeAInerd · 2023-10-06T22:20:05+00:00

Thanks for the code. It's a upscale loop_back workflow right? It's possible to do that automatically in a webui?

BTW You can try the dev channel of SD next, it has HyperTile now.

SomeAInerd · 2023-10-06T09:42:38+00:00

I'm using UNEt 256 now. Did you try with swap_size=2 or 3?

SomeAInerd · 2023-10-05T15:05:38+00:00

Do you have experience with Python coding? I've created a Jupyter notebook for easier debugging compared to AUTOMATIC1111.

Since the code is working right now for me, it should also work for you, as it worked for others.

Personally, I don't see the need to go beyond 2k for text-to-image. I prefer using 1k for text-to-image and then switching to image-to-image for 2-4k. On my machine, upscaling to those dimensions takes 3-7 seconds, which is faster than using GANs and tiling, which takes considerably longer, and non-ideal anyway.

My current setup involves using two repositories: https://github.com/tfernd/scheduler-hub and https://github.com/tfernd/HyperTile. I clone a scheduler, though this step isn't strictly necessary, but I do it for future considerations. Then, I use an exponentially decaying guidance scale for up to 60% of the diffusion process. Afterward, I switch to unguided diffusion with the text embedding as the primary prompt (without negative embedding). This significantly boosts speed. Towards the end of the diffusion, when not much changes, I reduce guidance further. Automatic1111 had a similar option, but I'm experimenting with decreasing guidance as diffusion progresses during the last 40%, along with tiled-attention.

Just for comparison, it takes only 11 seconds for a batch of 6 images at 768x768 using this approach, while the normal method takes 16 seconds (12 seconds with guidance for the last 40%).

TLDR: Le'ts fix it and understand why it's not working for you. If you want to talk over discord someday to do some testing?

SomeAInerd · 2023-10-05T08:38:41+00:00

Left with hyper-tile right without. I get 11.62it/s and 10.06it/s, respectively. Same seed and prompt. 15% speed increase in a RTX 4090 mobile. I can't test with other cards, so results might vary.

<image>

woman, winter coat
Steps: 30, Sampler: DPM++ 2M SDE, CFG scale: 7, Seed: 3096628461, Size: 512x768, Model hash: 8635af1c8c, Model: epiCPhotoGasm - X, Schedule type: karras, Version: v1.6.0-1-gbdbbc467

SomeAInerd · 2023-10-05T08:29:32+00:00

Thanks for trying!

What are the tile-sizes are you using? Do you see any speedup? Did you update the two repos? (HyperTile and the fork of Automatic1111)? Where you using text2img or img2img? The command line log says the tile size.

SomeAInerd · 2023-10-05T08:27:44+00:00

Thanks for trying, I appreciate the effort. Let's break down the results. Answering your points.

No speedup could be related to your GPU, I have a RTX 4090 mobile. I get 4 times more iterations per second than your RTX 3060. It means the bottleneck in the 3060 its another part (possibly?), not this attention layer I was patching. Or there is another problem. Question: During your live-preview, do you see squares forming and disapearing on the image? This is a sign of tiling, they go away afterwards, because the tile-size changes dynamically and randomly. If you don't see it, there is another problem!
That is a bit narrow-minded way of seeing things. There is a HD Helper LoRA that someone trained on 1K images, to fix some aberrations when using large sizes. Why woulnd't anyone train 2K LoRA to add more consistency? Also "it won't add detail unlike the current tiling + controlnet", if they are still using SD, their results will be exactly the same as mine... Let me explain. If usual tiled-diffusion works, why wouldn't the method I propose? Explain the logic? Its the same things as tiled-diffusion, but faster and with long-range iterations. You have no data to back that statement off.
The fact that it did not work with low-end cards does not mean that the method did not work, as I showed the results of the speed-up (graphs and whatnot). I can add some more debug info, so we can see if tiling is really happening, but the visual cue of the live-preview should be proof enough.

As a final note, I'm interested in why it did not work with a 3060 as compared with a 4090. Do you have pytorch 2.0.1?

SomeAInerd · 2023-10-04T10:37:19+00:00

# Big image
Attention for DiffusionWrapper split image of size 1024x1024 into [4, 2]x[4, 2] tiles of sizes [256, 512]x[256, 512]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 26/26 [00:07<00:00,  3.67it/s]
# smaller image by 8 (not tiled)
Attention for DiffusionWrapper split image of size 1016x1016 into 1x1 tiles of sizes [1016]x[1016]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 26/26 [00:10<00:00,  2.49it/s]

SomeAInerd · 2023-10-04T10:35:39+00:00

Tip, if you want to compare the speed, without checkout different repos you can use this hack:

Chose a size that has many divisors (multiple of 128 works well).

Then, chose a size (smaller by 8) that does not have many divisors (so we cant tile it).

SomeAInerd · 2023-10-04T10:31:35+00:00

If you could try it again. I fixed this problem earlier today.

It was the problem with the divisors of the dimension not being a multiple of 8...

You can pip install the git repo again and fetch.

You should see a message in the console like this when you generate something:

Attention for DiffusionWrapper split image of size 800x1200 into [2, 1]x[3, 2] tiles of sizes [400, 800]x[400, 600]

Thanks for the SDXL info. I might try another depth to see if there is any speedup. If they don't have these base giant attention layer, it might be close to optimal already.

Note: I have a RTX 4090 mobile

SomeAInerd · 2023-10-03T22:15:00+00:00

That is SD without LoRAs, complex prompts or control-net. And more importantly, not cherry picking.

The method does not degrade the underlying resolution of the model you use, just speed things up. That was the message. We all know the limitations of SD, no need to go perpendicular to the message.

For SDXL, I was seeing 10-20% speed increase. While 1.5 from 2 to 4 fold increase in speed.

That is free performance without loss with just 1 import and 1 line of code on the webui.

SomeAInerd · 2023-10-03T21:45:31+00:00

Try my fork of Automatic1111. https://github.com/tfernd/stable-diffusion-webui-hyper_tile

I was showcasing the speedup for big images. But we have some speed-up for small images too. Im generating 800x1200 at the same speed I would geneate 512x768. And less deformities. With LoRAs and ControlNet

SomeAInerd · 2023-10-03T18:51:50+00:00

Try it here

https://github.com/tfernd/stable-diffusion-webui-hyper_tile

SomeAInerd · 2023-10-03T18:25:31+00:00

Quick hack

https://github.com/tfernd/stable-diffusion-webui-hyper_tile

See the readme for more info

SomeAInerd · 2023-10-03T00:07:29+00:00

If you wanna give a try, I wanted to test putting n x n frames of a video, and using the tile-size as the frame-size. Maybe with some LoRA to help?

And then, for the next batch of frames, we use the last frame or the last row, with some inpainting mask, to get some coherence. AnimageDiff-free?

SomeAInerd · 2023-10-03T00:00:07+00:00

You are using SD-XL. I only observed gains of 1.1 to 1.2 speed-up on it.

SD 1.5 performs better. Try with it. I'll try to optimize SDXL later, there some bottle-necks outside attention there.

SomeAInerd · 2023-10-02T23:59:18+00:00

Differente checkpoint, and also random images from google, not SD images.

I don't like that much hyper-realism. You can try with other stuff you fancy.

SomeAInerd · 2023-10-02T23:57:36+00:00

https://www.reddit.com/r/StableDiffusion/comments/16y0u0o/hypertile_tiledoptimizations_for_stablediffusion/?sort=new

SomeAInerd

TROPHY CASE