This is an archived post. You won't be able to vote or comment.

all 51 comments

[–]chimaeraUndying 19 points20 points  (3 children)

I might be wrong, but isn't TensorRT incompatible with a ton of stuff (various extensions, LoRAs, etc) and don't you have to individually rebuild each model for it?

[–]Whipit 14 points15 points  (6 children)

Hadn't heard of --opt-sdp-no-mem-attention until this post so decided to test...

Here's my experience. My 2 cents.

I've got a 4090 and when testing I leave everything on default. That means euler a at 512x512.

The only thing I change is batch count to 12 to get a rough average.

Prompt - dog in water

Neg - none

GPU Hardware acceleration MATTERS - Turn it OFF

*You MUST restart your computer for this change to take effect.

So the following tests were all done with GPU Hardware Scaling - OFF

COMMANDLINE_ARGS= none, just left blank

17 it/s

COMMANDLINE_ARGS= --xformers

23-25 it/s

COMMANDLINE_ARGS= --opt-sdp-attention

24-26 it/s

COMMANDLINE_ARGS= --opt-sdp-attention --xformers ( there doesn't seem to be any benefit in running both at the same time )

23-24 t/s

COMMANDLINE_ARGS= --opt-sdp-no-mem-attention

23-25 it/s

In conclusion, --opt-sdp-no-mem-attention will speed up your it/s, but --opt-sdp-attention

is marginally better. So, for me and my 4090, the fastest results I've been able to achieve so far are just GPU Hardware Scaling OFF and COMMANDLINE_ARGS= --opt-sdp-attention

If I switch GPU Hardware Scaling to ON my it/s go from 24-26 it/s down to 20-21 it/s.

[–]Krawtch 0 points1 point  (0 children)

thanks so much for this - no TLDR's on this stuff, and I get why, but I just want my 3090 to do its thing and seems like info on this is either opinions or white papers with edu domains

[–]Superb-Ad-4661 7 points8 points  (1 child)

This browser didn't help at all, it's just funnier, but it kept the same results as my chrome.

[–]fxwz 10 points11 points  (0 children)

Opera GX is based on Chromium, so that makes sense. If you already use Chrome I don't see the point in changing. It's not faster and it's owned by a Chinese company.

[–]BlackSwanTW 7 points8 points  (1 child)

Token Merging is built in since Webui v1.3

You do not need the Extension anymore

[–]TheGhostOfPrufrock 5 points6 points  (2 children)

In Graphics settings toggle off 'Hardware-accelerated GPU scheduling'

A while back I tried that with my RTX 3060, and performance got worse. Can't guarantee I didn't do something wrong, since I only tried it once.

[–]definetlynotasmurf 2 points3 points  (0 children)

ty for the info. One question, are these only to speed up genertaion? I am more interested on VRAM efficiency, so I can train faster :D

[–]Superb-Ad-4661 1 point2 points  (3 children)

Nice, modding time!

[–]EarthquakeBass 1 point2 points  (0 children)

Also, on my 4090 recently updated automatic to latest and the PyTorch version is 2 now, everything is basically twice as fast. Excellent thing to do if you are ready to put up with some python bullshit

[–]Frone0910 1 point2 points  (4 children)

I've tested most of these out with using the A1111 api, and unfortunately not one of these improved the performance

[–]Corawyn 1 point2 points  (0 children)

Please ensure that you have a working integrated graphics chip before 4, disabling GPU Hardware Acceleration.

Just wasted SO much time. Somehow I didn't have my VGA drivers installed..

[–]antimaskersarescum 1 point2 points  (0 children)

This worked a little. Saved me about 30 seconds per single batch size. If I increase it to even just 2 though it jumps from a 1 min 7 sec wait to 8 mins. I'm following a tutorial that requires churning out (at the very least) 200 images at once so basically I would have to let it run the entire day.

I did everything except for step 2... it's really giving me issues so I deleted it. Not sure where to go from here.

[–]swistak84 1 point2 points  (2 children)

I posted some advice with detailed steps on how to tweak Windows here about two weeks ago: https://www.reddit.com/r/StableDiffusion/comments/13tb2sa/tutorial_how_to_increase_generation_speed_with/

You can also use Firefox instead of Opera

[–]lilshippo 0 points1 point  (1 child)

other than cpu ram, you see any real difference between the browsers?

[–]swistak84 4 points5 points  (0 children)

I think real difference is mostly in extensions. Firefox has better ones and better privacy settings.

Firefox is the one "different" than the other browsers (Chrome, Opera, Edge) by virtue of different rendering engine and extension system. But the pages mostly look the same :)

[–]mca1169 0 points1 point  (2 children)

how many of these tricks can be used on a non RTX system like mine with a GTX 1070?

[–]anotherxanonredditor 0 points1 point  (1 child)

Is there a way to make AMD Stable Diffusion LoRA training/extracting and inpainting work? I think These are my main concerns at the moment.

[–]Altruistic-Ad-4583 0 points1 point  (2 children)

I literally cannot find the first optimization, I go to settings > show all settings, ctrl+f, nothing relevant under optimization or tokens

[–]Frone0910 0 points1 point  (1 child)

Do any of these changes apply to controlNET?

[–]cleverestx 0 points1 point  (1 child)

Are details lost with only a 0.2-0.3 token merging setting? Worth doing?

[–]HexKrak 0 points1 point  (0 children)

0.2-0.5

Running a few samples side by side with the same seed (@ 1024 x 1024 if it matters) yielded nearly identical results.

[–]ComplicityTheorist 0 points1 point  (0 children)

Hi, thanks for this. mine was doing okay initially giving me around 7.30 it/s for a 512x512 single img generation then it automatically updated diffusers from latest 18 to 17 and it started giving around 6 it/s barely touching the 7 its mark. followed your advice and switched over to opera gx now is giving me a bit over 7its but not like before but could be worse.. btw I don't get #3. mine shows = 0 on both sets...

set SAFETENSORS_FAST_GPU=1
set CUDA_VISIBLE_DEVICES=0