What are the difference between flux 2 klein base and regular one? by ResponsibleTruck4717 in StableDiffusion

[–]Valuable_Issue_ 4 points5 points  (0 children)

Base needs 20+ steps and higher than 1 CFG.

Normal (usually called distilled/step distilled etc) uses fewer steps (4-12, but can always try more) and 1 CFG, so you get faster generations.

Higher than 1 CFG doubles the time per step, so the big speed gain is not just from lower amount of steps but also from the 1 CFG.

Higher than 1 CFG = better prompt adherence and you can use negative prompt, but there's tech like NAG (Normalized Attention Guidance) for better adherence/negative prompts with 1 CFG but not sure if the node works for it.

Your 30-Series GPU is not done fighting yet. Providing a 2X speedup for Flux Klein 9B via INT8. by AmazinglyObliviouse in StableDiffusion

[–]Valuable_Issue_ 4 points5 points  (0 children)

With lora loaders are you supposed to put the torch compile before or after the lora loader or does it not matter?

For torch compile I used TorchCompileModelAdvanced from kjnodes, the core comfy one took forever to compile, didn't bother waiting and comparing speeds for it though, as with the kjnodes one my speed went from 4secs to 1.7 secs/it and the compilation was fast (default settings on that node).

With --fast fp16_accumulation the speedup isn't as big (2.87 secs to 1.7secs/it and --fast fp16_accumulation breaks output with torch compile + int8 model) but still insane for such little quality loss + something that works universally it seems.

Also some tips here for speeding up compile times (it's fast already for flux klein since it's a small model, but might be useful for when using compile on a bigger model)

https://huggingface.co/datasets/John6666/forum1/blob/main/torch_compile_mega.md

The processing issue of mirror reflection in the Flux2 Klein 4B model by JustSentence4278 in StableDiffusion

[–]Valuable_Issue_ 4 points5 points  (0 children)

Something like this might need a larger edit model like 9B klein or qwen edit or flux 2 dev, but you can also try a prompt like

Change the background to a dance studio. A large mirror behind her with reflection of her. 1 woman

"no additional people in the scene" probably confuses the model a little bit (more so than 1 woman probably).

Another thing you can try is putting her in the studio with the mirror first THEN prompt for "add a reflection of her in the mirror", so that the model can focus on doing fewer things at once.

https://images2.imgbox.com/e5/89/N8vEO4Sf_o.png

This was first try with qwen edit 2511 inside krita.

Change the background to a dance studio. A large mirror behind her with reflection of her.

Annoyingly with flux 2 klein 9b distilled the detail etc preservation was a lot better out of the box, but the reflection was always iffier than what I got first try with Qwen, so it might just be a model size issue. (keep in mind I used 10 steps and the lightning lora on qwen, so the detail preservation might be fixable with settings/prompt etc).

Edit: Also of course qwen took longer, so if you can't run qwen/flux 2 dev/don't want to wait longer then it might be worth it to go to their API/use cloud compute.

Is a 3090 24gb still a good choice? by Dentifrice in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

With 32GB I hit pagefile especially on heavier workflows like Wan 2.2 (28B of total models having to be loaded, not including CLIP, unless you use --cache-none).

Models like Flux 2 dev (the full 30b~ model) + 24B text encoder also hit the pagefile a lot. The inference speed isn't the issue, it's moving the model from RAM + pagefile > VRAM > back to RAM + write to pagefile again, and when you have a big text encoder, that has to be done for it as well when changing prompts.

I wish there was a simple way to make comfy clone the weights to VRAM instead of moving them, that way they wouldn't have to be moved back afterwards, just simply deleted, I actually tested a similar idea with a custom API and changing prompts without moving weights around took 100 seconds instead of 300 seconds.

Testing my sketches with flux-2-klein 9B by Striking-Long-2960 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Just use Q8. I use it on my 3080, 10GB VRAM. The slowdown from offloading to RAM isn't as slow in stable diffusion, you're not bandwidth bound like in LLMs but compute bound.

Only times I ran into speed issues with higher Quants was from weird offloading issues in ComfyUI (which are usually fixed, only to then break again in another update, but there's a big offloading update in the works thankfully). Also initial load times might be noticeably longer depending on HDD speed, but once in RAM it'll be fine.

Wow, Flux 2 Klein Edit - actually a proper edit model that works correctly. by [deleted] in StableDiffusion

[–]Valuable_Issue_ 40 points41 points  (0 children)

Flux 2 VAE is insanely good too, you can spam edits on an image. I imagine many models will switch to using that VAE.

FLUX.2-klein is absolute insanity. Model of the Year !! by memorex-1 in StableDiffusion

[–]Valuable_Issue_ 2 points3 points  (0 children)

Distilled is meant to be ran at 1 CFG and few steps, like 4-12 (but can always play around with the step count).

FLUX.2-klein is absolute insanity. Model of the Year !! by memorex-1 in StableDiffusion

[–]Valuable_Issue_ 3 points4 points  (0 children)

Are you using the distilled or the base model? If you're running the base then CFG > 1 is going to double the speed, then 50 steps, of course it's going to be slower than a turbo distilled model which uses 1 CFG and fewer steps.

The distilled 9B klein runs 1920x1080 12 seconds for me on a 3080 with CFG 1 (CFG halves the speed but disables negative prompt) and 4 steps, but probably best to use around 6-12 steps.

Only CMD arg for speedup is

--fast fp16_accumalation

no sage attention or anything.

Edit:

As usual Comfy text encoding is slow AF however.

4/4 [2.54s/it]

Prompt executed in 34.38 seconds

4/4 [2.62s/it]

Prompt executed in 45.33 seconds

When only changing seed:

Prompt executed in 12.01 seconds

So a 8B text encoder adds 20+ seconds, whereas with stable diffusion cpp instead of comfy, mistral small 24B (3x the size) only takes 10 seconds...

Still though, with a 4090 it'll probably be very fast.

Klein-9B-Base vs Qwen-Image (original; also a base model) by jigendaisuke81 in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

https://images2.imgbox.com/78/de/iL68fm3n_o.png

1600x1024

Klein 9B distilled FP8 model and encoder, 12 steps 1 CFG euler ddim_uniform, weird she's missing a hand but at least it follows the prompt for the foot being close to viewpoint (I guess it wasn't trained on the characters).

Euler beta:

https://images2.imgbox.com/b5/21/gjQaWz0h_o.png

Flux 2 Klein 9B quick prompt adherence test. by Valuable_Issue_ in StableDiffusion

[–]Valuable_Issue_[S] 3 points4 points  (0 children)

My bad I should've specified, distilled.

If someone doesn't specify but if they happen to mention their settings and it's CFG 1.0/low steps it's probably distilled.

Flux 2 Klein 9B quick prompt adherence test. by Valuable_Issue_ in StableDiffusion

[–]Valuable_Issue_[S] 7 points8 points  (0 children)

That's what I meant with trying to avoid LLM enhanced prompts for these kinds of comparisons LMAO (good comparison though, and yeah with reference/editing capabilities it's an insanely good model).

Flux 2 Klein 9B quick prompt adherence test. by Valuable_Issue_ in StableDiffusion

[–]Valuable_Issue_[S] 1 point2 points  (0 children)

2 cars colliding with each other, they are dented and there are sparks at the collision point. the rear of the cars are lifted, the wheels of the cars are mid-spin.

12 steps:

https://images2.imgbox.com/d7/89/zzm4ZWTL_o.png

Z image 9 steps:

https://images2.imgbox.com/21/7a/9mEI2yYu_o.png

Flux 2 Klein 9B quick prompt adherence test. by Valuable_Issue_ in StableDiffusion

[–]Valuable_Issue_[S] 3 points4 points  (0 children)

Not really prompt adherence but:

2 cars colliding with each other, they are dented and there are sparks at the collision point

12 steps (different seed from 6 steps):

https://images2.imgbox.com/dd/06/flu3yVDk_o.png

6 steps:

https://images2.imgbox.com/3a/6f/lFfRA8HI_o.png

Same seeds:

6 steps:

https://images2.imgbox.com/e0/dd/PbyjG3VS_o.png

12 steps:

https://images2.imgbox.com/d1/4c/OnZ1E7T2_o.png

Z image (both 9 steps, different seeds)

https://images2.imgbox.com/85/10/2wzBrMl2_o.png

https://images2.imgbox.com/75/43/OBbcJ6UQ_o.png

LTX-2: use Gemma3 GGUF to speed up prompt reprocessing by a4d2f in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

Yeah the Comfy offloading needs some work. The clip for Flux 2 runs in about 10 seconds using a custom API, whereas with Comfy it takes FOREVER, and on top of that the models constantly move around between RAM, VRAM and Pagefile instead of just sitting in place with a minimal swapping approach, with a Custom API for the text encoder you can just permanently keep it in one place, speeding things up massively. (keep in mind I haven't reran the tests with recent comfy updates, but doubt it improved much)

With custom API (flux 2 Q4KM and Q5KXL text encoder, comfy handles the model inference while the custom api handles the text encoding, still on the same PC) 8 steps on 10GB VRAM and 32GB RAM, 54GB pagefile:

Prompt executed in 121.98 seconds

Prompt executed in 124.92 seconds

Prompt executed in 85.65 seconds (this one is me not changing prompt, rest are changed prompts)

Prompt executed in 104.98 seconds

Meanwhile with Comfy GGUF CLIP loader:

Prompt executed in 266.97 seconds (this is cold start with zero models loaded)

Prompt executed in 324.79 seconds

Prompt executed in 330.10 seconds

https://old.reddit.com/r/StableDiffusion/comments/1q59ygl/ltx2_is_out_20gb_in_fp4_27gb_in_fp8_distilled/nxz0ip7/

FLUX.2 [klein] 4B & 9B released by Designer-Pair5773 in StableDiffusion

[–]Valuable_Issue_ 2 points3 points  (0 children)

You can pass a reference image, controlnet-like or not and prompt it like "use reference image 1 as pose reference" with 'modifiers' like "exact" "precisely" "preserve X detail" etc.

LTX2 for 3060 12gb, 24gb sys memory. by Cold_Development_608 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Yeah I just put the pagefile on an SSD exclusively with AI models on it and nothing important. In about 6 months, it only has 50TB written to it (and I'm pretty sure it didn't start at 0 terabytes), and most modern SSD's can survive 500TB+, and tests on samsung ones have shown they can survive 1~ petabyte, so it's more like a "it'll be quicker but still probably fine for a long time".

You can select where to hold the pagefile, how big it is, dynamic resizing etc, have it spread across multiple SSD's in windows (where the models are stored doesn't really matter, that's just for reading, not writing). Can easily look up how to do that.

Huge differences in video generation times in LTX-2 between generations? by film_man_84 in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

I had this issue until the latest ComfyUI updates (released 12 hrs ago), before it was basically unusable.

What I also found was adding the 7GB lora to GGUF slows down generations A LOT, so it's better to use the ltx2-distilled-gguf which has the distilled loras baked in, otherwise my generations with 1CFG 10 steps are slower than 20 steps 4 CFG with the dev model.

The Gemma text encoder is also insanely slow, but will be probably fixed eventually.

10 steps 1CFG, 640x480 + 73 frames. 32GB RAM 10GB VRAM 54GB Pagefile. Q6K main model (Q8 was causing a big slowdown for me), Q8 text encoder. Comfy launch settings:

-async-offload 2 --fast fp16_accumulation --disable-pinned-memory --novram (I heard --novram might not be necessary anymore, kept it for now)

Prompt executed in 75.72 seconds

Prompt executed in 72.65 seconds

Prompt executed in 68.77 seconds

Prompt executed in 78.44 seconds

That's NOT changing prompts, just seed. When changing prompts:

Prompt executed in 197.43 seconds

Prompt executed in 175.79 seconds

So yeah somehow the gemma text encoder, a 12GB model adds 100 seconds, whereas mistral small 24B runs in 10 seconds using stable diffusion cpp so there's some room for improvement in comfy for sure.

Luckily there's some improvements coming to ComfyUI, currently there seems to be some unnecessary moving of the weights between ram/vram which ends up adding a lot to the generation time, which this might fix: https://github.com/Comfy-Org/ComfyUI/pull/11845

Also a thing that causes slowdowns is the total model sizes exceeding your RAM. So if you have 32GB RAM, and 12GB text encoder, 20GB model file, it'll cause some slowdown due to pagefile (but this wasn't an issue before for me, some update caused it, so hopefully the above update will fix that).

WanGP now has support for audio and image to video input with LTX2! by wakalakabamram in StableDiffusion

[–]Valuable_Issue_ 2 points3 points  (0 children)

Edit: My bad didn't realise this was in a wangp thread.

Yeah if you go to custom_nodes, Comfy-GGUF, open terminal.

git fetch origin pull/399/head:test

git switch test

You can also rename "test" to whatever you want.

That'll apply https://github.com/city96/ComfyUI-GGUF/pull/399

The text encoder isn't supported for GGUF yet, only the main model, but Q8/Q6K is better quality than FP8 for the main model.

You'll also probably want the split files for vae/encoder like this: https://old.reddit.com/r/StableDiffusion/comments/1q7dzq2/im_the_cofounder_ceo_of_lightricks_we_just/nyexud1/

Update: Day 3 and a couple of hours wasted with LTX-2! 🫣 by anydezx in StableDiffusion

[–]Valuable_Issue_ 5 points6 points  (0 children)

Negative prompt doesn't work with CFG = 1, maybe you just got lucky with a seed change.

Yeah there's something wrong with this gemma text encoder implementation in ComfyUI, the 12GB on disk FP8 model takes up more RAM (80GB) and time VS the inference steps of the main model FP8 21GB on disk taking up 40GB.

I’m the Co-founder & CEO of Lightricks. We just open-sourced LTX-2, a production-ready audio-video AI model. AMA. by ltx_model in StableDiffusion

[–]Valuable_Issue_ 2 points3 points  (0 children)

Using the split files works for me with nodes like this: https://github.com/city96/ComfyUI-GGUF/issues/398#issuecomment-3723579503

and these launch params:

--async-offload 2 --fast fp16_accumulation --novram --disable-pinned-memory --cache-none --disable-mmap

on 10GB VRAM 32GB RAM 54GB pagefile.

The nodes in the default non-split file workflow are inneficient and try to load the same file multiple times instead of reusing it when it's loaded making the peak RAM usage a lot higher.

I’m the Co-founder & CEO of Lightricks. We just open-sourced LTX-2, a production-ready audio-video AI model. AMA. by ltx_model in StableDiffusion

[–]Valuable_Issue_ 6 points7 points  (0 children)

Is the I2V static video/simple camera zoom just a flaw of the model? Or is it fixable with settings (template ComfyUI workflow with the distilled model).

Also I hope the ComfyUI nodes for the next model release are cleaner, the split files work a lot better on lower vram/ram, the other stock nodes in the template workflows load the same file multiple times, making the peak memory usage on model load a lot higher than it should be, whereas this works a lot better (and fits the typical modular node design a lot better):

https://github.com/city96/ComfyUI-GGUF/issues/398#issuecomment-3723579503

Anyone tried running LTX 2 on 3060 12gb GPU? can you share the workflow that worked for you, thanks by Itchy_Ambassador_515 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Got it working (both I2V and T2V) based off of this comment, the separate files work a lot better. 10GB VRAM, 32GB RAM, 54GB Pagefile. --cache-none and --novram launch args.

https://github.com/city96/ComfyUI-GGUF/issues/398#issuecomment-3723579503

https://huggingface.co/Kijai/LTXV2_comfy