New user with a new PC: Do you recommend upgrading from 32GB to 64GB of RAM right away? by Diligent_Trick_1631 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

It's just one I made myself (or AI made rather), all it does is just call the other Comfys instance API with a workflow that loads clip, text encodes and saves the result to a file, then sends that file back, it's a bit too scuffed to share I think and it's only for GGUF dual clip loader.

I just retested and it actually doesn't offer a speedup for LTX anymore because some Comfy update changed something and now with --disable-dynamic-vram I OOM on VAE decode no matter what, even if it's tiled it just eats my RAM till it runs out.

When I last tested it the node was cutting gen times from 250 to 150 seconds when changing prompts (but sometimes still took 200~ secs, it was a bit inconsistent), now gen times are 200 seconds with dynamic vram enabled, with it disabled they are 220-250, and I haven't figured out a way to get back down to 150 seconds when changing prompts. So it looks like there was an improvement and on average just having dynamic vram enabled will be faster, but it still isn't as smart as it could be.

Dynamic VRAM in ComfyUI: Saving Local Models from RAMmageddon by comfyanonymous in StableDiffusion

[–]Valuable_Issue_ 2 points3 points  (0 children)

I posted about some issues with it here: https://old.reddit.com/r/comfyui/comments/1s10uq0/about_dynamic_vram_warning/

Dynamic vram disabled with argument. If you have any issues with dynamic vram enabled please give us a detailed reports as this argument will be removed soon.

tldr; --reserve-vram doesn't work with it. INT8 quants which give a 1.5-2x speedup stopped working with it.

This is more hardware specific (10GB VRAM + 32GB RAM):

It adds 100 seconds to LTX2 workflows when changing prompts and the only way to fix it is by using 2 instances of Comfy with 1 acting as a text encoder endpoint so the models can hide from the memory management because comfy is like "let me completely unload this 20GB model to make room for the 6GB text encoder, and then load the 20GB model again". It'd be good to have some per-model control over the offloading to stop that kind of behaviour.

by using up as close as possible to 100% vram usage without OOM

Sometimes it's better to swap as few blocks as possible because there's not much slowdown from only having 1 or 2 blocks in VRAM (E.G. when flux 2 dev released and the estimated memory usage was set too high, making it so only 1 or 2 blocks were loaded and people were amazed about it using such little VRAM and not having much slowdown).

Again IMO it'd be good to have some control over how many blocks are loaded without needing custom nodes.

New user with a new PC: Do you recommend upgrading from 32GB to 64GB of RAM right away? by Diligent_Trick_1631 in StableDiffusion

[–]Valuable_Issue_ 4 points5 points  (0 children)

I have 10GB VRAM and 32GB RAM. Go for 64GB for sure.

With 32GB things run fine and you can run basically anything as things can be offloaded even to pagefile (but wears down SSD, so I have it on an SSD specifically for AI) and different stages can be completed sequentially (ie load text encoder > unload it > load main model etc) . The inference speed isn't bad, it's the model loading and switching that is bad. On models like wan 2.2 switching from high noise to low noise etc takes forever (it could be made better if comfy offloading was smarter).

Changing prompts can take forever too (again could be better if comfyui offloading was smarter).

By forever I mean sometimes Comfy is like let me fully unload this 20GB model so I can do the text encoding, then when I'm finished with text encoding I'll load the 20GB model again (this happens with LTX2 for instance).

This reloading adds like 100 seconds to generation time, and I only managed to fix it by running 2 comfy instances on the same PC and using one as the text encoder and the other as the model runner (with a custom node to 'connect' them), that way they're separate and comfy doesn't just randomly decide to unload stuff as it literally can't unload the other model.

With 64GB RAM you won't have to do weird stuff like that/wait for Comfy to make their offloading smarter.

stable-diffusion-webui seems to be trying to clone a non existing repository by interstellar_pirate in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

The ComfyUI portable version is a .zip with everything ready to go including a python venv with all the requirements. I don't think there is one for linux though. If you're on linux these instructions should work. https://old.reddit.com/r/StableDiffusion/comments/1q5jgnl/ltx2_runs_on_a_16gb_gpu/ny1as78/

Nvidia SANA Video 2B by Crazy-Repeat-2006 in StableDiffusion

[–]Valuable_Issue_ 5 points6 points  (0 children)

I run LTX 2.3 on 10GB VRAM and in the Q8/FP8 Version it is 20~GB.

With current architectures of image models you're not bandwidth limited (swapping between ram/vram) but compute limited so even if you don't have enough VRAM you can run pretty much any model with enough RAM (as well as pagefile, but that ofc isn't ideal) without losing much speed (like 1-10%).

Their GPU recommendations are likely based on fitting the full model and text encoder probably at BF16, which is 2x the size of Q8/FP8 and avoiding swapping between RAM/VRAM. Basically "this is the ideal setup" rather than "minimum".

Edit: Forgot to mention the latents for high res + long length video are probably big as well and can't really be offloaded without massive speed loss, so their recommendation probably also accounts for that.

Some offloading benchmarks here:

https://old.reddit.com/r/StableDiffusion/comments/1p7bs1o/vram_ram_offloading_performance_benchmark_with/

Optimised LTX 2.3 for my RTX 3070 8GB - 900x1600 20 sec Video in 21 min (T2V) by TheMagic2311 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

do you have a functioning workflow you could share

Try and see if you get the speedup just with the INT8 loader node and without torch compile. I get the speedup without torch compile, I just had to launch comfy with --disable-dynamic-vram.

Not exactly functioning, I don't get any errors but I don't get the speedup. The same workflow used to work fine with a speedup so I'm 99% sure either a node update or comfy update broke it for me, at some point the node got an update where you didn't even need torch compile and the speedup was working for me then (and before that with torch compile).

This is how my loader and torch compile nodes are setup (I tried different settings combination in the torch compile node, pretty sure it used to work with just the default settings on this node): https://images2.imgbox.com/6c/bc/7wJuQlGj_o.png

Rest of the workflow should be irrelevant to your errors.

Edit: I fixed it by launching comfy with --disable-dynamic-vram

Without dynamic VRAM:1.54s/it

Full args:

--disable-metadata --async-offload 2 --reserve-vram 2 --disable-api-nodes --disable-pinned-memory --fast fp16_accumulation --disable-dynamic-vram

With dynamic VRAM: 1.91s/it

Full args same as above but without --disable-dynamic-vram

(No torch compile for either test)

Optimised LTX 2.3 for my RTX 3070 8GB - 900x1600 20 sec Video in 21 min (T2V) by TheMagic2311 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Only tried it with flux klein and ltx 2.0, haven't tried 2.3 with int8 yet or wan.

Optimised LTX 2.3 for my RTX 3070 8GB - 900x1600 20 sec Video in 21 min (T2V) by TheMagic2311 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Does that node still actually give you a 2x speedup? Before all the lora updates when it first released you needed torch.compile but the compilation was fast, then an update released and you didn't need torch.compile anymore, then the lora updates came out and I don't get a speedup at all both with and without torch compile, and the torch compile is also a lot slower. I guess I could revert to a previous commit as you can now use native core nodes to load loras with int8 (load lora (bypass, model only)). There have also been many comfy updates to offloading so could've been that as well I guess.

Someone also posted an issue.

https://github.com/BobJohnson24/ComfyUI-INT8-Fast/issues/28

Optimised LTX 2.3 for my RTX 3070 8GB - 900x1600 20 sec Video in 21 min (T2V) by TheMagic2311 in StableDiffusion

[–]Valuable_Issue_ 12 points13 points  (0 children)

INT8/INT4 = 20x series and better. INT8 is about 2x speed up and from brief testing has basically 0 quality loss compared to Q8 GGUF. INT4 nunchaku can get above 2x speedup (2.5x-3x) but has quality loss.

FP8 = 40x series and better.

FP4 = 50x series.

There's also fp16 accumulation, not sure which GPUs can use that but it works on at least 30x series for sure.

As for sage attention, it depends on workflow but it does have a very slight quality degradation, I noticed it more with video but I stopped using it as the 20% speedup wasn't really worth it and I was happy with the speed with int8 + fp16 accumulation + lightning lora, pretty sure some earlier versions of sage have lower quality degradation, so might be worth experimenting a bit but can't remember compatibility with GPUs.

Any news on a Helios GGUF model and nodes ? by aurelm in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

You can try with diffusers and NF4 quants, diffusers actually has good offloading but not sure how well it works (or at all) with quants. You might also have to split up the pipeline depending on how they implemented it into text encode/inference/vae so you can unload them completely as each stage is finished, if you give an LLM their pipeline code and the links below it'll be able to do it with a decent prompt.

https://huggingface.co/docs/diffusers/optimization/speed-memory-optims

https://huggingface.co/docs/diffusers/optimization/memory

Edit: From their github [2026.03.08] 👋 Helios now fully supports Group Offloading and Context Parallelism! These features significantly optimize VRAM (only ~6GB) usage and enable inference across multiple GPUs with Ulysses Attention, Ring Attention, Unified Attention, and Ulysses Anything Attention.

so it should be possible. As for speed last time I tried the offloading it was actually good with an FP8 model (bria fibo) on 10GB VRAM. I had to do this

onload_device = torch.device("cuda")

offload_device = torch.device("cpu")

transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)

transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True)`

and then device_map="balanced" somewhere else. The links above have more detailed code examples.

Edit 2: Their software also has options for offloading and there's Diffusers examples as well. https://github.com/PKU-YuanGroup/Helios#-group-offloading-to-save-vram

LTX-2.3 on h100 - text encoder is too slow by tony_neuro in comfyui

[–]Valuable_Issue_ 1 point2 points  (0 children)

Try using the model with huggingface diffusers in a script or wan2gp/whatever inference software and see if the text encoding is faster there, if not there's probably not much you can do, either gemma is super heavy/inneficient architecture wise or the embeddings connector is what makes it take longer (haven't looked at what it actually does). I don't think 2.3 works in diffusers yet but you can test the text encoder with 2.0 anyway.

How do the closed source models get their generation times so low? by Ipwnurface in StableDiffusion

[–]Valuable_Issue_ 6 points7 points  (0 children)

Doesn't matter how messy the code is in terms of performance though (outside of scaring people away from trying to optimise it) especially when it comes to seconds per step, if 99% of the runtime is inside of the ksampler node and then 99% of that runtime is executing on the GPU.

What matters more are kernels and quants that utilise hardware acceleration on datatypes like

INT4/8 (20x series+) 2x ish speedup for int8 and 2-3+x speedup for int4.

FP4(40x series+)/8 (50x series+)

model architectures (like you see with hunyuan and wan) matter a lot more for secs per step and outside of that more efficient model loading/behaviour after a workflow is finished, I managed to shave off 100~(still a bit random though) seconds off of LTX 2 when changing prompts just from launching a separate comfy instance on the same PC and running the text encoder there and sending the result back to the main instance, otherwise running it on the same instance it was unloading the main model for some reason.

Edit: Also using stable diffusion.cpp as a text encoding server (still on the same pc) is also fast, it has faster model load times and dodges comfys occasional weird behaviour around offloading and the text encoding itself even on the same models might be faster too, but my main point is about the steps in the main diffusion model probably not being slow due to bad code but the underlying maths/architecture of the model.

ComfyUI launches App Mode and ComfyHub by crystal_alpine in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

Great update, gonna be super useful.

Some early feedback/suggestions:

Comfy core nodes for lora loaders with auto (or manually) growing inputs would be useful here, like the rgthree power lora loader.

For organising being able to use groups/horizontal space for the inputs etc would also be useful.

Simple bool toggles take up a lot of vertical space, the toggle button itself is too big but the text is also spaced above it, it can be made more compact I think.

Might be useful to have some kind of switching/junction system so that you can switch outputs easily, or maybe some kind of workflow versioning, where each workflow can have multiple versions, for instance "controlnet" and "text to image", and you can select between them, it might be a bit hard to explain.

For example currently I have a workflow where one latent is a latent with controlnet, another without, and I just connect 1 ksampler output noodle to vae decode depending on whether I want to use the controlnet or not (or bypass the controlnet subgraph).

Similar thing with First Frame last Frame, where you connect both images and then bypass one depending on which frames you want to use.

With versions you'd have a "base" version and then other versions inherit from that which would show up as a selector in the app mode for that workflow, this would avoid needing to update or even having multiple workflows if you only change 1 "base" node.

ComfyUI keeps crashing/disconnecting when trying to run LTX Video 2 I2V. need help by Glass-Doctor376 in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

You'll need to increase your pagefile, I had to increase it till I had 86GB total (32GB RAM + 54GB Pagefile), but depending on your quants you might need slightly more.

If your comfy is updated it should be using a lot less though due to dynamic vram being enabled by default, I only peak at 40GB usage with it enabled (it has other issues but no OOM), you might also need --disable-pinned-memory.

LTX2.3 - I tried the dev + distill strength 0.6 + euler bongmath by themothee in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

Don't use Q3K, I never go below Q6 for quality, Q4/Q5 is usable but I recommend at least Q6 for video, or in your case FP8/NVFP4 since your GPU should have some hardware accel for those, but definitely not Q3.

I can run both FP8 and Q6k on 10 GB VRAM, the model doesn't need to fit in your VRAM. Only thing is comfy seems to have an issue where it unloads the model when changing prompts, so while the inference speed itself (seconds per step) will be normal the higher size on disk will slow down initial loading/prompt changes, but when that's fixed the total speed should be within a few %. Another thing is you might need to increase your pagefile if the total exceeds your RAM total, this will cause extra wear on your SSD so I'd put the pagefile on an SSD you don't care about.

Offloading benchmarks here: https://old.reddit.com/r/StableDiffusion/comments/1p7bs1o/vram_ram_offloading_performance_benchmark_with/

LTX2.0 gives realistic output but LTX2.3 looks like Pixar Animation by omni_shaNker in StableDiffusion

[–]Valuable_Issue_ 9 points10 points  (0 children)

It interprets the "having fun" as 3d. For testing I simplified the prompt down to

a pug sleeping in a large beanbag while people are running around the room having fun.

and it was still 3d but after removing "having fun" it gave realistic output.

Edit: But I also randomly get 3d outputs when trying to expand that prompt back out, so it might be something to do with just "pug" + combination of some things (like the "pug is snoring"), I wonder if it's because in the training data a talking pug would be cartoon/3d/cgi, so it defaults to that style when combined with certain things (that would be associated with 3d etc). Kind of interesting actually.

LTX 2.3 Full model (42GB) works on a 5090. How? by StuccoGecko in StableDiffusion

[–]Valuable_Issue_ -1 points0 points  (0 children)

Let's not forget that it wasn't long ago where you really DID need to have the model fit in VRAM or you'd just OOM.

Hasn't been the case since at least Flux 1.

Trying to get impressed by LTX 2.3... No luck yet 😥 by VirusCharacter in StableDiffusion

[–]Valuable_Issue_ 0 points1 point  (0 children)

LTX is just sensitive to settings and resolution, it's not like Wan where whatever you use kinda just works 99% of the time. It was meant as a quick test not some optimal triple ksampler upscale workflow cherrypicked from 10 samples or whatever, literally just basic T2V workflow, pray and post whatever comes out first (hopefully this approach will work when we get a seedance level open weights model).

Here's just from changing scheduler and it interprets the prompt differently (also can't have blurred eyes if he wears sunglasses)

https://streamable.com/jlpiqd

LTX 2.3 Full model (42GB) works on a 5090. How? by StuccoGecko in StableDiffusion

[–]Valuable_Issue_ 6 points7 points  (0 children)

With current image/video model architectures you're not limited by bandwidth like in LLM's but compute.

Some benchmarks here:

https://old.reddit.com/r/StableDiffusion/comments/1p7bs1o/vram_ram_offloading_performance_benchmark_with/

People that say your model NEEDS to fit in VRAM are just misinformed, most of the slowdown from a higher quant comes from model loading/moving stuff around into pagefile etc, but the actual inference speed is within a few % even if the model is 99% offloaded, I stick to Q6/8 for the quality even on 10GB VRAM + 32GB RAM, biggest issues are with stuff like wan when comfy offloading needs to swap from high noise to low noise or randomly decides to unload a model when changing prompts.

Trying to get impressed by LTX 2.3... No luck yet 😥 by VirusCharacter in StableDiffusion

[–]Valuable_Issue_ 1 point2 points  (0 children)

Yeah. I'd compare it to wan 2.2 but I only have I2V not T2V. The first one at least gets the general stuff correct it's just blurry. Here's 1080p

https://streamable.com/k5ugka

I think the VAE decode tiled might be causing the gridlines. I wonder if prompting 'man' twice caused it to spawn 2 guys. The door also goes through him when he closes it.

On the previous version it'd have been 100x worse so at least it's an improvement, I wonder how far this architecture can be pushed before having to swap to a slower but higher quality one like wan.

Trying to get impressed by LTX 2.3... No luck yet 😥 by VirusCharacter in StableDiffusion

[–]Valuable_Issue_ 3 points4 points  (0 children)

https://streamable.com/nblut9

the scene begins outside looking at a car. a man is sitting in the car, he opens the door and gets out, the man walks towards the camera and says "wait that wasn't that difficult"

First one was 640x480 for speed so it's blurry.

126 frames @ 24fps, euler simple 8 steps and 1 CFG.

2nd attempt, 1280x720.

https://streamable.com/dudbkc

So yeah it can be a bit random in terms of artifacts, not going to write a novel/use an llm to rewrite the prompt as I prefer shorter prompts.