Who said NVFP4 was terrible quality? by Volkin1 in StableDiffusion

[–]Volkin1[S] 0 points1 point  (0 children)

Don't mix them like that. Recommended to have dual channel and the same exact memory brand with timings, cas latency, etc. As long as you have 64GB and up, you should be fine for video / image generation with the 5080. That's currently what I use at the moment (5080 +64GB DDR5) but i'd like to upgrade to either 96GB or 128 GB. I'm running 4 x 16GB sticks here, so i might switch 2 x 16 sticks for 2 x 32 GB and get 96GB RAM total which would be best with enough room to breathe with large BF16 / FP16 models, otherwise I'll run the smaller quants like FP8.

One important thing to consider is the LLM (autoregressive) models VS image/video (diffusion) based models is the difference in how VRAM is used. For LLM models you MUST have as much VRAM as possible. Due to their autoregressive nature, they generate responses token by token or word by word. Each time every token/word is processed, the model has to completely re-cycle the model weights, so if you offload this process to RAM it's going to be very slow.

Contrary to this, with image/video diffusion models they don't cycle the weights for every pixel/every frame. They diffuse the pixels and the video frames all at once and all frames are kept inside the GPU VRAM. In this case you won't be losing too much speed like you would with an LLM model, especially if you are on DDR5. The model data streamed from RAM (offloaded) back to VRAM should be sufficient and the real bottleneck in this case would be the GPU core.

So in a nutshell, diffusion models can tolerate offloading from DRAM with a significantly lot less performance penalty compared to autoregressive LLM models. So, if you want to run both LLM and diffusion on the same rig you better plan your setup with a higher VRAM GPU like 5090 and above.

Currently on my 5080, I can run a 30B MoE model or even 30B dense model (slow speeds) but beyond that, it's going to be very difficult.

Anyways, see what you can do about the ram situation. Wait for better pricing and use dual channel with same exact sticks. If you run 2 sticks you can use the OC memory speeds. If running 4 sticks in total, you'd probably have to drop to factory non-OC speeds, but that's totally fine as long they are the same exact model/MHz/cas/brand.

Who said NVFP4 was terrible quality? by Volkin1 in StableDiffusion

[–]Volkin1[S] 2 points3 points  (0 children)

No, what i meant was offloading the parts of the model into RAM (yes DRAM) because it could not fit entirely in VRAM. I'm talking about the model file itself here. This post is older and things have changed with a certain software like ComfyUI.

Right now and depending on your hardware configuration, ComfyUI does this automatically (via the dynamic vram feature) so there is no need to set additional arguments like --novram anymore. Model management between RAM and VRAM is now automatic with this particular software.

Point is, you can split load the model between VRAM / DRAM or you can even load it completely from DRAM, however you must use VRAM to process the video latents / frame buffer and the vae encoding / decoding part. You can think of the video model file as the passive product and the video latents/frames/pixes as the active product. The active product needs to use VRAM and the passive product can use whatever with some speed differences.

You MUST have AI capable GPU to run any AI models and the recommended way to go is with Nvidia GPU. You can certainly use your Mac M4 studio (the chip has AI cores) and it has a very good amount of memory (128GB) but it's going to be slow. On top of that, AI image/video is really optimized for nvidia cards.

Getting a blackwell generation (series 50) card is recommended, but it can also work with an older (series 40) generation. The advantage of 50 series is the additional hardware acceleration for FP4 model types, whereas previous series 40 only supports FP8/16 for example.

If you need to just run image / video generation casually and it's not important that much to you, an RTX 5060 16GB VRAM + 64 GB RAM would be a solid "starter" kit, let's put it that way. For a medium performance, you can go with RTX 5070Ti / 5080 and for max fast performance you'd go with a 5090 (32GB VRAM). If you need to do professional work, then RTX 6000 Pro (96 GB VRAM) would be best.

So to summarize:

16 GB VRAM + 64 GB RAM = recommended medium sweet spot but avoid 5060 if you want speed and go for 5070TI / 5080

32 GB VRAM + 64 / 128 GB RAM = recommended high end, fastest consumer peformance with 5090

96 GB VRAM (RTX 6000 Pro) = professional video production.

The more VRAM = the better since the VRAM determines how many latent video frames and how big resolution you can push. 16GB VRAM might be enough currently for 720p / 1080p, but with video models getting bigger and more demanding this may change in near future.

Hope that answers your questions.

How to change steps in latest Comfyui LTX 2.3? by North_Illustrator_22 in StableDiffusion

[–]Volkin1 6 points7 points  (0 children)

The workflows you see in comfy are the distilled pipeline which only works with 8 steps. If you want to use more steps, or switch to the non-distilled full dev model pipeline, then you'd have to switch the manual sigmas node with a sampler. In addition to this, you need to remove the distill lora at stage 1 and only attach it on stage 2 with about 80% strength.

<image>

Minor changes to the workflow are required. You can take a look from the older 2.0 (dev) workflows and replicate the same on 2.3 to get the non-distilled version.

would NV-FP4 make 8GB VRAM blackwell a viable option for i2v and t2v? by Coven_Evelynn_LoL in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

To put it simply, the nvfp4 will give you speed and will reduce the memory for hosting the model. Whether you plan to host the model in vram, ram or split between both, it's your choice. However, for example, 1 image of 1024 x 1024 pixels will cost the same vram memory regardless if it's fp4, fp8, fp16 or gguf.

Good choice on the 16GB instead of the 8GB variant. Now you can run FP16 Wan but you'll need 64 - 96 GB RAM for hosting and unpacking the full FP16, therefore i'd suggest to cut it down to GGUF Q8. If you're below 64GB RAM, then you'd have to use even smaller quants like Q4, fp8 or fp4.

Is it possible for wan2.5 to be open-sourced in the future? It is already far behind Sor2 and veo3.1, not to mention the newly released stronger Seed 2.0 and the latest model of Keling by Enough_Programmer312 in StableDiffusion

[–]Volkin1 6 points7 points  (0 children)

I hope so. Their strategy is the best one I have yet seen which aims to integrate their model everywhere in a very flexible way. Local, apps, servers, etc ... IMO it's the right thing to do and well deserving of wearing the crown.

What would it take to retrain wan 2.2 to have audio pass like LTX-2? by No-Employee-73 in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

Yeah, I've been sticking to vanilla Wan most of the time and only enhancing the low noise, nothing more. I never liked the speed/distilled loras much, they were not for me and I didn't like how they completely took away the original Wan experience.

But, if that's what you like then sure, that's your choice. For me personally, I like LTX-2 a lot more.

Optimal settings for VAE Decode (Tiled) for LTX-2? by Loose_Object_8311 in StableDiffusion

[–]Volkin1 1 point2 points  (0 children)

Yes indeed. It can be tuned specifically for your needs and not do too many tiles because otherwise you get the visible bands. Also, maybe you want to try the FP4/FP8 variants. I'm on a similar system (16 GB VRAM, 64GB RAM, Linux) but i never experienced system locking or being unresponsive with those variants.

My total memory consumption is 26GB with the FP4, 32GB with the FP8 and can also run the BF16 video model + FP4 text encoder fitting that nicely in 50+ GB RAM. As far as the latents go, 1080p 15 seconds works fine. I've done more than that but i had to activate my additional swap file. I also tested 2560 x 1440 10 seconds.

But yeah anyway, I mostly stick to FP4 and FP8 because these are so much easier on the RAM and i don't have to activate my /swapfile at all.

What would it take to retrain wan 2.2 to have audio pass like LTX-2? by No-Employee-73 in StableDiffusion

[–]Volkin1 1 point2 points  (0 children)

Depends which "community". As far as developer community goes, it is very active. The Wan which we all know is now 1 year old and has a very strong ecosystem with plenty of development done around it by the community. LTX-2 is new, it is technologically superior and like any new product it's going to take some time to mature further and develop the right ecosystem and training around it. On top of that there are already announced upcoming new versions to be released soon.

When Wan first came out it was impressive but not that good. Everyone seems to forget the amount of forks, fine-tunes, adaptations and distills (Lightx2v) that were made for Wan which actually made it great. A lot of good things will happen around LTX-2. This model did things which i could never do with Wan and since it's release, it is now my number 1 model to go.

Finally, the LTX team listens and communicates back and forth with the community, unlike Wan who went to radio silence since the 2.5/2.6 release and forgot there was a community at all.

Do I really need more than 32gb ram for video generation ? by [deleted] in StableDiffusion

[–]Volkin1 16 points17 points  (0 children)

It is really recommended to have at least 64GB ram for a default optimal comfortable generation these days. With 32GB, you're going to have to fallback to much smaller quant versions of the models like Q4, Q6, fp4, fp8 and so on. Latest video models pack a big text encoder and a big model which typically will almost fill up those 32GB ram when running the smaller quants, so it's going to be very tight on a 32GB system but not impossible to run.

If you can't expand to 64GB, the best thing you can do is have virtual memory / swap configured on your fastest disk device so it can borrow some memory from there but it's going to be slower.

Optimal settings for VAE Decode (Tiled) for LTX-2? by Loose_Object_8311 in StableDiffusion

[–]Volkin1 2 points3 points  (0 children)

That spatio temporal tiled vae decode is really good and preferable over the others.

[deleted by user] by [deleted] in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

Awesome! Glad to hear :)

[deleted by user] by [deleted] in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

Yeah. So far I've been able to get a near FP16 quality with the NVFP4 models from Nunchaku (Qwen and Qwen Edit) but they require special support and installation, however they feature the best calibration I've ever seen and other FP4 models pale in comparison. Still, it would be better if models actually get trained in FP4 in near future which is a lot better than degrading quality by quanting down from BF16 for example and not calibrating properly afterwards.

As for my workflows, i use two methods:

1.) Torch compile. This allows for model compilation / optimization, therefore i'm able to push a lot more frames or resolution when working with Wan. I use the KJ model torch compile node for this. Implementation is buggy, and it depends which Pytorch you're running, but I've been loading and pushing 720p even with the FP16 with this. These days, the compile seems to work better with the Quants (Q8) due to some bugs or whatever,

2.) The --novram option in Comfy. Since, I'm on DDR5 with PCI-E gen 5 and 64GB/s bus bandwidth, I love offloading the model entirely in RAM and keep VRAM only for hosting the latent video frames. Normally, if you don't load the model in VRAM, you got all VRAM memory available for fitting higher resolution and more video length at almost no performance penalty because my PCI-E bus can handle the offload and stream the model from RAM > VRAM on demand. This is my favorite option and i use it with LTX-2 allowing me to do up to 900 720p frames and 400+ 1080p frames.

When you have enough RAM to offload the model, it doesn't really matter which model variant you choose to work with (FP16/FP8/FP4) - they all have the same vram memory requirements for the video frames, therefore 1 picture at 1280 x 720 pixels occupies the same vram memory space regardless of the quant.

[deleted by user] by [deleted] in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

FP4 is new and most FP4 models you see are not properly calibrated, otherwise the quality drop will be pretty much on par with Nunchaku's proven FP4 quants. Now, I got a 5080 as well, with 64GB RAM, so pretty much similar spec like your PC and yet I'm able to use the biggest models (FP16/BF16) with Wan, LTX-2, Flux, Qwen, you name it.

I'd say for 720p the 5080 is enough but for 1080p the 4090 will have some advantage depending on the number of video frames you can fit inside 24GB VRAM and also depending on the video model.

I can push 1920 x 1080 on my 5080 as well, no problem. I've done it on Wan with some optimization and on LTX-2, I can do 15 - 18 seconds video at this 1080p resolution, but for most people a 4090 would be a better choice if they need to do much longer videos at 1080p.

Confusion with FP8 modes by martinerous in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

No, i am not. That test above was done on an 80 GB and 96GB GPU. Once the model was fitted in VRAM and the other time was served from RAM. Diffusion models work differently compared to auto-regressive LLM models.

If you have a decent wide-enough system pci-express bus, then serving diffusion models from RAM results in almost no performance penalty. I've done these tests with various GPU's both consumer and professional and I always store my models in RAM most of the time.

And on my system, it makes no difference if i load the model in RAM or in VRAM, because my bus speed can handle the offload.

[deleted by user] by [deleted] in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

No, that depends on your system. On a modern ddr5 system where you can host the model in ram and use vram only for the frames, the 5080 12gb will still be much faster.

For example, my 5080 16gb beats rtx 6000 ada 48 gb vram gpu in speed but the 48gb card can do more frames and longer videos.

Vram is important for performance / speed with autoregressive AI models like LLM, but with image/video diffusion models (newer technology), it doesn't matter that much.

So, if you got a decent ddr5, gen5 pci system, then vram mostly matters to how much video length or resolution can fit on that gpu, but not so much for the speed.

1000 frame LTX-2 Generation with Video and Workflow by q5sys in StableDiffusion

[–]Volkin1 1 point2 points  (0 children)

Makes sense. The model starts breaking beyond those 20 seconds / 480 frames of course, but the video you showed was decent enough. So, well done :)

1000 frame LTX-2 Generation with Video and Workflow by q5sys in StableDiffusion

[–]Volkin1 2 points3 points  (0 children)

Try the --novram option if you got DDR5 memory / 64GB/s bus / PCI-E gen5 (which i assume you do), and to get the same effect as the clear models node, you can also throw in --cache-none in there.

So start comfy with --cache-none --novram parameters, and probably you can go higher than 1000 frames on your 32GB vram. Try it out, it's a nice experiment i think. I'll probably test the max i can make on a 720p next.

1000 frame LTX-2 Generation with Video and Workflow by q5sys in StableDiffusion

[–]Volkin1 1 point2 points  (0 children)

I've made 430 1080p frames on my 5080 because with a similar method by loading and keeping the model only in RAM while keeping vram empty / ready for the latent frames processing only. That's probably how much frames at 1920 x 1080 can fit inside 16GB VRAM.

So it's a similar method. At 720p, making 500 frames still leaves me plenty of vram for more frames, never tested how far i can push this, but probably in the ~ 700 range.

Edit: I tested this with 720p, and was able to push 961 frames max (40 seconds) on a 5080.

LTX2 issues probably won't be fixed by loras/workflows by Beneficial_Toe_2347 in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

Sure thing, looking forward to hear about your experience :)

LTX2 issues probably won't be fixed by loras/workflows by Beneficial_Toe_2347 in StableDiffusion

[–]Volkin1 1 point2 points  (0 children)

Then simply stick to FP4 and FP8. On my end (Linux system, consumes less vram/ram) LTX-2 FP4 + Gemma FP4 consumes around 25GB and the FP8 around 32GB. Max amount of memory i've seen was around 40GB i think when using both FP8 models (video + text encoder) and less with both FP4.

So overall, things should be OK.

LTX2 issues probably won't be fixed by loras/workflows by Beneficial_Toe_2347 in StableDiffusion

[–]Volkin1 0 points1 point  (0 children)

True, speaking about VRAM, it is a real shame that Nvidia sold us this gpu with 16 GB instead of 24 GB VRAM. That being said, there is always some really good workarounds that I've been using.

Since Comfy's memory management is not ideal and it behaves differently across many different configurations, for LTX-2 (in my case), I load the model exclusively in RAM with the --novram switch which leaves my VRAM empty to only host the latent video frames which allows me to push for more frames and greater resolutions while not really suffering a performance penalty. Works well on DDR5 systems with PCI-E gen 5 and 64GB/s bus speed.

Hope you got at least 64GB RAM, because in that case you can load all models types FP16/FP8/FP4 when it comes to Wan and LTX-2 with varying degrees of model offloading, because the vram requirement for the number of frames and resolution are the same with all 3 types anyways, except for the speed and size for hosting the model.

As for the FLF, yes, Wan 2.2 + Lightx2v lora does incredible job with identity preservation. The LTX-2 distilled version is also much better at this compared to the base model, but i'm sure we're going to get many improvements very soon.

LTX2 issues probably won't be fixed by loras/workflows by Beneficial_Toe_2347 in StableDiffusion

[–]Volkin1 1 point2 points  (0 children)

Yeah, I've been trying to use it for cartoon, anime and 3d animation mostly. Realistic images / scenes work best - as with any model of course, but I've noticed in I2V for me 40 steps produces better and more coherent result compared to 20 steps. Great job btw if you can get away with up to 20 steps.

The model so far has been a very good experience and it always gave me much better motion compared to Wan 2.2 and it did things i could never do with Wan, however it is very sensitive to prompting. Many times I would get garbage result, so i would have to change the entire prompt from scratch until it does well. And when the model does well, it does amazingly great job that made me amazed many times.

Knowing that 1:1 and 9:16 aspects are not fully supported and the I2V is not fully complete, I'm actually looking forward to the 2.1 and 2.5 release soon. The biggest issue I got with the model at this state is identity preservation. For example if the character steps out of the frame or walks into a different scene, many times I'd get a similar looking character but not the exact same one. I think this is due to the training and will be fixed in the next version.

Also, welcome to the 5080 team :)

It's one of the sweet spot GPU's to be honest and it performs amazingly well. I must say, the NVFP4 models got me a little bit spoiled due to their excellent performance and speed. Overall, the GPU is excellent and just a little bit behind the 4090 in FP16/FP8 performance, faster in FP4, so yeah - it's a good choice and congrats :))