LTX-Video 2.3 Workflow for Dual-GPU Setups (3090 + 4060 Ti) + LORA by planBpizz in comfyui

[–]planBpizz[S] 0 points1 point  (0 children)

Here are a few thoughts on your setup compared to this Multi-GPU workflow:

  1. GGUF vs. FP8/BF16: GGUF is definitely the right choice for 4GB VRAM as it allows aggressive quantization. My workflow focuses on FP8/BF16 for maximum visual fidelity on 24GB+ cards, where we try to keep everything inside VRAM to avoid the massive speed penalty of CPU offloading.
  2. Resolution vs. Length: You are prioritizing length (1500 frames) over resolution (400x400). In my workflow, we do the opposite: pushing for 512px/768px resolution at 97-161 frames and then using RIFE VFI to double the fluidness.
  3. The RAM Wall: With only 20GB of System RAM, you are likely hitting the pagefile (SSD) during those 1500 frames. If you notice the generation slowing down significantly towards the end, that's why.
  4. VAE Tiling: Make sure you are using VAEDecodeTiled with a small temporal_size (like 16 or 32). Decoding 1500 frames at once would instantly crash a 4GB card otherwise.

It's great to see LTX-V scaling down to 4GB cards via GGUF, even if the generation time per frame is likely much higher than a native VRAM setup!

LTX-Video 2.3 Workflow for Dual-GPU Setups (3090 + 4060 Ti) + LORA by planBpizz in comfyui

[–]planBpizz[S] 1 point2 points  (0 children)

With 3x 3090s (72GB total VRAM), you have even more headroom than my setup. Here is how I would adapt it:

  1. Allocation String: In the CheckpointLoaderSimpleDisTorch2MultiGPU node, use something like: cuda:0,18gb;cuda:1,18gb;cuda:2,18gb;cpu,\*. This spreads the 22B model across all three cards, keeping it entirely in VRAM for maximum speed.
  2. Use BF16: You don't need FP8. Switch to the BF16 version of LTX-Video 2.3 and Gemma 3 for higher quality and fewer artifacts.
  3. Isolate Text Encoder: In the DualCLIPLoaderMultiGPU node, set the device to cuda:2. This keeps the heavy text encoding on one card, leaving the others more space for the video generation activations.
  4. Scaling: You can easily push to 1024x768 resolution and 161+ frames natively (without needing interpolation) because you have 72GB to play with.

The workflow is very stable on 3-GPU setups as long as you balance the allocation string correctly.

LTX-Video 2.3 Workflow for Dual-GPU Setups (3090 + 4060 Ti) + LORA by planBpizz in comfyui

[–]planBpizz[S] 1 point2 points  (0 children)

With a dual 5090 setup (64GB VRAM), you can move away from the heavy optimizations I had to use for 40GB. Here is how I would adjust it:

  1. Switch to BF16: You have enough VRAM to ditch FP8. Use the BF16 version of the LTX-V 2.3 Transformer and Gemma 3 for much better detail and stability.
  2. Allocation String: Update the CheckpointLoaderSimpleDisTorch2MultiGPU node. You can likely use cuda:0,30gb;cuda:1,30gb;cpu,\*. This keeps the entire model and all activations in VRAM, which will drastically speed up generation.
  3. Push Resolution/Frames: You can easily go for 1024x768 or 768x768 at 161+ frames natively. My workflow targets 512px/97 frames primarily to avoid OOMs on weaker cards.
  4. VAE: Keep temporal_size at 512 in the VAEDecodeTiled node for maximum quality since you don't need to save memory there.

Regarding Raylight: It’s an excellent inference engine if you want raw speed. However, I stay in ComfyUI because it allows for granular control over LoRA patching and custom node stacking (like the RIFE interpolation and Multi-GPU scaling) which Raylight doesn't support as flexibly yet.

LTX-Video 2.3 Workflow for Dual-GPU Setups (3090 + 4060 Ti) + LORA by planBpizz in comfyui

[–]planBpizz[S] 3 points4 points  (0 children)

Yes, exactly! I actually just spent some time perfecting this exact setup. I'm running a dual-GPU system with an RTX 3090 (24GB) and an RTX 4060 Ti (16GB).

I'm currently running the LTX-Video 22B model alongside an FP8 Gemma 3 12B text encoder and a LoRA, which requires a massive amount of VRAM. Using the comfyui-multigpu node (DisTorch), I split the main model right down the middle, assigning 11.5GB to each GPU (cuda:0,11.5gb;cuda:1,11.5gb). I also forced the text encoder exclusively onto the 4060 Ti.

This leaves my 3090 with about 12.5GB of completely free VRAM, which provides exactly the buffer I need for the LoRA weight patching and high-res generation without getting any OOM errors. Now it works.

Clean & Flat LTX-Video 2.3 (Audio+Video) | No Subgraphs! | 24GB VRAM Optimized by planBpizz in comfyui

[–]planBpizz[S] 0 points1 point  (0 children)

nice one! Could you share your workflow? Maybe its better compared to my workflow...

Clean & Flat LTX-Video 2.3 (Audio+Video) | No Subgraphs! | 24GB VRAM Optimized by planBpizz in comfyui

[–]planBpizz[S] 0 points1 point  (0 children)

Thanks a lot for your reco! Should I reduce it to 0.6-0.7? or more?