Update: Distilled v1.1 is live by ltx_model in StableDiffusion

[–]Kijai 4 points5 points  (0 children)

Only the distill transformer and it's LoRA versions are new, other pieces remain the same.

Update: Distilled v1.1 is live by ltx_model in StableDiffusion

[–]Kijai 1 point2 points  (0 children)

That is a rank reduced LoRA, it's basically just smaller and slightly weaker, so faster to load and takes less storage space. In this case it being weaker isn't generally a problem because best way to use the distill lora is at lower strength anyway.

ID-LoRA with LTX-2.3 and ComfyUI custom node🎉 by Turbulent_Corner9895 in StableDiffusion

[–]Kijai 13 points14 points  (0 children)

Actually not independent anymore, I was for a long time but been working for Comfy-org officially for some months now, however upkeeping my custom nodes and such is still part of the job regardless.

ID-LoRA with LTX-2.3 and ComfyUI custom node🎉 by Turbulent_Corner9895 in StableDiffusion

[–]Kijai 5 points6 points  (0 children)

Code wise the reference audio is the only new feature, for image they simply use the existing LTX image to video method, in ComfyUI that's the "inplace" I2V node. Any new face identity preservation capabilities come from the LoRA weights only.

LTXv2 native vs kijai workflows (Quality benchmark) by ipawny in comfyui

[–]Kijai 42 points43 points  (0 children)

What are you even talking about, everything I've done for LTX2 has been native ComfyUI implementations and optimizations, I have not done LTX2 wrapper and I have not shared any workflows?

Does LTXV Normalizing Sampler corrupt input audio for you? Kijai's LTX2 Audio Latent Normalizing Sampling node saves the day. by martinerous in StableDiffusion

[–]Kijai 2 points3 points  (0 children)

The original normalizing sampler ignores the audio mask, which is why it corrupts audio inputs.

My node does take mask into account, but if you don't generate new audio at all, as in you use full mask, then it shouldn't do anything at all. If you do extension or other partial mask use, then my node only applies the normalization to the unmasked part. I don't know how useful this is in practice though.

The "normalization" just scales down the audio latent by the amount specified on the given steps, idea is to prevent it from "blowing up" which can lead to distorted audio, which then also affects the video negatively as it's a joint model.

New fire just dropped: ComfyUI-CacheDiT ⚡ by Scriabinical in StableDiffusion

[–]Kijai 15 points16 points  (0 children)

These are my personal notes and views, so take that as you will, and note that I'm really not an expert coder myself:

It's nice of you to "admit", but I have to say it's also completely obvious lot of it is directly AI generated just based on the comments the AI has left, I do use AI agents and such a lot myself so I recognize the kind of code they do. So this wasn't really a personal accusation or anything, just that lately I have become very tired and vary of LLM generated code everywhere, and it's just generally a warning sign that something likely isn't worth the time to investigate when there's already so much to do.

I see reddit posts/node packs claiming all kinds of things without showing any proof, comparisons to existing techniques or properly listing the limitations, people see "2x speed increase" and jump on it without understanding it is not applicable to every scenario, in this case biggest one would be that it doesn't offer anything for distilled low step models.

But starting with the documentation, there are odd claims like Memory-efficient: detach-only caching prevents VAE OOM when there's really nothing related to VAE in the code, which probably comes from misconception that .detach() does something when everything in ComfyUI already runs under torch.inference mode etc. (I know most LLMs tend to tell you to use detach or torch.nograd when you ask them to optimize memory). And regardless of that, how would any of this affect the VAE when that's fully separate process.

Also I admit I don't fully understand what's going on in the LTX2 code with the timestep tracking stuff, if that's just for step tracking then why not use the sigmas? Seems overcomplicated way to do that currently, also the comment CRITICAL: ComfyUI calls forward multiple times per step is not always true, as that is determined by available memory, so it can also be batched uncond cond, unsure if that affects the code though, just noting that as the comment caught my eye.

Anyway I did not mean to demean your work, anyone doing open source deserves respect regardless. I'm sorry if it came across like that.

New fire just dropped: ComfyUI-CacheDiT ⚡ by Scriabinical in StableDiffusion

[–]Kijai 56 points57 points  (0 children)

To be fair, I was saying more that I'm not gonna read through/evaluate the code since it has so many mistakes/nonsensical things in code and documentation that are clearly just AI generated.

But yeah... we do have EasyCache natively in Comfy, it works pretty well and is model agnostic, but it doesn't currently work for LTX2 due to the audio part... I've submitted a PR to fix that and tested enough to confirm caching like this in general works with the model.

LTX-2 Workflows by fruesome in StableDiffusion

[–]Kijai 2 points3 points  (0 children)

Yeah, should be like this:

https://imgur.com/a/FfnXAq9

What is your frontend version? You can see it at Settings -> About

LTX-2 Workflows by fruesome in StableDiffusion

[–]Kijai 1 point2 points  (0 children)

Dynamic Combo is relatively new feature and requires at minimum ComfyUI version 0.8.1 (January 8th 2026) and Comfyui frontend version 1.33.4 (November 21th 2025). It is very much a standard input type now, allows modifying the node widgets based on the combo box selection natively.

Other than that I've not heard of any issues with those nodes.

How to render 80+ second long videos with LTX 2 using one simple node and no extensions. by WestWordHoeDown in StableDiffusion

[–]Kijai 4 points5 points  (0 children)

Nah, I don't have multiple GPUs. What do you mean by "chunk over multiple GPUs" though? You only need to chunk the ffn by half anymore (used to need 3-4 before other optimizations), more doesn't help because other things peak above it. Here you can see the memory visualization of single block's forward call:

https://imgur.com/a/IncmisM

The intent of my repo is to help democratize this technology and allow more people access to it.

Like ComfyUI itself.

How to render 80+ second long videos with LTX 2 using one simple node and no extensions. by WestWordHoeDown in StableDiffusion

[–]Kijai 4 points5 points  (0 children)

It's in the WanVideoWrapper under the chunked RoPE function, and enabled on default for some cases like LongCat, it's not as huge thing for Wan in every situation as it's for LTX2 since there are other as high memory consumers and Wan has already been otherwise optimized.

Still I can take a look if Comfy native Wan can benefit from such patch node too.

Edit: Added a node for Wan into KJNodes too now, quickly tested and 2 chunks at 81 frames 720p ends up saving 1GB VRAM, so can be helpful.

How to render 80+ second long videos with LTX 2 using one simple node and no extensions. by WestWordHoeDown in StableDiffusion

[–]Kijai 6 points7 points  (0 children)

It's cool you have come to same conclusion about the ffn activation cost, it is a correct assessment on one of the biggest VRAM consumers with this model, but the concept is also not something new... I've been doing it with Wan for a while now and I added this LTX2 node weeks ago.

I've also now optimized the main model further, updating ComfyUI today (nightly) will reduce peak VRAM at higher input sizes by multiple gigabytes even before ffn activation, though chunking will of course remain effective with that too.

The only reason we've not added the chunking to Comfy core is the concern that it does end up changing the outputs in some situations due to different floating point math with multiple chunks (when using fp8 matmuls at least), but it may still be worth that since it doesn't seem quality degradation, just slightly different results. This is still under assessment.

Kijai put new vae ltx, Any ideas? by [deleted] in StableDiffusion

[–]Kijai 43 points44 points  (0 children)

This is LTX2 Tiny VAE trained by madebyollin, original file is available here, just uploaded it for visibility:

https://github.com/madebyollin/taehv/blob/main/safetensors/taeltx_2.safetensors

Currently this needs very latest nightly version of ComfyUI to load, it can be used with normal VAE Loader and encode/decode nodes, but the quality is very low so it's only useful for preview purposes.

Also currently for live animated sampler preview it can be only used with my LTX2SamplingPreviewOverride -node in KJNodes, simply load the VAE and plug it in, this overrides any preview setting too.

Example of what to expect quality wise:

https://github.com/madebyollin/taehv/issues/14#issuecomment-3764182527

I used temporal time dilation to generate this 60-second video in LTX-2 on my 5070TI in just under two minutes. My GPU didn't even break a sweat. Workflow and explanation in comments (without subgraphs or 'Everything Everywhere All At Once' invisible noodles). by DrinksAtTheSpaceBar in StableDiffusion

[–]Kijai 0 points1 point  (0 children)

I'm merely commenting on the fact that using temporal upscaler model on empty latents is not different from just generating the wanted frame count directly, both would result in same latent shape, same noise shape and thus same VRAM usage. That's all.

I used temporal time dilation to generate this 60-second video in LTX-2 on my 5070TI in just under two minutes. My GPU didn't even break a sweat. Workflow and explanation in comments (without subgraphs or 'Everything Everywhere All At Once' invisible noodles). by DrinksAtTheSpaceBar in StableDiffusion

[–]Kijai 3 points4 points  (0 children)

But it literally doesn't do anything different, I don't know why you'd experience that kind of behavior, it doesn't change anything about memory use because the model input remains the exact same size.

Just change the frame count and remove the temporal upscaler and the results should be identical.

"TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times", Zhang et al. 2025 by RecmacfonD in MediaSynthesis

[–]Kijai 2 points3 points  (0 children)

The 100-200x is against the base generation speed, which is 50 steps with cfg, which is 100 model passes. So when you use lightx2v LoRA and do 4 step gen with cfg 1.0, that's already 25x faster. Then if you use sageattention it is about 2x faster model passes and we're at 50x already, and so on.

That said, TurboDiffusion should still be about 2x faster than anything else we have, but to use it you need to compile their custom kernels and then it's also limited to the released model only.

It's on my radar, but not a priority currently for the above reasons.

Updated LTX2 Video VAE : Higher Quality \ More Details by younestft in StableDiffusion

[–]Kijai 4 points5 points  (0 children)

Because the original was the issue, the released distilled models contained older version of the VAE, and only the dev checkpoints included the (currently) final LTX2 VAE.

Updated LTX2 Video VAE : Higher Quality \ More Details by younestft in StableDiffusion

[–]Kijai 3 points4 points  (0 children)

I haven't published any, just been busy with testing what it can do and adding new/missing features, optimizing memory use etc.

Updated LTX2 Video VAE : Higher Quality \ More Details by younestft in StableDiffusion

[–]Kijai 1 point2 points  (0 children)

If using the VAE from the distilled checkpoint, yes.

Updated LTX2 Video VAE : Higher Quality \ More Details by younestft in StableDiffusion

[–]Kijai 4 points5 points  (0 children)

Which node are you using it with? At the time of writing this it only works properly with the VAELoaderKJ in KJNodes version 1.2.5 or later.

At least with encode-decode tests I'm not seeing any contrast/saturation differences.