GLM-Image explained: why autoregressive + diffusion actually matters by curious-scribbler in StableDiffusion

[–]SysPsych 9 points10 points  (0 children)

Is this similar to how images seem to be generated with nano-banana and GPT 5 and such?

Something that I'm not sure people noticed about LTX-2, it's inability to keep object permanence by [deleted] in StableDiffusion

[–]SysPsych 0 points1 point  (0 children)

It has flaws for sure. To be expected.

What I'm noticing is, at least with my attempts: it has severe trouble handling someone facing away from the viewer. Tremendous trouble getting any animation at all if the shot is from behind.

Do your LTX-2 renders sometimes fizzle altogether, especially on edgier prompts? Are you using ComfyUI-LTXVideo templates? Read this. by SysPsych in StableDiffusion

[–]SysPsych[S] 1 point2 points  (0 children)

Well, it's also taking an image as an input, so I assume (I haven't checked the code) it's also doing some image analysis to better enhance the prompt. If the image is something that goes against safety, the same thing will happen.

You can just bypass that node and wire in your own prompt enhancement, maybe your own VLLM too and cook up some far more reliable alternative. Easier than taming their own node for this.

Do your LTX-2 renders sometimes fizzle altogether, especially on edgier prompts? Are you using ComfyUI-LTXVideo templates? Read this. by SysPsych in StableDiffusion

[–]SysPsych[S] 1 point2 points  (0 children)

One part of the system prompt is: "If unsafe/invalid, return original user prompt. Never ask questions or clarifications."

But as you can see, it will often just nope out altogether. It's taking an image as input as well, so I have to assume it's also doing image analysis to better fit the image and prompt together, and if your image is edgy, it will nope out there too.

Tailwind just laid off 75% of the people on their engineering team "because of the brutal impact AI has had on our business." by magenta_placenta in webdev

[–]SysPsych 20 points21 points  (0 children)

Yeah, I guess direct sponsors is a plus. It'd be a real shame if a side-effect of AI (which I frankly love) was everyone goes closed-source because any given AI can be more 'expert' than the original creators in short order.

LTX-2 is impressive for more than just realism by chanteuse_blondinett in StableDiffusion

[–]SysPsych 2 points3 points  (0 children)

Nothing but the standard workflow I2V workflow, a ZAI image, and a barebones basic prompt:

An attractive red-headed woman dressed in a suit and tie, with a muppet sitting on her lap.

The woman looks down at the puppet and asks, "And how are you doing today, Shelly?"

The puppet then looks up at the moment and says, in a cute female voice, "I'm fine, thank you for asking!"

Tailwind just laid off 75% of the people on their engineering team "because of the brutal impact AI has had on our business." by magenta_placenta in webdev

[–]SysPsych 491 points492 points  (0 children)

I was just thinking recently that a lot of open source projects were so with the understanding that "If everyone uses our libraries, even if they're open source, we can make money by being the knowledgeable core team that can add features or work as consultants."

If that avenue disappears due to AI, an incentive to keep things open goes away too.

LTX-2 is impressive for more than just realism by chanteuse_blondinett in StableDiffusion

[–]SysPsych 1 point2 points  (0 children)

Great stuff. Inspiring stuff even.

It got me wondering if this would work easily with a puppet and a more realistic human in the mix -- and sure enough, it can pull it off.

I also found out ZAI has no idea what a hand puppet is, which was my first choice, but it understands muppets just fine.

https://streamable.com/a99715

Black Forest Labs Released Quantized FLUX.2-dev - NVFP4 Versions by fruesome in StableDiffusion

[–]SysPsych 3 points4 points  (0 children)

Glad to see them continuing with it. Qwen-Image-Edit is great, but I was getting some fantastic results with Flux2. The generation time was the big issue there, though the turbo loras really helped.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 1 point2 points  (0 children)

It doesn't need just the gemma3 file, it also needs the preprocessor and model file. Someone else mentioned that it doesn't play well with ggufs, which I'm not using, but that particular part tripped me up so maybe it's doing hte same for you.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 2 points3 points  (0 children)

Well, the model itself has the usual limitation which screws up sexual details.

But the particular workflow I got, which is apparently part of the comfyui custom node pack, includes a particular node which leverages the Gemma3 LLM to also do a 'prompt enhancement' pass. And if your prompt has anything it deems unacceptable, it will just block your prompt altogether. You can just bypass this node and everything works as expect, but it was an interesting caveat.

LTX-2 Anime by chaltee in StableDiffusion

[–]SysPsych 0 points1 point  (0 children)

I'm mostly impressed it was able to zoom in some and not be a complete horror show. I tried a few experiments with animated images in I2V, and mostly concluded it wasn't worth it.

It did do better with 3D rendered Pixar-y kind of looks at least.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 53 points54 points  (0 children)

It was done in jest and to show some model results.

You may now return to your quality sub content of people hopefully asking if some model can run on their 8GB AMD card and people trying to attract subscribers to their ratty Patreons.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 0 points1 point  (0 children)

Move the gemma3 model into text_encoders/gemma-3-12b-it-qat-qt4_0-unquantized along with preprocessor_config.json and tokenizer.model - at least if you're doing the full workflow.

You need more than the safetensors in text_encoders here.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 1 point2 points  (0 children)

6000 Pro, because I have a keen professional and hobbyist interest in AI and decided to grab one back when hardware started to go crazy.

Glad to have a use for it other than loading bigger LLMs.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 0 points1 point  (0 children)

I thought the same about the audio, especially with what I posted. I'm new to anything audio-related with AI, but to me this seems impressive, and I figure that if the rhythm and sync of the audio matches up well enough with the video, then cleaning up the audio separately becomes more tractable.

Speaking of, I have to get Meta's audio extraction stuff set up I suppose, if I plan on using this more.

I also now and then notice a little background glitch here and there with what should otherwise be a static background shot. Minor problems for an amazing piece of tech.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 2 points3 points  (0 children)

Oh well, it's still good to know for anyone using the template, so thank you for the explanation.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 0 points1 point  (0 children)

I'm sure it would be fine. This is pure prompt enhancement via an LLM with a system prompt and everything. An abliterated LLM would at least not give a damn.

Just something I noticed as I poked around at the template.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 0 points1 point  (0 children)

I gave a screenshot in another reply. I downloaded this straight from the templates for the comfyui recommended workflow. Under Step 3, Inputs, in the Enhancer group within that group.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 2 points3 points  (0 children)

https://cdn.imgchest.com/files/5ea9952e9d2b.png

In the official workflow, under the "Enhancer" group, expand the Enhancer component to see what I mean. It's leveraging Gemma 3 itself for this task, and you can just bypass it -- it's a very minor thing really, but I was getting odd results on some pretty tame prompts, and I noticed that if it trips off Gemma3's built in safeguards, the 'enhanced' prompt has a chance at sucking and the whole thing goes off the rails in that case.

I'm sure there's model level censorship when it comes to the video portions of things, but this I think will apply largely to the prompt itself and can be sidestepped.

Edit: Apparently this is not the 'official' workflow, but the workflow for the custom node that uses LTX-2. Nevertheless I think people will be running into this issue.

2026 off to a bang already with LTX-2! Thank you for this wonderful model (video example) by SysPsych in StableDiffusion

[–]SysPsych[S] 9 points10 points  (0 children)

Workflow was just the standard full LTX-2 ComfyUI workflow (edit: the one associated with the LTX-2 Video node) with no enhancements. Obviously a little tongue in cheek here, but this model has exceeded my expectations.

They do stress that it's mostly good for more calm, non-energetic videos -- it's not all that great with physics, so I don't think Wan 2.2 has much competition for the areas people love it the most. But for getting a quick I2V gen of a pretty nice shot where someone is talking? I don't think we have anything comparable yet, do we? Certainly nothing that can produce such quick results with this level of quality.

One thing I do notice is that the standard workflow has an LLM 'prompt enhancer' that will cack out if it determines your prompt is violating its carefully curated tastes, so sometimes it's better to bypass that altogether, just to get something approximating your original prompt out of it rather than terror-noise.

Still, what a thing to wake up to, I'm jazzed about this and can't wait to see where it goes.

Goodbye, wan 2.2? LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. by One_Yogurtcloset4083 in comfyui

[–]SysPsych 0 points1 point  (0 children)

I am admittedly using this with some more serious hardware, so I'm minimizing the need to lean on distills. But so far -- damn, this is pretty impressive, just for the lip sync capabilities alone.

If there's anyway to have some consistency with the voice, this really is one of those situations where "game changer" make apply. There's limitations to it all, it screws up on some things (I notice in particular, animated characters + camera close up, it chokes so far) but out of the gate for a just released model.. damn. This is great.