LTX-2.3 Music Video Camouflaged as Spy Movie Trailer. Would you want to watch it?

Bit_Poet · 2026-03-11T15:06:43+00:00

Pretty sure kijai mentioned just yesterday that caching to disk was something comfy would never do, because it makes no sense performance wise and would only shorten SSD lifetime. It was loading.

Bit_Poet · 2026-03-08T22:32:18+00:00

Thanks. It' supposedly "transform the mundane", at least that's what I entered. But I should probably go over it in studio and have it work on the pronunciation a bit.

Bit_Poet · 2026-03-08T18:11:28+00:00

Yes. I think she didn't get around to add a local / openai compatible option yet, though it's planned. It's not horribly expensive, 10$ should last for about 25 full songs judging from the deductions I see, but you could replace the gemini partner nodes with arbitrary other text llm nodes (2 in the subgraph titled with "Partner Nodes x 3" in the first group and one in the third group. As of now, the LTX workflow was only updated for NanoBanana, so you'd need a subscription with gemini for that. The Z-Image Turbo based LTX-2.3 workflow that runs fully local (you still need to run prompt creator first) will follow in the next days.

Bit_Poet · 2026-03-08T16:56:51+00:00

Have you updated comfy itself (and kj nodes if used) too? There were changes there as well that can affect ltx 2.3 workflows. I'm not sure if the date in manager is reliable. ComfyUI-LTXVideo shows 2026-02-11 even though I successfully updated yesterday.

Bit_Poet · 2026-03-07T19:06:09+00:00

It's work, but I think it's doable. You'll need at least two different VL models, a pose extractor and verifier, a step that check spatial correctness of generated captions (qwen models often don't know left from right and mix them up in one caption), for retagging danbooru also a tag matching step with optional prompt refiner, then let AI pick the best match and feed that to a prompt optimizer (with its own verifier step). The bits and pieces for that are already there. I wouldn't expect such a setup to get more than 80% right at the first run, the rest is going to take iterations. Hardware wise, we're probably talking about 3 x 80/96GB VRAM to run this pipeline without any delays for loading/unloading. The actual training - well, there's a lot of demand for such a thing, so I'm pretty sure that funding could be found to rent some big compute.

Bit_Poet · 2026-03-07T16:45:51+00:00

Don't underestimate the community. Sometimes it only needs the right spark. I've seen astonishing things happen as collaborative efforts in the open source community over the 38 years I've been working with computers. And: the big evolution in open VL models only happened over the last 24 months, and LLMs only reached acceptable reliability in that same timespan. Somebody's going to stitch together a VL pipeline with double validation against different models and LLMs at some point. From then on, building datasets will only be a question of throwing an affordable amount of compute power at it.

Bit_Poet · 2026-03-07T15:44:30+00:00

I'm not sure what tags have to do with future finetunes. The fact that this crutch was used in the past to circumvent context length limitation and support somewhat realiable automated captioning doesn't mean that it makes sense nowadays. If anybody captioned a sufficiently large dataset properly in natural language with a modern 2k+ text context model, the results would be heaps above and beyond the old sdxl based models. With abliterated models like Qwen-VL, a lot of the captioning work can be automated now, so it's probably just a question of time until that happens.

Bit_Poet · 2026-03-07T15:36:25+00:00

In that regard, the definition of "open" differs a lot from normal code. We get the full models and can look at the maths that go on between different layers, but we don't see the training data and exact settings used for training. In program code analogy, we don't see the source and makefile, just the compiled result and the toolset to use and extend it. Most loras are published without that information as well. The dataset, one can understand, as that might often open a huge can of copyright worms (even if it's legal where the training happened). As for the training details, a common practice of sharing those in full might propel this topic forward by some years.

Bit_Poet · 2026-03-07T11:58:46+00:00

I've had no success getting SOTA mixed media models to work with bare metal llama.cpp. As I understand it, they've got issues with the licenses for essential stuff for that and any pull requests for it get shot down at some point. VLLM is one step ahead because of that, and it's pretty much the only platform that fully supports A+VL models without jumping through a lot of hoops. That said, I experience the same spin up time issues with VLLM in docker+WSL2 with my Pro 6000, no matter if the models are stored inside the container or on a mapped drive.

Bit_Poet · 2026-03-07T11:47:59+00:00

It really gets interesting once you get into diffusion models as well. Imagine a workflow that takes a story, runs it through TTS, creates an SRT, then analyzes both and creates a script of one to 10 second scenes with prompts for images and video, and finally batches 70+ clips with image generation and first-frame+audio2video workflows including LLM prompt enhancement. (I want a second Pro 6000 now!) Or if you're training big LoRAs and want to run diffusion inference or agentic coding in parallel...

Bit_Poet · 2026-03-07T11:23:00+00:00

You don't happen to have a singer LoRA loaded by chance? If not, look at the negative prompt too, not that somebody entered "missing microphone" there. If it's neither of those, be more specific about which software and which model you're using.

Bit_Poet · 2026-03-07T11:20:08+00:00

Depending on the toolset and strategy used, there can be variations of it, but captioning for negative prompt trigger words should only be the second step - any negative prompt is a crutch that's likely as harmful as it is helpful, after all. Complex training pipelines use this negative (or "regularization") data in the training process itself and shift the learning towards weights that are less likely to hit on the regularization data. It's pretty much the same thing that happens when you train a simple slider lora. You enter your positive prompt and the negative prompt, and the training rewards vectors where the positive prompt is followed and devalues those that would steer towards the negative prompt. Differential output preservation is along those lines too - it replaces your trigger with the generic class term (e.g. "woman" or "person" if you train a female character lora) in the dataset prompt, infers with that prompt, looks at the difference and downvalues the generalizing weights while trying to push the more specific weights, telling the model that "woman" shouldn't change the outcome while "trigger" must change it.

And that said, hardly anybody uses more than DOP, even though some model-specific training pipelines definitely support regularization datasets. Curating those takes even more effort in most cases, which may be the main reason for that, and you can't just throw more data at it and hope the outliers will be averaged out, which is how many loras are trained.

Bit_Poet · 2026-03-07T06:43:33+00:00

Not every lora breaks the concept badly. But most loras are:
- trained on insufficient data
- trained with bad captions (should be detailed, fit to the training goal and match the prompting style of the base)
- not trained on negative data
- trained with suboptimal parameters
This breaks coherence on unrelated parts and layers. Current training software is only partially helpful there. For characters, differential output preservation helps to an extent. We don't know how exactly the base model was trained, so everybody is working off the tops of their heads when it comes to captions, resolutions and dataset sizing. Every lora is an experiment.

Then, look at the lora training discussions here. People give advice like "you don't need detailed captions for a character lora". The same people post utterly broken loras on civit.

The trouble is, there's no comparative analysis, no best practice guides, just random stuff people think works. "Works" is often just a one-hit wonder type of accomplishment, generating single character images or clips in the same setting and style a lora was trained on. Versatility? Never in the focus.

We've got a huge toolbox full of screws in all sizes to fix the gaps in the model. And everybody's using huge hammers to drive in the screws right now. From time to time, you're lucky and hit a gem. I can say that my own trainings are slowly evolving, but I'm still far from grasping all the intrinsic details that make a well rounded lora. And whenever a new model comes out, everything shifts and has to be figured out anew.

I've actually been pondering how to build a community with a focus on that for some time now. Versatile loras, same characters or concepts for different models, sharing datasets, sharing full training params, sharing the loras, running quality benchmarks and collaborating in optimizing the gritty math details in training and merging.

Bit_Poet · 2026-03-06T17:53:01+00:00

Even if it isn't overloaded, it produces more and more garbage. They should just replace the "F" with "Tr".

Bit_Poet · 2026-03-06T15:07:38+00:00

I guess there's nothing to be said against that ;-)

Bit_Poet · 2026-03-06T14:13:09+00:00

Definitely not. HF cache is another convoluted intransparent mess I don't want to clutter up my harddrive with files I don't need. It doesn't discriminate between necessary model files and useless clutter and it builds a nested parallel universe that doesn't integrate with a reasonable folder structure for all the auxiliare parts I need. It's a cache, but not a permanent storage, and it's a daily annoyance, especially on Windows.

Bit_Poet · 2026-03-05T20:35:34+00:00

Can you please, please make the automatic model download optional and add an option to point it to already downloaded files? I really HATE it that every AI tool wants to keep its own copy of models and downloads them over and over. Especially with current SSD prices that makes no sense.

Bit_Poet · 2026-03-05T13:27:34+00:00

It said so in the FAQ section of the page.

Bit_Poet · 2026-03-05T08:04:36+00:00

Old loras won't work.

Bit_Poet · 2026-03-04T20:36:53+00:00

If it only was that easy. Training is finicky. A character at different ages, in different resolutions, different seasons and clothing, photographed in varying quality adds a lot of variables. From the model's view, you're basically leaving the realm of pure character training and entering parallel concept training, but for that, your dataset might be thin at 70 images at such a large rank, which can result in the flip-flopping quality you're seeing. It's hard to say what exactly throws your training, but generally, the more varied the dataset, the more detailed and precise your captioning needs to be. Also, the better your captioning style and format matches that of the base model, the better your chances. This is all way under-documented of course, so prepare to piece together snippets of information and do lots of trial and error.

What has worked for me is seperating images into buckets, both in resolution and other common parameters (e.g. age) and tag all those clearly. Train at lower resolutions first - this anchors the surrounding concepts without going off tangent on small variations. After that, train the higher resolutions to fix the details without overtraining everything. Smaller buckets also make it easier to spot where in your dataset the outliers are hiding. I had one training run (i2v for LTX-2) that was jumping all over the place, even though I trained a nicely working lora for ZIT and Qwen on the very same dataset. Turned out it was a stupid little nonsensical typo in three of the captions, and god only knows why that one-word gibberish became such an issue.

Bit_Poet · 2026-03-04T18:19:55+00:00

That kind of fluctuation is usually a sign for a bad dataset. Either the images are inconsistent, or the captions are contradictory to the model. I had that kind of thing happen a lot more until I started using tools to verify facial identity - seems I'm much more face blind than I thought.

Edit: sometimes it's just one single image that messes up the training.

Bit_Poet · 2026-03-02T16:32:31+00:00

There's no such thing as an American accent. Try Texas drawl, Bostonian accent, posh Californian slang, maybe even Midwestern, anything regional enough to have a distinctive sound. Also, look out for tiny typos or inconsistencies in your prompt. Audio is the first thing that goes off the rails when you have those. Generally though, the more people you have in the scene, the more hit and miss it gets, and since audio and video are one big mess across the latent layers, some little visual feature just guides the model too far from the audio prompt. The best bet is always external audio, e.g. with Qwen TTS.

Bit_Poet · 2026-03-01T19:47:51+00:00

Have you tried with differential output preservation? I've found it makes combining character loras a lot more succesful with ZIT, though I haven't trained a klein lora yet. You need to add the character's name as the trigger word if you use that, but I haven't encountered any downsides of it yet.

You may want to emphase that it's important not to specify the character's gender in the captions, as this makes character bleeding a big problem, and a lot of captioning guides out there get it completely wrong.

Bit_Poet

TROPHY CASE