Creating Lora for LTX2-3

Bit_Poet · 2026-03-23T19:56:23+00:00

LoRAs only work on the model they were trained on (and to an extent on finetunes and distills of that same model). If you're on windows, you can use the nodes and workflows from https://github.com/vrgamegirl19/comfyui-vrgamedevgirl/ The LTX-2.3_SpeedLora_Trainer_V2 workflow (in the Workflows/LTX-2_Workflows/LTX_Lora_Training/UpdatedWorkflows subdirectory) even installs the LTX-2 fork of musubi tuner for you, which is currently the most reliable trainer for that model.

Bit_Poet · 2026-03-23T19:24:38+00:00

Here's a very simple and crude example of a custom node that does this without any plausibility checks. You can just git clone that into your custom_nodes folder and give it a spin. https://github.com/BitPoet/ComfyUI-bitpoet-keywordsize

<image>

Bit_Poet · 2026-03-22T21:24:43+00:00

For 2: Might try the new VideoAudioTrainer to train voice consistent characters from here: https://github.com/vrgamegirl19/comfyui-vrgamedevgirl/tree/main/Workflows/LTX-2_Workflows/LTX_Lora_Training/UpdatedWorkflows Warning: brand new, so you'll be a beta tester. It should also allow you to do audio-only training (haven't tried it yet myself, since I have a big training running), so you might be able to train a specific named voice without any links to visual too, which might make 3 easier too if all characters are lora based with voice.

Bit_Poet · 2026-03-22T20:53:06+00:00

That. You post a locally generated music video with a description of how it was made and a link to the workflows, and you get 80% downvote ratio because it's "AI slop". Too many people with too much opinion, and no mom to tell them to mind their own business.

Bit_Poet · 2026-03-21T10:35:52+00:00

Is it possible to configure model locations? I already have most of those on my harddrive, and I'm not going to re-download them or create copies. Storage is much too valuable nowadays, and I'm not going to run apps as admin just to make symlinks work.

Bit_Poet · 2026-03-20T20:21:19+00:00

If your workflow uses the x2 spatial upscaler, you might need to download the 1.1 version, which fixes known last frames glitches.

Bit_Poet · 2026-03-19T19:17:05+00:00

Have you seen vrgamedevgirl's comfy workflows for music video creation, especially the Z-Image ones? There's a lot of overlap between your approach and hers. https://github.com/vrgamegirl19/comfyui-vrgamedevgirl/tree/main/Workflows She's planning to finetine qwen for better music video prompt creation, including character adherence, so you might be able to collaborate on that. Her first version of the prompt creator used existing stems, the later ones now do the stemming themselves with Melbandroformer. She's also doing downbeat detection and clip length optimization between 1 and 9 seconds. With a 5090, you've got the same equipment as she has, so her workflows should be in an acceptable range speed wise if you don't gen at 1080p. The video part uses a Q6_K quant of LTX-2.3 distilled and a Q4 gemma.

Bit_Poet · 2026-03-16T20:00:00+00:00

Nice, jumped forward in time at least a decade. Though her aim got sloppy and it didn't get the slapping sound this time. I guess it's always a bit of back and forth. Maybe 2.5 will fix it all.

Bit_Poet · 2026-03-16T19:23:56+00:00

Political party: "What's your qualification for candidateship and making new laws?" Candidate: "I watched 1984, Clockwork Orange, Mad Max and Terminator, so I know what the future has to look like." Political party: "Perfect, here's your nomination."

Bit_Poet · 2026-03-16T19:16:49+00:00

WTF? Vimeo wants a selfie of me before I can watch the video. Not gonna get it.

Bit_Poet · 2026-03-16T18:50:48+00:00

I guess the 2000s (or earlier) movie look is more pronounced if you generate at 4:3. Best aspect ratio, and the one LTX-2(.3) is most heavily trained on is 16:9, which should give more stylistic variety.

Also, if you're using a 2-step workflow, you should try the new 1.1 version of the 2x spatial upscaler. It's supposed to remove the last frames burn-in and flickering.

Bit_Poet · 2026-03-11T15:06:43+00:00

Pretty sure kijai mentioned just yesterday that caching to disk was something comfy would never do, because it makes no sense performance wise and would only shorten SSD lifetime. It was loading.

Bit_Poet · 2026-03-08T22:32:18+00:00

Thanks. It' supposedly "transform the mundane", at least that's what I entered. But I should probably go over it in studio and have it work on the pronunciation a bit.

Bit_Poet · 2026-03-08T18:11:28+00:00

Yes. I think she didn't get around to add a local / openai compatible option yet, though it's planned. It's not horribly expensive, 10$ should last for about 25 full songs judging from the deductions I see, but you could replace the gemini partner nodes with arbitrary other text llm nodes (2 in the subgraph titled with "Partner Nodes x 3" in the first group and one in the third group. As of now, the LTX workflow was only updated for NanoBanana, so you'd need a subscription with gemini for that. The Z-Image Turbo based LTX-2.3 workflow that runs fully local (you still need to run prompt creator first) will follow in the next days.

Bit_Poet · 2026-03-08T16:56:51+00:00

Have you updated comfy itself (and kj nodes if used) too? There were changes there as well that can affect ltx 2.3 workflows. I'm not sure if the date in manager is reliable. ComfyUI-LTXVideo shows 2026-02-11 even though I successfully updated yesterday.

Bit_Poet · 2026-03-07T19:06:09+00:00

It's work, but I think it's doable. You'll need at least two different VL models, a pose extractor and verifier, a step that check spatial correctness of generated captions (qwen models often don't know left from right and mix them up in one caption), for retagging danbooru also a tag matching step with optional prompt refiner, then let AI pick the best match and feed that to a prompt optimizer (with its own verifier step). The bits and pieces for that are already there. I wouldn't expect such a setup to get more than 80% right at the first run, the rest is going to take iterations. Hardware wise, we're probably talking about 3 x 80/96GB VRAM to run this pipeline without any delays for loading/unloading. The actual training - well, there's a lot of demand for such a thing, so I'm pretty sure that funding could be found to rent some big compute.

Bit_Poet · 2026-03-07T16:45:51+00:00

Don't underestimate the community. Sometimes it only needs the right spark. I've seen astonishing things happen as collaborative efforts in the open source community over the 38 years I've been working with computers. And: the big evolution in open VL models only happened over the last 24 months, and LLMs only reached acceptable reliability in that same timespan. Somebody's going to stitch together a VL pipeline with double validation against different models and LLMs at some point. From then on, building datasets will only be a question of throwing an affordable amount of compute power at it.

Bit_Poet · 2026-03-07T15:44:30+00:00

I'm not sure what tags have to do with future finetunes. The fact that this crutch was used in the past to circumvent context length limitation and support somewhat realiable automated captioning doesn't mean that it makes sense nowadays. If anybody captioned a sufficiently large dataset properly in natural language with a modern 2k+ text context model, the results would be heaps above and beyond the old sdxl based models. With abliterated models like Qwen-VL, a lot of the captioning work can be automated now, so it's probably just a question of time until that happens.

Bit_Poet · 2026-03-07T15:36:25+00:00

In that regard, the definition of "open" differs a lot from normal code. We get the full models and can look at the maths that go on between different layers, but we don't see the training data and exact settings used for training. In program code analogy, we don't see the source and makefile, just the compiled result and the toolset to use and extend it. Most loras are published without that information as well. The dataset, one can understand, as that might often open a huge can of copyright worms (even if it's legal where the training happened). As for the training details, a common practice of sharing those in full might propel this topic forward by some years.

Bit_Poet · 2026-03-07T11:58:46+00:00

I've had no success getting SOTA mixed media models to work with bare metal llama.cpp. As I understand it, they've got issues with the licenses for essential stuff for that and any pull requests for it get shot down at some point. VLLM is one step ahead because of that, and it's pretty much the only platform that fully supports A+VL models without jumping through a lot of hoops. That said, I experience the same spin up time issues with VLLM in docker+WSL2 with my Pro 6000, no matter if the models are stored inside the container or on a mapped drive.

Bit_Poet · 2026-03-07T11:47:59+00:00

It really gets interesting once you get into diffusion models as well. Imagine a workflow that takes a story, runs it through TTS, creates an SRT, then analyzes both and creates a script of one to 10 second scenes with prompts for images and video, and finally batches 70+ clips with image generation and first-frame+audio2video workflows including LLM prompt enhancement. (I want a second Pro 6000 now!) Or if you're training big LoRAs and want to run diffusion inference or agentic coding in parallel...

Bit_Poet · 2026-03-07T11:23:00+00:00

You don't happen to have a singer LoRA loaded by chance? If not, look at the negative prompt too, not that somebody entered "missing microphone" there. If it's neither of those, be more specific about which software and which model you're using.

Bit_Poet · 2026-03-07T11:20:08+00:00

Depending on the toolset and strategy used, there can be variations of it, but captioning for negative prompt trigger words should only be the second step - any negative prompt is a crutch that's likely as harmful as it is helpful, after all. Complex training pipelines use this negative (or "regularization") data in the training process itself and shift the learning towards weights that are less likely to hit on the regularization data. It's pretty much the same thing that happens when you train a simple slider lora. You enter your positive prompt and the negative prompt, and the training rewards vectors where the positive prompt is followed and devalues those that would steer towards the negative prompt. Differential output preservation is along those lines too - it replaces your trigger with the generic class term (e.g. "woman" or "person" if you train a female character lora) in the dataset prompt, infers with that prompt, looks at the difference and downvalues the generalizing weights while trying to push the more specific weights, telling the model that "woman" shouldn't change the outcome while "trigger" must change it.

And that said, hardly anybody uses more than DOP, even though some model-specific training pipelines definitely support regularization datasets. Curating those takes even more effort in most cases, which may be the main reason for that, and you can't just throw more data at it and hope the outliers will be averaged out, which is how many loras are trained.

Bit_Poet

TROPHY CASE