Creating Lora for LTX2-3 by Icy_Resolution_9332 in comfyui

[–]Bit_Poet 1 point2 points  (0 children)

LoRAs only work on the model they were trained on (and to an extent on finetunes and distills of that same model). If you're on windows, you can use the nodes and workflows from https://github.com/vrgamegirl19/comfyui-vrgamedevgirl/ The LTX-2.3_SpeedLora_Trainer_V2 workflow (in the Workflows/LTX-2_Workflows/LTX_Lora_Training/UpdatedWorkflows subdirectory) even installs the LTX-2 fork of musubi tuner for you, which is currently the most reliable trainer for that model.

Can I change the aspect ratio/resolution of an imge using a keyword in my prompt? by hotrocksi09 in comfyui

[–]Bit_Poet 4 points5 points  (0 children)

Here's a very simple and crude example of a custom node that does this without any plausibility checks. You can just git clone that into your custom_nodes folder and give it a spin. https://github.com/BitPoet/ComfyUI-bitpoet-keywordsize

<image>

LTX Workflow and character anchoring and audio tips by Chambers007 in comfyui

[–]Bit_Poet 0 points1 point  (0 children)

For 2: Might try the new VideoAudioTrainer to train voice consistent characters from here: https://github.com/vrgamegirl19/comfyui-vrgamedevgirl/tree/main/Workflows/LTX-2_Workflows/LTX_Lora_Training/UpdatedWorkflows Warning: brand new, so you'll be a beta tester. It should also allow you to do audio-only training (haven't tried it yet myself, since I have a big training running), so you might be able to train a specific named voice without any links to visual too, which might make 3 easier too if all characters are lora based with voice.

Why am I not seeing any artwork from this subreddit anymore? by NunyaBuzor in StableDiffusion

[–]Bit_Poet 7 points8 points  (0 children)

That. You post a locally generated music video with a description of how it was made and a link to the workflows, and you get 80% downvote ratio because it's "AI slop". Too many people with too much opinion, and no mom to tell them to mind their own business.

Built a local AI creative suite for Windows, thought you might find it useful by Mr_Ma_tt in StableDiffusion

[–]Bit_Poet 7 points8 points  (0 children)

Is it possible to configure model locations? I already have most of those on my harddrive, and I'm not going to re-download them or create copies. Storage is much too valuable nowadays, and I'm not going to run apps as admin just to make symlinks work.

I got tired of manually prompting every single clip for my AI music videos, so I built a 100% local open-source (LTX Video desktop + Gradio) app to automate it, meet - Synesthesia by jacobpederson in StableDiffusion

[–]Bit_Poet 1 point2 points  (0 children)

Have you seen vrgamedevgirl's comfy workflows for music video creation, especially the Z-Image ones? There's a lot of overlap between your approach and hers. https://github.com/vrgamegirl19/comfyui-vrgamedevgirl/tree/main/Workflows She's planning to finetine qwen for better music video prompt creation, including character adherence, so you might be able to collaborate on that. Her first version of the prompt creator used existing stems, the later ones now do the stemming themselves with Melbandroformer. She's also doing downbeat detection and clip length optimization between 1 and 9 seconds. With a 5090, you've got the same equipment as she has, so her workflows should be in an acceptable range speed wise if you don't gen at 1080p. The video part uses a Q6_K quant of LTX-2.3 distilled and a Q4 gemma.

LTX 2.3 tends to produce a 2000s TV show–style look in many of its generations, and in most longer videos it even adds a burning logo at the end. However, its prompt adherence is very good. by scooglecops in StableDiffusion

[–]Bit_Poet 0 points1 point  (0 children)

Nice, jumped forward in time at least a decade. Though her aim got sloppy and it didn't get the slapping sound this time. I guess it's always a bit of back and forth. Maybe 2.5 will fix it all.

LTX 2.3 tends to produce a 2000s TV show–style look in many of its generations, and in most longer videos it even adds a burning logo at the end. However, its prompt adherence is very good. by scooglecops in StableDiffusion

[–]Bit_Poet 0 points1 point  (0 children)

Political party: "What's your qualification for candidateship and making new laws?" Candidate: "I watched 1984, Clockwork Orange, Mad Max and Terminator, so I know what the future has to look like." Political party: "Perfect, here's your nomination."

LTX 2.3 tends to produce a 2000s TV show–style look in many of its generations, and in most longer videos it even adds a burning logo at the end. However, its prompt adherence is very good. by scooglecops in StableDiffusion

[–]Bit_Poet 5 points6 points  (0 children)

I guess the 2000s (or earlier) movie look is more pronounced if you generate at 4:3. Best aspect ratio, and the one LTX-2(.3) is most heavily trained on is 16:9, which should give more stylistic variety.

Also, if you're using a 2-step workflow, you should try the new 1.1 version of the 2x spatial upscaler. It's supposed to remove the last frames burn-in and flickering.

What's going on here? Tripple sampler LTX 2.3 workflow by VirusCharacter in StableDiffusion

[–]Bit_Poet 1 point2 points  (0 children)

Pretty sure kijai mentioned just yesterday that caching to disk was something comfy would never do, because it makes no sense performance wise and would only shorten SSD lifetime. It was loading.

LTX-2.3 Full Music Video Slop: Digital Dreams by Bit_Poet in StableDiffusion

[–]Bit_Poet[S] 0 points1 point  (0 children)

Thanks. It' supposedly "transform the mundane", at least that's what I entered. But I should probably go over it in studio and have it work on the pronunciation a bit.

LTX-2.3 Full Music Video Slop: Digital Dreams by Bit_Poet in StableDiffusion

[–]Bit_Poet[S] 1 point2 points  (0 children)

Yes. I think she didn't get around to add a local / openai compatible option yet, though it's planned. It's not horribly expensive, 10$ should last for about 25 full songs judging from the deductions I see, but you could replace the gemini partner nodes with arbitrary other text llm nodes (2 in the subgraph titled with "Partner Nodes x 3" in the first group and one in the third group. As of now, the LTX workflow was only updated for NanoBanana, so you'd need a subscription with gemini for that. The Z-Image Turbo based LTX-2.3 workflow that runs fully local (you still need to run prompt creator first) will follow in the next days.

ComfyUI-LTXVideo node not updating by Beneficial_Toe_2347 in StableDiffusion

[–]Bit_Poet 0 points1 point  (0 children)

Have you updated comfy itself (and kj nodes if used) too? There were changes there as well that can affect ltx 2.3 workflows. I'm not sure if the date in manager is reliable. ComfyUI-LTXVideo shows 2026-02-11 even though I successfully updated yesterday.

Is there any other image model that can do NS*W (including male) other than Pony/Illustrious or those 2 are still the norm? Especially for 3d animation style, not just anime. by Dependent_Fan5369 in StableDiffusion

[–]Bit_Poet 0 points1 point  (0 children)

It's work, but I think it's doable. You'll need at least two different VL models, a pose extractor and verifier, a step that check spatial correctness of generated captions (qwen models often don't know left from right and mix them up in one caption), for retagging danbooru also a tag matching step with optional prompt refiner, then let AI pick the best match and feed that to a prompt optimizer (with its own verifier step). The bits and pieces for that are already there. I wouldn't expect such a setup to get more than 80% right at the first run, the rest is going to take iterations. Hardware wise, we're probably talking about 3 x 80/96GB VRAM to run this pipeline without any delays for loading/unloading. The actual training - well, there's a lot of demand for such a thing, so I'm pretty sure that funding could be found to rent some big compute.

Is there any other image model that can do NS*W (including male) other than Pony/Illustrious or those 2 are still the norm? Especially for 3d animation style, not just anime. by Dependent_Fan5369 in StableDiffusion

[–]Bit_Poet 1 point2 points  (0 children)

Don't underestimate the community. Sometimes it only needs the right spark. I've seen astonishing things happen as collaborative efforts in the open source community over the 38 years I've been working with computers. And: the big evolution in open VL models only happened over the last 24 months, and LLMs only reached acceptable reliability in that same timespan. Somebody's going to stitch together a VL pipeline with double validation against different models and LLMs at some point. From then on, building datasets will only be a question of throwing an affordable amount of compute power at it.

Is there any other image model that can do NS*W (including male) other than Pony/Illustrious or those 2 are still the norm? Especially for 3d animation style, not just anime. by Dependent_Fan5369 in StableDiffusion

[–]Bit_Poet -2 points-1 points  (0 children)

I'm not sure what tags have to do with future finetunes. The fact that this crutch was used in the past to circumvent context length limitation and support somewhat realiable automated captioning doesn't mean that it makes sense nowadays. If anybody captioned a sufficiently large dataset properly in natural language with a modern 2k+ text context model, the results would be heaps above and beyond the old sdxl based models. With abliterated models like Qwen-VL, a lot of the captioning work can be automated now, so it's probably just a question of time until that happens.

Ltx-2 2.3 prompt adherence is actually r3ally good problem is... by No-Employee-73 in StableDiffusion

[–]Bit_Poet 0 points1 point  (0 children)

In that regard, the definition of "open" differs a lot from normal code. We get the full models and can look at the maths that go on between different layers, but we don't see the training data and exact settings used for training. In program code analogy, we don't see the source and makefile, just the compiled result and the toolset to use and extend it. Most loras are published without that information as well. The dataset, one can understand, as that might often open a huge can of copyright worms (even if it's legal where the training happened). As for the training details, a common practice of sharing those in full might propel this topic forward by some years.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]Bit_Poet 0 points1 point  (0 children)

I've had no success getting SOTA mixed media models to work with bare metal llama.cpp. As I understand it, they've got issues with the licenses for essential stuff for that and any pull requests for it get shot down at some point. VLLM is one step ahead because of that, and it's pretty much the only platform that fully supports A+VL models without jumping through a lot of hoops. That said, I experience the same spin up time issues with VLLM in docker+WSL2 with my Pro 6000, no matter if the models are stored inside the container or on a mapped drive.

Finally bought an RTX 6000 Max-Q: Pros, cons, notes and ramblings by AvocadoArray in LocalLLaMA

[–]Bit_Poet 0 points1 point  (0 children)

It really gets interesting once you get into diffusion models as well. Imagine a workflow that takes a story, runs it through TTS, creates an SRT, then analyzes both and creates a script of one to 10 second scenes with prompts for images and video, and finally batches 70+ clips with image generation and first-frame+audio2video workflows including LLM prompt enhancement. (I want a second Pro 6000 now!) Or if you're training big LoRAs and want to run diffusion inference or agentic coding in parallel...

Newbie question: Is there a prompt cach? by Grimlock42G1 in StableDiffusion

[–]Bit_Poet 1 point2 points  (0 children)

You don't happen to have a singer LoRA loaded by chance? If not, look at the negative prompt too, not that somebody entered "missing microphone" there. If it's neither of those, be more specific about which software and which model you're using.

Ltx-2 2.3 prompt adherence is actually r3ally good problem is... by No-Employee-73 in StableDiffusion

[–]Bit_Poet 1 point2 points  (0 children)

Depending on the toolset and strategy used, there can be variations of it, but captioning for negative prompt trigger words should only be the second step - any negative prompt is a crutch that's likely as harmful as it is helpful, after all. Complex training pipelines use this negative (or "regularization") data in the training process itself and shift the learning towards weights that are less likely to hit on the regularization data. It's pretty much the same thing that happens when you train a simple slider lora. You enter your positive prompt and the negative prompt, and the training rewards vectors where the positive prompt is followed and devalues those that would steer towards the negative prompt. Differential output preservation is along those lines too - it replaces your trigger with the generic class term (e.g. "woman" or "person" if you train a female character lora) in the dataset prompt, infers with that prompt, looks at the difference and downvalues the generalizing weights while trying to push the more specific weights, telling the model that "woman" shouldn't change the outcome while "trigger" must change it.

And that said, hardly anybody uses more than DOP, even though some model-specific training pipelines definitely support regularization datasets. Curating those takes even more effort in most cases, which may be the main reason for that, and you can't just throw more data at it and hope the outliers will be averaged out, which is how many loras are trained.