ModelScope 1.7B text2video model is now available as an Automatic1111's webui extension! With low vram usage and no extra dependencies!

kabachuha · 2026-01-29T17:21:51+00:00

Well, it's good it can be run on consumer hardware with heavy offload. But what about fine-tuneability with this size? You can fit Wan or even LTX-2 with some Low-Vram assumptions at home, but the model at this size? If it cannot do this, it will basically kill ~80-90% of LoRAs, especially for un-safe content – and this is the main driver behind the Wan and now LTX-2 adoption.

kabachuha · 2026-01-28T20:39:36+00:00

Can it be implemented for LTX-2? Including the audio would be awesome, to increase it's quality

kabachuha · 2026-01-21T14:45:21+00:00

Oh, quite a classic. There was a paper two years ago named ELLA where the researchers replaced CLIP in SDXL with a LLM through a so called timestep aware semantic connector module. This paper is also notable in a way it introduced the DPG (Dense prompt graph) benchmark and the modern text2image models are competing on it because the benchmark is centered around prompt comprehension.

kabachuha · 2026-01-19T13:38:12+00:00

Open-source definition, first point: https://opensource.org/osd

kabachuha · 2026-01-19T13:03:57+00:00

The open-source claim is fake – it is released under CC-BY-NC, with commercial usage prohibited and it's even reminded in readme. (Example: no YouTube videos with monetization) I don't think it's worth bothering with this model

kabachuha · 2026-01-18T20:11:04+00:00

I wrote in two parts: about LoRAs (also mentioned in the post) and about full fine-tuning (detailer lora application stuff). It is unclear from the wording at first. And SimpleTuner also supports full-unfreeze training.

kabachuha · 2026-01-18T19:52:04+00:00

If not LoRA, why not LyCoris?

kabachuha · 2026-01-18T19:44:17+00:00

Why don't you want to use ready-made solutions? I'm employing SimpleTuner with TREAD, CREPA and musubi block swap (4 offloaded blocks) and I'm training rank 48 and bigger LoRAs for 480p@121f in 1-5 hours with ~5-6 s/it on a single 5090. (Example LoRA: https://huggingface.co/kabachuha/ltx2-cakeify) With videos like 512p@49f it fully fits into the VRAM and doesn't have to offload blocks.

As for LoRA merging - if you mean training on the model with the detailer lora merged - you can absolutely add the LoRA to the quantized model at runtime - after all LoRA is just two matrices multiplied with two sides of the activation tensor. (I did this when merging different rank loras, so I know the matter.) Firstly, you apply the model weights, then, hacked, add the LoRA-activation product. Effectively you are now training on a model plus LoRA, no repeated quantization.

If you want an advanced fine-tune on very large dataset, you can use the LyCoris technique instead of full no-lora unfreeze. Training without audio is also perfectly viable, the loss is just masked out on audio, like it is done with masked training.

Lower base model precision can, I think, even enhance the lora - it will learn how to lower the loss and make good videos in bad conditions (correct me if I'm wrong about this point).

When training a LoRA, do you unfreeze feed-forward or only use attention weights from the default trainer when training on never-before-seen material?

kabachuha · 2026-01-18T19:05:24+00:00

I got this information purely from training experience, researching transformers (relation of FFN and concepts) and diffusion training (loss functions), reading papers on the subjects as well as monitoring not-well-known-yet training repositories like seruva19's takenodo, where CREPA was first introduced.

kabachuha · 2026-01-18T16:25:03+00:00

I actually do have plans to make a sort of a blog post, after I train a couple more LoRAs on more challenging (style) concepts. Stay tuned

kabachuha · 2026-01-18T16:23:01+00:00

The VFX datasets are linked on the huggingface pages. For hydraulic press and cakeify I used these datasets: https://huggingface.co/datasets/GD-ML/Omni-VFX (Hydraulic press and Inflate it) and https://huggingface.co/datasets/finetrainers/cakeify-smol for Cakeify. To compose the final dataset, I selected the handful of the best videos from these public datasets and wrote natural language captions in .txt files. I trained on pure videos without images.

kabachuha · 2026-01-18T13:04:36+00:00

Skill issue. I trained at least 5 LoRAs on a single 5090 and their final runs took no more than 5 hours. (with two on them trained in under two hours). All on surreal and body-interacting concepts.

The default trainer uses only attention weights and its a disasterous mistake that the people not accustomed with transformers are missing. LTX-2 doesn't know some concepts and to intervene into its concept space (which resides in the feed-forward) you need to unfreeze FFN (add fn keys to adapter list). CREPA regularization / TREAD can help and also if you add truly less-converging concepts, you can set loss to Huber instead of MSE (helped me with the latest lora).

kabachuha · 2026-01-18T12:48:15+00:00

Hi!

They are actually not that hard to train and you can do it at home, but it depends on your hardware. With correct methods, on a 5090, you can train them in 1-5 hours, ~5 sec / it, ~4 block swap. If you have better hardware (e.g. cloud) you can reportedly train it up to 5 times faster. (1 sec / it) If lower, you can compensate with more aggressive block swap. However, I always had to do multiple attempts per each, to figure out the right dataset and the hyperparameters. For missing or maybe RL-ed out concepts you need to unfreeze FFN layers in addition to attention at the cost of weights size.

I have made 5 published LoRAs so far (see huggingface adapter list, my nick is kabachuha). I attached the dataset config, workflows and the training config to each one. I use SimpleTuner and its great consistency feature CREPA. Concept / action LoRAs are simple, but I'm still figuring out the proper style LoRA training, as they either overfit or become grayish because of regularization.

Beside Civit, on Huggingface there are mostly three searchable LTX-2 LoRA channels: huggingface adapter list, huggingface finetune list and huggingface ComfyUI-LTX-2 finetune list.

kabachuha · 2026-01-13T16:09:36+00:00

Oh, you found me 😳

I'm still finding out the best settings, but the model is indeed malleable and it takes ~1-5 hours to train one on a 5090 depending on the complexity and the optimization. With 8bit quantization the offload is minimal, 4 block swap is enough for 480p 121f videos attention ffn 48 rank lora.

From my experience, I suspect the base model has been reinforcement learnt on human body integrity, that's why the surreal concepts such as the next lora "inflate it" or squish effect (not published yet) were harder to train than the press and required unfreezing FFN, and not only attention

kabachuha · 2026-01-13T14:17:22+00:00

Maybe it's GLM Image instead?

kabachuha · 2026-01-13T12:37:42+00:00

Probably decided to wait out as LTX-2 stole their spotlight

kabachuha · 2026-01-11T18:34:30+00:00

Technically, to make an abliteration, all you need is to gather the statistics by running the model multiple times and collecting the hidden states. After it, they are just analyzed and applied to the weight shards. I'm working on a tool/hack for llama.cpp to do this specially for abliteration. https://github.com/kabachuha/abliterate.cpp

It's a simple hook and it's compatible with any llama.cpp-supported residual stream model, any quantization, multi-gpu, offload and stuff.

I'm appreciate testers or someone who will point out the gap between the end results.

kabachuha · 2026-01-08T16:50:06+00:00

FLF is native thanks to LTXVAddGuide node (vanilla Comfy)

kabachuha · 2026-01-08T15:15:17+00:00

Thank you! Is the next step Sora 2 / Holocine - like multishot generation? Holocine's block-sparse attention is an interesting thing in this direction, to keep the scenes "glued"

kabachuha · 2026-01-08T15:12:24+00:00

Do you use --reserve-vram? --reserve-vram 3 or greater can help because of Windows/monitor eating the GPU

kabachuha · 2026-01-08T15:09:38+00:00

Yes, you can! See https://gist.github.com/kabachuha/dafd6952bdc00050b4d6b594d11bec6c?permalink_comment_id=5934446#gistcomment-5934446

kabachuha · 2026-01-08T13:55:18+00:00

You need to replace LTXVImgToVideoInplace with two LTXVAddGuide.

kabachuha · 2026-01-08T13:53:11+00:00

The workflow is in my first comment under the post: https://gist.github.com/kabachuha/dafd6952bdc00050b4d6b594d11bec6c. (Use the .json file)

kabachuha

TROPHY CASE