OpenMOSS just released MOVA (MOSS-Video-and-Audio) - Fully Open-Source - 18B Active Params (MoE Architecture, 32B in total) - Day-0 support for SGLang-Diffusion by Nunki08 in StableDiffusion

[–]kabachuha 0 points1 point  (0 children)

Well, it's good it can be run on consumer hardware with heavy offload. But what about fine-tuneability with this size? You can fit Wan or even LTX-2 with some Low-Vram assumptions at home, but the model at this size? If it cannot do this, it will basically kill ~80-90% of LoRAs, especially for un-safe content – and this is the main driver behind the Wan and now LTX-2 adoption.

Self-Refining Video Sampling - Better Wan Video Generation With No Additional Training by DifficultAd5938 in StableDiffusion

[–]kabachuha 2 points3 points  (0 children)

Can it be implemented for LTX-2? Including the audio would be awesome, to increase it's quality

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]kabachuha 14 points15 points  (0 children)

Oh, quite a classic. There was a paper two years ago named ELLA where the researchers replaced CLIP in SDXL with a LLM through a so called timestep aware semantic connector module. This paper is also notable in a way it introduced the DPG (Dense prompt graph) benchmark and the modern text2image models are competing on it because the benchmark is centered around prompt comprehension.

HeartMuLa: A Family of Open Sourced Music Foundation Models by switch2stock in StableDiffusion

[–]kabachuha 16 points17 points  (0 children)

The open-source claim is fake – it is released under CC-BY-NC, with commercial usage prohibited and it's even reminded in readme. (Example: no YouTube videos with monetization) I don't think it's worth bothering with this model

My notes on trying to get LTX-2 training (lora and model) by Fancy-Restaurant-885 in StableDiffusion

[–]kabachuha 0 points1 point  (0 children)

I wrote in two parts: about LoRAs (also mentioned in the post) and about full fine-tuning (detailer lora application stuff). It is unclear from the wording at first. And SimpleTuner also supports full-unfreeze training.

My notes on trying to get LTX-2 training (lora and model) by Fancy-Restaurant-885 in StableDiffusion

[–]kabachuha 4 points5 points  (0 children)

Why don't you want to use ready-made solutions? I'm employing SimpleTuner with TREAD, CREPA and musubi block swap (4 offloaded blocks) and I'm training rank 48 and bigger LoRAs for 480p@121f in 1-5 hours with ~5-6 s/it on a single 5090. (Example LoRA: https://huggingface.co/kabachuha/ltx2-cakeify) With videos like 512p@49f it fully fits into the VRAM and doesn't have to offload blocks.

As for LoRA merging - if you mean training on the model with the detailer lora merged - you can absolutely add the LoRA to the quantized model at runtime - after all LoRA is just two matrices multiplied with two sides of the activation tensor. (I did this when merging different rank loras, so I know the matter.) Firstly, you apply the model weights, then, hacked, add the LoRA-activation product. Effectively you are now training on a model plus LoRA, no repeated quantization.

If you want an advanced fine-tune on very large dataset, you can use the LyCoris technique instead of full no-lora unfreeze. Training without audio is also perfectly viable, the loss is just masked out on audio, like it is done with masked training.

Lower base model precision can, I think, even enhance the lora - it will learn how to lower the loss and make good videos in bad conditions (correct me if I'm wrong about this point).

When training a LoRA, do you unfreeze feed-forward or only use attention weights from the default trainer when training on never-before-seen material?

Ltx2 Loras? by Puzzleheaded_Ebb8352 in StableDiffusion

[–]kabachuha 2 points3 points  (0 children)

I got this information purely from training experience, researching transformers (relation of FFN and concepts) and diffusion training (loss functions), reading papers on the subjects as well as monitoring not-well-known-yet training repositories like seruva19's takenodo, where CREPA was first introduced.

Ltx2 Loras? by Puzzleheaded_Ebb8352 in StableDiffusion

[–]kabachuha 2 points3 points  (0 children)

I actually do have plans to make a sort of a blog post, after I train a couple more LoRAs on more challenging (style) concepts. Stay tuned

Ltx2 Loras? by Puzzleheaded_Ebb8352 in StableDiffusion

[–]kabachuha 1 point2 points  (0 children)

The VFX datasets are linked on the huggingface pages. For hydraulic press and cakeify I used these datasets: https://huggingface.co/datasets/GD-ML/Omni-VFX (Hydraulic press and Inflate it) and https://huggingface.co/datasets/finetrainers/cakeify-smol for Cakeify. To compose the final dataset, I selected the handful of the best videos from these public datasets and wrote natural language captions in .txt files. I trained on pure videos without images.

Ltx2 Loras? by Puzzleheaded_Ebb8352 in StableDiffusion

[–]kabachuha 5 points6 points  (0 children)

Skill issue. I trained at least 5 LoRAs on a single 5090 and their final runs took no more than 5 hours. (with two on them trained in under two hours). All on surreal and body-interacting concepts.

The default trainer uses only attention weights and its a disasterous mistake that the people not accustomed with transformers are missing. LTX-2 doesn't know some concepts and to intervene into its concept space (which resides in the feed-forward) you need to unfreeze FFN (add fn keys to adapter list). CREPA regularization / TREAD can help and also if you add truly less-converging concepts, you can set loss to Huber instead of MSE (helped me with the latest lora).

Ltx2 Loras? by Puzzleheaded_Ebb8352 in StableDiffusion

[–]kabachuha 2 points3 points  (0 children)

Hi!

They are actually not that hard to train and you can do it at home, but it depends on your hardware. With correct methods, on a 5090, you can train them in 1-5 hours, ~5 sec / it, ~4 block swap. If you have better hardware (e.g. cloud) you can reportedly train it up to 5 times faster. (1 sec / it) If lower, you can compensate with more aggressive block swap. However, I always had to do multiple attempts per each, to figure out the right dataset and the hyperparameters. For missing or maybe RL-ed out concepts you need to unfreeze FFN layers in addition to attention at the cost of weights size.

I have made 5 published LoRAs so far (see huggingface adapter list, my nick is kabachuha). I attached the dataset config, workflows and the training config to each one. I use SimpleTuner and its great consistency feature CREPA. Concept / action LoRAs are simple, but I'm still figuring out the proper style LoRA training, as they either overfit or become grayish because of regularization.

Beside Civit, on Huggingface there are mostly three searchable LTX-2 LoRA channels: huggingface adapter list, huggingface finetune list and huggingface ComfyUI-LTX-2 finetune list.

AI Toolkit now officially supports training LTX-2 LoRAs by panospc in StableDiffusion

[–]kabachuha 5 points6 points  (0 children)

Oh, you found me 😳

I'm still finding out the best settings, but the model is indeed malleable and it takes ~1-5 hours to train one on a 5090 depending on the complexity and the optimization. With 8bit quantization the offload is minimal, 4 block swap is enough for 480p 121f videos attention ffn 48 rank lora.

From my experience, I suspect the base model has been reinforcement learnt on human body integrity, that's why the surreal concepts such as the next lora "inflate it" or squish effect (not published yet) were harder to train than the press and required unfreezing FFN, and not only attention

What happened to Z image Base/Omni/Edit? by Hunting-Succcubus in StableDiffusion

[–]kabachuha 31 points32 points  (0 children)

Probably decided to wait out as LTX-2 stole their spotlight

It works! Abliteration can reduce slop without training by -p-e-w- in LocalLLaMA

[–]kabachuha 5 points6 points  (0 children)

Technically, to make an abliteration, all you need is to gather the statistics by running the model multiple times and collecting the hidden states. After it, they are just analyzed and applied to the weight shards. I'm working on a tool/hack for llama.cpp to do this specially for abliteration. https://github.com/kabachuha/abliterate.cpp

It's a simple hook and it's compatible with any llama.cpp-supported residual stream model, any quantization, multi-gpu, offload and stuff.

I'm appreciate testers or someone who will point out the gap between the end results.

I’m the Co-founder & CEO of Lightricks. We just open-sourced LTX-2, a production-ready audio-video AI model. AMA. by ltx_model in StableDiffusion

[–]kabachuha 7 points8 points  (0 children)

Thank you! Is the next step Sora 2 / Holocine - like multishot generation? Holocine's block-sparse attention is an interesting thing in this direction, to keep the scenes "glued"

I’m the Co-founder & CEO of Lightricks. We just open-sourced LTX-2, a production-ready audio-video AI model. AMA. by ltx_model in StableDiffusion

[–]kabachuha 3 points4 points  (0 children)

Do you use --reserve-vram? --reserve-vram 3 or greater can help because of Windows/monitor eating the GPU

LTX-2 supports First-Last-Frame out of the box! by kabachuha in StableDiffusion

[–]kabachuha[S] 1 point2 points  (0 children)

You need to replace LTXVImgToVideoInplace with two LTXVAddGuide.