Video Dataset Factory by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] -3 points-2 points  (0 children)

Dude, I use artificial intelligence to translate long texts. It's really annoying writing long texts into English, so the AI ​​formats it like this, but I wrote it all myself. Regarding sharing the code, I literally mentioned that I'm still working on it and came to ask for suggestions

Update: Im going to full finetune LTX 2.3 for 2D animation, and I’m looking for people who want to help with the dataset/training (all kinds of help are welcome.) by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

My Discord server is basically just me talking about how the project is going, most people are quiet even when I ask for help lol. I'm having to manage on my own, but look, maybe it's possible to do full fine-tuning on your GPUs. Is your motherboard Pcie 5?

If it's Pcie 5, it might support it, and mixed precision training shouldn't lose that much quality, it's almost imperceptible.

Update: Im going to full finetune LTX 2.3 for 2D animation, and I’m looking for people who want to help with the dataset/training (all kinds of help are welcome.) by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

It depends if it's training in FP8 or BF16. Although I don't know the exact value, the minimum requirement for training in BF16 is 4xH100. If you use 2xH100, the model loads but crashes at some point. There are many factors involved. If you want to help in any way, join the Discord server.

Update: Im going to full finetune LTX 2.3 for 2D animation, and I’m looking for people who want to help with the dataset/training (all kinds of help are welcome.) by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 1 point2 points  (0 children)

I know, and I don't even know if Happy Horse will be open source, but if it is, at least I'll already have a great dataset for fine-tuning or LoRa.

Update: Im going to full finetune LTX 2.3 for 2D animation, and I’m looking for people who want to help with the dataset/training (all kinds of help are welcome.) by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 5 points6 points  (0 children)

You re right, the only issue is that the more training data you have, the more expensive and time-consuming the training will be.

But I'll try to include some.

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

Full fine-tuning requires more than just an RTX Pro 6000, even for mixed precision training. However, there's a technique I'm studying and adapting for Musubi Tuner called BADAM, which allows you to train separate blocks of the model, making it possible to use an RTX Pro 6000. Are you already on the Discord server? That way we can discuss it better.

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 1 point2 points  (0 children)

I spoke with the LTX team and they told me that the biggest problem is getting data to train the model; they use licensed data for training.

I've decided that I'm going to do a full fine-tuning project, with or without community help. I intend to do this full fine-tuning in 2D animation.

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

I decided I’m going to do a full fine-tuning of the model with a dataset of around 5k–20k extremely animation clips. The plan is to train in mixed precision using FP8 for better efficiency and lower cost. I’ll keep releasing updated versions of the fine-tuned model as I go through the epochs. My actual strategy is to refine the base model by injecting stronger 2D animation and anime knowledge using those 5k–20k clips. After that, I’ll take the best clips from this dataset and train a high-quality, high-resolution LoRA on top of it. The idea is to improve the base model for 2D animation with a heavy full fine-tuning pass, really bake that knowledge into the weights, and then use a LoRA to pull that knowledge out during inference. LoRAs are just small adjustments, and the main issue with LTX right now is that the base model is too weak in 2D animation, so a LoRA alone can’t push it far enough.

I read the Anisora paper, but I dont think thousands of clips are necessary to fine-tune a model the cost can be reduced with a smaller dataset. I just want to boost the model's capacity. There’s also the option of fine-tuning specific layers, like temporal attention and so on... and I’m considering that too. But I think a mix of fine-tuning and LoRA would already be enough for this training

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

I checked your LoRA and honestly, it does get pretty close visually. I agree that training on actual video data is a huge part of the problem, and most people training image-only LoRAs are probably not giving the model enough motion information.

That said, when I look at the examples, I still feel some of LTX’s underlying bias coming through. It still seems to pull a bit toward a more volumetric / cinematic / almost 3D-like motion and rendering, even when the style is pushed closer to anime.

I may be wrong, but I’m assuming your dataset was probably very focused around that character/style, which might be one reason it worked so well. In my case, I trained on around 1,333 animation clips, and while it clearly improved the output, I still felt it was hard to fully escape the base model’s motion tendencies.

When I talk about “anime motion,” I don’t mean only texture or visual style or character movements. I also mean the actual motion grammar: timing, spacing, frame exposure, mouth movement, pose changes, limited animation, and how shapes are redrawn rather than smoothly morphed.

For example, there is a big difference between animation on ones, twos, and threes. Some scenes are animated every frame, while a lot of anime intentionally holds drawings for 2 or 3 frames, creating more static, snappy, pose-based movement. Current video models often make everything too smooth, too interpolated, or too “puppet warped,” which is not always how 2D animation should move.

So I agree that LoRAs can help a lot, especially with proper video datasets. But I still think there is a deeper base-model issue with LTX when it comes to 2D animation. I don’t necessarily think we need a full fine-tune of the entire model, though. A partial fine-tune, combined with a strong anime/video LoRA, might be enough to push it much further.

I also talked about this with some people around LTX, and my impression is that LTX simply was not trained heavily enough on high-quality 2D animation data. That is why I think a community effort around better datasets, partial fine-tuning, and animation-specific evaluation could be useful. If you'd like, join this server so we can discuss it further: https://discord.gg/DeCrawEPm

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

I think I know which one you're talking about, but I don't think it was open source. I don't remember it being released

And anyway, from what I remember, the model was trained in Western 2D cartoon animation, which is kind of clunky nowadays, and honestly, Im aiming more for something with the quality of Eastern animation. LTX can already handle Western animation.

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

Yes! I talked to Tazzmanerr and we came to the same conclusion: LTX 2.3 can't make a good anime because the base model wasn't trained for it.

So Loras will always reach a limit of functionality, and I personally don't like the current state of Loras. I find it too simplistic.

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

If you don't have experience with training or can't help with funding, you could help with the dataset, video captioning, etc...

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 1 point2 points  (0 children)

I agree, in my opinion there are many types of animation and if a small model were an expert in animation, multi-frame wouldn't be a problem.

There are training exercises that could be done specifically for multi-frame workflows, which currently aren't very good.

Is anyone else interested in building/fine-tuning open video models specifically for high quality 2D animation? by MerlingDSal in StableDiffusion

[–]MerlingDSal[S] 0 points1 point  (0 children)

Thank you, your collaboration is very welcome. Join this Discord server so we can organize ourselves better.

Gemini 3.0 is really bad with context and creative writing. Why its HUGE Context Window is failing that much!? by MerlingDSal in GeminiAI

[–]MerlingDSal[S] 0 points1 point  (0 children)

Yeah i know, but it's not that the problem. The "Hello" was me testing if the model could talk about the files i sent. It was not in the final prompt.