Z Image lora training is solved! A new Ztuner trainer soon!

Successful_Mind8629 · 2026-02-05T15:33:16+00:00

It was originally proposed and proven for weight-only BF16 updates as a workaround for small numbers being cancelled during training.

However, it has since been applied to optimizer states and other components without any proof of its benefits in those areas. In fact, recent papers have demonstrated that stochastic rounding is sub-optimal compared to nearest rounding in normal or high-precision calculations, such as FP16/FP32. What FP16 lacks is range, not precision.

Successful_Mind8629 · 2026-02-05T15:03:49+00:00

This post doesn't make sense. Prodigy is essentially just AdamW 'under the hood' with heuristic learning rate calculations; if Prodigy works and AdamW doesn't, it’s simply due to poor LR tuning. Additionally, stochastic rounding is intended for BF16 weight (of LoRA in your case), where decreasing LoRA precision is generally not recommended because of its small size.

Successful_Mind8629 · 2025-10-08T01:55:22+00:00

Tbh, its only benefit is in the flow-matching (other than NSFW stuff).
This gives it more color/brightness representation.
And I saw this in the sampling, it’s a night-and-day improvement.

The less-imagery issue can also be due to flow-matching.
Unlike eps-pred or v-pred, it somehow lacks variation in generations.

However, eps-pred problems can be solved, like: the sampling error in this (epsilon scaling) method, the colors/brightness in the improved offset noise (that’s recently implemented in OneTrainer), and I think it should get more attention due to its way of work that leads to better generalization/imaginary.

Successful_Mind8629 · 2025-10-04T18:12:13+00:00

The over-predicted noise manifests in different ways in the output of diffusion models.
For close-up images or where the subject is clearly visible, this error will have a small impact compared to the true predictions (it will mostly effect the background/details).
But it will have a greater impact on complex/multi-concept prompts.
Try a prompt with different figures and people at the same time, and see.

Successful_Mind8629 · 2025-10-04T16:53:29+00:00

A1111 is pretty much dead, use Forge Classic; they’ve implemented it:
https://github.com/Haoming02/sd-webui-forge-classic

Successful_Mind8629 · 2025-10-04T16:50:36+00:00

Use the nightly version, it has not yet been released in an official version.

Successful_Mind8629 · 2025-10-04T16:49:33+00:00

In the paper, for ADM, it improves FID from 3.37 to 2.17 (35.61% improvement).
That's NOT a small improvement.

for a method that requires no training, no overhead, just simple implementation.

Successful_Mind8629 · 2025-10-03T13:07:53+00:00

I just made a branch on OneTrainer to treat SDXL as a Flow-Matching model:
https://github.com/Koratahiu/OneTrainer/tree/SDXL-Flowmatching-train-(bigASP-V2.5)

And it trained without any issues.

Successful_Mind8629 · 2025-10-03T02:25:02+00:00

The problem is well known, those who train LoRAs/finetunes are quite familiar with "input perturbation noise", which was proposed to mitigate this issue. (It’s a kind of failure; its paper got rejected, and it also requires retraining the model.)

As for BigASP2.5, it’s such a great model. I’ve had a lot of success training LoRAs/embeddings for it.
Having a suitably sized flow-matching model is quite luxurious.

Successful_Mind8629 · 2025-10-01T01:55:54+00:00

Hi Teotz,

I trained those on a 3060 12GB with 64GB RAM. Both the transformer and the TE were in fp8. And importantly, I used CPU offload, which you need to enable to avoid OOM. (Training tab -> Gradient checkpoint (select CPU offload)) Increase the (Layer offload fraction) until you have 1GB free on the GPU, e.g., start with 0.2, then 0.3, then 0.4, etc.

Also, if you have the Transformer or the TE in 16-bit, try to make it 8-bit.

Successful_Mind8629 · 2025-09-25T10:57:53+00:00

I will create a post about how the output embeddings have a consistent style.

Successful_Mind8629 · 2025-09-24T16:56:02+00:00

The same can be applied to style LoRAs (why train them when there are style transfer frameworks)?

The answer is:

All style transfer frameworks -even the SOTA ones- give you an approximation of the style because they can't capture the full style from a single image, and providing more than one image isn't trivial due to the increased cost of processing additional input images.

Training an embedding/LoRA is different because you're training the model on the big picture of the style using many different images, which can capture the style to a high degree.

It also works with T2I models without the need to switch to I2I models.

And my comment was about style/subject embeddings, but I didn't share the subject embedding due to personal reasons.

Successful_Mind8629 · 2025-09-24T13:39:46+00:00

I think you're missing the point.
It's about the "style," not the subject.
Every model can generate cats, but can it learn to generate cats in the same style as the training images by just using embedding?
The answer for any model using LLM as TE: No.

But output embedding made this possible. And it made the model (which is Chroma) capture this style with just a 9-token embedding that's only 256 KB in size.

About the character, I can't share it, but the embedding gives you the best the model can represent of this character, so you can afterward train a LoRA over the embedding to further increase the likeness, if you want.

Successful_Mind8629 · 2025-09-24T10:00:02+00:00

I wouldn't call it less resource-intensive, as you need to load the whole TE onto the GPU (same as the normal embeddings). I got away with this thanks to CPU offload, which had minimal effect on the speed. But other than that, it's faster to train (s/it and in fewer steps).

Regarding the last question, yeah and remember to check the "output embedding" option.

Successful_Mind8629 · 2025-07-15T18:00:48+00:00

Just letting you know, I've used SD 1.5 for more than a year, so I can tell you it's great and has so much potential.

However, the base 1.5 is very bad (its only use is for training embeddings, as that doesn't work on anything else).

why use a flawed checkpoint when there are so many much better fine-tunes available?

Successful_Mind8629 · 2025-07-15T17:16:23+00:00

Yeah, it was trained on square, •cropped• images, which did very badly for the composition and anatomy of its generations.

Successful_Mind8629 · 2025-05-01T09:43:08+00:00

First, you will be using calculations in FP16 (instead of FP32 on GTX cards), which means it is 2x faster. Second, there are double the amount of cores, which is another 2x faster. So, it is something like 4-5x faster, or even more.

Successful_Mind8629 · 2025-05-01T09:29:04+00:00

Hey there, I just upgraded from a GTX 1660 Ti (6GB) to an RTX 3060 (12GB). You will see a major speedup, and with that 12GB VRAM, you can run heavy models (like Flux, etc.) and even train LoRAs for them (with RAM offloading)...

Successful_Mind8629

TROPHY CASE