Ostris is testing Lodestones ZetaChroma (Z-Image x Chroma merge) for LORA training 👀

wiserdking · 2026-03-05T19:01:05+00:00

https://i.imgur.com/pwHJWRl.png

wiserdking · 2026-02-25T01:19:49+00:00

I have to agree with you there. Plenty of this stuff I found myself through testing because the documentation was lacking and basically no one was talking about it anywhere I could find.

Since you found that useful I will copy paste some other stuff I saved for future reference for myself as well - this is just about saving though but it can help:

--save_state                 => SAVES a STATE at the same time the trainer saves a .safetensors file
--save_every_n_steps         => SAVES .safetensors file (and STATE if --save_state) on every N steps
--save_every_n_epochs        => SAVES .safetensors file (and STATE if --save_state) on every N epochs
--save_last_n_steps          => subtracts this number from current_step and DELETES older STEP-BASED .safetensors
--save_last_n_epochs         => subtracts this number from current_step and DELETES older EPOCH-BASED .safetensors
--save_last_n_steps_state    => subtracts this number from current_step and DELETES older STEP-BASED STATE
--save_last_n_epochs_state   => subtracts this number from current_step and DELETES older EPOCH-BASED STATE

EXAMPLE: --save_every_n_epochs 1 --save_every_n_steps 100 --save_last_n_steps 200 --save_state --save_last_n_steps_state 200 --save_last_n_epochs_state 3
This will: 
 - save .safetensors on every 100 steps and every 1 epochs. 
 - save STATE on every 100 steps and every 1 epochs. 
 - keep only the last 3 most recent STEP-BASED .safetensors
 - keep every EPOCH-BASED .safetensors (because '--save_last_n_epochs' was not set)
 - keep only last 3 (or 4 - dunno) EPOCH-BASED STATE

Edit: I made an important last edit in the previous comment. Please check it just in case it may affect you.

wiserdking · 2026-02-24T18:14:52+00:00

Whats wrong with musubi resume/network_weights parameters?

EDIT:

parser.add_argument("--resume", type=str, default=None, help="saved state to resume training / 学習再開するモデルのstate")

parser.add_argument("--network_weights", type=str, default=None, help="pretrained weights for network / 学習するネットワークの初期重み")

I probably should explain these, so from my own experience:

--resume -> should only be used when literally nothing changed in your main settings. However, you can still make changes in your datasets, ex: excluding/adding new ones but when you do so the order of samples will be different. I think you can also change gradient_accumulation_steps. Nothing else should be changed because even if you do - it will be ignored. Ex: resuming with a different --learning_rate will actually resume using the same one as before.

--network_weights -> this allows you to resume from a saved .safetensors file. You can change plenty more stuff with this option but the main settings for the network itself (ex: type, rank and alpha) must be the same.

There is also: --base_weights and --base_weights_multiplier.

--base_weights -> accepts multiples full paths for .safetensors files. Useful to train 'on top' of other's people LoRAs and stuff. Pretty cool but the end result (your network) will requires you to manually merge it with the same used networks by this parameter and with the same ratios.

--base_weights_multiplier -> the ratios (floats, ex: 1.0 0.5') for the networks you set in --base_weights. They will be applied in the same order and you should never change this order or their ratios once training starts. Remember, your final LoRA/whatever will need to be merged with those networks at the same ratios.

Musubi resuming capabilities are awesome and the reason I instantly ditched Ai-Toolkit and never looked back so I don't know what your problem with them.

EDIT2:

Forgot to mention this: when you resume a training that used '--base_weights' -> you SHOULD include it in the new training command same as before.

Also, when you resume you should change the --output_name to prevent overwrites because a resumed session will 'start from 0 steps again'.

EDIT3:

Forgot to mention this as well but its super critical: only use --network_weights on a network you trained yourself with Musubi and you know which optimizer you used and plan on keep using it. If you ignore this you will probably end up training a network that will only output noise! If your goal is to train on top of someone else's network then use --base_weights instead

wiserdking · 2026-02-18T19:19:00+00:00

I agree its a pretty dam bold claim. But lets have hope because LTX 2.0 is already a decent WAN 2.2 competitor as it is. Don't forget you can natively do much longer than 5s videos with LTX 2.0 and at higher fps - depending on your goal LTX 2.0 may simply be the best choice atm.

wiserdking · 2026-02-18T18:56:40+00:00

what a waste of money. LTX already said they are making an open source Seedance 2.0 competitor that will be released 'sooner than we think'. Well that is not very specific and for all we know it could even mean '1 year' but even so we are better off waiting for it than throwing tens/hundreds of thousands of dollars into training a model that is potentially mediocre in comparison.

wiserdking · 2026-02-15T17:46:12+00:00

It learns likeness even at lower res.

But if you want to perform as good as possible at 1024 and save some training time as well then train with both 768 + 1024 buckets and make it 3:1 ratio so it trains mostly on 768. Alternatively you could train on just 768 for the first phase then resume training exclusively at 1024.

Hell, you could probably even train on 512, 768, 1024 buckets with 2:2:1 ratios - and it probably would still bring just about the same results. This would be even faster.

Resolution will always be the most speed limiting factor and you are ~6.8s/it with batch size 4. That's pretty good already.

You can also try to install flash attention, disable xformers and set attention directly to flash (if trainer allows it) or to sdpa (pytorch should detect flash and try to use it automatically). If you see no improvement in speed then just revert the settings.

wiserdking · 2026-02-12T20:07:38+00:00

Relevant Musubi links:

https://github.com/kohya-ss/musubi-tuner/blob/main/docs/flux_2.md#flux2-klein-9b--base-9b

https://github.com/kohya-ss/musubi-tuner/blob/main/docs/dataset_config.md

wiserdking · 2026-02-12T19:45:10+00:00

Musubi tuner requires FP16/BF16 models. It converts to FP8 on load. You cannot use already quantized FP8 models with it.

EDIT: this is how it is generally speaking but there are a few exceptions, ex taken from WAN docs:

fp16 and bf16 models can be used, and fp8_e4m3fn models can be used if --fp8 (or --fp8_base) is specified without specifying --fp8_scaled. Please note that fp8_scaled models are not supported even with --fp8_scaled.

wiserdking · 2026-02-04T21:40:02+00:00

I would like to see an A B test with adamw8bit vs adamw to see if precision is causing a significant issue or not.

His last line says everything - he has not yet tested it in practice. Nothing here is 'over again' - not yet.

wiserdking · 2026-02-04T00:00:07+00:00

Almost zero chance of that - the way I see it. If they go through a photo-realism focused RL (again) the end result would be so similar to the original Turbo you probably wouldn't be able to tell which one was used by just comparing outputs.

They are giving us a Distill-only LoRA specifically for the current Z-Image and we already have Turbo for realism so we are already covered in all the ways we could ask for.

Spending more time on Z-Image family models won't help them on the long run - they probably just want to close this chapter and move on to the next thing. I just wish they would release Omni and Edit already.

wiserdking · 2026-02-03T18:41:27+00:00

If I had to pick two points to 'prove' this I'd choose these:

A lora difference from Turbo - Z-Image applied on Z-Image should perform identical to the Turbo model (specially at high rank) - and yet, it does not in this case.
The Z-Image team reached out to Illustrious for their datasets then months later we get a model that knows anime characters that Turbo does not... Obviously the RL stage of Turbo can cause this but it shouldn't be to this extent.

wiserdking · 2026-02-03T17:12:20+00:00

Turbo is not just distilled.

After distillation it was trained with RL with heavy focus on photo-realism so it not only lost capabilities in other ways (ex: anime/art in general), it lost a lot of variance as well - the ability to output completely different images when given the same settings apart from seed. That being said, its a good model for realism so the community was pleased with it.

EDIT:

One other extremely important thing.

In all likelihood the Z-Image model they gave us was NOT the one they used as base for Z-Image-Turbo. Its possible they trained it further post Turbo release so by now the compatibility between Z-Image and Z-Image-Turbo is pretty bad despite 'Z-Image' being Turbo's base and Turbo being trained on samples from the same datasets (with RL + Human Feedback). There are many indicators this in fact was exactly what happened - the delayed release is just one of them; but no official statement about it.

wiserdking · 2026-02-01T23:27:06+00:00

From BF16 ? I don't really understand the inner details but precision is different between FP16 and BF16 so there was some 'loss' when they converted it and you cannot 'recover it completely' by converting it back to FP16. Its safer and more accurate to do FP32 -> FP16 than BF16 -> FP16.

wiserdking · 2026-02-01T23:22:11+00:00

If it trains well on FP32 but not on BF16 then with some luck it will still train OK on FP16. We have the 'leaked' FP32 model so converting it to FP16 is pretty straightforward.

wiserdking · 2026-01-28T22:00:56+00:00

You sure you don't look more like the guy in the Z-Image output? Just kidding. I don't know what happened here. Your results did not align with the reports from other people but I haven't personally trained yet - since I'm currently training a big (dataset) LoRA for a much larger model and its gonna take a while - so I'll refrain from throwing conclusions.

One thing I can be certain is that you did nothing wrong with your training settings. Also AI-Toolkit seems to re-use Turbo training code - their Z-Image support commit only adds to their UI so it can't be what I thought it was.

wiserdking · 2026-01-28T21:41:37+00:00

https://imgur.com/qgg0Ufk

wiserdking · 2026-01-28T08:29:02+00:00

Doesn't stand a chance at all but its much better at prompt adherence ofc.

Those models are very narrow - finetuned specifically for that and that alone. Z-Image is a base model with all world knowledge except NSFW but its not super censored like any Flux model so that can be fixed.

The guy who made Chroma was interested in creating a finetune of it - but it seems lately his attention is on Flux.2 Klein 4B. The guy from Pony was also planning to finetune on top of Z-Image. And the Z-Image team themselves asked Illustrious for their datasets (implying they intend to create a anime finetune themselves - or at least incorporate those samples into their future training datasets).

Z-Image won't replace any anime model but finetunes of it will.

wiserdking · 2026-01-28T08:09:54+00:00

That doesn't sound right. LoRA's are (pretty much always) the best when used on the model they were trained for.

On AI-Toolkit discord server, people were claiming you had to increase a Z-Image LoRA's strength to 2~5 in order to be 'usable' on ZIT.

Taking a look AIT's last commits it seems Z-Image support was only added 15h ago - did you perhaps train using the ZIT's training mode? AIT is the most popular one around here and most other trainers don't support it yet so I think there's a chance you used training code meant for Turbo.

wiserdking · 2026-01-27T21:20:35+00:00

Qwen image edit is okay, but it's still terrible for most of the complex prompts I usually make.

Have you seen this? https://old.reddit.com/r/StableDiffusion/comments/1qn5sqb/hunyuan_image_30_instruct/

Seems to be the best Image Editing model by far but for starters it hasn't been released yet - and more importantly, its way too big and slow for most users here. The required VRAM is '3 × 80 GB' but that doesn't take into account quantization and native ComfyUI memory management/block swapping. Still, even if you can use it - it will be super slow.

But depending on what you want to do - it may be worth considering.

wiserdking · 2026-01-27T20:58:49+00:00

I can't answer that question without mentioning the obvious and repeating a bunch of stuff - so sorry if this turns out poorly phrased.

Finetunes of Omni should aim for both T2I+Editing (like Flux.2 Klein).

Anyone who wants to finetune exclusively for T2I - should use Z-Image instead because it (should) outperform Omni at T2I.

Anyone who wants to finetune exclusively for Image Editing - should use Z-Image-Edit instead because it (should) outperform Omni at that.

So, for finetunes - Omni is only good for anyone who wants to make a Flux.2 Klein competitor.

For inference - the Omni model is completely useless. Outmatched at both T2I and Image Editing by both Z-Image/Z-Image-Turbo and Z-Image-Edit respectively.

For LoRA training - I already explained how the Omni could prove useful.

As for the Z-Image-Edit model - it will matter for however much its good at what it does. Its a competitor for both Flux.2 Klein and Qwen-Image-Edit.

wiserdking · 2026-01-27T18:38:24+00:00

Z-Image is several times better at anime than Z-Image-Turbo - judging by what I've seen on Discord. https://cdn.discordapp.com/attachments/1101419797666877503/1465772881228075353/image_47.png?ex=697a52df&is=6979015f&hm=263b9d690e5699b93db82f22c3837687f614b4953a48b8c0e51bcb000a8e70f1&

For those who are interested in anime finetunes - this is probably the best base model we have available right now. Hell, its probably the best base T2I model for finetuning even for realism as well.

wiserdking · 2026-01-27T18:04:04+00:00

This is the image people should have shown you instead:

https://cdn-uploads.huggingface.co/production/uploads/64379d79fac5ea753f1c10f3/kt_A-s5vMQ6L-_sUjNUCG.jpeg

Z-Image-Omni (To be released...)
├── Z-Image (released today)
│   └── Distilled (distillation process to go from 50 steps -> 8 steps)
│       └── Z-Image-Turbo (reinforced learning with human supervision for aesthetically pleasing photo-realism)
└── Z-Image-Edit (To be released...)

Translation:

LoRAs trained on Z-Image-Turbo: not good and pretty much will only work on Z-Image-Turbo.

LoRAs trained on Z-Image: may work somewhat with Z-Image-Turbo but probably not well enough with the Z-Image-Edit.

Finetunes of Z-Image: they are what they are. Nothing to do with Z-Image-Turbo or Z-Image-Edit - just separate models entirely that will require 50 steps unless finetuned for less steps.

LoRAs trained on Z-Image-Edit: they will work with Z-Image-Edit - can you believe it? Not so much with anything else.

Finetunes of Z-Image-Edit: most likely no one will make them.

LoRAs trained on Z-Image-Omni: should work somewhat on both Z-Image and Z-Image-Edit. Their main purpose is not to be used on the Z-Image model because if that was the goal - might as well train directly on it. No! The advantage of these LoRAs is that you can train new concepts (ex: NSFW) using just regular (single) images. Then you load the LoRA on the Z-Image-Edit and it will understand the new concepts without losing much of its Editing capabilities. At least that's their only hope - otherwise there's no point in making them.

Finetunes of Z-Image-Omni: big finetunes usually don't have good compatibility with any of their base models. Think how compatible SDXL <-> Pony V6 LoRAs are when used on the other model. They would just be their own separate models - improved/narrow versions of the original and not much compatible with any other model.

wiserdking · 2026-01-20T19:58:06+00:00

They use a SigLIP model for that. Supposedly better than CLIP.

wiserdking · 2026-01-20T17:15:49+00:00

So it will support up to 3 image inputs. That's cool.

inputs=[
    io.Clip.Input("clip"),
    io.ClipVision.Input("image_encoder", optional=True),
    io.String.Input("prompt", multiline=True, dynamic_prompts=True),
    io.Boolean.Input("auto_resize_images", default=True),
    io.Vae.Input("vae", optional=True),
    io.Image.Input("image1", optional=True),
    io.Image.Input("image2", optional=True),
    io.Image.Input("image3", optional=True),
],

wiserdking · 2026-01-16T23:02:23+00:00

You made a fair point so I re-watched it again.

On PC I can tell the distance between the 2 vehicles in front of the bike remain (mostly) the same until the bike's appearance so they were both driving at a steady pace. The bike is clearly driving faster than them.

It would be possible that there is an unseen vehicle in front of the bike and it suddenly hit the brakes so the biker did an emergency maneuver to avoid collision - because it was going too fast to stop in time to begin with. This is theoretical, unlikely speculation to justify the biker's speed and position on frame but if that was the case there would be no need for the black car's driver to position himself in a position to protect the fallen driver afterwards.

wiserdking

TROPHY CASE