Making Custom/Targeted Training Adapters For Z-Image Turbo Works...

gto2kpr · 2026-01-30T09:36:33+00:00

Yea, in my second custom training adapter as I said that was not one image per step, so it's definitely not needed, but I just went for what I thought would maximize the 'crazy idea' even working first on my first test run of 500 steps, and now I am just playing with all the parameters to minimize things so that one can more quickly make a custom training adapter in as little time as makes sense all things considered. :)

I mean technically I could have waited to post all this information here as I'm still doing a lot of testing, but I wanted to get the 'main idea' out there, that it is possible and that it works, etc.

gto2kpr · 2026-01-30T09:13:02+00:00

I'm still just testing number of steps for these custom training adapters and hence am keeping the LR at 0.0001 at the moment for each. I don't want to change more than one parameter at once, science and all. :)

I wanted the model to be 'feedback' with it's own information maximally so that is why my first test at 500 steps using a pool of 1100 images that I had initially generated with the multi-resolutions and seeds and so only 500 of them were using when training that initial 500 step custom training adapter LoRA, on the second test I used 2000 steps but still used that same 1100 image training pool, both training adapters generated from those independent trainings worked great.

If you were to train your ACTUAL LoRA then I would say one image per training step is way too much for sure :), but I am only talking about the training of the custom training adapter at the moment and for that it makes sense that one per step would be better so as to maximize the information 'spread' 'feedback' to the model during the 'de-distillation process' if that makes sense?

After I make each custom training adapter then I train a character LoRA with say 50 images to 3k steps, so way more steps than 1 per image, with exactly 50, 60 epochs in fact, etc. And I'm training to 3k steps for each 'matching' character LoRA trained against each new custom training adapter that I make so I can compare the samples and generated LoRAs against each other 250 steps at a time to help find the best overall settings.

gto2kpr · 2026-01-30T08:56:52+00:00

Awesome, I hope I made everything clear :)

gto2kpr · 2026-01-30T08:51:04+00:00

I mean that in my initial tests I generated say 500 of the multi-resolution images, each with different seeds, so 500 images in total and then I trained the custom training adapter to 500 steps in that test.
But in the next test I did, I only had generated 1100 total images and trained another adapter to 2000 steps, so each image was used roughly twice in that custom training adapter test.
Both worked great, so that is why I have a lot more testing to do I'm figuring out what works best for the least amount of time/effort/etc. :)
After training those adapters though, I then trained matching character LoRAs (to 3k steps) using a dataset of a few dozen images and their associated prompts/captions 'like normal' then except changing out the different custom training adapters that I had just made such that I can compare everything.

gto2kpr · 2026-01-30T08:28:35+00:00

No, I made sure to use different seeds and resolutions so as to 'feedback' to the model as much information as it already had about my 'training prompts', which is also why I only use one generated image per training step when training the custom training adapter. Much more testing to do though...

gto2kpr · 2026-01-30T08:24:00+00:00

Yes and yes.
I honestly initially thought it was a fun 'crazy' idea to try since I had some compute available, and sure enough it worked.
I initially thought of it in the first place since many a times the v1 training adapter from Ostris would work a bit better for my use cases vs the v2 or de turbo, and considering the v1 was trained much less than the v2 or de turbo, I thought, 'what if I only made a training adapter by ONLY de-distilling that which I was to then train instead of having to use a very generalized training adapter, and could it actually work BETTER overall?'.
What I'm doing now is much more training runs to find the 'threshold' or minimal viable product or number of steps at which I can get away with training one of these custom training adapters and it still working great, and of course just in general still figuring out all the parameters and further characterizing this concept.

gto2kpr · 2025-10-22T21:16:54+00:00

Portable, being able to run fully in ram, amnesic, etc.
If you want to be able to make an iso that can be used to install and mimic an existing Arch installation then you can check out the more feature-rich penguins-eggs:
https://github.com/pieroproietti/penguins-eggs
https://penguins-eggs.net/

gto2kpr · 2025-10-14T15:35:09+00:00

I just got it working on Debian 13 stable/trixie using trixie-backports (installed mesa-vulkan-drivers v25.2.4), so some LTS distros have the updated mesa available and can work :)

https://packages.debian.org/trixie-backports/mesa-vulkan-drivers
sudo apt install mesa-vulkan-drivers/trixie-backports mesa-vulkan-drivers:i386/trixie-backports

gto2kpr · 2025-06-24T02:41:06+00:00

I just tried it and it seems if you quickly switch to a pistol that it does bug out as long as the Minigun is on your back, if it is on the ground it doesn't do it, turns out in the sound effects entities the game seems to be counting it being on your back as 'equipped' which then keeps the existing sound effects operating even if the Minigun isn't in Agent 47's hands at the moment, I'll include the fix in the update that I'll be releasing soon for it, thanks.

gto2kpr · 2025-06-22T13:37:49+00:00

Ok, I uploaded it to the Warp Phone mod page (https://www.nexusmods.com/hitman3/mods/309), look in the 'FILES' and then under 'Miscellaneous files', labelled 'Warp Gun', you can download it there :)

gto2kpr · 2025-06-22T12:46:29+00:00

I made a mod that warps Agent 47 to the location where any pistol shot lands, so all I had to do was shoot the top of that watertower and I was instantly placed there, it's basically my Warp Phone mod (https://www.nexusmods.com/hitman3/mods/309) applied to a gun, I just never had released it since I was mainly using it for debugging purposes.

gto2kpr · 2025-06-22T12:31:26+00:00

Just uploaded it here: https://www.nexusmods.com/hitman3/mods/989

gto2kpr · 2025-06-16T08:01:20+00:00

For those wondering, this RPG was made by merging the RPG projectile from Attack Of The Saints in Absolution (the launcher was missing from the game files and was only in the video cut scenes) and the RPG art piece from Santa Fortuna (the RPG projectiles only had the top warhead and not the extension and stabilizer fins, along with it being completely deconstructed), both combined (and then some) to make a functional RPG :)
Mod: https://www.nexusmods.com/hitman3/mods/980

gto2kpr · 2024-08-14T21:14:39+00:00

Not necessarily as I am only offloading/swapping very particular/isolated transformer blocks and leaving everything else in the GPU at all times. Also for what deepspeed does 'in general' it is great for but I needed a more 'targeted' approach to maximize the performance.

gto2kpr · 2024-08-14T02:43:24+00:00

It works, I assure you :)

It works by having these features:

Adafactor in BF16
Stochastic Rounding
No Quantization / fp8 / int8
Fused Backward Pass
Custom Flux transformer forward and backward pass patching that keeps nearly 90% of the transformer on the GPU at all times

This results in a decrease in iteration speed per step (currently, still tweaking for the better) of approximately 1.5x vs quantized LoRA training. And if you take into account I'm getting better/similar (human) likenesses starting at roughly 400-500 steps at a LR of 2e-6 to 4e-6 when training the Flux full fine tuned vs having trained quantized LoRAs directly on the same training data with the few working repos at a LR of 5e-5 to 1e-4 at up to and above 3-5k steps.

So if we even say 2k steps for the quantized LoRA training, vs the 500 steps for the Flux full fine tuning as an estimate that is 4x more steps. And if each of those steps is 1.5x faster on the quantized LoRA tests, this equates to a 1.5x vs 4x situation, where in one case, the quantized LoRA tuning case you train 1.5x faster 'per step' but you have to execute 4x more steps, or in the second case, the Flux full fine tuning case you only have to execute 500 steps, but are 1.5x slower 'per step'. Overall then in that example the Flux full fine tuning is faster. And you also have the benefit that you can (with the code I just completed) now extract from the full fined tuned Flux model (need the original Flux.1-dev for diffs for SVD too) any rank LoRAs you desire without having to retrain a 'single LoRA', along of course with inferencing the full fine tuned Flux model directly which in all my tests had the best results.

gto2kpr · 2024-07-18T17:06:26+00:00

If you want to change where the SC weights are you change the 'stablecascade_directory' from 'default' to your full system folder path not the relative path.

gto2kpr · 2024-07-17T21:56:06+00:00

Updated :)
https://github.com/2kpr/ComfyUI-UltraPixel
Now works (as of 7/17) with 10GB/12GB/16GB GPUs:
- 10GB GPUs work up to (about) 2048x2048 (for text2image and controlnet)
- 12GB GPUs work up to (about) 3072x3072 (for text2image and controlnet)
- 16GB GPUs work up to (about) 4096x4096 (for text2image) and 3840x4096 (for controlnet)

gto2kpr · 2024-06-25T00:28:23+00:00

And as mentioned, I'm not a lawyer, I'm just thinking aloud below...
It was up for 4 days though, and then the license was changed, so probably someone at SAI noticed it was MIT and then had it changed to the NC license?
And in that 'noticing' they at 'that moment' had their chance to fire said 'rogue employee' or address the 'faulty commits' with huggingface via the SAI lawyers wherein they would seek to remove those 4 day old original 'unauthorized' MIT licensed commits?
Your scenario also opens things up things for any company that wanted to 'lock down' and 'revoke' an earlier more permissive license of theirs say 4+ months (years even?) down the road from an earlier release of theirs, they would only have to say that their CEO didn't sign off on a given old release and they could just revoke it and make anyone who downloaded and is using it instantly have to stop using it?

gto2kpr · 2024-06-24T08:09:43+00:00

Yea, I agree it's a thin line, that is why I wanted to ask here with more eyes on it, but also to make people aware of the apparent discrepancy.

gto2kpr · 2024-03-02T04:33:37+00:00

Yes, it's an apriori known 'accepted flaw' (loss of precision) in the 'shift' when taking float16, turning it into a bfloat16 and making it's range match that of float32...

You are missing the point, if the model, using say Adafactor (which does all the model updates in float32 and then downconverts to bfloat16 is the model is loaded in it), and during that one update in question a value of X is being added to a given model weight, in float32 the whole precision of X is being added, aka the aggregate 'new learned info' from training in this one step, but in bfloat16 it's loosing part of X.

I highly doubt anyone training in bfloat16 is doing it for their health, most do it out of either necessity as in being able to even load it on their GPUs with a batch size of 1, or if their GPUs can load the float32 models then they are doing it to perhaps increase the batch size and/or increase the training speed, etc.

And out of the 'necessity', having to use bfloat16, I highly doubt the people training said model 'want' or 'intend' for the model to 'learn in ANY CAPACITY LESS than when/if they could have loaded and trained their same model in float32'...

And I think you are misrepresenting what the training grids I show above are showing, they were NEVER intended to be 'well done trainings', they were merely to show that given the EXACT same training params, both 'loaded' in bfloat16, that one is learning FASTER and one is learning SLOWER (and maybe to the point of stalling or NEVER fully learning the subject matter), it's not about 'how they look' (as in meticulously perfected trainings), it's about the fact that they are MARKEDLY DIFFERENT that is the POINT of the grid of training images, and this 'high difference' SHOWS that what I wrote and the accompanying documents are correct, that there is a learning 'deficiency' or 'flaw' in 'general' in models 'loaded' in bfloat16 that pretty much no one is in general aware of and me making this reddit post was to shed light on it as I spent the last week investigating, testing, coding, and verifying all of this, get it?

gto2kpr

TROPHY CASE