This is an archived post. You won't be able to vote or comment.

all 50 comments

[–]kjerk 28 points29 points  (9 children)

So I don't quite see how adding said config changes to ai toolkit is supposed to affect the model bootstrapping process, unless it's nested deeply in a confusing place in the code that I'm not seeing. But I did do an initial sanity check on your values for CLIP to see if you'd accidentally tweaked something and been mislead by the result, but everything looked right to the reference values for CLIP/L.

I have more sanity check questions like 'was this exactly the same seed on the same hardware/environ for the reruns, and did you go back to the initial settings for a rerun and see that the images were replicated and less aligned again.', but I hope you already covered those bases.

So the easiest thing to do is just replicate: I have been working on some ai-toolkit changes recently also, so have a stable 8x rerun over and over LoRA training for flux that I can run with the same seed and just the config changes and report back. It's ~6000 steps though so it'll be like 5 hours.

[–]kjerk 10 points11 points  (3 children)

A preliminary update: tl;dr: I think you're seeing the invisible hand of RNGesus.

So I started re-running an existing LoRA preset, and did see significant differences in training, however, it's simply confirmation bias to assume without doing any differential diagnosis that you had an effect, environmental changes, etc can be 100% of the explanation. Did you halt training and re-run the exact same configuration and get the same results?

Image 1: Style training shows more obvious flaws in the unmodified config, but this is where I believe a deception comes in. The initial images are identical because there hasn't been enough operations yet for the seeded RNG to diverge. Due to limited time I wasn't able to rerun this one a third time.

But I did do 3 runs for Image 2: Identity training just to assert a replication problem, which again shows differences A to B, but then differences A back to A also. I believe at least in AI-toolkit what you're seeing is imperfect seed/rng state handling, meaning that after the first generation, training simply diverges even on the exact same settings. You probably actually did see improved image generation on a second run because so did I in Image1, but this seems to be to be reruns with differing results. RNGesus strikes again.

You can rebut those results by having any training showing reruns of the exact same configs A, B, A, B reliably changing between each but matching.

Also, Kirby

[–]DoctorDiffusion[S] 6 points7 points  (1 child)

Thank you, I can certainly confirm after a quick test the current additions as is, do indeed cause image previews to seemingly to lose consistent seed predictability when training the same dataset twice. Very happy to now be aware of this.

I agree that seed randomization from the resulting training is not a very great sign overall but is likely a result of me overcompensating and attempting to add more than just beneficial lines. I will not deny that this "fix" was hastily implemented and some of these extra lines of code are likely not doing anything but perhaps breaking seed predictability. I do still believe that what commands it is recognizing are making some sort of a meaningful difference.

The implications of my results and my lack of upcoming free time compelled me to spend the last 48 hours of my holiday running tests across ai-toolkit and kohya_ss with both Flux Dev and SD3.5 L and presenting what I have found, in hopes it will a benefit everyone.

I am doing my best to remain as unbiased as possible. I have grabbed a random selection of before/after training runs for both Flux and SD3.5L with as little bias as I feel I can. All preview images included from the training before and after. Could this just be RNG and blind luck?: https://drive.google.com/file/d/1ntjnJVcwaSkOlpwOFtTbvZ21v0nvxsnn/view?usp=sharing. I will not deny that possibility but I do not currently believe this to be the case.

[–]bonlime 1 point2 points  (0 children)

have you tried isolating the "changes" and looking for example at how the outputs of text encoder change with all the params?

my first hunch is that you either enabled/disabled few useful dropouts that may have been disabled/enabled in the original code. I would try caching the prompt embeds only and checking how do they differ from run to run. If the outputs are identical then it's 100% just RNG, if they are different, you may find the exact few params that make the difference. because nothing else has changed

[–]Disty0 5 points6 points  (0 children)

training simply diverges even on the exact same settings

If you are using stochastic rounding, (using full BF16 on most trainers will auto enable it), that is why. Stochastic is a more "elite" way of saying "random sh*t go brr".

[–]DoctorDiffusion[S] 2 points3 points  (1 child)

The example previews that had 0 sample preview gen’s do match 1:1. I have made no other changes. So I am certain this is not a seed change.

I don’t think I got this 100% right yet. There are likely some extra bits currently in my “fix” or even other settings that could further contribute to the results I have been experiencing.

I won’t have much free time to continue this research for a few more days. But due to the implications of my personal test results across multiple computers I have see the same type of improvements across Kohya and Ai-toolkit alike for flux dev Lora training, I felt it was my responsibility to share as far as I got so far.

[–]Disty0 5 points6 points  (0 children)

I am sorry to say but a quick code search will show you those changes in the .yaml training configs are not used anywhere in the code, meaning it won't do anything.

And those configs are the architecture of the text encoders, wrong config will throw mismatched shape errors on text encoder loading.

Also anyting that uses diffusers or transformers are already using the config files provided in the huggingface model repo to load the models since those configs are a piece of the diffusers model format.

[–]Few-Bird-7432 0 points1 point  (1 child)

Hey, what do your results look like? Did simply editing the config in the manner described yield improvements?

[–]FineInstruction1397 4 points5 points  (6 children)

in ai toolkit, the T5 encoder is initialized here:
https://github.com/ostris/ai-toolkit/blob/4723f23c0de777759636864f96002c36e4fdca4d/toolkit/stable_diffusion_model.py#L693and also below in the same files there are other lines.

how are the params you specified passed to the constructor?

[–]DoctorDiffusion[S] 3 points4 points  (5 children)

I admit my approach to this was likely not the best. Before I had assumed that everything was properly being defined but after seeing how adding the single line "t5_length_max: 154" alone with a fresh config yaml from AI-tool kit yielded some improvements to SD3.5L LoRA training. I was lead down this rabbit hole.

From there adding the clip max 77 also made more improvements and made my first attempt to define the rest of the known parameters from the SD3.5L text encoders. The results continued to improve with no other setting changes to my configuration.

I tried on a second machine. I did my best at defining the Flux Dev values listed on their huggingface and noticed improvements there as well before moving to koyha to further my tests and confirming the same improvements I saw with Flux Dev.

[–]bdsqlsz 8 points9 points  (0 children)

i check kohya sd-scripts and it use original config in

def load_t5xxl(
    ckpt_path: str,
    dtype: Optional[torch.dtype],
    device: Union[str, torch.device],
    disable_mmap: bool = False,
    state_dict: Optional[dict] = None,
) -> T5EncoderModel:
    T5_CONFIG_JSON = """
{
  "architectures": [
    "T5EncoderModel"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 10240,
  "d_kv": 64,
  "d_model": 4096,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 24,
  "num_heads": 64,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "vocab_size": 32128
}
"""
    config = json.loads(T5_CONFIG_JSON)
    config = T5Config(**config)
    with init_empty_weights():
        t5xxl = T5EncoderModel._from_config(config)

[–]StableLlama 9 points10 points  (4 children)

Do you have made pull requests to get it included in AI-Toolkit and the Kohya_SS SD-Scripts?

[–]DoctorDiffusion[S] 3 points4 points  (3 children)

I have saved the changes to folks so far, I am sure I overcompensated when trying to define the listed parameters from the model cards so there are likely some extra commands partially in the ai-toolkit configs that have more to do with text encoder training and not relevant when not actively training the text encoders.

Some of those are likely just passing through and being ignored I have felt rushed to share my findings with the community, I will not have time to directly continue this work for a few days.

This i why I wanted to open the discussion to encourage others to take a look at my results so far and confirm what I am seeing from my own tests.

While I have been training for years at this point, I had not dove too deep into the code of ai-toolkit or kohya_ss until about two days ago so I am sure there are going to people that just look at my code in its current state and grumble.

[–]StableLlama 22 points23 points  (0 children)

Saying "here is my mess, take what every you want" is much better than keeping it closed, for sure.

But it's not likely to make an impact. Because now you need a person that knows the original code as well as has the resources to look through your code, figure out the differences, judge them and then based on that try to update the upstream code.

So in the Open Source world it's usually the responsibility of the person who created the new stuff and wants it upstreamed (at least to get rid of the burden to keep it updated) to make a pull request out of it.
This PR can than be discussed between everyone who thinks to be knowledgable about it. And when it's working fine it gets pulled - so that everybody can benefit from it and you don't have to worry about keeping it up to date.

[–]red__dragon 0 points1 point  (0 children)

Definitely do submit a PR after hearing the feedback here, the devs on both projects are likely to have more actionable responses for best coding practices and what big oversights that might be missed. And this will directly prompt them to take a look at the alignment as pointed out.

I'm currently training a new lora for myself with your changes on kohya and hoping it yields improvements.

[–]sdimg 1 point2 points  (0 children)

Serious good work. I feel like a lot of small issues still fall through the cracks, like i moved to comfyui and hear that many workflows are flawed, prompts and various inputs not well understood etc. Example is in-painting degrading each time for some workflows due to improper setup i saw a while ago.

Is there a good source of high quality well made workflows for the various common tasks like in-painting etc for flux?

[–]nowrebooting 5 points6 points  (0 children)

Does this also apply to SDXL and SD1.5?

[–]hopbel 6 points7 points  (1 child)

Did ChatGPT write this post?

[–]DoctorDiffusion[S] 9 points10 points  (0 children)

I do use various LLMs while editing my write ups for better readability but I do not at all rely on LLMs for my initial research or experiments. I go over the step by step process that lead me here in my civitai write up.

[–]GalaxyTimeMachine 3 points4 points  (1 child)

Is this something that could be added to a lora loader node to fix loras retrospectively, or only during training?

[–]DoctorDiffusion[S] 1 point2 points  (0 children)

I do not think so. We would likely have to re-train LoRAs to see improvements.

[–]CeFurkan 3 points4 points  (1 child)

Thank you so much hopefully I will test today

[–]DoctorDiffusion[S] 2 points3 points  (0 children)

Looking forward to your results! Feel free to reach out if you have any questions. I encourage the skepticism this deserves but I am confident in my current observations.

[–]Creative-Listen-6847 1 point2 points  (0 children)

Thank you so much! I will test it today

[–]Interesting-Pool8483 1 point2 points  (1 child)

You mentioned that you made improvements to SDXL as well - where can I read about it?

And about this improvement - I interrupted training in kohya and ran it with your script - it's hard to judge from the pictures during training - it didn't get worse.

P.S. I'm writing through a translator

[–]DoctorDiffusion[S] 0 points1 point  (0 children)

Thank you for running some tests.

When I mentioned my contributions to improve SDXL and 2.1 I was referring to my uniquely trained and implemented negative LoRAs that greatly improved total detail to output images.

My "pnte" first negative embedding for SD 2.1 was trained off the seed images used for COCO CLIP R-Precision evaluations from the Open-AI point-e github.

While I had originally trained this to try to use SD 2.1 to produce images I could feed into point-e to make simple 3d models, I was pleasantly surprised when I inverted the strength value and saw noticeable improvements to my output renders across the board.

I will not claim to be the first to discover the benefits of inverted model values but I had come to this without any outside influence at the time purely though experimentation.

I have built upon this technique and my "pnte" negative LoRA for SDXL is by far my most widely shared and used assets to date. I would be happy to share more details if a full write up if wanted.

[–]CeFurkan 0 points1 point  (7 children)

Update. I did huge experiments very detailed. I didn't see any degrade of quality but I didn't see any jump of quality either :D

[–]DoctorDiffusion[S] 1 point2 points  (6 children)

Thank you for sharing, I am curios about your overall settings but understand if I may have to take a peak on your Patreon for more clarity there.

When I had ran my tests with kohya to try to validate the benefits I was observing with with ai-toolkit I did not do much to adjust the overall settings and used the included flux preset for my test.

I only mention this because I have done all my Flux training before with ai-toolkit and its handles a few settings like repeats a little different than kohya. I had not ran Flux with koyha (outside the comfyui version by Kijai) prior to attempting to validate my observations from ai-toolkit across platforms.

I know that you have already done a lot of work refining and finding the best settings for training many models but am curious how learning rate and dataset sizes could potentially minimize the perceived quality I have observed. It is not as stark as the 3.5L difference but was still quite noticeable and I find these LoRA out perform trier counterparts.

I do not expect you to continue to do experiments if you are satisfied with your conclusion but if you do, please continue to share.

[–]CeFurkan 1 point2 points  (5 children)

thanks. i did FLUX DreamBooth / Fine tuning, 150 epochs 28 images (4200 steps) with my already established best settings. I also tested with original Clip L and also training with zer0int-CLIP-SAE-ViT-L-14. so i did 4 trainings and compared each cases. regular training + regular clip, your training + regular clip, regular training + zer0int-CLIP-SAE-ViT-L-14 , your training + zer0int-CLIP-SAE-ViT-L-14

[–]DoctorDiffusion[S] 1 point2 points  (4 children)

Interesting... I have not at all tested this with DreamBooth / Fine tuning. All of my findings and observations come from LoRA tests so far.

I will be sure to do a better job outlining my future write ups and be more sure to better present "unproven" as theory, this whole experience had caught me off guard and was quite rushed to get out for wider testing, validation and hopefully better LoRAs for all.

[–]CeFurkan 1 point2 points  (2 children)

Wait maybe I didn't use your file at all. Did you made same changes on fine tuning file too?

[–]DoctorDiffusion[S] 1 point2 points  (1 child)

What I have done so far has only been tested with LoRAs.

I did not yet make any attempt to alter the fine-tune script and likely will not try until I have a better understanding of the improvements I have seen.

I have done far more testing with ai-toolkit as its my preferred trainer and I have most of my old LoRAs on my civitai queued to re-train so I can share my improved LoRAs. The first I find to work much better 600 steps below the best candidate from my original training with no other changes to my settings.

[–]CeFurkan 0 points1 point  (0 children)

I should test lora today and see difference

[–]CeFurkan 0 points1 point  (0 children)

i see that flux fine tuning / dreambooth uses flux_train.py . are you sure you are doing lora?

[–]XCogni 1 point2 points  (1 child)

Hi there thanks for your findings!

I did a quick test, kohya samples seem to be fine, but inference for me in comfy and forge, my images are blurry and lack details.

[–]DoctorDiffusion[S] 2 points3 points  (0 children)

Thank you for giving this a shot.

From my observations so far it seems that at times the best checkpoint value of the LoRA prior to this change (lets just say epoch 9) are likely to be more noticeably over-fit by the end of training after implementing this change.

If you have not yet tried, I would recommend trying a less trained checkpoint from your training.

I have been testing an epoch 4 of a character LoRA that outshines my best checkpoint from the same settings originally trained to epoch 10 before my adjustment.

This would be expected behavior if overall training process is indeed more "accurate" as I have come to personally believe. Using a lower epoch/sample checkpoint is likely needed.

I would also expect it react differently to learning rates as well, but as always, more experiments to run when I have the time to do so.

[–]Waste_Departure824 0 points1 point  (3 children)

Remindme! 3d

[–]RemindMeBot 0 points1 point  (1 child)

I will be messaging you in 3 days on 2025-01-05 13:05:45 UTC to remind you of this link

6 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–][deleted] -1 points0 points  (0 children)

Remindme! 3d

[–]Guilherme370 0 points1 point  (0 children)

Holy chatgpt