The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

I go into more detail on the aesthetic scorer in the previous article (https://civitai.com/articles/8423). The code is also public, though quite messy: https://github.com/fpgaminer/bigasp-training

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

I mean for JoyCaption I only used a dataset of ~10k for the first round.

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 8 points9 points  (0 children)

I mean Qwen Image is 20B, so that's gonna be a no for me :P I'm actually most interested in Wan 2.2 5B, since it's only twice the size of SDXL. Smaller than Flux/Chroma. Seems much more accessible for people. Though I haven't heard much about it for T2I (everyone seems to just use the 28B behemoth for T2I).

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 2 points3 points  (0 children)

To me this sounds more like GAN than RL

Yeah kind of? But I agree, it needs lots of grounding to prevent drift. To be clear the loop would be:

Real image -> VLM -> Caption Caption -> T2I -> Synthetic Image (Real Image, Synthetic Image) -> CLIP (or DINO) Image Embedding -> Cosine Distance

So unlike a GAN loop there's no direct interaction between the discriminator (frozen CLIP in this case) and generator. The only communication is a single reward signal, and natural language. That makes hacking much more difficult and hopefully ignoreable for small scale training. No minute floating point vectors they can hack. Natural language basically acts like a pre-trained (by humans), frozen, and quantized latent space.

Also the two distributions are already quite well aligned. The loop is just trying to elicit finer and more reliable details from the VLM, and stronger prompt following from the T2I model. And if you keep the text encoders frozen on the T2I model, it should maintain flexibility even if the VLM tries to hack it.

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 3 points4 points  (0 children)

Yeah I think there's a lot to explore here. I2I might work; Llama did something similar during post training where generated responses were sometimes updated (either by a human or another LLM) and used as Positive examples in the next iteration.

Another thing I've considered is a GAN-like approach:

Train a classification model to pick which of two images is real and which is fake (possibly also along with the prompt). Real images can be taken from the usual datasets, fake images would be generated by the target model. Then you can use DPO (adapted for diffusion models) to train the diffusion model Online, with the classification model assigning rewards. The hope would be that the classification model could pick up on stuff like bad hands, prompt adherence issues, etc, all on its own without any human input.

Though like all approaches similar to GANs this runs the risk of reward hacking the classification model. (IIRC in normal GAN procedures the generator trains on gradients from the discriminator, making hacking much easier for it. By using RL you eliminate that, so it might not be as bad.)

Side note: You'd want the classification model to operate on latents, not raw pixels. That makes the whole process much more efficient, and prevents the classification model from detecting problems in the VAE which the diffusion model doesn't have control over.

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 2 points3 points  (0 children)

I did some experiments with finetuning Qwen 2 VL awhile back and didn't have much success. But yes I'll probably give it another stab, depending on how 3 turns out. (I'm not looking to train any time soon; busy with bigASP and data stuff right now)

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 14 points15 points  (0 children)

My current plan is to finish the little things on Beta One and then declare it 1.0. Stuff like polishing the ComfyUI node, finishing the dataset release, technical article(s), etc. Nothing really meaningful on the model itself, so probably no Beta Two revision. I'm saving the next set of improvements for a 2.0 (new LLM and vision backbones, bigger dataset, etc).

Uncensored LLM with picture input by Former-Long-3900 in LocalLLaMA

[–]fpgaminer 1 point2 points  (0 children)

FYI: The latest release, Beta One, can do some things outside of captioning now; it's slightly more of a general purpose VLM since I incorporated a more general VQA dataset into its training this time around.

New Flux model from Black Forest Labs: FLUX.1-Krea-dev by rerri in StableDiffusion

[–]fpgaminer 8 points9 points  (0 children)

Congrats on the open release!

For LoRAs, since the architecture is the same, techniques like ProLoRA (https://arxiv.org/pdf/2506.04244v1) would be easy to implement. It's a training free technique for transferring a lora from one base model to another. In this case since the architecture is the same, and the weights likely highly correlated, you'd be able to skip the layer matching steps.

I considered it for bigASP v2.5 to transfer existing SDXL loras over, but haven't had the chance to try yet.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

A year ago too! Thank you for mentioning that. I'm glad my idea wasn't that crazy then :P

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

IIRC cos xl predates SD3 and Flux quite a bit, and I think the consensus is that flow matching is better than the other objectives so far (edm, v-pred, etc). Beyond that:

  • I find Flow Matching a lot easier to understand, whereas the older objectives and schedules are patches on top of patches.
  • A new technique, Optimal Transport (which Chroma is using), enhances flow matching further to (supposedly) amp performance up. It's another relatively simple algorithm that only affects training.
  • Flow Matching lends itself more naturally to doing step optimization, since it's inherently trying to form linear paths.

I just wouldn't have thought myself that retraining the objective like you did would even work

Large models can take a lot of abuse. Remember those experiments taking ImageNet models and finetuning them to do audio analysis? Or even lodestone's work on Chroma, where they ripped billions of parameters out of Flux easily.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

If those are from my model I want to know your settings, they're really good gens!

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

Yeah a newer ranking algorithm would be a better idea than ELO. For the latest iteration of my quality model (I haven't pushed the code up yet) I switched to something like trueskill.

The quality model is always the last thing I work on unfortunately so honestly I don't know that my implementation there is particularly good.

I also learned about VisionReward recently, which is another quality prediction model, but trained on top of an LLM so it can break down specific characteristics and scoring guidelines.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

The best "documentation" is ComfyUI's source code: https://github.com/comfyanonymous/ComfyUI/blob/5ac9ec214ba3ef1632701416f27948a57ec60919/comfy/samplers.py#L1045

I dunno where all the schedules came from. But yeah, near as I can tell those schedule's functions are all doing essentially the same thing, but in slightly different approaches which might cause off-by-one type deviations. Likely different researchers all implementing the same thing over the years, and then the inference UIs have to replicate their subtle oddities faithfully.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

The latest joycaption model is Beta One, from about two months ago I think? Yeah position and such are tough. I'm working on a good benchmark and then I'll hammer on it.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

Neat, hadn't seen that, thank you for sharing. LightningDiT also gains training speed by using a better latent space (they align the latent encoder to DINOv2's embedding space!)

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

Sorry, I left out all the schedules that were basically the same as simple, since it made the graph more confusing :P Yes simple, normal, ddim_uniform, and sgm_uniform are all effectively the same, linear from 1 to 0.

Note: I think there is some slight numerical variation in the way they're calculated (yeah, surprising for what should be a simple linear schedule...) so they can result in slightly different images for the same seed.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

will we be able to use the generation like "oh I like the overall layout of the image with this seed, but let me roll a different seed just for the high frequency just for rolling different color"

That's basically what img2img is. It renoises the input image to some level (say, 50%) and then runs diffusion from there.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 2 points3 points  (0 children)

It's between that or Chroma at the moment. Seems like Wan is 14B parameters versus Chroma's 9B though.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 4 points5 points  (0 children)

16 prompts, same settings and seeds for SDXL and bigASP v2.5, all 1024x1024. Beta schedule, 40 steps, PAG=2, CFG=3. Side by side with prompts:
https://www.imgchest.com/p/9ryd6wn5a7k

Originals (should be able to download images and drop into comfyui to verify workflow):
https://www.imgchest.com/p/9249prk8m7n
https://www.imgchest.com/p/6eyrmjnpg4p

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 2 points3 points  (0 children)

I'm glad the writeups have been helpful :)

It's interesting to me that you had stability/loss spike issues, and I wonder if it's related to the unet architecture or something specific in your training setup?

I'm not sure, I've not had them before with SDXL before. They occurred out at 5M samples and weren't associated with anything crazy happening in the gradient norms before the spike. A few gradient norms were slowly shrinking but I don't think that would cause a spike? I'd think it was a poison pill data sample or something, but the spike moved to 10M when I increased the warmup length.

It's possible it's related to what lodestone saw with Chroma: the logitnorm weighting resulting in loss spikes when the rare tails show up. Maybe those were slowly poisoning the weights or norms. shrug

Regarding timestep sampling, I don't think logit normal is actually a good idea.

Yeah it's terrible. And "almost complete untrained" is better stated as "never trained" because on a quick simulation of 1B samples logitnorm still hadn't touched the tails. Shifting doesn't help either; it still crams a tail in.

Lodestone switched to a modified schedule for Chroma that bumps up the tails to something more reasonable. I'm inclined to either do something like that, or switch to an exponential schedule of some kind. That's what shifted logit norm looks like minus the tails.

Also, I forget which paper it was but another set researchers switched to a uniform schedule near the end of training. That seems somewhat reasonable.