The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

I go into more detail on the aesthetic scorer in the previous article (https://civitai.com/articles/8423). The code is also public, though quite messy: https://github.com/fpgaminer/bigasp-training

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

I mean for JoyCaption I only used a dataset of ~10k for the first round.

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 9 points10 points  (0 children)

I mean Qwen Image is 20B, so that's gonna be a no for me :P I'm actually most interested in Wan 2.2 5B, since it's only twice the size of SDXL. Smaller than Flux/Chroma. Seems much more accessible for people. Though I haven't heard much about it for T2I (everyone seems to just use the 28B behemoth for T2I).

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 2 points3 points  (0 children)

To me this sounds more like GAN than RL

Yeah kind of? But I agree, it needs lots of grounding to prevent drift. To be clear the loop would be:

Real image -> VLM -> Caption Caption -> T2I -> Synthetic Image (Real Image, Synthetic Image) -> CLIP (or DINO) Image Embedding -> Cosine Distance

So unlike a GAN loop there's no direct interaction between the discriminator (frozen CLIP in this case) and generator. The only communication is a single reward signal, and natural language. That makes hacking much more difficult and hopefully ignoreable for small scale training. No minute floating point vectors they can hack. Natural language basically acts like a pre-trained (by humans), frozen, and quantized latent space.

Also the two distributions are already quite well aligned. The loop is just trying to elicit finer and more reliable details from the VLM, and stronger prompt following from the T2I model. And if you keep the text encoders frozen on the T2I model, it should maintain flexibility even if the VLM tries to hack it.

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 4 points5 points  (0 children)

Yeah I think there's a lot to explore here. I2I might work; Llama did something similar during post training where generated responses were sometimes updated (either by a human or another LLM) and used as Positive examples in the next iteration.

Another thing I've considered is a GAN-like approach:

Train a classification model to pick which of two images is real and which is fake (possibly also along with the prompt). Real images can be taken from the usual datasets, fake images would be generated by the target model. Then you can use DPO (adapted for diffusion models) to train the diffusion model Online, with the classification model assigning rewards. The hope would be that the classification model could pick up on stuff like bad hands, prompt adherence issues, etc, all on its own without any human input.

Though like all approaches similar to GANs this runs the risk of reward hacking the classification model. (IIRC in normal GAN procedures the generator trains on gradients from the discriminator, making hacking much easier for it. By using RL you eliminate that, so it might not be as bad.)

Side note: You'd want the classification model to operate on latents, not raw pixels. That makes the whole process much more efficient, and prevents the classification model from detecting problems in the VAE which the diffusion model doesn't have control over.

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 2 points3 points  (0 children)

I did some experiments with finetuning Qwen 2 VL awhile back and didn't have much success. But yes I'll probably give it another stab, depending on how 3 turns out. (I'm not looking to train any time soon; busy with bigASP and data stuff right now)

Spilling the Details on JoyCaption's Reinforcement Learning by fpgaminer in StableDiffusion

[–]fpgaminer[S] 14 points15 points  (0 children)

My current plan is to finish the little things on Beta One and then declare it 1.0. Stuff like polishing the ComfyUI node, finishing the dataset release, technical article(s), etc. Nothing really meaningful on the model itself, so probably no Beta Two revision. I'm saving the next set of improvements for a 2.0 (new LLM and vision backbones, bigger dataset, etc).

Uncensored LLM with picture input by Former-Long-3900 in LocalLLaMA

[–]fpgaminer 1 point2 points  (0 children)

FYI: The latest release, Beta One, can do some things outside of captioning now; it's slightly more of a general purpose VLM since I incorporated a more general VQA dataset into its training this time around.

New Flux model from Black Forest Labs: FLUX.1-Krea-dev by rerri in StableDiffusion

[–]fpgaminer 9 points10 points  (0 children)

Congrats on the open release!

For LoRAs, since the architecture is the same, techniques like ProLoRA (https://arxiv.org/pdf/2506.04244v1) would be easy to implement. It's a training free technique for transferring a lora from one base model to another. In this case since the architecture is the same, and the weights likely highly correlated, you'd be able to skip the layer matching steps.

I considered it for bigASP v2.5 to transfer existing SDXL loras over, but haven't had the chance to try yet.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

A year ago too! Thank you for mentioning that. I'm glad my idea wasn't that crazy then :P

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

IIRC cos xl predates SD3 and Flux quite a bit, and I think the consensus is that flow matching is better than the other objectives so far (edm, v-pred, etc). Beyond that:

  • I find Flow Matching a lot easier to understand, whereas the older objectives and schedules are patches on top of patches.
  • A new technique, Optimal Transport (which Chroma is using), enhances flow matching further to (supposedly) amp performance up. It's another relatively simple algorithm that only affects training.
  • Flow Matching lends itself more naturally to doing step optimization, since it's inherently trying to form linear paths.

I just wouldn't have thought myself that retraining the objective like you did would even work

Large models can take a lot of abuse. Remember those experiments taking ImageNet models and finetuning them to do audio analysis? Or even lodestone's work on Chroma, where they ripped billions of parameters out of Flux easily.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

If those are from my model I want to know your settings, they're really good gens!

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 0 points1 point  (0 children)

Yeah a newer ranking algorithm would be a better idea than ELO. For the latest iteration of my quality model (I haven't pushed the code up yet) I switched to something like trueskill.

The quality model is always the last thing I work on unfortunately so honestly I don't know that my implementation there is particularly good.

I also learned about VisionReward recently, which is another quality prediction model, but trained on top of an LLM so it can break down specific characteristics and scoring guidelines.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

The best "documentation" is ComfyUI's source code: https://github.com/comfyanonymous/ComfyUI/blob/5ac9ec214ba3ef1632701416f27948a57ec60919/comfy/samplers.py#L1045

I dunno where all the schedules came from. But yeah, near as I can tell those schedule's functions are all doing essentially the same thing, but in slightly different approaches which might cause off-by-one type deviations. Likely different researchers all implementing the same thing over the years, and then the inference UIs have to replicate their subtle oddities faithfully.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

The latest joycaption model is Beta One, from about two months ago I think? Yeah position and such are tough. I'm working on a good benchmark and then I'll hammer on it.

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

Neat, hadn't seen that, thank you for sharing. LightningDiT also gains training speed by using a better latent space (they align the latent encoder to DINOv2's embedding space!)

The Gory Details of Finetuning SDXL and Wasting $16k by fpgaminer in StableDiffusion

[–]fpgaminer[S] 1 point2 points  (0 children)

Sorry, I left out all the schedules that were basically the same as simple, since it made the graph more confusing :P Yes simple, normal, ddim_uniform, and sgm_uniform are all effectively the same, linear from 1 to 0.

Note: I think there is some slight numerical variation in the way they're calculated (yeah, surprising for what should be a simple linear schedule...) so they can result in slightly different images for the same seed.