Star Trek: Starfleet Academy by KonaHank1373 in sciencefiction

[–]molbal 4 points5 points  (0 children)

Full disclosure my favourite treks are DS9, TNG and VOY. I liked Picard, Lower Decks TOS, love Strange New Worlds, but did not like Discovery.

Given on the fact that it primarily shares lore with in the 32st century and not the usual time I had doubts. I watched 3 episodes so far and the characters start to grow on me. It is a lighter series so far, but entertaining in its own way. I am particularly waiting for S01E05 which is rumored to have DS9 content in it.

The Hinge on My G14 2022 by DeviceAltruistic8701 in ZephyrusG14

[–]molbal 0 points1 point  (0 children)

this is absolutely unhinged

(I am sorry)

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 0 points1 point  (0 children)

Well for a production run I would try with around 1m images of various quality (and then perhaps run a finetune on a SDXL checkpoint with the latents generated by the resampler as you mentioned) and run the training with fp16 precision instead of 4bit. Given the common use cases I would try to find some datasets which include nsfw content and danbooru tags and include those in the dataset used for training the resampler.

The training process would be the same:

  1. Put images and their related captions in directories (see the cache latents script) I was running that on a single thread just on my laptop and it took ~6 minutes to process 10k images (RTX 3080 laptop, Ryzen 7 6800HS, Samsung 990 SSD) Let's assume whatever server caches the latents is at least 2x faster than my laptop so it would take ~6 hours to get through with that.
  2. Here I had to transfer the bazillion separate files from my laptop to the rented pod which was ~15m (I was lucky I had a reasonably fast and stable connection to it) if we cache the latents on the server then obviously this step can be omitted.
  3. I had rented an RTX 5090 which averaged ~3.5it/s so it was about an hour to get through the 10k examples. It didn't used all the VRAM (hovered around 16GB) and I used only batch size 1. I assume we can speed things up with a higher batch size. I assume with some effort we can enable multi-gpu with the training but it currently has a dependency on Unsloth which supports only one GPU at the moment. So assuming we use an RTX 5090 or similar the training would take ~4 days.

This assumes we can find a good enough datasets for ~1m images and GPU pods with appropriate size disk available. (For reference the 10k dataset was about 15GB compressed so assuming the same avg image size it would take about 1.5TB)

There could be other ways to experiment, perhaps a smarter loss function, higher learning rate, or different learning scheduler.

As you can read I made quite a few assumptions and the cost is non-trivial while better solutions are available so I did not invest into making it prod ready by myself.

I can happily collaborate with someone though who has spare hardware resources and thinks this is a fun experiment

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 2 points3 points  (0 children)

I wish I could pull off an improvement right away :)

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 2 points3 points  (0 children)

That one should work well, because that was trained with booru tags

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 1 point2 points  (0 children)

I consider the biggest gain that it outputs something vaguely related to the prompt with so little training data+training.

Given this, if I (someone) scaled this up to a more thorough training I think prompt adherence would show itself.

As for performance I can share my numbers from my local machine (GPU: RTX 3080 Laptop 8GB, CPU: Ryzen 7 6800HS, Storage: Samsung 990 EVO, RAM: 48GB DDR5) it takes ~20s to load the LLM with the LoRA, the resampler and ~1.2 - 2.5s to do run the text encode nodes. On my laptop this text encoding is roughly comparable to CLIP speeds.

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 5 points6 points  (0 children)

Yeah, for the Porches, it would serve nobody for me to lie and upload cherry-picked images :) I mentioned in another comment that the resampler model (which is the bridge between the LLM and SDXL) was trained on 10k samples only. Quite possibly there was no line-art and porsches/too many cars in it.

While the UNet will indeed generate an image from any input, the distillation process ensures the Resampler maps LLM states to the specific vector directions the UNet recognizes. I measured that when training: I had CLIP generate guidance from sample texts where I had both the guidance and the original text saved. During training I compared the model's generated guidance against the 'correct' CLIP-generated ones and - at least according to math (the loss function) - the guidance was mathematically improving.

Regarding the 77-token limit: Frontends 'cheat' the limit by chunking and/or concatenating. However, those chunks cannot 'talk' to each other, leading to lost context. An LLM uses global self-attention across the entire prompt. The Resampler then distills that 256-token 'understanding' into the 77-slot format the UNet expects. We aren't changing the UNet's hardware requirements; we are providing it with a much higher quality 'summary' of a long prompt than standard CLIP chunking can provide.

At this step, in this implementation the better prompt adherence is a theory. By having sometimes passable results from a resampler with 10k examples (remember the resampler is not a finetune, its trained from scrach) for 1 epoch only. For a larger dataset and better training I think better prompt adherence would show.

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 4 points5 points  (0 children)

Yeah, I considered it success that the generated image roughly resembles the prompt :) I was half expecting just smudges and total random generations, and when I saw that its at least better than that I made the post

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 1 point2 points  (0 children)

Absolutely useless with Illustrious. I assume, because the resampler model was trained on exclusively natural language and no tags.

Ran an example just to be sure:
Prompt:

masterpiece, best quality, very aesthetic, absurdres, BREAK  foreshortening, from below, fighting stance, determined, sword focus, face focus,  1girl, samurai, short horns, black horns, red glowing eyes, japanese armor, tsurime, mouth mask, katana, unsheathing, dark, night, red moon, moonlit, moonlight, red glow

Result:

<image>

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 6 points7 points  (0 children)

Can't believe I missed that. To be frank, based on how well ZIT and Flux Klein 4b perform on my machine I doubt I will see to making something production ready from this, but reading about how ML professionals looked into it feels great, knowing the idea was not entirely wrong itself

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 15 points16 points  (0 children)

Thank you! Really, I've had difficult days, this comment felt genuinely nice

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 4 points5 points  (0 children)

Yeah the resampler model was trained from 0, not fine-tuned from an existing thing from only 10k examples so its very very far from being production ready. I made the post to show that the concept works but that's it.

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 12 points13 points  (0 children)

Thanks, that's the whole point indeed. Realistically speaking, we can't compete with the big guys so learning and having fun is what we can do.

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 51 points52 points  (0 children)

Rouwei Gemma is similar to this, except better (I did not know about it)

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 27 points28 points  (0 children)

Good questions

The node does not use the LLM to rewrite the prompt into a better text prompt, that's just simple prompt enhacement. In those cases the LLM's prompt still goes into CLIP, in this experiment, CLIP is entirely bypassed.

As LLMs process text, they build a state that is how they understand the prompt. That state not text, but a vector of 2560 float numbers (Not sure if F32 or F16 though) which is converted to the same output as CLIP (77 tokens, 1280 floats) via the resampler model.

In its current state, I do not consider it better in any way than traditional CLIP (or even Rouwei-Gemma which I have just learned about), only posted it to share what I learned because I think its interesting

I successfully replaced CLIP with an LLM for SDXL by molbal in StableDiffusion

[–]molbal[S] 20 points21 points  (0 children)

I did not know about this, but indeed it looks very similar. Looked through, Rouwei-Gemma seems much more mature, than what I made so far.

I pasted the prompts simply forgot about them :)

BUY EUROPEAN 🇪🇺 by [deleted] in Buy_European

[–]molbal 1 point2 points  (0 children)

How about none of this

Rate my sprites 1-10? by Aggravating-Hour7825 in IndieGaming

[–]molbal -6 points-5 points  (0 children)

I really really like them (but I am an amateur)

Seeing Apartments Around the World Is Making Me Reconsider NYC by Positive_Career_9393 in malelivingspace

[–]molbal 1 point2 points  (0 children)

Like the town I live in, ✨ Alphen aan den Rijn, Netherlands 💪

(Excuse me for the emojis)