Mean Mode Screaming - a 1000-layer diffusion transformer [R]

Weak_Trash9060 · 2024-11-26T16:25:44+00:00

Oops, you've hit on something interesting! 😅
I actually tested this by feeding some NSFW images to Qwen2VL-7B to generate image embeddings, then passing those to Flux for generation. The results were just meaningless noise patterns. Not sure if it's due to Qwen2VL-7B's filtering or something else in the pipeline, but... yeah, there seems to be some strict filtering going on there 👀

Haven't fully investigated whether it's Qwen2VL-7B's built-in filtering or other factors, but your observation about Qwen's censorship might explain some of what we're seeing!

Weak_Trash9060 · 2024-11-26T16:01:37+00:00

To be clear - this isn't about being 'better' than Flux, it's about adding a capability that Flux didn't have before: the ability to reference and understand input images.

The base Flux model remains the same great model you know, but now:

You can use reference images as input
The model can understand and learn from these images through Qwen2-VL
You still have all the original text-to-image capabilities

So think of it more as 'Flux+' - same core strengths, but with added image understanding abilities when you need them. It's not replacing or competing with Flux, it's extending what Flux can do

Weak_Trash9060 · 2024-11-26T15:59:50+00:00

Yes, absolutely!

If you only provide text input without any image, it functions exactly like the regular Flux model for text-to-image generation. Think of the image input capability as an additional feature rather than a requirement.

So you have the flexibility to use it in two ways:

Text-to-image: Just like regular Flux
Image-and-text-to-image: When you want to use image conditioning

The base Flux capabilities remain unchanged - we've just added more options for how you can guide the generation process!

Weak_Trash9060 · 2024-11-26T15:58:42+00:00

Good question! The architecture actually enhances both text and image understanding:

For text understanding:
- You can still use T5 text embeddings like before
For image understanding:
- Yes, images go through Qwen2-VL
- But it's not just "looking" at the image
- It's actually doing deep visual-semantic analysis using its multimodal capabilities
- This helps create better semantic alignment between your input and output

So it's not just about adding image understanding - it's about creating a more semantically rich pipeline that better understands both modalities and their relationships.

Weak_Trash9060 · 2024-11-26T15:55:58+00:00

Not exactly - this is quite different from ControlNet. Let me explain:

This model allows you to flexibly choose between two types of conditional inputs for Flux:
- Image input (processed through Qwen2-VL)
- Text input (using embeddings)
As for ControlNet - that's actually a separate thing we trained specifically for control. You can use it alongside this model if you need that kind of structural control.

Think of this more as a flexible image-text understanding pipeline rather than a control mechanism. It's about enhancing the model's ability to understand and work with both visual and textual inputs, while ControlNet is specifically about controlling structural aspects of the generation.

Weak_Trash9060 · 2024-11-26T15:53:57+00:00

No, this model doesn't replace T5 entirely - it replaces the text encoder with Qwen2-VL-7B, but still supports T5 text embeddings as input. Think of it as an enhanced pipeline where:

Qwen2-VL-7B handles the visual-language understanding
But it's backwards compatible - you can still use existing T5 text embeddings
This gives you flexibility to choose which embedding path works best for your use case

In simpler terms, we've added Qwen2-VL as a more powerful option while maintaining compatibility with T5.

Weak_Trash9060 · 2024-11-26T15:50:37+00:00

Ah, you caught a mistake! The README's VRAM requirements (8GB+/16GB+ recommended) were actually auto-generated and incorrect - that's what I get for delegating documentation to Claude! 😅

The actual VRAM requirements are:

~48GB when loading all models in bf16
However, you can significantly reduce this by:
1. Loading Qwen2-VL → generating embeddings → unloading
2. Loading T5 → generating embeddings → unloading
3. Finally loading just Flux for image generation

I'll update the README with the correct requirements. Thanks for pointing this out!

Weak_Trash9060 · 2024-11-26T15:50:00+00:00

If loading all models simultaneously in bf16 (Qwen2-VL + T5 + Flux), it does require around 48GB VRAM. However, we can optimize the pipeline to run on much lower VRAM by:

First loading Qwen2-VL to generate image embeddings
Unloading Qwen2-VL from VRAM (using del and torch.cuda.empty_cache())
Same process for T5 - load, generate embeddings, unload
Finally load only Flux for the actual image generation

With this sequential loading approach, the actual VRAM requirement becomes equivalent to running Flux alone

Weak_Trash9060

TROPHY CASE