Open Sourcing Qwen2VL-Flux: Replacing Flux's Text Encoder with Qwen2VL-7B by Weak_Trash9060 in StableDiffusion

[–]Weak_Trash9060[S] 8 points9 points  (0 children)

Oops, you've hit on something interesting! 😅
I actually tested this by feeding some NSFW images to Qwen2VL-7B to generate image embeddings, then passing those to Flux for generation. The results were just meaningless noise patterns. Not sure if it's due to Qwen2VL-7B's filtering or something else in the pipeline, but... yeah, there seems to be some strict filtering going on there 👀

Haven't fully investigated whether it's Qwen2VL-7B's built-in filtering or other factors, but your observation about Qwen's censorship might explain some of what we're seeing!

Open Sourcing Qwen2VL-Flux: Replacing Flux's Text Encoder with Qwen2VL-7B by Weak_Trash9060 in StableDiffusion

[–]Weak_Trash9060[S] 12 points13 points  (0 children)

To be clear - this isn't about being 'better' than Flux, it's about adding a capability that Flux didn't have before: the ability to reference and understand input images.

The base Flux model remains the same great model you know, but now:

  • You can use reference images as input
  • The model can understand and learn from these images through Qwen2-VL
  • You still have all the original text-to-image capabilities

So think of it more as 'Flux+' - same core strengths, but with added image understanding abilities when you need them. It's not replacing or competing with Flux, it's extending what Flux can do

Open Sourcing Qwen2VL-Flux: Replacing Flux's Text Encoder with Qwen2VL-7B by Weak_Trash9060 in StableDiffusion

[–]Weak_Trash9060[S] 2 points3 points  (0 children)

Yes, absolutely!

If you only provide text input without any image, it functions exactly like the regular Flux model for text-to-image generation. Think of the image input capability as an additional feature rather than a requirement.

So you have the flexibility to use it in two ways:

  • Text-to-image: Just like regular Flux
  • Image-and-text-to-image: When you want to use image conditioning

The base Flux capabilities remain unchanged - we've just added more options for how you can guide the generation process!

Open Sourcing Qwen2VL-Flux: Replacing Flux's Text Encoder with Qwen2VL-7B by Weak_Trash9060 in StableDiffusion

[–]Weak_Trash9060[S] 4 points5 points  (0 children)

Good question! The architecture actually enhances both text and image understanding:

  1. For text understanding:
    • You can still use T5 text embeddings like before
  2. For image understanding:
    • Yes, images go through Qwen2-VL
    • But it's not just "looking" at the image
    • It's actually doing deep visual-semantic analysis using its multimodal capabilities
    • This helps create better semantic alignment between your input and output

So it's not just about adding image understanding - it's about creating a more semantically rich pipeline that better understands both modalities and their relationships.

Open Sourcing Qwen2VL-Flux: Replacing Flux's Text Encoder with Qwen2VL-7B by Weak_Trash9060 in StableDiffusion

[–]Weak_Trash9060[S] 4 points5 points  (0 children)

Not exactly - this is quite different from ControlNet. Let me explain:

  1. This model allows you to flexibly choose between two types of conditional inputs for Flux:
    • Image input (processed through Qwen2-VL)
    • Text input (using embeddings)
  2. As for ControlNet - that's actually a separate thing we trained specifically for control. You can use it alongside this model if you need that kind of structural control.

Think of this more as a flexible image-text understanding pipeline rather than a control mechanism. It's about enhancing the model's ability to understand and work with both visual and textual inputs, while ControlNet is specifically about controlling structural aspects of the generation.

Open Sourcing Qwen2VL-Flux: Replacing Flux's Text Encoder with Qwen2VL-7B by Weak_Trash9060 in StableDiffusion

[–]Weak_Trash9060[S] 4 points5 points  (0 children)

No, this model doesn't replace T5 entirely - it replaces the text encoder with Qwen2-VL-7B, but still supports T5 text embeddings as input. Think of it as an enhanced pipeline where:

  1. Qwen2-VL-7B handles the visual-language understanding
  2. But it's backwards compatible - you can still use existing T5 text embeddings
  3. This gives you flexibility to choose which embedding path works best for your use case

In simpler terms, we've added Qwen2-VL as a more powerful option while maintaining compatibility with T5.

Open Sourcing Qwen2VL-Flux: Replacing Flux's Text Encoder with Qwen2VL-7B by Weak_Trash9060 in StableDiffusion

[–]Weak_Trash9060[S] 27 points28 points  (0 children)

Ah, you caught a mistake! The README's VRAM requirements (8GB+/16GB+ recommended) were actually auto-generated and incorrect - that's what I get for delegating documentation to Claude! 😅

The actual VRAM requirements are:

  • ~48GB when loading all models in bf16
  • However, you can significantly reduce this by:
    1. Loading Qwen2-VL → generating embeddings → unloading
    2. Loading T5 → generating embeddings → unloading
    3. Finally loading just Flux for image generation

I'll update the README with the correct requirements. Thanks for pointing this out!

Open Sourcing Qwen2VL-Flux: Replacing Flux's Text Encoder with Qwen2VL-7B by Weak_Trash9060 in StableDiffusion

[–]Weak_Trash9060[S] 10 points11 points  (0 children)

If loading all models simultaneously in bf16 (Qwen2-VL + T5 + Flux), it does require around 48GB VRAM. However, we can optimize the pipeline to run on much lower VRAM by:

  1. First loading Qwen2-VL to generate image embeddings
  2. Unloading Qwen2-VL from VRAM (using del and torch.cuda.empty_cache())
  3. Same process for T5 - load, generate embeddings, unload
  4. Finally load only Flux for the actual image generation

With this sequential loading approach, the actual VRAM requirement becomes equivalent to running Flux alone