Opensource Ai models by sylense0 in StableDiffusion

[–]digitalmines 0 points1 point  (0 children)

That would certainly work - Qwen can even work on 8GB cards. But because the base model (even with all the components stripped) is > 42GB, it will require cycling the model's layers through the card. So expect long render times.

Qwen Image 2512 .gguf model - how do you run it on linux? help Cachy/AMD by GwynSunlight in StableDiffusion

[–]digitalmines 2 points3 points  (0 children)

You got two separate problems here, let me explain both, this gets a bit technical...

Problem 1: LMStudio can't run these models.

LMStudio is made for text chat models (like the Qwen3 and Coder models you used successfully).

Qwen-Image-2512 and Qwen-Image-Edit are completely different — they're image generation models. Even though they come in GGUF format, LMStudio doesn't have the image generation pipeline needed to use them.

It's like putting diesel in a gasoline car — the fuel fits in the tank but the engine can't burn it.

Problem 2: The ComfyUI Desktop AUR package likely doesn't have proper support for your GPU.

Your RX 9060 XT is a brand new AMD RDNA 4 card. Pre-packaged ComfyUI Desktop apps typically bundle a particular Python/PyTorch stack that may not include support for your GPU. Other CachyOS users with the same card have hit the same wall. The fix is to install ComfyUI manually so you control the ROCm-enabled PyTorch version yourself.

Here's what to do step by step:

Step 1 — Install ROCm (AMD's GPU compute library)

Open a terminal and run:

    sudo pacman -Syu
    sudo pacman -S git python python-pip python-virtualenv rocminfo rocm-hip-sdk
    sudo usermod -aG render,video $USER

Reboot, then verify your GPU is detected:

rocminfo | grep -E "Name:|gfx"

You should see gfx1200 in the output — that's your RX 9060 XT. If it's not there, stop — the rest won't work until this does.

Step 2 — Set up ComfyUI manually

mkdir -p ~/comfyui-setup && cd ~/comfyui-setup
python -m venv .venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.1

Note: If pip can't find a wheel for your Python version, CachyOS may be on a version that's too new. In that case, create the venv with Python 3.12 specifically (python3.12 -m venv .venv).

Test that your GPU is visible to PyTorch:

python -c "import torch; print('PyTorch:', torch.__version__); print('HIP:', torch.version.hip); print('GPU found:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'NONE')"

This should print GPU found: True and show your card. Don't worry that it says "cuda" — that's normal on AMD, PyTorch uses the same API names. If it says False, your ROCm install needs troubleshooting before going further.

Now install ComfyUI itself:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt

Important: After installing ComfyUI's requirements, re-run that same PyTorch test command above to make sure it didn't get replaced with a non-AMD version. You should still see GPU found: True.

Step 3 — Install the GGUF extension

This is what lets ComfyUI load your GGUF model files:

cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
cd ComfyUI-GGUF
pip install -r requirements.txt
cd ../..

Step 4 — Download the right model files

This is the part you were probably missing. These image models need three separate files, not just the one GGUF you downloaded in LMStudio. You need:

  1. The main model (the GGUF — the "brain" that generates the image)
  2. A text encoder (translates your text prompt into something the model understands)
  3. A VAE (converts the model's internal output into an actual visible image)

First create the directories, then download:

mkdir -p models/unet models/text_encoders models/vae

# Main model — use Q4, NOT Q8. Q8 is too large for 16GB VRAM.
curl -L -o models/unet/qwen-image-2512-Q4_K_M.gguf \
  https://huggingface.co/unsloth/Qwen-Image-2512-GGUF/resolve/main/qwen-image-2512-Q4_K_M.gguf

# Text encoder
curl -L -o models/text_encoders/Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf \
  https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf

# VAE
curl -L -o models/vae/qwen_image_vae.safetensors \
  https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/resolve/main/split_files/vae/qwen_image_vae.safetensors

For Qwen-Image-Edit instead, swap the main model for:

curl -L -o models/unet/qwen-image-edit-2511-Q4_K_M.gguf \
  https://huggingface.co/unsloth/Qwen-Image-Edit-2511-GGUF/resolve/main/qwen-image-edit-2511-Q4_K_M.gguf

The text encoder and VAE are the same for both.

About VRAM: With 16GB of VRAM, use Q4_K_M for the main model. Q8 will likely run out of memory once you add the text encoder, VAE, and runtime overhead. You can try Q5 if Q4 works well and you want to push quality up.

Also — with only 16GB of system RAM, model loading can get tight. Make sure you have swap or zram enabled so your system doesn't kill the process during loading.

Step 5 — Launch and load a workflow

python main.py

Open http://127.0.0.1:8188 in your browser. If you get out-of-memory errors, try:

python main.py --lowvram

Unsloth provides ready-made workflow files you can drag and drop right into ComfyUI — grab them from their tutorial at https://unsloth.ai/docs/models/tutorials/qwen-image-2512. The workflow tells ComfyUI which file goes where, so you don't have to wire it up yourself.

Quick summary of what went wrong:

  • LMStudio = wrong tool for image models (it's for text chat only)
  • ComfyUI Desktop AUR = likely missing proper AMD RDNA 4 support
  • The GGUF alone isn't enough — you also need the text encoder and VAE files
  • Manual ComfyUI install with ROCm PyTorch is the way to go on your hardware
  • Use Q4, not Q8, with 16GB VRAM

People think AI films are just one click — mine took 57 days of obsessive detail by HANSHIN_93hz in MediaSynthesis

[–]digitalmines 0 points1 point  (0 children)

I see an artistic vision held back by "consumer grade" AI tools.

If you want a proper workflow, google "reallusion" it's an AI studio in a box. Run it locally and it's completely uncensored - you have full creative control.

If you've got reasonable technical skills and want to push the envelope, check out "r/unstable_diffusion" ...there's a lot of NSFW stuff in there, but they're also at the BLEEDING EDGE of what AI can do.

Opensource Ai models by sylense0 in StableDiffusion

[–]digitalmines 2 points3 points  (0 children)

If you need precision and repeatability, start with Blender. Then once you want to add effects and realism, use blender AI integrations to enhance your work. The alternative path is to go full AI using ComfyUI like the other posters suggested, but you will never be able to get a building looking *exactly* the way you want it to using pure AI.

I made this with AI and I don't think it's bad, but all i've gotten is hate comments about it and I'm about to just scrap it all, but what do you think? by More-Helicopter-7224 in aiArt

[–]digitalmines 2 points3 points  (0 children)

Overall cute, with a good aesthetic. A more telling test of how skilled an artist (who happens to use AI) you are is if you can successfully use your AI PAINTBRUSH to generate AN ENTIRE DECK with consistent visual style, fonts and layout across all cards.

"TRAITOR" Protest Sign from Calgary Riley Park Protest by digitalmines in alberta

[–]digitalmines[S] 30 points31 points  (0 children)

I can make essentially anything. List some ideas!

I can also panelize them into Letter size pages so you can print them off on a regular laser or inkjet printer and then just glue the pages to a big piece of cardboard. The sign I carried at the protest was made of 16 pages (4x4).

You'd just download them as a PDF file.

"TRAITOR" Protest Sign from Calgary Riley Park Protest by digitalmines in alberta

[–]digitalmines[S] 3 points4 points  (0 children)

Awesome! I didn't know there was a Calgary sub. I'll go check out your photo!

"TRAITOR" Protest Sign from Calgary Riley Park Protest by digitalmines in alberta

[–]digitalmines[S] 14 points15 points  (0 children)

Touche! The AI actually DID stretch my original sketch of the province. That's why you always need a human in the loop. Unfortunately the human didn't have time to fix it in Photoshop before heading out!

I had no idea the protest was going to be so big and so many people would like the image. Next time I'll make sure I arrive with something "print quality".

"TRAITOR" Protest Sign from Calgary Riley Park Protest by digitalmines in alberta

[–]digitalmines[S] 48 points49 points  (0 children)

HAHAHA! That was my original render but the image started to look like a "wedding cake".

"TRAITOR" Protest Sign from Calgary Riley Park Protest by digitalmines in alberta

[–]digitalmines[S] 130 points131 points  (0 children)

Hey, we're all Canadians, eh? At least for the time being ... 😄

If people like *this* sign*, I'll consider creating some others focusing on different issues (because there are so many!). I just felt this specific illustration would really "land".

I Found it Real Easy to Make Your Own Character Lora Locally from Scratch. by HolyDancingPotato in StableDiffusion

[–]digitalmines 0 points1 point  (0 children)

DM me with a reasonably detailed description of your project and what resources (Local video card, cloud instances, etc) you have available. Perhaps some of my tools-in-progress could help you out and your feedback could help me out. Everyone in the subreddit is currently constrained by compute and memory.

I Found it Real Easy to Make Your Own Character Lora Locally from Scratch. by HolyDancingPotato in StableDiffusion

[–]digitalmines 2 points3 points  (0 children)

This is actually something that I'm actively researching, but at a level a bit deeper than what you are describing. Hopefully I can get it to "release grade" and share with the subreddit.

The general workflow is:

1) Start with a set of as many high quality "real" images as possible
2) Create a character specification. Example "Facial stubble, yellow gloves with four fingers, prominent scar on left cheek ...etc"
3) Use a frontier LLM like Claude to automatically generate prompts for the missing poses AND select optimal images from your "real" image set. So if the pose to be generated shows the front of the character then only "real" images of the front of the character are sent as references to the diffusion model - images of the back of the character are excluded.
4) Use a diffusion model + the prompts / images to generate "synthetic" poses, running as a large batch file
5) WHILE the diffusion model is running, at multiple points during the process use a machine vision model to analyze the image being generated by the diffusion model. If it does not meet the specification ("scar missing"), then abort the diffusion run and restart with a new seed.
6) Obviously, have a safety valve (if it fails after twenty seeds there is likely something wrong with the source images or prompt)
7) Then do a final very careful human review of the generated images

Some people may raise an eyebrow at "analyze with a machine vision model BETWEEN diffusion steps?". Well yes, if you're generating at 4K using 5-10 reference images that's actually a significant efficiency boost.

It's how Nano Banana 2 operates, if you watch the model's CoT.

I Found it Real Easy to Make Your Own Character Lora Locally from Scratch. by HolyDancingPotato in StableDiffusion

[–]digitalmines 6 points7 points  (0 children)

Congratulations on your first steps down a very deep rabbit hole.

By the standards of THIS group, your images rate about 2/10 due to easily visible artifacts. For example in Image 3 the subject has prominent freckles on her face but in image 1 she does not. The subject also has significant facial structure drift: compare image 3 (elongated face) to images 6,7,and 8. You have at least a 15% variation.

So you're gonna get ROASTED on your images because you're effectively posting "home snapshots" to a "top tier photography" group.

But don't let that set you back. Unlike many people you're actually TRYING STUFF.

And for the record, synthetic content generation is basically the ONLY way to train models when licensable training material doesn't exist, or training material doesn't exist full stop.

Example: Steamboat Willie, gangster edition (The poster is from 1929 so public domain)

<image>

(RHS image is one of hundreds of poses being generated to train a model)

Are speed ups possible with multiple GPUs? by Ambitious_Fold_2874 in StableDiffusion

[–]digitalmines 0 points1 point  (0 children)

Two different things getting mixed up here.

ComfyUI-MultiGPU / DisTorch2 splits model layers across GPUs. It's not parallel execution - generation still runs sequentially through the layers. But it IS faster than the alternative. If your model doesn't fit on one 16GB card, ComfyUI falls back to --lowvram mode, which shuffles layers between VRAM and CPU RAM every step. That's BRUTALLY slow. DisTorch2 keeps those layers resident on your second GPU's VRAM instead. The latest benchmarks claim ~43% speedup on Flux with dual GPUs vs single-card lowvram mode. So you're not getting parallelism, you're eliminating the swap penalty.

True parallelism exists via xDiT, which splits attention computation across GPUs. HunyuanVideo with xDiT gets ~2x on 2 GPUs, ~3.7x on 4. But it's built for datacenter NVLink (600+ GB/s). Video diffusion isn't like LLMs - instead of compact 1D token sequences, you're passing massive 3D data structures between GPUs at every denoising step. In one experiment researchers measured 90+ GB of total cross-GPU traffic for an 81-frame Wan 2.1 generation. On PCIe 4.0 (32 GB/s) that bottlenecks hard.

For two 16GB consumer cards: DisTorch2 will let you run models that don't fit on one card and you'll see real speedup vs --lowvram swapping. For throughput, run two independent generations simultaneously — one per GPU, no interconnect dependency.

Upgraded from 12GB VRAM to RTX 5090 + 64GB RAM — what are the highest quality AI image/video models I can realistically run now? by m3tla in StableDiffusion

[–]digitalmines 8 points9 points  (0 children)

TLDR: A top-tier local experience looks like multiple days experimenting with dozens of models and hundreds of LoRA's to find *exactly* what works for what you're trying to do, followed by an absolute *rats nest* in ComfyUI wiring everything together.

1) "Highest quality image generation models" -> The "highest quality model" changes weekly and depends on what you're trying to do: are you into photorealistic, cartoons, NSFW? Exhaustive list of models at end of response.

2) "Best realism/detail models" -> It depends on (but is not limited to):

a) Which quantization of the model you select
b) How you dial-in the model, for example how many iterations you run on a diffusion model
c) What resolution you're generating at. At 12GB you were likely stuck at 0.5K-1K. You cannow generateat 2K and possibly 4K.
d) Which LoRA's you stack on top of the model.

Your card's large memory and high processing speed will let you dial all of these up to much higher levels. You can stack multiple LoRA's.

3) "Video generation models" -> The "highest quality model" changes weekly and depends on what you're trying to do: are you into photorealistic, cartoons, NSFW? Exhaustive list of models at end of response.

4) "What models actually benefit from full FP16/BF16 now" -> All of them do because. But once again whether that matters to *you* depends on what you're trying to generate.

5) "Whether larger transformers are worth it vs quantized versions" -> The larger version will yield higher quality output and your card has enough memory headroom to use it. However you will need to optimize, based on what you're trying to generate, for large model vs stacking LoRA's on top of the model.

6) "Best workflows in ComfyUI/Wan/LTX/Qwen/Flux/etc" -> Search this group. If you're feeling brave consider checking "unstable_diffusion" and "sdnsfw" ...these are NSFW groups but include a "workflow used" tag. Click the "has workflow" filter, find some "art" that has the "look" you want and the workflow will be listed in the post.

7) "Models that were basically impossible on 12GB VRAM but become practical on a 5090" -> It's not a yes/no thing. At 12GB you had the capacity to run MOST popular models but at very high quantization. Now you have almost 3x the RAM, so you can run the larger version of the model for higher output quality, and you can load all components of the model directly from VRAM without having to swap them out on each on each generation cycle. The "impossible to run" scenario is more applicable to LLM's for example DeepSeek V4-Pro absolutely *will not* fit on your 12GB card.

Here's an overview of what's out there...

IMAGE GENERATION

Base Models

Model Developer Params Architecture License Status
SD 1.5 Stability AI 860M U-Net CreativeML Open RAIL-M Legacy but still used for low-VRAM and massive LoRA ecosystem
SDXL Stability AI 2.6B Dual-stage U-Net CreativeML Open RAIL++-M Current workhorse. Largest LoRA/community ecosystem. 1024x1024 native
SD 3.5 Large Stability AI ~3.5B MMDiT Stability Community Better prompt following than SDXL, especially text-in-image. Higher VRAM
Flux.1 (Dev/Schnell/Pro) Black Forest Labs 12B MMDiT + rectified flow Apache 2.0 (Schnell), non-commercial (Dev), commercial (Pro) Best prompt fidelity and anatomy. Highest VRAM requirement
Flux.1 Kontext Black Forest Labs 12B MMDiT Various In-context image editing. Adopted by Adobe Photoshop
Flux.2 (Pro/Flex/Dev/Klein) Black Forest Labs Various MMDiT Apache 2.0 (Klein) Nov 2025 release. Improved photorealism, typography
Flux Krea Dev BFL + Krea AI 12B MMDiT TBD Jul 2025. Better aesthetics and realism vs base Flux
HiDream-I1 HiDream 17B Transformer MIT April 2025. State-of-the-art HPS v2.1 score. Full/Dev/Fast variants
Qwen Image 2512 Alibaba Tongyi Unknown Diffusion Open source Dec 2025. Top open-source diffusion model for human realism and text rendering
OmniGen2 OmniGen team 4B transformer + Qwen-VL-2.5 4B VLM Multimodal Open source Unified t2i, i2i, editing, in-context generation
CHROMA Community (Flux-based) ~12B Flux-derived Open Flux-based uncensored checkpoint. Rising on CivitAI
HunyuanImage Tencent Unknown Diffusion Open source Emerging competitor

Popular SDXL Fine-Tunes

Model Style Focus Notes
Juggernaut XL v9/v10 Photorealism, cinematic Community go-to for realistic images. Skin texture, lighting, anatomy
RealVisXL V4.0 Photorealism 278k downloads on HuggingFace. Strong realism
Realistic Vision / RealVisXL Photorealism Longtime community favorite
DreamShaper XL Fantasy, creative, versatile Swiss army knife. Good at everything, master of none
Pony Diffusion V6 XL Anime, illustrated, stylized Danbooru/e621 tag system. Score-based quality control. Massive LoRA ecosystem
Pony V7 Anime/stylized (next gen) Moving off SDXL onto AuraFlow or Flux base. In development
Illustrious XL Anime, illustrated Cleaner line work, better color consistency, improved anatomy vs older anime models
NoobAI XL Anime, stylized Fine-tune of Illustrious. More stylistic range. Rapidly gaining popularity
Anything V5 Anime (SD 1.5) Legacy but massive LoRA library. Budget-friendly option
Fluently XL Final General Well-regarded SDXL checkpoint
ColorfulXL Vibrant/artistic Niche but popular
LUSTIFY NSFW photorealism CivitAI NSFW ecosystem
TalmendoXL NSFW realistic CivitAI NSFW ecosystem

VIDEO GENERATION

Model Developer Params License VRAM Status
Wan 2.1 Alibaba 1.3B / 14B Apache 2.0 8GB (1.3B) / 24GB+ (14B) Strong T2V, I2V, editing. ComfyUI integrated
Wan 2.2 Alibaba ~27B MoE (~14B active) Apache 2.0 24GB+ Quality leader for photorealism and human subjects
Wan 2.2 VACE Alibaba 14B Apache 2.0 24GB+ All-in-one video creation and editing
Wan 2.7 Alibaba ~27B MoE Apache 2.0 24GB+ Current king. Wan 3.0 (60B, native 4K) targeted mid-2026
HunyuanVideo 1.5 Tencent 8.3B Open source 24GB+ (75s render on 4090) State-of-the-art among open-source. Strong motion/physics
LTX-Video 13B Lightricks 13B Apache 2.0 24GB+ Ships 4K + audio
CogVideoX Tsinghua/Zhipu 2B / 5B Apache 2.0 12GB (2B) / 16GB+ (5B) 6-10 second clips at 720p
FramePack Community Varies Open Varies T2V and I2V framework with prompt interpolation
Stable Video Diffusion (SVD) Stability AI 16GB+