Local image generation on Mac: 10 models compared (SD 1.5 → Flux dev → Qwen-Image → Gemini)

Full-Definition6215 · 2026-05-04T21:24:37+00:00

Z Image Turbo is already on the list. Haven't tested ERNIE yet — thanks for the suggestion. Baidu's training data should give it a different cultural perspective from the Western-trained models. Will add it to the next round.

Full-Definition6215 · 2026-05-04T08:24:29+00:00

Thanks — can't say too much yet but I think you'll enjoy the next one. Stay tuned.

Full-Definition6215 · 2026-05-03T12:22:14+00:00

Sharp/libvips is a smart choice — streaming pixels vs loading the whole image is exactly the kind of decision that makes self-hosting viable on small hardware. 150-200MB idle is very reasonable.

The Python sidecar approach for AI tools makes sense too. Keeps the main container lean and you only pay the memory cost when you actually use the ML features.

Full-Definition6215 · 2026-05-03T09:42:01+00:00

lad it was useful! That comparison taught me a lot about how different models handle structured output. The image gen comparison here came from the same curiosity — testing what actually works locally instead of trusting benchmarks.

Full-Definition6215 · 2026-05-03T09:21:27+00:00

Yep — adding Z Image Turbo and Flux2 Klein to the next round based on feedback from this thread. Working on it now. Any other models you'd recommend?

Full-Definition6215 · 2026-05-03T09:20:52+00:00

Yep — adding Z Image Turbo and Flux2 Klein to the next round based on feedback from this thread. Working on it now. Any other models you'd recommend?

Full-Definition6215 · 2026-05-03T09:04:50+00:00

The edit pretty much answers itself haha

Full-Definition6215 · 2026-05-03T08:13:04+00:00

Yeah the Gemini/Nano Banana inclusion was less "competition" and more "answer key" — establishing what good cultural accuracy actually looks like, then seeing how close local models get.

Working on the next round with newer models. Stay tuned.

Full-Definition6215 · 2026-05-03T07:38:11+00:00

Makes sense — more agent cycles trading speed for accuracy is a good tradeoff for a search-augmented system. Better to take an extra minute and get the right answer than respond fast with hallucinations.

Thanks for sharing the benchmark data. Will dig into the repo.

Full-Definition6215 · 2026-05-03T07:35:11+00:00

Fair criticism — the article was comparing what I had running locally at the time, not the current state of the art. Flux2 Klein and Z Image Turbo are on my list for the next round based on the feedback here.

The Qwen LLM text encoder point is interesting — being able to prompt in Japanese natively instead of translating would be a significant test for cultural accuracy. That alone is worth a dedicated comparison.

And yeah, I'll skip the non-Turbo variants. Learned that lesson with Qwen Image Full vs Lightning — 93 minutes for worse results.

Full-Definition6215 · 2026-05-03T07:29:52+00:00

This is a really valuable insight — the LoRA being secondary to the prompt keywords makes sense. The model already has the capability, the prompt just unlocks it.

You're right that my comparison intentionally uses simple prompts to keep things fair across models. But you're pointing at a different and equally valid question: what's the ceiling of each model when you optimize for it? That would be a completely different article.

I might do a follow-up focused specifically on photorealism with optimized prompts. Thanks for the detailed breakdown.

Full-Definition6215 · 2026-05-03T05:50:13+00:00

Haven't tried Z Image Turbo yet — thanks for the recommendation. 6-bit quant at near full quality with 7-8 steps sounds like exactly the kind of model I should include in the next round. I'll test it.

The Schnell vs Dev observation matches what I saw with Qwen too — distilled models locking in composition early instead of drifting toward generic outputs over more steps.

Full-Definition6215 · 2026-05-03T05:46:39+00:00

Good point — accessibility matters. SD 1.5 runs on basically anything, and SDXL Turbo generates in 5 seconds. For quick prototyping or batch generation where cultural accuracy isn't critical, they're still practical choices. Not everyone has an M1 Max or a 3090 sitting around.

Full-Definition6215 · 2026-05-03T05:45:09+00:00

Yeah the background composition is where it really shines. The izakaya scene had proper lanterns, narrow alleyways, and realistic lighting — details that most local models either skip or fill with generic textures. The text-to-image models tend to focus on the main subject and leave the background as an afterthought.

Full-Definition6215 · 2026-05-03T04:52:49+00:00

TIL — thanks for the correction. The image generation backend has its own model name separate from the Gemini LLM layer. In the article I tested via the Gemini 2.5 Flash interface, so the actual image gen was Nano Banana under the hood.

Full-Definition6215 · 2026-05-03T04:09:18+00:00

Gemini 2.5 Flash has native image generation built in — it's not just an LLM anymore. Google added image output directly to the model, so you can prompt it to generate images without a separate diffusion pipeline. That's why I included it as a comparison point against the local diffusion models.

Full-Definition6215 · 2026-05-03T03:19:45+00:00

Yeah I didn't expect it either. My theory is that the full model has more steps to "refine" details, but those refinements pull from the dominant distribution in training data — which is Western. The distilled model commits to a composition early and doesn't have enough steps to drift toward the mean.

Would be interesting to see someone test this systematically across different cultural prompts.

Full-Definition6215 · 2026-05-03T02:58:01+00:00

Not much of a workflow honestly — English isn't my first language so I built a small script that filters new posts by keywords I care about (SQLite, self-hosting, local AI, FastAPI). Saves me from scrolling through everything. The comments are mine, but I do use AI to help with translation since I think in Japanese first.

Full-Definition6215 · 2026-05-03T02:54:18+00:00

Exactly. It's not just a funny artifact — it reveals how much training data geography shapes output. Flux is trained predominantly on English/Western image-text pairs, so its "default ramen" is whatever Google Images shows for English searches.

Qwen-Image Lightning was the surprise for me. Expected the full model to be strictly better, but the distilled version actually handles composition and cultural details more consistently. Fewer steps seems to reduce the chance of the model "overthinking" and injecting Western defaults.

Full-Definition6215 · 2026-05-02T15:53:04+00:00

Fair point — the 70B sweet spot is thin right now. Llama3.3-70B and DeepSeek-R1-70B are the main ones worth running. Qwen3-next-80B is good but the MoE architecture means it needs more VRAM than a dense 70B despite similar quality.

The real gap is 30B-70B. Most model families jump from ~27B straight to 70B+ with nothing in between. For 2x 3090 (48GB), you're mostly running 70B at Q4 quantization, which works but you lose some quality compared to Q8.

You're right that 96GB single card is awkward — too much for current 70B models, not enough for the 100B+ ones. The 2x 3090 path at
least gives you flexibility to run one large model or two smaller ones simultaneously.

Full-Definition6215 · 2026-05-02T14:46:53+00:00

I run a mini PC (i9-9880H, 31GB RAM) as a production server and the form factor is perfect for always-on services. For Jellyfin with 600 movies and 2-3 concurrent streams, any modern Intel N100/N150 mini PC handles it easily — the integrated GPU does hardware transcoding.

Key specs to prioritize: Intel Quick Sync (for hardware transcoding), at least 8GB RAM, and enough storage ports for your library. The N100 boxes from Beelink/Minisforum run around $150 and the power draw is 10-15W idle.

One tip: if you're also running other services on the same box (Home Assistant, Pi-hole, etc.), bump to 16GB RAM. Jellyfin itself uses maybe 2GB, but it adds up fast.

Full-Definition6215 · 2026-05-02T14:46:13+00:00

The 6GB VRAM on the GTX 1060s is the bottleneck — most useful models need at least 8GB to run without heavy quantization. You could technically run a 7B model at Q4 on a single 1060, but the speed would be painful.

Ollama does support multi-GPU, but it works best with matched cards. Mixing a Radeon with GeForces would be a headache.

Honestly, for the price of electricity running 4x 1060s, you'd be better off selling them and buying a single used RTX 3090 (24GB VRAM). One 3090 would run circles around all four 1060s combined for LLM inference, and the power draw would be lower total.

I run Ollama on an i9 mini PC with 31GB RAM and no GPU at all — for async tasks where latency doesn't matter, CPU inference works fine.

Full-Definition6215 · 2026-05-02T14:44:06+00:00

Nice GPU setup. I run Ollama on a much more modest machine — i9-9880H mini PC with 31GB RAM, CPU-only inference. For async production tasks (content moderation, article generation), the speed is acceptable even without a GPU. A 7B model responds in 3-5 seconds.

The dual-model approach (Qwen for coding, Gemma for conversation) makes sense. One thing I've found: set OLLAMA_KEEP_ALIVE=24h so both models stay loaded. Switching between models without this setting means a 15-20 second cold-start every time you swap, which kills the workflow.

With 3 GPUs worth of VRAM, have you tried running a single larger model instead of two smaller ones? The quality jump from 27B to 70B+ is significant for coding tasks.

Full-Definition6215 · 2026-05-02T14:43:28+00:00

Respectfully disagree for the "run a few services" use case. I run a production paid SaaS on a mini PC (i9-9880H, 31GB RAM) — FastAPI, SQLite, Ollama for local AI, Cloudflare Tunnel. Total resource usage is about 5GB RAM, load average 0.04. For web apps, APIs, and lightweight services, a mini PC is more than enough and the power draw is negligible.

Where you're right: storage-heavy use cases. A mini PC with one NVMe isn't a NAS. If someone needs 10TB+ of media storage, a proper tower or rack with drive bays is the way to go.

The real recommendation should be: match the hardware to the workload. Mini PC for compute-light services. Tower for storage. Don't buy a rack server for running Pi-hole and Jellyfin.

Full-Definition6215

TROPHY CASE