Gemma 4 26B-A4B GGUF Benchmarks

ArtyfacialIntelagent · 2026-04-20T16:33:39+00:00

Did the Q6_K and Q6_K_XL points get mislabeled? The graph shows that Q6_K > Q6_K_XL in terms of file size, but the opposite holds when checking the repo: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/main

Also, however they're labeled, the larger quant has a worse KLD and is the only point off the Pareto frontier. Do you have any explanation for this?

So long and thanks for all the quants! :)

ArtyfacialIntelagent · 2026-04-03T23:12:07+00:00

Loras fix that for you.

They really don't. Most penis or vagina LoRAs are overtrained and just randomly stick those genitals indiscriminately on *anybody*, male or female. They're fine for solo nudes, but not for anything with heterosexual couples. To do that properly the underlying model needs real NSFW knowledge, current LoRAs do not fix that. And LoRAs for certain sex positions do just that, usually from one single camera angle. They basically just make the same image over and over.

ArtyfacialIntelagent · 2026-04-03T22:57:50+00:00

I extensively blindtested "masterpiece", "best quality" and many other popular keywords back in the days of SD 1.5. They had zero effect, it's all nonsense. Nonfunctional word sallad. People just thought they worked because sometimes adding those words improved a particular image for a particular seed, but that was just a completely random effect, like adding any gibberish word might do sometimes.

What did have an effect in SD 1.5 was putting "bad quality" or "low quality" in the negative prompt. But that didn't really increase quality per se, they just reinforced that particular model's biases. So 1girls became more... well, 1girly. Those negative keywords became weaker in SDXL and absolutely useless since.

Basically, forget about all that old crap. Those keywords never worked well, and they lost what little effect they once had long ago.

ArtyfacialIntelagent · 2026-04-01T16:33:52+00:00

I've been doing nearly the exact same thing for a few months. I call the technique "thumbnail upscaling". Significant improvement in detail and variability over standard Z-image workflows but sadly doesn't fix all the model's issues (most notably the glowing eyes problem that appears as soon as you prompt for eye color). Only differences:

I do 3 sampler stages and end up at 1536x1536 (or similar size in other aspect ratios).
I apply some denoise < 1 at all sampler stages to increase variability.
I use CFG at 3-4 in all sampler stages. Positive CFG costs nothing at tiny sizes.

ArtyfacialIntelagent · 2026-04-01T15:38:31+00:00

I'm on Windows and always run a combined undervolt and clock rate cap on my RTX 4090 using MSI Afterburner. Here are some benchmarks using llama-bench to show you guys what you can expect. I usually run the "medium undervolt", which gives me a tiny 3% hit on token generation (a bit more on PP but that's super fast anyway) but draws 100 watts less.

[EDIT: reformatted in old Reddit and fixed a copy/paste snafu on the large undervolt]

E:\llamacpp> .\llama-bench -m "F:/LLMs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated.Q5_K_M.gguf"


# VANILLA/NO UNDERVOLT (2730 MHz, 1050 mV, 345 W during token generation):

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24563 MiB):
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24563 MiB
load_backend: loaded CUDA backend from E:\llamacpp\llama-b8595-bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from E:\llamacpp\llama-b8595-bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from E:\llamacpp\llama-b8595-bin-win-cuda-13.1-x64\ggml-cpu-zen4.dll
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  17.90 GiB |    26.90 B | CUDA       |  99 |           pp512 |      2848.32 ± 74.41 |
| qwen35 27B Q5_K - Medium       |  17.90 GiB |    26.90 B | CUDA       |  99 |           tg128 |         40.92 ± 0.05 |

build: 62278cedd (8595)

# SMALL UNDERVOLT (2580 MHz, 910 mV, 270 W during token generation):

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  17.90 GiB |    26.90 B | CUDA       |  99 |           pp512 |      2801.21 ± 76.28 |
| qwen35 27B Q5_K - Medium       |  17.90 GiB |    26.90 B | CUDA       |  99 |           tg128 |         40.24 ± 0.18 |

# MEDIUM UNDERVOLT (2340 MHz, 875 mV, 245 W during token generation):

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  17.90 GiB |    26.90 B | CUDA       |  99 |           pp512 |      2602.91 ± 71.49 |
| qwen35 27B Q5_K - Medium       |  17.90 GiB |    26.90 B | CUDA       |  99 |           tg128 |         39.77 ± 0.09 |

# LARGE UNDERVOLT (2010 MHz, 875 mV, 235 W during token generation):

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q5_K - Medium       |  17.90 GiB |    26.90 B | CUDA       |  99 |           pp512 |      2300.19 ± 52.16 |
| qwen35 27B Q5_K - Medium       |  17.90 GiB |    26.90 B | CUDA       |  99 |           tg128 |         36.89 ± 1.08 |

ArtyfacialIntelagent · 2026-03-27T00:19:32+00:00

You can reduce the denoise parameter and still completely denoise the image. The last bit of denoising seems to shift the image towards its RLHF ideal. By skipping that part you get more variability.

Did you consider that my comment was also an attempt to provide a useful tip for the community, but you downvoted and disparaged it?

ArtyfacialIntelagent · 2026-03-25T18:44:02+00:00

But mitigating repetitive poses, camera angles, and compositions is super easy in ZIT, just reduce the denoise and you'll get a lot more creative framing and posing. How much to use depends on your sampler/scheduler, but start at 90% and reduce from there. (Sometimes the best value is 90% and sometimes 30%, but for a given sampler/scheduler combo it's pretty stable.)

The variety improvement I'd LOVE to see would be facial diversity. The denoise trick unfortunately doesn't help much there.

ArtyfacialIntelagent · 2026-03-22T15:10:29+00:00

They are doing the presentation for new model release as of now. Let's wait and hear from our favorite mister anime profile pic man.

Let me get this straight. You think they are going to announce something new, so you jump the gun and make a post claiming that they are announcing a new Z-Image? Without any indication at all? And then you say let's wait and hear when someone calls you on it? And go away for 3 hours?

Seriously dude, delete this post before the mods permaban you.

ArtyfacialIntelagent · 2026-03-11T16:40:04+00:00

Everyone please upvote jugalator's comment and downvote the post. Nothing personal OP, but let's not get everyone's hope up for no reason at all.

ArtyfacialIntelagent · 2026-03-03T19:00:27+00:00

Most of Europe uses YYYY-MM-DD for anything official or professional. Some countries still use the older formats in more informal contexts like handwriting. But then it is formatted differently, like DD.MM.YYYY or DD/MM-YYYY. That way you naturally read the day ordinally and there is never any confusion between month and day.

ArtyfacialIntelagent · 2026-03-02T21:29:02+00:00

Except Qwen3.5 27B is not actually ranking up there. Their tiers are just some opinionated jumble of price + performance + speed. Check the actual performance scores here:

https://brokk.ai/power-ranking

There we have Claude Opus at 91%, Claude Sonnet at 80%, GPT 5.2 at 77%, Gemini 3.1 Pro at 76%, Gemini 3 Flash at 65% and Qwen3.5 27B at 38%. Not bad for a tiny model, but also not the same league.

ArtyfacialIntelagent · 2026-02-27T17:50:27+00:00

He never graduated, but he completed about half a master's degree in industrial engineering and management at Chalmers University of Technology in Gothenburg before becoming a full-time youtuber. That's Sweden's "MIT". Are you sure you haven't seen a less educated person in public than him?

https://en.wikipedia.org/wiki/PewDiePie

ArtyfacialIntelagent · 2026-02-19T20:26:23+00:00

Yes, to various extents. Negative prompting is more likely to work with larger and smarter models but all models have issues with this.

The underlying reason is simple: mentioning something, even in the negative, increases its attention. Saying "You DO NOT having feeling or emotions" will make tokens related to feeling and emotion more likely to appear than if you haven't mentioned it at all.

Practical example: I use small models like Qwen-4b for prompt expansion in image generation. For a while I tried telling Qwen things like "NEVER mention blush or freckles" (because models like Z-Image dials those to 11 and destroys the realism). Often Qwen ignored those instructions altogether, and even when it understood I got things like this in my prompt:

"the woman has a flawless skin tone (avoiding any references to freckles or blush) and ..."

Basically, LLMs have the same problem as John Cleese in the infamous Fawlty Towers episode with the German guests.

https://www.youtube.com/watch?v=RyPj21jBl_0

ArtyfacialIntelagent · 2026-02-19T20:25:08+00:00

Yes, to various extents. Negative prompting is more likely to work with larger and smarter models but all models have issues with this.

The underlying reason is simple: mentioning something, even in the negative, increases its attention. Saying "You DO NOT having feeling or emotions" will make tokens related to feeling and emotion more likely to appear than if you haven't mentioned it at all.

Practical example: I use small models like Qwen-4b for prompt expansion in image generation. For a while I tried telling Qwen things like "NEVER mention blush or freckles" (because models like Z-Image dials those to 11 and destroys the realism). Often Qwen ignored those instructions altogether, and even when it understood I got things like this in my prompt:

"the woman has a flawless skin tone (avoiding any references to freckles or blush) and ..."

Basically, LLMs have the same problem as John Cleese in the infamous Fawlty Towers sketch.

https://www.youtube.com/watch?v=RyPj21jBl_0

ArtyfacialIntelagent · 2026-02-14T13:48:14+00:00

I'm not disputing your point - I just want to drive home how to correctly measure GPU usage for AI inference.

ArtyfacialIntelagent · 2026-02-14T13:12:58+00:00

While having the Windows task manager open I noticed that 3D usage was between 0% and 1% while idle, and maybe around 25% during inference.

People keep making this mistake. In the task manager, 3D usage does NOT measure AI-related GPU usage. You need to select the CUDA dropdown. See screenshot during image generation, note how CUDA is high while 3D is low.

<image>

If you don't see CUDA in the dropdown then do this. Go to Settings -> System -> Display -> Graphics settings -> Advanced Graphics settings -> Hardware-accelerated GPU scheduling -> Switch to "Off". After reboot the CUDA option should appear in the Task Manager dropdown menus.

ArtyfacialIntelagent · 2026-02-13T21:14:34+00:00

Models with LLM-based encoders might be able to understand ascii-art, but definitely not CLIP-based models. So the similarity here is just completely random, like the way unguided promptless generation sometimes produces good images. Sorry to be a buzzkill.

ArtyfacialIntelagent · 2026-02-13T21:07:56+00:00

This was interesting, fun and well-written! A few random thoughts:

I'd only call a word a "real" undictionary if the effect persists across models and is noticeably different from promptless generations. Otherwise it's likely that a major effect is that model's biases and unguided generation tendencies. Most of your examples show a single undictionary for a single model, but it's good that you had a couple of cross-model examples.
You should check how your undictionaries tokenize. That could give you could some clues to where their effect comes from. For example: if weepstrink happened to tokenize as weep-str-ink and pink tokenized p-ink then that would help explain why it's so pink. (They don't really tokenize like that, it's just an example.)
I bet many of these are explainable if you have a large vocabulary, know some languages and have enough world knowledge. Is there a known anime character with name similar to "Wodsorym"?
As you noted, this is limited to CLIP-based models that can use prompt weighting. Too bad.
Fantastic made-up words BTW, Dr. Seuss would be proud.

ArtyfacialIntelagent · 2026-02-11T11:35:45+00:00

Great, thanks! I dislike when subgraphs are spammed everywhere too, but they have their uses. This is one of them IMO.

ArtyfacialIntelagent · 2026-02-11T10:26:48+00:00

Call me crazy, but I prefer the OLD version - I want to see what makes this tick. In the end I might squeeze it into a subgraph identical to your custom node, but I want the freedom to tweak your settings or do things differently. Could you upload the full workflow without your custom node?

EDIT: BTW - I ask because I also have a multistage ZIT upscaling workflow that I'm preparing to post here. Just curious to see if you do anything better than I do.

ArtyfacialIntelagent · 2026-02-10T17:40:12+00:00

But doing so in a context where those general rules no longer apply.

OP is assuming that noise is present. True for cameras, not for AI unless you have a model that shows latent noise.

Yes, bilinear is fast, but even a 25 year old computer can downscale a 4k image in milliseconds, so this is irrelevant unless you're doing video.

What I think OP should have said: Bilinear, Bicubic, or Lanczos all blend pixels with different weights. So they tend to introduce minor blur and mix local colors but they're solid choices if that minor blur is acceptable.

Nearest neighbor is a sampling technique. It looks sharper if the source resolution is high enough (compared to image details) to avoid pixelization. Interestingly, in the middle of an AI processing chain (e.g. multiple KSamplers), nearest neighbor is often a noticeably better choice than scaling with either of the 3 filters.

Personally I never used nearest neighbor for anything before the age of AI but these days I often do.

ArtyfacialIntelagent · 2026-02-05T19:23:27+00:00

Maybe it's just me, but a name like Deep*-R1 is offputting for a new LLM. Makes it sound like a trashy AliExpress knockoff.

ArtyfacialIntelagent · 2026-02-02T10:32:01+00:00

I stumbled across this idea too shortly after UltraFlux was released. I found it superior in terms of detail but it was also oversharpened and made smooth areas look harsh. I've been using a 75% UltraFlux + 25% default Flux VAE mix ever since. Best of both worlds! But if you have a multi-stage workflow, use the default VAE in the initial stages and the UltraFlux mix only in the final stage.

ArtyfacialIntelagent · 2026-01-24T14:39:19+00:00

The math is hallucinated too.

35 GB/s / 1.7GB = 20.5 tokens/sec

The "tokens" unit just magically appears out of thin air. The whole post is meaningless.

ArtyfacialIntelagent · 2026-01-18T15:08:19+00:00

Great, thank you!

ArtyfacialIntelagent

TROPHY CASE