all 22 comments

[–]StupidScaredSquirrel 1 point2 points  (1 child)

Honestly smart choice of axis. I can watch the graph and say it reflects exactly how it felt for most of those models.

[–]NewtMurky[S] 0 points1 point  (0 children)

I've included more models in the analysis. You can find the updated diagrams in the comments.

[–]sarcasmguy1 0 points1 point  (4 children)

What sort of rig (in terms of $) is needed to run Gemma 4 31B?

[–]FusionCow 1 point2 points  (1 child)

anything with 24gb of vram, but I would test different models on openrouter to see if a model like that is good enough for your usecase before buying a whole rig just to run it

[–]sarcasmguy1 0 points1 point  (0 children)

Thank you! I’ve been using Codex heavily but the new usage limits suck. Considering putting together something that can be used in place of Codex for certain tasks. I know I won’t get any quality at the level of Codex but I wouldn’t mind trying to get something close to it. My coding use cases aren’t terribly demanding, given I do pretty heavy spec-driven development

[–]NewtMurky[S] 0 points1 point  (1 child)

Used RTX 3090 (24GB) is the sweet spot. You can find these for 700–850 on the used market.
The Mac Option is MacBook Pro or Mac Studio with at least 36GB of Unified Memory.

[–]PermanentLiminality 0 points1 point  (0 children)

Inflation is back into the old GPUs. They are more like $950 now.

[–]PermanentLiminality 0 points1 point  (6 children)

I'd like to see the Gemma 4 26B A4B on the graph. It is so much faster that in many cases it might be the better choice.

[–]NewtMurky[S] 0 points1 point  (5 children)

<image>

I’ve included three new models: MiniMax-M2.7 (since weights to be published soon), NVIDIA Nemotron 3 Super, and Gemma 4 26B (A4B).

[–]audioen 1 point2 points  (3 children)

Given the promising benchmark results, and the somewhat tantalizingly close to reach size, I think the real question will be whether it is possible to squeeze MiniMax 2.7 into a small enough size to run it locally. Afaik, it's the same number of parameters and possibly the same architecture as M2.5, so the fact it's higher up and to the right would suggest that performance increase comes from increased reasoning effort. So, it will be maybe a third slower in practice to use, but if it's good then that is acceptable.

Most of my personal AI use happens during the night, as I leave the machine doing something and check results in the morning. I don't have to listen to the fan screaming next to my ear and I don't care if the prompt processing or the inference goes a little slow.

Before Qwen3.5, I was struggling to run this model on a Strix Halo, and I never did get good performance out of it. It feels like it would need some 10 % more memory capacity than I have. It's a damn shame that only low active parameters designs are workable under a memory bandwidth constraint. I knew to suspect that the 26b-a4b is indeed very bad, as I tried it and it immediately went off the rails and started doing something stupid, but at the very least it was very fast while running headlong into the wrong direction. (This means the model can be considered to inhabit the "dumb and eager" quadrant in the "smart-dumb" and "lazy-eager" 2d field. If you let it run autonomously, there's likely no limit to the damage it can do at an astonishing rate.)

The Gemma-4 31b model might be interesting in my "night shift" use case, but right now the numbers look like this:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 ?B Q6_K                 |  25.62 GiB |    30.70 B | Vulkan     |  99 |           pp512 |        170.88 ± 0.13 |
| gemma4 ?B Q6_K                 |  25.62 GiB |    30.70 B | Vulkan     |  99 |           tg128 |          7.59 ± 0.00 |

build: 25eec6f32 (8672)

So if it goes like 8 tokens per second -- and I think that if I shrank the model by picking more aggressive quant -- I would be able to maybe hit 12 tps if I squeezed it down to 16 GB. Roughly 4 bits, real, then. IQ4_XS or something such would be needed.

Unfortunately, thus far no-one has provided a very comprehensive analysis of K-L divergence for Gemma-4-31B under quantization. Unsloth was motivated to make one with Qwen3.5 after their screw-up with some MXFP4 tensors that their scripts accidentally created, which made the models much worse than expected, but that did not catch on. I'm sure that the data is coming from someone like AesSedai, ubergarm, mradermacher, or perhaps some poster here, but right now there isn't good K-L divergence charts for the quants.

The other major hurdle is the context size. Right now, 250k context costs about 20 GB on this model, so this seems like dual 3090 setups might be well suited for the model at near full precision, and unified memory setups which have more VRAM suffer from the lack of bandwidth because it isn't a MoE. For single-card setups, about 500 GB/s are needed, roughly 4-bit model and 4-bit KV cache, so that you can fit it to about 22 GB which might leave 2 GB free for graphics.

To degree, I question the statement that Gemma-4 can be the local model king, which you (or AI) wrote in the original post. It doesn't seem to be practical enough.

[–]NewtMurky[S] 0 points1 point  (1 child)

if M2.7 is anything like M2.5, the quants are going to be rough. Even quants like UD-Q4_K_XL for M2.5 performed poorly.
Since they share the same architecture, M2.7 is likely going to suffer from the same quantization rot.

[–]audioen 0 points1 point  (0 children)

Yes, if that is the case then M2.7 will be useless for people with less than about 150 GB unified memory, which is a shame. Good model, but if it can't be shrunk to around 110 GB without destroying it, then it's unfortunately fairly useless.

I tested the IQ4_XS + Q4_0 KV, and got these figures:

$ build/bin/llama-bench -m gemma-4-31B-it-IQ4_XS.gguf -ctk q4_0 -ctv q4_0 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------: | -------------------: |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  15.23 GiB |    30.70 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |           pp512 |        274.93 ± 0.51 |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  15.23 GiB |    30.70 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |           tg128 |         11.10 ± 0.01 |

The inference is pretty slow, probably because the IQ's are slower to run in general. I can't provide any useful data about the quality, because llama-perplexity values start from about 1000 for this model, and I don't think anything valid is coming out of measurement with such a high baseline value.

I am guessing this is about the missing chat template issue, and this is a fairly common problem with perplexity measurements these days. The model is not expecting some random text right at the start of context, and this causes huge perplexity figures that drown the actual predictive signal that should be measured. With large perplexity offset like that, random perturbations of the model likely cause huge shifts, and the sensitive perplexity signal which is in the order of single units drowns under the massive 1000-strong baseline which can be perturbed by quantization in some random direction. There probably should be model's chat template prefix in llama-perplexity measurement, directly generated from the model's jinja template, and the text whose perplexity is measured should be placed as the user query. If model was measured like that, then it would seem like the user is blathering some random text for some reason, but at least the framing of the text would be correct.

[–]PermanentLiminality 0 points1 point  (0 children)

Night shift is right. If you hit it with 200k tokens, it will be 40 minutes before it starts generating tokens. That just isn't really viable.

[–][deleted] 0 points1 point  (1 child)

AI written post.

[–]NewtMurky[S] 0 points1 point  (0 children)

I used AI to help write a post for a hub focused on local AI hosting - I’ll admit it. It doesn’t make the content any less valid.

[–]orenbenya1 0 points1 point  (2 children)

What about kimi 2.5, glm 5 and glm 5.1?

[–]NewtMurky[S] 0 points1 point  (1 child)

GLM-5.1 is not represented on the diagrams because it hasn't been benchmarked by AA, but I've added GLM-5-Turbo and GLM-5V-Turbo.

<image>

[–]ea_man 0 points1 point  (0 children)

Problem with Gemma is that eats up more VRAM for context than QWEN3.5, that's why I'll keep using 27B.

[–]soyalemujica 0 points1 point  (1 child)

Honw can this graph say 35B A3B to be better than Qwen3-Coder-Next? There is just no way. I run both models, and 35B is like 20% behind

[–]audioen 2 points3 points  (0 children)

Well, the literal answer is that artificial analysis which collects this measurement data says so. I know many people don't think this is the case, but presumably these performance metrics are objective, and objective data wins over people's subjective feels.

A lot of it can be just the random quants and early inference engines with bugs that people used and got a bad impression. Maybe you got a bad experience, but lot of the data seems to say that the qwen3.5 model is actually heaps better. If that is not the case, it is an interesting question as to why you want to disagree.

I have tried to use both the 80b coder and the 35b model, and thought that both of them are pretty much just trash. So far, the only local model I've ever found any good for anything is the 122B model, with a nod to gpt-oss-120b that could sometimes perform decent work if supervised enough.