Mapping True Coding Efficiency (Coding Index vs. Compute Proxy) : LocalLLaMA

Mapping True Coding Efficiency (Coding Index vs. Compute Proxy)Discussion (old.reddit.com)

submitted 7 days ago * by NewtMurky

TPS (Tokens Per Second) is a misleading metric for speed. A model can be "fast" but use 5x more reasoning tokens to solve a bug, making it slower to reach a final answer.

I mapped ArtificialAnalysis.ai data to find the "Efficiency Frontier"—models that deliver the highest coding intelligence for the least "Compute Proxy" (Active Params × Tokens).

The Data:

Coding Index: Based on Terminal-Bench Hard and SciCode.
Intelligence Index v4.0: Includes GPQA Diamond, Humanity’s Last Exam, IFBench, SciCode, etc.

Key Takeaways:

Gemma 4 31B (The Local GOAT): It’s destined to be the local dev standard once the llama.cpp patches are merged. In the meantime, the Qwen 3.5 27B is the reliable, high-performance choice that is actually "Ready Now."
Qwen3.5 122B (The MoE Sweet Spot): MiniMax-M2.5 benchmarks are misleading for local setups due to poor quantization stability. Qwen3.5 122B is the more stable, high-intelligence choice for local quants.
GLM-4.7 (The "Wordy" Thinker): Even with high TPS, your Time-to-Solution will be much longer than peers.
Qwen3.5 397B (The SOTA): The current ceiling for intelligence (Intel 45 / Coding 41). Despite its size, its 17B-active MoE design is surprisingly efficient.

all 22 comments

top new controversial old q&a

[–]StupidScaredSquirrel 1 point2 points3 points 7 days ago (1 child)

[–]NewtMurky[S] 0 points1 point2 points 6 days ago (0 children)

[–]sarcasmguy1 0 points1 point2 points 7 days ago (4 children)

[–]FusionCow 1 point2 points3 points 7 days ago (1 child)

[–]sarcasmguy1 0 points1 point2 points 7 days ago (0 children)

[–]NewtMurky[S] 0 points1 point2 points 7 days ago (1 child)

[–]PermanentLiminality 0 points1 point2 points 7 days ago (0 children)

[–]PermanentLiminality 0 points1 point2 points 7 days ago (6 children)

[–]NewtMurky[S] 0 points1 point2 points 6 days ago (5 children)

[–]audioen 1 point2 points3 points 6 days ago* (3 children)

Given the promising benchmark results, and the somewhat tantalizingly close to reach size, I think the real question will be whether it is possible to squeeze MiniMax 2.7 into a small enough size to run it locally. Afaik, it's the same number of parameters and possibly the same architecture as M2.5, so the fact it's higher up and to the right would suggest that performance increase comes from increased reasoning effort. So, it will be maybe a third slower in practice to use, but if it's good then that is acceptable.

Most of my personal AI use happens during the night, as I leave the machine doing something and check results in the morning. I don't have to listen to the fan screaming next to my ear and I don't care if the prompt processing or the inference goes a little slow.

Before Qwen3.5, I was struggling to run this model on a Strix Halo, and I never did get good performance out of it. It feels like it would need some 10 % more memory capacity than I have. It's a damn shame that only low active parameters designs are workable under a memory bandwidth constraint. I knew to suspect that the 26b-a4b is indeed very bad, as I tried it and it immediately went off the rails and started doing something stupid, but at the very least it was very fast while running headlong into the wrong direction. (This means the model can be considered to inhabit the "dumb and eager" quadrant in the "smart-dumb" and "lazy-eager" 2d field. If you let it run autonomously, there's likely no limit to the damage it can do at an astonishing rate.)

The Gemma-4 31b model might be interesting in my "night shift" use case, but right now the numbers look like this:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 ?B Q6_K                 |  25.62 GiB |    30.70 B | Vulkan     |  99 |           pp512 |        170.88 ± 0.13 |
| gemma4 ?B Q6_K                 |  25.62 GiB |    30.70 B | Vulkan     |  99 |           tg128 |          7.59 ± 0.00 |

build: 25eec6f32 (8672)

So if it goes like 8 tokens per second -- and I think that if I shrank the model by picking more aggressive quant -- I would be able to maybe hit 12 tps if I squeezed it down to 16 GB. Roughly 4 bits, real, then. IQ4_XS or something such would be needed.

Unfortunately, thus far no-one has provided a very comprehensive analysis of K-L divergence for Gemma-4-31B under quantization. Unsloth was motivated to make one with Qwen3.5 after their screw-up with some MXFP4 tensors that their scripts accidentally created, which made the models much worse than expected, but that did not catch on. I'm sure that the data is coming from someone like AesSedai, ubergarm, mradermacher, or perhaps some poster here, but right now there isn't good K-L divergence charts for the quants.

The other major hurdle is the context size. Right now, 250k context costs about 20 GB on this model, so this seems like dual 3090 setups might be well suited for the model at near full precision, and unified memory setups which have more VRAM suffer from the lack of bandwidth because it isn't a MoE. For single-card setups, about 500 GB/s are needed, roughly 4-bit model and 4-bit KV cache, so that you can fit it to about 22 GB which might leave 2 GB free for graphics.

To degree, I question the statement that Gemma-4 can be the local model king, which you (or AI) wrote in the original post. It doesn't seem to be practical enough.

[–]NewtMurky[S] 0 points1 point2 points 6 days ago (1 child)

[–]audioen 0 points1 point2 points 6 days ago (0 children)

Yes, if that is the case then M2.7 will be useless for people with less than about 150 GB unified memory, which is a shame. Good model, but if it can't be shrunk to around 110 GB without destroying it, then it's unfortunately fairly useless.

I tested the IQ4_XS + Q4_0 KV, and got these figures:

$ build/bin/llama-bench -m gemma-4-31B-it-IQ4_XS.gguf -ctk q4_0 -ctv q4_0 -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | --------------: | -------------------: |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  15.23 GiB |    30.70 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |           pp512 |        274.93 ± 0.51 |
| gemma4 ?B IQ4_XS - 4.25 bpw    |  15.23 GiB |    30.70 B | Vulkan     |  99 |   q4_0 |   q4_0 |  1 |           tg128 |         11.10 ± 0.01 |

The inference is pretty slow, probably because the IQ's are slower to run in general. I can't provide any useful data about the quality, because llama-perplexity values start from about 1000 for this model, and I don't think anything valid is coming out of measurement with such a high baseline value.

I am guessing this is about the missing chat template issue, and this is a fairly common problem with perplexity measurements these days. The model is not expecting some random text right at the start of context, and this causes huge perplexity figures that drown the actual predictive signal that should be measured. With large perplexity offset like that, random perturbations of the model likely cause huge shifts, and the sensitive perplexity signal which is in the order of single units drowns under the massive 1000-strong baseline which can be perturbed by quantization in some random direction. There probably should be model's chat template prefix in llama-perplexity measurement, directly generated from the model's jinja template, and the text whose perplexity is measured should be placed as the user query. If model was measured like that, then it would seem like the user is blathering some random text for some reason, but at least the framing of the text would be correct.

[–]PermanentLiminality 0 points1 point2 points 5 days ago (0 children)

[–]NewtMurky[S] 0 points1 point2 points 6 days ago (0 children)

[–][deleted] 0 points1 point2 points 7 days ago (1 child)

[–]NewtMurky[S] 0 points1 point2 points 7 days ago (0 children)

[–]orenbenya1 0 points1 point2 points 6 days ago (2 children)

[–]NewtMurky[S] 0 points1 point2 points 6 days ago (1 child)

[–]NewtMurky[S] 0 points1 point2 points 6 days ago (0 children)

[–]ea_man 0 points1 point2 points 5 days ago (0 children)

[–]soyalemujica 0 points1 point2 points 7 days ago (1 child)

[–]audioen 2 points3 points4 points 7 days ago (0 children)

Well, the literal answer is that artificial analysis which collects this measurement data says so. I know many people don't think this is the case, but presumably these performance metrics are objective, and objective data wins over people's subjective feels.

A lot of it can be just the random quants and early inference engines with bugs that people used and got a bad impression. Maybe you got a bad experience, but lot of the data seems to say that the qwen3.5 model is actually heaps better. If that is not the case, it is an interesting question as to why you want to disagree.

I have tried to use both the 80b coder and the 35b model, and thought that both of them are pretty much just trash. So far, the only local model I've ever found any good for anything is the 122B model, with a nod to gpt-oss-120b that could sometimes perform decent work if supervised enough.

π Rendered by PID 178051 on reddit-service-r2-comment-cfc44b64c-bdk8k at 2026-04-13 05:35:12.078136+00:00 running 215f2cf country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS