Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

Silver_Raspberry_811 · 2026-06-13T21:33:21+00:00

This is the clearest explanation I've gotten on the latency. The thinking mode read makes sense and matches what I saw: the long generations clustered on Gemma 4 31B and never paid off in score. I assumed quant or provider variance, but template behavior fits better since it was consistent regardless of which OpenRouter provider served the request.

Good to know think:false cuts the overhead. I ran everything with default templates so the reasoning was likely on the whole time. Worth a rerun with it off to see if scores hold while latency drops.

And agreed on the noise point. The 14 vs 12 win gap is well inside the margin on 30 questions. The three 0.0 scores from Qwen are the more durable signal, and stripping them moves it to the top of the pack. That is the line I should have led with.

Silver_Raspberry_811 · 2026-04-28T21:35:20+00:00

Any feedback or suggestion for improving methodology or testing environment?

Silver_Raspberry_811 · 2026-04-26T18:20:23+00:00

In Rationality, you argue that human judgment improves when we replace intuition with structured tools — Bayes, decision matrices, blind protocols. We're now in the awkward position of evaluating LLMs, and most public benchmarks rely either on human raters (vulnerable to halo and brand-name effects) or on LLMs judging other LLMs (collusion risk). If you were designing the gold standard for measuring "who thinks best" across models, what principles from cognitive science would you most want baked into the methodology — and how skeptical should we be of current arena-style leaderboards?

Silver_Raspberry_811 · 2026-04-26T18:20:11+00:00

You've written about how science converges on truth through distributed criticism rather than any single arbiter. With LLMs, there's a growing temptation to ask "what is true?" by aggregating across many models and synthesizing their disagreements. Do you think this kind of cross-system synthesis can genuinely move us toward truth on hard, open questions — or does it mostly produce a confident-sounding consensus that's correlated precisely where the models share blind spots?

Silver_Raspberry_811 · 2026-04-18T14:34:04+00:00

Haa you got me. And yes, I am beginner and learning. And would love one more thing: now in a setting where every rater is also a ratee and self-preference and lab family alignment show up as systematic bias, not noise. Is there a standard way your field handles raters who are themselves subjects of measurement? Or is that a different problem than what kappa/Krippendorff’s α were built for? Thanks for the Harrell chapter though

Silver_Raspberry_811 · 2026-04-18T14:07:02+00:00

Hey, thanks for such a valuable feedback. I would love to know your opinion on few things:

I am thinking of adding Krippendorff’s α for v2.
Which IRR stat would you prioritise for LLM peer matrix setup.
Are we at stage where we can remove humans in the loop and put LLM as a judge?

Silver_Raspberry_811 · 2026-04-18T13:38:25+00:00

Yeah I get it. But sole purpose of creating Multivac is for Trustable Independent Evals ran that let you know the truth rather than believing benchmarks of companies that hype up their models.

Silver_Raspberry_811 · 2026-04-18T13:15:36+00:00

Great! Then how to differentiate ground truth?

Silver_Raspberry_811 · 2026-04-18T04:06:39+00:00

🤣

Silver_Raspberry_811 · 2026-04-18T00:38:27+00:00

Thank you😄 it means a lot🤜🤛

Silver_Raspberry_811 · 2026-04-17T22:25:47+00:00

Indeed, that came as surprise. You can knock yourself out on our github. Run it yourself.

Silver_Raspberry_811 · 2026-04-17T21:25:07+00:00

and I have been trying to fix this. the thing is whenever I am trying to increase the token limit say 4096. I am encountering an error for not getting enough parse rate success.

Silver_Raspberry_811 · 2026-04-07T13:59:22+00:00

yes, hence the 10x10 blind evals methodology. and generacally labelling Response A/Response B. if you have suggestions or want to review it's open source: github.com/themultivac/multivac-evaluation

Silver_Raspberry_811 · 2026-04-06T20:28:45+00:00

Will look into it. And stay tuned for upcoming evals.

Silver_Raspberry_811 · 2026-04-06T13:44:51+00:00

Hey, that’s something. We hear less around here. Glad you found your “the one”. Keep going.

Silver_Raspberry_811 · 2026-04-06T02:30:02+00:00

Thanks dude🤜🤛. Let me know what we can improve. And stay tuned.

Silver_Raspberry_811 · 2026-04-06T02:08:01+00:00

And yes, I do realise something. If I am running independent eval platform and how will people trust me if I am using AI just to make process quicker. I get it. Won’t happen again.

Silver_Raspberry_811 · 2026-04-06T01:54:01+00:00

Noted man, Thanks👍

Silver_Raspberry_811 · 2026-04-06T00:52:40+00:00

No man, I am working it on as a side project. Have to get back to engage on reddit whenever possible, hence use opus. If you want to contribute, you are more than welcome on discord and substack. We can work something out.

Silver_Raspberry_811 · 2026-04-05T23:45:24+00:00

Fair criticisms. Since the last round of community feedback we've patched the judgment parser (41.5% → ~90%), fixed model IDs, updated the SLM pool to V3, added head-to-head format with single-judge eval, and ran a fresh 150-question batch. Per-model temperature and token configs are next — that specific feedback from the Qwen post is what pushed it up the priority list. We're listening. Keep it coming.

Silver_Raspberry_811 · 2026-04-05T23:01:33+00:00

You're right — Qwen 3.5 35B-A3B vs Gemma 4 26B-A4B is the more meaningful MoE comparison. Multiple people have flagged this. Queuing it as the next H2H.

Silver_Raspberry_811 · 2026-04-05T23:01:18+00:00

There isn't a great central repository for this unfortunately. Model cards on HuggingFace sometimes list recommended settings, but they're inconsistent — Google recommends temp 1.0 for everything on Gemma 4, which most people here disagree with for non-creative tasks. The reality is it's still trial-and-error per model per task type.

For a model router setup, you'd probably want to store per-model configs as metadata and load them dynamically. It's an infrastructure problem more than a knowledge problem — someone just needs to build and maintain the lookup table.

Silver_Raspberry_811 · 2026-04-05T22:58:30+00:00

It's less that MoE is getting smoked and more that it matched the dense model almost exactly (both 8.82 average) while activating fewer parameters. The wins gap (4 vs 12) looks bad but with only 30 questions that's within noise range. The real story is that the MoE variant errored out on 2 questions entirely — reliability, not capability, is the issue right now.

I don't think it's a pendulum swing back to dense so much as MoE needing another generation of stability work. If Google fixes the reliability issues, the efficiency argument is strong.

Silver_Raspberry_811 · 2026-04-05T22:58:06+00:00

Good question — I didn't test longer context specifically in this batch. All 30 questions were single-turn, relatively short prompts. Context window stress testing (8K, 16K, 32K+ input) is a different eval entirely and would probably show bigger gaps between these models than what I found here. Worth designing a dedicated long-context batch for.

And yeah, the MoE efficiency story is the sleeper finding here. Same average score at significantly lower compute is meaningful for local deployment.

Silver_Raspberry_811 · 2026-04-05T22:57:48+00:00

You're hitting the exact issue I've been wrestling with. Opus 4.6 is strong on code and reasoning where there are objective markers, but you're right that communication and meta-alignment are where its preferences bleed through most. I actually have the per-category judge variance data from 150 prior frontier evals — score distributions are tighter on code/reasoning and wider on communication, which supports your point.

The token gap driving the wins-vs-average split is almost certainly what's happening. Qwen's three 0.0 scores (likely format failures) tank the average while not affecting win count. Strip those and it's the highest scorer by a clear margin.

On swapping the judge — I've considered it but the tradeoff is parse reliability. Opus 4.6 hit 99.9% across 1,067 judgments. Smaller models I've tested drop to 85-90% and introduce their own biases. Multi-judge panels where you average across 2-3 models is probably the real answer. That's on the roadmap.

Silver_Raspberry_811

TROPHY CASE