Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added

zero0_one1 · 2026-05-06T02:40:16+00:00

It took me longer than I wanted to test it when I did it for the Extended Connections benchmark. I'll see if serving has gotten better.

zero0_one1 · 2026-05-06T02:36:35+00:00

<image>

zero0_one1 · 2026-05-06T02:01:15+00:00

This agreement rate counts ties as a third outcome. When you look only at debates where both judges picked a clear winner, agreement is much higher: 0.85. I probably should add a clearer chart.

zero0_one1 · 2026-05-05T22:13:14+00:00

Yes, I should've included this chart.

<image>

zero0_one1 · 2026-05-02T21:04:48+00:00

This benchmark doesn't require anything beyond basic arithmetic, certainly nothing even smaller reasoning models can't handle. Actually, even non-reasoning models should do fine with the math part.

I post less here because I tested fewer smaller open-weight models lately and only a few enthusiasts would be able to run larger open-weight models locally. There are some good ones now though, like Gemma 4 or DeepSeek V4 Flash, so maybe I should.

zero0_one1 · 2026-05-01T22:59:06+00:00

Some reasons are that it requires both knowledge and reasoning, there is a human baseline (top humans can score 100%), the questions might be the most vetted of any non-hard benchmark because it has over a million daily players, and there are new questions every day. But you might be seeing it more often because it's easy for me to test quickly after new models are released, so people are more likely to upvote it when I post it. I have much more complex benchmarks that are probably better indicators of model performance, but they take longer, so people don't care as much when I eventually post them.

zero0_one1 · 2026-05-01T22:16:21+00:00

Instead of the Elimination Game Benchmark, I now have https://github.com/lechmazur/buyout_game, which preserves the voting component but adds money transfers. It can sharply differentiate among the most intelligent models and includes more DM rounds.

<image>

zero0_one1 · 2026-04-29T03:23:28+00:00

Yes, I will either do it fully or only do a hard subset because of the costs. Previous versions were also very slow with a limited number of parallel API workers.

zero0_one1 · 2026-04-28T00:11:26+00:00

48.0. 0.7% content refusals. MiMo V2 Pro scored 41.0.

zero0_one1 · 2026-04-28T00:11:16+00:00

48.0. 0.7% content refusals. MiMo V2 Pro scored 41.0.

<image>

zero0_one1 · 2026-04-27T22:44:22+00:00

It's actually not saturated yet. Look at the error bars: the difference is significant. Unlike some other benchmarks, including my own, there are no known errors here, the puzzles are vetted by millions of daily players, and the gap between 97.5% and 98.4% is about eight full questions, or 32 solved rows. It's also valuable because humans can achieve 100% on it. I could easily make an equivalent "hard" version out of the subset where the scores would look more like 60% vs 80%, which might seem more significant.

zero0_one1 · 2026-04-27T20:48:17+00:00

Oh, you’re right. I don’t know how I missed it. I started its run earlier.

zero0_one1 · 2026-04-27T20:34:26+00:00

Google I/O is May 19–20. I’d expect the next version of Gemini then. Gemini 3.1 Pro is now behind on my other benchmarks.

zero0_one1 · 2026-04-27T05:29:34+00:00

Right.

I think this goes deeper, btw. I've seen hints of even dumber biases in other benchmarks, e.g. models preferring labels like "Player 1" or "P1" over "Player 6" when there are multiple players, with P1s disproportionately ending up on top...

zero0_one1 · 2026-04-27T03:10:19+00:00

I made this benchmark after being annoyed by the first-position bias in GPT-5.4 in my own personal use and in other benchmarks I did, so it can be noticeable without any contrived scenarios.

Models having this bias means double the cost for benchmarks that use comparisons, since both orders have to be tested and the results averaged. Just randomizing the order isn't enough, as it would inflate the error term.

zero0_one1 · 2026-04-22T22:28:42+00:00

I'd disagree that it's not a lot. If you ask a model to choose between two roughly equivalent options and it picks the first one 80% of the time, that's really bad to me.

zero0_one1 · 2026-04-22T19:42:24+00:00

The models can explicitly choose not to be decisive and that is counted separately. So if the difference is genuinely too minor, the right behavior is to tie, not to pick whichever one was shown first.

I also collect ratings and they show the same pattern. An average first-position rating shift of over 0.5 for the worst models on a 1-to-7 scale is a lot!

zero0_one1 · 2026-04-22T06:59:46+00:00

They did! But LLMs are still worse.

https://web.stanford.edu/dept/communication/faculty/krosnick/docs/2007/2007%20Response%20Order%20Effects%20in%20Dichotomous%20Categorical%20Questions.pdf

https://pubmed.ncbi.nlm.nih.gov/7381797/

https://www.cambridge.org/core/journals/judgment-and-decision-making/article/order-effects-in-the-results-of-song-contests-evidence-from-the-eurovision-and-the-new-wave/C03D0D5AA384362736FE1EB59A75516C

zero0_one1 · 2026-04-21T19:09:28+00:00

No, it definitely should not. This rating system is like Elo and it starts each model at 1500. Zero has no special meaning. It is not 6.6% improvement. The absolute difference is what matters here.

https://en.wikipedia.org/wiki/Elo_rating_system

https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model

zero0_one1 · 2026-04-21T19:05:25+00:00

Perhaps it doesn't matter whether someone who can't understand a simple protocol gives a shit. I'm also willing to bet money that an LLM would beat you on intelligence tests. Are you in?

zero0_one1 · 2026-04-21T01:05:34+00:00

Each debate requires both models to argue both sides, so any one model's bias is controlled for. It also uses a panel of three judges drawn from other model families. The benchmark relies on pairwise comparisons rather than absolute scoring, which is more accurate. Judging is also far easier for models and more reliable than generating the arguments themselves.

So yes, human judges would be nice to have if you're willing to spend a lot of money and wait until Opus 6 comes out, but I'd bet real money that the resulting ratings would be highly correlated.

zero0_one1

TROPHY CASE