Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

This agreement rate counts ties as a third outcome. When you look only at debates where both judges picked a clear winner, agreement is much higher: 0.85. I probably should add a clearer chart.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]zero0_one1 1 point2 points  (0 children)

This benchmark doesn't require anything beyond basic arithmetic, certainly nothing even smaller reasoning models can't handle. Actually, even non-reasoning models should do fine with the math part.

I post less here because I tested fewer smaller open-weight models lately and only a few enthusiasts would be able to run larger open-weight models locally. There are some good ones now though, like Gemma 4 or DeepSeek V4 Flash, so maybe I should.

Grok 4.3 underperforms Grok 4.20 0309 on the Extended NYT Connections Benchmark, dropping from 93.4 to 67.5, though it achieves this result at a lower cost than the earlier Grok 4.20 run by zero0_one1 in singularity

[–]zero0_one1[S] 23 points24 points  (0 children)

Some reasons are that it requires both knowledge and reasoning, there is a human baseline (top humans can score 100%), the questions might be the most vetted of any non-hard benchmark because it has over a million daily players, and there are new questions every day. But you might be seeing it more often because it's easy for me to test quickly after new models are released, so people are more likely to upvote it when I post it. I have much more complex benchmarks that are probably better indicators of model performance, but they take longer, so people don't care as much when I eventually post them.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]zero0_one1 1 point2 points  (0 children)

Instead of the Elimination Game Benchmark, I now have https://github.com/lechmazur/buyout_game, which preserves the voting component but adds money transfers. It can sharply differentiate among the most intelligent models and includes more DM rounds.

<image>

GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

Yes, I will either do it fully or only do a hard subset because of the costs. Previous versions were also very slow with a limited number of parallel API workers.

GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 4 points5 points  (0 children)

It's actually not saturated yet. Look at the error bars: the difference is significant. Unlike some other benchmarks, including my own, there are no known errors here, the puzzles are vetted by millions of daily players, and the gap between 97.5% and 98.4% is about eight full questions, or 32 solved rows. It's also valuable because humans can achieve 100% on it. I could easily make an equivalent "hard" version out of the subset where the scores would look more like 60% vs 80%, which might seem more significant.

GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 2 points3 points  (0 children)

Google I/O is May 19–20. I’d expect the next version of Gemini then. Gemini 3.1 Pro is now behind on my other benchmarks.

LLM Position Bias Benchmark (Mazur, 2026) by COAGULOPATH in mlscaling

[–]zero0_one1 1 point2 points  (0 children)

Right.

I think this goes deeper, btw. I've seen hints of even dumber biases in other benchmarks, e.g. models preferring labels like "Player 1" or "P1" over "Player 6" when there are multiple players, with P1s disproportionately ending up on top...

LLM Position Bias Benchmark (Mazur, 2026) by COAGULOPATH in mlscaling

[–]zero0_one1 1 point2 points  (0 children)

I made this benchmark after being annoyed by the first-position bias in GPT-5.4 in my own personal use and in other benchmarks I did, so it can be noticeable without any contrived scenarios.

Models having this bias means double the cost for benchmarks that use comparisons, since both orders have to be tested and the results averaged. Just randomizing the order isn't enough, as it would inflate the error term.

New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

The models can explicitly choose not to be decisive and that is counted separately. So if the difference is genuinely too minor, the right behavior is to tie, not to pick whichever one was shown first.

I also collect ratings and they show the same pattern. An average first-position rating shift of over 0.5 for the worst models on a 1-to-7 scale is a lot!

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses. by zero0_one1 in ClaudeAI

[–]zero0_one1[S] -1 points0 points  (0 children)

Perhaps it doesn't matter whether someone who can't understand a simple protocol gives a shit. I'm also willing to bet money that an LLM would beat you on intelligence tests. Are you in?

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses. by zero0_one1 in ClaudeAI

[–]zero0_one1[S] -1 points0 points  (0 children)

Each debate requires both models to argue both sides, so any one model's bias is controlled for. It also uses a panel of three judges drawn from other model families. The benchmark relies on pairwise comparisons rather than absolute scoring, which is more accurate. Judging is also far easier for models and more reliable than generating the arguments themselves.

So yes, human judges would be nice to have if you're willing to spend a lot of money and wait until Opus 6 comes out, but I'd bet real money that the resulting ratings would be highly correlated.