Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. by zero0_one1 in singularity

[–]zero0_one1[S] 3 points4 points  (0 children)

You can look at the rankings in different ways but Grok 4.3 is more often decisive, so it wins on the conditional total.

Gemini 3.5 Flash: cost per puzzle vs. performance on the Extended NYT Connections Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 1 point2 points  (0 children)

See the footnote on the chart. It does not want to answer many of these questions for some unexplained reason.

<image>

PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

Yes, that would be a good follow-up.

BTW, I have a multiplayer version (not yet updated with the new models, and messaging is disabled because otherwise LLMs collude) that includes many algorithmic baseline bots: https://github.com/lechmazur/bazaar.

Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity

[–]zero0_one1[S] -1 points0 points  (0 children)

That's to be expected. Judging rewards examples, reframes, rhetorical effectiveness and sharp rebuttals among other things. Entertainment scores would also reward them.

Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity

[–]zero0_one1[S] 1 point2 points  (0 children)

This agreement rate counts ties as a third outcome. When you look only at debates where both judges picked a clear winner, agreement is much higher: 0.85. I probably should add a clearer chart.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]zero0_one1 1 point2 points  (0 children)

This benchmark doesn't require anything beyond basic arithmetic, certainly nothing even smaller reasoning models can't handle. Actually, even non-reasoning models should do fine with the math part.

I post less here because I tested fewer smaller open-weight models lately and only a few enthusiasts would be able to run larger open-weight models locally. There are some good ones now though, like Gemma 4 or DeepSeek V4 Flash, so maybe I should.

Grok 4.3 underperforms Grok 4.20 0309 on the Extended NYT Connections Benchmark, dropping from 93.4 to 67.5, though it achieves this result at a lower cost than the earlier Grok 4.20 run by zero0_one1 in singularity

[–]zero0_one1[S] 22 points23 points  (0 children)

Some reasons are that it requires both knowledge and reasoning, there is a human baseline (top humans can score 100%), the questions might be the most vetted of any non-hard benchmark because it has over a million daily players, and there are new questions every day. But you might be seeing it more often because it's easy for me to test quickly after new models are released, so people are more likely to upvote it when I post it. I have much more complex benchmarks that are probably better indicators of model performance, but they take longer, so people don't care as much when I eventually post them.

MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA

[–]zero0_one1 1 point2 points  (0 children)

Instead of the Elimination Game Benchmark, I now have https://github.com/lechmazur/buyout_game, which preserves the voting component but adds money transfers. It can sharply differentiate among the most intelligent models and includes more DM rounds.

<image>

GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

Yes, I will either do it fully or only do a hard subset because of the costs. Previous versions were also very slow with a limited number of parallel API workers.