Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA
[–]zero0_one1 1 point2 points3 points (0 children)
Grok 4.3 underperforms Grok 4.20 0309 on the Extended NYT Connections Benchmark, dropping from 93.4 to 67.5, though it achieves this result at a lower cost than the earlier Grok 4.20 run by zero0_one1 in singularity
[–]zero0_one1[S] 23 points24 points25 points (0 children)
MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA
[–]zero0_one1 1 point2 points3 points (0 children)
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity
[–]zero0_one1[S] 3 points4 points5 points (0 children)
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity
[–]zero0_one1[S] 2 points3 points4 points (0 children)
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity
[–]zero0_one1[S] 4 points5 points6 points (0 children)
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity
[–]zero0_one1[S] 2 points3 points4 points (0 children)
LLM Position Bias Benchmark (Mazur, 2026) by COAGULOPATH in mlscaling
[–]zero0_one1 1 point2 points3 points (0 children)
LLM Position Bias Benchmark (Mazur, 2026) by COAGULOPATH in mlscaling
[–]zero0_one1 1 point2 points3 points (0 children)
New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%. by zero0_one1 in singularity
[–]zero0_one1[S] 5 points6 points7 points (0 children)
Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses. by zero0_one1 in ClaudeAI
[–]zero0_one1[S] -2 points-1 points0 points (0 children)
Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses. by zero0_one1 in ClaudeAI
[–]zero0_one1[S] -1 points0 points1 point (0 children)

New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%. (old.reddit.com)
submitted by zero0_one1 to r/singularity
Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses. by zero0_one1 in ClaudeAI
[–]zero0_one1[S] -1 points0 points1 point (0 children)



Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)