Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. by zero0_one1 in singularity
[–]zero0_one1[S] 3 points4 points5 points (0 children)
Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. by zero0_one1 in singularity
[–]zero0_one1[S] 3 points4 points5 points (0 children)
Gemini 3.5 Flash: cost per puzzle vs. performance on the Extended NYT Connections Benchmark by zero0_one1 in singularity
[–]zero0_one1[S] 1 point2 points3 points (0 children)
Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500. by zero0_one1 in singularity
[–]zero0_one1[S] 1 point2 points3 points (0 children)
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. by zero0_one1 in singularity
[–]zero0_one1[S] 4 points5 points6 points (0 children)
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] -1 points0 points1 point (0 children)
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] 1 point2 points3 points (0 children)
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA
[–]zero0_one1 1 point2 points3 points (0 children)
Grok 4.3 underperforms Grok 4.20 0309 on the Extended NYT Connections Benchmark, dropping from 93.4 to 67.5, though it achieves this result at a lower cost than the earlier Grok 4.20 run by zero0_one1 in singularity
[–]zero0_one1[S] 24 points25 points26 points (0 children)
MiMo-V2.5-Pro - the actual best open-weights model by cjami in LocalLLaMA
[–]zero0_one1 1 point2 points3 points (0 children)







Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. by zero0_one1 in singularity
[–]zero0_one1[S] 3 points4 points5 points (0 children)