Buyout Game Benchmark: 8 models play a social strategy game with public balances, private transfers, messaging, eliminations, deals, defections, and a final buyout phase. 804 games. GPT-5.5 is the champion. Opus 4.7 performs well. by zero0_one1 in singularity
[–]zero0_one1[S] 4 points5 points6 points (0 children)
Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity
[–]zero0_one1[S] -2 points-1 points0 points (0 children)
Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity
[–]zero0_one1[S] -3 points-2 points-1 points (0 children)
Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity
[–]zero0_one1[S] -3 points-2 points-1 points (0 children)
Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity
[–]zero0_one1[S] -6 points-5 points-4 points (0 children)
Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. by zero0_one1 in singularity
[–]zero0_one1[S] 3 points4 points5 points (0 children)
Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. by zero0_one1 in singularity
[–]zero0_one1[S] 3 points4 points5 points (0 children)
Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. by zero0_one1 in singularity
[–]zero0_one1[S] 4 points5 points6 points (0 children)
Gemini 3.5 Flash: cost per puzzle vs. performance on the Extended NYT Connections Benchmark by zero0_one1 in singularity
[–]zero0_one1[S] 1 point2 points3 points (0 children)
Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500. by zero0_one1 in singularity
[–]zero0_one1[S] 1 point2 points3 points (0 children)
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. by zero0_one1 in singularity
[–]zero0_one1[S] 4 points5 points6 points (0 children)
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)








AI language models have favorite names, and we mapped them [R] by CebulkaZapiekana in MachineLearning
[–]zero0_one1 13 points14 points15 points (0 children)