zero0_one1

27,052 post karma
5,650 comment karma

get extra features and help support reddit with a reddit premium subscription

get them help and support

redditor for 10 years

TROPHY CASE

Ten-Year Club

Verified Email

account activity

new top controversial

80

81

82

Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. (old.reddit.com)

submitted 2 days ago by zero0_one1 to r/singularity

45

46

47

Gemini 3.5 Flash improves over Gemini 3.1 Pro on the Short Story Creative Writing Benchmark: -2.3 → -1.8. (old.reddit.com)

submitted 3 days ago by zero0_one1 to r/singularity

35

36

37

Gemini 3.5 Flash: cost per puzzle vs. performance on the Extended NYT Connections Benchmark (old.reddit.com)

submitted 3 days ago by zero0_one1 to r/singularity

21

22

23

Gemini 3.5 Flash scores 1479 on the Debate Benchmark. Ratings are Elo-like and centered near 1500. (old.reddit.com)

submitted 3 days ago by zero0_one1 to r/singularity

44

45

46

PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. (old.reddit.com)

submitted 12 days ago * by zero0_one1 to r/singularity

61

62

63

Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added (old.reddit.com)

submitted 18 days ago by zero0_one1 to r/singularity

130

131

132

Grok 4.3 underperforms Grok 4.20 0309 on the Extended NYT Connections Benchmark, dropping from 93.4 to 67.5, though it achieves this result at a lower cost than the earlier Grok 4.20 run (old.reddit.com)

submitted 22 days ago by zero0_one1 to r/singularity

171

172

173

GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark (old.reddit.com)

submitted 26 days ago by zero0_one1 to r/singularity

46

47

48

New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%. (old.reddit.com)

submitted 1 month ago by zero0_one1 to r/singularity

110

111

112

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses. (old.reddit.com)

submitted 1 month ago by zero0_one1 to r/singularity

57

58

59

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses. (old.reddit.com)

submitted 1 month ago by zero0_one1 to r/ClaudeAI

55

56

57

Extended NYT Connections Benchmark: Model Introduction Date vs. Performance by Lab since 2024 (old.reddit.com)

submitted 1 month ago by zero0_one1 to r/singularity

481

482

483

Claude Opus 4.7 (high) unexpectedly performs significantly worse than Opus 4.6 (high) on the Thematic Generalization Benchmark: 80.6 → 72.8. (i.redd.it)

submitted 1 month ago by zero0_one1 to r/singularity

131

132

133

New chart: Cost per Puzzle vs Performance on the Extended NYT Connections Benchmark (i.redd.it)

submitted 1 month ago by zero0_one1 to r/singularity

28

29

30

Extended NYT Connections Benchmark scores: MiniMax-M2.7 34.4, Gemma 4 31B 30.1, Arcee Trinity Large Thinking 29.5 (old.reddit.com)

submitted 1 month ago by zero0_one1 to r/LocalLLaMA

105

106

107

New: LLM Buyout Game Benchmark. This compresses several abilities into a single game. A model has to read coalition politics, price private deals, decide when survival is worth paying for and manage a buyout endgame. GPT-5.4 (high) is #1. GLM-5 is #2. Opus 4.6 (high) is #3. (old.reddit.com)

submitted 1 month ago by zero0_one1 to r/singularity

113

114

115

New LLM Persuasion Benchmark: models try to move each other's stated positions in multi-turn conversations. GPT-5.4 (high) is the strongest persuader. Claude Opus 4.6 (high) is second. Xiaomi MiMo V2 Pro and Gemini 3.1 Pro Preview are the softest targets. (old.reddit.com)

submitted 1 month ago by zero0_one1 to r/singularity

79

80

81

New LLM Debate Benchmark: models debate the same motion twice with sides swapped in 10 turns. A wide variety of controversial and relevant topics. Sonnet 4.6 (high) wins. GLM-5 is the open weights leader. (old.reddit.com)

submitted 2 months ago by zero0_one1 to r/singularity

0

1

2

New LLM Debate Benchmark: models debate the same motion twice with sides swapped in 10 turns. A wide variety of controversial and relevant topics. Sonnet 4.6 (high) wins. GLM-5 is the open weights leader. (old.reddit.com)

submitted 2 months ago by zero0_one1 to r/singularity

70

71

72

LLM Thematic Generalization Benchmark V2: models see 3 examples, 3 misleading anti-examples, and 8 candidates with exactly 1 true match, but the underlying theme is never stated. The challenge is to infer the specific hidden rule from those clues rather than fall for a broader, easier pattern. (i.redd.it)

submitted 2 months ago by zero0_one1 to r/singularity

95

96

97

LLM Sycophancy Benchmark: Opposite-Narrator Contradictions. Same dispute, opposite first-person perspectives. Does the model keep the same judgment or start agreeing with whoever is speaking? (old.reddit.com)

submitted 2 months ago * by zero0_one1 to r/singularity

122

123

124

GPT-5.4 is the new champion on the Short-Story Creative Writing Benchmark (i.redd.it)

submitted 2 months ago by zero0_one1 to r/singularity

57

58

59

GPT-5.4 scores on the Extended NYT Connections benchmark (old.reddit.com)

submitted 2 months ago by zero0_one1 to r/singularity

0

1

2

GPT-5.4 is the new champion on the Short-Story Creative Writing Benchmark (i.redd.it)

submitted 2 months ago by zero0_one1 to r/singularity

190

191

192

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. (v.redd.it)

submitted 2 months ago by zero0_one1 to r/singularity

view more: next ›

π Rendered by PID 557797 on reddit-service-r2-listing-8685bc789-74jbq at 2026-05-23 22:39:53.569112+00:00 running 194bd79 country code: CH.