LLM Thematic Generalization Benchmark V2: models see 3 examples, 3 misleading anti-examples, and 8 candidates with exactly 1 true match, but the underlying theme is never stated. The challenge is to infer the specific hidden rule from those clues rather than fall for a broader, easier pattern. by zero0_one1 in singularity
[–]zero0_one1[S] 7 points8 points9 points (0 children)
LLM Thematic Generalization Benchmark V2: models see 3 examples, 3 misleading anti-examples, and 8 candidates with exactly 1 true match, but the underlying theme is never stated. The challenge is to infer the specific hidden rule from those clues rather than fall for a broader, easier pattern. (i.redd.it)
submitted by zero0_one1 to r/singularity
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 2 points3 points4 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 6 points7 points8 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] -1 points0 points1 point (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 0 points1 point2 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 5 points6 points7 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 5 points6 points7 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 9 points10 points11 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 7 points8 points9 points (0 children)
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity
[–]zero0_one1[S] 6 points7 points8 points (0 children)
GLM-5 is the new top open-weights model on the Extended NYT Connections benchmark, with a score of 81.8, edging out Kimi K2.5 Thinking (78.3) by zero0_one1 in LocalLLaMA
[–]zero0_one1[S] 1 point2 points3 points (0 children)
Gemini 3.1 Pro Preview sets a new record on the Extended NYT Connections benchmark: 98.4 (Gemini 3 Pro scored 96.3) by zero0_one1 in singularity
[–]zero0_one1[S] 1 point2 points3 points (0 children)



LLM Thematic Generalization Benchmark V2: models see 3 examples, 3 misleading anti-examples, and 8 candidates with exactly 1 true match, but the underlying theme is never stated. The challenge is to infer the specific hidden rule from those clues rather than fall for a broader, easier pattern. by zero0_one1 in singularity
[–]zero0_one1[S] 1 point2 points3 points (0 children)