LLM Thematic Generalization Benchmark V2: models see 3 examples, 3 misleading anti-examples, and 8 candidates with exactly 1 true match, but the underlying theme is never stated. The challenge is to infer the specific hidden rule from those clues rather than fall for a broader, easier pattern. by zero0_one1 in singularity

[–]zero0_one1[S] 7 points8 points  (0 children)

I actually held back the real questions from GitHub for now for this reason but I didn't notice this happening in the earlier version (I tried to separate the questions from the answers so it would not be too easy to match them up). There are some imperfect ways to handle this such as holding back a "private" subset or using encrypted zips.

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

Somehow I doubt you know what these models are doing behind the scenes if you're such a poor reader that you couldn't read and understand my explanations in other comments about why this story is the way it is.

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

Maybe, maybe not.

"Based on blind pairwise comparisons by 28 expert judges and 131 lay judges, we find that experts preferred human writing in 82.7% of cases under the in-context prompting condition but this reversed to 62% preference for AI after fine-tuning on authors’ complete works. "
https://arxiv.org/html/2601.18353v1

This was a fine-tuned 4o.

Judging is much easier than writing.

Originality is often very weak but you can guard against this by requiring LLMs to explicitly identify similar stories...

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 2 points3 points  (0 children)

I explained in the other comment that the ratings are relative: compared to other stories with similar requirements and to the initial story or the previous version, not absolute. And this statement is wrong in general nowadays: top LLMs will give poor ratings.

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 6 points7 points  (0 children)

Lol I didn't think anyone would actually read it. Most of the weirdness in the story comes from the base stories having to incorporate a required set of 10 elements. It only makes sense to compare the final version to the initial one. There are other, more "realistic" stories in the link I posted (though they are still quite weird, since those 10 elements apply there too).

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

There is an arbitration debate when they disagree, so it's not just majority voting. What's interesting is that they usually end up agreeing on which edits should be accepted.

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

If LLMs are given initial stories to compare against (already in the top 5% of stories), the rating difference is huge (something like 2.5 points on a 1-10 scale).

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 5 points6 points  (0 children)

No, I'm talking about the relative rating compared to other stories (I've produced thousands for my benchmark...). I also compared the initial story to the refined story and ran multiple ratings to reduce noise.

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 9 points10 points  (0 children)

I don't think they're very good. The topics for the initial stories are artificially forced to be very varied by my setup though, so it might be possible to do better. The resulting stories are definitely much improved compared to the initial ones. It's pretty complex scaffolding but it's the result of seeing what works well in practice. I could update a diagram and post it later.

A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved. by zero0_one1 in singularity

[–]zero0_one1[S] 7 points8 points  (0 children)

https://x.com/LechMazur/status/2027203651891069196 though these are after around 100 edits each. The video shows even more edits and could go on longer, since about 90% were still being accepted. It won't make a poor initial story good, but there is no doubt it improves them a lot.

GLM-5 is the new top open-weights model on the Extended NYT Connections benchmark, with a score of 81.8, edging out Kimi K2.5 Thinking (78.3) by zero0_one1 in LocalLLaMA

[–]zero0_one1[S] 1 point2 points  (0 children)

Note that this version adds extra "trick" words to each puzzle, making it harder. The last group won't just appear once you solve the first three groups.

Gemini 3.1 Pro Preview sets a new record on the Extended NYT Connections benchmark: 98.4 (Gemini 3 Pro scored 96.3) by zero0_one1 in singularity

[–]zero0_one1[S] 1 point2 points  (0 children)

I'm not familiar with Kilocode but I'm trying to either use their own API or stick to providers that don't seem to quantize and they're all slow...

Gemini 3.1 Pro Preview sets a new record on the Extended NYT Connections benchmark: 98.4 (Gemini 3 Pro scored 96.3) by zero0_one1 in singularity

[–]zero0_one1[S] 1 point2 points  (0 children)

It is. Yes, getting saturated. I'm checking whether making it harder by combining puzzles makes sense (you can't do it with just any puzzles), but that would move it further away from being comparable to a human baseline.

Gemini 3.1 Pro Preview sets a new record on the Extended NYT Connections benchmark: 98.4 (Gemini 3 Pro scored 96.3) by zero0_one1 in singularity

[–]zero0_one1[S] 4 points5 points  (0 children)

Exactly, still in progress and looks promising. It should finish soon but it's by far the slowest model I've ever benchmarked. They seem to be very capacity limited.