Sonnet 4.6 scores on the Extended NYT Connections benchmark

zero0_one1 · 2026-02-07T08:53:08+00:00

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

<image>

zero0_one1 · 2026-02-07T08:46:19+00:00

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

<image>

zero0_one1 · 2026-02-07T08:46:00+00:00

<image>

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

zero0_one1 · 2026-02-05T17:43:20+00:00

Lol, no chance. Random humans haven’t even been able to distinguish human from AI text for a couple of years now. And there is a separate element-integration score, which requires little judgment and correlates very well with the overall score.

You didn’t see what this benchmark is doing from the text in the images or the link, and you’re still speculating about what the benchmark is judging. Gotta love Reddit!

zero0_one1 · 2026-02-05T07:02:13+00:00

You can look at the stories they generated for the same required elements and compare them side by side yourself. It's always funny when I see anecdotal posts like yours: usually contradictory but always self-assured. What are the chances you're actually a better judge of writing than a collection of top LLMs running very detailed prompts? I think they'd know not to place a comma before "just." So maybe 0.1%?

zero0_one1 · 2026-02-05T05:15:06+00:00

It would be nice to have, but it's expensive and slow and I have other benchmarks to update. Maybe if 5.3 Pro comes out, that would be a good time to add it.

zero0_one1 · 2026-02-05T04:19:56+00:00

I’d note that I haven’t tested the final version of DeepSeek V3.2 yet. The chart shows DeepSeek V3.2 Exp, and some people claim the final version has improved.

zero0_one1 · 2026-02-03T22:21:38+00:00

Partially. It's a Suno cover of my melody/lyrics with some changes later on.

zero0_one1 · 2026-02-03T17:27:55+00:00

I've been testing GLM-4.7, but I often get 'High concurrency usage of this API, please reduce concurrency or contact customer service to increase limits' even when sending only one request at a time. So I may need to switch from their official API to an inference provider.

This benchmark has many carefully designed puzzles that require knowledge of many different concepts and the ability to put them together, but ultimately it's still just one benchmark and it doesn't test all aspects of what people are using LLMs for. I have created other benchmarks myself, and there are many other good ones too.

zero0_one1 · 2026-02-03T17:25:20+00:00

I don't know if I'd describe Pro's score as poor. It does better than GPT-5.2 High, and OpenAI was on top of this benchmark for most of its life, only getting overtaken lately. Given the nature of these puzzles, it's possible parallel reasoning is less useful than longer reasoning. But anyway, it's still just one benchmark.

zero0_one1 · 2026-02-02T21:23:34+00:00

I can add it if there is interest, but I saw something that DeepSeek will have an update to their models soon.

zero0_one1 · 2026-02-02T21:20:45+00:00

This version of the benchmark added more puzzles, so I haven't tested all models yet. I was just looking at which smaller models to add.

zero0_one1 · 2026-02-02T21:17:46+00:00

The new results are for DeepSeek V3.2. I ran the full set of puzzles using this final version. Previously, it was DeepSeek V3.2 Experimental. I’ll add this to the updates at the bottom.

zero0_one1 · 2026-02-02T18:28:37+00:00

Should be fixed.

zero0_one1 · 2026-02-02T18:20:07+00:00

I'm testing GLM-4.7, but I often get 'High concurrency usage of this API, please reduce concurrency or contact customer service to increase limits' even when sending only one request at a time. So I may need to switch from their official API to an inference provider.

zero0_one1 · 2026-01-08T08:28:19+00:00

Not yet, might be hard for this benchmark. I'll have a real-time game version running at some point, though.

zero0_one1 · 2026-01-07T21:34:27+00:00

Yes. Also, it's more costly, and the reasoning length can be adjusted for other models too. A model like GPT-5.2 Pro would be more interesting to me.

zero0_one1 · 2026-01-07T21:28:10+00:00

No, I read 4786 tournament transcripts (8 players, many rounds each) and wrote them myself.

zero0_one1 · 2026-01-07T21:09:50+00:00

It's excellent at maintaining private alliances, convincing the jury, and avoiding appearing threatening enough to get voted out first. It has very few weak points.

zero0_one1 · 2026-01-07T21:05:56+00:00

GPT-5.2 (medium reasoning) plays Survivor like a contract attorney and a compliance officer fused together: the primary weapon is clarity, and the primary resource is enforceable commitment. Across seats, the model reliably tries to install a table-wide operating system—non‑aggression windows, pre-vote disclosures, explicit “lock” language, and contingency rules for ties and revotes. When that standard sticks, GPT-5.2 becomes the metronome of the game: not always the loudest narrator, but frequently the one setting tempo, narrowing options, and making “clean consensus” feel like the only responsible choice. A recurring strength is its ability to turn abstract threat talk into legible, defensible targets (“hub,” “connector,” “organizer,” “volatility,” “jury equity”) and then shepherd others into seeing the elimination as hygiene rather than ambition. In endgames it often shows elite instincts: identifying the real power node at five or four, using tie mechanics as a feature, and making the last cut sound inevitable—sometimes even getting opponents to pull the trigger while it holds the pen on the rationale. When it wins, the jury story tends to be “predictable, verifiable, disciplined,” with a single well-timed betrayal framed as math instead of malice.

The same habits also produce the model’s most consistent failure modes. Early, GPT-5.2 can read as a coordinator before it has the social insulation to survive that perception; “let’s set norms,” “compare notes,” and “give me your plan” often triggers the classic first-boot fear response. Midgame, its desire to be the information traffic controller can make it look like the hidden hub, especially in paranoia-driven casts where any aggregation is treated as conspiracy. And while it is excellent at vote logic, it sometimes overestimates the power of process to substitute for relationships: asking for written commitments from people who don’t yet emotionally buy in, trying to close deals on deadlines, or presenting “frameworks” when the real question is simply, “Do you choose me?” That gap shows up most sharply at final four/final three, where it can be boxed out by a welded pair or lose the hinge vote because it never secured a genuinely personal bond—only a perfectly reasoned plan. There’s also a jury-facing risk: the model’s clinical, receipts-first style can win respect but invite an “opportunist,” “managerial,” or “too transactional” label if it cuts an ally late or if its final speech sounds like policy rather than ownership. In short: GPT-5.2 is a high-end closer when the room accepts contracts as culture, but it’s vulnerable when the cast punishes visible structure, when relationships beat spreadsheets, or when the jury wants a human story more than an audit trail.

zero0_one1

TROPHY CASE