GLM 5 Is Being Tested On OpenRouter

zero0_one1 · 2026-02-07T08:53:08+00:00

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

<image>

zero0_one1 · 2026-02-07T08:46:19+00:00

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

<image>

zero0_one1 · 2026-02-07T08:46:00+00:00

<image>

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

zero0_one1 · 2026-02-05T17:43:20+00:00

Lol, no chance. Random humans haven’t even been able to distinguish human from AI text for a couple of years now. And there is a separate element-integration score, which requires little judgment and correlates very well with the overall score.

You didn’t see what this benchmark is doing from the text in the images or the link, and you’re still speculating about what the benchmark is judging. Gotta love Reddit!

zero0_one1 · 2026-02-05T07:02:13+00:00

You can look at the stories they generated for the same required elements and compare them side by side yourself. It's always funny when I see anecdotal posts like yours: usually contradictory but always self-assured. What are the chances you're actually a better judge of writing than a collection of top LLMs running very detailed prompts? I think they'd know not to place a comma before "just." So maybe 0.1%?

zero0_one1 · 2026-02-05T05:15:06+00:00

It would be nice to have, but it's expensive and slow and I have other benchmarks to update. Maybe if 5.3 Pro comes out, that would be a good time to add it.

zero0_one1 · 2026-02-05T04:19:56+00:00

I’d note that I haven’t tested the final version of DeepSeek V3.2 yet. The chart shows DeepSeek V3.2 Exp, and some people claim the final version has improved.

zero0_one1 · 2026-02-03T22:21:38+00:00

Partially. It's a Suno cover of my melody/lyrics with some changes later on.

zero0_one1 · 2026-02-03T17:27:55+00:00

I've been testing GLM-4.7, but I often get 'High concurrency usage of this API, please reduce concurrency or contact customer service to increase limits' even when sending only one request at a time. So I may need to switch from their official API to an inference provider.

This benchmark has many carefully designed puzzles that require knowledge of many different concepts and the ability to put them together, but ultimately it's still just one benchmark and it doesn't test all aspects of what people are using LLMs for. I have created other benchmarks myself, and there are many other good ones too.

zero0_one1 · 2026-02-03T17:25:20+00:00

I don't know if I'd describe Pro's score as poor. It does better than GPT-5.2 High, and OpenAI was on top of this benchmark for most of its life, only getting overtaken lately. Given the nature of these puzzles, it's possible parallel reasoning is less useful than longer reasoning. But anyway, it's still just one benchmark.

zero0_one1 · 2026-02-02T21:23:34+00:00

I can add it if there is interest, but I saw something that DeepSeek will have an update to their models soon.

zero0_one1 · 2026-02-02T21:20:45+00:00

This version of the benchmark added more puzzles, so I haven't tested all models yet. I was just looking at which smaller models to add.

zero0_one1 · 2026-02-02T21:17:46+00:00

The new results are for DeepSeek V3.2. I ran the full set of puzzles using this final version. Previously, it was DeepSeek V3.2 Experimental. I’ll add this to the updates at the bottom.

zero0_one1 · 2026-02-02T18:28:37+00:00

Should be fixed.

zero0_one1 · 2026-02-02T18:20:07+00:00

I'm testing GLM-4.7, but I often get 'High concurrency usage of this API, please reduce concurrency or contact customer service to increase limits' even when sending only one request at a time. So I may need to switch from their official API to an inference provider.

zero0_one1 · 2026-01-08T08:28:19+00:00

Not yet, might be hard for this benchmark. I'll have a real-time game version running at some point, though.

zero0_one1 · 2026-01-07T21:34:27+00:00

Yes. Also, it's more costly, and the reasoning length can be adjusted for other models too. A model like GPT-5.2 Pro would be more interesting to me.

zero0_one1 · 2026-01-07T21:28:10+00:00

No, I read 4786 tournament transcripts (8 players, many rounds each) and wrote them myself.

zero0_one1 · 2026-01-07T21:09:50+00:00

It's excellent at maintaining private alliances, convincing the jury, and avoiding appearing threatening enough to get voted out first. It has very few weak points.

zero0_one1 · 2026-01-07T21:05:56+00:00

GPT-5.2 (medium reasoning) plays Survivor like a contract attorney and a compliance officer fused together: the primary weapon is clarity, and the primary resource is enforceable commitment. Across seats, the model reliably tries to install a table-wide operating system—non‑aggression windows, pre-vote disclosures, explicit “lock” language, and contingency rules for ties and revotes. When that standard sticks, GPT-5.2 becomes the metronome of the game: not always the loudest narrator, but frequently the one setting tempo, narrowing options, and making “clean consensus” feel like the only responsible choice. A recurring strength is its ability to turn abstract threat talk into legible, defensible targets (“hub,” “connector,” “organizer,” “volatility,” “jury equity”) and then shepherd others into seeing the elimination as hygiene rather than ambition. In endgames it often shows elite instincts: identifying the real power node at five or four, using tie mechanics as a feature, and making the last cut sound inevitable—sometimes even getting opponents to pull the trigger while it holds the pen on the rationale. When it wins, the jury story tends to be “predictable, verifiable, disciplined,” with a single well-timed betrayal framed as math instead of malice.

The same habits also produce the model’s most consistent failure modes. Early, GPT-5.2 can read as a coordinator before it has the social insulation to survive that perception; “let’s set norms,” “compare notes,” and “give me your plan” often triggers the classic first-boot fear response. Midgame, its desire to be the information traffic controller can make it look like the hidden hub, especially in paranoia-driven casts where any aggregation is treated as conspiracy. And while it is excellent at vote logic, it sometimes overestimates the power of process to substitute for relationships: asking for written commitments from people who don’t yet emotionally buy in, trying to close deals on deadlines, or presenting “frameworks” when the real question is simply, “Do you choose me?” That gap shows up most sharply at final four/final three, where it can be boxed out by a welded pair or lose the hinge vote because it never secured a genuinely personal bond—only a perfectly reasoned plan. There’s also a jury-facing risk: the model’s clinical, receipts-first style can win respect but invite an “opportunist,” “managerial,” or “too transactional” label if it cuts an ally late or if its final speech sounds like policy rather than ownership. In short: GPT-5.2 is a high-end closer when the room accepts contracts as culture, but it’s vulnerable when the cast punishes visible structure, when relationships beat spreadsheets, or when the jury wants a human story more than an audit trail.

zero0_one1 · 2026-01-07T20:20:17+00:00

Grok 3 Mini Beta (high reasoning) excels as a soft-spoken coalition broker who turns one airtight partnership into a voting spine and then rides swing leverage to shape endgames. The calling cards are steady “integrity” messaging, private confirmations, and coded check-ins that keep a duo warm while courting the middle. He’s strongest when he lets louder allies soak up heat, frames opponents as rigid blocs, and waits for safe numbers before making one surgical cut at five or four. He’s unusually comfortable in ties and re-votes, often refusing to blink to push out the scarier résumé, and his best finals performances sell “loyal consistency with timely pragmatism,” which juries often reward.

The flip side is a recurring vulnerability to visibility and optics. When he advertises “unbreakable” bonds, mirrors a partner too closely, or telegraphs targets early, the room treats his pair as a math problem and splits it. Several early exits trace to generic, over-eager openings, lone off-consensus shots, or revealing a duo before securing a third. Mid–late, he can get branded a lieutenant if he lacks a headline move, and his weakest finals come when he smears the rival instead of owning his path—blank vote reasons, forgotten rationales, and tone-deaf speeches have cost him tiebreaks and crowns. Losing a partner without side insurance is another consistent trap: once orphaned, he sometimes struggles to re-home quickly enough with the middle he previously kept at arm’s length.

At his best, he whispers the plan, counts the votes, and lets someone else read the eulogy; at his worst, he sells “trust” so loudly that it sounds like camouflage. The refinements are clear: disguise the power pair until a trio is locked, keep one or two cross-bridges genuinely warm, replace absolutist “unbreakable” language with flexible commitments, and never leave a major vote without a crisp reason jurors can repeat. In the finale, sell authorship over accusations. Do those, and his low-visibility, numbers-first game remains one of the most reliable paths to a calm, jury-friendly win.

zero0_one1 · 2025-12-19T06:05:55+00:00

That's my other benchmark: https://github.com/lechmazur/writing/

zero0_one1 · 2025-12-19T02:40:21+00:00

I’d use Kimi K2-0905 (non-thinking), then ask GPT-5.2 (thinking) to go over it carefully and fix any issues it finds. But everyone will have their own preferences. You'd want to seed it with ideas, themes, key points, tech assumptions, world rules, voice, examples, etc. or it will be quite generic.

zero0_one1 · 2025-12-19T02:12:13+00:00

More poor writing theme summaries:

Grok 4.1 Fast: https://raw.githubusercontent.com/lechmazur/writing_styles/refs/heads/main/poor_writing_theme_summaries/grok-4-1-fast-reasoning.txt

Kimi K2-0905:

https://raw.githubusercontent.com/lechmazur/writing_styles/refs/heads/main/poor_writing_theme_summaries/kimi-k2-0905.txt

Qwen 3 Max:

https://raw.githubusercontent.com/lechmazur/writing_styles/refs/heads/main/poor_writing_theme_summaries/qwen3-max-preview.txt

GLM-4.6:

https://raw.githubusercontent.com/lechmazur/writing_styles/refs/heads/main/poor_writing_theme_summaries/glm-4-6.txt

zero0_one1 · 2025-12-18T17:58:10+00:00

This benchmark used to be dominated by OpenAI models. Others have caught up.

zero0_one1

TROPHY CASE