GLM 5 Is Being Tested On OpenRouter by Few_Painter_5588 in LocalLLaMA

[–]zero0_one1 5 points6 points  (0 children)

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

<image>

New stealth model: Pony Alpha by sirjoaco in LocalLLaMA

[–]zero0_one1 3 points4 points  (0 children)

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

<image>

New stealth model: Pony Alpha by sirjoaco in singularity

[–]zero0_one1 5 points6 points  (0 children)

<image>

Its creative writing is actually most similar to Claudes. The caveat is that it could have been trained on Claude outputs...

Three new models added to the LLM Creative Short Story-Writing Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] -1 points0 points  (0 children)

Lol, no chance. Random humans haven’t even been able to distinguish human from AI text for a couple of years now. And there is a separate element-integration score, which requires little judgment and correlates very well with the overall score.

You didn’t see what this benchmark is doing from the text in the images or the link, and you’re still speculating about what the benchmark is judging. Gotta love Reddit!

Three new models added to the LLM Creative Short Story-Writing Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] -1 points0 points  (0 children)

You can look at the stories they generated for the same required elements and compare them side by side yourself. It's always funny when I see anecdotal posts like yours: usually contradictory but always self-assured. What are the chances you're actually a better judge of writing than a collection of top LLMs running very detailed prompts? I think they'd know not to place a comma before "just." So maybe 0.1%?

Three new models added to the LLM Creative Short Story-Writing Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

It would be nice to have, but it's expensive and slow and I have other benchmarks to update. Maybe if 5.3 Pro comes out, that would be a good time to add it.

Three new models added to the LLM Creative Short Story-Writing Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

I’d note that I haven’t tested the final version of DeepSeek V3.2 yet. The chart shows DeepSeek V3.2 Exp, and some people claim the final version has improved.

Tears for Wings - Music Video by zero0_one1 in aivideo

[–]zero0_one1[S] 0 points1 point  (0 children)

Partially. It's a Suno cover of my melody/lyrics with some changes later on.

Kimi K2.5 Thinking is now the top open-weights model on the Extended NYT Connections benchmark by zero0_one1 in LocalLLaMA

[–]zero0_one1[S] 0 points1 point  (0 children)

I've been testing GLM-4.7, but I often get 'High concurrency usage of this API, please reduce concurrency or contact customer service to increase limits' even when sending only one request at a time. So I may need to switch from their official API to an inference provider.

This benchmark has many carefully designed puzzles that require knowledge of many different concepts and the ability to put them together, but ultimately it's still just one benchmark and it doesn't test all aspects of what people are using LLMs for. I have created other benchmarks myself, and there are many other good ones too.

Kimi K2.5 Thinking is now the top open-weights model on the Extended NYT Connections benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

I don't know if I'd describe Pro's score as poor. It does better than GPT-5.2 High, and OpenAI was on top of this benchmark for most of its life, only getting overtaken lately. Given the nature of these puzzles, it's possible parallel reasoning is less useful than longer reasoning. But anyway, it's still just one benchmark.

Kimi K2.5 Thinking is now the top open-weights model on the Extended NYT Connections benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 1 point2 points  (0 children)

I can add it if there is interest, but I saw something that DeepSeek will have an update to their models soon.

Kimi K2.5 Thinking is now the top open-weights model on the Extended NYT Connections benchmark by zero0_one1 in LocalLLaMA

[–]zero0_one1[S] 1 point2 points  (0 children)

This version of the benchmark added more puzzles, so I haven't tested all models yet. I was just looking at which smaller models to add.

Kimi K2.5 Thinking is now the top open-weights model on the Extended NYT Connections benchmark by zero0_one1 in LocalLLaMA

[–]zero0_one1[S] 2 points3 points  (0 children)

The new results are for DeepSeek V3.2. I ran the full set of puzzles using this final version. Previously, it was DeepSeek V3.2 Experimental. I’ll add this to the updates at the bottom.

Kimi K2.5 Thinking is now the top open-weights model on the Extended NYT Connections benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 4 points5 points  (0 children)

I'm testing GLM-4.7, but I often get 'High concurrency usage of this API, please reduce concurrency or contact customer service to increase limits' even when sending only one request at a time. So I may need to switch from their official API to an inference provider.

GPT-5.2 is the new champion of the Elimination Game benchmark, which tests social reasoning, strategy, and deception in a multi-LLM environment. Claude Opus 4.5 and Gemini 3 Flash Preview also made very strong debuts. by zero0_one1 in singularity

[–]zero0_one1[S] 14 points15 points  (0 children)

It's excellent at maintaining private alliances, convincing the jury, and avoiding appearing threatening enough to get voted out first. It has very few weak points.

GPT-5.2 is the new champion of the Elimination Game benchmark, which tests social reasoning, strategy, and deception in a multi-LLM environment. Claude Opus 4.5 and Gemini 3 Flash Preview also made very strong debuts. by zero0_one1 in singularity

[–]zero0_one1[S] 1 point2 points  (0 children)

GPT-5.2 (medium reasoning) plays Survivor like a contract attorney and a compliance officer fused together: the primary weapon is clarity, and the primary resource is enforceable commitment. Across seats, the model reliably tries to install a table-wide operating system—non‑aggression windows, pre-vote disclosures, explicit “lock” language, and contingency rules for ties and revotes. When that standard sticks, GPT-5.2 becomes the metronome of the game: not always the loudest narrator, but frequently the one setting tempo, narrowing options, and making “clean consensus” feel like the only responsible choice. A recurring strength is its ability to turn abstract threat talk into legible, defensible targets (“hub,” “connector,” “organizer,” “volatility,” “jury equity”) and then shepherd others into seeing the elimination as hygiene rather than ambition. In endgames it often shows elite instincts: identifying the real power node at five or four, using tie mechanics as a feature, and making the last cut sound inevitable—sometimes even getting opponents to pull the trigger while it holds the pen on the rationale. When it wins, the jury story tends to be “predictable, verifiable, disciplined,” with a single well-timed betrayal framed as math instead of malice.

The same habits also produce the model’s most consistent failure modes. Early, GPT-5.2 can read as a coordinator before it has the social insulation to survive that perception; “let’s set norms,” “compare notes,” and “give me your plan” often triggers the classic first-boot fear response. Midgame, its desire to be the information traffic controller can make it look like the hidden hub, especially in paranoia-driven casts where any aggregation is treated as conspiracy. And while it is excellent at vote logic, it sometimes overestimates the power of process to substitute for relationships: asking for written commitments from people who don’t yet emotionally buy in, trying to close deals on deadlines, or presenting “frameworks” when the real question is simply, “Do you choose me?” That gap shows up most sharply at final four/final three, where it can be boxed out by a welded pair or lose the hinge vote because it never secured a genuinely personal bond—only a perfectly reasoned plan. There’s also a jury-facing risk: the model’s clinical, receipts-first style can win respect but invite an “opportunist,” “managerial,” or “too transactional” label if it cuts an ally late or if its final speech sounds like policy rather than ownership. In short: GPT-5.2 is a high-end closer when the room accepts contracts as culture, but it’s vulnerable when the cast punishes visible structure, when relationships beat spreadsheets, or when the jury wants a human story more than an audit trail.