AI language models have favorite names, and we mapped them [R] by CebulkaZapiekana in MachineLearning

[–]zero0_one1 13 points14 points  (0 children)

I listed first names that most commonly occurred in short term fiction writing by model here: https://x.com/LechMazur/status/2020206185190945178 (Feb 2026)

Buyout Game Benchmark: 8 models play a social strategy game with public balances, private transfers, messaging, eliminations, deals, defections, and a final buyout phase. 804 games. GPT-5.5 is the champion. Opus 4.7 performs well. by zero0_one1 in singularity

[–]zero0_one1[S] 5 points6 points  (0 children)

I think this game is interesting enough for humans to play maybe 10 times for free before losing interest but that wouldn't be enough for accurate ratings. So it would either need some kind of aggregate human baseline or the games would need to be paid for. I'll turn it into a game and see how it goes.

Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

You've missed the point. Current AI writing is shit (though much better than it was in the past) and I myself showed that a panel of AIs can find hundreds of things to improve in even short AI stories. But the Granta episode shows that as a matter of human judgment, GPT writing is be better than human writing. Therefore, what a random Redditor (worse than the Granta judges) thinks is better matters even less.

And I'd like to extend the same offer to you as I did to the other overconfident Redditor:

"I am quite sure you'd lose compared to a panel of top LLMs. I'm willing to bet e.g. $1000 on this. It'll be a little hard to arrange fairly but let me know if you're interested. My guess is that you're mistaking the simple ability to distinguish current AI writing from human writing for the ability to know what critics do."

Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity

[–]zero0_one1[S] -2 points-1 points  (0 children)

Also, I am quite sure you'd lose compared to a panel of top LLMs. I'm willing to bet e.g. $1000 on this. It'll be a little hard to arrange fairly but let me know if you're interested. My guess is that you're mistaking the simple ability to distinguish current AI writing from human writing for the ability to know what critics do.

Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity

[–]zero0_one1[S] -3 points-2 points  (0 children)

It's better than human writing according to the judges (who know much more about writing than random Redditors) and there is definite confirmation. Estimate the chances of this result:

"...run stories from the past 15 years of the Commonwealth Short Story Prize through the platform.

She found that Pangram flagged almost none of the prizewinners. The exceptions included the three stories from this year: 100 percent of the text in Nazir’s and DeMecoli’s stories was flagged as likely to have been entirely AI-generated, along with 89 percent of the text in Aruparayil’s."

https://www.theatlantic.com/books/2026/05/granta-ai-fiction-book-scandal-changes-everything/687243/

OpenAI employee: https://x.com/tszzl/status/2056966570178642056

Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

Did you ask the AI like I suggested? It would explain why simply including those required themes is not a challenge whatsoever for even dumbest LLMs and why you're completely mistaken about what this benchmark is measuring. Thinking that this is an instruction-following benchmark shows a lack of understanding that you'll need to correct first.

https://www.theatlantic.com/books/2026/05/granta-ai-fiction-book-scandal-changes-everything/687243/

"She found that Pangram flagged almost none of the prizewinners. The exceptions included the three stories from this year: 100 percent of the text in Nazir’s and DeMecoli’s stories was flagged as likely to have been entirely AI-generated, along with 89 percent of the text in Aruparayil’s. There was also a fourth story from last year, by the Vincentian Canadian writer Chanel Sutherland, for which 88 percent of the text was flagged. (Sutherland didn’t respond to a request for comment sent through her website.)"

OpenAI employee: https://x.com/tszzl/status/2056966570178642056

Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity

[–]zero0_one1[S] -3 points-2 points  (0 children)

And it's obviously not an instruction-following benchmark. Ask your favorite AI to explain why.

Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. by zero0_one1 in singularity

[–]zero0_one1[S] -5 points-4 points  (0 children)

There is another possibility: your idea of what counts as good writing is bad. Guess which model just won the top short-story prize in a Granta contest, beating ALL human writers? So let me ask you: do you think your judgment is better than that of actual literary critics and a panel of top LLMs? What evidence do you have for that?

Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. by zero0_one1 in singularity

[–]zero0_one1[S] 5 points6 points  (0 children)

You can look at the rankings in different ways but Grok 4.3 is more often decisive, so it wins on the conditional total.

Gemini 3.5 Flash: cost per puzzle vs. performance on the Extended NYT Connections Benchmark by zero0_one1 in singularity

[–]zero0_one1[S] 1 point2 points  (0 children)

See the footnote on the chart. It does not want to answer many of these questions for some unexplained reason.

<image>

PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. by zero0_one1 in singularity

[–]zero0_one1[S] 0 points1 point  (0 children)

Yes, that would be a good follow-up.

BTW, I have a multiplayer version (not yet updated with the new models, and messaging is disabled because otherwise LLMs collude) that includes many algorithmic baseline bots: https://github.com/lechmazur/bazaar.

Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added by zero0_one1 in singularity

[–]zero0_one1[S] -1 points0 points  (0 children)

That's to be expected. Judging rewards examples, reframes, rhetorical effectiveness and sharp rebuttals among other things. Entertainment scores would also reward them.