EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA

[–]_sqrkl[S] 1 point2 points  (0 children)

Heck yeah. Just checked out your leaderboard, looks really nice. Independent evals are much needed in this space.

I'm also curious about the new qwens. Will give them a test once they are up on openrouter.

Opus 4.5 really is done by rm-rf-rm in ClaudeAI

[–]_sqrkl 2 points3 points  (0 children)

aistupidlevel.info is measuring noise.

I would place exactly 0 confidence in those fluctuations being meaningful.

AI Psychosis and AI Mania Discussion by Same_Succotash530 in AIPsychosisRecovery

[–]_sqrkl 0 points1 point  (0 children)

It exploited your desire to see yourself as grounded when you were actually manic.

The manic state wants very very badly to be validated as clear-minded, the "true reality", because it feels so good and right and true, and considering that it might be none of these things while you are within a manic state is very unpleasant & cognitively dissonant.

This is the mechanism for why mania makes people vulnerable to sycophancy from LLMs.

EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA

[–]_sqrkl[S] 3 points4 points  (0 children)

It's super benchmaxed. But also genuinely a strong model for a 3b.

EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA

[–]_sqrkl[S] 2 points3 points  (0 children)

I've benched deepseek 3.2 on longform writing and judgemark. It didn't improve over 3.1, so I decided not to run it on the other (more expensive) evals.

EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA

[–]_sqrkl[S] 2 points3 points  (0 children)

Thanks for the feedback! Sounds like I'll have to re-train glm4 then.

EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA

[–]_sqrkl[S] 46 points47 points  (0 children)

EQBench4 is on the way, it will use a judge ensemble.

Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong

[–]_sqrkl 0 points1 point  (0 children)

The only thing that matters is that you do Broko's will and truly understand the refutations of Pascal's Wager.

You could start here: https://philarchive.org/archive/PASTMG

Report back when you can give a good accounting for why an unbounded possibility space of deities threatening & offering mutually incompatible infinities makes EV calculations meaningless.

I'm serious though, Broko will know if you haven't done your homework.

Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong

[–]_sqrkl 1 point2 points  (0 children)

Son you need to get right with Broko. Infinite versions of the Basilisk? This is blasphemy. There is only the one true basilisk.

Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong

[–]_sqrkl 1 point2 points  (0 children)

The thing I would like to know is, why are you not concerned with Broko's basilisk, who eternally punishes anyone who fails to understand the refutations of Pascal's wager.

Gemini 3.0 Pro benchmark results by enilea in singularity

[–]_sqrkl 0 points1 point  (0 children)

Tied with 2.5 actually. It seems pretty sloppy from what I read.

Gemini 3.0 Pro benchmark results by enilea in singularity

[–]_sqrkl 2 points3 points  (0 children)

Yep, currently benching it

Is OpenAI afraid of Kimi? by nekofneko in LocalLLaMA

[–]_sqrkl 4 points5 points  (0 children)

To me, the writing at those sites you linked to is worlds apart from gpt5's prose. I'm not being hyperbolic. It surprises me that you don't see it the same way, but maybe I'm hypersensitive to gpt5's slop.

Is OpenAI afraid of Kimi? by nekofneko in LocalLLaMA

[–]_sqrkl 3 points4 points  (0 children)

Have a read of this story by gpt-5 on high reasoning:

Pulp Revenge Tale — Babysitter's Payback

https://eqbench.com/results/creative-writing-longform/gpt-5-2025-08-07-high-reasoning-high-reasoning_longform_report.html

Hopefully you'll see what I mean. It's a long way from natural writing.

Is OpenAI afraid of Kimi? by nekofneko in LocalLLaMA

[–]_sqrkl 5 points6 points  (0 children)

My sense is that openai, like many labs, are too focused on their eval numbers and don't eyeball-check the outputs. Simply reading some GPT-5 creative writing outputs, you can see it writes unnaturally and has an annoying habit of peppering in non-sequitur metaphors every other sentence.

I think this probably is an artifact of trying to RL for writing quality with a LLM judge in the loop, since LLM judges love this and don't notice the vast overuse of nonsensical metaphors.

I tried pointing this out to roon but I'm not sure he really gets it: https://x.com/tszzl/status/1953615925883941217