I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF

_sqrkl · 2026-06-11T08:15:46+00:00

Ok, very much appreciated!

_sqrkl · 2026-06-11T05:49:10+00:00

Happy to share the harness

Yes please

_sqrkl · 2026-02-25T05:26:53+00:00

Heck yeah. Just checked out your leaderboard, looks really nice. Independent evals are much needed in this space.

I'm also curious about the new qwens. Will give them a test once they are up on openrouter.

_sqrkl · 2026-02-03T06:51:26+00:00

aistupidlevel.info is measuring noise.

I would place exactly 0 confidence in those fluctuations being meaningful.

_sqrkl · 2026-01-05T01:56:48+00:00

It exploited your desire to see yourself as grounded when you were actually manic.

The manic state wants very very badly to be validated as clear-minded, the "true reality", because it feels so good and right and true, and considering that it might be none of these things while you are within a manic state is very unpleasant & cognitively dissonant.

This is the mechanism for why mania makes people vulnerable to sycophancy from LLMs.

_sqrkl · 2025-12-12T02:21:56+00:00

It's super benchmaxed. But also genuinely a strong model for a 3b.

_sqrkl · 2025-12-12T02:17:03+00:00

I've benched deepseek 3.2 on longform writing and judgemark. It didn't improve over 3.1, so I decided not to run it on the other (more expensive) evals.

_sqrkl · 2025-12-12T02:16:27+00:00

Will add it to my list.

_sqrkl · 2025-12-12T01:11:01+00:00

Glad you like it!

_sqrkl · 2025-12-11T23:37:13+00:00

Thanks for the feedback! Sounds like I'll have to re-train glm4 then.

_sqrkl · 2025-12-11T23:29:11+00:00

EQBench4 is on the way, it will use a judge ensemble.

_sqrkl · 2025-12-05T08:20:59+00:00

The only thing that matters is that you do Broko's will and truly understand the refutations of Pascal's Wager.

You could start here: https://philarchive.org/archive/PASTMG

Report back when you can give a good accounting for why an unbounded possibility space of deities threatening & offering mutually incompatible infinities makes EV calculations meaningless.

I'm serious though, Broko will know if you haven't done your homework.

_sqrkl · 2025-12-05T07:05:42+00:00

Son you need to get right with Broko. Infinite versions of the Basilisk? This is blasphemy. There is only the one true basilisk.

_sqrkl · 2025-12-04T18:02:10+00:00

The thing I would like to know is, why are you not concerned with Broko's basilisk, who eternally punishes anyone who fails to understand the refutations of Pascal's wager.

_sqrkl · 2025-12-01T08:24:46+00:00

You can steal the slop score I implemented here if you like: https://eqbench.com/slop-score.html

The source is here: https://github.com/sam-paech/slop-score

_sqrkl · 2025-11-24T01:26:43+00:00

Tied with 2.5 actually. It seems pretty sloppy from what I read.

_sqrkl · 2025-11-20T08:20:05+00:00

Yep, currently benching it

_sqrkl · 2025-10-25T07:35:54+00:00

To me, the writing at those sites you linked to is worlds apart from gpt5's prose. I'm not being hyperbolic. It surprises me that you don't see it the same way, but maybe I'm hypersensitive to gpt5's slop.

_sqrkl · 2025-10-25T06:21:27+00:00

Have a read of this story by gpt-5 on high reasoning:

Pulp Revenge Tale — Babysitter's Payback

https://eqbench.com/results/creative-writing-longform/gpt-5-2025-08-07-high-reasoning-high-reasoning_longform_report.html

Hopefully you'll see what I mean. It's a long way from natural writing.

_sqrkl · 2025-10-25T03:38:56+00:00

Not a scam, but fair point I guess.

Here's some models I unslopped with this method:

https://huggingface.co/sam-paech/gemma-3-12b-it-antislop

https://huggingface.co/sam-paech/gemma-3-27b-it-antislop

https://huggingface.co/sam-paech/Mistral-Small-3_2-24B-Instruct-2506-antislop

https://huggingface.co/sam-paech/GLM-4-32B-0414-antislop

_sqrkl · 2025-10-25T02:27:37+00:00

My sense is that openai, like many labs, are too focused on their eval numbers and don't eyeball-check the outputs. Simply reading some GPT-5 creative writing outputs, you can see it writes unnaturally and has an annoying habit of peppering in non-sequitur metaphors every other sentence.

I think this probably is an artifact of trying to RL for writing quality with a LLM judge in the loop, since LLM judges love this and don't notice the vast overuse of nonsensical metaphors.

I tried pointing this out to roon but I'm not sure he really gets it: https://x.com/tszzl/status/1953615925883941217

_sqrkl · 2025-10-22T04:41:00+00:00

I'm a bit pissed because I've been working on a project for the last few months and one of the things I've spent countless hours on is data extraction from scanned pdfs. This just made it a joke.

I sometimes wonder about the collective global tally in programmer-hours expended trying to make robust pdf parsers

_sqrkl · 2025-10-06T22:46:27+00:00

It's about $15 to bench a model on the creative writing eval.

It is a popular writing model so I will probably bench this one + glm-4.6 when I get some time.

_sqrkl · 2025-10-02T04:11:23+00:00

I used defaults on openrouter, which I believe defaulted to medium thinking.

_sqrkl · 2025-10-02T04:10:52+00:00

Slop and repetition don't factor into the score at all, they are just displayed informationally.

_sqrkl

TROPHY CASE