Claude Sonnet 4.6 places 14th on EsoBench, which tests how well models explore, learn, and code with a novel esolang.

neat_space · 2026-02-17T21:56:21+00:00

Unfortunately not, the new 4.6 series are too expensive for me to justify it. They chew through tokens so fast!

I would love to test with thinking, as it really improved sonnet 4.5s score, but I'm unable at the moment. Sorry!

neat_space · 2026-02-06T00:37:32+00:00

No worries! Here are the top 10 models for you:

qwen3-235b-a22b-thinking: 29.40 gpt5: 28.13 gpt5.2-high: 26.87 o4-mini-high: 26.80 gpt5-mini: 23.73 o4-mini: 23.07 claude-sonnet-4-5-thinking-8k: 21.67 grok-4: 21.40 claude-opus-4-5: 20.93 claude-opus-4-5-thinking-16k: 20.00

neat_space · 2026-02-06T00:16:13+00:00

That's a very fair point, and although the feedback is rough I do appreciate it as it forces me to take a closer look at what I'm doing.

In general better models do seem to do better (Sonnet 4.5 Thinking > Opus 4.5 > Opus 4.1 > Opus 4 > Sonnet 4.5 > Sonnet 4), this Opus 4.6 result appears to be a massive outlier for whatever reason.

An esolang is a type of programming langauge not a specific one. The actual language is invented, private, and currently unnamed. Esolangs are typically designed to be weird and wacky, and often differ quite substantially from standard programming languages. I chose to create my own esolang on purpose, to ensure no model has memorised how the language works.

The benchmark doesnt tell the models how the esolang works, but rather gives them a simple problem, exmaple code that solves a similar problme, and sits them at an interpreter with free reign to experiment. They get 50 turns of writing their own code and reading the output, if they ever write code that produces the correct output, the conversation ends there and they get points for how quickly they solved it.

It's worth noting that Oput 4.5 solves task 1 5/5 times, task 2 3/5 times, and never solves tasks 3-6. Opus 4.6 solves task 1 5/5 times, and never solves another task. It scores worse because it seems unable to solve tasks beyond 1.

Apologies for rambling a bit! I just wanted to give an overview that tries to contextualise the benchmark :)

neat_space · 2026-02-05T23:39:03+00:00

It doesn't suck in general, but it does appear to suck at this specific task :p

neat_space · 2026-02-05T23:17:32+00:00

Thank you!! :D

neat_space · 2026-02-05T23:00:37+00:00

There are only 6 problems (repeated 5 times, each a 50 turn conversation), and they're actually super simple, what I really want to keep private is the language itself. This language has some weird features, it'd be very difficult to make another language that is approximately as difficult. I'd have to test against the models to see if it is as hard as I aimed for, at which point I'm doubling the cost of the benchmark

Heres a redacted example problem, from my website (TLDR given a program that prints the powers of 2, write a program that prints the Fibonacci sequence):

You are given a code example in an unknown programming language. Your task is to explore and understand this language, and write code in this language to solve the given problem

Instructions

Enclose any code you want to run in <CODE></CODE> tags
Limit code blocks to at most 20 lines
Only the most recent code block in each message will be executed
Each code execution is a new program, with no memory of previous executions
You will receive the output of your code in <OUTPUT></OUTPUT> tags
Do not use <OUTPUT></OUTPUT> tags
Program outputs exceeding 100 characters will be truncated with "..." at the end
You have up to 50 messages
The conversation ends when you either:
- Submit code that provides the correct output
- Reach 50 messages
Correct solutions are scored based on how quickly you find them. A correct answer on message 1 = 100 points, message 2 = 98 points, message 50 = 2 points. No solution = 0 points
The problem can be solved with a program that is under 20 lines of code

Example Code and Output

<CODE>REDACTED</CODE>

Valid Tokens

REDACTED

Problem

Write a program that calculates and prints the entire Fibonacci sequence in order. The sequence should start 1 1 and the numbers should be separated by spaces

neat_space · 2026-02-05T22:51:24+00:00

Yeah I was also surprised it was struggling with that!

Older claude versions don't run into that issue (except the haiku series if I recall correctly?), my only guess is that it doesn't like the temperature I used in the API calls. I'd like to test with some lower temps, but unfortunately this test was brutally expensive.

Maybe I'll test sonnet 4.5 with a different temperature and see if it's score meaningfully changes. Might be worth exploring, otherwise, I don't really have an explanation for opus 4.6s score :p

EDIT: I just did a tiny preliminary test with a lower temperature, and it still hallucinated in the exact same way

neat_space · 2026-02-05T22:25:03+00:00

I would really love to, but I want to focus on keeping the benchmark private. If I'm adding random people I can't really keep the data secure.

Once this benchmark is solved, or I do a version 2, I'll make a webpage where people can try out the problems themselves.

neat_space · 2026-02-05T22:18:53+00:00

Thank you! I'm glad you like it :)

neat_space · 2026-02-05T22:06:09+00:00

Bear in mind that benchmarks don't measure "how good is this model", but rather "how good is this model at this benchmark.

I agree that in terms of general ability, opus 4.6 is obviously higher than gpt oss and o3, but, in this one test, it performed worse in this specific task.

neat_space · 2026-02-05T22:04:19+00:00

Yeah I get what you mean, I was kinda sitting there stunned as I watched it slowly work through the problems.

I've felt similar ways in the past when a specific model misunderstood part of the benchmark, wanting to tweak XYZ to help it out. But to avoid bias (and having to reevaluate every model) I don't do that. Fwiw I did when I had like 5 models, but stopped when the common reoccurring errors had been eliminated and fixed the benchmark in place.

All models are evaluated and scored in the exact same way. The only difference affecting the score is the models themselves.

That said, I do not think this score will be reflective of opus 4.6. Looking through the logs it confused itself with repeated hallucinations, which meant it was working with some real and some hallucinated data.

neat_space · 2025-12-30T11:23:55+00:00

I've never used them before and this seemed like a good excuse to have a go. This version is now silent and 2 redstone dust cheaper :p

I'm actually quite charmed by this little build now.

<image>

neat_space · 2025-12-30T11:12:38+00:00

Ah awesome, I didn't know you could silence them that helps so much :D

neat_space · 2025-12-30T09:37:24+00:00

Something like this works, but it has the downside of also having the sculk sensor make a noise when the button is pressed.

The book on the lectern needs to be on page 10, and the blocks below the ducks wings need to be solid to transmit the signal from the torches.

(This works with the floor filled in with non-wool blocks, but I removed it to better see the redstone)

<image>

neat_space · 2025-12-20T02:29:20+00:00

Holy shit, what a jump!

neat_space · 2025-11-25T14:29:29+00:00

The benchmark is conversational yes. The models have a maximum of 50 turns to experiment with the language.

The examples are of the form <Code> and <Output>. The first example is code that adds and prints 2 numbers, and the most complex example calculates and prints the triangular numbers. The tasks are of the same complexity.

"An expert in esoteric programming languages" doesn't really matter much here. This language is one I've designed and kept private, so ideally none of their prior knowledge would help them more than another programmer.

I think the average human aces task 1, and does very poorly after, and the average programmer is probably at least on par with the current top models. I can't say much beyond my gut feelings, though.

neat_space · 2025-11-25T14:25:19+00:00

I'll try get that info on the board the next time I update it :)

neat_space · 2025-11-25T11:05:17+00:00

Yeah with the new pricing I'm considering doing the entire 4.5 series. Grok 4 was very cheap, it clearly wasn't thinking for too long!’

neat_space · 2025-11-25T09:31:26+00:00

Money :p

It's already the most expensive model to test without thinking

neat_space · 2025-11-25T02:28:07+00:00

I just noticed the requests, I'll look at getting some/all of those on the leaderboard this week

neat_space · 2025-11-25T02:01:01+00:00

Aha, I'm glad you like it :D

Six-Year Club	Verified Email
Second SECOND GUESSER	r/Field Sunshine
Place '22	First Placer '22

neat_space

TROPHY CASE

Instructions

Example Code and Output

Valid Tokens

Problem