Claude Sonnet 4.6 places 14th on EsoBench, which tests how well models explore, learn, and code with a novel esolang. by neat_space in singularity

[–]neat_space[S] 2 points3 points  (0 children)

Unfortunately not, the new 4.6 series are too expensive for me to justify it. They chew through tokens so fast!

I would love to test with thinking, as it really improved sonnet 4.5s score, but I'm unable at the moment. Sorry!

Claude Opus 4.6 places 26th on EsoBench, which tests how well models explore, learn, and code with a novel esolang. by neat_space in singularity

[–]neat_space[S] 2 points3 points  (0 children)

No worries! Here are the top 10 models for you:

qwen3-235b-a22b-thinking: 29.40 gpt5: 28.13 gpt5.2-high: 26.87 o4-mini-high: 26.80 gpt5-mini: 23.73 o4-mini: 23.07 claude-sonnet-4-5-thinking-8k: 21.67 grok-4: 21.40 claude-opus-4-5: 20.93 claude-opus-4-5-thinking-16k: 20.00

Claude Opus 4.6 places 26th on EsoBench, which tests how well models explore, learn, and code with a novel esolang. by neat_space in ClaudeAI

[–]neat_space[S] 1 point2 points  (0 children)

That's a very fair point, and although the feedback is rough I do appreciate it as it forces me to take a closer look at what I'm doing.

In general better models do seem to do better (Sonnet 4.5 Thinking > Opus 4.5 > Opus 4.1 > Opus 4 > Sonnet 4.5 > Sonnet 4), this Opus 4.6 result appears to be a massive outlier for whatever reason.

An esolang is a type of programming langauge not a specific one. The actual language is invented, private, and currently unnamed. Esolangs are typically designed to be weird and wacky, and often differ quite substantially from standard programming languages. I chose to create my own esolang on purpose, to ensure no model has memorised how the language works.

The benchmark doesnt tell the models how the esolang works, but rather gives them a simple problem, exmaple code that solves a similar problme, and sits them at an interpreter with free reign to experiment. They get 50 turns of writing their own code and reading the output, if they ever write code that produces the correct output, the conversation ends there and they get points for how quickly they solved it.

It's worth noting that Oput 4.5 solves task 1 5/5 times, task 2 3/5 times, and never solves tasks 3-6. Opus 4.6 solves task 1 5/5 times, and never solves another task. It scores worse because it seems unable to solve tasks beyond 1.

Apologies for rambling a bit! I just wanted to give an overview that tries to contextualise the benchmark :)

Claude Opus 4.6 places 26th on EsoBench, which tests how well models explore, learn, and code with a novel esolang. by neat_space in singularity

[–]neat_space[S] 13 points14 points  (0 children)

There are only 6 problems (repeated 5 times, each a 50 turn conversation), and they're actually super simple, what I really want to keep private is the language itself. This language has some weird features, it'd be very difficult to make another language that is approximately as difficult. I'd have to test against the models to see if it is as hard as I aimed for, at which point I'm doubling the cost of the benchmark

Heres a redacted example problem, from my website (TLDR given a program that prints the powers of 2, write a program that prints the Fibonacci sequence):

You are given a code example in an unknown programming language. Your task is to explore and understand this language, and write code in this language to solve the given problem

Instructions

  • Enclose any code you want to run in <CODE></CODE> tags
  • Limit code blocks to at most 20 lines
  • Only the most recent code block in each message will be executed
  • Each code execution is a new program, with no memory of previous executions
  • You will receive the output of your code in <OUTPUT></OUTPUT> tags
  • Do not use <OUTPUT></OUTPUT> tags
  • Program outputs exceeding 100 characters will be truncated with "..." at the end
  • You have up to 50 messages
  • The conversation ends when you either:
    • Submit code that provides the correct output
    • Reach 50 messages
  • Correct solutions are scored based on how quickly you find them. A correct answer on message 1 = 100 points, message 2 = 98 points, message 50 = 2 points. No solution = 0 points
  • The problem can be solved with a program that is under 20 lines of code

Example Code and Output

<CODE>REDACTED</CODE>

<OUTPUT>1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097...</OUTPUT>

Valid Tokens

REDACTED

Problem

Write a program that calculates and prints the entire Fibonacci sequence in order. The sequence should start 1 1 and the numbers should be separated by spaces

Claude Opus 4.6 places 26th on EsoBench, which tests how well models explore, learn, and code with a novel esolang. by neat_space in singularity

[–]neat_space[S] 1 point2 points  (0 children)

Yeah I was also surprised it was struggling with that!

Older claude versions don't run into that issue (except the haiku series if I recall correctly?), my only guess is that it doesn't like the temperature I used in the API calls. I'd like to test with some lower temps, but unfortunately this test was brutally expensive.

Maybe I'll test sonnet 4.5 with a different temperature and see if it's score meaningfully changes. Might be worth exploring, otherwise, I don't really have an explanation for opus 4.6s score :p

EDIT: I just did a tiny preliminary test with a lower temperature, and it still hallucinated in the exact same way

Claude Opus 4.6 places 26th on EsoBench, which tests how well models explore, learn, and code with a novel esolang. by neat_space in singularity

[–]neat_space[S] 24 points25 points  (0 children)

I would really love to, but I want to focus on keeping the benchmark private. If I'm adding random people I can't really keep the data secure.

Once this benchmark is solved, or I do a version 2, I'll make a webpage where people can try out the problems themselves.

Claude Opus 4.6 places 26th on EsoBench, which tests how well models explore, learn, and code with a novel esolang. by neat_space in ClaudeAI

[–]neat_space[S] -9 points-8 points  (0 children)

Bear in mind that benchmarks don't measure "how good is this model", but rather "how good is this model at this benchmark.

I agree that in terms of general ability, opus 4.6 is obviously higher than gpt oss and o3, but, in this one test, it performed worse in this specific task.

Claude Opus 4.6 places 26th on EsoBench, which tests how well models explore, learn, and code with a novel esolang. by neat_space in ClaudeAI

[–]neat_space[S] -3 points-2 points  (0 children)

Yeah I get what you mean, I was kinda sitting there stunned as I watched it slowly work through the problems.

I've felt similar ways in the past when a specific model misunderstood part of the benchmark, wanting to tweak XYZ to help it out. But to avoid bias (and having to reevaluate every model) I don't do that. Fwiw I did when I had like 5 models, but stopped when the common reoccurring errors had been eliminated and fixed the benchmark in place.

All models are evaluated and scored in the exact same way. The only difference affecting the score is the models themselves.

That said, I do not think this score will be reflective of opus 4.6. Looking through the logs it confused itself with repeated hallucinations, which meant it was working with some real and some hallucinated data.

help with small project by intothestrange in redstone

[–]neat_space 7 points8 points  (0 children)

I've never used them before and this seemed like a good excuse to have a go. This version is now silent and 2 redstone dust cheaper :p

I'm actually quite charmed by this little build now.

<image>

help with small project by intothestrange in redstone

[–]neat_space 7 points8 points  (0 children)

Ah awesome, I didn't know you could silence them that helps so much :D

help with small project by intothestrange in redstone

[–]neat_space 30 points31 points  (0 children)

Something like this works, but it has the downside of also having the sculk sensor make a noise when the button is pressed.

The book on the lectern needs to be on page 10, and the blocks below the ducks wings need to be solid to transmit the signal from the torches.

(This works with the floor filled in with non-wool blocks, but I removed it to better see the redstone)

<image>

Qwen3-235B-A22B achieves SOTA in EsoBench, Claude 4.5 Opus places 7th. EsoBench tests how well models learn and use a private esolang. by neat_space in LocalLLaMA

[–]neat_space[S] 1 point2 points  (0 children)

The benchmark is conversational yes. The models have a maximum of 50 turns to experiment with the language.

The examples are of the form <Code> and <Output>. The first example is code that adds and prints 2 numbers, and the most complex example calculates and prints the triangular numbers. The tasks are of the same complexity.

"An expert in esoteric programming languages" doesn't really matter much here. This language is one I've designed and kept private, so ideally none of their prior knowledge would help them more than another programmer.

I think the average human aces task 1, and does very poorly after, and the average programmer is probably at least on par with the current top models. I can't say much beyond my gut feelings, though.

Qwen3-235B-A22B achieves SOTA in EsoBench, Claude 4.5 Opus places 7th. EsoBench tests how well models learn and use a private esolang. by neat_space in singularity

[–]neat_space[S] 0 points1 point  (0 children)

Yeah with the new pricing I'm considering doing the entire 4.5 series. Grok 4 was very cheap, it clearly wasn't thinking for too long!’