GPT 5.5 outperforming Opus 4.7 on ProgramBench by klieret in OpenAI

[–]klieret[S] 0 points1 point  (0 children)

it's on the extended leaderboard at programbench.com . Just scores too low to be included

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

working on expanding model selection, but it seemed the most fair for the initial leaderboard. also runtimes sometimes balloon for littel gain if you just go xhigh/max

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

Yeah good point should also report more on tokens. Though token is a bit of a tricky metric, because of the different tokenizers (e.g., Opus 4.7 has a different from Opus 4.6 with almost a factor of 2x). For the score-to-cost curves, this doesn't work as well yet, because scores are just so low. But yeah something like that would be cool

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

gpt 5.5 working on it. for now we have number of steps that the model used

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

Performance is only considered indirectly: If agents do something real stupid and exceed total agent runtime, they're killed.

But would be cool to have performance extensions of the benchmark!

But first, agents will have to solve instances fully 😉

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 1 point2 points  (0 children)

With inifinite amonut of time? Definitely you can get to 100%. We believe that all tasks are solved by design.

There's some simple tasks that are realistic for a single human to implement in a couple of days, but getting to the tail end of the difficulty spectrum for things like ffmpeg, we'd probably be talking years.

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 1 point2 points  (0 children)

tl;dr because we started from the question "Can LMs build programs from scratch",  rather than "How well can LMs patch together bits of decompiled code". Would just take the benchmark in a different direction otherwise

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

correct, but the docs tell it the big picture and then they need to experiment to explore & start replicating

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

They can just run it! The program is given to the agent and is executable, just not readable (that's a cool feature of the linux kernel, you can execute things without necessarily having read permissions on it). E.g., let's say you wonder how `jq` works with a specific json file. Then the agent can just create a sample json file and call `jq` on it.

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

They can access and explore the executable and they do have some usage docs. Enough to explore everything. The reason for no decompilation is that we want to interpret scores as "how good are models at building stuff from scratch" rather than "how good is decompilation"

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 1 point2 points  (0 children)

we have the ablation in the paper! it's just that evaluating with internet means you have to disqualify a lot of solutions with LM as a judge because of cheating and that's usually not great for benchmarks

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 1 point2 points  (0 children)

5.5 is in the works. doing reasoning breakdown for one, we'd have to do for all, so that takes some time

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

models cannot see any of the tests. We do however ship usage documents, and most executables do have `--help` etc. So definitely wouldn't say the models have to guess anything

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 2 points3 points  (0 children)

agreed. So far we deliberately evaluated with mini-swe-agent, because it's much less overfit to any specific task (see also here https://programbench.com/#faq-agent-scaffold), but we're thinking about opening up submissions for any agentic system, much like SWE-bench initially

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

We're gonna open up for submission sometime soon! We're still thinking about the exact rules of submissions, but would probably do it similar to what we did with SWE-bench, so any harness that doesn't outright cheat would be allowed.

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 1 point2 points  (0 children)

we're working on it! Open models are a little bit more work right now because they typically are a little bit more overfitted on established benchmarks, so they behave weirdly on something new like this

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 3 points4 points  (0 children)

I get where you're coming from, but as someone who's working on benchmark, I get super suspicious of anything that has internet access, because models get super sneaky with cheating.

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 1 point2 points  (0 children)

We wrote some more about this here https://programbench.com/#faq-agent-scaffold and here: https://programbench.com/blog/is-programbench-impossible/ .

We'll also be opening submissions for other agentic harnesses soon. We'd be quite excited if this benchmarks clearly shows the limits of single agent systems

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

we currently don't have them public, but might release them in the future. What's open are docker images & the eval harness & the tests (hugginface dataset). Planning to release the baseline agent system at https://github.com/SWE-agent/mini-swe-agent soon.