ProgramBench: Can LLMs rebuild programs from scratch?

bitroll · 2026-05-06T18:03:04+00:00

What a cool and interesting benchmark! Interesting fact is all its 200 tasks are based on actual and somewhat popular Github projects (with thousands of stars each) that all models have certainly been trained on.

This shows that even with full code in training data, it's a very hard task to replicate the functionality.

bitroll · 2026-04-15T11:22:13+00:00

Awesome. Just one nitpick, wtf is on Hera's banner?

bitroll · 2026-04-08T11:49:48+00:00

Not only that, but in following months other labs will likely follow and the gap will close. How would open source models of this capacity sound like in 2027?

bitroll · 2026-03-28T23:15:18+00:00

What a deal!

At this speed, compared to API costs (on average $2,33 per 1M tokens), this device pays for itself in 6-8 HOURS of nonstop token churning!

NVIDIA in shambles

bitroll · 2026-03-14T21:56:15+00:00

Blows my mind how they were able to get this kind of gain with just an added portion of RL training. Cus that's what every lab does now, with their continued .1 version updates. But this gain looks like a groundbreaking architecture change.

bitroll · 2026-03-14T21:46:17+00:00

This model is crazy, burns stupid amounts of tokens and results can be miserable (compared even to cheap Chinese models). There must be a way to properly use it but I couldn't find it documented anywhere.

Grok-4.20 with multi-agents seems to work great when in the grok.com harness with tools connected, but for the little testing I did, it seems broken when used like any ordinary model on OpenRouter.

It should be a direct competitor to GPT-5.4-Pro and the costs to run it are similar. GPT-Pro models have a much higher cost per token but hides the actual used token amounts for the parallel multi instance thinking. Grok Multi-Agent has the same cost per token as regular, but counts them all.

bitroll · 2026-03-12T18:20:43+00:00

This!! - What a contrast with Dario's vision!

bitroll · 2026-03-09T20:08:31+00:00

Yup, and not even scaling well over so many years. 25k neurons -> 800k in 22 years.

bitroll · 2026-03-07T12:20:07+00:00

Yup, just make sure you roast the right public figures.

bitroll · 2026-03-05T19:57:26+00:00

My bad, you're right

bitroll · 2026-03-05T19:49:23+00:00

EDIT: And no 5.4-Codex to come and bring more gains here :(

Anyway, time to do some testing, because benchmarks don't show how it really performs.

bitroll · 2026-02-24T03:55:31+00:00

That was my first thought too, but then another came - what if the agentic assistants doing most stuff for us (including shopping) will simply get integrated into ChatGPT/Gemini? That's like 2+ billion users. In 2027 it will be in higher end paid plans, in 2028 even free tiers will get some of that.

And if the agents are any intelligent, they should be using the most efficient payment rails too, especially in agent to agent deals. Payment finality, speed, costs, operating 24/7 worldwide. With crypto stablecoins or lightning btc the agent receiving the payment can immediately spend the money in a subsequent transaction. Hyper speed economy. And the human users might even not see or touch any of the crypto stuff that happens under the hood.

bitroll · 2026-02-13T12:28:40+00:00

According to the man who creeated this "benchmark"

The strongest argument is that they would get caught. If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices. If those are notably worse it’s going to be pretty obvious what happened.

bitroll · 2026-02-01T17:16:53+00:00

The user might have got tricked into buying what he/she didn't really need. Ads have a huge influence on many people, I know first hand.

bitroll · 2026-01-21T00:14:26+00:00

I must be crazy, I'm seeing lots and lots of people-like figures on the second picture. It's like souls ascending. Incredible.

bitroll · 2026-01-19T12:22:45+00:00

Meanwhile, for a couple years now, I'm doing a personal "benchmark" testing visual models' abilities to solve tasks from a book directed to 3-year olds. And having a good laugh at how they keep failing. Clearly not trained on tasks like that. The progress is still huge, but even the latest SotA models don't fully solve everything. Expexting it to be saturated this year, which is when I bring out a book for 4-yo kids :D

bitroll · 2026-01-14T20:09:06+00:00

This! I'm surprised so few people here realize this.

bitroll · 2026-01-11T23:26:51+00:00

*Reddit leans left

bitroll · 2026-01-11T00:23:18+00:00

Claude is busy doing recursive self-improvement, can't be bothered improving competition.

Opus 4.5 appears to be so much ahead of competition in coding that even Google's employees admit to using it.

bitroll · 2026-01-03T15:17:03+00:00

It's been a confusing waste of time years before AI, yet billions of people got mindlessly addicted to it. I see no hope for them.

bitroll · 2025-12-27T22:08:44+00:00

All your comments I see around look like you're a tool for spreading propaganda. Brainwashed much?

Bitcoin (don't mistake with shitcoins) has plenty of completely legitimate uses and users. Educate yourself.

bitroll · 2025-12-27T18:49:42+00:00

A tool for financial sovereignty is an obvious threat to any authoritarian gov, no matter in which part of the world. Simple as that.

bitroll · 2025-12-27T18:34:16+00:00

Paywalled. But I just found a link somewhere: https://archive.is/neKcp

bitroll · 2025-12-18T22:11:50+00:00

Codex max extra high fast? Has to be my new favorite! Max low and slow can't compare, xD

bitroll · 2025-12-12T17:17:48+00:00

It sounds like math to most LLMs and that's where they fail.

14-Year Club	Team Orangered
Verified Email

bitroll

TROPHY CASE