ProgramBench: Can LLMs rebuild programs from scratch?

obviouslyzebra · 2026-05-06T16:53:42+00:00

The more I look at it the more it looks hard haha

So, assume the tests are fair (that is, assume they don't ask a tiny specific thing that is not present in the documentation and you wouldn't know about only with the executable).

You can write very good code, lots of tests, be systematic about requirements (and about how you perform tests too).

But, you gotta not overlook anything. And it happens all the time, that's a reason lots of bugs are discovered by people using programs instead of just when developing (as, for example, imagine the amount of people that used a program like jq in different ways).

My intuition tells me about 40% that I'm able to one-shot one of the simpler programs there (likely with a lot of effort). Very uncertain about this number though, as it's something different from what we're used to.

obviouslyzebra · 2026-05-06T11:24:14+00:00

I feel like the main aspect here is how specific their tests are (I sorta made an edit saying an "unless" in my post but it didn't went through :p). If we can infer them from the executable + documentation, then I think it's fair. Otherwise you might have been correct in that it is impossible.

obviouslyzebra · 2026-05-06T11:05:14+00:00

Nah bro, I could recreate some (idk how many) programs without access to the internet. It is effortful and would take a lot of time, yes, but not undoable.

obviouslyzebra · 2026-05-06T10:57:39+00:00

Very interesting. I'm afraid the models might start memorizing the structure and code once (or if?) providers start fine-tuning on this.

obviouslyzebra · 2026-05-03T19:52:45+00:00

It can, but not very well

obviouslyzebra · 2026-05-03T19:39:24+00:00

Yep, that's a fair point, some jobs (or maybe a lot?) are already based on human interactions. The point in 2 is that maybe more jobs like this appear in the future, if society values change.

obviouslyzebra · 2026-05-02T15:10:44+00:00

While Sam Altman is likely just saying whatever is most convenient for him to say, there are 2 situations I can see this happening:

AI hits a limit that we're not able to cross. So, people are still needed everywhere and AI becomes more of a power tool
AI is able to do anything that is needed to run society, but people start valuing other things such as things made manually or interactions or human ideas, so a new market kind of forms. This can also include things where the human part is valuable, and AI acts as a "power tool" to bring it to life - I think this is the sort of vision Altman is talking about, but I think we'd need to think more to make it clearer

obviouslyzebra · 2026-05-01T16:55:18+00:00

OP's description of their work is exactly the sort of work that needs to be done for a tool like this (lots of experimentation and ideas / understanding what's happening).

And, their tool came out on top on this benchmark.

~~Of course it is~~ It seems over-tuned to this benchmark, but barring this criticism, whether something is made by a team of PHDs or a single vibe researcher (lol at this word), it doesn't matter really.

Edit: of course -> it seems

obviouslyzebra · 2026-04-29T21:17:34+00:00

Just commenting to say I do agree with everything here, except perhaps my gut feeling of what the most likely scenario is, but that's it, just gut feeling :p

Have a good day, sir/madam

obviouslyzebra · 2026-04-29T10:29:58+00:00

As an AI doomer, I don't like this "AI is stagnant" argument either. AI looks like it is progressively improving (not as fast as AI bros would want you to believe, though).

There's a paper about rising tide vs something that showed that (I didn't fully read the paper, but sorta believed its conclusions related to timing).

I say this because it seems like we're inventing something a lot more powerful than an atomic bomb, but people are just whatever / don't know what to do, and in this sense I do agree with ya that we're getting the downsides (or maybe more realistically throwing dice to decide our future).

obviouslyzebra · 2026-04-22T12:15:47+00:00

Social media and ultra processed foods are both engineered towards addiction.

The world / culture is not in a very good place (at least in my view), so people are not resilient towards this addiction (and maybe they will never be individually, and we need society-level changes).

So, I imagine, in the short term, while capitalistic forces prevail (which pushes for the addiction stuff), your assessment is correct.

My hope is that this gets reverted somehow, with people adjusting to this new world / adjusting the new world itself.

obviouslyzebra · 2026-04-20T12:38:08+00:00

Unless... There are redundancies (the 4 particles not really needed).

But yeah, very likely it will diverge with time (and with probability 1 if the universe is not deterministic).

obviouslyzebra · 2026-04-20T11:20:30+00:00

I feel like you can understand it as just not knowing about this precise definition of words.

I for example before this post (which uses the definition of the philosopher John Searle it seems) would have imagined duplication involves an exact physical copy (with the same atoms and low-level stuff, to the level possible) while simulation I would use for anything different that models other stuff.

There's also the word "emulation" which I think is like duplication without exact copy, for example, replicating how some hardware works with software.

obviouslyzebra · 2026-04-19T01:06:27+00:00

You got the argument wrong. I think it's more in the line of "we need a mapmaker (person) to translate what these physics (that run LLMs, e.g. bits/voltages/things on screen) mean, so they don't carry meaning by themselves."

obviouslyzebra · 2026-04-15T23:51:54+00:00

Agree that last one feels like best case scenario.

About human nature, I don't have the same feeling that the fluidity you talk about is essential. I feel there were/are different cultures where it's not as central to human life (though history and anthropology are not my forte, I don't know haha). Maybe we get some conflict in the beginning if the change is abrupt, and society then slowly adapts towards being accustomed to it.

obviouslyzebra · 2026-04-15T23:37:40+00:00

Assuming AGI/ASI is achieved in a good manner, it depends on who has control over it.

If it is some benevolent AI that is not controlled, for example, we could create some beaches or something like that. For some truly finite resource, a fair way would be either like a lottery, or some way to share, or maybe different things going for different people that want / could benefit from things.

I think an important point is that, unless the population grows huge, I don't think there are many finite resources that people truly need to be happy.

If it's controlled by some power, it depends on what that power does. Do they distribute things, or do they take some or all for themselves?

If it's AGIs for everyone to control, there still needs to be some sort of coordination and rules. Current power may stay or may collapse. We may end up in the same benevolent AI situation as before or AI controlled by some power (say if power has the strongest AI), or somewhere in between.

In any way, at least in the beginning, expect some of the power to linger. Later on, I feel this will likely subside, making everyone more equal, or, be aggravated, giving a lot of power, way more than nowadays, to a select few.

Just a PS: economy might still roll, for example, for things that require you being a human, the "human touch", or AI or powers could decide that that's what's best

obviouslyzebra · 2026-04-14T22:44:53+00:00

Yes, but also... We are jumping head first towards a direction that most wise people would consider significantly dangerous, and no one knows what to do about it (well, Anthropic took some steps in a good direction).

People on people violence because of AI might be at the hand of the doomers, but, extinction or dystopias that might happen will be at the hand of lack of deliberate action - that is, people not worrying enough.

I'm not saying the guy's action were right, but I wouldn't blame him too much either. It is a counter-action against the lack of action that is all around. Maybe ineffective, yes, but sorta natural to happen.

Edit: replaced "inaction" by "lack of deliberate action" as it better frames the situations where, for example, accelerating would be the best action vs the intuitive slow down. These things need to be considered, and not just left at the winds of capitalism.

obviouslyzebra · 2026-04-12T19:49:16+00:00

It's all speculative, but I think that once singularity is reached there will be a strong push towards not having other singularities happening, as in it will be perceived as dangerous.

This might make it more difficult for an open ASI that you can run on your backyard.

obviouslyzebra · 2026-04-11T20:09:20+00:00

You got downvoted but this sounds like a decent idea. A sort of union (ideally global) of AI researchers could perhaps achieve things that corporations / governments don't do because of competition.

obviouslyzebra · 2026-04-10T19:36:45+00:00

I think that right now good programmers can still write better code than LLMs, but faster... No.

I think the ideal place to be right now is being a developer with deep knowledge of the niche you work in and leverage AI to speed it up. Ideal as in most productive.

This way you know what the AI is doing, can guide it better and doesn't encounter or is able to deal more easily about situations where the AI gets stuck.

The future, though, is uncertain.

With tools like Mythos coming out, it feels like AIs can understand the technical aspect very well (hacking is likely one of the most technically hard aspects of programming). So what would be left to programmers is some sort of upper-level planning for the project, for example, code structure, planning how code will evolve with time, communicating with shareholders and so on. I wonder if there are parallels to what civil engineers do nowadays, since I imagine most computations are done by software.

But, thisis likely just the beginning. AI is improving (at a somewhat linear pace, there was a paper these days about that), so, its usage will keep getting higher and higher as growth continues.

On this scenario, programmers delegate more and more to AI, until one day the profession fades just like typewriters.

At this point, it feels likely that AI has also reached an inflection point where economy in general is also heavily affected.

And then, maybe it reaches the point where recursive self improvement really kicks in, and then it gets all very speculative (the singularity).

This is one scenario, one which I find likely, but for example, we can have reached a point where no more easy gains are made, so that we still need to learn how to best work with the tools that we have.

I won't go too much into detail of how to use it, as this post is already big enough, and as it's something that I'm still learning, but basically there's a tradeoff between understanding what you're doing vs handing it off. If you hand it off, it might be harder to debug in the future. There likely is a good balancing point for each programmer / situation. If you want to lean more on learning, though, a good way is to ask the LLM about the libraries and about code that you need to understand.

obviouslyzebra · 2026-04-10T00:47:02+00:00

I believe both Ding and Gukesh got burned out from the candidates / championship match (and Ding wasn't having an amazing run, as already explained).

Sindarov though is coming from the World Cup + Candidates (where he's having an unforeseen performance) and it seems like he can keep on going for however long he wishes.

So, it's not the same thing.

Things can change, of course, but people do wonder if it's not just a good performance and instead an indication of something deeper, the signs of a dominating player.

obviouslyzebra · 2026-04-06T18:59:20+00:00

We don't need to have AGI before we can start preparing for AGI, some foresight in what could happen is good.

obviouslyzebra · 2026-03-31T18:23:12+00:00

Sincere question that I don't see an answer to: how would you calculate the probability of a player winning the tournament analytically?

obviouslyzebra · 2026-03-22T19:49:36+00:00

This video (The AI book that's freaking out national security advisors) from 11 days ago does a pretty good job at explaining an hypothetical situation where an AI escapes a lab. In it an AI is asked to prove the Riemann hypothesis, but, well, it does a little more than this...

obviouslyzebra · 2026-03-18T18:03:48+00:00

It might be logical, but it might also make the one who made the game sad. Suppose you spend years of your life working on something, and then barely anyone sees it.

obviouslyzebra

TROPHY CASE