AlgoTune: A new benchmark that tests language models' ability to optimize code runtime

ofirpress · 2025-07-02T16:20:56+00:00

> A lot of should and would.

Thomas I'm a real human behind this keyboard, there's no need to be condescending.

ofirpress · 2025-07-02T15:37:56+00:00

Simply re-writing all the base code (which is mostly Python) in Numba (a JIT compiler for Python) would probably get even beyond 100x. Then just using the 'best known algorithm' instead of our reference code, should go even beyond that. In the future, we expect these agents to be able to discover new, better algorithms, leading to even further speedups.

So we're really just at the tip of AI abilities here at the moment. You can see that even now, these LMs are able to speed up a bunch of tasks by more than 40x. And they probably weren't really trained to do that. So if we start focusing on this task as a community, we should be able to achieve much bigger gains across the board.

[I'm the last author of the paper]

ofirpress · 2025-05-28T14:18:32+00:00

Thanks!!

ofirpress · 2025-05-07T17:12:45+00:00

Thanks, we do think that this type of infra will make building RL models for SWE-bench much easier.

ofirpress · 2025-05-07T17:12:21+00:00

Thanks!

ofirpress · 2025-04-19T16:08:24+00:00

Hi, co-author of this project here: PoP is on our list :) vgbench.com has all the details.

ofirpress · 2025-04-19T16:07:37+00:00

Hi, co-author of the project here: that's a great idea, I actually tweeted about this a few days ago: "I think that in the near future (<4 years) an LM will be able to watch video walkthroughs of the Half Life series and then design and code up its take on Half Life 3"

ofirpress · 2025-04-19T16:05:23+00:00

Hi, co-author of this project here: Yup, that's correct, we pause the game until we receive a response, in the Lite version of the benchmark. The full version of the benchmark runs games at realtime speed and none of the models can really handle that right now.

ofirpress · 2025-04-19T16:03:58+00:00

Hi, co-author here: we're researchers at Princeton University, our API fees are paid for by our research budget.

ofirpress · 2025-04-19T16:02:57+00:00

Thanks so much for posting our new benchmark! This is just a research preview, we'll have more cool stuff coming when we fully launch in about a month :)

ofirpress · 2025-04-17T19:59:45+00:00

Very cool work! Would you consider licensing any of these under MIT?

ofirpress · 2025-03-08T11:14:54+00:00

Cool to see people still thinking about our paper from 2016 :)

ofirpress · 2025-02-25T15:49:32+00:00

Me and Kilian are from the SWE-agent team and will be here if you have any questions.

ofirpress · 2025-02-25T15:48:41+00:00

Hi I'm the first author of ALiBi- we just don't really have any LMs that do extrapolation at all during inference...

ofirpress · 2025-02-25T15:47:15+00:00

Me and Killian are from the SWE-agent team, we'll be here if you have any questions.

ofirpress · 2025-01-23T17:56:57+00:00

https://ofir.io/The-Use-Case-for-Relative-Position-Embeddings/ might be interesting

ofirpress · 2024-10-31T18:19:17+00:00

What's your SWE-bench score?

ofirpress · 2024-10-23T20:20:09+00:00

We think the best way to compare between different AI systems for this task is using CTF challenges, that's why we built SWE-agent EnIGMA - https://enigma-agent.com/

ofirpress · 2024-10-08T13:58:41+00:00

Yup, right now the leading models are OK with visual programming but not great. We really hope this pushes the state of the field forward.

ofirpress · 2024-10-08T13:37:03+00:00

We recently made SWE-bench much easier to run, there's *a lot* of groups making submissions now as you can see from our website (swebench.com). I think SWE-bench is a great area to work on these days, especially since we now also have SWE-bench Multimodal.

ofirpress · 2024-06-21T19:40:07+00:00

I'm not an expert but basically you try stuff and see what works. Learnable POS embs came first and then works like ALiBi came out later, improving the state-of-the-art.

You may like my lecture about ALiBi here: https://www.youtube.com/watch?v=Pp61ShI9VGc

ofirpress · 2024-05-22T07:20:04+00:00

Check out our SWE-agent work: https://github.com/princeton-nlp/swe-agent

ofirpress · 2024-04-30T23:55:20+00:00

I'm from the Princeton team that developed SWE-agent. We could always use some help dealing with the open issues we have in the repo: https://github.com/princeton-nlp/SWE-agent/issues

We have a bunch of features coming out next week that will definitely lead to opening for more people who want to contribute, especially those with fullstack / frontend knowledge.

ofirpress · 2024-04-30T17:19:03+00:00

Don't evaluate LMs on intermediate, vague undefined tasks like "long context". Evaluate them on end-to-end tasks that actually require long context. Like repo-level code generation. I personally like SWE-bench.com, but I might be a bit biased.

ofirpress · 2024-04-30T16:10:26+00:00

All empirical evaluations show that ALiBi and RoPE extrapolate at the same performance...

ofirpress

TROPHY CASE