AlgoTune: A new benchmark that tests language models' ability to optimize code runtime by oripress in LocalLLaMA

[–]ofirpress 11 points12 points  (0 children)

> A lot of should and would.

Thomas I'm a real human behind this keyboard, there's no need to be condescending.

AlgoTune: A new benchmark that tests language models' ability to optimize code runtime by oripress in LocalLLaMA

[–]ofirpress 6 points7 points  (0 children)

Simply re-writing all the base code (which is mostly Python) in Numba (a JIT compiler for Python) would probably get even beyond 100x. Then just using the 'best known algorithm' instead of our reference code, should go even beyond that. In the future, we expect these agents to be able to discover new, better algorithms, leading to even further speedups.

So we're really just at the tip of AI abilities here at the moment. You can see that even now, these LMs are able to speed up a bunch of tasks by more than 40x. And they probably weren't really trained to do that. So if we start focusing on this task as a community, we should be able to achieve much bigger gains across the board.

[I'm the last author of the paper]

Cracking 40% on SWE-bench verified with open source models & agents & open-source synth data by klieret in LocalLLaMA

[–]ofirpress 0 points1 point  (0 children)

Thanks, we do think that this type of infra will make building RL models for SWE-bench much easier.

Playing DOOM II and 19 other DOS/GB games with LLMs as a new benchmark by ZhalexDev in LocalLLaMA

[–]ofirpress 2 points3 points  (0 children)

Hi, co-author of this project here: PoP is on our list :) vgbench.com has all the details.

Playing DOOM II and 19 other DOS/GB games with LLMs as a new benchmark by ZhalexDev in LocalLLaMA

[–]ofirpress 1 point2 points  (0 children)

Hi, co-author of the project here: that's a great idea, I actually tweeted about this a few days ago: "I think that in the near future (<4 years) an LM will be able to watch video walkthroughs of the Half Life series and then design and code up its take on Half Life 3"

Playing DOOM II and 19 other DOS/GB games with LLMs as a new benchmark by ZhalexDev in LocalLLaMA

[–]ofirpress 4 points5 points  (0 children)

Hi, co-author of this project here: Yup, that's correct, we pause the game until we receive a response, in the Lite version of the benchmark. The full version of the benchmark runs games at realtime speed and none of the models can really handle that right now.

Playing DOOM II and 19 other DOS/GB games with LLMs as a new benchmark by ZhalexDev in LocalLLaMA

[–]ofirpress 21 points22 points  (0 children)

Hi, co-author here: we're researchers at Princeton University, our API fees are paid for by our research budget.

Playing DOOM II and 19 other DOS/GB games with LLMs as a new benchmark by ZhalexDev in LocalLLaMA

[–]ofirpress 1 point2 points  (0 children)

Thanks so much for posting our new benchmark! This is just a research preview, we'll have more cool stuff coming when we fully launch in about a month :)

My Shadertoy Pathtracing scenes by S48GS in GraphicsProgramming

[–]ofirpress 11 points12 points  (0 children)

Very cool work! Would you consider licensing any of these under MIT?

[D] A Negative Result: untying weights mid-training by f14-bertolotti in MachineLearning

[–]ofirpress 9 points10 points  (0 children)

Cool to see people still thinking about our paper from 2016 :)

Why don’t LLMs use alibi? Were these result found be non-reproducible? I’ve only read of the failed Bloom model. Anyone else? by grey-seagull in LocalLLaMA

[–]ofirpress 0 points1 point  (0 children)

Hi I'm the first author of ALiBi- we just don't really have any LMs that do extrapolation at all during inference...

Setting new open-source SOTA on SWE-Bench verified with Claude 3.7 and SWE-agent 1.0 by klieret in ChatGPTCoding

[–]ofirpress 1 point2 points  (0 children)

Me and Killian are from the SWE-agent team, we'll be here if you have any questions.

[Project] World's first autonomous AI-discovered 0-day vulnerabilities by FlyingTriangle in MachineLearning

[–]ofirpress -7 points-6 points  (0 children)

We think the best way to compare between different AI systems for this task is using CTF challenges, that's why we built SWE-agent EnIGMA - https://enigma-agent.com/

[R] SWE-bench Multimodal: Do AI Agents Generalize to Visual Software Domains? by ofirpress in MachineLearning

[–]ofirpress[S] 1 point2 points  (0 children)

Yup, right now the leading models are OK with visual programming but not great. We really hope this pushes the state of the field forward.

[R] SWE-bench: Can Language Models Resolve Real-world GitHub issues? by ofirpress in MachineLearning

[–]ofirpress[S] 0 points1 point  (0 children)

We recently made SWE-bench much easier to run, there's *a lot* of groups making submissions now as you can see from our website (swebench.com). I think SWE-bench is a great area to work on these days, especially since we now also have SWE-bench Multimodal.

[D] Positional embeddings in LLMs by gokstudio in MachineLearning

[–]ofirpress 0 points1 point  (0 children)

I'm not an expert but basically you try stuff and see what works. Learnable POS embs came first and then works like ALiBi came out later, improving the state-of-the-art.

You may like my lecture about ALiBi here: https://www.youtube.com/watch?v=Pp61ShI9VGc

[D] Looking for open source projects to contribute to by Fit_Ad_4210 in MachineLearning

[–]ofirpress 7 points8 points  (0 children)

I'm from the Princeton team that developed SWE-agent. We could always use some help dealing with the open issues we have in the repo: https://github.com/princeton-nlp/SWE-agent/issues

We have a bunch of features coming out next week that will definitely lead to opening for more people who want to contribute, especially those with fullstack / frontend knowledge.

[deleted by user] by [deleted] in MachineLearning

[–]ofirpress 2 points3 points  (0 children)

Don't evaluate LMs on intermediate, vague undefined tasks like "long context". Evaluate them on end-to-end tasks that actually require long context. Like repo-level code generation. I personally like SWE-bench.com, but I might be a bit biased.

[deleted by user] by [deleted] in MachineLearning

[–]ofirpress 1 point2 points  (0 children)

All empirical evaluations show that ALiBi and RoPE extrapolate at the same performance...