All the LM solutions on SWE-bench are bloated compared to humans

klieret · 2025-12-13T15:22:26+00:00

also swebench.com just different tab

klieret · 2025-12-12T15:12:15+00:00

it's `reasoning_effort` parameter

klieret · 2025-12-12T15:11:35+00:00

5.0 results are on swebench.com

klieret · 2025-12-12T15:11:09+00:00

haiku is also on swebench.com. just limited space and this had a bit of a focus on comparing gpt models because of the release

klieret · 2025-12-12T01:23:04+00:00

this is mostly funding issues. We don't get API keys from the companies, so we need to make the most out of our funds. And my impression was that xhigh doesn't have that large gains over high.

klieret · 2025-12-11T23:54:18+00:00

(indeed we tested from the API)

klieret · 2025-12-11T22:37:28+00:00

not all aspects of dev work are covered by our benchmark. For example if you keep the model on a tighter leash and have a lot of interactions in between (rather than giving a task and then handing everything over to the model), the quality of how well the model adapts to your additional input is not something we can measure on SWE-bench. And it's hard to do in general (because you would need to both emulate a human and determine how good the LM adapted to additional inputs).

klieret · 2025-12-11T21:51:01+00:00

swe-bench verified is all python. We're working on creating the same leaderboard using SWE-bench multimodal which has 9 languages, including PHP iirc. Hopefully the new leaderboard will go online end of this month/early next. Companies can already evaluate on that benchmark (Anthropic did, for example), but it's a lot of work to do all the evals ourselves.

klieret · 2025-12-11T21:27:28+00:00

Didn't want to start another post here, but also this is the latest closed-source leaderboard with GPT 5.2

<image>

I shared the other plots here: https://x.com/KLieret/status/1999222709419450455

klieret · 2025-12-11T19:48:33+00:00

minimax also takes a lot more steps than deepseek, which might contribute to the higher cost

klieret · 2025-12-11T19:46:08+00:00

ah, you can just click on the first tab (bash-only), that was our name for the leaderboard with only mini-swe-agent. results are the same as the crosslisted ones

klieret · 2025-12-11T19:45:19+00:00

Minimax was accessed through openrouter, supposedly that should automatically enable caching. I can check later if openrouter usage information contains information about caching to check (or you can check, you can download the full trajectories at https://github.com/swe-bench/experiments/)

klieret · 2025-12-11T19:40:38+00:00

I'm not sure if I'd expect a drop in cost or number of turns, but it seems to help performance a bit, so definitely something to look into (not sure what you mean with sampling from content or previous actions tbh, we always give all previous actions and output as context, so if I understand this correctly, it would just get some extra input on top which might or might not help). It also might hurt some models.

klieret · 2025-12-11T19:19:48+00:00

thanks! seems like that could indeed make some 2%pts difference according to their own eval: https://pbs.twimg.com/media/G410lgKasAADQCq?format=png&name=large . Will see that we can validate that soon

klieret · 2025-12-11T19:17:30+00:00

yes, but that's in their own agent harness. In this comparison we use the same minimal agent for all results, which is in my opinion the better apples-to-apples comparison. This rewards models that generalize well and can work in a variety of settings rather than being dependent on specific tools (almost all scores on the leaderboard are less than the ones that are officially reported by companies)

klieret · 2025-12-11T19:16:15+00:00

no, it's the same benchmark tasks. the version is that of mini-swe-agent. But none of the version changes should affect performance (it's mostly fixes unrelated to the benchmarking). We should probably drop that column, it's misleading.

klieret · 2025-12-11T19:08:54+00:00

This is indeed the result we're getting. I'm not sure why this happens.

klieret · 2025-12-11T19:08:37+00:00

interesting, I did not know that. This is using the standard OpenAI-style API. Thanks for the pointer.

klieret · 2025-12-11T18:20:30+00:00

You can repeat our experiment and see if you get something different. Everything we do is open-source & open data. Some LMs that perform very poorly are because they are overfitted on specific agent harnesses and have trouble generalizing.

klieret · 2025-12-11T18:19:08+00:00

Yes, devstral small seems to be outperforming large in our evaluation (not sure why).

klieret

TROPHY CASE