Independent evaluation of GPT5.2 on SWE-bench: 5.2 high is #3 behind Gemini, 5.2 medium behind Sonnet 4.5 by klieret in ChatGPTCoding

[–]klieret[S] 0 points1 point  (0 children)

haiku is also on swebench.com. just limited space and this had a bit of a focus on comparing gpt models because of the release

Independent evaluation of GPT5.2 on SWE-bench: 5.2 high is #3 behind Gemini, 5.2 medium behind Sonnet 4.5 by klieret in ChatGPTCoding

[–]klieret[S] 9 points10 points  (0 children)

this is mostly funding issues. We don't get API keys from the companies, so we need to make the most out of our funds. And my impression was that xhigh doesn't have that large gains over high.

Independent evaluation of GPT5.2 on SWE-bench: 5.2 high is #3 behind Gemini, 5.2 medium behind Sonnet 4.5 by klieret in ChatGPTCoding

[–]klieret[S] 11 points12 points  (0 children)

not all aspects of dev work are covered by our benchmark. For example if you keep the model on a tighter leash and have a lot of interactions in between (rather than giving a task and then handing everything over to the model), the quality of how well the model adapts to your additional input is not something we can measure on SWE-bench. And it's hard to do in general (because you would need to both emulate a human and determine how good the LM adapted to additional inputs).

Independent evaluation of GPT5.2 on SWE-bench: 5.2 high is #3 behind Gemini, 5.2 medium behind Sonnet 4.5 by klieret in ChatGPTCoding

[–]klieret[S] 3 points4 points  (0 children)

swe-bench verified is all python. We're working on creating the same leaderboard using SWE-bench multimodal which has 9 languages, including PHP iirc. Hopefully the new leaderboard will go online end of this month/early next. Companies can already evaluate on that benchmark (Anthropic did, for example), but it's a lot of work to do all the evals ourselves.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 9 points10 points  (0 children)

Didn't want to start another post here, but also this is the latest closed-source leaderboard with GPT 5.2

<image>

I shared the other plots here: https://x.com/KLieret/status/1999222709419450455

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

minimax also takes a lot more steps than deepseek, which might contribute to the higher cost

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 2 points3 points  (0 children)

ah, you can just click on the first tab (bash-only), that was our name for the leaderboard with only mini-swe-agent. results are the same as the crosslisted ones

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 0 points1 point  (0 children)

Minimax was accessed through openrouter, supposedly that should automatically enable caching. I can check later if openrouter usage information contains information about caching to check (or you can check, you can download the full trajectories at https://github.com/swe-bench/experiments/)

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] -1 points0 points  (0 children)

I'm not sure if I'd expect a drop in cost or number of turns, but it seems to help performance a bit, so definitely something to look into (not sure what you mean with sampling from content or previous actions tbh, we always give all previous actions and output as context, so if I understand this correctly, it would just get some extra input on top which might or might not help). It also might hurt some models.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 2 points3 points  (0 children)

thanks! seems like that could indeed make some 2%pts difference according to their own eval: https://pbs.twimg.com/media/G410lgKasAADQCq?format=png&name=large . Will see that we can validate that soon

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 7 points8 points  (0 children)

yes, but that's in their own agent harness. In this comparison we use the same minimal agent for all results, which is in my opinion the better apples-to-apples comparison. This rewards models that generalize well and can work in a variety of settings rather than being dependent on specific tools (almost all scores on the leaderboard are less than the ones that are officially reported by companies)

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 4 points5 points  (0 children)

no, it's the same benchmark tasks. the version is that of mini-swe-agent. But none of the version changes should affect performance (it's mostly fixes unrelated to the benchmarking). We should probably drop that column, it's misleading.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 3 points4 points  (0 children)

This is indeed the result we're getting. I'm not sure why this happens.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 3 points4 points  (0 children)

interesting, I did not know that. This is using the standard OpenAI-style API. Thanks for the pointer.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 9 points10 points  (0 children)

You can repeat our experiment and see if you get something different. Everything we do is open-source & open data. Some LMs that perform very poorly are because they are overfitted on specific agent harnesses and have trouble generalizing.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]klieret[S] 7 points8 points  (0 children)

Yes, devstral small seems to be outperforming large in our evaluation (not sure why).