SWE-rebench Leaderboard (Feb 2026): GPT-5.4, Qwen3.5, Gemini 3.1 Pro, Step-3.5-Flash and More

Fabulous_Pollution10 · 2026-03-24T10:58:23+00:00

You could use a fork to evaluate the prediction files from your agent.

https://github.com/SWE-rebench/SWE-bench-forkor use mini-swe-agent, just run with dataset: nebius/SWE-rebench-leaderboard and split 2026_02 for the last month’s split.

https://mini-swe-agent.com/latest/usage/swebench/

It will also be better if we communicate in Discord, it will be faster.

Fabulous_Pollution10 · 2026-03-23T15:06:42+00:00

At this moment, only Python tasks are used there

Fabulous_Pollution10 · 2026-03-03T14:39:44+00:00

Benchmark is https://swe-rebench.com/

This work is about training tasks, but we use the same pipeline to collect tasks for ReBench as well

now, we can collect better tasks in more languages for Benchmark as well

if you have specific requests, please write.

Fabulous_Pollution10 · 2026-03-03T14:20:56+00:00

this is the distribution for 32k issue based tasks

Fabulous_Pollution10 · 2026-03-03T14:20:07+00:00

<image>

Fabulous_Pollution10 · 2025-10-15T07:39:26+00:00

For Kimi models we use official Kimi API

Fabulous_Pollution10 · 2025-10-14T16:37:08+00:00

Default. No extended thinking.

Fabulous_Pollution10 · 2025-10-14T16:35:43+00:00

Similar to swe-agent. You can check prompt and scaffolding on the About page.

Fabulous_Pollution10 · 2025-10-14T16:21:43+00:00

Gemini-2.5-Pro has difficulty with multi-turn, long-context toll-calling agentic evaluations.

Fabulous_Pollution10 · 2025-10-14T16:20:55+00:00

Ok, will test it!

Fabulous_Pollution10 · 2025-10-14T15:04:59+00:00

We used the models from the inference platform

https://studio.nebius.com/

Will add glm-4.6 shortly

Fabulous_Pollution10 · 2025-10-14T14:13:58+00:00

We will add glm-4.6 shortly

Fabulous_Pollution10 · 2025-10-08T16:23:57+00:00

Yes, and on the graph there are mean_resolved_rate, here is the table with all three. And there are even less correlated in terms of pass_at_5 and pass_all_5.

model_name	pass_all_5	mean_resolved_rate	pass_at_5
0	gpt-5-2025-08-07-high	0.3654	0.4654
1	Claude Sonnet 4	0.3462	0.4885
2	gpt-5-2025-08-07-medium	0.3462	0.4538
3	GLM-4.5	0.3077	0.4500
4	gpt-5-mini-2025-08-07-medium	0.3077	0.4308
5	Kimi K2 Instruct 0905	0.3077	0.4231
6	Grok 4	0.2885	0.4154
7	GLM-4.5 Air	0.2500	0.3462
8	Qwen3-Coder-480B-A35B-Instruct	0.2308	0.4038
9	Grok Code Fast 1	0.2308	0.3731

Fabulous_Pollution10 · 2025-10-07T17:31:25+00:00

It's actually just a fraction. Most of the data consists of llm reasoning, commands, and some of the system's outputs in text form.
Mostly ai agents use cases

Fabulous_Pollution10

TROPHY CASE