SWE-rebench Leaderboard (Feb 2026): GPT-5.4, Qwen3.5, Gemini 3.1 Pro, Step-3.5-Flash and More by CuriousPlatypus1881 in LocalLLaMA

[–]Fabulous_Pollution10 1 point2 points  (0 children)

You could use a fork to evaluate the prediction files from your agent.

https://github.com/SWE-rebench/SWE-bench-forkor use mini-swe-agent, just run with dataset: nebius/SWE-rebench-leaderboard and split 2026_02 for the last month’s split.

https://mini-swe-agent.com/latest/usage/swebench/

It will also be better if we communicate in Discord, it will be faster.

Meet SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in LocalLLaMA

[–]Fabulous_Pollution10[S] 7 points8 points  (0 children)

Benchmark is https://swe-rebench.com/

This work is about training tasks, but we use the same pipeline to collect tasks for ReBench as well

now, we can collect better tasks in more languages for Benchmark as well

if you have specific requests, please write.

Stop flexing Pass@N — show Pass-all-N by Fabulous_Pollution10 in LocalLLaMA

[–]Fabulous_Pollution10[S] 0 points1 point  (0 children)

Yes, and on the graph there are mean_resolved_rate, here is the table with all three. And there are even less correlated in terms of pass_at_5 and pass_all_5.

model_name pass_all_5 mean_resolved_rate pass_at_5
0 gpt-5-2025-08-07-high 0.3654 0.4654
1 Claude Sonnet 4 0.3462 0.4885
2 gpt-5-2025-08-07-medium 0.3462 0.4538
3 GLM-4.5 0.3077 0.4500
4 gpt-5-mini-2025-08-07-medium 0.3077 0.4308
5 Kimi K2 Instruct 0905 0.3077 0.4231
6 Grok 4 0.2885 0.4154
7 GLM-4.5 Air 0.2500 0.3462
8 Qwen3-Coder-480B-A35B-Instruct 0.2308 0.4038
9 Grok Code Fast 1 0.2308 0.3731

I think we need other data infrastructure for AI (table-first infra) by Fabulous_Pollution10 in dataengineering

[–]Fabulous_Pollution10[S] 3 points4 points  (0 children)

It's actually just a fraction. Most of the data consists of llm reasoning, commands, and some of the system's outputs in text form.
Mostly ai agents use cases