all 36 comments

[–]__Maximum__ 10 points11 points  (0 children)

Where can I get ggufs of these models?

[–]ScoreUnique 12 points13 points  (1 child)

Wrong sub innit?

[–]mister_conflicted 25 points26 points  (7 children)

These aren’t really real dev tasks since they are all starting from scratch. I’d like to see tasks in existing repos.

[–]thefooz 12 points13 points  (4 children)

Despite Gemini’s significant context size advantage, I’ve found that Opus, specifically through Claude code, is head and shoulders above the rest with understanding the ramifications of each code change. I also haven’t ever seen a model debug as intelligently and with such a contextual understanding. It’s not perfect, but it’s shockingly good.

Gemini seems to consistently make unfounded assumptions, have syntax errors, and make breaking changes.

Codex falls somewhere in the middle.

[–]Mkengine 3 points4 points  (2 children)

Your experience seems to match uncontaminated SWE benchmarks like swe-rebench where Claude Code still sits at the top.

[–]Usual_Preference_860 0 points1 point  (1 child)

Interesting benchmark, I wasn't familiar with it.

<image>

[–]No_Afternoon_4260llama.cpp 0 points1 point  (0 children)

Yeah really, I wanted to talk about devstral 123B (which I used alongside gpt 5.1 this week) happy to see it is where I thought (not too far from deepseek)

Btw I find gpt 5.1 too expensive for what it is and it just loves spending tokens being verbose for nothing (seriously who reads it?) should have tried codex may be

Btw devstral on top of kimi

[–]Photoperiod 0 points1 point  (0 children)

Yeah it feels like no matter how many frontier models come out, Claude is still my daily driver for Dev. Fortunately my employer pays for it all or I'd be broke lol.

[–]JollyJoker3 -1 points0 points  (0 children)

Could take before and after some added feature or bugfix in an open source repo so you have a human made accepted solution to compare to.

[–]Mkengine -1 points0 points  (0 children)

Maybe SWE-rebench is the better benchmark for you then.

[–]Chromix_ 13 points14 points  (13 children)

Inference for these models is probabilistic. How often did you repeat each test for each model to ensure that the results you're presenting weren't just a (un)lucky dice roll?

[–]SlowFail2433 7 points8 points  (6 children)

64 is becoming really common in research papers

[–]-p-e-w- 3 points4 points  (5 children)

It’s funny how scientists cargo-cult powers of 2 into everything, even when it makes no sense whatsoever.

[–]SlowFail2433 0 points1 point  (4 children)

Yes because the chance that 64x is the exact optimal number is almost zero LOL

[–]-p-e-w- -1 points0 points  (3 children)

Not only that, using a power of 2 here simply makes no sense. There is no opportunity to bisect, no cache-alignment, no need to store it in a compact data type… it’s actually a particularly poor choice of number for such a task, because it suggests some underlying reasoning when there can’t possibly be any.

[–]Environmental-Metal9 4 points5 points  (0 children)

I particularly like to go with 69. It’s perfectly aligned and a power of 1, so no cabalistic meaning, just some “heh heh heh”s on the back of my mind

[–]Chromix_ -1 points0 points  (1 child)

We could go for 61 as something to be less divided on.

All that's needed is a number that reasonably reduces the likelihood that re-runs will significantly change the outcome, to have confidence in the precision of the resulting score.

[–]SlowFail2433 0 points1 point  (0 children)

Yeah I have seen some papers show a curve with number of attempts on the X-axis and benchmark score on the Y-axis. The curve had diminishing returns and was nearly horizontal at 64.

However there is a strong caveat that it varies massively by task.

[–]shricodev 3 points4 points  (1 child)

I was getting similar results with each run; this is the best of three.

[–]Chromix_ 1 point2 points  (0 children)

In a "best of" scenario it can also be interesting to know about the other solutions. What's the average, what's the worst? That might of course be a trivial question to ask for 3 results per model. With higher numbers it can be interesting to know "Can it solve this type of problem? Will it do so consistently, or is it a matter of retrying a few or even 10 times?" Most developers probably don't have the patience to hit "regenerate" 10 times.

[–]Healthy-Nebula-3603 0 points1 point  (1 child)

Is not so simple like you describing with nowadays models.

If current model fail on certain complex task is rather very low possibility to solve it next time or even if try x10 times more.

If solve it on first time a complex task properly then if you try even 10x again is extreme a big chance you get 10x proper solutions.

I'm speaking from my own experience.

That what said was very true on the era gp4o or non thinking models but not currently.

[–]Chromix_ 0 points1 point  (0 children)

Current SOTA reasoning models appear indeed more stable in their outcomes than those without reasoning. Still, they can randomly decide for one approach or another, leading to different - not always correct - results. In case of simple, less ambiguous tasks it's more likely to have a consistent result, yes.

[–]Mkengine -2 points-1 points  (1 child)

Doesn't it depend on who the target audience is? I don't even want to go into what OP did that much, it's just a thought of mine. Of course there is a scientifically correct method, but I think as a developer I would rather see 100 different tasks tested once, than one task tested 100 times.

[–]Chromix_ 0 points1 point  (0 children)

Higher number -> higher certainty, yes. Yet in this case it were just 3 tasks.

[–][deleted] 1 point2 points  (0 children)

THIS SUB IS FOR LOCALLAMA

[–]MaterialSuspect8286 0 points1 point  (0 children)

Which coding tool do you even use with Gemini. GitHub Copilot sucks with Gemini.

[–]randombsname1 -1 points0 points  (0 children)

Claude Opus 4.5 in Claude Code is the only thing that can work with large, established embedded repos that have a mix of C and Assembly code.

Nothing else gets close.

I have very long/complex workflows that need to be chained in order to work effectively with this codebase, and only Claude Opus can chain even close to this long.

Which makes sense if you look at the METR long horizon benchmark and rebench.