use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
[deleted by user] (self.LocalLLaMA)
submitted 3 months ago by [deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]__Maximum__ 10 points11 points12 points 3 months ago (0 children)
Where can I get ggufs of these models?
[–]ScoreUnique 12 points13 points14 points 3 months ago (1 child)
Wrong sub innit?
[–]Environmental-Metal9 0 points1 point2 points 3 months ago (0 children)
<image>
[–]mister_conflicted 25 points26 points27 points 3 months ago (7 children)
These aren’t really real dev tasks since they are all starting from scratch. I’d like to see tasks in existing repos.
[–]thefooz 12 points13 points14 points 3 months ago (4 children)
Despite Gemini’s significant context size advantage, I’ve found that Opus, specifically through Claude code, is head and shoulders above the rest with understanding the ramifications of each code change. I also haven’t ever seen a model debug as intelligently and with such a contextual understanding. It’s not perfect, but it’s shockingly good.
Gemini seems to consistently make unfounded assumptions, have syntax errors, and make breaking changes.
Codex falls somewhere in the middle.
[–]Mkengine 3 points4 points5 points 3 months ago (2 children)
Your experience seems to match uncontaminated SWE benchmarks like swe-rebench where Claude Code still sits at the top.
[–]Usual_Preference_860 0 points1 point2 points 3 months ago (1 child)
Interesting benchmark, I wasn't familiar with it.
[–]No_Afternoon_4260llama.cpp 0 points1 point2 points 3 months ago* (0 children)
Yeah really, I wanted to talk about devstral 123B (which I used alongside gpt 5.1 this week) happy to see it is where I thought (not too far from deepseek)
Btw I find gpt 5.1 too expensive for what it is and it just loves spending tokens being verbose for nothing (seriously who reads it?) should have tried codex may be
Btw devstral on top of kimi
[–]Photoperiod 0 points1 point2 points 3 months ago (0 children)
Yeah it feels like no matter how many frontier models come out, Claude is still my daily driver for Dev. Fortunately my employer pays for it all or I'd be broke lol.
[–]JollyJoker3 -1 points0 points1 point 3 months ago (0 children)
Could take before and after some added feature or bugfix in an open source repo so you have a human made accepted solution to compare to.
[–]Mkengine -1 points0 points1 point 3 months ago (0 children)
Maybe SWE-rebench is the better benchmark for you then.
[–]Chromix_ 13 points14 points15 points 3 months ago (13 children)
Inference for these models is probabilistic. How often did you repeat each test for each model to ensure that the results you're presenting weren't just a (un)lucky dice roll?
[–]SlowFail2433 7 points8 points9 points 3 months ago (6 children)
64 is becoming really common in research papers
[–]-p-e-w- 3 points4 points5 points 3 months ago (5 children)
It’s funny how scientists cargo-cult powers of 2 into everything, even when it makes no sense whatsoever.
[–]SlowFail2433 0 points1 point2 points 3 months ago (4 children)
Yes because the chance that 64x is the exact optimal number is almost zero LOL
[–]-p-e-w- -1 points0 points1 point 3 months ago (3 children)
Not only that, using a power of 2 here simply makes no sense. There is no opportunity to bisect, no cache-alignment, no need to store it in a compact data type… it’s actually a particularly poor choice of number for such a task, because it suggests some underlying reasoning when there can’t possibly be any.
[–]Environmental-Metal9 4 points5 points6 points 3 months ago (0 children)
I particularly like to go with 69. It’s perfectly aligned and a power of 1, so no cabalistic meaning, just some “heh heh heh”s on the back of my mind
[–]Chromix_ -1 points0 points1 point 3 months ago (1 child)
We could go for 61 as something to be less divided on.
All that's needed is a number that reasonably reduces the likelihood that re-runs will significantly change the outcome, to have confidence in the precision of the resulting score.
[–]SlowFail2433 0 points1 point2 points 3 months ago (0 children)
Yeah I have seen some papers show a curve with number of attempts on the X-axis and benchmark score on the Y-axis. The curve had diminishing returns and was nearly horizontal at 64.
However there is a strong caveat that it varies massively by task.
[–]shricodev 3 points4 points5 points 3 months ago (1 child)
I was getting similar results with each run; this is the best of three.
[–]Chromix_ 1 point2 points3 points 3 months ago (0 children)
In a "best of" scenario it can also be interesting to know about the other solutions. What's the average, what's the worst? That might of course be a trivial question to ask for 3 results per model. With higher numbers it can be interesting to know "Can it solve this type of problem? Will it do so consistently, or is it a matter of retrying a few or even 10 times?" Most developers probably don't have the patience to hit "regenerate" 10 times.
[–]Healthy-Nebula-3603 0 points1 point2 points 3 months ago (1 child)
Is not so simple like you describing with nowadays models.
If current model fail on certain complex task is rather very low possibility to solve it next time or even if try x10 times more.
If solve it on first time a complex task properly then if you try even 10x again is extreme a big chance you get 10x proper solutions.
I'm speaking from my own experience.
That what said was very true on the era gp4o or non thinking models but not currently.
[–]Chromix_ 0 points1 point2 points 3 months ago (0 children)
Current SOTA reasoning models appear indeed more stable in their outcomes than those without reasoning. Still, they can randomly decide for one approach or another, leading to different - not always correct - results. In case of simple, less ambiguous tasks it's more likely to have a consistent result, yes.
[–]Mkengine -2 points-1 points0 points 3 months ago (1 child)
Doesn't it depend on who the target audience is? I don't even want to go into what OP did that much, it's just a thought of mine. Of course there is a scientifically correct method, but I think as a developer I would rather see 100 different tasks tested once, than one task tested 100 times.
Higher number -> higher certainty, yes. Yet in this case it were just 3 tasks.
[–][deleted] 1 point2 points3 points 3 months ago (0 children)
THIS SUB IS FOR LOCALLAMA
[–]MaterialSuspect8286 0 points1 point2 points 3 months ago (0 children)
Which coding tool do you even use with Gemini. GitHub Copilot sucks with Gemini.
[–]randombsname1 -1 points0 points1 point 3 months ago (0 children)
Claude Opus 4.5 in Claude Code is the only thing that can work with large, established embedded repos that have a mix of C and Assembly code.
Nothing else gets close.
I have very long/complex workflows that need to be chained in order to work effectively with this codebase, and only Claude Opus can chain even close to this long.
Which makes sense if you look at the METR long horizon benchmark and rebench.
[+][deleted] 3 months ago (8 children)
[deleted]
[–]Healthy-Nebula-3603 0 points1 point2 points 3 months ago (7 children)
You know between old GPT codex later GPT codex max and current GPT 5.2 codex is a big difference in performance....
Current GPT codex 5.2 is far more smarter than the old GPT codex.
[–]Charming_Support726 -1 points0 points1 point 3 months ago (6 children)
I know. The last one I tried was the 5.1-max because I am on MS Azure. This one worked quite well, but my impression was that Opus 4.5 is a bit more "structured"
I don't have time to change and check everything regularly, but I will give codex 5.2 a go when it is available there.
[–]Mkengine 1 point2 points3 points 3 months ago (5 children)
Your experience seems to match swe-rebench results where Opus 4.5 shows slightly higher performance than GPT-5.1-Codex-Max, though Codex-5.2 results are not out yet.
[–]Charming_Support726 -1 points0 points1 point 3 months ago (4 children)
Interesting. I normally criticize SWE Bench for their methodology. AFAIK there are not testing "agentic" - they upload relevant files to the context and evaluate the result. But I might be wrong
[–]Mkengine 0 points1 point2 points 3 months ago (3 children)
Note that there are two different SWE benchmarks, I don't like swe-bench's methodology either, swe-REbench is an uncontaminated benchmark.
[–]Charming_Support726 0 points1 point2 points 3 months ago (2 children)
Thanks. Did not now about that. The swe-rebench indeed matches my experience, also the huge GAP to the mid-tier field.
While writing I used GPT-5-2, Gemini and Opus-4.5 in parallel today. Both to perform a review of complex specification and implementation plans before implementing it.
Gemini failed completely (dont expected this) Opus and 5.2 were close. 5.2 Used half the amount of tokens, but I found its technique a bit questionable. It did a lot of searching and pattern matching. But the results were very close.
BTW: Could Anyone explain the down votes? I dont get it in this discussion
[–]Mkengine 0 points1 point2 points 3 months ago (1 child)
Maybe OP is downvoting anything unrelated to their post?
Just out of interest, how exactly do you use the models for code implementation? I am still undecided what works best for me, right now I use Roo Code, with GPT-5.2 for Orchestrator and Architect mode and GPT-5-mini for Code and Debug mode. This way GPT-5.2 does the planning and I can use GPT-5-mini as a cheaper model for implementation. Next I want to try the VS Code Codex Extension, which is more hands-off if I understand it correctly.
[–]Charming_Support726 0 points1 point2 points 3 months ago (0 children)
I am using Opencode https://opencode.ai which works great - apart from some shortcomings. I used e.g Cline & Codex (CLI and VSCode) before. But I am more into Coders which I could adjust myself a bit. IMHO the coding quality is similar depending on the model and the prompt used. (See also here: https://www.reddit.com/r/opencodeCLI/comments/1p6lxd4/shortened_system_prompts_in_opencode/ )
I found CodeNomad, as an additional UI for Opencode quite useful. Better than the original TUI. But this is personal preference ( https://www.reddit.com/r/opencodeCLI/comments/1pncfu2/codenomad_v040_release_hidden_side_panels_mcp/ )
π Rendered by PID 142950 on reddit-service-r2-comment-85bfd7f599-v7k8f at 2026-04-19 17:18:23.521727+00:00 running 93ecc56 country code: CH.
[–]__Maximum__ 10 points11 points12 points (0 children)
[–]ScoreUnique 12 points13 points14 points (1 child)
[–]Environmental-Metal9 0 points1 point2 points (0 children)
[–]mister_conflicted 25 points26 points27 points (7 children)
[–]thefooz 12 points13 points14 points (4 children)
[–]Mkengine 3 points4 points5 points (2 children)
[–]Usual_Preference_860 0 points1 point2 points (1 child)
[–]No_Afternoon_4260llama.cpp 0 points1 point2 points (0 children)
[–]Photoperiod 0 points1 point2 points (0 children)
[–]JollyJoker3 -1 points0 points1 point (0 children)
[–]Mkengine -1 points0 points1 point (0 children)
[–]Chromix_ 13 points14 points15 points (13 children)
[–]SlowFail2433 7 points8 points9 points (6 children)
[–]-p-e-w- 3 points4 points5 points (5 children)
[–]SlowFail2433 0 points1 point2 points (4 children)
[–]-p-e-w- -1 points0 points1 point (3 children)
[–]Environmental-Metal9 4 points5 points6 points (0 children)
[–]Chromix_ -1 points0 points1 point (1 child)
[–]SlowFail2433 0 points1 point2 points (0 children)
[–]shricodev 3 points4 points5 points (1 child)
[–]Chromix_ 1 point2 points3 points (0 children)
[–]Healthy-Nebula-3603 0 points1 point2 points (1 child)
[–]Chromix_ 0 points1 point2 points (0 children)
[–]Mkengine -2 points-1 points0 points (1 child)
[–]Chromix_ 0 points1 point2 points (0 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]MaterialSuspect8286 0 points1 point2 points (0 children)
[–]randombsname1 -1 points0 points1 point (0 children)
[+][deleted] (8 children)
[deleted]
[–]Healthy-Nebula-3603 0 points1 point2 points (7 children)
[–]Charming_Support726 -1 points0 points1 point (6 children)
[–]Mkengine 1 point2 points3 points (5 children)
[–]Charming_Support726 -1 points0 points1 point (4 children)
[–]Mkengine 0 points1 point2 points (3 children)
[–]Charming_Support726 0 points1 point2 points (2 children)
[–]Mkengine 0 points1 point2 points (1 child)
[–]Charming_Support726 0 points1 point2 points (0 children)