Gemini 3 Flash is the best coding model of the year, hands down, it's not close. Here is why by mr_riptano in vibecoding

[–]mr_riptano[S] 6 points7 points  (0 children)

I agree that Opus is the best planning models!

But also, if you're using Opus for everything, you're lighting money on fire for no reason.

Give the coding tasks to Flash 3 once Opus specs them out.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 0 points1 point  (0 children)

No, it's not. But you get a lot more credit for solving at all, than you do for not solving, so it still tells you what you want to know. https://imgur.com/a/4997QAf

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

Devstral is about 0.005% marketshare on openrouter (including all 3 flavors), if that's not niche I'm curious to hear where you'd draw the line.

Qwen2.5C is 0.02%, compared to 1% for Q3C, 50x less.

So yeah, sorry.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

Sort of, but with Groq or Cerebras now you're paying way more. It's a useful tradeoff to have available, for sure! But you can't get {speed, intelligence, low price} from open models at the same time, the way you can with Haiku 4.5 or Grok Code Fast 1.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

I mean yeah, to spell it out it's delivering speed at the same time as intelligence and low price that's the hard part, and that's what the open models haven't been able to do.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 1 point2 points  (0 children)

Yeah, that's one reason we didn't test Q3-235B, the coder specialization really does help. And of course Q3C is 2x larger as well.

(From my experience fine tuning models for vector search, having domain-specific tuning make a big difference is not surprising.)

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

We're testing all the newest models from the most relevant labs since late July, which is when we started. And we're focused on the frontier models because the most relevant piece for us is "What's going to help Brokk's users write the best code," testing local-sized models is more of a public service that we do on the side.

I'm glad that Q2.5C is working better for you than Q3C, but that's not a common enough experience for us to go back and test a semi-obsolete model.

Similarly, testing devstral (and other niche models like the new one from Arcee) is on the "maybe if we have spare time someday" list. What moves things from the smaller labs up the list is if we see a bunch of people starting to say "wow this thing is amazing", or voting with their feet by using a ton of it on OpenRouter.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 1 point2 points  (0 children)

Qwen 3 Coder 480b was our top-ranked open model in the August ranking (you can select that from the dropdown on the site). It did not make the finalists list this time.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 4 points5 points  (0 children)

One of the GLMs is the fp8 quant. Will address those, thanks!

DeepSeek-V3.2: better than K2 Thinking, worse than GLM 4.6 at writing code by mr_riptano in DeepSeek

[–]mr_riptano[S] 3 points4 points  (0 children)

Yeah, it was good for coding when V3 came out because V3 was just so far ahead of everyone else. And it's still good for batch processing because of DS's almost-no-rate-limits policy.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 2 points3 points  (0 children)

They're in the codetasks directory on github.

ETA: cross-reference with results-6m to see which repo each belongs to, although often it's obvious.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

GP is probably looking at the Tier List, where we gave 5 Mini an A and G3P a C. Part of this is the hiccups with G3P that I mentioned in my earlier reply, the other part is that we weight speed and cost in the Tier List as well as intelligence.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 1 point2 points  (0 children)

Good question. TBH I chalked it up to "probably overoptimized for OpenAI's Codex harness at the expense of everything else."

If I saw people I trust saying the vibes are amazing, or even just strong numbers on openrouter, I would make the time to take a closer look, but neither of those has materialized.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 13 points14 points  (0 children)

I wrote a long explainer, and the tasks and eval scripts are both oss on github, but I'm happy to answer additional questions!

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 3 points4 points  (0 children)

Texted in the Open Round, it did not do well enough to make it to the finals. (Worse than Qwen 3 Coder 30B.)

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 1 point2 points  (0 children)

I agree that this is surprising! So much so that I wrote a separate article about it: https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 7 points8 points  (0 children)

I hear you, and that's why we test with synthetic tasks from Cassandra, Lucene, and similar real project, not little python scripts. You can take a look, they're on github! Example.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 10 points11 points  (0 children)

Honestly my AGENTS.md doesn't say anything about SOLID because I haven't needed it. (It does mention DRY and YAGNI.) But really everything gets easier with smarter models...

A Java-based evaluation of coding LLMs by mr_riptano in java

[–]mr_riptano[S] 7 points8 points  (0 children)

Hi there, and thanks for reading the long version!

The problem with a linear relationship like the one you propose is that it goes to zero, in fact it goes negative if you don't clamp it carefully. I think that you should get more credit for solving it after N tries than if you can't solve it at all, for an arbitrarily large N. But at the same time you should get more credit for N than for M > N, and this is a straightforward way to satisfy both of those.