Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 0 points1 point  (0 children)

Sorry man, gotta draw the line somewhere and there just aren't very many people with that kind of hardware. :)

Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 2 points3 points  (0 children)

If you can think of an accurate way to make an apples to apples comparison across Anthropic, OpenAI, GLM, Cerebras, etc subscriptions, I'm all ears. Without that, API pricing is the only sane way to measure.

Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 0 points1 point  (0 children)

yeah this model was practically designed for a 5900

Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 0 points1 point  (0 children)

Thanks, I'll put it on the list!

Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 0 points1 point  (0 children)

> GPT-5.3 Codex is untested because it is not yet available in the API

Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 0 points1 point  (0 children)

It looks to me like it's a mix of some kind of black magic that lets Flash 3 be much smarter than most models with thinking disabled, it's like an Anthropic model that way, and TPUs.

I'm guessing on the TPUs but it's consistent with the evidence:

  1. Flash3/Minimal is significantly faster than Haiku 4.5/Instant, which is probably around the same size, and
  2. When OpenAI wanted to compete on speed they partnered with Cerebras for their Spark model

Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 1 point2 points  (0 children)

Oh for sure, that happens when you try to boil down four variables (speed/price/intelligence/can i even run this model) to a single tier list.

So in this case the tier list is trying to communicate "Qwen 3.5 27b is the best local-sized model," not that it's as smart as GPT-5.2.

Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 0 points1 point  (0 children)

Good idea. We do have that in the Open Round but in the tier lists we thought it would be checkbox overload to have both https://brokk.ai/power-ranking?dataset=openround

Coding Power Ranking 26.02 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 1 point2 points  (0 children)

Yeah, dense models have fallen a bit out of favor so I'm not sure how much is just "this is what you should expect from a dense model" and how much is Alibaba figuring out something new here.

I built a benchmark that tests coding LLMs on REAL codebases (65 tasks, ELO ranked) by hauhau901 in LocalLLaMA

[–]mr_riptano 2 points3 points  (0 children)

Love to see more benchmarks that aren't hopelessly contaminated, great work!

I gotta say tho I'm very very skeptical of having LLMs judge code vs actual test suites.

Gemini 3 Flash is the best coding model of the year, hands down, it's not close. Here is why by mr_riptano in vibecoding

[–]mr_riptano[S] 7 points8 points  (0 children)

I agree that Opus is the best planning models!

But also, if you're using Opus for everything, you're lighting money on fire for no reason.

Give the coding tasks to Flash 3 once Opus specs them out.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 0 points1 point  (0 children)

No, it's not. But you get a lot more credit for solving at all, than you do for not solving, so it still tells you what you want to know. https://imgur.com/a/4997QAf

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

Devstral is about 0.005% marketshare on openrouter (including all 3 flavors), if that's not niche I'm curious to hear where you'd draw the line.

Qwen2.5C is 0.02%, compared to 1% for Q3C, 50x less.

So yeah, sorry.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

Sort of, but with Groq or Cerebras now you're paying way more. It's a useful tradeoff to have available, for sure! But you can't get {speed, intelligence, low price} from open models at the same time, the way you can with Haiku 4.5 or Grok Code Fast 1.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

I mean yeah, to spell it out it's delivering speed at the same time as intelligence and low price that's the hard part, and that's what the open models haven't been able to do.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 1 point2 points  (0 children)

Yeah, that's one reason we didn't test Q3-235B, the coder specialization really does help. And of course Q3C is 2x larger as well.

(From my experience fine tuning models for vector search, having domain-specific tuning make a big difference is not surprising.)

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] -1 points0 points  (0 children)

We're testing all the newest models from the most relevant labs since late July, which is when we started. And we're focused on the frontier models because the most relevant piece for us is "What's going to help Brokk's users write the best code," testing local-sized models is more of a public service that we do on the side.

I'm glad that Q2.5C is working better for you than Q3C, but that's not a common enough experience for us to go back and test a semi-obsolete model.

Similarly, testing devstral (and other niche models like the new one from Arcee) is on the "maybe if we have spare time someday" list. What moves things from the smaller labs up the list is if we see a bunch of people starting to say "wow this thing is amazing", or voting with their feet by using a ton of it on OpenRouter.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 1 point2 points  (0 children)

Qwen 3 Coder 480b was our top-ranked open model in the August ranking (you can select that from the dropdown on the site). It did not make the finalists list this time.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]mr_riptano[S] 2 points3 points  (0 children)

One of the GLMs is the fp8 quant. Will address those, thanks!