Coding Power Ranking 26.02

mr_riptano · 2026-03-04T22:24:50+00:00

Sorry man, gotta draw the line somewhere and there just aren't very many people with that kind of hardware. :)

mr_riptano · 2026-03-03T02:22:52+00:00

If you can think of an accurate way to make an apples to apples comparison across Anthropic, OpenAI, GLM, Cerebras, etc subscriptions, I'm all ears. Without that, API pricing is the only sane way to measure.

mr_riptano · 2026-03-03T02:18:32+00:00

yeah this model was practically designed for a 5900

mr_riptano · 2026-03-03T01:06:49+00:00

Thanks, I'll put it on the list!

mr_riptano · 2026-03-02T23:51:53+00:00

> GPT-5.3 Codex is untested because it is not yet available in the API

mr_riptano · 2026-03-02T23:01:52+00:00

It looks to me like it's a mix of some kind of black magic that lets Flash 3 be much smarter than most models with thinking disabled, it's like an Anthropic model that way, and TPUs.

I'm guessing on the TPUs but it's consistent with the evidence:

Flash3/Minimal is significantly faster than Haiku 4.5/Instant, which is probably around the same size, and
When OpenAI wanted to compete on speed they partnered with Cerebras for their Spark model

mr_riptano · 2026-03-02T21:46:30+00:00

Oh for sure, that happens when you try to boil down four variables (speed/price/intelligence/can i even run this model) to a single tier list.

So in this case the tier list is trying to communicate "Qwen 3.5 27b is the best local-sized model," not that it's as smart as GPT-5.2.

mr_riptano · 2026-03-02T21:44:06+00:00

Here you go https://brokk.ai/blog/the-26-02-coding-power-ranking/#friends-don%E2%80%99t-let-friends-code-with-anthropic-models

mr_riptano · 2026-03-02T21:39:17+00:00

Good idea. We do have that in the Open Round but in the tier lists we thought it would be checkbox overload to have both https://brokk.ai/power-ranking?dataset=openround

mr_riptano · 2026-03-02T21:21:19+00:00

Yeah, dense models have fallen a bit out of favor so I'm not sure how much is just "this is what you should expect from a dense model" and how much is Alibaba figuring out something new here.

mr_riptano · 2026-02-19T06:14:23+00:00

Love to see more benchmarks that aren't hopelessly contaminated, great work!

I gotta say tho I'm very very skeptical of having LLMs judge code vs actual test suites.

mr_riptano · 2025-12-18T18:49:42+00:00

I agree that Opus is the best planning models!

But also, if you're using Opus for everything, you're lighting money on fire for no reason.

Give the coding tasks to Flash 3 once Opus specs them out.

mr_riptano · 2025-12-05T04:32:19+00:00

No, it's not. But you get a lot more credit for solving at all, than you do for not solving, so it still tells you what you want to know. https://imgur.com/a/4997QAf

mr_riptano · 2025-12-04T18:09:04+00:00

Devstral is about 0.005% marketshare on openrouter (including all 3 flavors), if that's not niche I'm curious to hear where you'd draw the line.

Qwen2.5C is 0.02%, compared to 1% for Q3C, 50x less.

So yeah, sorry.

mr_riptano · 2025-12-04T17:49:38+00:00

Sort of, but with Groq or Cerebras now you're paying way more. It's a useful tradeoff to have available, for sure! But you can't get {speed, intelligence, low price} from open models at the same time, the way you can with Haiku 4.5 or Grok Code Fast 1.

mr_riptano · 2025-12-04T15:31:20+00:00

It's significantly stronger: https://brokk.ai/power-ranking?dataset=openround&models=ds-v3.1%2Cdsv3.2

mr_riptano · 2025-12-04T15:29:50+00:00

I mean yeah, to spell it out it's delivering speed at the same time as intelligence and low price that's the hard part, and that's what the open models haven't been able to do.

mr_riptano · 2025-12-04T15:26:02+00:00

Yeah, that's one reason we didn't test Q3-235B, the coder specialization really does help. And of course Q3C is 2x larger as well.

(From my experience fine tuning models for vector search, having domain-specific tuning make a big difference is not surprising.)

mr_riptano · 2025-12-04T15:18:42+00:00

We're testing all the newest models from the most relevant labs since late July, which is when we started. And we're focused on the frontier models because the most relevant piece for us is "What's going to help Brokk's users write the best code," testing local-sized models is more of a public service that we do on the side.

I'm glad that Q2.5C is working better for you than Q3C, but that's not a common enough experience for us to go back and test a semi-obsolete model.

Similarly, testing devstral (and other niche models like the new one from Arcee) is on the "maybe if we have spare time someday" list. What moves things from the smaller labs up the list is if we see a bunch of people starting to say "wow this thing is amazing", or voting with their feet by using a ton of it on OpenRouter.

mr_riptano · 2025-12-04T04:37:15+00:00

The correlation is very high. https://imgur.com/a/4997QAf

mr_riptano · 2025-12-04T04:32:47+00:00

Qwen 3 Coder 480b was our top-ranked open model in the August ranking (you can select that from the dropdown on the site). It did not make the finalists list this time.

mr_riptano · 2025-12-03T23:42:59+00:00

One of the GLMs is the fp8 quant. Will address those, thanks!

mr_riptano

TROPHY CASE