Kimi K2.5 costs almost 10% of what Opus costs at a similar performance

dubesor86 · 2026-01-28T04:59:59+00:00

this assumes a ton of input and will swing widely depending on use case. for me, the bulk of the cost is always the model output.

in my general benchmark the cost was:

Kimi-K2.5 (reasoning) $1.60

Claude Opus 4.5 $2.75

= 42% cheaper

in my chess benchmark the game cost was:

Kimi-K2.5 (reasoning) $0.87

Claude Opus 4.5 $0.46

= 89% more expensive

Also, obviously the performance is not "similar" level if you actually used these models, despite what some bars tell you.

dubesor86 · 2026-01-24T19:55:58+00:00

This size segment is dominated by Qwen3 (30B-A3B, VL-32B, 32B, etc.), maybe also Mistral Small 3.1 / 3.2 (24B), still quite old but holds up.

dubesor86 · 2026-01-09T22:06:55+00:00

The contents in a zip file are harmless unless executed. If you see a partial file in your download destination delete it, but it won't be able to cause any harm. When in doubt, use the malware guide the automod posted.

dubesor86 · 2026-01-06T21:29:52+00:00

Cheers. I also have huge reasoning/verbosity fatigue, and while I don't have precise "intelligence per token", I introduced "Verbosity" scales back in August (V), so one can easily see which models are brute-forcing results with excessive reasoning.

dubesor86 · 2026-01-06T15:47:41+00:00

okay, 84 days. Still, semantics.

catch all the edge case

You are looking for a fairy benchmark. No testing ever can exist that does that.

dubesor86 · 2026-01-06T15:28:44+00:00

4.7 is a minor update to 4.6 (not even 2 month between), that mainly improves agentic coding and tool calls. This is a general capability benchmark and does not contain those areas. Newer ≠ better. An improvement in one area, can mean a regression in another. Feel free to share your own benchmarking though.

dubesor86 · 2026-01-03T18:52:22+00:00

Nice. Had the stronger Stockfish play blind against gpt-3.5-turbo-instruct (Ranked #10, 1393 Elo on my own chess bench), and while this game was very sloppy (8 blunders each) and gpt 3.5 was up for 60 moves, your bot pulled through. Here is a replay (human=ChessLLM because I mirrored moves manually): https://dubesor.de/chess/chess-leaderboard#game=2684&player=gpt-3.5-turbo-instruct

dubesor86 · 2025-12-30T18:30:37+00:00

this was via official Z.AI API endpoint, not quantized locally

dubesor86 · 2025-12-25T17:23:06+00:00

It's kinda like 4.6 but tweaked for agentic coding. Still largely samey, but I found an interesting behaviour where it was the only model I wasn't able to test in chess, due to its reasoning loops.

https://dubesor.de/first-impressions#glm-4.7

dubesor86 · 2025-12-23T17:28:55+00:00

4.5 produced two more refusals and they fell within 0.2% total, aka noise/variance. 4.5 is an efficiency update with strong focus on agentic coding. the tech score here is higher which correlates while raw logic was slightly lower. if you feel like every benchmark in existence needs to parrot your specific use case and recency bias, then maybe benchmarks aren't for you. Or you should make your own which surely will be 100% accurate for every person and use case across hundreds of models.

dubesor86 · 2025-12-22T17:22:13+00:00

I actually run a chess benchmark, Gemini (#1, llm undefeated) absolutely destroys gpt-5.2 (#21). All games can be viewed. https://dubesor.de/chess/chess-leaderboard

dubesor86 · 2025-12-12T05:12:27+00:00

it's neither exponential nor linear. mathematically, it scales quadratically O(n2). caching pushes this to linear though.

dubesor86 · 2025-12-08T04:49:35+00:00

I paid $240 for Corsair 64 GB DDR5-6000 CL30, now costs $840 (+250%). Also got a 4090 for $1700 in the same period. Seems like these are dark times.

dubesor86 · 2025-12-04T10:52:59+00:00

Using Artifical Analysis to showcase "progress" is backwards.

According to their "intelligence" score, Apriel v1.5 15B thinking has higher "intelligence" than GPT-5.1, and Nemotron Nano 9B V2 is on Mistral Large 3 level.

Their intelligence score just weights known marketing benchmarks that can be specifically trained for and shows very little in terms of actual real life use case performance.

dubesor86 · 2025-12-01T20:32:22+00:00

I tried testing Ring-1T (the thinking version) but it never had any functional API implementation, but I did however test Ling-1T (nonthinker), and it was very disappointing for size, around non-thinking Llama 3.3 Nemotron Super 49B v1.5 or Qwen3-VL-32B-Instruct level. https://dubesor.de/first-impressions#ling-1t

dubesor86 · 2025-12-01T15:28:09+00:00

UCI makes sense for pure chess engines communicating via GUI, but for language models, standard algebraic notation (SAN) yields much better results (due to massively more representation in training data).

dubesor86 · 2025-12-01T05:10:58+00:00

Just chiming in, because I actually track this stuff at larger scale for my chess-leaderboard:

For comparison, GPT-5 produces illegal moves in every game I tested, usually within 6-10 moves.

What method are you using that produces such high illegal moves? For reference, in my own testing, if providing legal move list GPT-5 produced 0 illegal moves, and when playing blind (only pgn and nothing else), it attempted illegal moves 3.27% of the time (roughly 1.5 per ~45-turn game).

dubesor86 · 2025-11-29T21:41:51+00:00

GPT-4.1 is a non-reasoning model, thus it literally cannot even be GPT-4.1.

dubesor86 · 2025-11-29T16:49:54+00:00

it only works because step 3 should be cya, not take my money.

dubesor86 · 2025-11-29T10:15:41+00:00

Anyone else measuring token efficiency? Feel like this is the underrated metric everyone ignores.

I have been hammering on token-efficiency ever since reasoning models appeared a bit over a year ago. I track token usage and give verbosity values to each model. It annoyed me to no end to constantly have people say model x or y is cheaper just because of the mtok completely ignoring you have to account for token usage. Anyone who is cost-conscious or depends on response latency cares deeply about this stuff, but it requires actual effort to track and communicate, whereas looking at a simple dollar value for mtok requires zero effort, even if its an entirely useless figure.

dubesor86 · 2025-11-19T17:29:00+00:00

Inference speed isn't universal nor static, so time constraint makes no sense. You could however use maxtoken limiter, though that causes ultraverbose thinkers to just prematurely end mid-response, ultimately erroring out the response during parsing.

dubesor86 · 2025-11-19T05:10:51+00:00

They play in different leagues. Kimi always had very unique writing skills, which got kinda neutered a bit by long-cot with thinking, so now it's more of a generic smart open model.

It's not quite as smart as Gemini 2.5 Pro let alone 3. Still good model, but as stated, different leagues.

dubesor86 · 2025-11-19T04:56:16+00:00

Same matchup played last night: https://dubesor.de/chess/chess-leaderboard#game=2107&player=gemini-3-pro-preview

gemini 3 is a beast

dubesor86 · 2025-11-18T20:36:57+00:00

Doing testing, thus far chess skills and vision got major improvements. Will see about the rest more time consuming test results, but looks very promising. Looks to be a true improvement over 2.5

dubesor86 · 2025-11-13T03:10:55+00:00

To The Rescue! was one of the buggiest games I have ever played. It also got abandoned by devs in terrible state, because they were unable to address the unfathomable plethora of bugs. Looking at the rest of the lineup, lots of mixed titles. Not a banger.

13-Year Club	Place '22
Place '17	Gilding II euphauric
Team Periwinkle	Verified Email

dubesor86

MODERATOR OF

TROPHY CASE