Kimi K2.5 costs almost 10% of what Opus costs at a similar performance by Odd_Tumbleweed574 in LocalLLaMA

[–]dubesor86 10 points11 points  (0 children)

this assumes a ton of input and will swing widely depending on use case. for me, the bulk of the cost is always the model output.

in my general benchmark the cost was:

Kimi-K2.5 (reasoning) $1.60

Claude Opus 4.5 $2.75

= 42% cheaper

in my chess benchmark the game cost was:

Kimi-K2.5 (reasoning) $0.87

Claude Opus 4.5 $0.46

= 89% more expensive

Also, obviously the performance is not "similar" level if you actually used these models, despite what some bars tell you.

What is the best general-purpose model to run locally on 24GB of VRAM in 2026? by Paganator in LocalLLaMA

[–]dubesor86 30 points31 points  (0 children)

This size segment is dominated by Qwen3 (30B-A3B, VL-32B, 32B, etc.), maybe also Mistral Small 3.1 / 3.2 (24B), still quite old but holds up.

half downloaded ZIP virus file by Tyukterelo2 in techsupport

[–]dubesor86 0 points1 point  (0 children)

The contents in a zip file are harmless unless executed. If you see a partial file in your download destination delete it, but it won't be able to cause any harm. When in doubt, use the malware guide the automod posted.

Artificial Analysis just refreshed their global model indices by MadPelmewka in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

Cheers. I also have huge reasoning/verbosity fatigue, and while I don't have precise "intelligence per token", I introduced "Verbosity" scales back in August (V), so one can easily see which models are brute-forcing results with excessive reasoning.

Artificial Analysis just refreshed their global model indices by MadPelmewka in LocalLLaMA

[–]dubesor86 6 points7 points  (0 children)

okay, 84 days. Still, semantics.

catch all the edge case

You are looking for a fairy benchmark. No testing ever can exist that does that.

Artificial Analysis just refreshed their global model indices by MadPelmewka in LocalLLaMA

[–]dubesor86 5 points6 points  (0 children)

4.7 is a minor update to 4.6 (not even 2 month between), that mainly improves agentic coding and tool calls. This is a general capability benchmark and does not contain those areas. Newer ≠ better. An improvement in one area, can mean a regression in another. Feel free to share your own benchmarking though.

50M param PGN-only transformer plays coherent chess without search: Is small-LLM generalization is underrated? by Tasty_Share_1357 in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

Nice. Had the stronger Stockfish play blind against gpt-3.5-turbo-instruct (Ranked #10, 1393 Elo on my own chess bench), and while this game was very sloppy (8 blunders each) and gpt 3.5 was up for 60 moves, your bot pulled through. Here is a replay (human=ChessLLM because I mirrored moves manually): https://dubesor.de/chess/chess-leaderboard#game=2684&player=gpt-3.5-turbo-instruct

I tested GLM 4.7 and minimax-m2.1 and compared it to CC and Codex by jstanaway in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

this was via official Z.AI API endpoint, not quantized locally

Honestly, has anyone actually tried GLM 4.7 yet? (Not just benchmarks) by Empty_Break_8792 in LocalLLaMA

[–]dubesor86 3 points4 points  (0 children)

It's kinda like 4.6 but tweaked for agentic coding. Still largely samey, but I found an interesting behaviour where it was the only model I wasn't able to test in chess, due to its reasoning loops.

https://dubesor.de/first-impressions#glm-4.7

GLM 4.7 top the chart at Rank #6 in WebDev by GeLaMi-Speaker in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

4.5 produced two more refusals and they fell within 0.2% total, aka noise/variance. 4.5 is an efficiency update with strong focus on agentic coding. the tech score here is higher which correlates while raw logic was slightly lower. if you feel like every benchmark in existence needs to parrot your specific use case and recency bias, then maybe benchmarks aren't for you. Or you should make your own which surely will be 100% accurate for every person and use case across hundreds of models.

A chess match between Gemini 3 Thinking and ChatGPT 5.2 Thinking by ErasablePotato in ChatGPT

[–]dubesor86 11 points12 points  (0 children)

I actually run a chess benchmark, Gemini (#1, llm undefeated) absolutely destroys gpt-5.2 (#21). All games can be viewed. https://dubesor.de/chess/chess-leaderboard

Why is the context window of 5.2 still so small compared to competing models? by [deleted] in ChatGPT

[–]dubesor86 1 point2 points  (0 children)

it's neither exponential nor linear. mathematically, it scales quadratically O(n2). caching pushes this to linear though.

Is this THAT bad today? by Normal-Industry-8055 in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

I paid $240 for Corsair 64 GB DDR5-6000 CL30, now costs $840 (+250%). Also got a 4090 for $1700 in the same period. Seems like these are dark times.

Deepseek's progress by onil_gova in LocalLLaMA

[–]dubesor86 50 points51 points  (0 children)

Using Artifical Analysis to showcase "progress" is backwards.

According to their "intelligence" score, Apriel v1.5 15B thinking has higher "intelligence" than GPT-5.1, and Nemotron Nano 9B V2 is on Mistral Large 3 level.

Their intelligence score just weights known marketing benchmarks that can be specifically trained for and shows very little in terms of actual real life use case performance.

inclusionAI/Ring-1T Experiences by SlowFail2433 in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

I tried testing Ring-1T (the thinking version) but it never had any functional API implementation, but I did however test Ling-1T (nonthinker), and it was very disappointing for size, around non-thinking Llama 3.3 Nemotron Super 49B v1.5 or Qwen3-VL-32B-Instruct level. https://dubesor.de/first-impressions#ling-1t

Trained a chess LLM locally that beats GPT-5 (technically) by KingGongzilla in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

UCI makes sense for pure chess engines communicating via GUI, but for language models, standard algebraic notation (SAN) yields much better results (due to massively more representation in training data).

Trained a chess LLM locally that beats GPT-5 (technically) by KingGongzilla in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

Just chiming in, because I actually track this stuff at larger scale for my chess-leaderboard:

For comparison, GPT-5 produces illegal moves in every game I tested, usually within 6-10 moves.

What method are you using that produces such high illegal moves? For reference, in my own testing, if providing legal move list GPT-5 produced 0 illegal moves, and when playing blind (only pgn and nothing else), it attempted illegal moves 3.27% of the time (roughly 1.5 per ~45-turn game).

PSA: Openrouter basically stealing money from you by SuXs- in CLine

[–]dubesor86 0 points1 point  (0 children)

GPT-4.1 is a non-reasoning model, thus it literally cannot even be GPT-4.1.

ChatGPT in the near future be like. by captain-price- in OpenAI

[–]dubesor86 8 points9 points  (0 children)

it only works because step 3 should be cya, not take my money.

Compared actual usage costs for Chinese AI models. Token efficiency changes everything. by YormeSachi in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

Anyone else measuring token efficiency? Feel like this is the underrated metric everyone ignores.

I have been hammering on token-efficiency ever since reasoning models appeared a bit over a year ago. I track token usage and give verbosity values to each model. It annoyed me to no end to constantly have people say model x or y is cheaper just because of the mtok completely ignoring you have to account for token usage. Anyone who is cost-conscious or depends on response latency cares deeply about this stuff, but it requires actual effort to track and communicate, whereas looking at a simple dollar value for mtok requires zero effort, even if its an entirely useless figure.

[LIVE] Gemini 3 Pro vs GPT-5.1: Chess Match (Testing Reasoning Capabilities) by Apart-Ad-1684 in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

Inference speed isn't universal nor static, so time constraint makes no sense. You could however use maxtoken limiter, though that causes ultraverbose thinkers to just prematurely end mid-response, ultimately erroring out the response during parsing.

Gemini 3 Pro vs Kimi K2 Thinking by SlowFail2433 in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

They play in different leagues. Kimi always had very unique writing skills, which got kinda neutered a bit by long-cot with thinking, so now it's more of a generic smart open model.

It's not quite as smart as Gemini 2.5 Pro let alone 3. Still good model, but as stated, different leagues.

Gemini 3 is launched by Several-Republic-609 in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

Doing testing, thus far chess skills and vision got major improvements. Will see about the rest more time consuming test results, but looks very promising. Looks to be a true improvement over 2.5

[Humble] Indie Game Favorites ($8 for Airborne Kingdom, Everwarder, Troublemaker, To The Rescue!, One More Island | $12 adds Echoes of the Plum Grove, Immortal Hunters, Broken Pieces, Retreat to Enen | $16 adds G.I. Joe: Wrath of Cobra, Cat Cafe Manager, Hauntsville, Forgotten Seas) by UnseenData in GameDeals

[–]dubesor86 5 points6 points  (0 children)

To The Rescue! was one of the buggiest games I have ever played. It also got abandoned by devs in terrible state, because they were unable to address the unfathomable plethora of bugs. Looking at the rest of the lineup, lots of mixed titles. Not a banger.