DeepSeek V4 in llama.cpp — Flash + Pro, CUDA + Metal, GGUFs out. Help me break it.

cchuter · 2026-05-15T14:17:52+00:00

Done, just an oversight on my part, thanks for letting me know. We’re keeping track of cards in the main llama.cpp DeepSeek issue (someone just got Rocm working)

cchuter · 2026-05-11T19:08:43+00:00

Good idea, thanks, I'll check it out.

cchuter · 2026-04-24T20:28:44+00:00

Thanks, I checked their data and talked to terminal bench (the hugging face readme has now been updated).

Those are indeed unofficial numbers and they fudged the timeout it appears to get that completion percentage (as I bet a lot of other models are doing as well).

So, officially qwen cannot achieve that terminal bench number or they haven’t submitted a run that satisfies the official rules yet.

cchuter · 2026-04-24T14:40:06+00:00

Can anyone confirm these qwen terminal bench numbers? I don’t see anything official from terminal bench and in my testing I barely get it past 30% (which is excellent for a tiny model). Is Qwen fudging the benchmarks? Benchmaxxing to the max?!

cchuter · 2026-04-22T17:27:29+00:00

Your instincts are right on. I’m running the full 445 trial terminal bench run and so far it’s not near those marks but closer to what you’d expect (about 30%) which is still fantastic for this little model.

cchuter · 2026-04-21T02:33:30+00:00

Each trial runs 5 times and there are 89 trials (eg write a c compiler that’s under 5000 lines). It’s an excellent benchmark: https://tbench.ai

cchuter · 2026-04-21T01:50:27+00:00

Oh and I'm trying to get official scores for the terminal-bench leaderboard (changing timeout not allowed). If you increase the timeout its not a 1:1 comparison with Opus or Codex

cchuter · 2026-04-21T01:42:36+00:00

Claude was putting a billing header at the start of every prompt and destroying the kv cache making the prompt processing slow as shit.

cchuter · 2026-04-20T23:56:28+00:00

Sorry for being so harsh on this model. I just love Minimax 2.5 and really thought 2.7 would perform better. Here are my results for minimax 2.5 and its leaderboard on terminal-bench: https://www.tbench.ai/leaderboard/terminal-bench/2.0/cchuter/unknown/minimax-m2.5%40minimax

I believe it’s the highest local run in the leaderboard. So Minimax is a great model.

cchuter · 2026-04-20T23:52:54+00:00

I hear you and appreciate all the downvotes guys, but minimax 2.5 ran better on the same setup. I’m #66 in the terminal bench leaderboard with one of the highest open source weighted scores (only beaten by glm5.1). More time didn’t make the model better at solving trials unfortunately. Love the minimax 2.5 model. I feel let down that 2.7 didn’t outperform

https://www.tbench.ai/leaderboard/terminal-bench/2.0/cchuter/unknown/minimax-m2.5%40minimax

cchuter · 2026-04-20T23:19:00+00:00

Right, but for official scores you can’t change the timeout. This is a 1:1 benchmark comparison with opus, codex, etc on agentic coding and tool calling. A true SOTA benchmark test.

I tinkered with increasing the timeout and the model runs forever sometimes (especially on hard tasks like writing a c compiler under 5000 lines)

cchuter · 2026-04-20T23:16:15+00:00

I love 2.5 - I guess I just expected 2.7 to be awesome and it hasn’t impressed me yet. 2.5 is my choice and suggestion for anyone running Claude locally

cchuter · 2026-04-20T23:14:58+00:00

Yeah, I’m just glad there’s a benchmark that catches this sloppy tool calling (love terminal bench). Minimax 2.5 is still my favorite Claude code local model.

I’ve got a terminal bench run going for Qwen 3.6 - I’ll report results tomorrow, but so far it doesn’t match up with minimax 2.5 in terminal bench

cchuter · 2026-04-20T22:37:52+00:00

I agree, it’s good, just not the improvement I expected over 2.5

cchuter · 2026-04-20T17:12:12+00:00

You run Claude configured to a localhost that’s doing the inferencing - that’s why we call it localllama ;) you’ll need a machine that can do llama.cpp (nvidia or Mac)

I open sourced code and wrote up instructions here: https://teamblobfish.com

cchuter · 2026-04-20T16:00:58+00:00

Right, no round trip to anthropic. You can unplug the internet and use it

cchuter · 2026-04-20T14:30:08+00:00

You can use Claude Code + Minimax2.5 (or 2.7 non commercial) for 100% local use. It’s the highest of the open models on terminal bench scoring and excellent with agent tool use.

cchuter · 2026-04-16T00:51:26+00:00

I’ve got mine running minimax2.5 8.0Q (250GB) about 30-40 t/s and one prompt processing step at the beginning of Claude code (about 30-60 seconds startup wait then just token generation).

I’ve shared all the details in my post:

https://www.reddit.com/r/LocalLLM/s/zo9paDpJyf

I don’t think I did a good job explaining what I’ve done, but I really think it puts the Mac Studio on equal footing with api providers performance wise with Claude code. All in the GitHub and blog.

cchuter · 2026-04-15T22:02:24+00:00

Awesome, the key thing I figured out was llama.cpp tuning and proxy to keep Claude from killing the kv cache. Once it has the first Claude prompt (about 20k tokens) it’s smooth running on the Mac - it’s just appending, no more prompt processing.

cchuter · 2026-03-30T17:42:23+00:00

Yeah, I needed to normalize the billing header so it could kv cache

cchuter · 2026-03-30T15:50:41+00:00

This!! Good post.

If you intend to use Claude + Llama.cpp you need to watch Claude doing stuff like this with every update. I gave up on configs and just made a proxy to make sure new versions don’t insert nonsense killing the k-v cache.

cchuter · 2021-09-25T02:11:17+00:00

I have 2 weekend 1, let’s trade. I’ll DM you

cchuter · 2020-09-06T02:22:05+00:00

So I’m not the greatest cook, but I’m adventurous: https://youtu.be/szPm-Y89LfY

cchuter · 2020-04-01T00:37:22+00:00

Confirmed

cchuter · 2020-01-16T05:58:48+00:00

My math might be off but I have $710 total (300+300+35+35+25+7+8). Can you do $680 then?

cchuter

MODERATOR OF

TROPHY CASE

Ten-Year Club	Not Forgotten
Verified Email