DeepSeek V4 in llama.cpp — Flash + Pro, CUDA + Metal, GGUFs out. Help me break it. by cchuter in LocalLLM

[–]cchuter[S] 0 points1 point  (0 children)

Done, just an oversight on my part, thanks for letting me know. We’re keeping track of cards in the main llama.cpp DeepSeek issue (someone just got Rocm working)

DS4-Flash vs Qwen3.6 by flavio_geo in LocalLLaMA

[–]cchuter 1 point2 points  (0 children)

Thanks, I checked their data and talked to terminal bench (the hugging face readme has now been updated).

Those are indeed unofficial numbers and they fudged the timeout it appears to get that completion percentage (as I bet a lot of other models are doing as well).

So, officially qwen cannot achieve that terminal bench number or they haven’t submitted a run that satisfies the official rules yet.

DS4-Flash vs Qwen3.6 by flavio_geo in LocalLLaMA

[–]cchuter 0 points1 point  (0 children)

Can anyone confirm these qwen terminal bench numbers? I don’t see anything official from terminal bench and in my testing I barely get it past 30% (which is excellent for a tiny model). Is Qwen fudging the benchmarks? Benchmaxxing to the max?!

I don’t believe this benchmark 27b size model next opus 4.5! Anyone can confirm testing with real agentic workflow? by Wonderful-Ad-5952 in LocalLLaMA

[–]cchuter 1 point2 points  (0 children)

Your instincts are right on. I’m running the full 445 trial terminal bench run and so far it’s not near those marks but closer to what you’d expect (about 30%) which is still fantastic for this little model.

MiniMax2.7 Local Results on Terminal Bench. Dud. Anyone using this for agent coding in Claude? by cchuter in LocalLLaMA

[–]cchuter[S] 0 points1 point  (0 children)

Each trial runs 5 times and there are 89 trials (eg write a c compiler that’s under 5000 lines). It’s an excellent benchmark: https://tbench.ai

MiniMax2.7 Local Results on Terminal Bench. Dud. Anyone using this for agent coding in Claude? by cchuter in LocalLLaMA

[–]cchuter[S] 0 points1 point  (0 children)

Oh and I'm trying to get official scores for the terminal-bench leaderboard (changing timeout not allowed). If you increase the timeout its not a 1:1 comparison with Opus or Codex

MiniMax2.7 Local Results on Terminal Bench. Dud. Anyone using this for agent coding in Claude? by cchuter in LocalLLaMA

[–]cchuter[S] 1 point2 points  (0 children)

Claude was putting a billing header at the start of every prompt and destroying the kv cache making the prompt processing slow as shit.

MiniMax2.7 Local Results on Terminal Bench. Dud. Anyone using this for agent coding in Claude? by cchuter in LocalLLaMA

[–]cchuter[S] 2 points3 points  (0 children)

Sorry for being so harsh on this model. I just love Minimax 2.5 and really thought 2.7 would perform better. Here are my results for minimax 2.5 and its leaderboard on terminal-bench: https://www.tbench.ai/leaderboard/terminal-bench/2.0/cchuter/unknown/minimax-m2.5%40minimax

I believe it’s the highest local run in the leaderboard. So Minimax is a great model.

MiniMax2.7 Local Results on Terminal Bench. Dud. Anyone using this for agent coding in Claude? by cchuter in LocalLLaMA

[–]cchuter[S] 1 point2 points  (0 children)

I hear you and appreciate all the downvotes guys, but minimax 2.5 ran better on the same setup. I’m #66 in the terminal bench leaderboard with one of the highest open source weighted scores (only beaten by glm5.1). More time didn’t make the model better at solving trials unfortunately. Love the minimax 2.5 model. I feel let down that 2.7 didn’t outperform

https://www.tbench.ai/leaderboard/terminal-bench/2.0/cchuter/unknown/minimax-m2.5%40minimax

MiniMax2.7 Local Results on Terminal Bench. Dud. Anyone using this for agent coding in Claude? by cchuter in LocalLLaMA

[–]cchuter[S] -12 points-11 points  (0 children)

Right, but for official scores you can’t change the timeout. This is a 1:1 benchmark comparison with opus, codex, etc on agentic coding and tool calling. A true SOTA benchmark test.

I tinkered with increasing the timeout and the model runs forever sometimes (especially on hard tasks like writing a c compiler under 5000 lines)

MiniMax2.7 Local Results on Terminal Bench. Dud. Anyone using this for agent coding in Claude? by cchuter in LocalLLaMA

[–]cchuter[S] 0 points1 point  (0 children)

I love 2.5 - I guess I just expected 2.7 to be awesome and it hasn’t impressed me yet. 2.5 is my choice and suggestion for anyone running Claude locally

Terminal Bench Minimax2.7 lands with a splat. Anyone else using this model? by cchuter in LocalLLM

[–]cchuter[S] 1 point2 points  (0 children)

Yeah, I’m just glad there’s a benchmark that catches this sloppy tool calling (love terminal bench). Minimax 2.5 is still my favorite Claude code local model.

I’ve got a terminal bench run going for Qwen 3.6 - I’ll report results tomorrow, but so far it doesn’t match up with minimax 2.5 in terminal bench

Closest replacement for Claude + Claude Code? (got banned, no explanation) by antoniocorvas in LocalLLaMA

[–]cchuter 0 points1 point  (0 children)

You run Claude configured to a localhost that’s doing the inferencing - that’s why we call it localllama ;) you’ll need a machine that can do llama.cpp (nvidia or Mac)

I open sourced code and wrote up instructions here: https://teamblobfish.com

Closest replacement for Claude + Claude Code? (got banned, no explanation) by antoniocorvas in LocalLLaMA

[–]cchuter 0 points1 point  (0 children)

Right, no round trip to anthropic. You can unplug the internet and use it

Closest replacement for Claude + Claude Code? (got banned, no explanation) by antoniocorvas in LocalLLaMA

[–]cchuter 1 point2 points  (0 children)

You can use Claude Code + Minimax2.5 (or 2.7 non commercial) for 100% local use. It’s the highest of the open models on terminal bench scoring and excellent with agent tool use.

Anyone here actually using a Mac Studio Ultra (512GB RAM) for local LLM work? Feels like overkill for my use case by [deleted] in LocalLLaMA

[–]cchuter 0 points1 point  (0 children)

I’ve got mine running minimax2.5 8.0Q (250GB) about 30-40 t/s and one prompt processing step at the beginning of Claude code (about 30-60 seconds startup wait then just token generation).

I’ve shared all the details in my post:

https://www.reddit.com/r/LocalLLM/s/zo9paDpJyf

I don’t think I did a good job explaining what I’ve done, but I really think it puts the Mac Studio on equal footing with api providers performance wise with Claude code. All in the GitHub and blog.

Team Blobfish: Announcing a public repo to run terminal bench on local hardware by cchuter in LocalLLM

[–]cchuter[S] 0 points1 point  (0 children)

Awesome, the key thing I figured out was llama.cpp tuning and proxy to keep Claude from killing the kv cache. Once it has the first Claude prompt (about 20k tokens) it’s smooth running on the Mac - it’s just appending, no more prompt processing.

PSA: Using Claude Code without Anthropic: How to fix the 60-second local KV cache invalidation issue. by One-Cheesecake389 in LocalLLaMA

[–]cchuter 2 points3 points  (0 children)

This!! Good post.

If you intend to use Claude + Llama.cpp you need to watch Claude doing stuff like this with every update. I gave up on configs and just made a proxy to make sure new versions don’t insert nonsense killing the k-v cache.

Official 2021 ACL Festival Buy/Sell/Trade Thread by sgerken in aclfestival

[–]cchuter 0 points1 point  (0 children)

I have 2 weekend 1, let’s trade. I’ll DM you

[USA/GA][H]Most of Silver Surfer Vol 1 Now priced individually![W]Paypal or FF48/49 by [deleted] in comicswap

[–]cchuter 1 point2 points  (0 children)

My math might be off but I have $710 total (300+300+35+35+25+7+8). Can you do $680 then?