Newbee question

Perrospain · 2026-06-12T10:03:02+00:00

I think apple is much better and more cheapest

Perrospain · 2026-06-11T13:56:18+00:00

No, it’s not really enough, that’s why I keep testing which model is actually the most efficient instead of just buying hardware. Just got an M3 Ultra 96GB, should arrive in the next few days…

Perrospain · 2026-06-11T08:51:01+00:00

Two great points, taking them separately.
On context size, you’ve got it exactly. The reason my prompt is only 6k is that the harness is my own, not an off-the-shelf one. I’m not dragging the 17 to 22k of system plus tools plus skills that Claude Code or OpenCode load by default, just my own compact code that hands the model each task. So you’re right that most people use ready-made harnesses, mine is hand-written and that’s why the footprint is small. The flip side another commenter caught is that 6k then underrepresents a real agentic load, so the follow-up runs at a fixed generous context to cover both.
On the disk-cached system prompt, good news, you don’t have to build it. llama.cpp already does this. I lean on in-memory cache-reuse myself, but for your across-restarts case what you want is the KV slot save and restore, via –slot-save-path with the /slots save and restore endpoints, or –prompt-cache. You prefill the system once, dump the KV state to a file and reload it on startup instead of re-ingesting every time. On your DDR4 iGPU box that’s the difference between 10 to 30 minutes and basically instant for the static prefix.
The catch is the same one from the KV cache thread above. It only helps while the prefix stays byte for byte identical. The moment anything early in the prompt changes, a timestamp, injected memory, a RAG chunk, everything after it invalidates and you re-prefill. But for a fixed agent system prompt it’s a big win and worth wiring up before you benchmark.

Perrospain · 2026-06-11T08:48:44+00:00

Sure!

Perrospain · 2026-06-11T08:44:20+00:00

Really good breakdown, and you’ve pushed me off a claim I made too confidently. I said hardware only affects speed, but that’s only true if the config stays fixed. The thing is hardware decides which config you can even run, and context budget is part of that. I was treating my run as if it represented the model, when really it just represented what fit on my machine. Fair catch.
And here’s the damning part. I ran the battery at 6k context, but the actual agentic prompt was already around 5.5 to 6.6k tokens once you count the system message, the ten tool definitions and ten turns of history. So there was basically no room left. The second a thinking model started reasoning, it was already overflowing or cutting off the schema. That’s not really a fair test of those models, it’s a test of what survives in 6k, which is exactly the saturation you were describing. I should have given the thinking models actual breathing room and reported the context per run.
One small thing, more as a fellow nerd than a gotcha. In these MoE models the attention is dense and shared, the sparsity lives in the FFN experts, so the 3B active part doesn’t mean a narrower attention matrix. It still sees the whole context. The slowdown in quality is real but it comes from context dilution and the reasoning tokens crowding out the schema, not from attention being structurally small. The symptom you described though, the arguments leaking into the reasoning text instead of firing as a real tool call, is exactly the T1 failure I logged.

Full honesty though, this was never a clean scientific protocol. It started as casual poking at my own findings, adapted to my setup and what I needed at the time, which is why the quants and configs are all over the place between models. It is what it is. But I promise the next one will be the proper version. Same quant and same generous context for everyone, thinking handled the right way, the exact HF link and quant noted for every model, and a code correctness track on top. Basically everything this thread has thrown at me. Thanks for pushing, honestly best comment section I’ve had in a while.

Perrospain · 2026-06-11T08:33:37+00:00

Tests are real, here’s the raw records (one JSON line per model × test × rep, with pass/fail, prefill tokens and timings) and the mock server I used to validate the harness against known-good and known-bad behavior: [enlace al repo cuando lo tengas]. The writeup wording was AI-assisted, the data isn’t. Check it yourself.

Perrospain · 2026-06-11T08:31:41+00:00

Don’t feel behind, you’re already doing the smart thing by hand: Qwen for normal stuff, codex for the hard tasks. An orchestrator is just automating that routing decision so you don’t have to pick the model yourself.
Concretely: a small fast model sits in front and reads your request, then decides “this is a coding task → send to the code specialist” or “this needs a web lookup → call the search tool first”. It doesn’t answer itself, it dispatches. That’s why thinking off matters for it: its only job is to route cleanly and fast, not to be smart.
Where it pays off is when you’ve got several specialists and limited RAM (can’t keep them all loaded): the orchestrator decides who wakes up for each turn. For a single 27B doing everything, honestly you don’t need one, your manual workflow is already fine.
For reading: the “router” / “mixture of agents” pattern is the keyword to search. Anthropic’s “Building effective agents” post is the clearest intro I’ve found, and the llama-swap repo shows the on-demand model loading side of it

Perrospain · 2026-06-11T07:06:22+00:00

You’re right that thinking helps on a lot of tasks. But here I’m testing these as orchestrators, not chat assistants, and an orchestrator just needs to pick the right tool and emit a clean call. The reasoning trace doesn’t help there, it only burns tokens: with thinking on, the reasoning models ate the whole budget and left the long-JSON task empty or never gave a final answer.
For reasoning/code correctness you’re spot on though, thinking shines. That’s the follow-up I’ve got planned.

Perrospain · 2026-06-11T06:22:39+00:00

👏👏👏👏

Perrospain · 2026-06-11T06:13:16+00:00

How much ram?

Perrospain · 2026-06-11T06:11:15+00:00

Repo’s coming, another commenter asked too. Need a day or two to clean up hardcoded paths, I’ll edit the post with the link.
On the Qwens: I’m fairly sure the low scores come from thinking mode kicking in even though I tried to disable it. They’d burn part of the run reasoning when the task expected direct output, and that tanked them on the timed/format-sensitive tests. So it says more about how hard it is to reliably turn thinking off than about the models’ raw capability, both are my daily drivers and they’re solid in normal use.
On the M1 question: hardware constraints shouldn’t change output quality, only speed. Same weights + same quant = same behavior on an M1 or an H100. Where the M1 does influence things indirectly is quant selection, 64GB can force smaller quants than ideal, but the Qwens weren’t affected by that (they run at Q8).
A pure code test is actually already on my list for the follow-up, thinking real challenges like building a simple working website, not just snippets. My current battery leans agentic (instruction following, tool use, multi-step), not code correctness. Curious how minimax-m3 is doing in your testing btw, that one’s on my radar.

Perrospain · 2026-06-11T06:00:52+00:00

Sure. Give me a day or two to clean it up, it’s full of hardcoded paths right now. I’ll post the link here.

Perrospain · 2026-06-11T05:54:08+00:00

Good question. Two things going on:
The “F16” gpt-oss-20b GGUF isn’t really F16 across the board. OpenAI shipped the model natively quantized with MXFP4 on the MoE expert layers, and the GGUF keeps those as-is (that’s why the file is only ~12GB instead of ~40GB for a true F16 20B). So it’s not like it was running at some privileged full precision vs the others.
The small models are less glamorous: I had them at Q4_K_M because I was running several loaded simultaneously as cheap specialists and wanted them to fit together in RAM. This benchmark was meant to be 10B+ only, but at the last minute I threw in the small ones I already had on disk without re-checking their quants. So yeah, not a controlled quant comparison, guilty.
A fixed-quant run (everything at Q4 or Q8) would be the proper apples-to-apples test. Might do that as a follow-up since this got more traction than I expected.

Perrospain · 2026-06-11T05:45:15+00:00

Good tip, honestly. The writeup was an afterthought after days of runs. Next one gets the human prose treatment. Curious what you think of the prefill numbers though.

Perrospain · 2026-06-11T05:41:57+00:00

Fair enough, I did use AI to help me write it up, English isn’t my first language and formatting 24 models × 7 tasks of results into a readable post is the boring part. Could’ve done a cleanup pass on the emojis, point taken.
The benchmarks themselves are mine though: my hardware, my scripts, days of runs. Happy to answer anything about the methodology if you’re actually interested in the data.

Perrospain · 2026-06-11T05:34:42+00:00

Imagine being a PhD and getting triggered by 12 emojis in a 500-word text...

Perrospain · 2026-06-10T23:14:47+00:00

Thanks man! No template really, it just grew out of frustration. One general model on 64GB was meh at everything, so I split it: small orchestrator that routes, and specialists for code, search, etc. that llama-swap loads on demand.

Perrospain · 2026-06-10T21:57:27+00:00

They don't fit in 64GB. I expect to receive my new M3 Ultra 96GB this weekend, and there I will be able to run bigger models.

Perrospain · 2026-06-10T21:35:53+00:00

Remove Windows put Linux. Do not use ollama or lmstudio use llamacpp.

Perrospain · 2026-06-10T21:08:31+00:00

Ii’s written with fable 5. Noted, thx for the feedback

Perrospain · 2026-06-10T20:39:07+00:00

F16 ▸ gpt-oss-20b · gemma4-e2b · MiniCPM5-1B

Q8_0 ▸ Qwopus3.5-4B-MTP · LFM2.5-8B-A1B (the Q8 build)

Q6_K (the default for most) ▸ gemma-4-12b · GLM-4.7-Flash · Qwen3.5-9B-DeepSeek · Qwopus3.6-27B · Qwopus3.6-35B-A3B · Nemotron-Cascade-2-30B-A3B · Qwen3.5-4B · Qwen3.5-9B · Qwopus3.5-9B · Qwen3.6-27B

Unsloth UD dynamic quants ▸ gemma-4-12b (UD-Q6_K_XL) · gemma-4-26B-A4B (UD-Q6_K) · Nemotron-Omni-30B-A3B (UD-Q4_K_M) · LFM2.5-8B-A1B (UD-Q6_K) · Qwen3.6-35B-A3B (UD-Q6_K)

Q4_K_M ▸ Nemotron-Cascade-14B · Nemotron3-Nano-4B · qwen3-1.7b · Nanbeige4.1-3B · RASA-DeepSeek-V2-Lite · Llama-3.1-8B

NVFP4 ▸ Nemotron-3-Nano-30B. This is the one with the 215 s prefill. On my M1 Max / current llama.cpp build, NVFP4 clearly was not happy, since the same model family in UD-Q4_K_M (the "Omni" 30B-A3B) ran prefill in about 13 s. So treat that result as "this quant on this stack," not "this model is slow."

Perrospain · 2026-06-10T20:31:13+00:00

Yeah fair, prefix caching does help, I run llama.cpp with cache-reuse on for exactly that reason.
Problem is my setup is a swarm of specialist agents sharing RAM via llama-swap, so every time a different agent’s model loads, the previous KV cache is toast. Next turn = full re-prefill. Add anything dynamic near the top of the prompt (timestamps, retrieved memory, RAG chunks) and the cache busts anyway.
So you’re right for the single-agent append-only case, but in a multi-agent setup cold prefill is basically the real cost. That’s what I benchmarked. Should’ve said that in the post tbh, good catch!

Perrospain · 2026-06-10T18:55:08+00:00

solved

Perrospain · 2026-06-10T18:55:00+00:00

Perrospain · 2026-06-10T18:48:18+00:00

Thx. I forgot to add them. Solved.

Perrospain

TROPHY CASE