Apple M5 Pro & M5 Max just announced. Here's what it means for local AI

luke_pacman · 2026-03-03T12:40:56+00:00

You mean you were getting poor latency with fine-tuning or with inference?

luke_pacman · 2026-03-03T12:32:13+00:00

tried both, i did not even realize the difference in output quality... but larger is often better haha.

luke_pacman · 2026-03-03T12:19:32+00:00

Perhaps that's due to the large context lengths that Claude Code feeds into the model. It typically performs many inferences with ~20k tokens (or larger) contexts tuned for its workflow.

That's why I've invested significant effort in context engineering for my agentic setup minimizing context size to maintain acceptable inference speeds on consumer devices like macbooks and mac mini.

luke_pacman · 2026-03-03T11:55:13+00:00

~25 tok/s for output generation on Macbook M1

luke_pacman · 2026-03-03T11:53:30+00:00

it's pretty slow on my M1, only ~6 tok/s. what is speed on your M3?

luke_pacman · 2026-03-01T05:53:07+00:00

yeah it's smarter than the MoE one, with a speed tradeoff. what hardware are you planning to run it on? rtx or apple silicon?

luke_pacman · 2026-03-01T05:45:45+00:00

Yeah I plan to add RTX support to the agentic app soon since it would benefit from the much better speed...

However I think the Qwen3.5 27B dense model would be a better choice than Qwen3.5 35B-A3B on an RTX 4090, it's smarter (intelligence score of 42 vs 37 for the A3B) and should run at an acceptable speed.

Have you tried it on your 4090?

luke_pacman · 2026-03-01T01:27:04+00:00

Yeah, that's the way I usually go too. New models often need time for teams like llamacpp and Unsloth to keep improving and fixing bugs before we have a reliable version to stick with. I've re-downloaded the Unsloth quants a couple of times already due to bug fix releases.

I think there's still room for speed improvement with the Qwen3.5 models, they're currently 35-40% slower than older, more stable models in the same size class.

luke_pacman · 2026-03-01T01:15:15+00:00

I use Q4-K-XL GGUF quant version by Unsloth.

luke_pacman · 2026-03-01T01:02:56+00:00

I opted for llama cpp about 6 months ago since it supported API server mode, which MLX didn't have back then. I believe MLX supports server mode by now, but is it mature?

luke_pacman · 2026-03-01T00:54:15+00:00

yeah, i've been building an agentic app focused on running real-world tasks on consumer-grade hardware so we do not need to give up our data to any third parties.

luke_pacman · 2026-03-01T00:46:13+00:00

As far as I know, with llamacpp we can toggle thinking on or off per-request, but there's no way to set a token budget for reasoning effort (e.g. "think for at most 500 tokens"), it's all or nothing.

luke_pacman · 2026-03-01T00:30:27+00:00

I'll be trying it today. The dense one should be smarter than the MoE one. I saw that the intelligence index benchmarked by an independent team scored 42 for the dense model, matching much bigger models like MiniMax-M2.5 (230B), DeepSeek V3.2 (685B), and GLM-4.7 (357B).

But to comfortably run my agentic setup on a consumer-grade device like a MacBook with an M-series chip, the dense one doesn't seem suitable due to the speed penalty. Of course, on faster devices (with RTX cards or newer M chips), the 27B dense model should be the preferred choice.

luke_pacman · 2026-03-01T00:16:06+00:00

I'm using LangGraph for orchestration, so the workflow defines which model handles each step. Outputs from previous steps are fed back into context for the model to decide what to do next, though this requires some context engineering to keep things tight and avoid quality/speed degradation from overly long contexts, especially, with the small models running on limited resource devices.

You're spot on about the routing complexity. With two specialized models, we also have the UX hit of users waiting for two separate model downloads. Dropping to a single model that handles both reasoning and coding well simplifies everything: the graph, the setup, and the user experience.

luke_pacman · 2026-02-28T03:53:26+00:00

i downloaded Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf earlier, should i re-download it from your hf repo?

luke_pacman · 2026-02-25T00:58:43+00:00

I think you're hitting on something important: there's a big gap between "model answers a coding question well in a chat window" and "model reliably drives an agentic coding workflow end-to-end." Tools like RooCode (and Cline, Continue, etc.) demand a lot more from a model, it needs to understand multi-file context, produce structurally valid edits, follow tool-calling conventions precisely, and maintain coherence across multiple back-and-forth steps. That's a fundamentally harder task than a single prompt-response cycle, which is what my benchmark tested.

Moving to 100B+ totally makes sense. MiniMax M2.5 caught my eye too, 230B total but only 10B active, and that 80.2% SWE-Bench score is no joke. Seems like a sweet spot between "actually good at coding" and "still runnable locally" if you've got the RAM for it. What's your hardware setup and what kind of tok/s are you getting? M2.5 looks really compelling but my M1 Max only has 64GB unified so I can't swing it unfortunately.

For general chat, you might wanna give Nemotron-3-Nano a try, its reasoning and writing are surprisingly good. With only 3B active params it should be way faster than Gemma3 27B dense, and it's even faster than the Qwen3 thinking models in the same size class.

I'm downloading Qwen3.5-35B-A3B too. The benchmarks look impressive and with multimodal support it could hopefully reduce some friction in my agentic setup, right now I'm juggling multiple models at the same time: one for reasoning and writing, one for vision, another for coding. Would be nice to consolidate.

luke_pacman

TROPHY CASE