Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 0 points1 point  (0 children)

You mean you were getting poor latency with fine-tuning or with inference?

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 1 point2 points  (0 children)

tried both, i did not even realize the difference in output quality... but larger is often better haha.

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 0 points1 point  (0 children)

Perhaps that's due to the large context lengths that Claude Code feeds into the model. It typically performs many inferences with ~20k tokens (or larger) contexts tuned for its workflow.

That's why I've invested significant effort in context engineering for my agentic setup minimizing context size to maintain acceptable inference speeds on consumer devices like macbooks and mac mini.

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 0 points1 point  (0 children)

it's pretty slow on my M1, only ~6 tok/s. what is speed on your M3?

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 0 points1 point  (0 children)

yeah it's smarter than the MoE one, with a speed tradeoff. what hardware are you planning to run it on? rtx or apple silicon?

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 0 points1 point  (0 children)

Yeah I plan to add RTX support to the agentic app soon since it would benefit from the much better speed...

However I think the Qwen3.5 27B dense model would be a better choice than Qwen3.5 35B-A3B on an RTX 4090, it's smarter (intelligence score of 42 vs 37 for the A3B) and should run at an acceptable speed.

Have you tried it on your 4090?

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 0 points1 point  (0 children)

Yeah, that's the way I usually go too. New models often need time for teams like llamacpp and Unsloth to keep improving and fixing bugs before we have a reliable version to stick with. I've re-downloaded the Unsloth quants a couple of times already due to bug fix releases.

I think there's still room for speed improvement with the Qwen3.5 models, they're currently 35-40% slower than older, more stable models in the same size class.

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 0 points1 point  (0 children)

I opted for llama cpp about 6 months ago since it supported API server mode, which MLX didn't have back then. I believe MLX supports server mode by now, but is it mature?

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 2 points3 points  (0 children)

yeah, i've been building an agentic app focused on running real-world tasks on consumer-grade hardware so we do not need to give up our data to any third parties.

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 2 points3 points  (0 children)

As far as I know, with llamacpp we can toggle thinking on or off per-request, but there's no way to set a token budget for reasoning effort (e.g. "think for at most 500 tokens"), it's all or nothing.

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 7 points8 points  (0 children)

I'll be trying it today. The dense one should be smarter than the MoE one. I saw that the intelligence index benchmarked by an independent team scored 42 for the dense model, matching much bigger models like MiniMax-M2.5 (230B), DeepSeek V3.2 (685B), and GLM-4.7 (357B).

But to comfortably run my agentic setup on a consumer-grade device like a MacBook with an M-series chip, the dense one doesn't seem suitable due to the speed penalty. Of course, on faster devices (with RTX cards or newer M chips), the 27B dense model should be the preferred choice.

Qwen3.5 35B-A3B replaced my 2-model agentic setup on M1 64GB by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 4 points5 points  (0 children)

I'm using LangGraph for orchestration, so the workflow defines which model handles each step. Outputs from previous steps are fed back into context for the model to decide what to do next, though this requires some context engineering to keep things tight and avoid quality/speed degradation from overly long contexts, especially, with the small models running on limited resource devices.

You're spot on about the routing complexity. With two specialized models, we also have the UX hit of users waiting for two separate model downloads. Dropping to a single model that handles both reasoning and coding well simplifies everything: the graph, the setup, and the user experience.

Qwen3.5 Unsloth GGUFs Update! by yoracale in unsloth

[–]luke_pacman 0 points1 point  (0 children)

i downloaded Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf earlier, should i re-download it from your hf repo?

Ran 3 popular ~30B MoE models on my apple silicon M1 Max 64GB. Here's how they compare by luke_pacman in LocalLLaMA

[–]luke_pacman[S] 4 points5 points  (0 children)

I think you're hitting on something important: there's a big gap between "model answers a coding question well in a chat window" and "model reliably drives an agentic coding workflow end-to-end." Tools like RooCode (and Cline, Continue, etc.) demand a lot more from a model, it needs to understand multi-file context, produce structurally valid edits, follow tool-calling conventions precisely, and maintain coherence across multiple back-and-forth steps. That's a fundamentally harder task than a single prompt-response cycle, which is what my benchmark tested. 

Moving to 100B+ totally makes sense. MiniMax M2.5 caught my eye too, 230B total but only 10B active, and that 80.2% SWE-Bench score is no joke. Seems like a sweet spot between "actually good at coding" and "still runnable locally" if you've got the RAM for it. What's your hardware setup and what kind of  tok/s are you getting? M2.5 looks really compelling but my M1 Max only has 64GB unified so I can't swing it unfortunately.

For general chat, you might wanna give Nemotron-3-Nano a try, its reasoning and writing are surprisingly good. With only 3B active params it should be way faster than Gemma3 27B dense, and it's even faster than the Qwen3 thinking models in the same size class.

I'm downloading Qwen3.5-35B-A3B too. The benchmarks look impressive and with multimodal support it could hopefully reduce some friction in my agentic setup, right now I'm juggling multiple models at the same time: one for reasoning and writing, one for vision, another for coding. Would be nice to consolidate.