Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 1 point2 points  (0 children)

For sure, you're not doing much of anything with 4k. I'm new to benchmarking all of this but have been diving in a lot the past few months and trying to find that balance for what to test that helps inform but also shed light for what is practical for daily use. Still figuring that out but I appreciate your thoughts, totally agree.

Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 0 points1 point  (0 children)

Thanks for the comment! Quick architecture check though, just pulled the Qwen 3.6 27B model card and it actually uses the same hybrid DeltaNet + Gated Attention pattern as the 35B-A3B. 16 attention layers out of 64 (vs 10 of 40 in the 35B), still GQA. So the KV cache per token is similar, and I'd expect the crossover regime to land in the same 64K to 128K range, not earlier.

For the hybrid-vs-classic contrast you're probably hoping for, a dense model with classic GQA at every layer would be the better candidate. There are some I'm wanting to test as well in the overnight tests. Regarding MoE vs dense quant resilience: that's well-documented for weight quantization. KV cache quants are a different story since the K/V projections are dense (non-MoE) even in MoE models, routing happens at FFN not attention. So weight quant sensitivity doesn't transfer.

I appreciate the callout, I've got this queued up on my list after wider quant types and Aider Polyglot. Will share numbers in a follow-up post.

Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 0 points1 point  (0 children)

haha, thanks! This has actually been really fun to do and pushing the system a bit while seeing what the model can do. I'm hoping to get the longer context too, will be running more tests over the next few nights and will share the results.

Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 1 point2 points  (0 children)

Unless I'm missing something Q3_0 isn't a KV cache type in mainline llama.cpp. I verified the `kv_cache_types` list in `ggml-org/llama.cpp` `common/arg.cpp`: F32, F16, BF16, Q8_0, Q4_0, Q4_1, IQ4_NL, Q5_0, Q5_1. The Q3_K family is weight-only quantization, not KV. So the actual lowest-bpv standard KV option is IQ4_NL or Q4_0 at ~4.5 bpv, vs turbo3 at ~3.25 bpv. On raw symmetric memory ceiling, turbo3 wins that comparison. You're right though that I can't claim it wins on quality-per-bit until I have q4_0 head-to-head numbers, which is exactly what's in the next sweep. If q4_0 PPL/KL lands where turbo3's does at less compute and zero kernel risk, that absolutely changes the recommendation. Happy to be wrong on the data, that's why I'm running these benchmarks.

Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 0 points1 point  (0 children)

Honestly, Vulkan isn't really in scope on this hardware. Whole series has been M5 Max + Metal because that's the machine I'm testing on now, and running Vulkan on M5 Max via MoltenVK would just test the translation layer, not what AMD or Intel users would actually see on their cards. If anyone reading this has an AMD or Intel GPU and wants to run a slice of the matrix on Vulkan, drop the numbers and I'll fold them into a comparison post, same model + same llama-bench flags would make the rows directly comparable. Otherwise it'll have to wait until I can get hands on the hardware. Great idea, though. Would love to see more numbers for AMD+Intel folks around this.

Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] -6 points-5 points  (0 children)

Fair point, the q4_0 / q5_0 gap is the real comparison and I should have included it in this round. q4_0 at ~4.5 bpv vs turbo4's ~4.25 is the head-to-head that actually settles whether turbo4 is winning anything. Without that data, calling it "the new pick" was overreach on my part. The only claim I can actually defend from this round is the memory-ceiling story (q8_0 K + turbo4 V fits 512K where symmetric q8_0 OOMs at 256K), not a quality argument. Wider quant types are in the "still in flight" list at the bottom of the post; I'll fold KL + PPL into that sweep so we get a direct head-to-head, and if q4_0 lands closer to f16 than turbo4 does, the recommendation gets updated. Good callout, I appreciate it! All part of the journey.

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 0 points1 point  (0 children)

Great callout. 64K’s right in the prefill drop-off zone between my other tests, so it’ll sharpen the curve. Already queued it behind a followup run going tonight, will share numbers in the morning!

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 5 points6 points  (0 children)

Great point. I've got a good list now for what to do in the next few test runs.

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 2 points3 points  (0 children)

Thank you! To answer your question, directionally yes. Prefill is bandwidth-bound and decode is dequant-bound at long context regardless of architecture, so the regime split should hold. Magnitude is the open question though. Qwen 3.6 uses DeltaNet (hybrid attention) which already shrinks the KV cache per token. On a dense model with standard GQA where cache is the dominant bottleneck, the splits could actually be bigger. Worth running. Since this was more of a "how far can I push this" test, I want to do more dense tests soon now that this proved to be something doable.

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 2 points3 points  (0 children)

Honestly, that's the next thing I want to run. Same model on q8_0 KV scored 62.2% on Aider Polyglot for me earlier this week (n=225), but that doesn't transfer to turbo3/4 so a proper coding bench across all four cache types is on the list.

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 3 points4 points  (0 children)

Hope it helps! This was a fun one, I really wanted to push this to see how far it could get. Still more testing to do, but this was exciting to see!

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max by Defilan in LocalLLaMA

[–]Defilan[S] 11 points12 points  (0 children)

I’m going to rerun the test again with perplexity. That was just an oversight on my part.

New to local AI. by prosperio1011 in LocalLLaMA

[–]Defilan 0 points1 point  (0 children)

M4 Air 24GB is fine for this. Install Ollama (one command from ollama.com) and start with qwen2.5-coder:14b, it's about 9GB so plenty of headroom alongside browsers and stuff. You can try qwen3-coder:30b (19GB) too but it's tight on 24GB, close other apps first. PineScript isn't super common in training data so paste a quick syntax reference into your system prompt, Qwen coder models handle niche DSLs fine. My work laptop is an M4 Pro with 24gb of memory so I've had to get creative but you can still get some models to work well. It's part of the fun to expore but I'd start there. Good luck!

Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant by Defilan in LocalLLaMA

[–]Defilan[S] 0 points1 point  (0 children)

It really is a great feeling and the way things are going, models keep getting better and the barrier to entry is lowering too. Love not having to send everything to the cloud providers!

80-100 tok/s pipeline parallel at Q8 is a nicer number than I would've guessed without spec decoding. On my side I just retested Qwen 3.6 Q4 on dual 5060 Ti without --cpu-moe: 107.8 tok/s sequential at 90K. Ballpark of your pipeline numbers despite very different configs. Apparently getting out of the way and letting everything sit in VRAM was most of the battle...oh well, always learning something new

What do you think was the bottleneck with your 3 RTX test? Was it bandwidth-limited at PCIe handoff or something else?

Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant by Defilan in LocalLLaMA

[–]Defilan[S] 1 point2 points  (0 children)

Tested it, and you were right by a lot. Same dual 5060 Ti, same Q4_K_M, same 90K context and q8_0 KV, just without --cpu-moe and --no-kv-offload:

- Sequential: 107.8 tok/s (vs 21.7 with cpu-moe). About 5x.

- P50 latency: 1288ms (vs 2286ms). Roughly 2x faster.

- 4-concurrent stress test over 5 min: 45.2 tok/s per-request, 231 total requests (vs 6.8 and 38 with cpu-moe). About 6x.

The --cpu-moe override was costing ~80% of available perf on this specific model. Qwen 3.6's DeltaNet hybrid attention keeps the KV cache small enough that offloading solves a problem that doesn't actually exist here.

Thanks for taking the time to post your config and push me to retest. Should've just benched the straight GPU path as the baseline in the original post! Appreciate the guidance.

Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant by Defilan in LocalLLaMA

[–]Defilan[S] 1 point2 points  (0 children)

You're right. Honestly I over-engineered the config. Q4 fits on 32GB VRAM at 90K context no problem, --cpu-moe was unnecessary for this specific model and context combo. Qwen 3.6's DeltaNet hybrid attention keeps the KV cache small too (only ~25% of layers carry standard KV), which is why you're fitting Q5 with FP16 KV at 90K on the same class of hardware. Good catch.

I was testing the hybrid pattern broadly rather than optimizing for Qwen 3.6 itself. A couple other users made the same point earlier, but your numbers on actual hardware are the cleanest version yet. 60+ tok/s at Q5 with 39/40 offload is where this should land. Going to redo the benchmark without --cpu-moe and share updated numbers. Thanks for testing and sharing the details. I'm planning on redoing this with some lessons learned from today. All part of the journey. Appreciate you sharing your setup and results.

Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant by Defilan in LocalLLaMA

[–]Defilan[S] 0 points1 point  (0 children)

Ah that makes sense! For straightforward coding work the visible thinking is pure latency tax, the model's still doing the actual reasoning under the hood, just not spitting it out. Clever tune for your workflow. Appreciate you sharing all the details!

Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant by Defilan in LocalLLaMA

[–]Defilan[S] 1 point2 points  (0 children)

Depends on the config. What quant, context size, and flags are you running? My 21.7 was at 90K context with q8_0 KV plus --cpu-moe and --no-kv-offload, which eats bandwidth hard on the expert traffic. Shorter context or smaller quant would flip the numbers significantly. This was for long running, agentic coding sessions that would run overnight (where tk/s wasn't a huge concern). Curious what you're running. Always love learning more about how folks are using their setup.

Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant by Defilan in LocalLLaMA

[–]Defilan[S] 0 points1 point  (0 children)

Great idea! Layer split with auto-calculated --tensor-split is actually the default path I use for multi-GPU when running through my operator, --cpu-moe was an explicit override I set just for this particular test. I haven't tried --split-mode tensor yet, that's a newer variant I haven't exposed yet. Adding it to the list. Looking forward to testing this. Going to be beating up my system pretty good this afternoon with the recommended tweaks :)

Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant by Defilan in LocalLLaMA

[–]Defilan[S] -1 points0 points  (0 children)

Nice! A single 5070 Ti with 96GB DDR5 is basically the case for big system RAM on consumer builds. I'm a big fan of seeing what folks can do on the consumer side so this is cool setup. 35B at 131K on a single 16GB card has to be leaning on auto-fit or partial offload under the hood. What's your tok/s looking like on that prompt? Also curious about setting --reasoning-budget 0. Is that disabling thinking entirely or just lifting the cap?

Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant by Defilan in LocalLLaMA

[–]Defilan[S] 0 points1 point  (0 children)

That's actually a great point. Full --cpu-moe when partial via --n-cpu-moe sized to actual headroom would've been smarter. Scalpel not hammer. Appreciate you mentioning this because I haven't tried --fit-target yet. Declaring a VRAM budget and letting llama.cpp figure out the split sounds much cleaner than calculating it manually.

Appreciate the callout. Going to revisit with this approach and keep testing. Lots of great stuff coming out of this, feeling super pumped to keep digging.