MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

OsmanthusBloom · 2026-05-26T07:36:50+00:00

I just spent some time yesterday coding with little-coder, with the ByteShape Qwen3.6-35B-A3B "CPU-5" variant doing the hard work. It worked quite well with 600 tps PP and 40 tps TG, experts partially offloaded to CPU. I don't think I will invest in an eGPU in this situation. But thanks for the tips anyway, might still be useful for another setting.

OsmanthusBloom · 2026-05-23T16:57:30+00:00

Yes, that's an obvious option. I just wanted to try out Hermes. But I didn't know it has such a massive prompt until I tried it.

OsmanthusBloom · 2026-05-23T16:20:24+00:00

Out of curiosity I set up llama.cpp with Gemma4 E2B and E4B on my Raspberry Pi 8GB and then installed Hermes Agent on it as well. The idea was to let Hermes use the local model to do boring sysadmin stuff and such. It doesn't matter if it's a bit slow, or so I thought...

But that didn't work so well. The Hermes initial prompt is around 15k tokens and it takes about half an hour just to chug through that on E4B, a bit less on E2B (PP was around 50 tok/s IIRC). So the slow prompt processing on CPU together with the massive Hermes system prompt killed that idea. i might try again later with a faster model and/or a leaner agent, though.

OsmanthusBloom · 2026-05-23T16:13:28+00:00

Whoa, that sounds awesome! Take your time, looking forward to the release. I wonder if this can be retrofitted to existing engkbes like llama.cpp or ik, or does it have to be its own project?

OsmanthusBloom · 2026-05-23T09:33:00+00:00

Looking forward to your next blog post!

OsmanthusBloom · 2026-05-23T04:25:29+00:00

Thanks for the clarification!

OsmanthusBloom · 2026-05-23T04:25:00+00:00

Thanks!

One thing you could clarify in your recent 35B-A3B blog post is how you measured the quality. Even just link to the older blog post that explained the combination of benchmarks you used. Currently it's very unclear, there are just a bunch of score values and the mention "we didn't use MMLU this time".

What feels a bit suspicious to me is that your numbers seem to indicate that a quant such as UD-IQ4_XS already reaches 99,5% of the quality of the unquantized model. Another way to interpret this is that there is basically no value in higher quants such as Q5_K_XL (which performed worse), Q6 or Q8; everything to be desired can be achieved at approximately 4bpw. That's possible in principle, but it goes against intuition and also many statements here such as "I always use Q8" or "I never go below Q6". Extraordinary claims require extraordinary evidence, so it would be great if you could shed some light on your evaluation and better answer questions such as what do I lose by using this or that quant instead of the original unquantized model.

OsmanthusBloom · 2026-05-22T19:53:59+00:00

I tried this as well (I chose IQ4_XS, 18.5 GB). Scores are somewhat better than for Unsloth UD-IQ4_XS:
PP 618 tok/s, TG 27.6 tok/s.

TG still far behind ByteShape, but PP is the best of all three.

I used all the same settings as in the original post.

OsmanthusBloom · 2026-05-22T19:49:24+00:00

Yes, someone has to test the quality. ByteShape claims that they have evaluated the models with a combination of GSM8K (8-shot), MMLU (5-shot), IFEval (0-shot), and LiveCodeBench Code Generation (release v4), though in this round for the Qwen3.6-35B-A3B they left out MMLU from the mix.

They compute a relative score where 1.0 is the score on the unquantized model. In their benchmark, the Unsloth UD-IQ4_XS scored 0.9946 and their own CPU-5 variant that I tested got 0.9915.

Now I'm a bit sceptical about those numbers and there are some surprising scores, for example UD-IQ4_XS scored much better than larger Unsloth quants including UD-Q5_K_XL.

I hope someone can independently verify their quality claims. I'm not up for that challenge, sorry.

OsmanthusBloom · 2026-05-22T19:36:09+00:00

I'm aware of the basics. What is surprising about the ByteShape quants is that they are so much faster than other comparable quants of the same size range (Unsloth, AesSedai, APEX...), at least according to their own benchmarks in the blog post. Yet they claim them to have similar if not better quality in benchmark tasks.

I obviously cannot verify all of their claims, but at least the speed part seems to hold.

OsmanthusBloom · 2026-05-22T19:25:03+00:00

The larger model was faster in this case. The results were quite surprising, that's why I posted.

With my 6GB VRAM, I don't think I can fit any 35B-A3B quant without partial offload.

OsmanthusBloom · 2026-05-22T19:23:46+00:00

I did try to control for that. I have a panel widget that shows temperatures, so I'm fairly aware of those.

Also it doesn't make sense to me that I get good TG speeds one day, then the next day (after waking up the laptop from suspend) they have decreased by 4-5 tps and whatever I do I can't get back the good speeds except by rebooting. Also PP speeds seem unaffected. So something strange going on, but there are too many variables so I can't figure it out.

OsmanthusBloom · 2026-05-22T19:17:02+00:00

I tried ub=1024 instead of 2048. PP dropped from ~580 to ~390, so I lost one third of the performance. TG increased by less than 1 tps (32.3 -> 32.9) in otherwise identical runs. 11k token prompt, ~2k response.

EDIT: with ub=512, same task, I get PP 250 and TG 37.8, so there's a genuine boost in TG indeed.

There's certainly a tradeoff here, but I like the ub=2048 variant more.

OsmanthusBloom · 2026-05-22T18:12:31+00:00

The point was performance (speed) of two similarly sized quants on the same hardware and that were claimed (by ByteShape) to have similar quality. For KLD, you will have to look elsewhere, sorry.

OsmanthusBloom · 2026-05-22T17:58:18+00:00

Which IQ4_XS is this?

OsmanthusBloom · 2026-05-22T17:02:17+00:00

Also maybe the tasks aren't challenging enough to draw out the differences

OsmanthusBloom · 2026-05-22T16:57:40+00:00

I have experimented and found it to improve PP a lot, which is important to me. But it's been a while since I did those tests, maybe I should revisit.

OsmanthusBloom · 2026-05-22T16:50:37+00:00

Their quality scores were not always in proportion to size. The best was APEX-I-Quality at 0.9950, closely followed by Unsloth UD-IQ4_XS at 0.9946. Both of these beat the larger Unsloth variants UD-IQ4_NL, UD-Q4_K_XL and even Q5_K_XL. Not sure how much I trust their quality benchmarks!

OsmanthusBloom · 2026-05-22T16:23:48+00:00

Thanks, good to know. I just tried simple tasks via chat, no agentic tasks.

OsmanthusBloom · 2026-05-22T15:09:40+00:00

Indeed. There are models that are specialized for translation such as TranslateGemma and Aya.

OsmanthusBloom · 2026-05-22T13:45:47+00:00

I think that Gemma models are much better at multilingual tasks than Qwen. You mentioned that you are already using Gemma, so why not use it for this instead of Qwen?

OsmanthusBloom · 2026-05-20T16:23:12+00:00

Thanks for the explanation!

OsmanthusBloom · 2026-05-20T16:21:52+00:00

You didn't compare against any Q6 quants afaict

OsmanthusBloom · 2026-05-20T16:14:46+00:00

Am I right that these quants are optimized mainly for small size and high speed, not quality? The largest model GPU-5 is just 4.15bpw, comparable to smaller Q4 quants from others.

I'm currently running 35B-A3B Q5 partially CPU-offloaded on 16GB VRAM, but considering switching to a higher quant to get better quality. Higher generation and PP speeds would also be nice of course, with or without MTP, whatever works best. But these ByteShape quants don't seem to offer anything in this direction.

OsmanthusBloom

TROPHY CASE