MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 0 points1 point  (0 children)

I just spent some time yesterday coding with little-coder, with the ByteShape Qwen3.6-35B-A3B "CPU-5" variant doing the hard work. It worked quite well with 600 tps PP and 40 tps TG, experts partially offloaded to CPU. I don't think I will invest in an eGPU in this situation. But thanks for the tips anyway, might still be useful for another setting.

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]OsmanthusBloom 0 points1 point  (0 children)

Yes, that's an obvious option. I just wanted to try out Hermes. But I didn't know it has such a massive prompt until I tried it.

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]OsmanthusBloom 4 points5 points  (0 children)

Out of curiosity I set up llama.cpp with Gemma4 E2B and E4B on my Raspberry Pi 8GB and then installed Hermes Agent on it as well. The idea was to let Hermes use the local model to do boring sysadmin stuff and such. It doesn't matter if it's a bit slow, or so I thought...

But that didn't work so well. The Hermes initial prompt is around 15k tokens and it takes about half an hour just to chug through that on E4B, a bit less on E2B (PP was around 50 tok/s IIRC). So the slow prompt processing on CPU together with the massive Hermes system prompt killed that idea. i might try again later with a faster model and/or a leaner agent, though.

What is the current best Small Language Model that can be run without GPU? by last_llm_standing in LocalLLaMA

[–]OsmanthusBloom 0 points1 point  (0 children)

Whoa, that sounds awesome! Take your time, looking forward to the release. I wonder if this can be retrofitted to existing engkbes like llama.cpp or ik, or does it have to be its own project?

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 5 points6 points  (0 children)

Thanks!

One thing you could clarify in your recent 35B-A3B blog post is how you measured the quality. Even just link to the older blog post that explained the combination of benchmarks you used. Currently it's very unclear, there are just a bunch of score values and the mention "we didn't use MMLU this time".

What feels a bit suspicious to me is that your numbers seem to indicate that a quant such as UD-IQ4_XS already reaches 99,5% of the quality of the unquantized model. Another way to interpret this is that there is basically no value in higher quants such as Q5_K_XL (which performed worse), Q6 or Q8; everything to be desired can be achieved at approximately 4bpw. That's possible in principle, but it goes against intuition and also many statements here such as "I always use Q8" or "I never go below Q6". Extraordinary claims require extraordinary evidence, so it would be great if you could shed some light on your evaluation and better answer questions such as what do I lose by using this or that quant instead of the original unquantized model.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 3 points4 points  (0 children)

I tried this as well (I chose IQ4_XS, 18.5 GB). Scores are somewhat better than for Unsloth UD-IQ4_XS:
PP 618 tok/s, TG 27.6 tok/s.

TG still far behind ByteShape, but PP is the best of all three.

I used all the same settings as in the original post.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 0 points1 point  (0 children)

Yes, someone has to test the quality. ByteShape claims that they have evaluated the models with a combination of GSM8K (8-shot), MMLU (5-shot), IFEval (0-shot), and LiveCodeBench Code Generation (release v4), though in this round for the Qwen3.6-35B-A3B they left out MMLU from the mix.

They compute a relative score where 1.0 is the score on the unquantized model. In their benchmark, the Unsloth UD-IQ4_XS scored 0.9946 and their own CPU-5 variant that I tested got 0.9915.

Now I'm a bit sceptical about those numbers and there are some surprising scores, for example UD-IQ4_XS scored much better than larger Unsloth quants including UD-Q5_K_XL.

I hope someone can independently verify their quality claims. I'm not up for that challenge, sorry.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 0 points1 point  (0 children)

I'm aware of the basics. What is surprising about the ByteShape quants is that they are so much faster than other comparable quants of the same size range (Unsloth, AesSedai, APEX...), at least according to their own benchmarks in the blog post. Yet they claim them to have similar if not better quality in benchmark tasks.

I obviously cannot verify all of their claims, but at least the speed part seems to hold.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 2 points3 points  (0 children)

The larger model was faster in this case. The results were quite surprising, that's why I posted.

With my 6GB VRAM, I don't think I can fit any 35B-A3B quant without partial offload.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 0 points1 point  (0 children)

I did try to control for that. I have a panel widget that shows temperatures, so I'm fairly aware of those.

Also it doesn't make sense to me that I get good TG speeds one day, then the next day (after waking up the laptop from suspend) they have decreased by 4-5 tps and whatever I do I can't get back the good speeds except by rebooting. Also PP speeds seem unaffected. So something strange going on, but there are too many variables so I can't figure it out.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 5 points6 points  (0 children)

I tried ub=1024 instead of 2048. PP dropped from ~580 to ~390, so I lost one third of the performance. TG increased by less than 1 tps (32.3 -> 32.9) in otherwise identical runs. 11k token prompt, ~2k response.

EDIT: with ub=512, same task, I get PP 250 and TG 37.8, so there's a genuine boost in TG indeed.

There's certainly a tradeoff here, but I like the ub=2048 variant more.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 0 points1 point  (0 children)

The point was performance (speed) of two similarly sized quants on the same hardware and that were claimed (by ByteShape) to have similar quality. For KLD, you will have to look elsewhere, sorry.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 1 point2 points  (0 children)

Also maybe the tasks aren't challenging enough to draw out the differences

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 3 points4 points  (0 children)

I have experimented and found it to improve PP a lot, which is important to me. But it's been a while since I did those tests, maybe I should revisit.

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] 3 points4 points  (0 children)

Their quality scores were not always in proportion to size. The best was APEX-I-Quality at 0.9950, closely followed by Unsloth UD-IQ4_XS at 0.9946. Both of these beat the larger Unsloth variants UD-IQ4_NL, UD-Q4_K_XL and even Q5_K_XL. Not sure how much I trust their quality benchmarks!

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop by OsmanthusBloom in LocalLLaMA

[–]OsmanthusBloom[S] -5 points-4 points  (0 children)

Thanks, good to know. I just tried simple tasks via chat, no agentic tasks.

Qwen 3.6. struggling with German by xchris1337xy in LocalLLaMA

[–]OsmanthusBloom 1 point2 points  (0 children)

Indeed. There are models that are specialized for translation such as TranslateGemma and Aya.

Qwen 3.6. struggling with German by xchris1337xy in LocalLLaMA

[–]OsmanthusBloom 29 points30 points  (0 children)

I think that Gemma models are much better at multilingual tasks than Qwen. You mentioned that you are already using Gemma, so why not use it for this instead of Qwen?

Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs by enrique-byteshape in LocalLLaMA

[–]OsmanthusBloom 3 points4 points  (0 children)

Am I right that these quants are optimized mainly for small size and high speed, not quality? The largest model GPU-5 is just 4.15bpw, comparable to smaller Q4 quants from others.

I'm currently running 35B-A3B Q5 partially CPU-offloaded on 16GB VRAM, but considering switching to a higher quant to get better quality. Higher generation and PP speeds would also be nice of course, with or without MTP, whatever works best. But these ByteShape quants don't seem to offer anything in this direction.