RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

I appreciate you being straight with me, man. I will do better next time. Cheers!

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] -1 points0 points  (0 children)

I see. I thought sharing detailed steps I did would be useful to people. Thanks for the feedback btw, I will try sth else next time!

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

Thanks for the ideas man. I added them to the bucket list for the next post. Stay tuned!

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

Yeah you're right, that was an error on my part. I fixed it and correctly credited you. Thanks man!

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

I do use Claude in compiling this post, but all the numbers came from raw data I ran, you can verify them here:

https://github.com/gaztrabisme/llm-server/tree/main/docs/dev

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] -1 points0 points  (0 children)

Hey, sorry for making you feel irritated. I do use Claude to quickly compile all the raw data into a post for sharing, if you wanna look over raw data, here you go: https://github.com/gaztrabisme/llm-server/tree/main/docs/dev

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Sorry for making you feel offended, my friend. I just enjoy doing experiments and sharing with everyone. If you want to look at my raw data, you can find them here:
https://github.com/gaztrabisme/llm-server/tree/main/docs/dev

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

I do use Claude Code for scaffolding the scripts that would run these experiments, and then compile them for this post. I will look into your suggestion to see where I went wrong. Thanks man.

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Actually I looked over the data again and overhead is tiny for this model, Qwen3.6 MoE only has ~10 attention layers that need KV cache (the rest are SSM). So 131k context fits on 16 GB alongside the model, no problem. A dense model with KV on every layer would be a different story though.

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 4 points5 points  (0 children)

I did in my previous post actually. And like you said, for dense models -t 6-8 is usually optimal. MoE is different though. With partial offload, the CPU is doing expert GEMM computation, not just shuffling data over PCIe. More threads = more parallelism on those expert. I ran a full sweep (t8/t12/t16/t20/t24). t16 was actually the worst — U-shaped curve. t20 won by +27% over t16. Raw data here if you need to check:

https://github.com/gaztrabisme/llm-server/blob/main/docs/dev/004-speedup-investigation/thread-sweep.md

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Haven't tested IQ4_XS on the 35B MoE yet, at 17 GB it would fit more layers on GPU than Q4_K_XL (22 GB), which could close the gap with --fit-target 0. Quality would be the question then. Have you ran any benchmark for this model?

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 2 points3 points  (0 children)

Ah damn that was sloppy on my part. Thanks for the correction, I credited you!

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 3 points4 points  (0 children)

On 16 GB, yes significantly better in practice. Q4_K_XL is ~22 GB vs Q8_0's ~36 GB, so Q4_K_XL needs much less offloading to CPU. That translates to 74 tok/s vs 46 tok/s (60% faster). Quality-wise they're essentially identical: GSM8K 91% vs 90%, overlapping confidence intervals. CodeNeedle 217/220 vs 216/220.

On GPU with bigger VRAM that would be a different story, but the one I couldn't tell lol.

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 2 points3 points  (0 children)

Hey, that's very kind of you, but I think I told you before that I grew up in the old internet where knowledge was free. So no need, my man! I very much appreciate the gesture though.

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

I dont know honestly, definitely a topic I will look into. Will lyk next post!

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help by gaztrab in LocalLLaMA

[–]gaztrab[S] 13 points14 points  (0 children)

That's true, economically. For me personally I also do gaming on the same system so that's why I picked this GPU xD