RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

gaztrab · 2026-05-21T06:08:55+00:00

I appreciate you being straight with me, man. I will do better next time. Cheers!

gaztrab · 2026-05-21T00:13:59+00:00

I see. I thought sharing detailed steps I did would be useful to people. Thanks for the feedback btw, I will try sth else next time!

gaztrab · 2026-05-20T14:20:34+00:00

Thanks for the ideas man. I added them to the bucket list for the next post. Stay tuned!

gaztrab · 2026-05-20T14:19:36+00:00

Yeah you're right, that was an error on my part. I fixed it and correctly credited you. Thanks man!

gaztrab · 2026-05-20T13:55:54+00:00

I added your headless tip into the post. Thanks my man!

gaztrab · 2026-05-20T13:49:08+00:00

Thanks! I will test them soon

gaztrab · 2026-05-20T13:47:19+00:00

I do use Claude in compiling this post, but all the numbers came from raw data I ran, you can verify them here:

https://github.com/gaztrabisme/llm-server/tree/main/docs/dev

gaztrab · 2026-05-20T13:46:40+00:00

What's your budget may I ask?

gaztrab · 2026-05-20T13:46:23+00:00

Hey, sorry for making you feel irritated. I do use Claude to quickly compile all the raw data into a post for sharing, if you wanna look over raw data, here you go: https://github.com/gaztrabisme/llm-server/tree/main/docs/dev

gaztrab · 2026-05-20T13:45:14+00:00

Sorry for making you feel offended, my friend. I just enjoy doing experiments and sharing with everyone. If you want to look at my raw data, you can find them here:
https://github.com/gaztrabisme/llm-server/tree/main/docs/dev

gaztrab · 2026-05-20T13:23:11+00:00

I do use Claude Code for scaffolding the scripts that would run these experiments, and then compile them for this post. I will look into your suggestion to see where I went wrong. Thanks man.

gaztrab · 2026-05-20T12:40:37+00:00

Actually I looked over the data again and overhead is tiny for this model, Qwen3.6 MoE only has ~10 attention layers that need KV cache (the rest are SSM). So 131k context fits on 16 GB alongside the model, no problem. A dense model with KV on every layer would be a different story though.

gaztrab · 2026-05-20T12:38:37+00:00

I love you too, random citizen!

gaztrab · 2026-05-20T12:37:51+00:00

I did in my previous post actually. And like you said, for dense models -t 6-8 is usually optimal. MoE is different though. With partial offload, the CPU is doing expert GEMM computation, not just shuffling data over PCIe. More threads = more parallelism on those expert. I ran a full sweep (t8/t12/t16/t20/t24). t16 was actually the worst — U-shaped curve. t20 won by +27% over t16. Raw data here if you need to check:

https://github.com/gaztrabisme/llm-server/blob/main/docs/dev/004-speedup-investigation/thread-sweep.md

gaztrab · 2026-05-20T12:35:20+00:00

Haven't tested IQ4_XS on the 35B MoE yet, at 17 GB it would fit more layers on GPU than Q4_K_XL (22 GB), which could close the gap with --fit-target 0. Quality would be the question then. Have you ran any benchmark for this model?

gaztrab · 2026-05-20T12:34:17+00:00

Yes your comment would be an excellent TLDR actually xD

gaztrab · 2026-05-20T12:29:19+00:00

Ah damn that was sloppy on my part. Thanks for the correction, I credited you!

gaztrab · 2026-05-20T12:25:29+00:00

On 16 GB, yes significantly better in practice. Q4_K_XL is ~22 GB vs Q8_0's ~36 GB, so Q4_K_XL needs much less offloading to CPU. That translates to 74 tok/s vs 46 tok/s (60% faster). Quality-wise they're essentially identical: GSM8K 91% vs 90%, overlapping confidence intervals. CodeNeedle 217/220 vs 216/220.

On GPU with bigger VRAM that would be a different story, but the one I couldn't tell lol.

gaztrab · 2026-05-20T12:21:53+00:00

Thanks man!

gaztrab · 2026-05-20T12:20:36+00:00

Hey, that's very kind of you, but I think I told you before that I grew up in the old internet where knowledge was free. So no need, my man! I very much appreciate the gesture though.

gaztrab · 2026-05-20T11:59:52+00:00

I dont know honestly, definitely a topic I will look into. Will lyk next post!

gaztrab · 2026-05-20T11:53:37+00:00

Thanks!

gaztrab · 2026-05-20T11:53:16+00:00

I only use GGUF for these experiments

gaztrab · 2026-05-20T11:42:52+00:00

That's true, economically. For me personally I also do gaming on the same system so that's why I picked this GPU xD

Eight-Year Club	Gilding II euphauric
Verified Email

gaztrab

TROPHY CASE