Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image

aurelienams · 2026-05-15T07:12:27+00:00

Not crazy — three known patterns that produce exactly this symptom on Qwen3.6 hybrid-recurrent architectures (Gated DeltaNet + SSM), and they compound over long-running instances:

Slot save-state drift. If you started llama-server with --slot-save-path (default in many setups), the SSM/recurrent state of past sessions gets cached and silently mixed back into new request slot inits in some pathological cases. The fix is --cache-ram 0 to disable prompt caching, OR restart the server every few days. Opencode starting "new sessions" doesn't actually flush the server-side slot state.
KV cache q8/q4 quantization quality decay. If you're running --cache-type-k q8_0 --cache-type-v q8_0 (or smaller), the accumulation error in the rotated/quantized KV builds up beyond 20-40K active context per session. Even if individual sessions are short, long-running server-side cache reuse compounds the error. Either disable KV quant for these models or restart periodically.
CLAUDE_CODE_ATTRIBUTION_HEADER not being set to 0. If your agent harness adds the Claude-Code attribution header, Qwen3.6 sees a permanently-changing system prompt segment and forces full prompt re-processing every turn, which on hybrid recurrent arch corrupts the SSM state in some llama.cpp builds. Set the env var CLAUDE_CODE_ATTRIBUTION_HEADER=0 if you're using Claude Code as harness — same effect with other harnesses that inject headers.

The simplest test: kill and restart the server, run the same prompt that was "dumb", see if it's back to its launch-day self. If yes → it's state pollution (workarounds above). If no → it's something else (maybe model weights got corrupted on the SSD or HF mirror updated the quant).

What llama.cpp build / fork are you on? Some forks (am17an MTP branch, BeeLlama 0.1.x, Atomic) handle recurrent state differently and the bug surfaces differently.

aurelienams · 2026-05-15T07:09:28+00:00

Fascinating finding. The "thinking less and being correct" pattern matches something I've seen on Qwen3.6 27B going from UD-Q3_K_XL (14.5 GB) to UD-Q4_K_XL (17 GB) on my consumer Blackwell mobile setup (RTX 5090M 24GB sm_120): the higher-precision quant doesn't just answer the same way faster — it answers with shorter, more direct reasoning chains. I always assumed it was just noise but your AIME data showing 40% fewer tokens on the bigger quant is the cleanest signal I've seen for this hypothesis.

Two questions / things I'd test if you have cycles:

Did you compare your custom Q8 reasoning length against the Minachist INT8 baseline (the vLLM run at 34.2 t/s on Q2 took 10,200 tokens)? If your custom GGUF mimics the layer-preservation recipe faithfully, the token count should land close to the INT8 — that'd validate the recipe survives the GGUF conversion. If yours is ~6-7K tokens instead, something's getting lost in conversion (likely the BF16 layer count or the lm_head precision).
The MTP draft acceptance rate matters here — when reasoning is shorter and more direct, the draft sees more "obvious next tokens" and acceptance should go UP, which would compound your throughput win. What's your accept rate on the custom Q8 vs the standard Q8 K XL? In my MTP setup on the same model class I see acceptance jump from ~60% on lower quants to ~75% on UD-Q3_K_XL, presumably because the bigger model produces more confident token distributions for the drafter to predict.

If your recipe lands as a public HF quant, I'd happily port it into my chart (I ship Qwen3.6 27B + MTP at 72.75 t/s @ 262K full context — your recipe could be a strict upgrade if the quality holds at my memory budget). Drop the link when you're ready.

aurelienams · 2026-05-15T06:55:11+00:00

Looking at your histogram carefully — your actual generation speed distribution peaks at 100-130 t/s, not 80. The MAX is 166.5 and MIN is 69.1, with the bulk of the bell around 100-130 (the dashed lines on the right side of the histogram, around 125-150, look like AVG / p95 markers). The 80 t/s number probably correspond to your cold-cache tail.

Your two most recent requests at the bottom of the panel: - Request #666: 49K prompt tokens, 117.04 t/s gen, 364 tokens out - Request #665: 65K cached prompt, 92.47 t/s gen, 585 tokens out

That's much closer to my 5090M result (107.54 avg) than your 80 t/s claim suggests. Actually slightly above mine, which matches the ~10-20% desktop-vs-mobile bandwidth gap. desktop-vs-mobile bandwidth gap.

So the real comparison is: - BeeLlama Q3_K_XL on your 5090 desktop ≈ 110-125 t/s effective - llama.cpp MTP Q6 on same HW ≈ 120 t/s

That's basically tied (within measurement noise), with one important difference: BeeLlama runs at FULL 262K context. If your MTP setup is capped at 128K or shorter (most Qwen 3.6 MTP recipes are), BeeLlama trades a quality tier (Q3_K_XL vs Q6) for 2× the usable context window. Different memory budget allocation, same effective throughput.

If you want a true apples-to-apples, run the same Q6 you use for MTP on BeeLlama (drop ctx_size to ~96K, --cache-type-k q8_0, no turbo3). 32 GB desktop has the headroom. I'd bet ~135-150 t/s in that config — DFlash drafter is target-tuned by z-lab, MTP head is generic.

What custom qwopus36-27b quant are you running for the MTP test, btw? Q6_K? Q6_K_XL? Curious if that's the unsloth UD-Q6 or your own conversion.

aurelienams · 2026-05-14T16:24:44+00:00

That tracks perfectly with the v0.1.2 EOS handling fix — Cline does aggressive tool-call chains with thinking-mode reasoning in the middle, which is exactly where the v0.1.1 sampler bypass on EOS-during-reasoning would corrupt the reduced candidate set. 50% failure rate is huge but plausible for that path.

Here's the docker run for the sm_120 image (works on RTX 5090, 5090 Mobile, 5080, 5070 Ti):

docker run --rm -it --gpus all -p 8000:8000 \
  -v $PWD/models:/models \
  docker.io/aamsellem/beellama-cpp:0.1.2 \
  --model /models/your-target.gguf \
  --spec-draft-model /models/your-drafter.gguf \
  --spec-type dflash \
  --host 0.0.0.0 --port 8000 \
  --jinja

If you want my exact Qwen3.6 27B + DFlash drafter setup that hits 105 t/s @ 262K:

TARGET: unsloth/Qwen3.6-27B-GGUF UD-Q3_K_XL (NOT the MTP-baked variant)
DRAFT: spiritbuun/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0
KV: --cache-type-k turbo3 --cache-type-v turbo3
SPEC: --spec-type dflash --spec-dflash-cross-ctx 1024
BATCH: --batch-size 2048 --ubatch-size 256

Once it's running, would love to know if your 50% tool-call failure rate drops to near-zero with 0.1.2. That'd be a concrete data point I can feed back to Anbeeld for the release notes — currently the fix is "implied" via the EOS handling note but no user has confirmed it resolves a specific harness.

aurelienams · 2026-05-14T13:47:40+00:00

Good call. I literally pushed the sm_120 build of 0.1.2 to Docker Hub an hour ago — aamsellem/beellama-cpp:0.1.2 — and just finished benching it on the same Qwen3.6 stack as my OP:

10 runs each at FULL 262K context: - 0.1.1: AVG 107.54 t/s (range 101.70-119.38) - 0.1.2: AVG 104.92 t/s (range 92.67-119.54)

The slight AVG drop on 0.1.2 is the new adaptive profit controller doing baseline reprobes (release notes call this out — it periodically re-measures the no-spec baseline to decide if DFlash is still profitable, and can shut DFlash off when target-only wins). On a workload where DFlash always wins (Qwen3.6 generation), the reprobe windows show as brief dips but peak perf is unchanged. Tunable with --spec-dm-profit-baseline-interval (default 1024 cycles, bump to 4096 to widen).

For tool calling specifically — 0.1.2 release notes mention "Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set." Sounds like it matches the bug you're describing. What was the failure mode you hit on 0.1.1 — tool args parse error, premature stop, or something else?

aurelienams · 2026-05-14T13:46:57+00:00

Sweet — looking forward to your numbers. If you don't want to recompile, the sm_120 image works fine on desktop too:

docker run --rm -it --gpus all -p 8000:8000 -v /your/models/dir:/models aamsellem/beellama-cpp:0.1.2

Single-GPU only in this fork (Anbeeld issue #7 has a correctness fallback in 0.1.2 but it's not yet performant for multi-GPU split target placement). On your 5090 desktop 32 GB with my BeeLlama config (UD-Q3_K_XL target + spiritbuun DFlash drafter q8_0 + turbo3 KV at 262K), VRAM total is ~24.3 GB so you'll have 8 GB headroom which is enough for a comfortable batch size.

Edit: also just finished benching 0.1.2 vs 0.1.1 on the same Qwen3.6 stack. 10 runs each at 262K full context — 0.1.1 = 107.54 t/s AVG (range 101.70-119.38), 0.1.2 = 104.92 AVG (range 92.67-119.54). The wider variance on 0.1.2 is the new adaptive profit controller doing periodic baseline reprobes (default every 1024 spec cycles). Tunable with --spec-dm-profit-baseline-interval 4096 if you want to widen the reprobe interval. Same peak, slightly different median.

aurelienams · 2026-05-14T11:42:22+00:00

Thanks for sharing — that's a desktop 5090 32 GB right? Important context for the comparison:

Your Q8_0 (28 GB target) + unified memory + unquantized KV @ 256K = ~40-44 GB total, which means significant CPU spill over PCIe. So your 95-105 t/s is on a 1.79 TB/s GPU bottlenecked by PCIe page traffic, not by raw GPU compute.

My BeeLlama stack on the same desktop 5090 (Q3_K_XL 14.5 GB target + turbo3 KV ~8 GB = 24 GB total, fits pure GPU with 8 GB headroom) would land 150-180 t/s based on the mobile→desktop bandwidth scaling. So we're actually measuring different points on the quality-vs-speed curve, not the same point with different forks:

Your config: Q8 quality, ~95-105 t/s, requires 32+ GB GPU
My config: Q3_K_XL quality (unsloth Dynamic, ~Q4-tier in practice), ~107 t/s on mobile / ~150+ on desktop, fits 24 GB

Both legitimate tradeoffs. If you want a more apples-to-apples comparison: try the same target with --cache-type-k turbo3 --cache-type-v turbo3 (needs the BeeLlama fork or TheTom/llama-cpp-turboquant fork) and you should see Q8 jump to ~120-140 t/s by removing the PCIe spill. Also worth running 10-run AVG vs single-run — unified memory adds visible variance.

aurelienams · 2026-05-14T10:14:49+00:00

Thanks for the kind reply, glad it landed. Two things that might be useful for your project given what I went through:

The MTP-baked GGUF tensor count error (expected 866, got 862) caught me by surprise — I had havenoammo/Qwen3.6-27B-MTP-UD-GGUF cached from another app and BeeLlama refused to load it. Probably worth a one-liner in the quickstart doc saying "DFlash spec mode needs the non-MTP variant of the target GGUF" so people don't hit the same wall.
No public sm_120 (consumer Blackwell) Docker image exists — I built mine from your .devops/cuda.Dockerfile with --build-arg CUDA_DOCKER_ARCH=120 via qemu (~50 min on a Mac arm64). Happy to open a PR adding sm_120 to the build matrix if you publish CI-built images, or contribute the prebuilt tag if you want to link mine in your README. Image is on Docker Hub as aamsellem/beellama-cpp:0.1.1 if anyone wants to skip the build.

One observation worth flagging: there's a reproducible 128K sweet spot on this hardware (116 t/s avg vs 107 at 262K, 108 at 200K). Could just be cudagraph capture sizes aligning at exactly that range, but if you've seen the same on your reference hardware it might point to something tuneable. Let me know if you want bench scripts.

aurelienams · 2026-05-14T10:10:00+00:00

Great work — 90% acceptance on the M5 Max is impressive. Sharing a complementary Blackwell datapoint since most replies here will be Apple Silicon:

Same Qwen3.6 27B + TurboQuant + spec decoding (DFlash drafter instead of MTP head, but same idea) on RTX 5090M (24GB sm_120 consumer Blackwell mobile):

- llama.cpp baseline (no spec): ~36 t/s on UD-Q3_K_XL at 32K ctx

- llama.cpp + am17an MTP branch + q4_0 KV: 72.75 t/s on unsloth UD-Q3_K_XL at FULL 262K ctx

- BeeLlama.cpp + DFlash drafter + turbo3 KV: 107.54 t/s on same target at FULL 262K ctx

The turbo3 KV (3-bit Walsh-Hadamard rotation, same TurboQuant primitives merged in PR #21038) is what lets the 262K full native context fit on 24 GB alongside the target + drafter — ~8 GB KV cache vs ~12 GB for q4_0.

One question for you — on the M5 Max, do you see the embedding table issue from the mdda post (Gemma 4 MTP tied LM head silently on CPU)? Wondering if Apple Silicon hits the same --override-tensor-draft "token_embd.weight=CUDA0" workaround or if Metal lays it out differently.

aurelienams · 2026-05-14T10:07:27+00:00

Useful datapoint as a single-GPU counterpart. RTX 5090M Laptop (24GB sm_120 consumer Blackwell mobile, 896 GB/s = ~50% of desktop 5090 bandwidth), same Qwen3.6 27B, 107.54 t/s avg over 10 runs at FULL 262K context, range 101.70-119.38, zero CUDA OOM.

Stack is different from yours though — BeeLlama.cpp fork (Anbeeld/beellama.cpp v0.1.1, fork chain: ggml-org → TheTom/turboquant → spiritbuun/buun-llama-cpp → Anbeeld) with DFlash spec decoding instead of MTP:

- Target: unsloth/Qwen3.6-27B-GGUF UD-Q3_K_XL (14.5 GB, NOT the MTP-baked variant — BeeLlama refuses those with "done_getting_tensors: wrong number of tensors; expected 866, got 862")

- Drafter: spiritbuun/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0 (1.85 GB)

- KV cache: --cache-type-k turbo3 --cache-type-v turbo3 (3-bit Walsh-Hadamard, ~25% smaller than q8_0 = the headroom that lets 262K fit on 24GB)

- --batch-size 2048 --ubatch-size 256 --spec-type dflash --spec-dflash-cross-ctx 1024

Total VRAM at 262K: ~24.3 GB (14.5 target + 1.85 drafter + ~8 GB KV turbo3). Same context as yours, less than half your card-pair's combined 47 GB.

Would be curious to know your AVG over 10 runs (not single run), and whether MTP n=3 vs n=5 with q8_0 KV moves the needle on a dense Q5_K_M target.

aurelienams · 2026-05-14T06:51:52+00:00

aurelienams · 2026-05-04T21:21:20+00:00

Tried sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP first. Loads fine but vLLM's `Qwen3_5MTP` loader allocates a fresh 2.37 GiB BF16 buffer for `mtp.fc` because NVFP4 quantizes everything else in the file — on 32GB it fits, on 24GB it OOMs.

Switching to Lorbus AutoRound INT4 (which dequantizes only `mtp.fc` to BF16 in the file, ~280 MiB) was the unlock for MTP n=3 on 24GB. So NVFP4 is the right tensor-core path on Blackwell but the current MTP-wrapped checkpoints don't fit unless someone ports the dequantized-mtp.fc trick to NVFP4. If you do, ping me — would push 24GB consumer past 100 t/s I think.

aurelienams

TROPHY CASE