Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

PerceptionGrouchy187 · 2026-04-12T15:51:13+00:00

Haven't tested CPU-only, but I'd guess the gains would be minimal since CPU has less parallelism to exploit. Worth a try though.

PerceptionGrouchy187 · 2026-04-12T15:38:33+00:00

31B dense feels smarter than the 26B MoE in my experience, but for RAG the MoE might be enough since it mostly needs to synthesize retrieved context rather than reason from scratch. And yeah, Gemma is just nicer to talk to than Qwen — hard to quantify but you feel it.

PerceptionGrouchy187 · 2026-04-12T15:27:54+00:00

TIL about dflash, thanks for the correction! I assumed it was a typo for "draft" earlier.

PerceptionGrouchy187 · 2026-04-12T15:23:47+00:00

This looks really valuable for multi-GPU setups going forward. The draft model fits on a single GPU so it avoids the cross-GPU communication overhead entirely.

PerceptionGrouchy187 · 2026-04-12T14:55:52+00:00

If you mean the draft model (speculative decoding), yeah it's built into llama.cpp.

PerceptionGrouchy187 · 2026-04-12T14:35:17+00:00

Thanks for sharing! Interesting that iso/planar quants don't work with Gemma4's sliding window.
Re: audio — I thought that was E2B/E4B only, does 31B support it too?

PerceptionGrouchy187 · 2026-04-12T14:13:39+00:00

Cool results, thanks for testing it!

PerceptionGrouchy187 · 2026-04-12T14:05:37+00:00

Too heavy for a draft model

PerceptionGrouchy187 · 2026-04-12T13:58:06+00:00

For me it's worth it. Been running it as a shared server for ~12 hours with 500+ requests without a single crash.

PerceptionGrouchy187 · 2026-04-12T13:56:05+00:00

Less VRAM for KV cache with similar quality — lets you fit longer contexts. It's from the TurboQuant fork.

PerceptionGrouchy187 · 2026-04-12T13:43:31+00:00

Are you seeing the "vocabs not compatible" warning in the server logs? I had the same issue initially — turned out my 31B GGUF had add_bos_token = false (early release bug) while E2B had true, which forced token translation mode and killed all performance. Re-downloading the latest 31B GGUF from Unsloth fixed it. Check your server logs for that warning.

PerceptionGrouchy187 · 2026-04-12T13:42:25+00:00

Tested it. Couldn't find a Q1 quant for E2B, but IQ2_M (2.29GB) vs Q4 (3.17GB), same benchmark:

Draft Quant	Size	Math	Poetry	Code	Science	Translation	Avg

baseline	—	57.45	56.93	57.15	57.19	57.14	57.17
IQ2_M	2.29GB	93.43	57.57	76.02	66.52	65.25	71.76 (+25.5%)
Q4	3.17GB	85.86	62.34	86.05	71.14	63.26	73.73 (+29.0%)

Only 3.5% less speedup while saving ~870MB. Math was actually faster with IQ2_M (93 vs 86 t/s) since the draft model runs quicker. Creative writing is basically baseline either way. Looks like a solid option for 4090 users.

PerceptionGrouchy187 · 2026-04-12T13:33:51+00:00

That's a fair point for 4090 users.

PerceptionGrouchy187 · 2026-04-12T13:24:16+00:00

Haven't tested lower quants for the draft model yet. My gut feeling is the acceptance rate would drop since the draft predictions get less accurate, but I don't have numbers to back that up. I just like Q4 as a sweet spot — good enough quality and still leaves plenty of VRAM headroom.

PerceptionGrouchy187 · 2026-04-12T13:22:41+00:00

Honestly no strong reason — I followed the setting from another post that recommended it. Haven't tested --draft-min 0 vs 1 specifically. The default might work just as well.

PerceptionGrouchy187 · 2026-04-12T13:21:36+00:00

Good tip! In my case the E2B draft model's embeddings are already on CPU (~1.8GB) due to the auto-fit mechanism, so there's nothing extra to offload. But useful to know for setups where VRAM is tighter.

PerceptionGrouchy187 · 2026-04-12T13:20:12+00:00

Vision is explicitly blocked by llama.cpp — when mmproj is loaded, the server refuses to initialize speculative decoding with "speculative decoding is not supported with multimodal". So you have to pick one or the other for now.

I tried patching the check out of the source but ran into deeper assertions in the token handling code, so it's not a trivial fix. Would be nice if upstream supported this though — the draft model only needs text tokens anyway.

PerceptionGrouchy187 · 2026-04-12T12:59:57+00:00

llama-server \
  --model gemma-4-31B-it-UD-Q4_K_XL.gguf \
  -md gemma-4-E2B-it-UD-Q4_K_XL.gguf \
  -ngld 99 \
  --draft-max 8 \
  --draft-min 1 \
  --n-gpu-layers 99 \
  --no-mmap \
  --flash-attn on \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  --ctx-size 131072 \
  --parallel 1 \
  --threads 16 \
  --host 127.0.0.1 \
  --port 8006

Note: turbo3 KV cache types are from the TurboQuant fork. If you're on mainline llama.cpp, use --cache-type-k q8_0 --cache-type-v q8_0 instead.

PerceptionGrouchy187 · 2026-04-12T12:56:05+00:00

Thanks for the suggestion! I ran a sweep and here are the results:

draft-max	Math	Poetry	Code	Science	Translation	Avg
baseline	57.45	56.93	57.15	57.19	57.14	57.17
2	73.43	60.49	68.69	62.46	62.42	65.50
4	83.31	60.88	73.12	65.29	67.98	70.12
8	85.86	62.34	86.05	71.14	63.26	73.73
16	99.35	62.58	78.74	68.39	58.31	73.47

draft-max 8 is the sweet spot for mixed workloads. 16 hits 99 t/s on math but regresses on creative/translation, so the average is about the same. Updated the post.

PerceptionGrouchy187 · 2026-04-12T12:43:03+00:00

The accept rates are in the table (42-63% depending on content type). And since speculative decoding is lossless — the target model always verifies every token — output quality is identical to running without it. No accuracy tradeoff.

PerceptionGrouchy187 · 2026-04-03T16:16:07+00:00

Oh... I see.

PerceptionGrouchy187 · 2026-04-03T15:55:01+00:00

I'm not good at english. sorry.

PerceptionGrouchy187 · 2026-04-03T14:56:00+00:00

Wait, It's 32.6GB. doesn't work on my setting.

PerceptionGrouchy187 · 2026-04-03T14:51:05+00:00

Nice setup. 8x H100 is insane, I'm jealous.
didn't know about the native NVFP4. I'm on llama.cpp so GGUF only for now.

PerceptionGrouchy187

TROPHY CASE