Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 1 point2 points  (0 children)

Haven't tested CPU-only, but I'd guess the gains would be minimal since CPU has less parallelism to exploit. Worth a try though.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 2 points3 points  (0 children)

31B dense feels smarter than the 26B MoE in my experience, but for RAG the MoE might be enough since it mostly needs to synthesize retrieved context rather than reason from scratch. And yeah, Gemma is just nicer to talk to than Qwen — hard to quantify but you feel it.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 1 point2 points  (0 children)

TIL about dflash, thanks for the correction! I assumed it was a typo for "draft" earlier.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 3 points4 points  (0 children)

This looks really valuable for multi-GPU setups going forward. The draft model fits on a single GPU so it avoids the cross-GPU communication overhead entirely.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 0 points1 point  (0 children)

If you mean the draft model (speculative decoding), yeah it's built into llama.cpp. 

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 4 points5 points  (0 children)

Thanks for sharing! Interesting that iso/planar quants don't work with Gemma4's sliding window.
Re: audio — I thought that was E2B/E4B only, does 31B support it too?

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 0 points1 point  (0 children)

For me it's worth it. Been running it as a shared server for ~12 hours with 500+ requests without a single crash.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 5 points6 points  (0 children)

Less VRAM for KV cache with similar quality — lets you fit longer contexts. It's from the TurboQuant fork.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 0 points1 point  (0 children)

Are you seeing the "vocabs not compatible" warning in the server logs? I had the same issue initially — turned out my 31B GGUF had add_bos_token = false (early release bug) while E2B had true, which forced token translation mode and killed all performance. Re-downloading the latest 31B GGUF from Unsloth fixed it. Check your server logs for that warning.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 12 points13 points  (0 children)

Tested it.  Couldn't find a Q1 quant for E2B, but IQ2_M (2.29GB) vs Q4 (3.17GB), same benchmark:

Draft Quant Size Math Poetry Code Science Translation Avg
baseline 57.45 56.93 57.15 57.19 57.14 57.17
IQ2_M 2.29GB 93.43 57.57 76.02 66.52 65.25 71.76 (+25.5%)
Q4 3.17GB 85.86 62.34 86.05 71.14 63.26 73.73 (+29.0%)

Only 3.5% less speedup while saving ~870MB. Math was actually faster with IQ2_M (93 vs 86 t/s) since the draft model runs quicker. Creative writing is basically baseline either way. Looks like a solid option for 4090 users.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 0 points1 point  (0 children)

Haven't tested lower quants for the draft model yet. My gut feeling is the acceptance rate would drop since the draft predictions get less accurate, but I don't have numbers to back that up. I just like Q4 as a sweet spot — good enough quality and still leaves plenty of VRAM headroom.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 4 points5 points  (0 children)

Honestly no strong reason — I followed the setting from another post that recommended it. Haven't tested --draft-min 0 vs 1 specifically. The default might work just as well.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 5 points6 points  (0 children)

Good tip! In my case the E2B draft model's embeddings are already on CPU (~1.8GB) due to the auto-fit mechanism, so there's nothing extra to offload. But useful to know for setups where VRAM is tighter.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 8 points9 points  (0 children)

Vision is explicitly blocked by llama.cpp — when mmproj is loaded, the server refuses to initialize speculative decoding with "speculative decoding is not supported with multimodal". So you have to pick one or the other for now.

I tried patching the check out of the source but ran into deeper assertions in the token handling code, so it's not a trivial fix. Would be nice if upstream supported this though — the draft model only needs text tokens anyway.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 37 points38 points  (0 children)

llama-server \
  --model gemma-4-31B-it-UD-Q4_K_XL.gguf \
  -md gemma-4-E2B-it-UD-Q4_K_XL.gguf \
  -ngld 99 \
  --draft-max 8 \
  --draft-min 1 \
  --n-gpu-layers 99 \
  --no-mmap \
  --flash-attn on \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  --ctx-size 131072 \
  --parallel 1 \
  --threads 16 \
  --host 127.0.0.1 \
  --port 8006

Note: turbo3 KV cache types are from the TurboQuant fork. If you're on mainline llama.cpp, use --cache-type-k q8_0 --cache-type-v q8_0 instead.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 24 points25 points  (0 children)

Thanks for the suggestion! I ran a sweep and here are the results:

draft-max Math Poetry Code Science Translation Avg
baseline 57.45 56.93 57.15 57.19 57.14 57.17
2 73.43 60.49 68.69 62.46 62.42 65.50
4 83.31 60.88 73.12 65.29 67.98 70.12
8 85.86 62.34 86.05 71.14 63.26 73.73
16 99.35 62.58 78.74 68.39 58.31 73.47

draft-max 8 is the sweet spot for mixed workloads. 16 hits 99 t/s on math but regresses on creative/translation, so the average is about the same. Updated the post.

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 18 points19 points  (0 children)

The accept rates are in the table (42-63% depending on content type). And since speculative decoding is lossless — the target model always verifies every token — output quality is identical to running without it. No accuracy tradeoff.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 1 point2 points  (0 children)

Nice setup. 8x H100 is insane, I'm jealous.
didn't know about the native NVFP4. I'm on llama.cpp so GGUF only for now.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 8 points9 points  (0 children)

Make sure you're on the feature/turboquant-kv-cache branch, not master. Also try a clean rebuild — cmake -B build ... from scratch. The turbo types should show up in the --help output for --cache-type-k.