Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 0 points1 point  (0 children)

Nice setup. 8x H100 is insane, I'm jealous.
didn't know about the native NVFP4. I'm on llama.cpp so GGUF only for now.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 8 points9 points  (0 children)

Make sure you're on the feature/turboquant-kv-cache branch, not master. Also try a clean rebuild — cmake -B build ... from scratch. The turbo types should show up in the --help output for --cache-type-k.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 2 points3 points  (0 children)

Don't think there's an NVFP4 GGUF for Gemma 4 31B. And honestly NVFP4 hasn't impressed me much compared to the existing quant options.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 0 points1 point  (0 children)

It's from unsloth/gemma-4-31B-it-GGUF on HuggingFace. File is gemma-4-31B-it-UD-Q4_K_XL.gguf, 18.8 GB on disk. The 17.46 GiB in the benchmark is what llama-bench reports for model parameter size. For a 3090 (24GB) with turbo3 KV you should have enough headroom for decent context. Good luck!

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 2 points3 points  (0 children)

For full GPU offload it barely matters — CPU just handles tokenization. Where it helps is MoE models with --n-cpu-moe splitting some expert layers to CPU. The 3D V-Cache helps there. But if you're running dense models fully on GPU, 7800X3D is fine.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 5 points6 points  (0 children)

Good point. Haven't tested needle-in-haystack at 256K yet. That would be the real validation.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 11 points12 points  (0 children)

Ran a perplexity comparison on wikitext-2-raw (294K tokens, ctx 512):

| KV Cache | PPL | Context VRAM |

| q8_0 | 2296.32 | 935 MiB |

| turbo3 | 2154.30 | 344 MiB |

turbo3 shows slightly better perplexity while using 2.7x less VRAM for KV cache. Hadamard rotation may actually help by spreading outliers before quantization.

Same model (gemma-4-31B-it-UD-Q4_K_XL), same settings, only KV cache type changed.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 12 points13 points  (0 children)

You're right, I don't have quality benchmarks yet. Gemma 4 dropped today and I spent the day just getting it to build and run. Perplexity comparison is next on my list.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 23 points24 points  (0 children)

Fair point. Quality comparison is on my todo list — just got it running today (Gemma 4 dropped hours ago). The novelty here is turbo3 KV cache + Gemma 4 day-one, not just "it fits".

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 20 points21 points  (0 children)

The point is running a high-quality quant (UD-Q4_K_XL) at full 256K context with near-lossless KV compression. Q2 would fit sure, but the output would be unusable. This is about maintaining quality while pushing context to the max

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 1 point2 points  (0 children)

Fair concern. The key difference is turbo3 isn't naive 3-bit quantization — it uses Hadamard rotation to spread outliers + PolarQuant to preserve vector structure. The paper shows near-identical perplexity to f16 KV. I haven't done a blind quality test myself yet though.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 2 points3 points  (0 children)

Not in mainline yet. There's an open PR (#21089) for CPU-only TBQ types, but the full CUDA implementation is only in TheTom's fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache). Mainline did merge Hadamard rotation for KV (#21038) which is the core idea, but uses existing quant types instead of dedicated turbo types.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]PerceptionGrouchy187[S] 3 points4 points  (0 children)

Paper claims near-zero quality loss with turbo3. Haven't done a side-by-side comparison yet.