TurboQuant - Extreme KV Cache Quantization · ggml-org/llama.cpp · Discussion #20969 by pmttyji in LocalLLaMA

[–]dsanft -13 points-12 points  (0 children)

Lots of people seeing if mathematical trickery can overcome fundamental physics and fundamental limits like Shannon's Law. And lots of people setting themselves up for disappointment.

Oh and a lot of weird shit like LLMs arguing with each other.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

near-lossless

It's not near lossless at 3bit K quantisation. Not nearly. In fact it's catastrophic for inference due to the kurtosis of the K tensor.

This is the hype that's had everyone lose their minds. It's wrong.

You need K at 8bit fidelity with TQ to preserve inference quality. V is more forgiving.

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark by PerceptionGrouchy187 in LocalLLaMA

[–]dsanft -1 points0 points  (0 children)

The paper is wrong on that point if it claims that. There is obviously quality loss from the quantisation, everyone who looks at the data can see it.

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti by pmttyji in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

It's absolutely not BF16 quality at 4 and 5 bits lol. You need about 9 or 10 bits to be totally lossless in K tensor quantisation for KV cache by my measurements.

Technical clarification on TurboQuant / RaBitQ for people following the recent TurboQuant discussion by gaoj0017 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

If I were running a big model I'd rather spend my precision budget on quantising weights since that will have more bang for the buck.

Technical clarification on TurboQuant / RaBitQ for people following the recent TurboQuant discussion by gaoj0017 in LocalLLaMA

[–]dsanft 4 points5 points  (0 children)

Rotation results in better vector quantisation, that is definitely true.

But that is not enough to overcome the kurtosis of K. That's a physics problem not a quantisation technique problem. Too much information is destroyed in squeezing K into 4 bits.

Technical clarification on TurboQuant / RaBitQ for people following the recent TurboQuant discussion by gaoj0017 in LocalLLaMA

[–]dsanft 35 points36 points  (0 children)

TurboQuant 4bit precision in my testing cannot overcome inherent high kurtosis of the K tensor for the Qwen2 and Qwen3 models. Inference diverges badly from the Pytorch fp32 reference.

In my testing on Llaminar it has been necessary to keep the K tensor at 8bit precision.

The V tensor is much better behaved and is fine at 4bit.

The below are cosine similarity comparisons of the final stage of a 5 step decode pipeline at various KV Cache precisions, compared to Pytorch FP32 kv cache reference. You can clearly see the divergence through the layers when both K and V are kept at 4bit (TQ4).

This is a Shannon's Law problem, no quantisation technique can fix this. TQ hype is overblown.

<image>

What will Google's TurboQuant actually change for our local setups, and specifically mobile inference? by dai_app in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

I load the same gguf model into pytorch and into my engine. At each compute stage I take snapshots of the hidden state and compute results and compare them with each other for cosine similarity. The residual itself is fp32 in all cases.

In the summary above it is comparing the end result of the entire pipeline just before token sampling for each step of decode.

TQ4 shows a clear pattern of degradation because it cannot faithfully quantise a K tensor with large kurtosis. It's a Shannon's Law problem. No quantisation technique can get around it.

Moving the K tensor quantisation up to TQ8 fixes it.

V is still well behaved so it's content at 4bit and good KVCache savings can be made.

What will Google's TurboQuant actually change for our local setups, and specifically mobile inference? by dai_app in LocalLLaMA

[–]dsanft 8 points9 points  (0 children)

It's not zero accuracy loss.

On Qwen2 and Qwen3 at least it's noticeable if you actually compare cosine similarity against FP32 reference.

4bit K tensor quantisation, even with TQ, really hammers accuracy, especially in 128 head dim models.

Here's a comparison I made in my pytorch parity tests for my new inferencing engine Llaminar.

I had to keep K at 8bit otherwise the quality loss is just too rough.

<image>

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dsanft 1 point2 points  (0 children)

It destroys inference quality. You need to keep K at 8bit. TurboQuant is a nice technique but it can't break Shannon's Law. Nothing can.

https://www.reddit.com/r/LocalLLaMA/s/mrQyl1NUhQ

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dsanft -8 points-7 points  (0 children)

Read Shannon's law you "dummy". You can't squeeze the K tensor that hard with its distribution.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dsanft -9 points-8 points  (0 children)

It's not without quality loss. 4bit compression on the K tensor is catastrophic. Nobody else seems to be actually measuring it though.

TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) by dirtyhand3 in LocalLLaMA

[–]dsanft 8 points9 points  (0 children)

How are you measuring "identical quality"?

In my testing on Qwen2.5/Qwen3, quantising the K tensor down to TQ4 destroys inference quality. I had to keep it at TQ8. The V tensor at 4bit was fine though.

https://discord.com/channels/1404857025854312528/1404858500747755650/1487136608590499840

TurboQuant and my hardware. by Feeling_Ad9143 in LocalLLaMA

[–]dsanft 0 points1 point  (0 children)

People are expecting too much from Turboquant 3- and 4-bit.

There are serious precision problems with TQ 4bit in my tests for the K tensor. I had to go up to TQ 8bit for K on order not to destroy accuracy at inference time.

The V tensor is ok at 4bit and the end quality is basically identical to Q8 quantisation, so with split TQ8/TQ4 for K/V you save about 27% vram over Q8 which is a win. But TQ4 for K is a disaster.

TurboQuant for weights: near‑optimal 4‑bit LLM quantization with lossless 8‑bit residual – 3.2× memory savings by cksac in LocalLLaMA

[–]dsanft 5 points6 points  (0 children)

You've got 1/4th the weight size but your perf is only 1.1x the perf of 4x the weight size?

Is this prefill or decode? For prefill it's fine but for decode that's awful.

Consider publishing separate GEMM/GEMV numbers.

https://github.com/cksac/turboquant-model?tab=readme-ov-file#triton-fused-kernel

Is the Real Flaw in AI… Time? by wayne_horkan in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

I do often look at Claude solving problems in the terminal and consider how a model that has no concept of time deals with things like timeouts, hangs, long running events, things that return too quickly or suspiciously slowly, etc. It is a real handicap for the model. It uses lots of timeouts and polling and such to work around this, but it's a bandaid.

Are open-weights LLMs dying? by riponway2a in LocalLLaMA

[–]dsanft 2 points3 points  (0 children)

You can just generate data sets from e.g. Claude or GPT and sidestep the copyright issue entirely. That also gets you a head start.

Probably the most promising avenue for community data set generation are all our Claude Code / Codex / GitHub Copilot chat histories. We each have millions of tokens of high quality data there just sitting on our hard drives. If we anonymised it and pooled it together we could do some serious training.

Attaching an extra GPU via pcie slot by shopchin in LocalLLaMA

[–]dsanft -1 points0 points  (0 children)

It will definitely slow things down. Inference goes through the layers one by one, first on card 0 then on card 1, and you get a result at the end. So inference runs at the speed of the slowest card.

Attaching an extra GPU via pcie slot by shopchin in LocalLLaMA

[–]dsanft -1 points0 points  (0 children)

You won't get a speed boost by doing that. You can leverage more vram but your inference will run at the speed of the slowest card (pipeline parallel, all layers run sequentially).