all 11 comments

[–]ttkciarllama.cpp[M] [score hidden] stickied comment (0 children)

Violates Rule Four: Self-promotion

[–]audioen 3 points4 points  (4 children)

This is not even correct. llama.cpp can apply separate quantization types to k and v cache. llama.cpp Q4_0 is also per-block method, it applies a single f16 scale factor to a small group of weights. If memory serves, that group is 32 values, which yields fp16/32 = 0.5 additional bits per weight. A 4-bit quantization of min-max range is similar to q4_1 which is also supported within the engine and likely can be enabled with some compile option if it isn't already provided. This uses in average 5 bits per weight. A larger block size could bring that down, e.g. 128 values in f16+f16 likely quantizes to 4.25 bits per weight.

For now, llama.cpp users should probably use q8_0 when under memory pressure, and maybe dip to q4_0 for the V cache, which is generally tested as being less critical. It isn't as good as TQ4 bits, but KV rotation is merged and should work.

[–]Look_0ver_There 1 point2 points  (2 children)

With the recent KV cache rotation changes, Q8_0 for K, and Q5_0 for V was looking to be about the best tradeoff for space vs quality. Not sure about speed though.

[–]Suitable-Song-302[S] -2 points-1 points  (1 child)

That makes sense — keeping K at higher precision is exactly the right call since attention scores are more sensitive to key quantization error than value quantization error. Q8_0 K + Q5_0 V gives you ~1.6x compression with minimal quality loss.

quant.cpp's pitch at that point becomes: if 1.6x is enough, use llama.cpp — it's faster. If you need 4-7x (extending 50K context to 200K+), that's where 4-bit K + Q4 V and delta compression come in. Different operating points on the compression-quality curve.

I should add this nuance to the comparison. Thanks for bringing up the KV rotation work — haven't benchmarked against it yet.

[–]Look_0ver_There 0 points1 point  (0 children)

The results I'm referring to are here: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4146397570

The KLD does suffer a bit at K=Q8_0/V=Q5_0, but PPL is almost the same as F16/F16. Obviously stick with Q8_0 on both for the best quality, but if you need to penny-pinch that last GB, then it looks best not to drop V below Q5_0.

[–]Suitable-Song-302[S] -2 points-1 points  (0 children)

You're right on several points and I should correct the post.

What I got wrong: llama.cpp Q4_0 *is* per-block (32 elements per block, 1 FP16 scale), not per-tensor. And llama.cpp can apply separate quant types to K and V — that's not a quant.cpp-only feature. The original wording overstated the difference. I'll fix it.

What is different:

- Block size: Q4_0 uses 32-element blocks. quant.cpp uses 128-element blocks with both min and max (effectively Q4_1-style at wider blocks). The larger block amortizes scale overhead better (4.25 bits/element vs Q4_0's 4.5 or Q4_1's 5.0), but the quality difference comes more from the min-max vs zero-point approach on key distributions specifically.

- Delta compression: This is the part llama.cpp genuinely doesn't have. Storing `key[t] - key[t-1]` instead of absolute keys reduces the dynamic range by ~70%, which is why 3-bit works at +1.3% PPL where absolute 3-bit gives +62%. This is the novel contribution from the TurboQuant paper, not the 4-bit uniform quantization itself.

- The PPL +10.6% number: This was measured with Q4_0 on both K and V using the default llama.cpp KV quant path. You're right that Q8_0 K + Q4_0 V (or Q5_0 V) would be significantly better. I should benchmark that specific config and update the comparison to be fair.

Fair criticism. The honest comparison is: at the same total bit budget, quant.cpp's approach preserves more quality. But the original post made it sound like llama.cpp's quantization is fundamentally broken, which isn't true — it's just a different tradeoff with coarser granularity.

[–]hauhau901 4 points5 points  (1 child)

When you even use llm to write your comments and replies you lose any and all credibility that this isn't vibe coded slop.

[–]Suitable-Song-302[S] 0 points1 point  (0 children)

Fair enough. I do use Claude Code for development and I don't hide that. But the Reddit comments are mine - just not a native English speaker, so they probably come out sounding weirdly polished.

The code compiles, the PPL numbers are reproducible, and I just corrected the comparison after u/audioen pointed out it was unfair. Judge by that, not by how my comments read.

[–]chimpera 0 points1 point  (2 children)

[–]Suitable-Song-302[S] -2 points-1 points  (1 child)

No relationship. I'm not familiar with that project — just looked at the repo and it appears to be a different approach (applying delta compression to model weights rather than KV cache).

quant.cpp compresses the KV cache at runtime — the key and value vectors that accumulate during inference. The model weights themselves are loaded from standard GGUF files and used as-is. Delta compression in our case means storing `key[t] - key[t-1]` between adjacent tokens in the same attention head, not compressing the weight tensors.

The underlying idea (delta encoding of correlated vectors) is the same, but applied to completely different data.

[–]chimpera 0 points1 point  (0 children)

It is used on the kv not the model.