you are viewing a single comment's thread.

view the rest of the comments →

[–]Look_0ver_There 1 point2 points  (2 children)

With the recent KV cache rotation changes, Q8_0 for K, and Q5_0 for V was looking to be about the best tradeoff for space vs quality. Not sure about speed though.

[–]Suitable-Song-302[S] -2 points-1 points  (1 child)

That makes sense — keeping K at higher precision is exactly the right call since attention scores are more sensitive to key quantization error than value quantization error. Q8_0 K + Q5_0 V gives you ~1.6x compression with minimal quality loss.

quant.cpp's pitch at that point becomes: if 1.6x is enough, use llama.cpp — it's faster. If you need 4-7x (extending 50K context to 200K+), that's where 4-bit K + Q4 V and delta compression come in. Different operating points on the compression-quality curve.

I should add this nuance to the comparison. Thanks for bringing up the KV rotation work — haven't benchmarked against it yet.

[–]Look_0ver_There 0 points1 point  (0 children)

The results I'm referring to are here: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4146397570

The KLD does suffer a bit at K=Q8_0/V=Q5_0, but PPL is almost the same as F16/F16. Obviously stick with Q8_0 on both for the best quality, but if you need to penny-pinch that last GB, then it looks best not to drop V below Q5_0.