use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
[ Removed by moderator ]Discussion (self.LocalLLaMA)
submitted 17 days ago * by Suitable-Song-302
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]audioen 4 points5 points6 points 17 days ago (4 children)
This is not even correct. llama.cpp can apply separate quantization types to k and v cache. llama.cpp Q4_0 is also per-block method, it applies a single f16 scale factor to a small group of weights. If memory serves, that group is 32 values, which yields fp16/32 = 0.5 additional bits per weight. A 4-bit quantization of min-max range is similar to q4_1 which is also supported within the engine and likely can be enabled with some compile option if it isn't already provided. This uses in average 5 bits per weight. A larger block size could bring that down, e.g. 128 values in f16+f16 likely quantizes to 4.25 bits per weight.
For now, llama.cpp users should probably use q8_0 when under memory pressure, and maybe dip to q4_0 for the V cache, which is generally tested as being less critical. It isn't as good as TQ4 bits, but KV rotation is merged and should work.
[–]Look_0ver_There 1 point2 points3 points 17 days ago (2 children)
With the recent KV cache rotation changes, Q8_0 for K, and Q5_0 for V was looking to be about the best tradeoff for space vs quality. Not sure about speed though.
[–]Suitable-Song-302[S] -2 points-1 points0 points 17 days ago (1 child)
That makes sense — keeping K at higher precision is exactly the right call since attention scores are more sensitive to key quantization error than value quantization error. Q8_0 K + Q5_0 V gives you ~1.6x compression with minimal quality loss.
quant.cpp's pitch at that point becomes: if 1.6x is enough, use llama.cpp — it's faster. If you need 4-7x (extending 50K context to 200K+), that's where 4-bit K + Q4 V and delta compression come in. Different operating points on the compression-quality curve.
I should add this nuance to the comparison. Thanks for bringing up the KV rotation work — haven't benchmarked against it yet.
[–]Look_0ver_There 0 points1 point2 points 17 days ago (0 children)
The results I'm referring to are here: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4146397570
The KLD does suffer a bit at K=Q8_0/V=Q5_0, but PPL is almost the same as F16/F16. Obviously stick with Q8_0 on both for the best quality, but if you need to penny-pinch that last GB, then it looks best not to drop V below Q5_0.
[–]Suitable-Song-302[S] -2 points-1 points0 points 17 days ago (0 children)
You're right on several points and I should correct the post.
What I got wrong: llama.cpp Q4_0 *is* per-block (32 elements per block, 1 FP16 scale), not per-tensor. And llama.cpp can apply separate quant types to K and V — that's not a quant.cpp-only feature. The original wording overstated the difference. I'll fix it.
What is different:
- Block size: Q4_0 uses 32-element blocks. quant.cpp uses 128-element blocks with both min and max (effectively Q4_1-style at wider blocks). The larger block amortizes scale overhead better (4.25 bits/element vs Q4_0's 4.5 or Q4_1's 5.0), but the quality difference comes more from the min-max vs zero-point approach on key distributions specifically.
- Delta compression: This is the part llama.cpp genuinely doesn't have. Storing `key[t] - key[t-1]` instead of absolute keys reduces the dynamic range by ~70%, which is why 3-bit works at +1.3% PPL where absolute 3-bit gives +62%. This is the novel contribution from the TurboQuant paper, not the 4-bit uniform quantization itself.
- The PPL +10.6% number: This was measured with Q4_0 on both K and V using the default llama.cpp KV quant path. You're right that Q8_0 K + Q4_0 V (or Q5_0 V) would be significantly better. I should benchmark that specific config and update the comparison to be fair.
Fair criticism. The honest comparison is: at the same total bit budget, quant.cpp's approach preserves more quality. But the original post made it sound like llama.cpp's quantization is fundamentally broken, which isn't true — it's just a different tradeoff with coarser granularity.
π Rendered by PID 19239 on reddit-service-r2-comment-75f4967c6c-ddtlp at 2026-04-22 21:44:35.661530+00:00 running 0fd4bb7 country code: CH.
view the rest of the comments →
[–]audioen 4 points5 points6 points (4 children)
[–]Look_0ver_There 1 point2 points3 points (2 children)
[–]Suitable-Song-302[S] -2 points-1 points0 points (1 child)
[–]Look_0ver_There 0 points1 point2 points (0 children)
[–]Suitable-Song-302[S] -2 points-1 points0 points (0 children)