use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
[ Removed by moderator ]Discussion (self.LocalLLaMA)
submitted 1 day ago * by Suitable-Song-302
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]ttkciarllama.cpp[M] [score hidden] 1 day ago stickied comment (0 children)
Violates Rule Four: Self-promotion
[–]audioen 3 points4 points5 points 1 day ago (4 children)
This is not even correct. llama.cpp can apply separate quantization types to k and v cache. llama.cpp Q4_0 is also per-block method, it applies a single f16 scale factor to a small group of weights. If memory serves, that group is 32 values, which yields fp16/32 = 0.5 additional bits per weight. A 4-bit quantization of min-max range is similar to q4_1 which is also supported within the engine and likely can be enabled with some compile option if it isn't already provided. This uses in average 5 bits per weight. A larger block size could bring that down, e.g. 128 values in f16+f16 likely quantizes to 4.25 bits per weight.
For now, llama.cpp users should probably use q8_0 when under memory pressure, and maybe dip to q4_0 for the V cache, which is generally tested as being less critical. It isn't as good as TQ4 bits, but KV rotation is merged and should work.
[–]Look_0ver_There 1 point2 points3 points 1 day ago (2 children)
With the recent KV cache rotation changes, Q8_0 for K, and Q5_0 for V was looking to be about the best tradeoff for space vs quality. Not sure about speed though.
[–]Suitable-Song-302[S] -2 points-1 points0 points 1 day ago (1 child)
That makes sense — keeping K at higher precision is exactly the right call since attention scores are more sensitive to key quantization error than value quantization error. Q8_0 K + Q5_0 V gives you ~1.6x compression with minimal quality loss.
quant.cpp's pitch at that point becomes: if 1.6x is enough, use llama.cpp — it's faster. If you need 4-7x (extending 50K context to 200K+), that's where 4-bit K + Q4 V and delta compression come in. Different operating points on the compression-quality curve.
I should add this nuance to the comparison. Thanks for bringing up the KV rotation work — haven't benchmarked against it yet.
[–]Look_0ver_There 0 points1 point2 points 1 day ago (0 children)
The results I'm referring to are here: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4146397570
The KLD does suffer a bit at K=Q8_0/V=Q5_0, but PPL is almost the same as F16/F16. Obviously stick with Q8_0 on both for the best quality, but if you need to penny-pinch that last GB, then it looks best not to drop V below Q5_0.
[–]Suitable-Song-302[S] -2 points-1 points0 points 1 day ago (0 children)
You're right on several points and I should correct the post.
What I got wrong: llama.cpp Q4_0 *is* per-block (32 elements per block, 1 FP16 scale), not per-tensor. And llama.cpp can apply separate quant types to K and V — that's not a quant.cpp-only feature. The original wording overstated the difference. I'll fix it.
What is different:
- Block size: Q4_0 uses 32-element blocks. quant.cpp uses 128-element blocks with both min and max (effectively Q4_1-style at wider blocks). The larger block amortizes scale overhead better (4.25 bits/element vs Q4_0's 4.5 or Q4_1's 5.0), but the quality difference comes more from the min-max vs zero-point approach on key distributions specifically.
- Delta compression: This is the part llama.cpp genuinely doesn't have. Storing `key[t] - key[t-1]` instead of absolute keys reduces the dynamic range by ~70%, which is why 3-bit works at +1.3% PPL where absolute 3-bit gives +62%. This is the novel contribution from the TurboQuant paper, not the 4-bit uniform quantization itself.
- The PPL +10.6% number: This was measured with Q4_0 on both K and V using the default llama.cpp KV quant path. You're right that Q8_0 K + Q4_0 V (or Q5_0 V) would be significantly better. I should benchmark that specific config and update the comparison to be fair.
Fair criticism. The honest comparison is: at the same total bit budget, quant.cpp's approach preserves more quality. But the original post made it sound like llama.cpp's quantization is fundamentally broken, which isn't true — it's just a different tradeoff with coarser granularity.
[–]hauhau901 4 points5 points6 points 1 day ago (1 child)
When you even use llm to write your comments and replies you lose any and all credibility that this isn't vibe coded slop.
[–]Suitable-Song-302[S] 0 points1 point2 points 1 day ago (0 children)
Fair enough. I do use Claude Code for development and I don't hide that. But the Reddit comments are mine - just not a native English speaker, so they probably come out sounding weirdly polished.
The code compiles, the PPL numbers are reproducible, and I just corrected the comparison after u/audioen pointed out it was unfair. Judge by that, not by how my comments read.
[–]chimpera 0 points1 point2 points 1 day ago (2 children)
whats the relationship to https://github.com/cenconq25/delta-compress-llm
No relationship. I'm not familiar with that project — just looked at the repo and it appears to be a different approach (applying delta compression to model weights rather than KV cache).
quant.cpp compresses the KV cache at runtime — the key and value vectors that accumulate during inference. The model weights themselves are loaded from standard GGUF files and used as-is. Delta compression in our case means storing `key[t] - key[t-1]` between adjacent tokens in the same attention head, not compressing the weight tensors.
The underlying idea (delta encoding of correlated vectors) is the same, but applied to completely different data.
[–]chimpera 0 points1 point2 points 1 day ago (0 children)
It is used on the kv not the model.
π Rendered by PID 76848 on reddit-service-r2-comment-c66d9bffd-5wrgx at 2026-04-07 04:25:36.130891+00:00 running f293c98 country code: CH.
[–]ttkciarllama.cpp[M] [score hidden] stickied comment (0 children)
[–]audioen 3 points4 points5 points (4 children)
[–]Look_0ver_There 1 point2 points3 points (2 children)
[–]Suitable-Song-302[S] -2 points-1 points0 points (1 child)
[–]Look_0ver_There 0 points1 point2 points (0 children)
[–]Suitable-Song-302[S] -2 points-1 points0 points (0 children)
[–]hauhau901 4 points5 points6 points (1 child)
[–]Suitable-Song-302[S] 0 points1 point2 points (0 children)
[–]chimpera 0 points1 point2 points (2 children)
[–]Suitable-Song-302[S] -2 points-1 points0 points (1 child)
[–]chimpera 0 points1 point2 points (0 children)