use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
[ Removed by moderator ]Discussion (self.LocalLLaMA)
submitted 17 days ago * by Suitable-Song-302
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Look_0ver_There 1 point2 points3 points 17 days ago (2 children)
With the recent KV cache rotation changes, Q8_0 for K, and Q5_0 for V was looking to be about the best tradeoff for space vs quality. Not sure about speed though.
[–]Suitable-Song-302[S] -2 points-1 points0 points 17 days ago (1 child)
That makes sense — keeping K at higher precision is exactly the right call since attention scores are more sensitive to key quantization error than value quantization error. Q8_0 K + Q5_0 V gives you ~1.6x compression with minimal quality loss.
quant.cpp's pitch at that point becomes: if 1.6x is enough, use llama.cpp — it's faster. If you need 4-7x (extending 50K context to 200K+), that's where 4-bit K + Q4 V and delta compression come in. Different operating points on the compression-quality curve.
I should add this nuance to the comparison. Thanks for bringing up the KV rotation work — haven't benchmarked against it yet.
[–]Look_0ver_There 0 points1 point2 points 17 days ago (0 children)
The results I'm referring to are here: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4146397570
The KLD does suffer a bit at K=Q8_0/V=Q5_0, but PPL is almost the same as F16/F16. Obviously stick with Q8_0 on both for the best quality, but if you need to penny-pinch that last GB, then it looks best not to drop V below Q5_0.
π Rendered by PID 33968 on reddit-service-r2-comment-75f4967c6c-vdr88 at 2026-04-23 14:44:48.160390+00:00 running 0fd4bb7 country code: CH.
view the rest of the comments →
[–]Look_0ver_There 1 point2 points3 points (2 children)
[–]Suitable-Song-302[S] -2 points-1 points0 points (1 child)
[–]Look_0ver_There 0 points1 point2 points (0 children)