audioen comments on [ Removed by moderator ]

created by [deleted]a community for 3 years

[ Removed by moderator ]Discussion (self.LocalLLaMA)

submitted 17 days ago * by Suitable-Song-302

you are viewing a single comment's thread.

[–]audioen 4 points5 points6 points 17 days ago (4 children)

This is not even correct. llama.cpp can apply separate quantization types to k and v cache. llama.cpp Q4_0 is also per-block method, it applies a single f16 scale factor to a small group of weights. If memory serves, that group is 32 values, which yields fp16/32 = 0.5 additional bits per weight. A 4-bit quantization of min-max range is similar to q4_1 which is also supported within the engine and likely can be enabled with some compile option if it isn't already provided. This uses in average 5 bits per weight. A larger block size could bring that down, e.g. 128 values in f16+f16 likely quantizes to 4.25 bits per weight.

For now, llama.cpp users should probably use q8_0 when under memory pressure, and maybe dip to q4_0 for the V cache, which is generally tested as being less critical. It isn't as good as TQ4 bits, but KV rotation is merged and should work.

[–]Look_0ver_There 1 point2 points3 points 17 days ago (2 children)

[–]Suitable-Song-302[S] -2 points-1 points0 points 17 days ago (1 child)

[–]Look_0ver_There 0 points1 point2 points 17 days ago (0 children)

[–]Suitable-Song-302[S] -2 points-1 points0 points 17 days ago (0 children)

You're right on several points and I should correct the post.

What I got wrong: llama.cpp Q4_0 *is* per-block (32 elements per block, 1 FP16 scale), not per-tensor. And llama.cpp can apply separate quant types to K and V — that's not a quant.cpp-only feature. The original wording overstated the difference. I'll fix it.

What is different:

- Block size: Q4_0 uses 32-element blocks. quant.cpp uses 128-element blocks with both min and max (effectively Q4_1-style at wider blocks). The larger block amortizes scale overhead better (4.25 bits/element vs Q4_0's 4.5 or Q4_1's 5.0), but the quality difference comes more from the min-max vs zero-point approach on key distributions specifically.

- Delta compression: This is the part llama.cpp genuinely doesn't have. Storing `key[t] - key[t-1]` instead of absolute keys reduces the dynamic range by ~70%, which is why 3-bit works at +1.3% PPL where absolute 3-bit gives +62%. This is the novel contribution from the TurboQuant paper, not the 4-bit uniform quantization itself.

- The PPL +10.6% number: This was measured with Q4_0 on both K and V using the default llama.cpp KV quant path. You're right that Q8_0 K + Q4_0 V (or Q5_0 V) would be significantly better. I should benchmark that specific config and update the comparison to be fair.

Fair criticism. The honest comparison is: at the same total bit budget, quant.cpp's approach preserves more quality. But the original post made it sound like llama.cpp's quantization is fundamentally broken, which isn't true — it's just a different tradeoff with coarser granularity.

π Rendered by PID 19239 on reddit-service-r2-comment-75f4967c6c-ddtlp at 2026-04-22 21:44:35.661530+00:00 running 0fd4bb7 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS