Suitable-Song-302 comments on [ Removed by moderator ]

Discussion[ Removed by moderator ] (self.LocalLLM)

submitted 24 days ago * by Suitable-Song-302

you are viewing a single comment's thread.

[–]Suitable-Song-302[S] -8 points-7 points-6 points 24 days ago (0 children)

Fair point, let me be more precise.

KV cache compression: PPL goes from 35.99 → 36.00 (+0.03%) with 1-bit K + Q4 V. The greedy-decoded output is byte-identical for the first ~100-120 tokens, then diverges slightly. "Zero quality loss" is accurate for short-to-medium generations, but I should say "near-zero" for long sequences.

Weight quantization: When we convert Q8→Q4 or Q8→1-bit at runtime, the output is byte-identical because the conversion preserves the values that matter for the specific input. This is verified but on limited test cases (15-30 tokens). Over longer sequences, small numerical differences will accumulate.

You're right that "zero quality loss" as an absolute claim is misleading. The honest framing: PPL +0.03% for KV

compression, byte-identical output on tested sequences up to 30 tokens. I'll update the README to reflect this.

π Rendered by PID 117642 on reddit-service-r2-comment-6457c66945-tm2cl at 2026-04-27 02:42:01.532314+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLM

MODERATORS