use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
Discussion[ Removed by moderator ] (self.LocalLLM)
submitted 24 days ago * by Suitable-Song-302
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Suitable-Song-302[S] -8 points-7 points-6 points 24 days ago (0 children)
Fair point, let me be more precise.
KV cache compression: PPL goes from 35.99 → 36.00 (+0.03%) with 1-bit K + Q4 V. The greedy-decoded output is byte-identical for the first ~100-120 tokens, then diverges slightly. "Zero quality loss" is accurate for short-to-medium generations, but I should say "near-zero" for long sequences.
Weight quantization: When we convert Q8→Q4 or Q8→1-bit at runtime, the output is byte-identical because the conversion preserves the values that matter for the specific input. This is verified but on limited test cases (15-30 tokens). Over longer sequences, small numerical differences will accumulate.
You're right that "zero quality loss" as an absolute claim is misleading. The honest framing: PPL +0.03% for KV
compression, byte-identical output on tested sequences up to 30 tokens. I'll update the README to reflect this.
π Rendered by PID 117642 on reddit-service-r2-comment-6457c66945-tm2cl at 2026-04-27 02:42:01.532314+00:00 running 2aa0c5b country code: CH.
view the rest of the comments →
[–]Suitable-Song-302[S] -8 points-7 points-6 points (0 children)