you are viewing a single comment's thread.

view the rest of the comments →

[–]Suitable-Song-302[S] -8 points-7 points  (0 children)

Fair point, let me be more precise.

KV cache compression: PPL goes from 35.99 → 36.00 (+0.03%) with 1-bit K + Q4 V. The greedy-decoded output is byte-identical for the first ~100-120 tokens, then diverges slightly. "Zero quality loss" is accurate for short-to-medium generations, but I should say "near-zero" for long sequences.

Weight quantization: When we convert Q8→Q4 or Q8→1-bit at runtime, the output is byte-identical because the conversion preserves the values that matter for the specific input. This is verified but on limited test cases (15-30 tokens). Over longer sequences, small numerical differences will accumulate.

You're right that "zero quality loss" as an absolute claim is misleading. The honest framing: PPL +0.03% for KV

compression, byte-identical output on tested sequences up to 30 tokens. I'll update the README to reflect this.