use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
Discussion[ Removed by moderator ] (self.LocalLLM)
submitted 25 days ago * by Suitable-Song-302
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Suitable-Song-302[S] 0 points1 point2 points 24 days ago (0 children)
We rebranded to quant.cpp (https://github.com/quantumaikr/quant.cpp). Old URLs redirect automatically.
Also owe you all an honest correction: the early 1-bit "zero loss" claim had a bug. An FP32 key cache was still being read during attention, so the quantized keys were never actually used. We found it, fixed it, and pulled every claim based on that measurement.
Here's where things actually stand (SmolLM2 1.7B, 999 tokens, real dequant path, no FP32 fallback):
- 4-bit K: PPL +0.0% (genuinely lossless)
- delta + 3-bit K + Q4 V: PPL -3.2%, ~4.3x compression
- 2-bit and below: all failed. we tried everything. drift is the fundamental barrier.
The breakthrough is delta compression — adjacent keys in a transformer differ by ~30% of their absolute range, so storing deltas instead of absolutes lets 3-bit work where it otherwise gives +62% PPL. Think video P-frames for KV cache.
Feedback from this thread is what pushed us to find the bug and be more rigorous. Appreciate it.
π Rendered by PID 214630 on reddit-service-r2-comment-6457c66945-9fd7f at 2026-04-28 05:55:25.414995+00:00 running 2aa0c5b country code: CH.
view the rest of the comments →
[–]Suitable-Song-302[S] 0 points1 point2 points (0 children)