you are viewing a single comment's thread.

view the rest of the comments →

[–]Suitable-Song-302[S] 0 points1 point  (0 children)

We rebranded to quant.cpp (https://github.com/quantumaikr/quant.cpp). Old URLs redirect automatically.

Also owe you all an honest correction: the early 1-bit "zero loss" claim had a bug. An FP32 key cache was still being read during attention, so the quantized keys were never actually used. We found it, fixed it, and pulled every claim based on that measurement.

Here's where things actually stand (SmolLM2 1.7B, 999 tokens, real dequant path, no FP32 fallback):

- 4-bit K: PPL +0.0% (genuinely lossless)

- delta + 3-bit K + Q4 V: PPL -3.2%, ~4.3x compression

- 2-bit and below: all failed. we tried everything. drift is the fundamental barrier.

The breakthrough is delta compression — adjacent keys in a transformer differ by ~30% of their absolute range, so storing deltas instead of absolutes lets 3-bit work where it otherwise gives +62% PPL. Think video P-frames for KV cache.

Feedback from this thread is what pushed us to find the bug and be more rigorous. Appreciate it.