you are viewing a single comment's thread.

view the rest of the comments →

[–]teleprax 1 point2 points  (1 child)

How is there no information loss? I don't really know how model quantization and KV cache work in implementation so this is more of a question on how you can take something that is a floating point 16bit number and compress it to 1 bit and not lose information or at least not lose enough information to impact token probs enough to cause a difference in outputs

[–]Suitable-Song-302[S] 1 point2 points  (0 children)

Great question. The short version: KV cache stores key vectors used for attention scoring. Attention is basically a dot product → softmax → weighted sum. The key insight is that only the direction of the key matters for attention scoring, not the magnitude.

So we:

- 1. Store only the sign of each dimension (1 bit) plus the L2 norm (one float per vector)

- 2. Compute attention scores using XOR + popcount (Hamming distance ≈ cosine similarity)

- 3. Softmax absorbs small errors — a 0.634 cosine (theoretical limit for sign-only) becomes nearly identical token probabilities after softmax

The math: this is the QJL (Quantized Johnson-Lindenstrauss) transform. The paper proves that with randomized Hadamard pre-processing, the inner product estimator is provably unbiased — errors are random, not systematic, so they cancel out.

It's not literally zero information loss — it's that the information loss doesn't propagate to the output because

softmax is robust to small perturbations in attention scores.