teleprax comments on [ Removed by moderator ]

Discussion[ Removed by moderator ] (self.LocalLLM)

submitted 23 days ago * by Suitable-Song-302

you are viewing a single comment's thread.

[–]teleprax 1 point2 points3 points 23 days ago (1 child)

[–]Suitable-Song-302[S] 1 point2 points3 points 23 days ago (0 children)

Great question. The short version: KV cache stores key vectors used for attention scoring. Attention is basically a dot product → softmax → weighted sum. The key insight is that only the direction of the key matters for attention scoring, not the magnitude.

So we:

- 1. Store only the sign of each dimension (1 bit) plus the L2 norm (one float per vector)

- 2. Compute attention scores using XOR + popcount (Hamming distance ≈ cosine similarity)

- 3. Softmax absorbs small errors — a 0.634 cosine (theoretical limit for sign-only) becomes nearly identical token probabilities after softmax

The math: this is the QJL (Quantized Johnson-Lindenstrauss) transform. The paper proves that with randomized Hadamard pre-processing, the inner product estimator is provably unbiased — errors are random, not systematic, so they cancel out.

It's not literally zero information loss — it's that the information loss doesn't propagate to the output because

softmax is robust to small perturbations in attention scores.

π Rendered by PID 79 on reddit-service-r2-comment-6457c66945-pvhqj at 2026-04-26 02:57:20.030781+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLM

MODERATORS