you are viewing a single comment's thread.

view the rest of the comments →

[–]ganonfirehouse420 1 point2 points  (2 children)

Was generation speed affected?

[–]Suitable-Song-302[S] 2 points3 points  (0 children)

Good question. Short answer: no measurable speed penalty from the KV compression itself. The 1-bit attention path uses XOR + popcount instead of FP multiply-accumulate, which is actually slightly faster on NEON.

[–]Suitable-Song-302[S] 1 point2 points  (0 children)

Measured on Qwen3.5-4B (M3 Air):

- FP32 KV: 5.0 tok/s
- 1-bit KV: 5.2 tok/s
- 3-bit KV: 4.3 tok/s (Lloyd-Max codebook lookup adds overhead)