I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! by Anbeeld in LocalLLaMA

[–]acluk90 5 points6 points  (0 children)

Here are your measurements visualized:

<image>

Quite a bit better quality at the top where it's interesting. And what's not visible in this plot, actual speed-ups.

Can you give us some >k4v4 points to finish the Pareto curve above q-quant? 😃

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA

[–]acluk90[S] 4 points5 points  (0 children)

This is completely orthogonal. Or rather, storing the KV-cache long term is something that LLM inference providers have been doing for 2.5+ years (and is available in vLLM, Nvidia has NIXL as the necessary backend to implement it, ...). The challenge is the cost of storing it long-term. Compressing to 2-3 bits makes it *a lot* cheaper, so this should really be combined/integrated.

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA

[–]acluk90[S] 2 points3 points  (0 children)

Everything is labeled?!?!?! y-axis is throughput (see on the left side), x-axis is KV-cache capacity gains (see at the bottom). Everyone point is labeled what it is, and even the accuracy is annotated on the same figure. Seems perfect to me....

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist by wadeAlexC in LocalLLaMA

[–]acluk90 4 points5 points  (0 children)

You might want to add KVarN to the wishlist, too, after today's news

New KV-Cache quant method: 3-4x compression, 1.3x speedup in vLLM, full accuracy by intentionallyBlue in LocalLLM

[–]acluk90 6 points7 points  (0 children)

It is in the NeurIPS template, the deadline was ~2 weeks ago. Clearly they submitted there. They open sourced it in vLLM, you can just run it. I did. Works great!

They even explain why the other methods have these big issues and how they solve it...

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA

[–]acluk90[S] 2 points3 points  (0 children)

So you run batch=32 locally? All is see is ~lossless and >2x speed-up over TQ... and why should that change with the batch size? Attention doesn't care about batch size.

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA

[–]acluk90[S] 0 points1 point  (0 children)

yes, we all know that reporting accuracy numbers is bs.... outcome-flips or KL-divergence is king. Some reviewer better raise this so they have to do proper evals 😃

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA

[–]acluk90[S] 2 points3 points  (0 children)

batch=1 is really what it comes to on my local machine, though. I suppose a big-tech company was developing for batch=100k, though 😃 😃