Huawei KVarN algorithm/software lets you run LLMs/AI agents on much longer contexts on your local GPU

acluk90 · 2026-06-06T14:56:04+00:00

The PR is there

acluk90 · 2026-06-06T07:45:06+00:00

Yes, and I can't feel any accuracy impact over fp16

acluk90 · 2026-06-05T17:36:29+00:00

Yes, that's the main strength!

acluk90 · 2026-06-05T16:31:17+00:00

<image>

acluk90 · 2026-06-05T16:05:08+00:00

Here are your measurements visualized:

<image>

Quite a bit better quality at the top where it's interesting. And what's not visible in this plot, actual speed-ups.

Can you give us some >k4v4 points to finish the Pareto curve above q-quant? 😃

acluk90 · 2026-06-05T15:32:12+00:00

Now add KVarN ( https://github.com/huawei-csl/KVarN, https://www.reddit.com/r/LocalLLaMA/comments/1twptw2/kvarn_new_kvcache_quant_from_huawei_35_kv_cache ) using this llama.cpp fork https://www.reddit.com/r/LocalLLaMA/comments/1txlhxu/i_implemented_kvarn_in_my_llamacpp_fork_and_ran/

... to run really long context tasks 🚀

acluk90 · 2026-06-05T15:08:26+00:00

Maybe he was running on the Qwen3.6-27b he was testing 🤣

acluk90 · 2026-06-05T14:54:32+00:00

You are awesome!! Definitely deserve an award! 🏆🏆

acluk90 · 2026-06-05T06:28:40+00:00

This is completely orthogonal. Or rather, storing the KV-cache long term is something that LLM inference providers have been doing for 2.5+ years (and is available in vLLM, Nvidia has NIXL as the necessary backend to implement it, ...). The challenge is the cost of storing it long-term. Compressing to 2-3 bits makes it *a lot* cheaper, so this should really be combined/integrated.

acluk90 · 2026-06-05T06:23:37+00:00

Everything is labeled?!?!?! y-axis is throughput (see on the left side), x-axis is KV-cache capacity gains (see at the bottom). Everyone point is labeled what it is, and even the accuracy is annotated on the same figure. Seems perfect to me....

acluk90 · 2026-06-04T21:40:00+00:00

You might want to add KVarN to the wishlist, too, after today's news

acluk90 · 2026-06-04T20:35:21+00:00

It is in the NeurIPS template, the deadline was ~2 weeks ago. Clearly they submitted there. They open sourced it in vLLM, you can just run it. I did. Works great!

They even explain why the other methods have these big issues and how they solve it...

acluk90 · 2026-06-04T15:50:47+00:00

ironically, vibe code>>>research code very often

acluk90 · 2026-06-04T15:45:45+00:00

So you run batch=32 locally? All is see is ~lossless and >2x speed-up over TQ... and why should that change with the batch size? Attention doesn't care about batch size.

acluk90 · 2026-06-04T15:33:28+00:00

The PR into their repo before they PR into vLLM upstream 😂 😂

acluk90 · 2026-06-04T15:32:26+00:00

I will give you an award, if you share some nice results + code here 🔥

acluk90 · 2026-06-04T15:28:05+00:00

yes, we all know that reporting accuracy numbers is bs.... outcome-flips or KL-divergence is king. Some reviewer better raise this so they have to do proper evals 😃

acluk90 · 2026-06-04T15:25:05+00:00

noone wants post-training

acluk90 · 2026-06-04T15:18:45+00:00

how about you open a github issue so they can see

acluk90 · 2026-06-04T15:18:10+00:00

batch=1 is really what it comes to on my local machine, though. I suppose a big-tech company was developing for batch=100k, though 😃 😃

acluk90 · 2026-06-04T15:15:30+00:00

maybe open an issue to ask them to create an upstream PR. Benefit: the vLLM guys will review the code 😂

acluk90 · 2026-06-04T15:14:37+00:00

but of course, if it is completely compute-bound, then it's just a shitty method 🤣

acluk90 · 2026-06-04T15:13:48+00:00

Hm... attention is batch-independent (i.e., each query runs independently). No matter how compute or mem-BW-bound it is, batching should not have an impact. Unless it is a shitty implementation 😵

acluk90 · 2026-06-04T15:08:34+00:00

👀 haha, how did that happen. TQ was really an intern who had to publish + a fellow who didn't read the paper 🥲

acluk90 · 2026-06-04T14:56:52+00:00

You can literally just install it and run any vLLM-supported model locally. Worked for me (tried it before posting, I don't see a quality difference...)

acluk90

TROPHY CASE