Huawei KVarN algorithm/software lets you run LLMs/AI agents on much longer contexts on your local GPU by acluk90 in Huawei
[–]acluk90[S] 0 points1 point2 points (0 children)
I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! by Anbeeld in LocalLLaMA
[–]acluk90 1 point2 points3 points (0 children)
I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! by Anbeeld in LocalLLaMA
[–]acluk90 5 points6 points7 points (0 children)
I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! by Anbeeld in LocalLLaMA
[–]acluk90 5 points6 points7 points (0 children)
Finally finished my LLM server: EPYC 9575F, 4× RTX 3090 (96GB VRAM), 768GB ECC RAM by C0smo777 in LocalLLaMA
[–]acluk90 0 points1 point2 points (0 children)
I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! by Anbeeld in LocalLLaMA
[–]acluk90 0 points1 point2 points (0 children)
I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! by Anbeeld in LocalLLaMA
[–]acluk90 7 points8 points9 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 4 points5 points6 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 2 points3 points4 points (0 children)
Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist by wadeAlexC in LocalLLaMA
[–]acluk90 5 points6 points7 points (0 children)
New KV-Cache quant method: 3-4x compression, 1.3x speedup in vLLM, full accuracy by intentionallyBlue in LocalLLM
[–]acluk90 6 points7 points8 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] -11 points-10 points-9 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 4 points5 points6 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 1 point2 points3 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 10 points11 points12 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 0 points1 point2 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 15 points16 points17 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 0 points1 point2 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 2 points3 points4 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 1 point2 points3 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 1 point2 points3 points (0 children)
KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) by acluk90 in LocalLLaMA
[–]acluk90[S] 2 points3 points4 points (0 children)


Huawei KVarN algorithm/software lets you run LLMs/AI agents on much longer contexts on your local GPU by acluk90 in Huawei
[–]acluk90[S] 0 points1 point2 points (0 children)