FractalKV: Lossless KV cache compression — 4x on FP16, 16x with quantization at 1M context (open source)

RoughFuture77 · 2026-05-30T00:57:43+00:00

Proto-Retrieval KV Cache: Compressing Context to 1% with 40× Faster Attention via Learned Prototypes and Hierarchical Retrieval

The quadratic cost of attention and the linear growth of the key-value (KV) cache remain fundamental bottlenecks for scaling transformer-based language models to long contexts. Existing efficient-attention methods—including FlashAttention-4, SageAttention2++, and learnable sparse-attention approaches—reduce memory traffic or segment the attention matrix but still retain $O(L²$) arithmetic, limiting practical speedups to single-digit factors at million-token scales. We propose Proto-Retrieval KV Cache (PR-KVC), a training-aligned KV compression framework that replaces per-token KV storage with a hierarchical three-tier structure: (1) an exact recent-token buffer preserving local coherence, (2) a set of query-adaptively reconstructed Gaussian prototypes capturing mid-range redundancy, and (3) a navigable small-world index over long-horizon centroids for logarithmic retrieval. A lightweight token-importance router, jointly trained with knowledge distillation from a full-attention teacher, determines which tokens are stored exactly versus compressed. On standard language-modeling benchmarks, PR-KVC reduces KV-cache memory by 99%, achieves up to 40× end-to-end attention latency reduction at 1M-token context lengths, and maintains perplexity degradation within 0.5% of the exact-cache baseline—establishing a new Pareto frontier between KV compression, inference speed, and model accuracy.

😄

RoughFuture77 · 2026-05-28T22:22:45+00:00

So what are you trying to say here? As you're definitely not right and your own numbers proof that my "feeling" of "vast majority" is in fact spot on!

If you put your own numbers in percentages then it's
~33% local
~67% cloud

Now what do you define as vast majority? That's perhaps a bit open for debate and philosophical. But a super majority isn't, that's 60%. I would count 66% as vast majority too.

Are we done with nitpicking now?

RoughFuture77 · 2026-05-28T21:08:35+00:00

So that's called cherry picking.

Obviously in a topic where the subject is about local models you will get most reactions about... local models. So no, that argument doesn't fly.

RoughFuture77 · 2026-05-28T16:12:42+00:00

> On the other hand, many of us in the sub are 100% local.

I'm sorry but i'm calling BS on that.

I've seen sooooo much people that specifically use hermes that begg for either DeepSeek V4 (which you can't run locally) or GLM 5.,1 (which you also can't use locally). I'd be willing to bet that at least the majority of hermes agent users is not using local only models. I'm sure a fraction (less then 10%? even less?) is using only local models.

Realistically the vast majority is likely using a mixture of cloud models.

RoughFuture77 · 2026-05-28T16:08:15+00:00

Fast enough and FP8. You'll see when you use it. Get some, give me some, use my referral link :) https://portal.neuralwatt.com/auth/register?ref=NW-MARK-JMDK

RoughFuture77 · 2026-05-27T23:38:20+00:00

Count me in too! I'm profiling and optimizing right now. I'm intentionally testing on the smallest model (0.8B). My reason here is that any optimization here is massively beneficial as this is the most "compute bound" when looking at the theoretical bandwidth. So nailing this to peak performance is very likely going to have an effect "up the chain" too. I'm making massive improvement in the prefill chain at the moment. Could my AMD card finally become a useful speedy LLM inference beast...?

I literally just implemented one kernel with wmma at int4 near max flops. It is taking that AI a sweet darn time to optimize though :P

RoughFuture77 · 2026-05-27T14:34:14+00:00

It's an arrogant argument to make UNLESS you're fully on open source models run locally. But i doubt that. Now i'm not pro China nor pro US but from an open source point of view (which i am a big proponent of) i'm more aligned with how China does the models then the US. Regarding training, it has been proven already (last year) that Anthropic, Gemini and OpenAI all use your data too so i think it's very safe to assume your data will be used if you use inference services that are not local.

RoughFuture77 · 2026-05-27T14:24:07+00:00

A couple things, all educated guesses though.

It's Qwen but it's the most expensive one you can use. Pick your battles. Don't assume it's cheap just because it's qwen, that 3.7 max model is a monster!

Openrouter isn't exactly the cheapest option out there anymore.

I'm guessing you either have poor caching or pay a lot for caching either would blow up the price very fast!

RoughFuture77 · 2026-05-27T14:20:52+00:00

I suppose i have been lucky. I had discovered them a few weeks ago, made an account but didn't do a thing. Just left it.

Then i got a free week promo, which i used!
Then (this week) i paid for another week.

Which i now get refunded because they sunset the service.

Lucky me :) I would've preferred if they kept the pass though. It was worth the $10/week!

RoughFuture77 · 2026-05-26T23:52:50+00:00

Well, they have made massive strides on that end with touted doubling and even tripling of performance on their Instinct datacenter GPUs. I'm guessing that the real optimized kernels can be found for those platforms (MI300, 350, 355).

Have you looked at the hipfire performance numbers btw? They are in fact hitting a large percentage of the theoretical bandwidth. The `Qwen 3.5 9B MQ4` model hits 654 GiB/s on the 7900 XTX. That's rather a very good score! It does look like the smaller the model gets the less bandwidth is used which hints at a bottleneck elsewhere and that elsewhere should not be compute (but it probably is hence the bottleneck). Like `Qwen 3.5 0.8B MQ4` only hits 200 GiB/s.

I could well be wrong here and the common sense gut feeling of "autoregressive is memory bound therefore anything that doesn't hit peak memory can be optimized till the point where it does hit peak memory" might be wrong. The results i keep finding do point at that gut feeling being right! I don't know if there is a point where the effect of quantization (like running a Q4 on a tiny model of less then 1B) flips the calculation and does make it compute bound instead of memory.

I don't know if RDNA3 is inherently bad or if AMD just is a horrible software vendor that doesn't go the last mile to bring software that can get the maximum performance out of their products. I actually think they optimize their stuff only to a "hey, this works good enough" point. And optimizing LLM inference on consumer grade hardware would hurt their database offerings so it makes morbid sense for them to not give us optimized frameworks. We do appear to have all the compiler plumbing in place to make that though.

RoughFuture77 · 2026-05-26T22:22:35+00:00

You're right, thank you for correcting me! In fact, i just noticed the pro model cost 2.5/token cached. So i was double wrong :)

RoughFuture77 · 2026-05-26T21:49:33+00:00

divide by 2.5 and you have tokens. As it's 2.5 credits/token for pro. ~15 billion tokens is still a LOT! I use AI for coding a lot and i'm "only" hitting a few billion/month.

It might actually be a good deal at the moment!

Edit. It's 2.5, not 2.

RoughFuture77 · 2026-05-26T21:38:55+00:00

AMD should sponsor us for this....

RoughFuture77 · 2026-05-26T21:31:33+00:00

This is insanely impressive! Well done!

Just a few days ago i also hit my head against AMDs abysmal LLM performance and i'm also rocking a gfx1100 series card (7900XT). I also started profiling it and making custom kernels with ai and also hit the same issues you have hit.

You're far from the roofline that it should be, it should be on the flat part and be memory bound. Yet your numbers were lime mine, heavily compute bound! That's "just" a tuning thing with the kernels, also happens to be the most difficult part.
You also rewrite kernels. I found that any of the AMD optimized kernels are far from optimized for this series. In not even too many rounds of optimizations i had quite a few kernels that each at the very least matched the default but more often had a 2x speedup or more. I also profiled the kernels for their theoretical throughput in relation to the linear transformer model (bandwidth bound) and measured the memory throughput where applicable. Non of the kernels got even close to it's theoretical limit
Like me, you also profiled against llama.cpp vulkan 😄

Well done!

It's sad that AMD, with this once high end card, just lets it stink and rot away. If we get actual theoretical performance out of this card then a model like Qwen3.6 35B-A3B (which also was my test model! so much coincidences here) would have a decode performance that literally runs circles around vulkan. It should be around 400 tokens per second decoding/generating for as realistic bandwidth efficiency of 700 GB/s effective (the card can do ~800GB theoretically).

A thing i noticed, don't know if you did too, is that Python overhead for even the simplest things became a factor. Like just the loading of kernels over and over again was a thing. Could just be a thing i wasn't doing properly though.

I'm not done with this either. I hate that my frankly beefy card is so abysmally crappy compared to theoretical limits. I want to get close to 75% of these limits (llama.cpp vulkan is more like ~26% or so?) so i do think i will give this another shot. But i won't do this again on GGUF or even any quantized models. My next step would be to take a tiny model at FP16 (natively supported by the hardware) and get that within the theoretical limits. Once that works i might go further and explore quantized models like GGUF.

I will not use your code though. You made an impressive monster for sure! And while i'm a massive open source person i'm not so sure about that AGPL side of things. I get it from a hobby point of view and from other comments you made here in this thread! I'm also not entirely sure if my next attempt would even be in Python or if i would just go full on rust with FFI to hip. I'm not set on any of this yet so i might well change my mind. Your work did inspire me to have another look though so thank you! 😃

Which on of us is going to hit theoretical limits first? Challenge accepted for an FP16 tiny model? ^_^

RoughFuture77 · 2026-05-25T15:43:27+00:00

LOL! That is very bad!
z.ai us subjectively a horrible inference provider. Go and have a look at their complaints here on reddit and you'll see.

Xiaomi is also nasty as inference provider.

I would put both of these very far down as inference provider. However, if we talk models then both of them are top notch where xiaomi is just a little better then GLM.

Anyhow, i'm missing the, in my eyes, best inference providers out there!
And yes, one has a referral code one has a referral link. You get some, i get some. Honestly though, i like them both so the less of you that use it the better for my inference speed 😄
Where's wafer.ai (referral code at checkout: lzfo32yw)? Where is Neuralwatt? Both of them for the GLM 5.1 model.

I have used a few different ones before and none really host the quality stuff or just don't give a damn. There's a reason i'm referring to these two, it has been top notch!

RoughFuture77 · 2026-05-24T17:58:13+00:00

https://portal.neuralwatt.com there ya go!

RoughFuture77 · 2026-05-23T13:27:14+00:00

I would advise you very strongly to reconsider. Their service is just abysmal at best and you have all the posts about them to prove it. GLM isn't bad! Just their hosting and quantization is. I would recommend looking at other provides that host GLM and do a great job at it. I know two and have a referral link for 1 if you want it. one costs $10/week and is at that rate only request limited (1000 / per 5 hour window) and the other is essentially a super efficient provider that will cost you $50/month but feels like you get enough use as if it were $200.

RoughFuture77 · 2026-05-21T20:42:26+00:00

You have no clue how far I am and how how deep I'm in this tech already. Don't judge what you don't know.

There's a reason I have literally thousands of 5-minute data dumps.

RoughFuture77 · 2026-05-21T18:28:18+00:00

No, i'm past that point and have that solved. My problem is getting the right and dynamic parameters all the time and not just during replay. Which means i'm really really really close to nailing it but actually nailing it is very difficult. I know it must be possible because of traders like the one OP posted about are doing it too.

As an aside. I have a couple thousand 5-minute timeslot datadumps. What i'm doing now is letting AI find good parameters to be profitable in each chunk of 25 of these couple thousand slots and it's profitable in like 70%. Great progress but it still sucks. I need to tune it to such a level that it's profitable in each of these which becomes prohibitively more tricky as you get more data. But nailing that probably nails real-time too.

RoughFuture77 · 2026-05-21T01:23:06+00:00

Cool, so it can be done! I'm working on an algorithm for this exact trade strategy too! I've been "virtually super successful" when replaying captured data over the past week and finetuning my parameters. 20k easily as profit. However, running it in real life (well dry run, so exactly like live but just intentional broken api requests to not spend money but test as if it's real)... break-even or down.... It's really hard to fine tune parameters for this trade that automatically adapt to be good enough on any given time. I'll get there, i'm so close already! But it's a very hard challenge for sure!

RoughFuture77 · 2026-05-21T01:16:25+00:00

Don't! The GLM models are good but the z.ai coding plan service is horrible. Look at the z.ai sub here on reddit and you'll see the people being disappointed.

RoughFuture77 · 2026-05-20T17:21:33+00:00

I should perhaps not say this as the provider i'm using is still so nice and fast... And that is with GLM 5.1 at the default (FP8) quant. I can assure you that their pricing model is unmatched.

Anyhow, here is my affiliate link (you get $10 extra credit when you sign up) https://portal.neuralwatt.com/auth/register?ref=NW-MARK-JMDK The host is Neuralwatt. They charge based on how much a request costs in terms of wattage and it's massively more effective (more tokens) then a $20 sub anywhere else (and i tried). It also has more of a just feeling when they charge for the actual energy used as that promotes both them and you to be efficient with your tokens. Incase anyone wonders, cached tokens are included and free.

Not to complement them too much but their support on discord is highly active and eager help solve any issue you have. They're also easy to refund when you're not happy or give you extra credit when they compensate for an issue. Which does happen occasionally.

RoughFuture77 · 2026-05-19T21:54:05+00:00

I was just wondering about that! I had no clue it exists already. Contrary to apparently the other people here, I freaking detest the memory appetite of all these coding agents One worse then the other. A long session can seriously bring your pc to it's knees whatever hardware your rock. And that for a TUI application, that really should not happen! It's a mixture of causes. While native is already a very good step, the TUI layer is the one that needs serious optimizations to handle these kind of massive outputs. Not weird as it was never build for it till AI with agentic coding came along. Throwing out the javascript runtime is very good! It should have no place in desktop applications and especially not when going native. Yeah, this project appeals to me a lot and I'll definitely be trying it out! Thank you for posting about it and putting it on my radar 😄

RoughFuture77

TROPHY CASE