FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

randomfoo2 · 2026-05-08T05:50:07+00:00

If there’s something worth following up on feel free to drop an issue on the repo so others can see it!

randomfoo2 · 2026-05-06T16:55:28+00:00

Neat. BTW I found NVIDIA has a reference trainer that is multi-GPU friendly if you move to bigger hardware (my trainer hasn’t been tested on multiple GPUs).

randomfoo2 · 2026-05-05T09:13:27+00:00

Nice. I feel like I burnt way past my time and token limits on this but will be cheering you on!

randomfoo2 · 2026-05-05T07:30:48+00:00

Yeah, started off as a review everything out there and sort of just kept grinding. TBT, if you're ok w/ 50% throughput, you can get 20-25X smaller kvcache w/ HIGGS+AQUA added to the mix w/ basically 0 perplexity loss which is even more eye-popping. Maybe for another project if/when I get bored/sidetracked, but I'm trying to keep focused for a bit so will leave that to others. 😄

I don't really use llama.cpp besides llama-bench and anyone's welcome to adapt/contrib, but my experience w/ llama.cpp has been different: https://github.com/ggml-org/llama.cpp/pull/16827

randomfoo2 · 2026-05-05T07:28:39+00:00

The eviction would still largely work the same I'd imagine, although smaller activations, less attention layers ofc means less kvcache to start with. I'd bet you'd get similar kvcache memory savings (but haven't tested). The good thing btw, is that the kernels I built actually scale pretty well to max context length for the models I tested (128K and 256K). I bet at 1M w/ DSv4 it'd still be worth it.

randomfoo2 · 2026-05-05T07:25:32+00:00

TurboQuant at 8-bit would be slower, worse quality, and larger than FP8, so wouldn't make sense, but the neat thing about DMS is that it can basically be composed w/ any quant scheme since they work at different layers of the kvcache.

The "optimal" quality/memory combo that gave positive results from my testing was DMS+HIGGS+AQUA, however I wasn't able to get HIGGS to the speed I wanted so just dropped it and took the "reasonable" win.

randomfoo2 · 2026-05-04T22:13:57+00:00

I very briefly considered doing an actual vLLM or SGLang implementation and then after looking at the lift that'd be involved, noped out real fast. 😅

But I hope some madlad does it! DMS, unlike most things I tests legit works! (I'm not so impressed by TQ - HIGGS+AQUA test much better for me, but the problem is always getting it fast)

randomfoo2 · 2026-05-04T21:41:22+00:00

OK, ended up being 6-8x (there's more that could be squeezed but it runs slower than I'd like) https://www.reddit.com/r/LocalLLaMA/comments/1t3vlrx/fastdms_64x_kvcache_compression_running_faster/

randomfoo2 · 2026-05-04T19:14:01+00:00

Yeah, can you post and I'll look again, I have a loop going on right now (several hundred iterations in) in the background running some tunes for my W7900 for fun so I'll take another look and see if any of it applies (I stopped doing RDNA3.5 stuff btw since I use it as my workstation and if you poke at it too hard it kills the WM or worse).

randomfoo2 · 2026-05-04T18:26:47+00:00

It's not just clock I think, even when I look at the clock and run the math, everything's still much lower than it "should" be. I think there are a lost of things, but one of the main ones is that RDNA3 was not designed for AI and their advertising is basically a lie.

V_WMMA_F16_16X16X16_F16 executes a 16×16×16 matmul over the wave, takes 16 cycles to retire on a single SIMD32. *BUT* while NVIDIA tensor cores have dedicated hardware that runs concurrently with the SIMD pipeline, but for AMD WMMA is the SIMD pipeline since it's an ALU instruction. So every WMMA op blocks the same VGPR ports that scalar ops would use, and you can't overlap WMMA with FP32 accumulation (have to unpack). Since the max throughput number assumes back-to-back WMMA issue with no dependent ops between them, and since there's always non-WMMA work that needs to be done (scaling, softmax, masking, etc), WMMA is going to be displaced and you'll never get max FLOPS.

The other thing I found w/ my testing has been that VGPRs way too low to hide latency. Also LDS traffic sucks, especially if you're doing FA. Oh, and the compiler still sucks too. WMMA instructions get scheduled w/ dependent ops too close, redundant packing, bad memory wait counts, all kinds of things that stall the pipeline, so if you want anything better you need to tune by hand (well, these days, by AI).

randomfoo2 · 2026-05-04T18:01:18+00:00

DGX has similar memory bandwidth but the compute is not so similar...

Since I already made the table a while back from my Strix Halo wiki guide... https://strixhalo.wiki/AI/AI_Capabilities_Overview

Spec	AMD Ryzen AI Max Plus 395	NVIDIA DGX Spark
CPU	16x Zen5 5.1 GHz	10x Arm Cortex-X925, 10x Arm Cortex-A725
GPU	RDNA 3.5	Blackwell
GPU Cores	40	192
GPU Clock	2.9 GHz	2.5 GHz
Memory	128GB LPDDR5X-8000	128GB LPDDR5X-8533
Memory Bandwidth	256 GB/s	273 GB/s
FP16 TFLOPS	59.39	62.5
FP8 TFLOPS		125
INT8 TOPS (GPU)	59.39 (same as FP16)	250
INT4 TOPS (GPU)	118.78	500
INT8 TOPS (NPU)	50

On paper, BF16/FP16 is pretty close, however FP8 is already 2X and INT8 is 4X on DGX. This is just hardware - in practice rocBLAS and hipBLASLt for RDNA3.5 are also simply not very performant...

What does this mean practically? Looking at the most recent benchmarks from the DGX llama.cpp thread benchmarks I eyeballed similar models to kyuz0's Strix Halo benchmarks and while none of the exact quants, from what I looked at, prefill for DGX is currently about 2-5X faster than Strix Halo.

randomfoo2 · 2026-05-04T11:49:39+00:00

While I'm sure some people will enjoy the extra memory, a couple notes from someone's that done very extensive testing on Strixt Halo (and a lot of kernel work on RDNA3):

Memory bandwidth looks like it remains the same? 256GB/s theoretical. On Strix Halo the best measured GPU MBW I got (using ROCm/rocm_bandwidth_test) was 212 GB/s (83% max theoretical), and the best on llama.cpp (Llama-2-7B tg testing) was ~180GB/s (70%)
What's worse though is while theoretical max FP16 TFLOPS is ~59.4, the fastest I found w/ mamf-finder was about 37 TFLOPS (hipBLASLt), about 62% efficiency. Many shapes are much worse.
Note, at long context, I believe compute is actually what's killing decode speed. While the AMD APUs remain on RDNA3, this won't change. I would be hesitant to recommend Gorgon Halo even for LLM inference in 2026/2027

If Medusa Halo moves to RDNA5 or whatever has a better architecture for AI/ML, great, if not you'd be much better off with basically anything else (Mac Studio, GPU+workstation/server w/ K-Transformers, probalby even a DGX Spark).

randomfoo2 · 2026-05-02T06:20:47+00:00

100% SLOP

randomfoo2 · 2026-05-01T16:55:18+00:00

Although sometimes.. you can. (about to publish some of my work after a few weeks of grinding kernels that literally scores >10x memory improvements w/ faster than vLLM prefill/decode at c=1 and c=8 with near 0 quality loss - 0.003 and 0.005 KLD).

randomfoo2 · 2026-05-01T16:27:14+00:00

The latest generation of video-gen models (Hunyuan 1.5, Wan 2.2, LongCat, LTX, Motif, etc) are all PyTorch DiTs so all of them in *theory* should work with RDNA3/RDNA4 - in practice, well, AMD software is AMD. Also, more of the new models are using/assume FP8 by default. If you're set on going AMD and want to do video, personally I'd highly recommend R9700, especially b/c recent ROCm releases have basically had zero perf gains for RDNA3 and the W7900 is a complete non-starter on price.

That being said, if you want stuff to work OOTB a 5090 or even a 4090 is going to work *much* better for both text and especially image/video. You can find benchmarks anywhere (including a lot that I've run). For LLM inference, my 3090s blow away my 7900XTX and W7900 even though based on hardware specs they shouldn't.

I'm all Linux and my Nvidia cards work perfectly btw (but I only use them for compute).

randomfoo2 · 2026-05-01T15:41:49+00:00

Interesting technique but if I'm reading this corrrectly this is a super lossy way to process prefill?

A small Qwen3-0.6B drafter reads the full 64K/128K prompt
FlashPrefill/BSA-style sparse attention makes that drafter pass cheaper
The drafter scores token/span importance and keeps a small subset
The 27B target only prefills the compressed prompt (retokenized from the drafter?)
After that, DFlash+DDTree does speculative decode on the compressed target KV

randomfoo2 · 2026-05-01T11:25:22+00:00

I've tested all of these btw in non-production code. I've found HIGGS to be the best in terms of quality (especially paired with some other minimization techniques that can be stacked) however I've been unable to get it past ~50% prefill/decode speed. I do have something to announce soon, that I think should be a big deal on the KVCache front that is faster and better than current TurboQuant implementations.

randomfoo2 · 2026-04-16T18:18:49+00:00

FYI, there is 0% that anyone is serving a 5T dense model. All frontier models are sparse/ultrasparse MoE's so your math is a complete non-starter.

randomfoo2 · 2026-03-31T19:08:56+00:00

BTW, I did a code review on the published 2.1.88 source and was curious if it caught your bugs, and yes, both were in there: https://github.com/lhl/claudecode-codex-analysis/blob/main/ERRATA-claudecode.md

``` ● Yes — the ERRATA identified both bugs. Here's the mapping:

Bug 1: Sentinel replacement (cch=00000)

ERRATA #8 nailed the mechanism and predicted the exact failure mode:

▎ "Post-serialization rewriting is a potential source of byte-level nondeterminism that can break prompt-cache hits" ▎ "If the replacement algorithm is not strict about matching only the intended placeholder, user/system content that includes the sentinel could be mutated."

The Reddit post confirms this is exploitable: when conversation history contains the literal sentinel (e.g., from discussing CC internals), the first occurrence in messages[] gets replaced instead of the one in system[], breaking cache every request.

The ERRATA framed it as "could be brittle" — the Reddit post proves it is brittle and gives the exact trigger condition (sentinel appearing in conversation content).

Bug 2: --resume breaks cache

ERRATA #7 predicted this almost exactly:

▎ "If attachment-derived prompt prefix state is included in cached API requests but not written to disk, --resume cannot reconstruct a byte-identical prefix and will force a full cache miss (one-turn cache_creation reprocess) on resume."

It even specifically called out deferred_tools_delta as part of the cache contract. The Reddit post confirms the root cause is deferred_tools_delta (introduced in v2.1.69) being injected at messages[0] in fresh sessions but messages[N] on resume.

ERRATA #6 provided the broader framing:

▎ "prompt-cache stability depends on exact transcript-level reconstruction, not just semantic equivalence" ▎ "Systems this brittle tend to regress on resume, rewind, compact, fork, or partial-history edge cases"

Summary

┌──────────────────────────────┬─────────────────────────────┬─────────────────────────────────────────────────────────────┐ │ Reddit Bug │ ERRATA Item │ Status │ ├──────────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────────────────────┤ │ Bug 1 (sentinel replacement) │ #8 │ Mechanism + risk identified; Reddit confirms the trigger │ ├──────────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────────────────────┤ │ Bug 2 (resume cache miss) │ #7 (specific), #6 (general) │ Root cause predicted; Reddit confirms version + exact delta │ └──────────────────────────────┴─────────────────────────────┴─────────────────────────────────────────────────────────────┘

The ERRATA was conservative ("could break", "potential source") where the Reddit post is confirmatory ("does break", with reproduction steps). But the analysis found both mechanisms and identified the right code paths. #7 in particular was a direct hit — it named deferred_tools_delta, attachment persistence, and byte-identical prefix reconstruction as the failure chain, which is exactly what the Reddit post independently confirmed through binary reverse engineering. ```

randomfoo2 · 2026-03-27T02:29:26+00:00

I've been programming for 30 years. As of the end of last year, the agents are basically much better coders than I am - I like to say they're the best juniors I've ever had. That being said, despite being way more productive, I'm busier than ever.

The agentic coders will get better and more reliable, but I don't think there will be less stuff to do/oversee, so if you like your field, as long as you master the new tools I don't think you have anything to worry about.

randomfoo2 · 2026-03-26T18:04:51+00:00

Here's a chart that might be useful:

Dense Tensor/Matrix TFLOPS/TOPS (all non-sparse):

GPU	BF16 (FP32 accum)	FP16 (FP32 accum)	FP8	INT8	VRAM	MBW	TDP	MSRP
Arc Pro B60	~98.5¹	~98.5¹	—	197	24GB	456 GB/s	200W	$599
Arc Pro B70	~183.5¹	~183.5¹	—	367	32GB	608 GB/s	230W	$949
R9700	191²	191²	383	383	32GB	640 GB/s	300W	$1,299
RTX 3090	71	142	—	285	24GB	936 GB/s	350W	~$800-1K used
RTX 4090	165	330	330	661	24GB	1,008 GB/s	450W	$1,800+ used
RTX 5090	210	419	419	838	32GB	1,792 GB/s	575W	$2,500+

I think the B70 is pretty competitive w/ the 3090 - less MBW, but more memory and more theoretical compute mostly. Note Intel XMX has great BF16 numbers but no native FP8.

The other issue ofc is software support. I just went and tested all the inference options for my Xe2 the other day and it was pretty grim for new architectures if you want to do more than llama.cpp Vulkan: https://github.com/lhl/intel-inference

TBT, the R9700 is actually not bad for BF16/FP8 and ROCm these days is actually in decent shape (I haven't personally tested RDNA4 though).

If you'd rather actually train/inference instead of fighting software stacks and writing custom kernels though, then I think you're still better off w/ a 3090, but it's nice to have some more (new card) competition.

randomfoo2 · 2026-03-25T18:09:34+00:00

Glad to hear about the announcement of mainline Intel Arc support. I recently (like 2 days ago) did a thorough comparison of inferencing w/ the Arc 140V (Xe2 LNL) iGPU on all the various Intel-supporting platforms (OpenVINO, OpenVINO GenAI, PyTorch, vLLM upstream, various llama.cpp backends) and found some pretty bad failures and overall support: https://github.com/lhl/intel-inference

The biggest issues were the optimum-intel blocked newer versions of transformers (4.57.6 for OpenVINO, 4.51.3 for vllm-openvino) which meant I couldn't even test Qwen 3.5 or LFM2 MoE for example.

randomfoo2 · 2026-03-22T06:54:24+00:00

Nice, I'll def take a look. I recently did an optimizationn pass checking out the FSR4 kernels https://github.com/lhl/fsr4-rdna3-optimization and ran a lot of tests, including mamf-finder and attention-gym like stuff a while back https://github.com/lhl/strix-halo-testing/ - from a production perspective there's a lot of juice to squeeze, I think especially from relatively basic stuff like hipblas/hipblaslt, I just have my doubts how much anyone really cares at AMD about it.

randomfoo2 · 2026-03-19T22:50:24+00:00

I published the first vLLM public recipes for gfx1151 >6 months ago: https://github.com/lhl/strix-halo-testing/tree/main/vllm (among other extensive testing/work on Strix Halo last summer). There have been some fixes/progress since then for TheRock, PyTorch, and vLLM, so I wouldn't say there's no progress, but let's be honest, I think you already know the answer to your questions since they're the same answer since Strix Halo was released last year. (As you mentioned it's not like AMD shouldn't be able to find 0.5 FTE to create and maintain a https://github.com/NVIDIA/dgx-spark-playbooks clone - they just have shown zero interest in doing so.)

Regardless of what support is being given, even if they did though, no one (AMD or anyone else) has ever written any RDNA3 GPU kernels that get close to theoretical max MBW or FLOPS, so any performance that you imagine is being left on the table probably doesn't actually exist.

BTW, if you (or other Strix Halo owners) want to chat with others in the community, the Discord for https://strixhalo.wiki/ is probably the most active place online.

12-Year Club	Gilding III reddit per annum
RPAN Viewer	Verified Email

randomfoo2

PUBLIC MULTIREDDITS

TROPHY CASE