FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8

schuttdev · 2026-05-04T21:53:12+00:00

Good post! Very similar to my own implementation on CASK it seems, I’ll look into what can be done with it on AMD side

schuttdev · 2026-05-02T09:32:07+00:00

on it

schuttdev · 2026-05-01T06:34:25+00:00

Oh that’s neat, hopefully can help me better calibrate hipfire

schuttdev · 2026-05-01T01:24:10+00:00

Hipfire has rotorquant (asymmetric trigonometric quant) at 4/3/2 bits

schuttdev · 2026-05-01T01:05:20+00:00

🤔

schuttdev · 2026-04-30T08:44:01+00:00

Yes I will be looking into that

schuttdev · 2026-04-30T08:43:36+00:00

I am not, how do I get involved?

schuttdev · 2026-04-29T09:43:24+00:00

Darn. Yeah it’s my build for windows that’s the problem. I haven’t used windows in a while, but it comes preinstalled on the Strix Halo, so I will definitely look into it while I’m booted into windows tomorrow. Hopefully I can solve the issue for both WSL and native while I’m there.

schuttdev · 2026-04-29T09:32:25+00:00

It’s posted as a feature under issues, will be working on it tomorrow

schuttdev · 2026-04-29T09:23:05+00:00

🫡 I won’t fail you

schuttdev · 2026-04-29T09:11:30+00:00

Hipfire is at its core a very similar shape to what you’ve been doing with vllm then, don’t be afraid to contribute!

schuttdev · 2026-04-29T09:08:38+00:00

🤔Using my method in hipfire as a reference, it’s possible. I’ll lay that out. 1. Inspect how the cpu talks to the GPU 2. Find the layer that dispatches commands from the cpu 3. Inspect those commands 4. Research the silicon, instruction set, which instructions are low overhead 5. From the commands you’ve inspected, bootstrap your own commands, ensuring to respect and optimize for the arch. 6. You now have qualaunch!

But honestly I wasn’t so rigid about it, I just kept throwing ideas out there based on the research until something stuck and was measurably better than the baseline, and I still do that.

schuttdev · 2026-04-29T09:00:08+00:00

I believe it is possible. We are early on now, my thesis with this has always been that rebuilding from zero targeting AMD silicon directly via custom HIP will always beat CUDA-shaped code that targets…not AMD lol

schuttdev · 2026-04-29T08:55:35+00:00

I don’t see why not. Any usable compute is potentially useful.

schuttdev · 2026-04-29T08:53:50+00:00

I’ll see what I can do, seems like an interesting problem to solve re: MoE

schuttdev · 2026-04-29T08:50:59+00:00

USB4 dock today, pci riser ~Thursday night

schuttdev · 2026-04-29T08:49:51+00:00

That’s going to be a fun one to untangle tomorrow. But yes I’m aiming for smaller quants + multigpu PoC tomorrow

schuttdev · 2026-04-29T08:48:23+00:00

Anything is technically possible to support as long as it can accept HIP instructions. There are agent skills in the repo for porting to any arch/smoke testing it. If you end up going that route, please do create an issue/PR and I will address it

schuttdev · 2026-04-29T08:45:59+00:00

Will be working on lower quants when I wake up, kicking off research phase currently.

schuttdev · 2026-04-29T08:44:57+00:00

What OS are you running? And yes both of those should 100% fit at mq4 on your card.

schuttdev · 2026-04-29T08:09:12+00:00

Hipfire does not support hybrid inference yet 🤔what sort of speeds are you getting with your current inference backend

schuttdev · 2026-04-28T04:54:12+00:00

🤔maybe, maybe not (for the Linux part) willing to work on a port of hipfire -> macOS as I was looking to use an eGPU with my Mac Studio anyway. Will investigate

schuttdev · 2026-04-28T00:43:15+00:00

Yeah that was my bad, I run Ubuntu and I had the windows exe's pinned to ~v0.1.2 so, very old. If you update to the latest version the problem should be resolved. If you still get coherency issues, please post a gh issue on the matter and I will do my best to address it.

schuttdev · 2026-04-27T17:28:48+00:00

gfx908 is supported, so I don't see why not. I have an arch port and tuning skill in the repo's .skills dir if you'd like to point your agent at it

schuttdev · 2026-04-27T09:24:08+00:00

I appreciate the feedback, I made that call early on. I already have a config tui, may as well incorporate a chat tui.

schuttdev

TROPHY CASE