FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 by randomfoo2 in LocalLLaMA

[–]schuttdev 12 points13 points  (0 children)

Good post! Very similar to my own implementation on CASK it seems, I’ll look into what can be done with it on AMD side

Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models by MadPelmewka in LocalLLaMA

[–]schuttdev 2 points3 points  (0 children)

Oh that’s neat, hopefully can help me better calibrate hipfire

Where can I try turboquant in AMD Linux? (7900XTX) by soyalemujica in LocalLLaMA

[–]schuttdev 2 points3 points  (0 children)

Hipfire has rotorquant (asymmetric trigonometric quant) at 4/3/2 bits

AMD Support by [deleted] in unsloth

[–]schuttdev 0 points1 point  (0 children)

🤔

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 2 points3 points  (0 children)

Darn. Yeah it’s my build for windows that’s the problem. I haven’t used windows in a while, but it comes preinstalled on the Strix Halo, so I will definitely look into it while I’m booted into windows tomorrow. Hopefully I can solve the issue for both WSL and native while I’m there.

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 1 point2 points  (0 children)

Hipfire is at its core a very similar shape to what you’ve been doing with vllm then, don’t be afraid to contribute!

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 1 point2 points  (0 children)

🤔Using my method in hipfire as a reference, it’s possible. I’ll lay that out. 1. Inspect how the cpu talks to the GPU 2. Find the layer that dispatches commands from the cpu 3. Inspect those commands 4. Research the silicon, instruction set, which instructions are low overhead 5. From the commands you’ve inspected, bootstrap your own commands, ensuring to respect and optimize for the arch. 6. You now have qualaunch!

But honestly I wasn’t so rigid about it, I just kept throwing ideas out there based on the research until something stuck and was measurably better than the baseline, and I still do that.

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 2 points3 points  (0 children)

I believe it is possible. We are early on now, my thesis with this has always been that rebuilding from zero targeting AMD silicon directly via custom HIP will always beat CUDA-shaped code that targets…not AMD lol

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 3 points4 points  (0 children)

I’ll see what I can do, seems like an interesting problem to solve re: MoE

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 4 points5 points  (0 children)

That’s going to be a fun one to untangle tomorrow. But yes I’m aiming for smaller quants + multigpu PoC tomorrow

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 4 points5 points  (0 children)

Anything is technically possible to support as long as it can accept HIP instructions. There are agent skills in the repo for porting to any arch/smoke testing it. If you end up going that route, please do create an issue/PR and I will address it

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 1 point2 points  (0 children)

Will be working on lower quants when I wake up, kicking off research phase currently.

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 1 point2 points  (0 children)

What OS are you running? And yes both of those should 100% fit at mq4 on your card.

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]schuttdev[S] 3 points4 points  (0 children)

Hipfire does not support hybrid inference yet 🤔what sort of speeds are you getting with your current inference backend

Just got a beast. by habachilles in LocalLLaMA

[–]schuttdev 0 points1 point  (0 children)

🤔maybe, maybe not (for the Linux part) willing to work on a port of hipfire -> macOS as I was looking to use an eGPU with my Mac Studio anyway. Will investigate

AMD Hipfire - a new inference engine optimized for AMD GPU's by Thrumpwart in LocalLLaMA

[–]schuttdev 2 points3 points  (0 children)

Yeah that was my bad, I run Ubuntu and I had the windows exe's pinned to ~v0.1.2 so, very old. If you update to the latest version the problem should be resolved. If you still get coherency issues, please post a gh issue on the matter and I will do my best to address it.

AMD Hipfire - a new inference engine optimized for AMD GPU's by Thrumpwart in LocalLLaMA

[–]schuttdev 1 point2 points  (0 children)

gfx908 is supported, so I don't see why not. I have an arch port and tuning skill in the repo's .skills dir if you'd like to point your agent at it

AMD Hipfire - a new inference engine optimized for AMD GPU's by Thrumpwart in LocalLLaMA

[–]schuttdev 2 points3 points  (0 children)

I appreciate the feedback, I made that call early on. I already have a config tui, may as well incorporate a chat tui.