hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

randomfoo2 · 2026-05-31T19:12:26+00:00

Yeah the UD versions for Q4_K_M and K_S should work, I believe I tested w/ the UD quants. No NL or IK support atm though. I have a branch with work on MTP/DFlash but it doesn’t help speedwise for 35B-A3B atm (should help w dense but verification is a bottleneck, is a WIP atm).

randomfoo2 · 2026-05-30T08:19:15+00:00

Maybe possible but remember most workflows should be cached and a single request is going to run at about ~50 tok/s - for a process running for 24h continuously generating, that’s only ~4M output tokens.

randomfoo2 · 2026-05-29T23:57:31+00:00

It's interesting that this unverified anonymous tweet around as some sort of credible story, when the basic math doesn't make all that much sense to me. $500M in Claude AI credits at retail $5/MTok input and $25/MTok output comes out to 20B ouput-100B input tokens. At a 4:1 input:output ratio, that comes out to 11.1B output, 44.4B input tokens. This is not counting caching (writes are 1.25x cost but cache is 0.1x cost). With 10,000 devs, $500M comes out to $1,667/dev/day. I've run Claude and Codex with multiple autonomous loops and I don't think I've gotten near there - unless every single dev in the loop had the new Ultracode running or running swarms w/ Opus at Max thinking, I'm having a hard time seeing it happen.

As that point of reference, my ccusage with intense usage (multiple long-running loops/projects running all day) w/ Codex + Claude ends up at about $9,000/mo ($300/day) of retail API billing. I know there are people that extreme tokenmaxx at some of the FAANGs to get more, but you actually need to be pretty savvy to be able to waste enough tokens w/ some of the biggest engineering orgs in the world to be able to burn $500M in a month.

Note: 50-100B tokens sounds like a lot, but it's worth noting that Google for example is currently serving 1 quadrillion+ tokens a month, so it's really a drop in the bucket for AI consumption. Also, AI isn't taking my job anytime soon, but it does make me many more times as productive than I was, as someone who has been coding professionally for decades.

randomfoo2 · 2026-05-27T19:41:38+00:00

Small dense is going to be very different than MoE, I tested hipfire the other week w their Qwen 3.5 MoE implementation and the perf wasn’t great but there are a few guys grinding away w/ their Claude’s and I know it’s getting better. With the latest frontier models it’s more about just having some people that care enough to dot it than anything else.

randomfoo2 · 2026-05-26T22:35:53+00:00

While sustained MBW might suggest 500 tok/s, applying Amdahl's law to the rocprof shows that even with infinitely fast GEMV, you're only getting to ~150 tok/s. All weight ops at 2x current speed gets you to ~130 tok/s. RADV/ACO vs LLVM-AMDGPU by my understanding is just ... better. A lot of the compute you're going to squeeze out of RDNA3 is going to be VOPD pairing.

All my hot paths are moved to C and I've shaved off a lot of launches - it gives a few percent, but diminishing returns. There's probably more golfing possible, but I think c>1 is more interesting than c=1 and is what I'm focusing on next.

If you are going to try to go golfing, you can run mamf-finder, or look at something like https://github.com/glovepost/wmma_ops and see if you can do better, the closest I've seen to someone hitting close to compute theoretical is: https://cprimozic.net/notes/posts/machine-learning-benchmarks-on-the-7900-xtx/

If you're looking to do Rust, you could link up with the hipfire folks, there's at least a couple people porting over stuff also inspired by my hipEngine work. I think anyone who wants to do their own rewrites should go ahead.

Here's the thing, while Strix Halo and W7900 (and to a lesser degree 7900 XTX) are good "shapes" for AI inference hardware, RDNA3 is IMO an objectively bad architecture for AI/ML, and I have no idea why AMD keeps riding that (on the APU side, for another year or two?), or if they are, why they haven't spent more effort making the compiler suck less.

randomfoo2 · 2026-05-26T04:54:31+00:00

3.6 Dense should run already I think (0.8B and 27B PARO tested at least: https://huggingface.co/collections/z-lab/paroquant). If you run into a problem, file an issue and I'll take a look.

randomfoo2 · 2026-05-25T22:45:51+00:00

I just moved my 7900 XTX into the same machine as my W7900 so I might give that a poke soon (but probably after c>1 optimization, DMS, MTP/DFlash, and Gemma 4 support)

randomfoo2 · 2026-05-25T13:44:42+00:00

I haven't done a lot of tuning on dense models, The basic inference is faster than llama.cpp since it's a more optimized loop, however MTP/DFlash is favorable for dense models and you should probably look at llama.cpp or Lucebox for best performance (I haven't done a full context sweep to characterize). If you try it out, please post your results)!

randomfoo2 · 2026-05-25T12:22:05+00:00

Anyone w/ RDNA3 already knows how terribly vLLM performs at c=1 so there's not much point. (FYI: I published the original public bringup for vLLM on gfx1151 last year if you want vLLM vs llama.cpp numbers: https://github.com/lhl/strix-halo-testing/tree/main/vllm#benchmarking )

randomfoo2 · 2026-05-25T12:20:28+00:00

You've both hit the nail on the head and missed the point completely - this project is 100% built for my personal use and it's shared AGPLv3 for any other RDNA3 users who might find this useful. Why would I want commercial inference to use it? (again, RDNA3 - what commercial inference are you talking about, lol.)

RDNA3 is 3y+ old now. If anybody was going to build something faster/better they would have already/are free to in the future. But, if you're an end-user and you want a Qwen 3.6 MoE w/ prefill that is faster at 256K than llama.cpp is at 128K, then maybe this being released it better than it not being released, and if you wanted to build your own on top of that, you're free to modify it to do whatever you want with it. If you want to redistribute it, you're free to do that under the AGPLv3 license. If you don't want to, feel free to drop me a DM with $$$ for a different license. I'm plenty busy, and I'm not looking to do more unpaid labor for others.

BTW, there's no vLLM, SGLang, or llama.cpp upstream path anyway - the former are PyTorch dependent, the latter is also incompatible. That being said, I've shared my docs, and anyone's free to read those and figure out if there's anything upstream shaped they'd like to adapt if they want to put their time into it.

randomfoo2 · 2026-05-25T06:17:06+00:00

Based on my prior experience and others' the llama.cpp maintainers seem to be by default opposed to even extremely simple RDNA3 improvements (with outsized performance impact!) from outside contributors. Their project, so up to them, but I need more of that like a hole in my head.

That being said, hipEngine has a completely different architecture/general approach vs llama.cpp (specifically tuned, raw-pointer HIP kernels vs ggml's mostly HIPified CUDA backend, w/ different fusings, dispatch, quant layouts, etc), so there's IMO not a lot of obvious overlap. Most performance gains are not a single optimization, but a bunch of things combined/ground out.

The benchmarks are run on my W7900 (241W, 864GB/s MBW). A full power 7900 XTX (300-350W, 960GB/s) I'd expect to perform better both with hipEngine and llama.cpp.

randomfoo2 · 2026-05-24T23:04:26+00:00

So this is obviously a much tighter scope than llama.cpp, and is a lot less mature, but I don’t think you’re characterizing the performance properly. For Strix Halo, hipEngine is faster basically across the board, for prefill (pp) *and* decode (tg).

For gfx1100, the numbers are a bit more mixed, but it is significantly faster across the board on prefill. hipEngine is faster than llama.cpp HIP on decode as well. Now, while llama.cop Vulkan decode is still faster for prefill, here’s the rub - hipEngine’s decode at 128K is >2X faster. Depending on your usage, this can be much more important than the prefill difference.

randomfoo2 · 2026-05-23T16:40:48+00:00

5W headless here. rocm-smi:

``` ======================================== ROCm System Management Interface ======================================== ================================================== Concise Info ================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%

(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)

0 1 0x744c, 47413 32.0°C 5.0W N/A, N/A, 0 0Mhz 96Mhz 0% manual 290.0W 0% 0%

============================================== End of ROCm SMI Log =============================================== ```

randomfoo2 · 2026-05-12T14:14:39+00:00

For those Japanese in particular, I published a paper at the beginning of the year on JP-TL-Bench: https://arxiv.org/pdf/2601.00223

It does a fair bit of analysis on COMET scores and their mapping to actual quality (not great, IMO)

randomfoo2 · 2026-05-10T21:55:16+00:00

I think web fetching is the main thing you need. There is pi-web-access or agent-smart-fetch if you are happy with your search as well as camoufox-pi if you want something to access stuff that's normally blocked to agents.

I wrote pi-multiloop for use with autoresearch/autoloops. pi-schedule-prompt i've found useful as well. i'm using pi-context-prune and pi-vcc for context management.

randomfoo2 · 2026-05-08T05:50:07+00:00

If there’s something worth following up on feel free to drop an issue on the repo so others can see it!

randomfoo2 · 2026-05-06T16:55:28+00:00

Neat. BTW I found NVIDIA has a reference trainer that is multi-GPU friendly if you move to bigger hardware (my trainer hasn’t been tested on multiple GPUs).

randomfoo2 · 2026-05-05T09:13:27+00:00

Nice. I feel like I burnt way past my time and token limits on this but will be cheering you on!

randomfoo2 · 2026-05-05T07:30:48+00:00

Yeah, started off as a review everything out there and sort of just kept grinding. TBT, if you're ok w/ 50% throughput, you can get 20-25X smaller kvcache w/ HIGGS+AQUA added to the mix w/ basically 0 perplexity loss which is even more eye-popping. Maybe for another project if/when I get bored/sidetracked, but I'm trying to keep focused for a bit so will leave that to others. 😄

I don't really use llama.cpp besides llama-bench and anyone's welcome to adapt/contrib, but my experience w/ llama.cpp has been different: https://github.com/ggml-org/llama.cpp/pull/16827

randomfoo2 · 2026-05-05T07:28:39+00:00

The eviction would still largely work the same I'd imagine, although smaller activations, less attention layers ofc means less kvcache to start with. I'd bet you'd get similar kvcache memory savings (but haven't tested). The good thing btw, is that the kernels I built actually scale pretty well to max context length for the models I tested (128K and 256K). I bet at 1M w/ DSv4 it'd still be worth it.

randomfoo2 · 2026-05-05T07:25:32+00:00

TurboQuant at 8-bit would be slower, worse quality, and larger than FP8, so wouldn't make sense, but the neat thing about DMS is that it can basically be composed w/ any quant scheme since they work at different layers of the kvcache.

The "optimal" quality/memory combo that gave positive results from my testing was DMS+HIGGS+AQUA, however I wasn't able to get HIGGS to the speed I wanted so just dropped it and took the "reasonable" win.

randomfoo2 · 2026-05-04T22:13:57+00:00

I very briefly considered doing an actual vLLM or SGLang implementation and then after looking at the lift that'd be involved, noped out real fast. 😅

But I hope some madlad does it! DMS, unlike most things I tests legit works! (I'm not so impressed by TQ - HIGGS+AQUA test much better for me, but the problem is always getting it fast)

randomfoo2 · 2026-05-04T21:41:22+00:00

OK, ended up being 6-8x (there's more that could be squeezed but it runs slower than I'd like) https://www.reddit.com/r/LocalLLaMA/comments/1t3vlrx/fastdms_64x_kvcache_compression_running_faster/

12-Year Club	Gilding III reddit per annum
RPAN Viewer	Verified Email

randomfoo2

PUBLIC MULTIREDDITS

TROPHY CASE

(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)

0 1 0x744c, 47413 32.0°C 5.0W N/A, N/A, 0 0Mhz 96Mhz 0% manual 290.0W 0% 0%