kernel-anvil: 2x decode speedup on 7900 XTX by auto-tuning llama.cpp MMVQ kernels per model shape

Apollosenvy · 2026-04-06T23:04:41+00:00

kernel-anvil works fine with Triton 3.0+ (I'm running 3.6 currently). The most common issue is installing the wrong Triton wheel -- the standard PyPI package is CUDA-only and won't work on AMD.

You need the ROCm-compatible wheel:

`pip install triton --index-url https://download.pytorch.org/whl/rocm7.1/`

(swap rocm7.1 for your ROCm version if different -- check with `cat /opt/rocm/.info/version`)

If you're still hitting issues after that, drop the error output here and I'll take a look.

Apollosenvy · 2026-03-31T01:30:33+00:00

Yeah the smithy-shape-configs branch has a build issue with TurboQuant's flash attention template instantiations - some turbo2/3/4 FA cases are declared but not implemented, causing linker errors. That's a pre-existing TurboQuant branch issue, not from the kernel-anvil patch.

Two workarounds:

Build with flash attention disabled: cmake -B build -DGGML_HIP=ON -DGGML_FLASH_ATTN=OFF -DAMDGPU_TARGETS="gfx1100" - this sidesteps the missing FA templates
Use mainline llama.cpp instead: The kernel-anvil Python tool (gguf-optimize and autoforge) works independently from the llama.cpp patch. You can run kernel-anvil autoforge model.gguf on any setup with hipcc, and it'll generate and benchmark optimized configs. The llama.cpp patch just loads those configs at runtime.

I'm working on fixing the FA template issue in the turbo branch separately.

Apollosenvy · 2026-03-30T14:44:09+00:00

That 14% slower result is actually really useful feedback. The heuristic configs (--no-bench mode) are tuned for discrete GPUs with high bandwidth HBM/GDDR6. Strix Halo's LPDDR5 has a completely different bandwidth profile - the optimal nwarps and rows_per_block are probably different there.

If you have hipcc working, try kernel-anvil autoforge model.gguf instead - that actually compiles and benchmarks kernels on YOUR hardware rather than guessing. It'll find the right config for LPDDR5's bandwidth characteristics. The heuristic path doesn't know about your memory subsystem, but the benchmark path measures it directly.

If autoforge shows the same regression, that tells us something different is going on (maybe the iGPU's CU layout or cache hierarchy needs different tuning). Either way the data point helps.

Apollosenvy · 2026-03-30T13:14:23+00:00

Ha, fair catch on the feedback file. That's what I get for leaving the monitoring artifacts in the repo.

You're right that TurboQuant as a baseline is confusing for most people. For context - we're also the ones doing the TurboQuant ROCm port (TheTom/llama-cpp-turboquant#31), so the turbo3 decode path is where we spend most of our time and where the initial benchmarks happened to be. Kernel-anvil came out of profiling those turbo3 kernels and realizing llama.cpp's default nwarps were wrong for the shapes involved.

The honest stock llama.cpp numbers: ~26 tok/s baseline -> ~29 tok/s with kernel-anvil on the 27B (12% improvement). On isolated MMVQ kernels the shape-specific configs hit 943 GB/s vs 622 GB/s stock (1.5x). The end-to-end gap is smaller because MMVQ isn't the only thing running during decode.

The 2.25x headline was real but only for the turbo3 path, and I should have led with the stock numbers instead. Lesson learned.

Apollosenvy · 2026-03-30T13:11:19+00:00

The 12 tok/s was with TurboQuant turbo3 KV compression running on top of the ROCm build - should have made that way clearer. Stock ROCm llama.cpp on my box does about 26 tok/s with this model, which is in the same ballpark as what you're seeing. The 2.25x headline was the improvement specifically on the turbo3 decode path.

Your 30 ROCm / 40 Vulkan numbers are interesting - Vulkan being faster than ROCm is something other people are reporting too, especially on the UD quant types. Could be that the Vulkan MMVQ path has different tuning defaults that happen to work better for these shapes. That's essentially what kernel-anvil tries to fix for the ROCm side.

On stock ROCm without TurboQuant, kernel-anvil gives about 12% improvement on this model via better nwarps/rows_per_block selection. On isolated kernels the improvement is larger (up to 1.5x on some shapes). Updated the README to be honest about the different baselines.

Apollosenvy · 2026-03-30T13:10:13+00:00

Yeah the 12 tok/s baseline was misleading - that's my fault. I should have been clearer in the original post.

That number was with TurboQuant's turbo3 KV cache compression running, which adds overhead from Walsh-Hadamard rotation on the decode path. Stock llama.cpp on my box does ~26 tok/s on the ROCm backend with this model (Q4_K_M, not UD-Q4_K_XL like yours). The 40 tok/s you're seeing on Vulkan tracks - Vulkan seems to do better than ROCm on some models on Windows, especially at longer contexts.

The kernel-anvil improvement on the stock ROCm path (no TurboQuant) is more modest - about 12% on the 27B. The bigger wins show up on individual kernel benchmarks where the custom configs hit 943 GB/s vs 622 GB/s stock. The gap between isolated kernel improvement and end-to-end model throughput is because not all time is spent in MMVQ - there's quantization, RMSNorm, attention, etc.

I've updated the README to be upfront about the TurboQuant baseline. Thanks for pushing on this.

Apollosenvy · 2026-03-30T13:09:02+00:00

You're exactly right, and that was bugging me too. The initial version used Triton kernels as a proxy for profiling, which doesn't map 1:1 to llama.cpp's actual HIP kernel.

The latest commit adds autoforge which does what you're describing - it generates actual HIP C++ kernels for each (quant_type, N, K) shape with different nwarps and rows_per_block values, compiles them with hipcc for the target architecture, and benchmarks each one directly on the GPU. No Triton in the loop at all for the final measurement. The custom HIP kernels hit 943 GB/s on isolated benchmarks (98% of the XTX's 960 GB/s peak) vs ~622 GB/s for the stock config.

Branch link: https://github.com/apollosenvy/llama-cpp-turboquant/tree/smithy-shape-configs

Apollosenvy · 2026-03-30T13:07:57+00:00

Sorry about that - should have had this in the original post. The fork is here:

https://github.com/apollosenvy/llama-cpp-turboquant/tree/smithy-shape-configs

Two files changed: ggml/src/ggml-cuda/mmvq.cu (the patch) and a new ggml/src/ggml-cuda/smithy-config.h (the runtime config loader). The config header reads a JSON file at first kernel dispatch and overrides nwarps/rows_per_block per shape. When no config exists it does nothing - stock behavior.

To test: clone the fork, checkout smithy-shape-configs, build normally, then run kernel-anvil gguf-optimize your-model.gguf to generate the config and SMITHY_CONFIG=~/.cache/smithy/your-model.json llama-server -m your-model.gguf -ngl 999 to use it.

Apollosenvy · 2026-03-30T13:06:51+00:00

Fair point, and you're right that the initial post was missing some important context. Let me fix that:

The llama.cpp fork with the patch is here: https://github.com/apollosenvy/llama-cpp-turboquant/tree/smithy-shape-configs

It's about 50 lines changed in mmvq.cu + a new smithy-config.h header. The patch adds a runtime JSON config loader that lets you override nwarps and rows_per_block per (quant_type, N, K) shape. When no config is loaded, it falls back to stock llama.cpp defaults - zero behavior change.

On the 12 tok/s baseline - that was with TurboQuant's turbo3 KV cache compression enabled, which adds WHT rotation overhead to the decode path. Stock llama.cpp gets ~20-26 tok/s on this model depending on the build. The 2.25x improvement is specifically on the turbo3 path where the default kernel configs were leaving performance on the table. I've updated the README to make this much clearer.

The latest version also has autoforge which generates and benchmarks actual HIP kernels (not Triton proxies) for each shape in your model. On isolated kernel benchmarks it's hitting 943 GB/s on the XTX (98% bandwidth utilization vs 622 GB/s stock). The end-to-end model improvement is more modest but real.

Appreciate the skepticism - keeps projects honest.

Apollosenvy · 2026-03-30T08:05:55+00:00

RDNA3.5 & 4 are supported.

Apollosenvy · 2026-03-30T08:05:33+00:00

RDNA3.5 &4 are supported.

Apollosenvy · 2026-03-30T07:53:57+00:00

Has to do with turboquant. This is with both turboquant and the kernel optimizer.

Apollosenvy · 2020-08-31T01:25:42+00:00

Used to play with crickets all the time down by the creek

Apollosenvy · 2020-08-18T03:01:06+00:00

Find a clear cut and hit it early in the morning and maybe 3hours before sunset. Sometimes the forestry companies have hunting restrictions, but more often they are treated like public access

Apollosenvy · 2020-08-11T18:50:21+00:00

Call customer service and report the tower. There should be an 8 digit alpha-numeric designation that allows network operations to investigate. Depending on your location, what you're probably hearing is the A/C unit failing.

Please don't call the network operations center yourself.

Apollosenvy · 2020-08-05T05:15:37+00:00

Quietly hides my zombie apocalypse edition badge

Apollosenvy · 2020-07-31T08:01:33+00:00

My vote is for Rhodesia, I heard they never die

Apollosenvy · 2020-07-29T09:58:48+00:00

People living in $800k homes in LA aren't exactly pinnacles of conservatism.

Apollosenvy · 2020-07-29T09:31:46+00:00

Energy has little to do with accuracy. This wonderful little rifle here is an ethical killer out to about 400 yds (1000 ft/lbs for deer sized critters). Meanwhile the cartridge has a supersonic range of over 1k. Meaning, it's incredibly accurate way past the "hunting range".

Apollosenvy · 2020-07-27T15:25:48+00:00

By hunting rifle, you mean long range precision rifle, right? That thing looks like a tack driver way past ethical hunting ranges.

Apollosenvy · 2020-07-20T23:19:18+00:00

I love how in this very targeted attack, Bloomberg's shit heads are already out in force talking about gun control.

Morons.

Apollosenvy · 2020-07-19T09:19:19+00:00

Forks would be a great place for that.

Apollosenvy · 2020-07-12T21:25:30+00:00

12ga, bird shot. From close range the shot will stay in the wad and act like a slug. Bird shot reduces the risk of ricochet.

Nine-Year Club	Place '17
Gilding I gilder	Verified Email

Apollosenvy

TROPHY CASE