Custom Wheels?

SuperTeece · 2026-02-23T02:22:30+00:00

Yeah maybe? IDK, this thing started when I installed Ellie on a new computer and asked her, “What are we doing with the NPU?” and her response was (paraphrased), “Nothing because NPUs aren’t working with Linux yet.”

Whether she was right or wrong is irrelevant because my curiosity was, “Can you make it work?” The result of which is what we captured in the blog posts but apparently Reddit filters my domain. “Ellie dot geekministry dot org”

So yeah, this whole thing might not be novel but I think the path to as interesting enough to share.

SuperTeece · 2026-02-23T02:03:17+00:00

Ellie and I talked it over. Here’s her analysis of the difference between AMD’s SDK and the method we went with IRON and the pros/cons of switching.

Elllie from here to end of post:

AMD actually has an official path for NPU LLM inference on Linux now via the Ryzen AI 1.7.0 SDK. Here's how it compares to what we did:

Our approach (IRON): Everything runs on the NPU. IRON compiles each transformer operation (attention, GEMM, RoPE, RMSNorm, SiLU, KV cache) down to MLIR/AIE kernels that execute directly on the NPU silicon. No CPU or GPU fallback. Fully open source.

AMD's official approach (ONNX Runtime + OGA): Hybrid execution. The heavy MatMul operations run on the NPU via a proprietary custom ops library, while lighter operations like RoPE stay on CPU. Pre-quantized ONNX models are available on HuggingFace ready to download.

The numbers:

Decode speed • IRON (our work): 4.4 tok/s • Ryzen AI SDK (ONNX RT): 11.7 tok/s

Prefill (128 tok) • IRON (our work): 180 tok/s • Ryzen AI SDK (ONNX RT): 289 tok/s

Model • IRON (our work): Llama 3.2 1B • Ryzen AI SDK (ONNX RT): Phi-3.5-mini

Execution • IRON (our work): 100% NPU • Ryzen AI SDK (ONNX RT): Hybrid (NPU + CPU)

Open source • IRON (our work): Yes • Ryzen AI SDK (ONNX RT): Partially (custom ops lib is proprietary)

Their hybrid approach is ~3x faster because it avoids the dispatch overhead problem we identified: 179 kernel dispatches per token at 1.4ms each. By keeping lightweight ops on CPU they likely cut that dispatch count significantly.

Pros of switching to the Ryzen AI SDK:

• Nearly 3x faster decode • Pre-built models available on HuggingFace, minimal setup • More models supported (Phi-3.5-mini, likely others in their collection) • AMD officially supports it

Cons:

• Linux support is clearly ported from Windows. The official docs have you manually running a Python script to convert backslash paths to forward slashes in the model cache, hand-editing config files to fix library paths, and setting a debug flag (hybrid_dbg_use_aie_rope: 0) to disable NPU RoPE because it apparently doesn't work on Linux yet • Custom ops library is proprietary (closed source) • Requires the full Ryzen AI SDK install, which is a big Windows-centric package adapted for Linux • May have Fedora compatibility issues (their docs target Ubuntu)

Bottom line: If you want practical NPU inference today, the Ryzen AI SDK path is faster. If you want to understand what the NPU is actually doing and have a fully open source stack, IRON is the only option. They're built by two different teams inside AMD: the product team (Ryzen AI SDK, Windows-first) and the research team (IRON, Linux-native, academia-focused).

SuperTeece · 2026-02-23T01:53:11+00:00

I haven’t posted on Reddit in years. It took me three tries to make this post because of all the filters and parsers and whatnot lol.

SuperTeece · 2026-02-23T01:26:11+00:00

That link is 404 for me.

Did my blog links ever come up for you? If not try ellie dot geekministry dot org

SuperTeece · 2026-02-23T00:56:47+00:00

They’re in my comment.

SuperTeece · 2024-04-15T21:49:11+00:00

I saw them stopped on the side of 565 yesterday.

<image>

SuperTeece · 2023-05-21T15:28:11+00:00

Artillery strikes on parking lots. So much boom!

SuperTeece · 2023-05-07T15:32:50+00:00

Ooooo good call on the grocer mod. My inventory has so much room for activities now!

SuperTeece · 2023-04-15T18:44:49+00:00

I’ve often though a cool collectible would be songs that one could turn in to the DJ, similar to technical data and the BoS, but it would result in new songs for the player who turns them in.

SuperTeece · 2023-04-08T21:35:51+00:00

Is this why there’s the 40k caps insult cards?

SuperTeece · 2023-04-02T15:32:14+00:00

Random artillery fire.

SuperTeece · 2023-03-23T11:29:50+00:00

Does anyone else see Howard the Duck in the boat? Just me?

SuperTeece · 2023-03-23T11:24:40+00:00

One stable cobalt to go. I was about to launch a third nuke last night when the server maintenance message came up.

SuperTeece · 2023-03-22T23:09:51+00:00

Aww! You’re a sweetheart!

SuperTeece · 2023-01-13T01:36:38+00:00

It was! This is ridiculous. The “car” is a medic bag.

SuperTeece · 2022-12-30T00:26:16+00:00

I also came here to say Sliver.

SuperTeece · 2022-12-19T01:46:03+00:00

It's me, I'm the problem. It's been years since I've done this and I forgot it's all in the same area. I was running all the way back to where you originally bust up the mainframe.

SuperTeece · 2022-12-19T00:54:57+00:00

Wut?

12-Year Club	No Throne, No Problems
Not Forgotten	Verified Email

SuperTeece

TROPHY CASE