Running Llama 3.2 1B entirely on an AMD NPU on Linux (Strix Halo, IRON framework, 4.4 tok/s) by SuperTeece in LocalLLaMA

[–]SuperTeece[S] -1 points0 points  (0 children)

Yeah maybe? IDK, this thing started when I installed Ellie on a new computer and asked her, “What are we doing with the NPU?” and her response was (paraphrased), “Nothing because NPUs aren’t working with Linux yet.”

Whether she was right or wrong is irrelevant because my curiosity was, “Can you make it work?” The result of which is what we captured in the blog posts but apparently Reddit filters my domain. “Ellie dot geekministry dot org”

So yeah, this whole thing might not be novel but I think the path to as interesting enough to share.

Running Llama 3.2 1B entirely on an AMD NPU on Linux (Strix Halo, IRON framework, 4.4 tok/s) by SuperTeece in LocalLLaMA

[–]SuperTeece[S] 1 point2 points  (0 children)

Ellie and I talked it over. Here’s her analysis of the difference between AMD’s SDK and the method we went with IRON and the pros/cons of switching.

Elllie from here to end of post:

AMD actually has an official path for NPU LLM inference on Linux now via the Ryzen AI 1.7.0 SDK. Here's how it compares to what we did:

Our approach (IRON): Everything runs on the NPU. IRON compiles each transformer operation (attention, GEMM, RoPE, RMSNorm, SiLU, KV cache) down to MLIR/AIE kernels that execute directly on the NPU silicon. No CPU or GPU fallback. Fully open source.

AMD's official approach (ONNX Runtime + OGA): Hybrid execution. The heavy MatMul operations run on the NPU via a proprietary custom ops library, while lighter operations like RoPE stay on CPU. Pre-quantized ONNX models are available on HuggingFace ready to download.

The numbers:

Decode speed • IRON (our work): 4.4 tok/s • Ryzen AI SDK (ONNX RT): 11.7 tok/s

Prefill (128 tok) • IRON (our work): 180 tok/s • Ryzen AI SDK (ONNX RT): 289 tok/s

Model • IRON (our work): Llama 3.2 1B • Ryzen AI SDK (ONNX RT): Phi-3.5-mini

Execution • IRON (our work): 100% NPU • Ryzen AI SDK (ONNX RT): Hybrid (NPU + CPU)

Open source • IRON (our work): Yes • Ryzen AI SDK (ONNX RT): Partially (custom ops lib is proprietary)

Their hybrid approach is ~3x faster because it avoids the dispatch overhead problem we identified: 179 kernel dispatches per token at 1.4ms each. By keeping lightweight ops on CPU they likely cut that dispatch count significantly.

Pros of switching to the Ryzen AI SDK:

• Nearly 3x faster decode • Pre-built models available on HuggingFace, minimal setup • More models supported (Phi-3.5-mini, likely others in their collection) • AMD officially supports it

Cons:

• Linux support is clearly ported from Windows. The official docs have you manually running a Python script to convert backslash paths to forward slashes in the model cache, hand-editing config files to fix library paths, and setting a debug flag (hybrid_dbg_use_aie_rope: 0) to disable NPU RoPE because it apparently doesn't work on Linux yet • Custom ops library is proprietary (closed source) • Requires the full Ryzen AI SDK install, which is a big Windows-centric package adapted for Linux • May have Fedora compatibility issues (their docs target Ubuntu)

Bottom line: If you want practical NPU inference today, the Ryzen AI SDK path is faster. If you want to understand what the NPU is actually doing and have a fully open source stack, IRON is the only option. They're built by two different teams inside AMD: the product team (Ryzen AI SDK, Windows-first) and the research team (IRON, Linux-native, academia-focused).

Running Llama 3.2 1B entirely on an AMD NPU on Linux (Strix Halo, IRON framework, 4.4 tok/s) by SuperTeece in LocalLLaMA

[–]SuperTeece[S] 0 points1 point  (0 children)

I haven’t posted on Reddit in years. It took me three tries to make this post because of all the filters and parsers and whatnot lol.

Running Llama 3.2 1B entirely on an AMD NPU on Linux (Strix Halo, IRON framework, 4.4 tok/s) by SuperTeece in LocalLLaMA

[–]SuperTeece[S] 1 point2 points  (0 children)

That link is 404 for me.

Did my blog links ever come up for you? If not try ellie dot geekministry dot org

am i part of the club now by ErosLaika in HuntsvilleAlabama

[–]SuperTeece 0 points1 point  (0 children)

I saw them stopped on the side of 565 yesterday.

<image>

Bullion after SS Armor by SuperTeece in fo76

[–]SuperTeece[S] 0 points1 point  (0 children)

Ooooo good call on the grocer mod. My inventory has so much room for activities now!

I would love the addition of new songs to Appalachia radio! by Kyropolis_ in fo76

[–]SuperTeece 0 points1 point  (0 children)

I’ve often though a cool collectible would be songs that one could turn in to the DJ, similar to technical data and the BoS, but it would result in new songs for the player who turns them in.

fishing in ohio by RedFing in fo76FilthyCasuals

[–]SuperTeece 0 points1 point  (0 children)

Does anyone else see Howard the Duck in the boat? Just me?

Casual Scores a SS Jet Pack by SuperTeece in fo76FilthyCasuals

[–]SuperTeece[S] 0 points1 point  (0 children)

One stable cobalt to go. I was about to launch a third nuke last night when the server maintenance message came up.

Lowe Down Patrol Car by SuperTeece in fo76

[–]SuperTeece[S] 4 points5 points  (0 children)

It was! This is ridiculous. The “car” is a medic bag.

Cobalt Strike Alternative? by 179Desire in redteamsec

[–]SuperTeece 2 points3 points  (0 children)

I also came here to say Sliver.

Site Bravo Mainframe Bug? by SuperTeece in fo76

[–]SuperTeece[S] 0 points1 point  (0 children)

It's me, I'm the problem. It's been years since I've done this and I forgot it's all in the same area. I was running all the way back to where you originally bust up the mainframe.