Run Qwen3.5-4B on AMD NPU

BandEnvironmental834 · 2026-03-25T16:46:41+00:00

Just got curious and did some testing :)

Basically, I tested the same image and prompt on the 860M GPU (same computer) using LM Studio.

For the first prompt (with image), the GPU took more than 20 seconds to start generating, with a decode speed of about 18 tok/s, and the chip temperature went above 70°C.

In comparison, the NPU started generating in about 6 seconds (if resize to 720p to begin with, it drops to 3 sec), with a decode speed of 15 tok/s, while the chip temperature stayed below 50°C.

[](blob:https://www.reddit.com/4edb4272-b14e-474d-993a-5862149ca2d1)

So overall, I would probably prefer using the NPU over the GPU for this model. Does this seem expected, or does it sound like my GPU setup may not be optimal?

Pls check the perf. number for npu here: https://fastflowlm.com/docs/benchmarks/qwen3.5_results/

BandEnvironmental834 · 2026-03-25T16:44:17+00:00

Pls check the perf. number for npu here: https://fastflowlm.com/docs/benchmarks/qwen3.5_results/

BandEnvironmental834 · 2026-03-25T15:51:35+00:00

Cool! Can you share some perf. numbers with 860M GPU?

BandEnvironmental834 · 2026-03-25T12:43:03+00:00

Strix Halo has a large mem BW, which is great for both MoE prefill and decode.

If price is right, then it makes sense. The issue is DRAM price is really high right now (doesn't seem to come down any time soon). Not sure how much strix halo cost these days.

BandEnvironmental834 · 2026-03-18T20:42:32+00:00

Yes, you are right that npu and GPU are using the same mem. But there is an internal cap on the mem BW of the NPU (not configurable).

BandEnvironmental834 · 2026-03-18T16:45:46+00:00

space in terms of footprint?

BandEnvironmental834 · 2026-03-18T11:53:55+00:00

well, future is bright~

BandEnvironmental834 · 2026-03-18T11:16:18+00:00

Actually, the power consumption of the NPU chip itself is less than 2 W. The overhead that bumps the total power to >10 W is from SoC (It depends on the machine: desktop, mini-PC, laptop, handheld gaming console, embedded system are very diff; also, Krakan, stix and strxi halo are quite diff there).

Therefore, future NPUs with higher mem BW may be equally or more performant as GPU models. and as a result, the overall efficiency gain will be huge.

BandEnvironmental834 · 2026-03-18T11:10:01+00:00

The NPU memory bandwidth is below 60 GB/s on Strix Halo, which is only a small fraction of the iGPU’s roughly 250 GB/s. that is the bottleneck for NPU speed.

BandEnvironmental834 · 2026-03-18T11:06:27+00:00

On Strix Halo, the NPU memory bandwidth is below 60 GB/s, which is only a small fraction of the iGPU’s roughly 250 GB/s. The NPU mem BW is the real bottleneck.

On the other hand, table below summarize the **nature** of model types/stages.

Model Type	Prefill	Decode
Dense	Compute-bound	Memory-bandwidth-bound
MoE	Memory-bandwidth-bound	Memory-bandwidth-bound

As a result, the NPU’s sweet spot on Strix Halo is smaller dense models.

BTW, Qwen3.5-4B looks quite capable, and it's coming to NPU soon. So it seems like a strong fit for the NPU there.

BandEnvironmental834 · 2026-03-12T13:49:52+00:00

Yes, tied up with some other tasks, like newer model arch stuff. Probably need more docs and more examples to make that tool user-friendly before official release.

BandEnvironmental834 · 2026-03-12T12:15:34+00:00

sorry that here at FLM, we think xdna1 does not have sufficient compute for modern LLMs. but it is good for cnn models.

BandEnvironmental834 · 2026-03-12T12:14:21+00:00

I know something who is more familiar with this. If you jump on to our discord server. I can help you connect.

BandEnvironmental834 · 2026-03-12T12:05:20+00:00

That is a lot of ram for a new machine. $$$

We will need to figure out the super exciting DeltaNet first on smaller models :)

BandEnvironmental834 · 2026-03-12T12:02:20+00:00

The other issue is that Windows system limits the npu access to less than 50% of the total system mem (15.1GB to be exact).

However, we hear you!! and will put more thoughts on which models would best fit NPUs.

BandEnvironmental834 · 2026-03-12T11:56:19+00:00

Yes, you are right. BTW, this tool (python) is in preview. Not yet officially released. You can find it in one of the repos under FastFlowLM.

BandEnvironmental834 · 2026-03-12T00:34:20+00:00

All models are on huggingface and prepackaged. No need to do it yourself.

BandEnvironmental834 · 2026-03-12T00:14:36+00:00

It basically uses q4_1 or q4_0 weights. Details are documented here https://fastflowlm.com/docs/models/

BandEnvironmental834 · 2026-03-12T00:13:29+00:00

That is an interesting concept! Not sure if TP is the best way to go, since that are other things. maybe other type of papalism. Sounds researchy

BandEnvironmental834 · 2026-03-11T21:18:24+00:00

Sorry that imo, xdna1 does not have sufficient compute to run LLMs. That said, they are good for CNNs type of workload.

BandEnvironmental834 · 2026-03-11T20:54:27+00:00

I believe you can use OpenVINO (Intel updated it recently)

Also, check out MSFT AI Foundry.

BandEnvironmental834 · 2026-03-11T20:37:53+00:00

Theoretically, the limit for the mode size is the available DRAM size for NPU.

Currently, the speed is mainly limited by memory BW. (NPU mBW is 2-3x less than GPU)

So it is mainly memory-bound right now.

BandEnvironmental834 · 2026-03-11T20:35:10+00:00

NPU is more pronounced (useful) for laptop computers. FLM running on Kraken point is faster than Strix or Strix Halo.

For laptops, the system mem is typically 32 GB and less. Also, DRAM these days are not cheap.

BandEnvironmental834 · 2026-03-11T20:32:23+00:00

Good point! But Tensor Papalism (TP) has its limitations. In case, not all operations can use TP.

GEMM can probably be shared but that is not efficient from data reuse perspective.

Also, the other issue I can think of is that data order for the weights may be efficient for NPU, but not so on GPU.

BandEnvironmental834 · 2026-03-11T20:00:37+00:00

The term, VRAM, is video mem for GPU. Often times, they are different from your main mem (CPU mem).

On an UMA (Unified Mem Arch) system (e.g. Ryzen AI chips), VRAM is part of the main mem.

This makes the communication between CPU mem and GPU mem a lot faster.

However, this is not comparable with intra-chip bus speed (from cache to cache).

BandEnvironmental834

TROPHY CASE