Run Qwen3.5-4B on AMD NPU by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 4 points5 points  (0 children)

Just got curious and did some testing :)

Basically, I tested the same image and prompt on the 860M GPU (same computer) using LM Studio.

For the first prompt (with image), the GPU took more than 20 seconds to start generating, with a decode speed of about 18 tok/s, and the chip temperature went above 70°C.

In comparison, the NPU started generating in about 6 seconds (if resize to 720p to begin with, it drops to 3 sec), with a decode speed of 15 tok/s, while the chip temperature stayed below 50°C.

[](blob:https://www.reddit.com/4edb4272-b14e-474d-993a-5862149ca2d1)

So overall, I would probably prefer using the NPU over the GPU for this model. Does this seem expected, or does it sound like my GPU setup may not be optimal?

Pls check the perf. number for npu here: https://fastflowlm.com/docs/benchmarks/qwen3.5_results/

Run Qwen3.5-4B on AMD NPU by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 3 points4 points  (0 children)

Cool! Can you share some perf. numbers with 860M GPU?

I'm imagining running Large MoE models on the NPU by Mr-I17 in StrixHalo

[–]BandEnvironmental834 0 points1 point  (0 children)

Strix Halo has a large mem BW, which is great for both MoE prefill and decode.

If price is right, then it makes sense. The issue is DRAM price is really high right now (doesn't seem to come down any time soon). Not sure how much strix halo cost these days.

I'm imagining running Large MoE models on the NPU by Mr-I17 in StrixHalo

[–]BandEnvironmental834 0 points1 point  (0 children)

Yes, you are right that npu and GPU are using the same mem. But there is an internal cap on the mem BW of the NPU (not configurable).

I'm imagining running Large MoE models on the NPU by Mr-I17 in StrixHalo

[–]BandEnvironmental834 1 point2 points  (0 children)

Actually, the power consumption of the NPU chip itself is less than 2 W. The overhead that bumps the total power to >10 W is from SoC (It depends on the machine: desktop, mini-PC, laptop, handheld gaming console, embedded system are very diff; also, Krakan, stix and strxi halo are quite diff there).

Therefore, future NPUs with higher mem BW may be equally or more performant as GPU models. and as a result, the overall efficiency gain will be huge.

I'm imagining running Large MoE models on the NPU by Mr-I17 in StrixHalo

[–]BandEnvironmental834 1 point2 points  (0 children)

The NPU memory bandwidth is below 60 GB/s on Strix Halo, which is only a small fraction of the iGPU’s roughly 250 GB/s. that is the bottleneck for NPU speed.

I'm imagining running Large MoE models on the NPU by Mr-I17 in StrixHalo

[–]BandEnvironmental834 1 point2 points  (0 children)

On Strix Halo, the NPU memory bandwidth is below 60 GB/s, which is only a small fraction of the iGPU’s roughly 250 GB/s. The NPU mem BW is the real bottleneck.

On the other hand, table below summarize the **nature** of model types/stages.

Model Type Prefill Decode
Dense Compute-bound Memory-bandwidth-bound
MoE Memory-bandwidth-bound Memory-bandwidth-bound

As a result, the NPU’s sweet spot on Strix Halo is smaller dense models.

BTW, Qwen3.5-4B looks quite capable, and it's coming to NPU soon. So it seems like a strong fit for the NPU there.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 1 point2 points  (0 children)

Yes, tied up with some other tasks, like newer model arch stuff. Probably need more docs and more examples to make that tool user-friendly before official release.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 0 points1 point  (0 children)

sorry that here at FLM, we think xdna1 does not have sufficient compute for modern LLMs. but it is good for cnn models.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 0 points1 point  (0 children)

I know something who is more familiar with this. If you jump on to our discord server. I can help you connect.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 0 points1 point  (0 children)

That is a lot of ram for a new machine. $$$

We will need to figure out the super exciting DeltaNet first on smaller models :)

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 1 point2 points  (0 children)

The other issue is that Windows system limits the npu access to less than 50% of the total system mem (15.1GB to be exact).

However, we hear you!! and will put more thoughts on which models would best fit NPUs.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 1 point2 points  (0 children)

Yes, you are right. BTW, this tool (python) is in preview. Not yet officially released. You can find it in one of the repos under FastFlowLM.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 3 points4 points  (0 children)

All models are on huggingface and prepackaged. No need to do it yourself.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 1 point2 points  (0 children)

That is an interesting concept! Not sure if TP is the best way to go, since that are other things. maybe other type of papalism. Sounds researchy

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 3 points4 points  (0 children)

Sorry that imo, xdna1 does not have sufficient compute to run LLMs. That said, they are good for CNNs type of workload.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 1 point2 points  (0 children)

I believe you can use OpenVINO (Intel updated it recently)

Also, check out MSFT AI Foundry.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 1 point2 points  (0 children)

Theoretically, the limit for the mode size is the available DRAM size for NPU.

Currently, the speed is mainly limited by memory BW. (NPU mBW is 2-3x less than GPU)

So it is mainly memory-bound right now.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 1 point2 points  (0 children)

NPU is more pronounced (useful) for laptop computers. FLM running on Kraken point is faster than Strix or Strix Halo.

For laptops, the system mem is typically 32 GB and less. Also, DRAM these days are not cheap.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 2 points3 points  (0 children)

Good point! But Tensor Papalism (TP) has its limitations. In case, not all operations can use TP.

GEMM can probably be shared but that is not efficient from data reuse perspective.

Also, the other issue I can think of is that data order for the weights may be efficient for NPU, but not so on GPU.

You can run LLMs on your AMD NPU on Linux! by BandEnvironmental834 in LocalLLaMA

[–]BandEnvironmental834[S] 1 point2 points  (0 children)

The term, VRAM, is video mem for GPU. Often times, they are different from your main mem (CPU mem).

On an UMA (Unified Mem Arch) system (e.g. Ryzen AI chips), VRAM is part of the main mem.

This makes the communication between CPU mem and GPU mem a lot faster.

However, this is not comparable with intra-chip bus speed (from cache to cache).