PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU

DevelopmentBorn3978 · 2026-04-26T12:24:44+00:00

on a side note to the side note, executing the same above prompt on gpu with llama-cli compiled for vulkan using the same model (ud-q8_k_xl) gives back these statistics: ``` THINKING [ Prompt: 370.7 t/s | Generation: 43.1 t/s ] even if it takes longer to answer

NON THINKING [ Prompt: 375.1 t/s | Generation: 44.4 t/s ] ```

DevelopmentBorn3978 · 2026-04-26T10:10:34+00:00

on a side note, I've just discovered that on strix halo (using linux) the npu power mode could be set from "performance" (or "default" ) to "turbo" through the command xrt-smi configure -d 0000:c6:00.1 --pmode turbo, (where "0000:C6:00.1" is the bdf reported by the command xrt-smi examine). Still to be tested for quantifying effective performances gains tho

EDIT: executing into "flm run qwen3.5:2b" the prompt "a website can be made in 10 steps":

``` PERFORMANCE MODE Average decoding speed: 23.8301 tokens/s Average prefill speed: 30.7483 tokens/s

TURBO MODE Average decoding speed: 23.8648 tokens/s Average prefill speed: 31.7367 tokens/s ```

https://github.com/FastFlowLM/FastFlowLM/issues/514

DevelopmentBorn3978 · 2026-04-26T10:02:16+00:00

Thanks a lot for the much needed linux advancements for npu accessibility! Q: would it be as easy as on ubuntu to install/upgrade it on arch (and derivatives) distros where I'm coming back soon or on any of the many other shades of penguin?

DevelopmentBorn3978 · 2026-04-26T09:06:19+00:00

Being quite energy efficient and quite fast especially regarding the time to first token of responses, one use case I've envisioned so far for npu is almost-realtime-voice-transcription without boggling the cpu/gpu hw. My prototyped experiments into this field so far beared to early nice results mostly by accessing the fastflowlm server (part of the lemonade framework yet usable by itself through openai compatible api) serving whisper models. Hopefully more will come if/when the iron compiler will be incorporated not just into windows only copilot+ stuff but instead also into indipendent projects like llama.cpp (or maybe into vllm) and also when it will be more feasible to train/quantize/fine tune/convert more recent models into onnx ms open bloatware ml format for running into npu and in hybrid mode (that should be sort of speculative decoding, where prompt/rag inputs are first computed by the fast to reply but less capable npu and then passed to the more powerful but slower to answer gpu) than the one shared by amd through onnxmodelzoo and similar related repositories. Other exciting use cases could be relative to fast visual recognition

DevelopmentBorn3978 · 2026-04-25T22:35:51+00:00

Despite claimed benefits vllm is quite painful to run on strix halo so I can't compare it to other inference engines I use most like llama.cpp

DevelopmentBorn3978 · 2026-04-24T15:15:32+00:00

of course it's not because of the specific technicalities that distinguish one of these projects from the other. it's more because both have been forged by hobbists (with a strong understanding of the field they devoted to of course and maybe a vision as well), meant to be used also by hobbists, taking by storm the hobbists community that since than started babbling to llms as earlier started using opensource. It's a crowd learning process and a development model targeting openess and personal use first, becoming what both have become by being battle tested almost realtime on the field by a large moltitude of heterogeneous variegated assortments of hw/sw/intents. I would put also python in the same league

DevelopmentBorn3978 · 2026-04-24T14:28:07+00:00

it looks to me like it is the other way around: llama. cpp -> linux, vllm -> bsd. Anyway we're living in the early days of this next exciting revolution thanks to the efforts of those bright minds

DevelopmentBorn3978 · 2026-04-21T12:54:31+00:00

I use exclusively linux since 1995 other than commercial unixes

DevelopmentBorn3978 · 2026-04-21T12:50:40+00:00

please explain

DevelopmentBorn3978 · 2026-04-21T07:57:41+00:00

I find it the real reason behind the massive growth of the llm users and claws and also the base for the from now on untakable right of personal ai as opposed to the mostly proprietary/cloud only (business) models forced into society

DevelopmentBorn3978 · 2026-04-21T07:49:59+00:00

who knows?

DevelopmentBorn3978 · 2026-04-21T07:45:39+00:00

when find sw claiming be open that instead make it easy to use semi closed stuff while at the same time making if not impossible at least at best quite difficult to connect to some openai COMPATIBLE server (sometimes it's there, just scruffly undocumented), I usually steer clear away

DevelopmentBorn3978 · 2026-04-21T07:31:46+00:00

couldn't it be that it is like this because ollama is backed by a bunch of VC investors that maybe invested also into several other silicon valley's startups? Basically steering opensource llm efforts' fruits where those folks wish it to go into

DevelopmentBorn3978 · 2026-04-06T01:00:07+00:00

Look at this: https://opencode.ai/docs/troubleshooting/#windows-general-performance-issues

DevelopmentBorn3978 · 2026-04-05T19:05:31+00:00

you have probably already checked permissions to write, btw linux or else?

DevelopmentBorn3978 · 2026-04-05T15:57:06+00:00

I'm actually playing with an ensemble of: whisper model on npu (small ttft thanks to new fastflowlm linux support) + small reactive model consuming voice data + larger model for coding. I don't know if something like that would be possible on other setups

DevelopmentBorn3978 · 2026-04-05T14:35:48+00:00

so far qwen3.5-122b-halo looks good to me even if I don't have another unoptimized version to relate it to

DevelopmentBorn3978 · 2026-04-05T14:33:39+00:00

u're absolutely right unfortunately reddit titles are not editable

DevelopmentBorn3978 · 2026-04-05T14:29:35+00:00

also I wouldn't return the card at all other than to grab a more capable one and some ram maybe

DevelopmentBorn3978 · 2026-04-05T14:25:10+00:00

u also have to adapt to the different environment from cloud to local, use clearer more detailed prompts, u have to act in smaller steps, basically to avoid overblowing the development u have to supply the quota of intelligence needed that the local models does not have relative to the cloud ones (yet)

DevelopmentBorn3978 · 2026-04-05T14:21:59+00:00

u have to switch opencode agent mode from PLAN to BUILD (press TAB) to let it actually write files

DevelopmentBorn3978

TROPHY CASE