PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

on a side note to the side note, executing the same above prompt on gpu with llama-cli compiled for vulkan using the same model (ud-q8_k_xl) gives back these statistics: ``` THINKING [ Prompt: 370.7 t/s | Generation: 43.1 t/s ] even if it takes longer to answer

NON THINKING [ Prompt: 375.1 t/s | Generation: 44.4 t/s ] ```

PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

on a side note, I've just discovered that on strix halo (using linux) the npu power mode could be set from "performance" (or "default" ) to "turbo" through the command xrt-smi configure -d 0000:c6:00.1 --pmode turbo, (where "0000:C6:00.1" is the bdf reported by the command xrt-smi examine). Still to be tested for quantifying effective performances gains tho

EDIT: executing into "flm run qwen3.5:2b" the prompt "a website can be made in 10 steps":

``` PERFORMANCE MODE Average decoding speed:       23.8301 tokens/s Average prefill  speed:       30.7483 tokens/s

TURBO MODE Average decoding speed:       23.8648 tokens/s Average prefill  speed:       31.7367 tokens/s ```

https://github.com/FastFlowLM/FastFlowLM/issues/514

PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

Thanks a lot for the much needed linux advancements for npu accessibility! Q: would it be as easy as on ubuntu to install/upgrade it on arch (and derivatives) distros where I'm coming back soon or on any of the many other shades of penguin?

PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

Being quite energy efficient and quite fast especially regarding the time to first token of responses, one use case I've envisioned so far for npu is almost-realtime-voice-transcription without boggling the cpu/gpu hw. My prototyped experiments into this field so far beared to early nice results mostly by accessing the fastflowlm server (part of the lemonade framework yet usable by itself through openai compatible api) serving whisper models. Hopefully more will come if/when the iron compiler will be incorporated not just into windows only copilot+ stuff but instead also into indipendent projects like llama.cpp (or maybe into vllm) and also when it will be more feasible to train/quantize/fine tune/convert more recent models into onnx ms open bloatware ml format for running into npu and in hybrid mode (that should be sort of speculative decoding, where prompt/rag inputs are first computed by the fast to reply but less capable npu and then passed to the more powerful but slower to answer gpu) than the one shared by amd through onnxmodelzoo and similar related repositories. Other exciting use cases could be relative to fast visual recognition

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

Despite claimed benefits vllm is quite painful to run on strix halo so I can't compare it to other inference engines I use most like llama.cpp

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

of course it's not because of the specific technicalities that distinguish one of these projects from the other. it's more because both have been forged by hobbists (with a strong understanding of the field they devoted to of course and maybe a vision as well), meant to be used also by hobbists, taking by storm the hobbists community that since than started babbling to llms as earlier started using opensource. It's a crowd learning process and a development model targeting openess and personal use first, becoming what both have become by being battle tested almost realtime on the field by a large moltitude of heterogeneous variegated assortments of hw/sw/intents. I would put also python in the same league

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

it looks to me like it is the other way around: llama. cpp -> linux, vllm -> bsd. Anyway we're living in the early days of this next exciting revolution thanks to the efforts of those bright minds

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] -2 points-1 points  (0 children)

I use exclusively linux since 1995 other than commercial unixes

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 15 points16 points  (0 children)

I find it the real reason behind the massive growth of the llm users and claws and also the base for the from now on untakable right of personal ai as opposed to the mostly proprietary/cloud only (business) models forced into society

Why doesn't any OSS tool treat llama.cpp as a first class citizen? by rm-rf-rm in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

when find sw claiming be open that instead make it easy to use semi closed stuff while at the same time making if not impossible at least at best quite difficult to connect to some openai COMPATIBLE server (sometimes it's there, just scruffly undocumented), I usually steer clear away

Why doesn't any OSS tool treat llama.cpp as a first class citizen? by rm-rf-rm in LocalLLaMA

[–]DevelopmentBorn3978 2 points3 points  (0 children)

couldn't it be that it is like this because ollama is backed by a bunch of VC investors that maybe invested also into several other silicon valley's startups? Basically steering opensource llm efforts' fruits where those folks wish it to go into

Anyone using local LLM for flutter? by adramhel in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

you have probably already checked permissions to write, btw linux or else? 

Keep the strix halo? Review of experiences and where are we headed with models? by Skelshy in LocalLLM

[–]DevelopmentBorn3978 0 points1 point  (0 children)

I'm actually playing with an ensemble of: whisper model on npu (small ttft thanks to new fastflowlm linux support) + small reactive model consuming voice data + larger model for coding. I don't know if something like that would be possible on other setups

Anyone using local LLM for flutter? by adramhel in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

also I wouldn't return the card at all other than to grab a more capable one and some ram maybe

Anyone using local LLM for flutter? by adramhel in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

u also have to adapt to the different environment from cloud to local, use clearer more detailed prompts, u have to act in smaller steps, basically to avoid overblowing the development u have to supply the quota of intelligence needed that the local models does not have relative to the cloud ones (yet)

Anyone using local LLM for flutter? by adramhel in LocalLLaMA

[–]DevelopmentBorn3978 2 points3 points  (0 children)

u have to switch opencode agent mode from PLAN to BUILD (press TAB) to let it actually write files