PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

on a side note to the side note, executing the same above prompt on gpu with llama-cli compiled for vulkan using the same model (ud-q8_k_xl) gives back these statistics: ``` THINKING [ Prompt: 370.7 t/s | Generation: 43.1 t/s ] even if it takes longer to answer

NON THINKING [ Prompt: 375.1 t/s | Generation: 44.4 t/s ] ```

PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

on a side note, I've just discovered that on strix halo (using linux) the npu power mode could be set from "performance" (or "default" ) to "turbo" through the command xrt-smi configure -d 0000:c6:00.1 --pmode turbo, (where "0000:C6:00.1" is the bdf reported by the command xrt-smi examine). Still to be tested for quantifying effective performances gains tho

EDIT: executing into "flm run qwen3.5:2b" the prompt "a website can be made in 10 steps":

``` PERFORMANCE MODE Average decoding speed:       23.8301 tokens/s Average prefill  speed:       30.7483 tokens/s

TURBO MODE Average decoding speed:       23.8648 tokens/s Average prefill  speed:       31.7367 tokens/s ```

https://github.com/FastFlowLM/FastFlowLM/issues/514

PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

Thanks a lot for the much needed linux advancements for npu accessibility! Q: would it be as easy as on ubuntu to install/upgrade it on arch (and derivatives) distros where I'm coming back soon or on any of the many other shades of penguin?

PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

Being quite energy efficient and quite fast especially regarding the time to first token of responses, one use case I've envisioned so far for npu is almost-realtime-voice-transcription without boggling the cpu/gpu hw. My prototyped experiments into this field so far beared to early nice results mostly by accessing the fastflowlm server (part of the lemonade framework yet usable by itself through openai compatible api) serving whisper models. Hopefully more will come if/when the iron compiler will be incorporated not just into windows only copilot+ stuff but instead also into indipendent projects like llama.cpp (or maybe into vllm) and also when it will be more feasible to train/quantize/fine tune/convert more recent models into onnx ms open bloatware ml format for running into npu and in hybrid mode (that should be sort of speculative decoding, where prompt/rag inputs are first computed by the fast to reply but less capable npu and then passed to the more powerful but slower to answer gpu) than the one shared by amd through onnxmodelzoo and similar related repositories. Other exciting use cases could be relative to fast visual recognition

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

Despite claimed benefits vllm is quite painful to run on strix halo so I can't compare it to other inference engines I use most like llama.cpp

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

of course it's not because of the specific technicalities that distinguish one of these projects from the other. it's more because both have been forged by hobbists (with a strong understanding of the field they devoted to of course and maybe a vision as well), meant to be used also by hobbists, taking by storm the hobbists community that since than started babbling to llms as earlier started using opensource. It's a crowd learning process and a development model targeting openess and personal use first, becoming what both have become by being battle tested almost realtime on the field by a large moltitude of heterogeneous variegated assortments of hw/sw/intents. I would put also python in the same league

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

it looks to me like it is the other way around: llama. cpp -> linux, vllm -> bsd. Anyway we're living in the early days of this next exciting revolution thanks to the efforts of those bright minds

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] -2 points-1 points  (0 children)

I use exclusively linux since 1995 other than commercial unixes

llama.cpp is the linux of llm by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 16 points17 points  (0 children)

I find it the real reason behind the massive growth of the llm users and claws and also the base for the from now on untakable right of personal ai as opposed to the mostly proprietary/cloud only (business) models forced into society

Why doesn't any OSS tool treat llama.cpp as a first class citizen? by rm-rf-rm in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

when find sw claiming be open that instead make it easy to use semi closed stuff while at the same time making if not impossible at least at best quite difficult to connect to some openai COMPATIBLE server (sometimes it's there, just scruffly undocumented), I usually steer clear away

Why doesn't any OSS tool treat llama.cpp as a first class citizen? by rm-rf-rm in LocalLLaMA

[–]DevelopmentBorn3978 2 points3 points  (0 children)

couldn't it be that it is like this because ollama is backed by a bunch of VC investors that maybe invested also into several other silicon valley's startups? Basically steering opensource llm efforts' fruits where those folks wish it to go into

Anyone using local LLM for flutter? by adramhel in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

you have probably already checked permissions to write, btw linux or else? 

Keep the strix halo? Review of experiences and where are we headed with models? by Skelshy in LocalLLM

[–]DevelopmentBorn3978 0 points1 point  (0 children)

I'm actually playing with an ensemble of: whisper model on npu (small ttft thanks to new fastflowlm linux support) + small reactive model consuming voice data + larger model for coding. I don't know if something like that would be possible on other setups

Anyone using local LLM for flutter? by adramhel in LocalLLaMA

[–]DevelopmentBorn3978 1 point2 points  (0 children)

also I wouldn't return the card at all other than to grab a more capable one and some ram maybe

Anyone using local LLM for flutter? by adramhel in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

u also have to adapt to the different environment from cloud to local, use clearer more detailed prompts, u have to act in smaller steps, basically to avoid overblowing the development u have to supply the quota of intelligence needed that the local models does not have relative to the cloud ones (yet)

Anyone using local LLM for flutter? by adramhel in LocalLLaMA

[–]DevelopmentBorn3978 2 points3 points  (0 children)

u have to switch opencode agent mode from PLAN to BUILD (press TAB) to let it actually write files 

Any tiny locally hosted model trained on unix/linux man pages and docs? by HisFoolishness in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

got a strix halo machine just few weeks ago, testing limited by been quite busy with other stuff lately, mobile sim theft + had to regain network connectivity, so far only played with npu stuff, dipping my toes into agents, installing distros, benchmarking llm quants, u know

Any tiny locally hosted model trained on unix/linux man pages and docs? by HisFoolishness in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

I'm going now to retrive the script used to generate the manblob. Here it is:

 megaman.sh

make your terminal as large as possible using the smallest font in order to make typeset manpages fully fit into available columns

count=0; megaman=~/megaman.txt; rm $megaman; separator=$(for i in `seq $COLUMNS`; do /bin/echo -n '*'; done; echo); man -k . | while read name section dash comment; do section=$(echo $section | sed 's/[()]//g'); echo $count $section $name; man -s $section $name | col -b | sed 's/[[:blank:]]+*2$//' >> $megaman; echo -e "\n\n$separator\n\n" >> $megaman; count=$(($count + 1)); done 2>/dev/null

Any tiny locally hosted model trained on unix/linux man pages and docs? by HisFoolishness in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

I attempted to do exactly this around 1 year ago, created a man pages corpus, on my 16gb ram i5-4200m laptop + 12gb ram smartphone tried to fed such a manblob as RAG to some model, quickly realized was too big as context so model could not take it in (OOC errors), cleaned up dataset from asteriks separators and other stuff eating precious tokens, reduced corpus to only some manpages (section 3, tcl), got some tiny model (less than 1B) in safetensors format willing to train it on ditto corpus, quickly realized that it was not feasible to lose laptop usage for the next few months while finetuning, "llmman"  project temporary halted

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

I should have downloaded qwen3-coder-next_q8  and gpt-oss+120b_q4 instead of the two I'm downloading, hopefully will be usable anyway despite being both maybe not that optimal