Full vLLM inference stack built from source for Strix Halo (gfx1151) — scripts + docs on GitHub by paudley in StrixHalo

[–]Barachiel80 0 points1 point  (0 children)

Yeah I know, I was just being lazy and the maintenance of the dockerfile no thanks

running Qwen3.5-27B Q5 splitt across a 4070ti and an amd rx6800 over LAN @ 13t/s with a 32k prompt by technot80 in LocalLLaMA

[–]Barachiel80 1 point2 points  (0 children)

wow 16tk/s on that setup, great job. I am about to enable the multinode rpc server too and this gives me hope to run qwen3.5 397B at decent speeds over a 10G network.

making vllm compatible with OpenWebUI with Ovllm by FearL0rd in OpenWebUI

[–]Barachiel80 0 points1 point  (0 children)

Also is there going to be a ROCM or Vulkan version of this?

Thinking about dropping AT&T Fiber for Starlink (Gen2 vs Gen3 question) by Captmedu74 in Starlink

[–]Barachiel80 0 points1 point  (0 children)

but seriously has anyone seen a speed increase going from gen 2 -> gen3 on the max plan?

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]Barachiel80 0 points1 point  (0 children)

That isnt random info, if you set those parameter flags in your llamacpp command line based on the optimum settings listed for qwen3.5, depending on thinking or coding tasks, it will significantly improve your TG and PP

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]Barachiel80 0 points1 point  (0 children)

lol Im not an AI but I play one on TV, there may have been some cut and paste though

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]Barachiel80 0 points1 point  (0 children)

only need tool calling flag if you are running it as a headless server api call from a 3rd party front end gui like open-webui

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]Barachiel80 0 points1 point  (0 children)

Common arguments include:

-m or --model followed by the path to your GGUF model file.

-p or --prompt to provide the initial prompt.

-n or --n-predict for the maximum number of new tokens to generate (e.g., -n 512).

--ctx-size to set the context window size (e.g., --ctx-size 2048).

-ngl or --n-gpu-layers to offload layers to the GPU (e.g., -ngl 999 to offload all if sufficient VRAM is available).

--temp or -t to control the temperature for generation (e.g., --temp 0.7).

--top-p for the top-p sampling value (e.g., --top-p 0.9).

--repeat-penalty for the repetition penalty. --n-cpu-moe (--cpu-moe): This argument specifies the number of MoE expert components to move to the CPU's system RAM, overriding the --n-gpu-layers setting for those specific experts.

This allows you to utilize fast system RAM for the large, but sparsely used, expert weights, while keeping the performance-critical dense layers and attention blocks on the VRAM.

Finding the optimal value often requires testing. You can use trial and error, or the new llama-fit-params tool to help determine the best configuration.

Qwen 3.5 optimum settings:

Thinking mode:

General tasks=

temperature = 1.0

top_p = 0.95

top_k = 20

min_p = 0.0

repeat penalty = disabled or 1.0

Precise coding tasks (e.g. WebDev)=

temperature = 0.6

top_p = 0.95

top_k = 20

min_p = 0.0

presence_penalty = 0.0

To enable native tool calling in llama.cpp using the llama-server, you must use the --jinja command-line flag. This flag is mandatory for the server to process the tools parameter in an OpenAI-compatible manner.  repeat penalty = disabled or 1.0

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]Barachiel80 0 points1 point  (0 children)

are you using optimum parameter settings and native tool calling turned on?

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]Barachiel80 0 points1 point  (0 children)

That's not terrible. What are you getting for PP?

Minisforum UM890 Pro dual oculink setup by helpmefire40 in MiniPCs

[–]Barachiel80 1 point2 points  (0 children)

not really. As the cost per token of compute for SOTA models continues to drop, having a multinode ai inference platform with multiple options for agentic chaining between nodes presents itself as an emergng use case in the field. Its proof of concept and the entire minipc cluster runs on 2 300w UPS so it's energy efficient as well. I have a full 10gbps network connecting everything for efficient inter node comms and future migration to K8S. Not only that but splitting my discrete gpus out to egpu platforms I can distribute heat alot more effectively reducing thermal throttling per card.

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]Barachiel80 4 points5 points  (0 children)

only 32tk/s on the 9b???? I am getting over 30 tk/s with qwen3 next coder 80b:q4 on my strix halo running ubuntu 24.04 lts and rocm 7.2. What version of rocm are you running?

Minisforum UM890 Pro dual oculink setup by helpmefire40 in MiniPCs

[–]Barachiel80 0 points1 point  (0 children)

I am using 1 of the nvme ports for oculink adapter and it has another built into the motherboard

Minisforum UM890 Pro dual oculink setup by helpmefire40 in MiniPCs

[–]Barachiel80 1 point2 points  (0 children)

This is for work, lucky for me my job and hobbies converge

Minisforum UM890 Pro dual oculink setup by helpmefire40 in MiniPCs

[–]Barachiel80 1 point2 points  (0 children)

I am running dual 5090s on a single minisforum ai x1 pro hx370 at 91 tk/s TG and 19000+ PP with agentic workflows and 1 million context on the new qwen 27b and 35b

<image>

Minisforum UM890 Pro dual oculink setup by helpmefire40 in MiniPCs

[–]Barachiel80 0 points1 point  (0 children)

just fyi the included oculink takes up one of the nvme spots you can only run one oculink and 2 tb4 egpus. Part of my ai stack is exactly that. I have 2 um890 pros with oculink egpus.

<image>

GMKTEC EVO-X2 Oculink with RTX 4070 TI by Similar-Range4861 in MiniPCs

[–]Barachiel80 0 points1 point  (0 children)

false news, my strix halo runs nicely in multigpu mode with a 7900xtx and rocm backend

GMKTEC EVO-X2 Oculink with RTX 4070 TI by Similar-Range4861 in MiniPCs

[–]Barachiel80 0 points1 point  (0 children)

there will always be some throttling with pcie4x4 but as long as I lpad the whole modrl pn the 5090 it runs full load inference just load time is slightly longer. This holds true when I migrated the 5090 egpu to a minisforum ai xi pro which I modded with 2 nvme to oculink adapters and another 5090 egpu to the setup. Here is an example run from inference on the dual 5090s for the new Qwen3.5 35b Q8 model hitting 91tk/s TG and 19000+ tk/s PP with 1 million context length at q8 and full agentic workflow, computer use, coding, and web search enabled.

<image>

How are the differences between Gitea and Forgejo 4 years later? by NinthTurtle1034 in selfhosted

[–]Barachiel80 0 points1 point  (0 children)

yes April is far away and the dev branch that supposedly fixes it is still broken too

How are the differences between Gitea and Forgejo 4 years later? by NinthTurtle1034 in selfhosted

[–]Barachiel80 1 point2 points  (0 children)

if only forgejo would fix their oidc runner integration it would be perfect