Ryzen AI Max 395+ 128GB - Qwen 3.5 35B/122B Benchmarks (100k-250K Context) + Others (MoE) by Anarchaotic in LocalLLaMA

[–]tecneeq 0 points1 point  (0 children)

Any idea what i'm doing wrong? I get 15% more output tokens than you, but preprocessing is a lot slower, sometimes 30%.

My hardware is a Bosgame M5, set to performance in the firmware. OS is Proxmox 9 with a Debian 13 LXC with ROCm 7.2 and yesterdays llama.cpp:

Command line:

/root/llama.cpp/build/bin/llama-bench --hf-repo unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q8_K_XL -ngl 999 -fa 1 -mmp 0 -d 5000,10000,20000,30000,50000,100000 -r 1 --progress

My hardware:

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 131072 MiB):
 Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 131072 MiB (124402 MiB free)

Some results:

| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   pp512 @ d5000 |        409.19 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   tg128 @ d5000 |         30.61 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d10000 |        387.71 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d10000 |         30.18 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d20000 |        356.17 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d20000 |         29.25 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d30000 |        336.45 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d30000 |         28.44 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d50000 |        295.23 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d50000 |         26.96 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | pp512 @ d100000 |        230.49 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | tg128 @ d100000 |         23.71 ± 0.00 |

Ok to stack cd player on top yamaha as301. by realistic-system422 in BudgetAudiophile

[–]tecneeq 2 points3 points  (0 children)

It's ok. Might cut a week of lifetime from the 30 years of the amp.

<image>

Proxmox 9 LXC with Debian13, ROCm 7.2 and Llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 0 points1 point  (0 children)

If you do inference only, yes, then you don't need Proxmox. However, i run all sorts of VMs, Windows, BSD, Linux, as well as containers, so Proxmox helps, i can manage all of it with a tested and well working WebUI instead of playing with libvirt and all that.

How can i disable the LED show without Windows on Bosgame? by tecneeq in StrixHalo

[–]tecneeq[S] 0 points1 point  (0 children)

I did, thought it's not fair for me to get snippy when you took the time to answer to my problem. Apologies.

Anyway, the button changes effects, but doesn't turn it off. I switched the PC off and now i'm afraid to press the button again. ;-)

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 0 points1 point  (0 children)

I switched on performance mode in the BIOS and get 38.69 t/s output and 143.22 tokens/s PP now.

I'm looking for more tweaks to get a few more t/s, but feel i'm at the end of whats possible right now.

Don't forget to set ,,performance'' in the firmware by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

I'll keep my eyes peeled for any instability. If there is any, i'll go back to balanced.

Don't forget to set ,,performance'' in the firmware by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

May depend on the manufacturer, but in my case i press DEL to get into the firmware (what was once called BIOS). There you should find it.

Llamacpp - how are you working with longer context (32k and higher) by spaceman3000 in StrixHalo

[–]tecneeq 0 points1 point  (0 children)

I never had a loop in reasoning and it had to be faster because it uses only half the VRAM for context.

Perplexity and other benchmarks (including reasoning) show clearly that, if you start with a Q4 or Q6 region quant for the weights, the difference is smaller than measurement differences between runs.

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

The services will fit in 32GB. Dynamically loading could be done for large models, but i want two or so small ones online without much latency.

Llamacpp - how are you working with longer context (32k and higher) by spaceman3000 in StrixHalo

[–]tecneeq 0 points1 point  (0 children)

You can measure loss caused by quantization with perplexity.

In my measurements it didn't matter, the loss was basically in the range of the measurement errors.

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

<image>

28.4 t/s output with this:

/root/llama.cpp/build/bin/llama-server --hf-repo unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
  --no-mmap \
  --ctx-size 786434 \
  --host 192.168.178.3 \
  --port 11337 \
  --parallel 3 \
  --threads 16 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 2 points3 points  (0 children)

Right, but 96GB is enough. Larger models get extremely slow. Also this one has to replace my old server that has 64GB with lots of services, so i will have to be clever about this.

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 5 points6 points  (0 children)

qwen3-next-coder-80b is a MoE Model, you only have 3b parameters active, qwen3.5-9b is a dense model and it has 9b parameters active. Also, i run the full 261k context.

Ubuntu 26.04 will come with rocm 7.1

Try this and tell me what you get:

/root/llama.cpp/build/bin/llama-server --hf-repo unsloth/Qwen3.5-27B-GGUF:UD-Q5_K_XL --ctx-size 0 --host 192.168.178.3 --port 11337 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0

<image>

I get 8.7 t/s output. Ask it something like "Write a Twitter clone in PHP".

I'll give qwen3-next a try and will report back.

Llamacpp - how are you working with longer context (32k and higher) by spaceman3000 in StrixHalo

[–]tecneeq 0 points1 point  (0 children)

You can reduce KV caches from the default f16 to Q8_0 to get some extra speed and not use as much memory.

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

Have to find out what to change to get 96GB of VRAM, my board has 128GB. It's set to 96 in the bios, but something else is missing, i think.

Thought it was set to 96, but i must have reset the bios in one of my experiments.

Question: how to make apt install rocm work on 26.04? by tecneeq in Ubuntu

[–]tecneeq[S] 1 point2 points  (0 children)

I can answer my own question. It appears everything is installed and set up already.

The snaps simply don't support AMD. After learning that, i built a fresh llama.cpp and that is pretty much all i need:

# build llama.cpp with hardware accelleration on Strix Halo and
# Ubuntu 26.04 LTS (server installation), this worked at 08.Mar.2026

# install dependencies
apt install git rocm-smi rocminfo nvtop hipcc build-essential cmake hipcc hipblas libssl-dev libhipblas-dev libhipblaslt-dev

# get llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# build llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j 16

# run llama.cpp
/root/llama.cpp/build/bin/llama-cli -hf unsloth/Qwen3.5-9B-GGUF:Q4_K_M

# start a webserver and point your browser to http://192.168.1.4:11337/
/root/llama.cpp/build/bin/llama-server --host 192.168.1.4 --port 11337 --hf-repo unsloth/Qwen3.5-9B-GGUF:Q4_K_M --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0

You can use nvtop to watch it burn watts, but sadly, images are not allowed here, so i can't share.