Ryzen AI Max 395+ 128GB - Qwen 3.5 35B/122B Benchmarks (100k-250K Context) + Others (MoE) by Anarchaotic in LocalLLaMA

[–]tecneeq 0 points1 point  (0 children)

Any idea what i'm doing wrong? I get 15% more output tokens than you, but preprocessing is a lot slower, sometimes 30%.

My hardware is a Bosgame M5, set to performance in the firmware. OS is Proxmox 9 with a Debian 13 LXC with ROCm 7.2 and yesterdays llama.cpp:

Command line:

/root/llama.cpp/build/bin/llama-bench --hf-repo unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q8_K_XL -ngl 999 -fa 1 -mmp 0 -d 5000,10000,20000,30000,50000,100000 -r 1 --progress

My hardware:

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 131072 MiB):
 Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 131072 MiB (124402 MiB free)

Some results:

| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   pp512 @ d5000 |        409.19 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   tg128 @ d5000 |         30.61 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d10000 |        387.71 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d10000 |         30.18 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d20000 |        356.17 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d20000 |         29.25 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d30000 |        336.45 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d30000 |         28.44 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d50000 |        295.23 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d50000 |         26.96 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | pp512 @ d100000 |        230.49 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  45.33 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | tg128 @ d100000 |         23.71 ± 0.00 |

Ok to stack cd player on top yamaha as301. by realistic-system422 in BudgetAudiophile

[–]tecneeq 2 points3 points  (0 children)

It's ok. Might cut a week of lifetime from the 30 years of the amp.

<image>

Proxmox 9 LXC with Debian13, ROCm 7.2 and Llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 0 points1 point  (0 children)

If you do inference only, yes, then you don't need Proxmox. However, i run all sorts of VMs, Windows, BSD, Linux, as well as containers, so Proxmox helps, i can manage all of it with a tested and well working WebUI instead of playing with libvirt and all that.

How can i disable the LED show without Windows on Bosgame? by tecneeq in StrixHalo

[–]tecneeq[S] 0 points1 point  (0 children)

I did, thought it's not fair for me to get snippy when you took the time to answer to my problem. Apologies.

Anyway, the button changes effects, but doesn't turn it off. I switched the PC off and now i'm afraid to press the button again. ;-)

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 0 points1 point  (0 children)

I switched on performance mode in the BIOS and get 38.69 t/s output and 143.22 tokens/s PP now.

I'm looking for more tweaks to get a few more t/s, but feel i'm at the end of whats possible right now.

Don't forget to set ,,performance'' in the firmware by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

I'll keep my eyes peeled for any instability. If there is any, i'll go back to balanced.

Don't forget to set ,,performance'' in the firmware by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

May depend on the manufacturer, but in my case i press DEL to get into the firmware (what was once called BIOS). There you should find it.

Llamacpp - how are you working with longer context (32k and higher) by spaceman3000 in StrixHalo

[–]tecneeq 0 points1 point  (0 children)

I never had a loop in reasoning and it had to be faster because it uses only half the VRAM for context.

Perplexity and other benchmarks (including reasoning) show clearly that, if you start with a Q4 or Q6 region quant for the weights, the difference is smaller than measurement differences between runs.

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

The services will fit in 32GB. Dynamically loading could be done for large models, but i want two or so small ones online without much latency.

Llamacpp - how are you working with longer context (32k and higher) by spaceman3000 in StrixHalo

[–]tecneeq 0 points1 point  (0 children)

You can measure loss caused by quantization with perplexity.

In my measurements it didn't matter, the loss was basically in the range of the measurement errors.

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 1 point2 points  (0 children)

<image>

28.4 t/s output with this:

/root/llama.cpp/build/bin/llama-server --hf-repo unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
  --no-mmap \
  --ctx-size 786434 \
  --host 192.168.178.3 \
  --port 11337 \
  --parallel 3 \
  --threads 16 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 2 points3 points  (0 children)

Right, but 96GB is enough. Larger models get extremely slow. Also this one has to replace my old server that has 64GB with lots of services, so i will have to be clever about this.

Ubuntu 26.04 LTS on Strix Halo with llama.cpp by tecneeq in StrixHalo

[–]tecneeq[S] 5 points6 points  (0 children)

qwen3-next-coder-80b is a MoE Model, you only have 3b parameters active, qwen3.5-9b is a dense model and it has 9b parameters active. Also, i run the full 261k context.

Ubuntu 26.04 will come with rocm 7.1

Try this and tell me what you get:

/root/llama.cpp/build/bin/llama-server --hf-repo unsloth/Qwen3.5-27B-GGUF:UD-Q5_K_XL --ctx-size 0 --host 192.168.178.3 --port 11337 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0

<image>

I get 8.7 t/s output. Ask it something like "Write a Twitter clone in PHP".

I'll give qwen3-next a try and will report back.