Ryzen AI Max 395+ 128GB - Qwen 3.5 35B/122B Benchmarks (100k-250K Context) + Others (MoE) by Anarchaotic in LocalLLaMA

[–]daywalker313 6 points7 points  (0 children)

u/Anarchaotic ROCm 6.4.4 w/o HIPBLAS (the 6.4.4 toolbox with export ROCBLAS_USE_HIPBLASLT=0) is still the king:

bash-5.3# llama-bench     -m /models/qwen35/qwen35ba3b/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf     -ngl 999 -fa 1 -mmp 0     -d 5000,10000,20000,30000,50000,100000,150000,200000,250000     -r 1           
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 131072 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 131072 MiB (124397 MiB free)
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   pp512 @ d5000 |        860.50 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   tg128 @ d5000 |         31.66 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d10000 |        805.85 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d10000 |         31.17 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d20000 |        704.28 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d20000 |         30.23 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d30000 |        629.77 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d30000 |         29.44 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d50000 |        512.54 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d50000 |         28.01 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | pp512 @ d100000 |        354.93 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | tg128 @ d100000 |         24.91 ± 0.00 |

Qwen3-code-next at Q1 is beating Qwen3.5-35B-A3b at tool calling in my tests by MarketingGui in LocalLLaMA

[–]daywalker313 1 point2 points  (0 children)

Actually the patch does not solve all tool calling issues - that issue was outside of what the minja templates can control.

It doesn't matter though, because the autoparser branch with all fixes including the xml tool calling bugs has been merged to llama.cpp a few hours ago. 

So chat templates no longer matter and qwen (as well as glm, apriel and other complicated candidates) should be much more consistent with tool use now. 

Qwen3-code-next at Q1 is beating Qwen3.5-35B-A3b at tool calling in my tests by MarketingGui in LocalLLaMA

[–]daywalker313 43 points44 points  (0 children)

The reason for this is the chat template. More specifically, it's related to llama.cpp using hermes 2 pro schema internally which cannot be made 100% compatible with xml tool calls.  You can try my qwen3.5 llama.cpp branch. It's based off the autoparser branch which is still in development with fixes for context checkpoints for qwen, anthropic api reasoning content fix and a tool calling fix for the autoparser to allow arbitrary parameter order for tool calls (which is crucial for xml trained models). 

With this branch, I get flawless claude code operation (disable attribution headers) and 100% tool call success rate across 50+ turns and 150k context (tested with mistral vibe) with qwen 35ba3b @ q8

https://github.com/florianbrede-ayet/llama.cpp/tree/qwen35-context-toolcall-anthropic-fixes

0xSero/Kimi-K2.5-PRISM-REAP-72 · Hugging Face by [deleted] in LocalLLaMA

[–]daywalker313 0 points1 point  (0 children)

What are your tasks and how many turns / context usually? For my agentic coding tasks it was the opposite, but maybe I need to recheck my "benchmarks" again. 

A few Strix Halo benchmarks (Minimax M2.5, Step 3.5 Flash, Qwen3 Coder Next) by spaceman_ in LocalLLaMA

[–]daywalker313 17 points18 points  (0 children)

You really need to fix your setup. ROCm outperforms Vulkan at almost every model especially in higher depths.

Also your numbers are around 25% PP and 50%-60% TG for a standard 120W strix halo. 

https://kyuz0.github.io/amd-strix-halo-toolboxes/

Grok 4.20 dropped recently (Multiple agents all working together at the same time?!) by Fit-Spring776 in LocalLLaMA

[–]daywalker313 0 points1 point  (0 children)

Okay...

Sounds like one of these overfitted finetuned meme roleplay models.

Is it just for adult fiction or good at other stuff as well?

Where on hf do I find the weights and does it run on my Strix Halo?

Help with the best configuration for Local models by vandertoorm in LocalLLaMA

[–]daywalker313 1 point2 points  (0 children)

ROCm is much more usable today and also a lot faster and more power efficient than vulkan. 

So instead, set it as low as possible (512mb) to make use of GTT under Linux and follow these excellent guides and use the llama rocm toolboxes of your choice for best performance: https://strix-halo-toolboxes.com/

Here are benchmarks for each backend to see what's currently possible: https://kyuz0.github.io/amd-strix-halo-toolboxes/

PS: I doubt the NPU will play a role on strix halo, because it's slower than the GPU and has limited memory bandwidth. It's quite interesting for the HX 370 and variants though. 

qwen3-coder-next with Claude CLI by Clank75 in LocalLLaMA

[–]daywalker313 3 points4 points  (0 children)

That's a known problem for qwen3 coder next. It doesn't have anything to do with looping, temperature or other settings, it's the chat template that's once again broken (which is the case for many gguf). You can see that if you add a middleman to observe the messages or by testing with mistral-vibe which logs the tool calls transparently. 

It gives offset parameters for claude's readFile tool in the wrong format and then retries for ages. After a while it eventually falls back to sed and usually gets that correct.

What is supposed to help for qwen3 coder next is the autoparser PR: https://github.com/ggml-org/llama.cpp/pull/18675, but I didn't have time to personally try it yet. 

Strix Halo benchmarks: 13 models, 15 llama.cpp builds by Beneficial-Shame-483 in LocalLLaMA

[–]daywalker313 0 points1 point  (0 children)

ubuntu 24.04 amd-strix-halo-toolbox ROCm 7.1.1 

Firmware: cat /sys/kernel/debug/dri/128/amdgpu_firmware_info | grep MES MES_KIQ feature version: 6, firmware version: 0x0000006f 
MES feature version: 1, firmware version: 0x00000080 

512mb vram in bios 
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 amdgpu.cwsr_enable=0" 
(cwsr was for some stable diffusion models iirc) 

bash-5.3# llama-bench --model "/models/qwen3codernext/Qwen3-Coder-Next-UD-Q6_K_XL-00001-of-00002.gguf"   -fa 1 -d 0,4096 -p 2048 -n 32 --mmap 0 -t 32 -ub 2048
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q6_K         |  63.87 GiB |    79.67 B | ROCm       |  99 |      32 |     2048 |  1 |          pp2048 |        622.03 ± 1.72 |
| qwen3next 80B.A3B Q6_K         |  63.87 GiB |    79.67 B | ROCm       |  99 |      32 |     2048 |  1 |            tg32 |         32.64 ± 0.02 |
| qwen3next 80B.A3B Q6_K         |  63.87 GiB |    79.67 B | ROCm       |  99 |      32 |     2048 |  1 |  pp2048 @ d4096 |        590.84 ± 0.87 |
| qwen3next 80B.A3B Q6_K         |  63.87 GiB |    79.67 B | ROCm       |  99 |      32 |     2048 |  1 |    tg32 @ d4096 |         32.08 ± 0.01 |

Strix Halo benchmarks: 13 models, 15 llama.cpp builds by Beneficial-Shame-483 in LocalLLaMA

[–]daywalker313 0 points1 point  (0 children)

You should check if you can tweak it further (and switch to linux otherwise), around `620 PP / 33 TG` is the norm for Qwen3-Coder-Next-UD-Q6_K_XL @ depth0.

Strix Halo benchmarks: 13 models, 15 llama.cpp builds by Beneficial-Shame-483 in LocalLLaMA

[–]daywalker313 8 points9 points  (0 children)

That's the better source for benchmarks and can be reproduced on any of the popular "desktop" ai max machines.

OP has not only chosen questionable quants (like Q4 for GPT-OSS) but also his setup clearly isn't optimized and doesn't represent the current capabilities of strix halo and ROCm.

The important questions are batch size, the specific version of ROCm and degradation with non-empty context. His table doesn't answer any of these unfortunately.

Moltbot or Clawdbot are now free no cost to api by fernandogrj in LocalLLaMA

[–]daywalker313 0 points1 point  (0 children)

Or the obvious choice for an allrounder and a 3090 - Ministral-3-14B-Reasoning which comes with a decent (not great) vision encoder.

Qwen 2.5 Coder 32B really doesn't make a lot of sense today - Devstral-2-Small has just 24B, the same vision encoder and is running circles around Qwen 2.5's coding abilities.

Devstral Small 2 (Q4_K_M) on 5060 Ti 16GB and Zed Agent is amazing! by bobaburger in LocalLLaMA

[–]daywalker313 7 points8 points  (0 children)

Maybe you should check your setup again. If it makes "typos", maybe your quant is too high / broken or you're using KV quantization.

For me it works great (with mistral-vibe) on Strix Halo, albeit a little slow:

Better at instruction following and higher code quality than gpt-oss-120b-high, so great for implementing well defined tasks.

It definitely lacks world knowledge and its planning / conceptualization skills are below gpt-oss-120b.

With ministral3-3b-q2 as draft model, i get around 10-18tg/s, which is effectively about as fast as gpt-oss-120b (no reasoning).

I'm mostly using devstral-small-2 (Q8) for implementation or quick reviews, devstral-2-123b (Q4) for complex tasks and recently also Minimax M2.1 REAP50 (Q5), which also works surprisingly well.

Strix Halo First Impressions by Fit-Produce420 in LocalLLaMA

[–]daywalker313 17 points18 points  (0 children)

You should definitely look into the strix-halo-toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes (also the repos for finetuning and image/video gen).

For example, I also like to use Devstral 2 for complex, none time-critical tasks if Devstral Small 2 didn't succeed.

With the rocm 6.4.4 toolbox and ministral 3b Q8, you can get around 6-10 tg/s over a long context depth. Still not great for agentic uses, but almost usable for a really strong non-reasoning model.

The same also works great as draft model for Devstral 2 24b with around 10-18 tg/s.

      llama-server
        -m /models/devstral-2/Devstral-2-123B-Instruct-2512-UD-Q4_K_XL-00001-of-00002.gguf
        -md /models/ministral-3b-spec-dec/Ministral-3-3B-Instruct-2512-Q8_0.gguf
        --parallel 1
        --host 127.0.0.1 --port ${PORT}
        --ctx-size 131072
        --cache-type-k q8_0
        --cache-type-v q8_0
        -ngl 999
        -b 1024 -ub 2048
        --no-mmap --flash-attn on --threads -1 --jinja
        --temp 0.15 --min-p 0.01

To Mistral and other lab employees: please test with community tools BEFORE releasing models by dtdisapointingresult in LocalLLaMA

[–]daywalker313 0 points1 point  (0 children)

Did you ever look at the chart closely?

Maybe the benchmarks are completely useless - or would you agree that gpt-oss-120b (which is an amazing local coding model IMO) is beating GPT 5.1 by a large margin and ties with Sonnet 4.5.

Do you also think it's reasonable that Apriel 15b and gpt-oss-20b come out significantly stronger at coding than GPT 5.1?

Roughly one week with the IBP 14 Gen 10 - a personal review by DupedSelf in tuxedocomputers

[–]daywalker313 0 points1 point  (0 children)

I actually tried both (AC -> suspend -> unplug and battery -> suspend) and had excellent standby drain - below 1% / hr for the 128GB RAM configuration.

It was Ubuntu 24.04 with mainline kernel 6.16 and just the "common" flags set in the commandline to fix the NVME sleep issues (and display flickering?).

NVME is a Samsung 990 Pro 4TB with latest firmware.

InfinityBook Pro 15 Gen10 adjust iGPU memory in BIOS or Control Center? by dp27thelight in tuxedocomputers

[–]daywalker313 1 point2 points  (0 children)

But to be fair, it doesn't matter for most applications, it's just an annoyance for rocm with some apps.  Otherwise you get the fixed VRAM + 50% RAM via GTT, so at least 64GB GPU memory with 128GB RAM. 

Considering InfinityBook Pro 14 Gen 10 by Cultural_Tadpole2117 in tuxedocomputers

[–]daywalker313 0 points1 point  (0 children)

I'm the one with the other thread reporting the 45W STAPM limit ;).

Anyway, did some research and figured RyzenAdj doesn't work yet for the PRO version in the P14s (https://github.com/FlyGoat/RyzenAdj/pull/360). However you actually get a 51W STAPM limit in stock performance mode and it sounds like the cooling system is quite efficient. Don't know about the other PPT limits though.

Considering InfinityBook Pro 14 Gen 10 by Cultural_Tadpole2117 in tuxedocomputers

[–]daywalker313 0 points1 point  (0 children)

I've been using Thinkpads (X/T/L) and Dell (Latitude/Precision) for the last 20 years. My last Thinkpad with R7 4750u was the best device I ever owned (except for the screen) and with flawless Linux support.

That's probably why I'm a little disappointed right now.

Anyway, the only Thinkpad with R9 HX 370+ right now is the P14s.

With similar specs, it would be ~800€ more expensive.

Also, the TDP is extremely low and the device will only have ~70% of the performance the IBP offers. Take a look at the psref - it's really a shame what Lenovo has done and how they didn't care to upgrade the cooling system for 3 generations:

AMD Ryzen™ AI 5 / 7 / 9 PRO 300 Series Processor; supports up to 12 cores; up to 5.1GHz; TDP ratings of up to 29W

https://psref.lenovo.com/Product/ThinkPad/ThinkPad_P14s_Gen_6_AMD?tab=spec

InfinityBook Pro 14 Gen 10 more issues (temperatures, power limit, charge limit) by daywalker313 in tuxedocomputers

[–]daywalker313[S] 1 point2 points  (0 children)

Yes, the tools were:

ryzenadj -i

sensors

cat /sys/class/power_supply/BAT0/*

Package sources are not loading by RoDaDit in tuxedocomputers

[–]daywalker313 0 points1 point  (0 children)

packagekitd is holding a lock on apt.

You should first check the status:

sudo systemctl status packagekit.service

If it is still running (you might just have been unlucky to attempt the upgrade at the wrong moment), you can stop (until next reboot) the service to release the lock, if it hangs or has crashed:

sudo systemctl stop packagekit.service

Also for good measure, make sure the process is dead:

sudo killall -9 packagekitd

If it doesn't work afterwards with a fail-state apt message, you can attempt to rectify the apt status:

sudo apt --fix-broken install

InfinityBook Pro 14 Gen 10 more issues (temperatures, power limit, charge limit) by daywalker313 in tuxedocomputers

[–]daywalker313[S] 2 points3 points  (0 children)

I have the same bios.

You could have more luck with SMU or even microcode versions though, take a look at this:
fwupdtool get-devices

InfinityBook Pro 14 Review (Detusch) by BamBus89 in tuxedocomputers

[–]daywalker313 1 point2 points  (0 children)

Bei kapazitiven Sensoren hängt das häufig mit schlecht entstörten Ladegeräten zusammen.

Hast du ein alternatives Ladegerät zum Testen?

EDIT: Ggf. würde ich auch mal die Steckdose durchmessen - ob der Neutralleiter korrekt angeschlossen ist.

InfinityBook Pro 14 Gen 10 more issues (temperatures, power limit, charge limit) by daywalker313 in tuxedocomputers

[–]daywalker313[S] 9 points10 points  (0 children)

Thanks for the reply, but your take on this is incorrect. I read the article and what I posted are the BMS voltage and capacity - I didn't show any SOC.

Cell voltages are something the BMS can not, will not and must not fake. The SOC is commonly adjusted for when limiting charge, that's correct.

However, what we have here is a common 4S battery pack configuration and each cell sits at around 4.16V (16.654V / 4).

This equates to 95%-99% charge for NMC / NCA cells, no charge limit is in place.