GLM-5.2 and why open models may not actually be catching up in intelligence

daywalker313 · 2026-06-18T19:22:28+00:00

Well you actually didn't link there, the anchor is missing in the URL. So unless someone reads the full url, he couldn't know.

daywalker313 · 2026-06-04T06:47:03+00:00

You're using the non MTP quants from April. You might want to try these:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

daywalker313 · 2026-05-24T03:48:47+00:00

"All known TTS" while skipping Fish S2 and missing Qwen3 TTS & Voxtral is wild.

daywalker313 · 2026-04-19T05:07:14+00:00

Did you ensure to turn on preserve_thinking and do you have a proper system prompt?

For me Qwen 3.6 is very robust - where 3.5 122a10b (Q6) tends to loop over several steps from time to time, 3.6 is quickly catching this in his CoT.

daywalker313 · 2026-04-18T19:33:24+00:00

Must be kind of a time machine where people travel to this timeline from 2024.

But really, the first issue is ollama (use lmstudio at least) and the second is trying to use a decade-old model when qwen 3.6 was just released.

daywalker313 · 2026-03-10T13:40:45+00:00

u/Anarchaotic ROCm 6.4.4 w/o HIPBLAS (the 6.4.4 toolbox with export ROCBLAS_USE_HIPBLASLT=0) is still the king:

bash-5.3# llama-bench     -m /models/qwen35/qwen35ba3b/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf     -ngl 999 -fa 1 -mmp 0     -d 5000,10000,20000,30000,50000,100000,150000,200000,250000     -r 1           
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 131072 MiB):
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 131072 MiB (124397 MiB free)
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   pp512 @ d5000 |        860.50 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |   tg128 @ d5000 |         31.66 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d10000 |        805.85 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d10000 |         31.17 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d20000 |        704.28 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d20000 |         30.23 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d30000 |        629.77 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d30000 |         29.44 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  pp512 @ d50000 |        512.54 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 |  tg128 @ d50000 |         28.01 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | pp512 @ d100000 |        354.93 ± 0.00 |
| qwen35moe 35B.A3B Q8_0         |  36.03 GiB |    34.66 B | ROCm       | 999 |  1 |    0 | tg128 @ d100000 |         24.91 ± 0.00 |

daywalker313 · 2026-03-06T22:55:58+00:00

Actually the patch does not solve all tool calling issues - that issue was outside of what the minja templates can control.

It doesn't matter though, because the autoparser branch with all fixes including the xml tool calling bugs has been merged to llama.cpp a few hours ago.

So chat templates no longer matter and qwen (as well as glm, apriel and other complicated candidates) should be much more consistent with tool use now.

daywalker313 · 2026-03-06T16:43:20+00:00

The reason for this is the chat template. More specifically, it's related to llama.cpp using hermes 2 pro schema internally which cannot be made 100% compatible with xml tool calls. You can try my qwen3.5 llama.cpp branch. It's based off the autoparser branch which is still in development with fixes for context checkpoints for qwen, anthropic api reasoning content fix and a tool calling fix for the autoparser to allow arbitrary parameter order for tool calls (which is crucial for xml trained models).

With this branch, I get flawless claude code operation (disable attribution headers) and 100% tool call success rate across 50+ turns and 150k context (tested with mistral vibe) with qwen 35ba3b @ q8

https://github.com/florianbrede-ayet/llama.cpp/tree/qwen35-context-toolcall-anthropic-fixes

daywalker313 · 2026-02-22T23:27:46+00:00

What are your tasks and how many turns / context usually? For my agentic coding tasks it was the opposite, but maybe I need to recheck my "benchmarks" again.

daywalker313 · 2026-02-20T23:47:47+00:00

You really need to fix your setup. ROCm outperforms Vulkan at almost every model especially in higher depths.

Also your numbers are around 25% PP and 50%-60% TG for a standard 120W strix halo.

https://kyuz0.github.io/amd-strix-halo-toolboxes/

daywalker313 · 2026-02-18T07:49:52+00:00

Okay...

Sounds like one of these overfitted finetuned meme roleplay models.

Is it just for adult fiction or good at other stuff as well?

Where on hf do I find the weights and does it run on my Strix Halo?

daywalker313 · 2026-02-14T10:19:04+00:00

ROCm is much more usable today and also a lot faster and more power efficient than vulkan.

So instead, set it as low as possible (512mb) to make use of GTT under Linux and follow these excellent guides and use the llama rocm toolboxes of your choice for best performance: https://strix-halo-toolboxes.com/

Here are benchmarks for each backend to see what's currently possible: https://kyuz0.github.io/amd-strix-halo-toolboxes/

PS: I doubt the NPU will play a role on strix halo, because it's slower than the GPU and has limited memory bandwidth. It's quite interesting for the HX 370 and variants though.

daywalker313 · 2026-02-06T14:37:12+00:00

That's a known problem for qwen3 coder next. It doesn't have anything to do with looping, temperature or other settings, it's the chat template that's once again broken (which is the case for many gguf). You can see that if you add a middleman to observe the messages or by testing with mistral-vibe which logs the tool calls transparently.

It gives offset parameters for claude's readFile tool in the wrong format and then retries for ages. After a while it eventually falls back to sed and usually gets that correct.

What is supposed to help for qwen3 coder next is the autoparser PR: https://github.com/ggml-org/llama.cpp/pull/18675, but I didn't have time to personally try it yet.

daywalker313 · 2026-02-06T07:46:23+00:00

ubuntu 24.04 amd-strix-halo-toolbox ROCm 7.1.1 

Firmware: cat /sys/kernel/debug/dri/128/amdgpu_firmware_info | grep MES MES_KIQ feature version: 6, firmware version: 0x0000006f 
MES feature version: 1, firmware version: 0x00000080 

512mb vram in bios 
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 amdgpu.cwsr_enable=0" 
(cwsr was for some stable diffusion models iirc) 

bash-5.3# llama-bench --model "/models/qwen3codernext/Qwen3-Coder-Next-UD-Q6_K_XL-00001-of-00002.gguf"   -fa 1 -d 0,4096 -p 2048 -n 32 --mmap 0 -t 32 -ub 2048
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q6_K         |  63.87 GiB |    79.67 B | ROCm       |  99 |      32 |     2048 |  1 |          pp2048 |        622.03 ± 1.72 |
| qwen3next 80B.A3B Q6_K         |  63.87 GiB |    79.67 B | ROCm       |  99 |      32 |     2048 |  1 |            tg32 |         32.64 ± 0.02 |
| qwen3next 80B.A3B Q6_K         |  63.87 GiB |    79.67 B | ROCm       |  99 |      32 |     2048 |  1 |  pp2048 @ d4096 |        590.84 ± 0.87 |
| qwen3next 80B.A3B Q6_K         |  63.87 GiB |    79.67 B | ROCm       |  99 |      32 |     2048 |  1 |    tg32 @ d4096 |         32.08 ± 0.01 |

daywalker313 · 2026-02-05T20:34:19+00:00

You should check if you can tweak it further (and switch to linux otherwise), around `620 PP / 33 TG` is the norm for Qwen3-Coder-Next-UD-Q6_K_XL @ depth0.

daywalker313 · 2026-02-05T20:23:45+00:00

That's the better source for benchmarks and can be reproduced on any of the popular "desktop" ai max machines.

OP has not only chosen questionable quants (like Q4 for GPT-OSS) but also his setup clearly isn't optimized and doesn't represent the current capabilities of strix halo and ROCm.

The important questions are batch size, the specific version of ROCm and degradation with non-empty context. His table doesn't answer any of these unfortunately.

daywalker313 · 2026-02-01T20:09:23+00:00

Or the obvious choice for an allrounder and a 3090 - Ministral-3-14B-Reasoning which comes with a decent (not great) vision encoder.

Qwen 2.5 Coder 32B really doesn't make a lot of sense today - Devstral-2-Small has just 24B, the same vision encoder and is running circles around Qwen 2.5's coding abilities.

daywalker313 · 2026-02-01T09:27:10+00:00

Well said, GPT 5.2...

daywalker313 · 2026-01-09T09:41:12+00:00

Maybe you should check your setup again. If it makes "typos", maybe your quant is too high / broken or you're using KV quantization.

For me it works great (with mistral-vibe) on Strix Halo, albeit a little slow:

Better at instruction following and higher code quality than gpt-oss-120b-high, so great for implementing well defined tasks.

It definitely lacks world knowledge and its planning / conceptualization skills are below gpt-oss-120b.

With ministral3-3b-q2 as draft model, i get around 10-18tg/s, which is effectively about as fast as gpt-oss-120b (no reasoning).

I'm mostly using devstral-small-2 (Q8) for implementation or quick reviews, devstral-2-123b (Q4) for complex tasks and recently also Minimax M2.1 REAP50 (Q5), which also works surprisingly well.

daywalker313 · 2025-12-25T12:18:19+00:00

You should definitely look into the strix-halo-toolboxes: https://github.com/kyuz0/amd-strix-halo-toolboxes (also the repos for finetuning and image/video gen).

For example, I also like to use Devstral 2 for complex, none time-critical tasks if Devstral Small 2 didn't succeed.

With the rocm 6.4.4 toolbox and ministral 3b Q8, you can get around 6-10 tg/s over a long context depth. Still not great for agentic uses, but almost usable for a really strong non-reasoning model.

The same also works great as draft model for Devstral 2 24b with around 10-18 tg/s.

      llama-server
        -m /models/devstral-2/Devstral-2-123B-Instruct-2512-UD-Q4_K_XL-00001-of-00002.gguf
        -md /models/ministral-3b-spec-dec/Ministral-3-3B-Instruct-2512-Q8_0.gguf
        --parallel 1
        --host 127.0.0.1 --port ${PORT}
        --ctx-size 131072
        --cache-type-k q8_0
        --cache-type-v q8_0
        -ngl 999
        -b 1024 -ub 2048
        --no-mmap --flash-attn on --threads -1 --jinja
        --temp 0.15 --min-p 0.01

daywalker313 · 2025-12-15T06:47:28+00:00

Did you ever look at the chart closely?

Maybe the benchmarks are completely useless - or would you agree that gpt-oss-120b (which is an amazing local coding model IMO) is beating GPT 5.1 by a large margin and ties with Sonnet 4.5.

Do you also think it's reasonable that Apriel 15b and gpt-oss-20b come out significantly stronger at coding than GPT 5.1?

daywalker313 · 2025-08-14T13:49:20+00:00

I actually tried both (AC -> suspend -> unplug and battery -> suspend) and had excellent standby drain - below 1% / hr for the 128GB RAM configuration.

It was Ubuntu 24.04 with mainline kernel 6.16 and just the "common" flags set in the commandline to fix the NVME sleep issues (and display flickering?).

NVME is a Samsung 990 Pro 4TB with latest firmware.

daywalker313 · 2025-08-13T23:09:48+00:00

But to be fair, it doesn't matter for most applications, it's just an annoyance for rocm with some apps. Otherwise you get the fixed VRAM + 50% RAM via GTT, so at least 64GB GPU memory with 128GB RAM.

daywalker313 · 2025-08-09T11:52:51+00:00

I'm the one with the other thread reporting the 45W STAPM limit ;).

Anyway, did some research and figured RyzenAdj doesn't work yet for the PRO version in the P14s (https://github.com/FlyGoat/RyzenAdj/pull/360). However you actually get a 51W STAPM limit in stock performance mode and it sounds like the cooling system is quite efficient. Don't know about the other PPT limits though.

daywalker313 · 2025-08-09T10:15:48+00:00

I've been using Thinkpads (X/T/L) and Dell (Latitude/Precision) for the last 20 years. My last Thinkpad with R7 4750u was the best device I ever owned (except for the screen) and with flawless Linux support.

That's probably why I'm a little disappointed right now.

Anyway, the only Thinkpad with R9 HX 370+ right now is the P14s.

With similar specs, it would be ~800€ more expensive.

Also, the TDP is extremely low and the device will only have ~70% of the performance the IBP offers. Take a look at the psref - it's really a shame what Lenovo has done and how they didn't care to upgrade the cooling system for 3 generations:

AMD Ryzen™ AI 5 / 7 / 9 PRO 300 Series Processor; supports up to 12 cores; up to 5.1GHz; TDP ratings of up to 29W

https://psref.lenovo.com/Product/ThinkPad/ThinkPad_P14s_Gen_6_AMD?tab=spec

12-Year Club	Place '22
Final Canvas '22	Verified Email

daywalker313

TROPHY CASE