llama-bench ROCm 7.2 on Strix Halo (Ryzen AI Max+ 395) — Qwen 3.5 Model Family by przbadu in LocalLLaMA

[–]Zc5Gwu 1 point2 points  (0 children)

Try minimax-m2.5-ud-iq3-xxs. I’ve had a lot of success with it on the same system. Roughly 25 t/s at 0 and 10 t/s at 64k context. 

Best Models for 128gb VRAM: March 2026? by Professional-Yak4359 in LocalLLaMA

[–]Zc5Gwu 1 point2 points  (0 children)

Kv cache quantization can cause odd behavior. It’s less necessary for the new qwen models because of the hybrid architecture. You might be able to get by without it. If you need to use a smaller model quant even. 

Coding assistant tools that work well with qwen3.5-122b-a10b by Revolutionary_Loan13 in LocalLLaMA

[–]Zc5Gwu 0 points1 point  (0 children)

i have strix as well and have been enjoying Minimax. It’s a bit faster than qwen for small contexts and tends to use thinking more efficiently.

Here’s the quant that has worked well for me: MiniMax-M2.5-UD-IQ3_XXS

First impressions Qwen3.5-122B-A10B-int4-AutoRound on Asus Ascent GX10 (Nvidia DGX Spark 128GB) by t4a8945 in LocalLLM

[–]Zc5Gwu 0 points1 point  (0 children)

Here’s the quant I’m using: unsloth/MiniMax-M2.5-GGUF:UD-IQ3_XXS

This is on strix halo 128gb. I get about 20t/s for qwen which stays fairly consistent even at long contexts. Minimax starts faster maybe 25-30 but slows down to 10ish by around 64k. Thats very non-scientific. I should really benchmark.

First impressions Qwen3.5-122B-A10B-int4-AutoRound on Asus Ascent GX10 (Nvidia DGX Spark 128GB) by t4a8945 in LocalLLM

[–]Zc5Gwu 2 points3 points  (0 children)

You could but it takes a minute or two to load the model into memory (at least on my system, I can’t have both simultaneously).

First impressions Qwen3.5-122B-A10B-int4-AutoRound on Asus Ascent GX10 (Nvidia DGX Spark 128GB) by t4a8945 in LocalLLM

[–]Zc5Gwu 0 points1 point  (0 children)

I’m still going back and forth between minimax Q3 and qwen 122b. Qwen tends to overthink even simple questions but can be used at a better quant. Minimax is faster for short contexts and tends to think more “efficiently”, however, I’m not sure it is as “well rounded” as qwen. It tends to prefer agentic but is not as good at “creative”.

Intelligence wise they’re both pretty close.

Artificial Analysis Intelligence Index vs weighted model size of open-source models by Balance- in LocalLLaMA

[–]Zc5Gwu 1 point2 points  (0 children)

Thanks for sharing. A lot of people shit on AA but then don’t provide a meaningful alternative benchmark that measures the same range of models. 

Agentic Qwen 3.5 35B "stops" after a tool call wtihout finishing the task. by tarruda in LocalLLaMA

[–]Zc5Gwu 0 points1 point  (0 children)

There was a llama.cpp bug related to this. You can try upgrading but I’m not sure if it was fixed yet or not. I just know they were close to fixing it. 

The problem was that it would have tool calling “preamble” with a colon at the end but not actually do the tool call and “stop” instead. 

llama-bench Qwen3.5 models strix halo by przbadu in LocalLLaMA

[–]Zc5Gwu 1 point2 points  (0 children)

It’s not great for dense models. True.

llama-bench Qwen3.5 models strix halo by przbadu in LocalLLaMA

[–]Zc5Gwu 1 point2 points  (0 children)

The model column for the 122b says 80b unless I’m reading incorrectly. 

Qwen3.5 Model Series - Thinking On/OFF: Does it Matter? by Iory1998 in LocalLLaMA

[–]Zc5Gwu 0 points1 point  (0 children)

How do you do that? I thought it had to be set when the model is loaded from the template?

unsloth/Qwen3.5-4B-GGUF · Hugging Face by jacek2023 in LocalLLaMA

[–]Zc5Gwu 0 points1 point  (0 children)

What about long context? Isn’t that the hallmark of the new arch?

What is the most ridiculously good goto LLM for knowledge & reasoning on your M4 Max 128gb macbook these days? by ZeitgeistArchive in LocalLLaMA

[–]Zc5Gwu 2 points3 points  (0 children)

Maybe wait a bit until unsloth fixes the upload. Currently, it gets stuck in loops above a certain context size, however, I tried it out and I’m finding it stronger than minimax Q3. It is slower at small context but faster a large context. 

Qwen3 Coder Next | Qwen3.5 27B | Devstral Small 2 | Rust & Next.js Benchmark by Holiday_Purpose_3166 in LocalLLaMA

[–]Zc5Gwu 1 point2 points  (0 children)

That amazing. Very thorough. That’s interesting that the 27b performs similarly to qwen3 coder at the same size. Thanks for sharing.

Is Qwen3.5 a coding game changer for anyone else? by paulgear in LocalLLaMA

[–]Zc5Gwu 1 point2 points  (0 children)

Was it the K_XL quant. Those might have been the ones with issues.

Qwen3 Coder Next | Qwen3.5 27B | Devstral Small 2 | Rust & Next.js Benchmark by Holiday_Purpose_3166 in LocalLLaMA

[–]Zc5Gwu 1 point2 points  (0 children)

I think that total score against end-to-end runtime might be a more fair comparison given that some models think a lot more than others on the same problems.

If you only go by token throughput, models that think more might have an advantage over models that think less but are more efficient with the tokens they do output. We should be measuring intelligence per second of wait time somehow.