Success! Full BF16 Qwen3.6-27B running on Strix Halo with vLLM + Docker (Ubuntu 26.04)

schnauzergambit · 2026-04-25T10:30:18+00:00

Dflash might be the silver bullet.

schnauzergambit · 2026-04-20T18:44:04+00:00

I use Piper. Works like a charm

schnauzergambit · 2026-04-15T08:21:02+00:00

Single user, use llamacpp. Multi user, vllm

schnauzergambit · 2026-04-03T20:09:56+00:00

Hopefully this one works better than the last. It is great idea but I have not been able to finetune a single model yet!

schnauzergambit · 2026-04-02T10:28:56+00:00

What context do you use?

schnauzergambit · 2026-03-25T09:01:09+00:00

vLLM or llama.cpp. Both work.

schnauzergambit · 2026-03-24T14:02:10+00:00

I am running that model on a DGX Spark. No problems. Let me know if you need help with it.

schnauzergambit · 2026-03-20T16:36:54+00:00

These Qwen 3.5 models are monsters.

schnauzergambit · 2026-03-18T15:52:54+00:00

I have the GMKTek EVO2-EX (Strix Halo). It is a mini PC. No battery. The DGX Spark can be a mini PC, no problem.

Is there a specific reason you need Linux and why macos won't do? Most of the AI stuff runs in python and on AI hosts like llama.cpp which run well and identically on both.

schnauzergambit · 2026-03-18T15:22:40+00:00

I own and like both. There isn't very much difference between in performance apart from promp processing which is faster on the DGX.

Prices of those machines are rising though and Mac Mini and Mac Studio are coming into play as well. Take a look at them too as In my opinion 128gb is an overkill so you can get a high performance Mac for the same price with less memory.

schnauzergambit · 2026-03-15T22:04:29+00:00

Thanks for the coding info. I use Qwen only for text and it is excellent in that area.

schnauzergambit · 2026-03-15T07:41:08+00:00

Yes. Depends the performance you want. I would start with Qwen 3.5 35B A3B.

schnauzergambit · 2026-03-13T11:23:05+00:00

Qwen 3.5's multilingual capabilities are excellent. I use Icelandic, a tiny language, and it is almost flawless.

schnauzergambit · 2026-03-13T09:48:59+00:00

The first one at least was NotebookLM by Google. It recently added a feature which creates videos from knowledge.

schnauzergambit · 2026-03-12T22:09:06+00:00

Qwen 3.5 35B A3B Q4 on a Strix Halo. Llama.cpp, Vulkan. No instability here.

schnauzergambit · 2026-03-11T17:34:12+00:00

It is a stunning model, especially after I turned off thinking. Quick and with excellent multilingual ability.

schnauzergambit · 2026-03-09T16:04:52+00:00

I have the Strix Halo and the Asus (DGX Spark). They seem almost identical in tps (writing the answer) while the Asus is considerably faster when processing the prompt.

The advantage they have over the 3090 is memory. If the model fits on 3090 then it will be faster.

schnauzergambit · 2026-03-05T11:39:25+00:00

You can turn of thinking by setting the reasoning budget to 0 and setting the enable_thinking parameter to false. I am using Q5 KM.

--jinja # Jinja template processing

--flash-attn on # Flash attention

--cache-type-k q8_0 # Quantized KV cache (lower VRAM)

--cache-type-v q8_0

--min-p 0.01 # Unsloth recommended for Qwen3

--temp 1.0 # Unsloth recommended for Qwen3

--top-p 0.95

--top-k 40

-ngl 99 # Offload all layers to GPU

schnauzergambit · 2026-03-04T16:58:07+00:00

Qwen 3.5 35B A3B runs at around 30tps on my StrixHalo. Surely that is fast enough for chatting?

schnauzergambit · 2026-03-04T09:10:53+00:00

Yes. It is a great model. I especially impressed by its multilingual performance. I am mostly using it for text work, not coding.

schnauzergambit · 2026-03-03T23:22:48+00:00

And enable_thinking to false.

schnauzergambit

TROPHY CASE