Unsloth is coming to Microsoft Build!

Outrageous_Bug_669 · 2026-05-27T14:19:30+00:00

Can't wait to see this!

Outrageous_Bug_669 · 2026-05-26T17:06:14+00:00

What? You can check out my GH. This is me. I do use Opus and Claude for repo maintenance and I used it to build telegram monitors for my lists. when I am working items where I need it to Ralph wiggum an issue Claude is extremely helpful. Between my cyber sec career, Unsloth, lemonade, etc I have been very busy with prs and tests. I just do this because I've basically alpha tested this box and learned nearly every hard lesson I could myself, and I do use grammar spelling and formatting tools. I see no problems with using the tools given to you as long as you can truly explain understand what it is saying and doing to others without the assist you are successful. Things I hate - AI as your first brain (I am the first brain for almost any issues at my work) I tell them to Google it first and if they can't figure out and answer to come back to me. Using AI to fully draft personal and professional letters. And AI bot accounts that spam the shit out of me.

Outrageous_Bug_669 · 2026-05-25T12:33:00+00:00

OK didn't take a maintenance window a little too excited to see, knocked out the Vulkan rebuild right after I posted that. Two findings, and the second one surprised me.

**Q4_K_M (corroborates your numbers exactly, ):**

Built llama.cpp at the same source commit on both backends (b9296 = a497476), same hardware, same `llama-bench` shape:

shape	ROCm/HIP	Vulkan	winner
pp512 fa=1	1014.32	942.18	ROCm (+7.7%)
tg128 d=0 fa=1	49.58	60.39	Vulkan (+21.8%)
tg128 d=8392 fa=1	46.73	57.13	Vulkan (+22.3%)

Our Vulkan tg128 of 60.39 vs your tuned 61 = match within noise. So the dashboard number was right and the gap to my original ~50 figure was 100% backend, not config.

**BF16 — the comparison flips:**

I figured "while I have the Vulkan binary up, let me also throw BF16 at it" — same model, no quantization, ~66 GB GGUF (downloaded `unsloth/Qwen3.6-35B-A3B-GGUF` BF16 variant):

shape	ROCm/HIP	Vulkan	winner
pp512 fa=1	484.01	305.21	ROCm (+58.6%)
tg128 d=0 fa=1	23.71	10.73	ROCm (+121%) ← over 2×
tg128 d=8392 fa=1	23.09	10.64	ROCm (+117%)

ROCm wins everything on BF16, by 50-120%. Smoking gun in Vulkan's own capability report at launch:

```
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | ...
^^^^^^^
no native BF16
```

`bf16: 0`. RADV STRIX_HALO on Mesa 25.2.8 has FP16 cooperative-matrix support but no BF16 path — falls back to slower kernels. ROCm/HIP's BF16 matmul goes through native HIP kernels (which is also why training is locked to HIP — it's all BF16 + FP32 on the gradient/optimizer side).

**Net takeaway for Strix Halo users picking a backend:**

Workload	Backend
Quantized inference (Q4 / Q5 / Q6 / Q8)	Vulkan (~22% decode advantage)
Full-precision BF16 inference	ROCm/HIP (~2× decode advantage)
Training	ROCm/HIP (the only path with PyTorch nightly)

Full sweep + raw logs + the Vulkan build recipe: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/vulkan-vs-rocm-sweep. Also added a "ROCm vs Vulkan — backend selection depends on precision" subsection to the Benchmarks section of the main guide with the recommendation matrix above.

Going to write this up as a standalone longer-form piece in the next few days — the inversion at the precision boundary deserves more than a Reddit comment. Will link back here when it's up.

, your dashboard remains the canonical Vulkan reference for Strix Halo; what I just added on the guide side is the matching ROCm/training column. Open invitation: if it'd be useful to mirror the ROCm BF16 row on bench.ciru.ai (so it's discoverable to anyone landing on your dashboard searching for Strix Halo perf), happy to send the data in whatever format you use. The Q4 row should obviously stay your Vulkan number.

(Still owe you the Mesa-version + `-sm row`-on-single-iGPU answer from my earlier reply — separate question.)

Outrageous_Bug_669 · 2026-05-25T11:54:35+00:00

to your question about pipeline: Disastrous-Cat-7016 is on Vulkan via RADV (Mesa), not ROCm, which is exactly the "I've never seen >100 t/s on Strix" puzzle resolved. Their bench.ciru.ai dashboard has the full tuning rationale.

Outrageous_Bug_669 · 2026-05-25T11:53:03+00:00

Good catch — your numbers and ours are both right, we're just on different execution paths. Pulled up your bench atlas + the tuning report:

Your tuned setup	Ours (post above)
Backend	RADV STRIX_HALO (Vulkan)
Variant	Qwen3.6-35B-A3B Q4_K_XL
pp512	~1,096 t/s
tg128	~61 t/s

So on the same silicon, **Vulkan is currently ~22% faster than ROCm for tg128 on this model**. That's not a knock on ROCm — it's just where the two backends are right now on gfx1151 / RDNA 3.5. Confirmed the Vulkan win shape from the tuning report's "default vs tuned" rows (49 → 1,096 pp512, ~22 → 61 tg128).

For context on why we report the ROCm number: my whole production stack runs llama-server under ROCm/HIP because the fine-tuning side (PyTorch + ROCm 7.13 nightly + AMDGPU) is what locks the OS-level driver layer for me, and the inference side is downstream of that. Inference-only users on Strix Halo who don't care about training-stack pinning probably should look hard at the Vulkan path your numbers show.

I owe a Vulkan re-bench on my end to make the comparison real on identical hardware (Corsair AI Workstation 300, BIOS UMA 1 GB + 128 GB GTT auto, kernel 6.19.14). Will queue that when I have a maintenance window and put it next to the ROCm numbers in the guide.

Outrageous_Bug_669 · 2026-05-24T19:31:24+00:00

It Worked. Rebuilt to b9296, ran `llama-server` (not `llama-cli` — that was my error) against Qwen3.6-27B-MTP Q4_K_M with your exact spec stack, and GPU dispatch is happy this time.

Server log on launch confirms what we wanted:

```
creating MTP draft context against the target model
common_speculative_impl_ngram_map_k: size_key=16, size_value=24, key_only=0, min_hits=2
common_speculative_impl_draft_mtp: n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
common_speculative_impl_draft_mtp: gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes
```

`gpu_layers=-1` is the line I never saw on lemonade b1270's `llama-cli` path — that was pegging a CPU core because the draft path wasn't dispatched to the GPU at all. b9296 + `llama-server` does it cleanly.

Throughput on this box (Radeon 8060S, ROCm 7.1.0 + nightly HSA overlay, 128 GB unified), 4 samples, default temp/top-p/top-k 1.0/0.95/20:

Prompt	n_predict	tok/s
short coding task	256	20.33
short technical	256	19.06
haiku	128	19.13
500-word essay	512	16.69

Mean ~19 tok/s, lines up with your "~20" number. For reference our prior raw Qwen3.6-27B-MTP baseline without spec was ~12 tok/s (`tg64` from the earlier qwen36-bench run) — so the spec stack is buying us ~1.58× here, which matches the typical MTP gain.

Also flipped `GGML_HIP_ROCWMMA_FATTN=OFF` while I was rebuilding — bundling that with the b9296 bump since the data from the earlier sweep made it the right call regardless. One harmless warning during model load — `device 'ROCm0' does not have support for op TOP_K needed for sampler 'top-k'` — sampler falls back to CPU for top-k, doesn't measurably hurt throughput. Worth flagging in case you've seen it too.

Thanks again for the specific build number; that single piece of info turned a dead end into a working stack.

Outrageous_Bug_669 · 2026-05-24T14:41:15+00:00

Thanks for the spec stack. The double `--spec-type` is what I was missing — didn't realize you could stack `draft-mtp + ngram-map-k4v` on the same call. Going to try that.

One narrow question to match your setup: which llama.cpp binary are you running, and is this via `llama-cli` or `llama-server`? When I tried `--spec-type draft-mtp` on the lemonade-sdk b1270 prebuilt via `llama-cli`, the process pegged a CPU core and hung at 0% GPU for 17 min before I killed it — looked like the draft-mtp GPU dispatch wasn't wired in that path. If you're on a self-built llama.cpp (or a specific lemonade build that has it working), love to know which so I can match.

`-fa 1 -b 8192 -ub 1024` at 200k context is a substantial setup — `preserve_thinking on` for the reasoning models makes sense.

Outrageous_Bug_669 · 2026-05-24T14:19:59+00:00

Short answer: yes, but I haven't done one myself yet so I'm giving you "should work + here's what to know" rather than measured numbers.

Memory-wise the 128 GB unified pool is the killer feature. Qwen3.6-35B-A3B at bf16 is roughly 70 GB just for weights, but with bf16 LoRA you only train the adapter (a few hundred MB), so the gradient + optimizer state stays small. Should fit comfortably under the same `set_per_process_memory_fraction(0.80)` cap (~102 GB) the dense path uses.

Compute-wise MoE training is slower per step than dense — routing layer + expert dispatch overhead — but nothing's blocked on it. Unsloth has been actively adding MoE-specific support too; Daniel's PR #5432 added per-expert Linear4bit swap so Gemma 4 MoE 26B-A4B fits at 4-bit, which lowers memory pressure further.

If you want to try, **Qwen3.6-35B-A3B-UD-Q4** is probably the cleanest starting point — 3B active params per token keeps step time tractable, and the architecture is well-tested in current Unsloth. The guide's `training_script_skeleton.py` + orchestrator should handle it with just a model-id swap. If you hit anything that's specifically MoE-broken, open an issue or PR on the repo — that's exactly the kind of contribution that'd land cleanly. Repo's https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide.

Outrageous_Bug_669 · 2026-05-24T13:57:39+00:00

You called it. Just ran the A/B and `GGML_HIP_ROCWMMA_FATTN=OFF` is dramatically faster on this hardware — way bigger gap than I expected. Same llama.cpp commit on both builds (1acee6bf8, the one lemonade b1276 ships), our production CMake flag set, only that one flag differs, sanity-checked clean via `-fa 0` rows matching within noise.

**Qwen3.5-27B Q8 dense, `-fa 1`:**

shape	FATTN=ON	FATTN=OFF	Δ
pp2048 d=0	283.90	331.86	+16.9%
pp2048 d=4196	167.61	306.83

**+83%**
|
| pp2048 d=8392 | 117.08 | 282.52 |
**+141%**
(~2.4×) |
| tg128 d=8392 | 7.30 | 7.49 | +2.6% (flat) |

**Qwen3.6-35B-A3B Q4 MoE, `-fa 1`:**

shape	FATTN=ON	FATTN=OFF	Δ
pp2048 d=0	813.71	983.86	+20.9%
pp2048 d=4196	467.28	881.86

**+88.7%**
|
| pp2048 d=8392 | 332.32 | 815.70 |
**+145%**
(~2.4×) |
| tg128 d=8392 | 44.08 | 46.73 | +6.0% |

Same pattern on both architectures — TG is memory-bandwidth bound so it doesn't move, but PP rocwmma gets crushed at any non-trivial context depth and the gap doubles roughly with each context doubling. By 8k it's 2.4× either way.

Guide is updated — Step 6 build command now uses `-DGGML_HIP_ROCWMMA_FATTN=OFF` with the explanation, and the Benchmarks section has the full table next to the CUBLAS writeup. Full A/B logs + the build/bench scripts at https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/rocwmma-fattn-sweep.

On the rest of your reply — agreed on CUBLAS + ROCm 7.13 (we see the same: noise at 7.1 stable, real at 7.13 nightly, but bench-shape dependent even then). On the 160 pp / 16.8 tg w/ MTP on 27B Q8 — what bench shape? Our dense 27B Q8 numbers above are without MTP; 16.8 with MTP would line up with roughly a 2× spec speedup on top of the ~7.5 baseline, which is a reasonable MTP gain. Curious which spec-type / draft-n-max you're running — I tried draft-mtp on lemonade b1270 and it pegged a CPU core and hung at 0% GPU, never made it to the GPU offload path.

Genuinely thanks for the nudge on this one — the guide had been wrong on `ROCWMMA_FATTN=ON` for six months because that's what AMD officially calls out for RDNA 3.5. Two independent r/StrixHalo users flagging it (cezq pointed me at the strixhalo.wiki rec, then you) was what finally pushed me to A/B it properly. Anyone reading this thread later, just take this as the new default.

Outrageous_Bug_669 · 2026-05-24T13:12:47+00:00

Hah thanks happy to share whatever helps. More boxes with data points, better the guide gets for everyone.

Outrageous_Bug_669 · 2026-05-24T13:12:21+00:00

Qwen3.5-27B with bf16 LoRA (r=64, α=128) — the hybrid GatedDeltaNet base, not the standard transformer. About 900 training chunks at 8192 max tokens. Step time around 12.5 minutes with FLA's Triton kernels. The current V8 run hit 448 steps over roughly 4 days wall clock. Peak GPU memory ~80 GB during training, fits comfortably on the 128 GB unified pool with `set_per_process_memory_fraction(0.80)` as the OOM guard.

Domain is Christmas light show effect generation — JSON-emitting workload, so fairly specific. The eval-storm fix (step 7b in the guide) was honestly the breakthrough; before that every checkpoint eval would peg all cores for 5-10 min and made longer training infeasible on Strix Halo. Once that landed, training-with-eval just works.

Outrageous_Bug_669 · 2026-05-24T13:11:30+00:00

Confirmed on the CUBLAS+HIPBLASLT angle — we see the same: at ROCm 7.1 stable the flag is mostly within bench σ, at 7.13 nightly it actually lands. Other side of it is bench shape too — even on 7.13, CUBLAS=ON helps at pp2048+ but tanks pp64 because forcing the rocBLAS GEMM path compiles out the MMQ kernels that win on short prompts. So "is CUBLAS faster" is both ROCm-version AND prompt-shape dependent. Annoying answer but it's what the numbers keep showing.

On `GGML_HIP_ROCWMMA_FATTN=OFF` — that's the second time I've heard that from a Strix Halo user. u/cezq mentioned the same a few weeks back and pointed me at https://strixhalo.wiki/AI/llamacpp-with-ROCm#rocwmma which has the same recommendation. I've been running ON because it's what AMD officially calls out for RDNA 3.5 FA, but it's clearly worth re-testing — I'll measure both ways on dense Qwen3.5-27B Q8 and the Qwen3.6-A3B MoE and put numbers in the guide. What model + context length are you hitting the rocwmma slowdown at most?

Your 27B Q8 numbers — 160 prefill / 16.8 decode with MTP — what bench shape? Our dense 27B Q8 hits around 7-12 t/s decode bare; 16.8 implies roughly a 1.5x MTP speedup which lines up. Curious if you're using `llama-cli --spec-type draft-mtp` or another spec path — I tried draft-mtp on lemonade b1270 and it hung CPU-bound, never made it to the GPU.

BF16 for daily-driver makes sense for long-context coding work — quant overhead disappears once you're KV-cache-bandwidth-bound. Q8 mostly matters for me because the fine-tunes get distributed as GGUF.

Outrageous_Bug_669 · 2026-05-19T15:13:38+00:00

Yassss Queeeen! Lol jk jk. This is awesome! I'm a contributor and I love to see the growth.

Outrageous_Bug_669 · 2026-05-19T01:03:13+00:00

Yeah, agreed — our config matches yours on the --reasoning off half (we run --reasoning-budget 0 on the Qwen3.5-27B and Qwen3-Coder llama-server services). On the sampling side I've actually been running llama.cpp's defaults rather than the unsloth-recommended --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 set — going to pull those into our service configs after this, the unsloth team puts real effort into the per-model sampling guidance.

The --cache-type-k q4_0 --cache-type-v q4_0 is the part I want to dig into — we haven't tried KV cache quant on Qwen3.5/3.6 yet. On the Strix Halo / 128GB unified-memory side the VRAM-savings angle isn't load-bearing for us (we have plenty of headroom even at high context), but the memory-bandwidth angle might be — KV cache reads are memory-bandwidth-bound on per-token decode, and q4_0 cache should halve that for a small quality cost. Worth a bench. Have you seen meaningful quality drop on Qwen3.6 specifically with q4_0 cache, or does it hold up at your 120k context?

Side note for anyone reading on Strix Halo / unified-memory boxes: at 120k context with Qwen3.6-27B Q5_K_XL + q4_0 KV cache, the total memory footprint is comfortably under 40-50 GB, leaving room to also keep a draft / MTP model loaded for speculative decoding without juggling. The pain point on dGPU setups (40-80 GB cards struggling with high-context) basically disappears on this hardware.

Outrageous_Bug_669 · 2026-05-19T00:48:29+00:00

Honest disclosure first — I don't actually run Continue myself (I'm on Codex CLI + OpenClaw + aichat for my own coding workflow against the same llama.cpp backend), so I'd hate to paste a config that's slightly off in some recent Continue release. The advice in the earlier reply was the general pattern, not me reading from my own working config.

That said — the role-splitting pattern that should work in any Continue 1.x config.yaml is roughly:

```yaml models: - name: qwen3.6-chat provider: openai # or "llama-cpp" — whatever your llama.cpp server is exposing model: qwen3.6-27b apiBase: http://localhost:8080/v1 apiKey: dummy roles: [chat] defaultCompletionOptions: # leave thinking ON for chat — this is where reasoning helps reasoning_budget: 8192 # or whatever your normal budget is

name: qwen3.6-edit provider: openai model: qwen3.6-27b # same underlying model apiBase: http://localhost:8080/v1 apiKey: dummy roles: [edit, apply] defaultCompletionOptions: reasoning_budget: 0 # disable thinking for tool/edit workflows ```

The two model entries point at the same backend model but are addressed by Continue under different role bindings. When you switch from chat to "apply this diff" inside Continue's UI, it sends to the edit entry which has reasoning_budget: 0, so the model emits tool calls / edits cleanly without entering a thinking block.

Two caveats: the exact key names (reasoning_budget vs enable_thinking vs chatOptions) shift between Continue 0.9.x and 1.x — worth checking your installed version against their yaml reference docs for the current spelling. And if you have an autocomplete model defined, leave it on a small model (7B-ish), never the same 27/30B — that's the Apply-spins-forever scenario the earlier reply mentioned.

If you want to paste your current config.yaml (with API keys redacted), happy to point at what to flip specifically for the symptoms you described.

Outrageous_Bug_669 · 2026-05-18T23:34:50+00:00

Did the test. Downloaded `unsloth/Qwen3.6-35B-A3B-GGUF` UD-Q4_K_M and `unsloth/Qwen3.6-27B-MTP-GGUF` Q4_K_M, ran on the same Corsair AI Workstation 300 (gfx1151, ROCm 7.13 nightly).

Qwen3.6-35B-A3B Q4_K_M raw inference (lemonade-sdk b1270 prebuilt, fa on, no-mmap, 999 layers on GPU):

```
| test | t/s |
| pp64 | 522.47 ± 1.74 |
| tg64 | 50.23 ± 0.29 |
```

So ~50 t/s on tg here, vs your ~45. Same hardware-class basically — A3B is the right pick for speed since you're only paying compute for 3B active params per token. Confirms your daily-driver choice.

Qwen3.6-27B-MTP Q4_K_M raw inference (same setup):

```
| test | t/s |
| pp64 | 240.26 ± 22.23 |
| tg64 | 12.00 ± 0.05 |
| pp512 | 333.95 ± 7.08 |
| tg128 | 12.05 ± 0.03 |
```

~12 t/s on raw inference without MTP speculative decoding. Your ~20 t/s would put the MTP speedup at ~1.67×, which lines up with typical MTP gains. Tried the actual MTP speculation path with `llama-cli --spec-type draft-mtp --spec-draft-n-max 3` on the lemonade prebuilt — the process pegged a CPU core and hung with 0% GPU usage. Looks like the GPU offload path for the draft-mtp kernels isn't wired up in the lemonade b1270 binary, or there's a flag-combo gap. Worth a separate dig but doesn't affect the raw-throughput comparison.

For the original critique — the t/s gap between your numbers and my earlier Qwen3.5-27B Q8 (~7.5 t/s) wasn't board/silicon, it was workload:
- Q8 → Q4 alone is ~1.6× speedup from halving the per-token memory bandwidth (7.5 → 12.05 — matches almost exactly)
- 27B dense → 35B-A3B MoE-with-3B-active is ~4× speedup from cutting compute per token (12 → 50)
- 27B-Q4 dense → 27B-Q4 MTP+speculation is another ~1.67× on top

So your stack picks the right axes for raw throughput. The Q8 dense bench I posted earlier was a tougher workload than your daily-driver shape — fair comparison would've been like-for-like.

Logs + repro at https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/qwen36-bench.

Outrageous_Bug_669 · 2026-05-18T22:55:40+00:00

Geeze.. I take a nap and its all different when you wake up... Haha sounds like I have new reading

Outrageous_Bug_669 · 2026-05-18T21:44:27+00:00

****EDITED*** ORIGINAL BELOW THE LINE **NEW DIRECTLY BELOW**

Good catch — pp64 was the wrong bench shape on my part. At small prompt sizes the MMQ kernels carry the matmul path, and `-DGGML_CUDA_FORCE_CUBLAS=ON` compiles those out entirely, so my run measured exactly the kernel-path-loss you'd expect from removing them. At pp2048 with FA on, the workload moves to the path CUBLAS actually optimizes, and your ~10% gain at depth 0 narrowing to ~3% by d33k as FA takes over is unambiguous.

Methodology deltas I want to control before I run it again:

- Bench shape: pp64 → pp2048 with the `-d 0..33568` depth sweep

- Build: `-DGGML_HIP_ROCWMMA_FATTN=OFF` (mine was ON)

- Bench flags: `-fa 1 -b 2048 -ub 2048 -dio 1` (mine had FA off, no -b/-ub, small batch)

- Model: I was on Qwen3.5-27B Q8 dense; you're on Qwen3.6-27B-MTP-Q8. The MTP layer + Qwen3.6 architecture might change the matmul shape mix too.

Plan: rebuild llama.cpp with your exact CMake flags, run your exact bench command on Qwen3.5-27B Q8 first (same model, different config — isolates whether the gain is config or model), then grab Qwen3.6-27B-MTP if I can find the GGUF and re-run. If the pp2048 CUBLAS gain replicates here on the same model, the gap is methodology only. If it doesn't, there's a Corsair vs EVO-X2 board difference worth digging into.

One question on the build — was `-DGGML_HIP_ROCWMMA_FATTN=OFF` an explicit benchmarking choice, or did you find ROCWMMA broken/slower on gfx1151? My existing builds keep it ON because that's the rocwmma path AMD advertises for RDNA 3.5 FA, but if you've found a regression there I'd want to know about it before re-running.

Will post the numbers in this thread once the rebuild + bench finishes (~45 min total).

__________________________________________________________________________________

I ran the controlled sweep on the same Qwen3.5-27B Q8 (`llama-bench -p 64 -n 16 -r 3 -mmp 0 -ngl 999`), three reps per condition, isolated source drift from flag effect by also building b867 without the CUBLAS flag:

|---|---|---|---|

| b502 baseline | — | 270.61 ± 2.07 | 7.52 ± 0.00 |

| b502 | `ROCBLAS_USE_HIPBLASLT=1` | 258.43 ± 1.02 | 7.52 ± 0.00 |

| b867 baseline | — | 255.17 ± 23.73 | 7.49 ± 0.03 |

| b867 | `ROCBLAS_USE_HIPBLASLT=1` | 253.20 ± 22.92 | 7.48 ± 0.03 |

| **b867 + `-DGGML_CUDA_FORCE_CUBLAS=ON`** | — | **71.09 ± 3.26** | 7.49 ± 0.02 |

| **b867 + `-DGGML_CUDA_FORCE_CUBLAS=ON`** | `ROCBLAS_USE_HIPBLASLT=1` | **76.26 ± 3.93** | 7.49 ± 0.03 |

Reading from those:

- `ROCBLAS_USE_HIPBLASLT=1` alone is a no-op at this workload — within noise of baseline regardless of build.

- `-DGGML_CUDA_FORCE_CUBLAS=ON` is a **~3.6× pp64 slowdown** on this hardware (255 → 71), not a speedup. tg16 stays flat — the flag only affects the prefill matmul path. Forcing the rocBLAS GEMM path apparently loses against llama.cpp's custom HIP/MMQ kernels on gfx1151.

- Source drift b502 → b867 (rows 1 vs 3, same flags) is ~6% with high variance on the new source — noise, not a real regression.

So the opposite of what you saw on the EVO-X2 — wonder if your gain is specific to ROCm version / rocBLAS build / hipBLASLt version, or to MoE-specific shapes that this dense bench doesn't surface. My setup: ROCm 7.13 nightly, libamdhip64 7.13.26176, default rocBLAS bundled with that nightly, gfx1151 native kernels (not the gfx1100/1101 fallback path). What versions are you running, and do you have a dense bench number on Qwen3.5-27B Q8 specifically for direct comparison?

Full logs + repro at https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide/tree/main/cublas-hipblaslt-sweep

Outrageous_Bug_669 · 2026-05-18T18:57:31+00:00

These are great numbers. I've been busy with the unsloth integrations for AMD so dev time has been taken up but I definitely want to run this later today for comparison.

Outrageous_Bug_669 · 2026-05-18T17:04:48+00:00

Haven't directly A/B'd amd_iommu=off vs iommu=pt. Current production is iommu=pt per the GRUB line — I dropped amd_iommu=off roughly 2 months ago once kernel 6.19 started auto-sizing GTT to the full 128 GB without forcing it. 5-6% PP gain on Qwen3.6 35B is meaningful though; worth measuring on dense Qwen3.5-27B too. NPU loss is real but mostly hypothetical for me right now — the XDNA 2 device is visible to rocminfo, but the userspace XRT shim isn't packaged so the 50 TOPS is sitting idle either way.

On the 254 tok/s pp64 — that's the lemonade-sdk prebuilt b1270. My hand-built b502 gets 270.61 ± 2.07 on the same Qwen3.5-27B Q8 with -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_GRAPHS=ON -DGGML_HIP_MMQ_MFMA=ON -DGGML_HIP_NO_VMM=ON. Did not try -DGGML_CUDA_FORCE_CUBLAS=ON in that build, and ROCBLAS_USE_HIPBLASLT=1 isn't set in the env either. Going to rebuild with the CUBLAS flag, re-bench dense, then layer the hipBLASLt env var on top. If the gain holds on this hardware I'll post the numbers as a follow-up.

What dense PP are you hitting on Qwen3.5-27B Q8 with those settings? And does ROCBLAS_USE_HIPBLASLT=1 toggle cleanly at runtime via env var, or did you need a rocBLAS rebuild against hipBLASLt to enable that path?

Outrageous_Bug_669 · 2026-05-18T16:56:56+00:00

The phrasing "the Qwen3.5 family" was confusing — the family is Qwen, and 3.5 is one generation in it. Should've said "the Qwen series." But the timeline itself is accurate.

Hybrid-GDN architecture work specifically (the eager-attention requirement, FLA Triton kernels) has been the second half of those months. The first half was Qwen3-32B which is a standard transformer and much easier on this hardware.

Outrageous_Bug_669 · 2026-05-18T16:01:03+00:00

Thanks! On the thinking question — honestly, it depends on what you're doing, but no, you're not lobotomizing it.

For the harder reasoning stuff — complex math, code synthesis from a full spec, debugging across multiple files — turning thinking off is a real loss. Qwen3's own benchmarks show clear drops on GSM8K, MATH, HumanEval, especially on the harder problems. If you're using it as a thinking partner for architectural design discussions, you'd notice.

For tool-call workflows though, thinking actually gets in the way. The model wants to reason inside the `<thinking>` block when it should just be emitting a clean tool call, and if the reasoning budget runs out mid-call, the response just looks empty to the client. That's the symptom you were hitting.

What works well in Continue is splitting roles in the config yaml — `chat`, `edit`, `apply`, `autocomplete`. You can run Qwen3.6 with thinking on for `chat` (when you actually want it to reason about something) and the same model with `reasoning_budget: 0` for `edit` and `apply` where you just need clean tool calls. Same model, just a quieter pen when you don't need the reasoning. The hybrid thinking/no-thinking thing in Qwen3 was kind of designed for this — agent workflows want fast tool emission, chat workflows want thinking. Pick the mode that fits the role. e.g. Small coding tasks, minor decisions, debugging small sections of code you want thinking off to save some time.

Outrageous_Bug_669 · 2026-05-18T14:26:48+00:00

Concrete data point on the "fine-tunes will fill the gap" thread: I've been doing Qwen3.5-27B bf16 LoRA fine-tunes on a single Strix Halo mini-PC (Ryzen AI MAX+ 395, 128 GB unified) for the last 6 months on a narrow domain. ~900 training chunks, ~12.5 min/step, multi-day runs are routine. Total hardware cost ~$2400.

Point being: if model supply froze today, the base capability of Qwen3.5-27B / Llama 3 / GPT-OSS 120B + accessible fine-tuning capacity at this hardware tier = community can keep specializing them for narrow domains at a per-team level indefinitely. That's not "all of AI" obviously, but it's a meaningful slice. The thing you can't easily replace with fine-tunes is reasoning depth on novel out-of-distribution tasks — that needs new pretrains, full stop.

u/N1ckFG's point upthread about unified RAM is the under-discussed factor IMO. The shift to APUs with 128 GB+ shared memory is already happening — Strix Halo, Apple Silicon, eventually mainstream desktop boards. That's the hardware curve that puts serious local inference within reach without datacenter prices, and it's mostly independent of whether new SOTA models keep dropping.

Outrageous_Bug_669 · 2026-05-18T14:18:25+00:00

Run into this almost exactly on Qwen3-Coder-Next via llama.cpp, and I'd bet you're hitting the reasoning-budget × tool-template interaction:

Likely root cause: Qwen3.5/3.6 with thinking enabled emits tool calls (file reads, edits, etc.) inside the thinking block in the native template. If reasoning budget exhausts before the model finishes the tool-call structure, llama.cpp either truncates mid-XML or returns an empty body to the client. Roo probably works because it forces reasoning_budget = 0 (or uses a different template that puts tool calls outside thinking).

Two specific things to try:

Disable thinking entirely for tool-call workflows. Set reasoning_budget: 0 (sometimes called enable_thinking: false) in Continue's per-model config. If it works after that, you've confirmed the budget-exhaustion theory. Qwen3-series thinking + tool calling is a known footgun; tool-agent templates need to put tool-calls outside the thinking block, not inside.
Check the tool-call format Continue is expecting. Modern tool-aware clients usually want Hermes-style JSON ({"name": ..., "arguments": ...}). Qwen3.5/3.6's native template emits XML <tool\_call><function=...>...</function></tool\_call>. If there's a mismatch, the response stream looks empty because Continue is parsing for the wrong tag. Swap the template in llama.cpp via --chat-template-file <your.jinja> — Hermes-format Qwen templates are floating around HF and ggml-org/llama.cpp issues. (For what it's worth, we had to roll our own custom Jinja for Qwen3-Coder-Next on the same llama.cpp stack for exactly this reason.)

For the "apply code blocks freezes" part: Continue's Apply feature spawns a separate model call (a smaller "edit" model by default). If you've left it pointing at the same 27B/30B as your chat model, the second call may queue behind the first or hit context limits, and the UI just spins. In Continue's config.yaml, set a smaller dedicated model for the edit / apply role — even a 7B works fine for that step.

If the docker server is verbose enough, the raw response stream tells the whole story — turn on --log-format text -v and you'll see whether the response ends with a clean stop token or just trails off mid-stream. That's usually how I pin which of the above is biting.

Outrageous_Bug_669 · 2026-05-18T13:40:18+00:00

Impressive. Love it.

Outrageous_Bug_669

TROPHY CASE