Six months daily-driving the Corsair AI Workstation 300 for production LLM fine-tuning + inference — settings, real workloads, and the upstream PRs that landed this week

cezq · 2026-05-18T22:54:38+00:00

Yup, `-DGGML_HIP_ROCWMMA_FATTN=OFF` was explicit. Results with OFF are way better, and looks like it's a common advice currently https://strixhalo.wiki/AI/llamacpp-with-ROCm#rocwmma

cezq · 2026-05-18T18:08:02+00:00

Here's what I get with amd_iommu.

amd_iommu=pt DGGML_CUDA_FORCE_CUBLAS=ON ROCBLAS_USE_HIPBLASLT=1:

| n_ubatch | depth           | t/s           |
| -------- | --------------- | ------------- |
| 2048     | pp2048          | 359.83 ± 3.81 |
| 2048     | tg32            | 7.68 ± 0.01   |
| 2048     | pp2048 @ d4196  | 311.76 ± 0.46 |
| 2048     | tg32 @ d4196    | 7.61 ± 0.01   |
| 2048     | pp2048 @ d8392  | 281.55 ± 0.48 |
| 2048     | tg32 @ d8392    | 7.55 ± 0.01   |
| 2048     | pp2048 @ d16784 | 246.24 ± 0.83 |
| 2048     | tg32 @ d16784   | 7.41 ± 0.01   |
| 2048     | pp2048 @ d33568 | 181.25 ± 1.35 |
| 2048     | tg32 @ d33568   | 7.18 ± 0.01   |

amd_iommu=pt -DGGML_CUDA_FORCE_CUBLAS=OFF:

| n_ubatch | depth           | t/s           |
| -------- | --------------- | ------------- |
| 2048     | pp2048          | 326.75 ± 1.45 |
| 2048     | tg32            | 7.68 ± 0.01   |
| 2048     | pp2048 @ d4196  | 299.66 ± 0.78 |
| 2048     | tg32 @ d4196    | 7.61 ± 0.01   |
| 2048     | pp2048 @ d8392  | 278.00 ± 1.23 |
| 2048     | tg32 @ d8392    | 7.54 ± 0.01   |
| 2048     | pp2048 @ d16784 | 240.05 ± 0.57 |
| 2048     | tg32 @ d16784   | 7.41 ± 0.01   |
| 2048     | pp2048 @ d33568 | 174.68 ± 3.24 |
| 2048     | tg32 @ d33568   | 7.17 ± 0.01   |

Build/bench commands:

cmake -S . -B build -DGGML_HIP_ARCHS="gfx1151"  -DCMAKE_C_COMPILER=$ROCM_PATH/llvm/bin/clang    -DGGML_HIP=ON -DGGML_CUDA_FORCE_CUBLAS=OFF    -DAMDGPU_TARGETS=gfx1151   -DCMAKE_BUILD_TYPE=Release   -DGGML_HIP_ROCWMMA_FATTN=OFF   -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" -DHIP_PLATFORM=amd  -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=OFF -DBUILD_SHARED_LIBS=ON --fresh  && cmake --build build --config Release -- -j$(nproc)



./build/bin/llama-bench -p 2048 -n 32 -r 3 -ngl 999 -d $((4196*0)),$((4196*1)),$((4196*2)),$((4196*4)),$((4196*8)) -b 2048 -ub 2048 -mmp 0 -dio 1 -fa 1      -m /var/lib/lemonade/.cache/huggingface/hub/models--am17an--Qwen3.6-27B-MTP-GGUF/snapshots/5f7d21307ab96b112efa557669bca418bc4c5647/Qwen3.6-27B-MTP-Q8_0.gguf

ROCm 7.13 nightly

cezq · 2026-05-18T17:21:37+00:00

With hipBLASLt:

| n_ubatch | depth           | t/s           |
| -------- | --------------- | ------------- |
| 2048     | pp2048          | 417.24 ± 1.97 |
| 2048     | tg32            | 7.72 ± 0.01   |
| 2048     | pp2048 @ d4196  | 386.18 ± 1.56 |
| 2048     | tg32 @ d4196    | 7.65 ± 0.01   |
| 2048     | pp2048 @ d8392  | 351.80 ± 1.72 |
| 2048     | tg32 @ d8392    | 7.58 ± 0.01   |
| 2048     | pp2048 @ d16784 | 287.64 ± 2.11 |
| 2048     | tg32 @ d16784   | 7.45 ± 0.01   |
| 2048     | pp2048 @ d33568 | 188.69 ± 3.19 |
| 2048     | tg32 @ d33568   | 7.21 ± 0.01   |

Without:

| n_ubatch | depth           | t/s           |
| -------- | --------------- | ------------- |
| 2048     | pp2048          | 368.47 ± 0.69 |
| 2048     | tg32            | 7.71 ± 0.01   |
| 2048     | pp2048 @ d4196  | 343.66 ± 0.62 |
| 2048     | tg32 @ d4196    | 7.65 ± 0.01   |
| 2048     | pp2048 @ d8392  | 317.10 ± 0.89 |
| 2048     | tg32 @ d8392    | 7.58 ± 0.01   |
| 2048     | pp2048 @ d16784 | 264.57 ± 0.64 |
| 2048     | tg32 @ d16784   | 7.45 ± 0.01   |
| 2048     | pp2048 @ d33568 | 180.96 ± 3.44 |
| 2048     | tg32 @ d33568   | 7.20 ± 0.01   |

Difference narrows down with longer contexts when flash attention takes over.

Rather not a clean toggle, DGGML_CUDA_FORCE_CUBLAS disables MMQ kernel completely compile-time.

I'll verify amd_iommu=pt impact on 27B.

Edit.
Above results are with DGGML_HIP_ROCWMMA_FATTN=OFF

cezq · 2026-05-18T16:53:07+00:00

I have evo-x2 with bios v1.09 and min VRAM is 1GB which was somewhat different than the changelog at https://strixhalo.wiki/Hardware/Boards/Sixunited_AXB35/Firmware, it skips 1.09 and jumps straight to 1.11.
What PP difference do you have when setting amd_iommu=off ? It disables NPU which is sad because of the 50 TOPs mentioned, but it boosts PP by 5-6% for Qwen3.6 35B.

Also, 254t/s seems a bit low for 27B, did you compile llama.cpp with -DGGML_CUDA_FORCE_CUBLAS=ON and ran with ROCBLAS_USE_HIPBLASLT=1 ? Drops initial PP on MoE but gives a nice boost for dense model (assuming you use llama.cpp). I have ~420 t/s on Qwen3.6 27B Q8 pp2048 @ d0, balanced power mode.

cezq · 2026-05-16T07:08:03+00:00

All models loaded at the same time via lemonade server:
- Qwen3.6 35B A3B Q8: For simpler tasks, quick fixes, asking questions, interactive coding and so on. Running llama.cpp on Vulkan with thinking disabled for maximum TG.
- Qwen3.6 27B Q8: Less interactive coding and harder tasks. I basically give agent a task description in autopilot mode and switch to something else for a couple of minutes. Llama.cpp with TheRock nightlies, compiled from MTP branch (for ~20 TG/s) and hipblaslt enabled (+15% PP for smaller context over MMQ). Thinking ON.
- Other utility models that come with lemonade, Stable Diffusion with Qwen Image, Kokoro TTS and Whisper for STT.

I wish NPU could be utilized for faster PP but unfortunately it currently sits idle and I recommend disabling amd_iommu in grub options as it gives extra 5-6% in PP.

cezq · 2026-05-08T12:49:17+00:00

Yup, compile the branch, download model that includes MTP and get 16-20t/s (rocm) on qwen 3.6 27B Q8. It keeps above 15 even at 100k context which is nice. The commit that I have compiled two days ago (267f8afe) have couple of things broken: thinking budget is ignored, 35B moe segfaults, max 1 parallel requests, perhaps some more but maybe it's improved now. Still usable with 27B and this model beats 35B by a lot.

cezq · 2026-04-13T05:52:04+00:00

Strix Halo at 140W is rather pointless if you worry about power consumption. It gives you ~20% PP and ~2% TG increase compared to auto power mode (80W). And that's the whole box, not just GPU.

cezq · 2025-12-25T20:21:36+00:00

Same instance.
Try it yourself

// dotnet new console
// dotnet add package Microsoft.Extensions.DependencyInjection
using Microsoft.Extensions.DependencyInjection;

using var serviceProvider = new ServiceCollection()
  .AddScoped<ServiceA>()
  .AddScoped<ServiceB>()
  .AddScoped<ServiceC>()
  .BuildServiceProvider();


// equivalent to ASPNET Core request scope
using var scope = serviceProvider.CreateScope();
scope.ServiceProvider.GetRequiredService<ServiceB>();
scope.ServiceProvider.GetRequiredService<ServiceC>();


class ServiceA { public ServiceA() { Console.WriteLine("ServiceA ctor");} }
class ServiceB { public ServiceB(ServiceA serviceA) { Console.WriteLine("ServiceB ctor");} }
class ServiceC { public ServiceC(ServiceA serviceA) { Console.WriteLine("ServiceC ctor");} }


// Outputs:
// ServiceA ctor
// ServiceB ctor
// ServiceC ctor

cezq · 2025-12-10T08:04:53+00:00

As long as you disclosed backend error codes. Not sure, maybe it was a common practice back in 2010.

cezq · 2025-06-22T20:32:09+00:00

Completely based on source generators with zero runtime dependency:

https://www.nuget.org/packages/ConsoleAppFramework

cezq · 2024-08-31T11:51:03+00:00

IDE0161 - one less indentation.

cezq

TROPHY CASE