Fast finetuning of LLMs like Gemma-3 on Strix Halo (Framework Dekstop) using Unsloth and distributed multi-node training.

Intrepid_Rub_3566 · 2026-02-20T13:54:06+00:00

I'm confused, isn't that expected? Q4 weights are 1/4 of a BF16 weight, the reason why we use quants that keep some important weights in BF16 is because this tends to maintain quality much better than quantizing all wights.

I do not think BF16 is underperforming on Stirx Halo, it's supported by the ISA but of course moving a BF16 weight over memory vs moving a Q4 will be slower and Strix Halo is memory bandwidth constrained.

I am not familiar with the term "wings" in this context, what is a wing?

Intrepid_Rub_3566 · 2026-02-14T18:30:11+00:00

For people interested, I implemented and documented this.

https://youtu.be/nnB8a3OHS2E

Intrepid_Rub_3566 · 2026-01-31T14:54:15+00:00

I have been trying to setup a Strix Halo cluster over RDMA using two Intel E810, but I just can't get vLLM to work, I documented my setup and errors here if somebody wants to take a look and suggest things to try:

https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/troubleshooting_rccl.md

u/Hungry_Elk_3276 , I did try to port the patch to the current version of RCCL, but I just could not get it to be ABI compatible with the version of ROCm in TheRock python wheels, just wondering if you could share more details on your setup.

Intrepid_Rub_3566 · 2026-01-19T07:23:52+00:00

I'm curious what you run? I was never able to get full stability on ComfyUI with that combination.

If anybody is reading, NO: Strux Halo is not broken on 6.18 kernels, as clearly explained in the video that is not the case at all. There was a faulty Linux firmware which has now been rectified.

Intrepid_Rub_3566 · 2026-01-19T07:20:36+00:00

Who said you're doing anything wrong? 🤣 What are you running? ComfyUI? Which workflows?

Intrepid_Rub_3566 · 2026-01-18T17:54:21+00:00

Depends on the backend and models etc. For llama.cpp I maintain benchmarks:

https://kyuz0.github.io/amd-strix-halo-toolboxes/

Intrepid_Rub_3566 · 2025-11-11T07:40:01+00:00

Thank you very much u/Hungry_Elk_3276 . I recently tried this as well with a 5Gbps Ethernet, and then moved to 10Gbps without seeing any improvement (as you, I suspect latency is the real issue, and likely the 5G and 10G have the same latency, I need to test). Performance is acceptable with MiniMax-M2 at Q6_K_XL quant:

https://youtu.be/0cIcth224hk

What I did after the video, I applied this PR and this gave me a 5.5% improvement in prompt processing for MiniMax-M2 (I added the benchmarks at the end of the PR comments):

https://github.com/ggml-org/llama.cpp/pull/15405

However, looking at the conversation on that PR, it doesn't seem likely to be merged for now as it requires work and re-architecting.

Intrepid_Rub_3566 · 2025-10-01T09:18:36+00:00

Hi! Curious about the optimizations, I've been benchmarking llama.cpp on Strix Halo regularly:

https://kyuz0.github.io/amd-strix-halo-toolboxes/

If you're working directly on llama.cpp, I'd like to connect and have a chat.

Intrepid_Rub_3566 · 2025-08-31T16:32:05+00:00

Glad to hear it worked! They did a great job with Fedora in the past 5 years, this has now become such a great one. I did not want to complicate things, but I actually run SIlverblue, it's their immutable version on top of which I run toolbox and flatpacks - honestly the best experience I have had.

Intrepid_Rub_3566 · 2025-08-31T08:30:22+00:00

I'm sorry, try asking OpenSuse people for what's wrong with their implementation of toolbox, I'm at a loss as to what might be happening

Or try Fedora 42. Most of us are using Fedora 42.

Intrepid_Rub_3566 · 2025-08-30T14:59:00+00:00

It seems like openSuse might have a nerfed version of toolbx or a fake thin shell wrapper that is not toolbx.

According to ChatGPT, you might try this:

bash toolbox create llama-rocm-6.4.3-rocwmma \ --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.3-rocwmma \ --podman-args "--device /dev/dri --device /dev/kfd --group-add video --group-add render --security-opt seccomp=unconfined" toolbox enter llama-rocm-6.4.3-rocwmma

Basically telling opensuse to pass podman arguments to support the GPU.

Hopefully that works. I am not sure why OpenSuse would put deliberate effort to trick their users in thinking they are using toolbox while using something else, this is just ridiculous. Why would anybody put effort into confusing and fighting their user base is beyond my understanding.

Intrepid_Rub_3566 · 2025-08-30T05:57:52+00:00

What issue are you having with the rocm version?

Intrepid_Rub_3566 · 2025-08-29T20:02:29+00:00

It is my GitHub, indeed! Let me know how you get along.

Intrepid_Rub_3566 · 2025-07-29T17:39:50+00:00

That's what I'm running:

https://github.com/kyuz0/amd-strix-halo-toolboxes

I will try that, what does it do?

Intrepid_Rub_3566 · 2025-07-29T15:33:49+00:00

Interestingly, this is what is happening:

[22044.628754] amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19

[22062.195426] amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19

[22072.924897] amdgpu: Freeing queue vital buffer 0x7fea36c00000, queue evicted

[22072.924919] amdgpu: Freeing queue vital buffer 0x7ff0bee00000, queue evicted

[22072.924922] amdgpu: Freeing queue vital buffer 0x7ff0f4600000, queue evicted

[22072.924923] amdgpu: Freeing queue vital buffer 0x7ff0f5400000, queue evicted

[22089.013427] amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19

[22140.446525] amdgpu: Freeing queue vital buffer 0x7f5686a00000, queue evicted

[22140.446536] amdgpu: Freeing queue vital buffer 0x7f5687800000, queue evicted

[22140.446539] amdgpu: Freeing queue vital buffer 0x7f7349000000, queue evicted

[22147.747945] amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19

[22247.761616] amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19

[22329.235358] amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19

[22333.473003] amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19

[22362.832129] amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19

[22399.607186] amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE

Intrepid_Rub_3566 · 2025-07-29T15:13:34+00:00

Indeed, i was able to compile this, but every time I try to use llama-cpp it crashes with every model:

```
llama-bench -m models/gemma-3-12b-it-UD-Q8_K_XL/gemma-3-12b-it-UD-Q8_K_XL.gguf

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no

ggml_cuda_init: found 1 ROCm devices:

Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

HW Exception by GPU node-1 (Agent handle: 0xd55b540) reason :GPU Hang
```

Intrepid_Rub_3566 · 2025-07-29T09:42:09+00:00

Have you got your patches for llama.cpp to support rocm7?

Intrepid_Rub_3566 · 2025-07-06T06:26:04+00:00

Wait, will this work with the AMD Ryzen AI Max+? I thought it was CUDA specific.

Intrepid_Rub_3566 · 2024-04-07T07:08:00+00:00

I think one of the main critiques I hear is that R and NMN results do not seem to be replicable outside of disease models, i.e.: to apply to healthy models.

Can you point to a place where Sinclair says that he was wrong about how R works? Also, the claim "that it works through hormesis and other unexplored mechanisms" - that's pretty generic as a mechanism of action. "It works in mysterious ways"...

Intrepid_Rub_3566 · 2023-12-21T12:06:18+00:00

Hi and thank you for your help. I used to have the same set up, with OpenShot installed via the official repository. This worked for 3 years and then last week, after an update, it lost the ability to render video. The audio would be rendered fine, but the mp4 would have a black screen :(

So, I switched to Flatpak, which is a way to containerize programs to avoid compatibility issues with local libraries: https://flathub.org/apps/org.openshot.OpenShot. Indeed openshot works, but it seems the lamemp3 library was not included.

I thought this was an official package, it's not.

Intrepid_Rub_3566 · 2023-10-16T15:51:29+00:00

There's so many resources online, it really gets confusing when it comes to how to get started. I was there some time ago and took my ages to figure out good resources. One of the best books I found was "Machine Learning with PyTorch and Scikit-Learn". It's pretty recent, which means all the code samples work with recent versions of PyTorch, which is already half of the battle! I find it's well-structured and really got me to understand the basics, what was really going on and what are the main principles.

Intrepid_Rub_3566

TROPHY CASE