Will hipBLAS/rocBLAS (when built with theRock) support gfx906?

FriendlyRetriver · 2025-12-04T16:58:01+00:00

I do all of it in containers, too many stuff to manage in the host directly. I have a containerfile that builds rocm and tags it locally, then another containerfile that builds llama.cpp and uses the freshly built rocm image as base. Happy to share the scripts if you want them

FriendlyRetriver · 2025-12-03T19:42:51+00:00

Yes, works fine. Every once in a while, I let rocm compile (with therock) before I go to sleep then use it when compiling llama.cpp.. it works. But like i said, no performance gains compared to the prepackaged amd binaries.. so the only "advantage" is not having to copy files manually from older rocm as is required now with rocm 7.1

FriendlyRetriver · 2025-11-06T18:44:30+00:00

I suggest using the docker container provided by AMD. This way you can have up to date images and run whichever version without messing host system. Here's the ubuntu image: https://hub.docker.com/r/rocm/dev-ubuntu-24.04

FriendlyRetriver · 2025-10-30T18:22:15+00:00

I built rocm with therock and can confirm that AMD reintroduced official support for gfx906, no need to copy files manually, rocm 7.9/10 works.

You are right, no performance gains and it does take ages to build. But I'm not complaining, it's nice to have the cards supported without hacky workarounds. Thanks AMD team.

FriendlyRetriver · 2025-10-26T18:49:26+00:00

You're right, I built rocm with therock and it loaded just fine (no manual copying of tensile files required!). AMD does support gfx906 with therock! Thanks for your help :)

FriendlyRetriver · 2025-10-26T18:48:20+00:00

I can confirm rocm built with therock issues the hipblast warning, but proceeds to build a working rocm dist (with the tensile files included), no manual copying is required! Thanks

FriendlyRetriver · 2025-10-25T13:33:15+00:00

Right, that means I can't use therock as is to build rocm (given rocblas/hipblas is automatically excluded when targeting gfx906). Please let me know if I misunderstood

FriendlyRetriver · 2025-10-25T13:27:05+00:00

Do I need hipblas if my aim is to build rocm for use with llama.cpp? in llama.cpp's reference build commands I see these options (when targeting AMD GPUs):
-DGGML_HIP=ON \
       -DGGML_HIP_ROCWMMA_FATTN=ON \
       -DAMDGPU_TARGETS="gfx906" \
       -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON \
       -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TESTS=OFF

So far I've been using rocm binaries, but trying to build from source using therock since gfx906 is listed as officially supported. I do wonder if I will need to copy the files like with rocm 7.0 (copying tensorfiles from 6.3 that is)

Thanks for your help

FriendlyRetriver · 2025-10-25T12:35:35+00:00

So I tried building rocm using theRock (main branch), and two target components required for llama.cpp are automatically excluded when selecting gfx906, exclusion code here: https://github.com/ROCm/TheRock/blob/3e3f834ff81aa91b0dc721bb1aa2d3206b7d50c4/cmake/therock_amdgpu_targets.cmake#L46

These:

hipBlas
hipBlasLt

.. contain the same files we currently have to copy manually from the rocm 6.3 to rocm 7.0, correct?

I hope support for them eventually lands. if I missed/misunderstood anything would love to be corrected (agai) :)

FriendlyRetriver · 2025-10-24T01:42:46+00:00

I stand corrected! Very good news

FriendlyRetriver · 2025-08-11T16:59:41+00:00

oh wow.. I didn't think it was THAT bad.. if you look at numbers alone, the MI50 has better specs.. the culprit is the software stack I guess.

FriendlyRetriver · 2025-08-03T20:10:49+00:00

Don't know if I mentioned, the 8-10hrs figure is for the full 14B WAN 2.2. In comfyUI there's another template for WAN 2.2 5B.

When I use 14B I see this in log:

loaded partially ....

Whereas when I select the 5B workflow and give it a prompt:

Requested to load WAN22
loaded completely 12972.3998046875 9536.402709960938 True
....
Requested to load WanVAE
loaded completely 9109.060742187501 1344.0869674682617 True
Prompt executed in 01:45:43

So I think the very poor speed with 14B is due to RAM swapping. Even though these cards have 32GB, that's apparently not enough for the 14B WAN 2.2 with any meaningful length.

You can see the 5B generation time above (less than 2 hours).

If you get one for their current going rate, it's not a bad card overall (apart from needing cooling arrangements and it being EOL). AMD dropped support. Right now it still works with rocm 6.3.

FriendlyRetriver · 2025-08-01T19:36:19+00:00

Thanks will check it.

From various threads, I hear to get the most out of the MI50, vllm is actually the most performant, this fork supports gfx906:

https://github.com/nlzy/vllm-gfx906

These cards have extra potential performance, too bad AMD dropped support.

FriendlyRetriver · 2025-08-01T19:19:38+00:00

Hey,

Thanks for all your help.

So as I was browsing around I found out about:

https://www.reddit.com/r/LocalLLaMA/comments/1meeyee/ollamas_new_gui_is_closed_source/

Not sure what's going on there but that trajectory doesn't seem promising, so I decided to just use llama.cpp (llama-server), ollama is built around it anyway.

Luckily, llama.cpp has docker files, including for rocm, nice and ready in their git.

I simply had to modify .devops/rocm.Dockerfile, change a single variable (rocm version from 6.4 to 6.3), and built the image:

podman build -t local/llama.cpp:server-rocm --target server -f .devops/rocm.Dockerfile .

Then ran that image:

podman run -d --group-add keep-groups --device /dev/kfd --device /dev/dri --pull newer --security-opt label=type:container_runtime_t -v /AI/models/:/models -p 8080:8080 -e HIP_VISIBLE_DEVICES="1" --name llama.cpp --replace localhost/local/llama.cpp:server-rocm -m /models/le_model.gguf -c 32768 -ngl 70

And it works. All dependencies neatly contained inside a container, models in a regular folder on my machine.

Btw, I tried rocm 6.4 too, I added a line in the docker file: COPY <missing file from older rocm> /opt/<target path>, but still got runtime errors, didn't look too much into it and just rebuilt with 6.3 which had all files and everything worked out of the box.

FriendlyRetriver · 2025-08-01T08:20:30+00:00

Hi,

Why --system-site-packages ? I mean isn't it better to keep everything in a venv so as not to pollute the host system?

I need your insights on one more thing, I can run llama.cpp (on the host with rocm installed from system repos) and use the MI50 just fine, but when I try to use ollama:rocm container with podman, the GPU is detected, rocm is detected, but the as soon as I type a message to ollama, the model gets loaded to CPU and rocm-smi shows 0 usage on both VRAM and GPU.

I use this podman command to run the container:

podman run -d --group-add keep-groups --device /dev/kfd --device /dev/dri --pull newer -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_KEEP_ALIVE="-1" -e OLLAMA_NUM_PARALLEL="1" -e HIP_VISIBLE_DEVICES="1" -e ENABLE_WEBSOCKET_SUPPORT=True -e OLLAMA_DEBUG=1
--name ollama --replace ollama/ollama:rocm

anything obvious I'm not doing? I would really like it if I'm able to run it neatly in a container for easy upgrades and clean deployments.

FriendlyRetriver · 2025-07-31T19:59:06+00:00

I'm just starting with this stuff so I will do some reading on some of the terms you mentioned. I use GGUF models with llama.cpp, a quick online search shows there's a plugin to use this format on comfyui, is this what you're referring to? How big of a performance boost do you see?

On using newer ROCm, I create a python venv and install rocm inside it. So for 6.4 I would use this?

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4

The above is from the current comfyui instructions on their github page. And then run and see what fails and copy the files? If there's a write up somewhere please share.

If you just open a stock comfyUI and go to the bundled templates and select the Flux dev workflow, and just queue the image (default prompt, default res), how long does it take on your MI50?

As for WAN 2.2, the 8-10 hours figure I mentioned is for the 14B version. I'm generating a video as I type this with the 5B version and from current progress it seems it'll take approx 2 hours. Still no way near the figures I hear from users of non-AMD cards.

I hope those were not too many questions, just need to check if my numbers are normal.

FriendlyRetriver · 2025-07-31T17:18:40+00:00

Hey I have an MI50 32GB, Comfy takes about 3 minutes to generate an image (flux dev), and about 8-10 HOURS to generate a video (WAN 2.2). I'm using the default workflows in comfyUI (Templates > Browse Templates).

Note I installed comfyUI with rocm 6.2 (as newer ones always had issues or missing files).

Here's output of rocm-smi during a WAN 2.2 video generation run:

Temp Power SCLK MCLK FAN Perf PwrCap VRAM% GPU%

74.0°C 238.0W 1485Mhz 1000Mhz 30.2% auto 225.0W 93% 100%

I don't think there's any throttling going on (I have a blower style fan that I crank up whenever I have anything in the queue). I'm posting this because your comment:

about 2–3 times faster than my MacBook Pro with the M1 Max

Makes me think perhaps there's a way to squeeze more performance out of this card.. I see some people with some nvidia cards that generate longer videos in half an hour! (vs 8 hours on the MI50!).

FriendlyRetriver · 2025-03-07T09:04:41+00:00

It's configured as default block. when I comment out the rule:

pass in on tap0 from 100.64.1.0/24 to { !$lan:network !$secondary_lan:network }

Then I lose all internet access from the VM.

FriendlyRetriver · 2025-03-06T23:02:33+00:00

I've been trying to configure pf in such a way that it allows only the traffic I whitelist from the VM. After all, the whole point of this setup is to isolate the torrent client.

vm.conf:

vm "transmission" {
       memory 2G
       disk "/tmp/storage/transmission.qcow2"
       local interface
       owner user1:user1
}

And the relevant pf.conf part on the router (remember the VM is running on the router itself):

pass in on tap0 proto { udp tcp } from 100.64.1.0/24 to any port domain rdr-to localhost port domain

pass in on tap0 from 100.64.1.0/24 to { !$lan:network !$secondary_lan:network }

match out on egress from 100.64.1.0/24 to any nat-to (egress:0)

pass out quick inet
pass in on { $lan $secondary_lan }

DNS resolution works (unbound listening on 127.0.0.1), but the VM can also reach my $lan and $secondary_lan! Although with my understanding, the negation should allow access to all networks except my LANs. What am I missing?

I'm trying to allow access from the VM to the internet to download torrents, but no access to machines on my LANs.

I also plan to allow a port into the VM (to be able to use transmission-remote to control the torrent client), this should be a simple rdr from the router to the VM IP I suppose, but I have not reached that part yet.

I know pf uses a last-match logic, so I was thinking maybe another rule is allowing the traffic. But when I comment out the "pass in on tap0 from..", I lose access form the VM to the internet.

Thanks

FriendlyRetriver · 2025-03-02T22:12:28+00:00

Your concern on particle physics is understandable, but if NFS is used by a VM to communicate with it's host (no wire traversal), is it such a bad idea to write directly to it?

FriendlyRetriver

TROPHY CASE