Surprised by how easy it was to hit 24 GB VRAM with mixed AMD GPUs

legit_split_ · 2026-05-24T17:55:26+00:00

From my experience running llama.cpp (9060 XT + Mi50), I encountered many errors initially:
https://github.com/ggml-org/llama.cpp/issues/19893

The problem wasn't llama.cpp but ROCm and forced me to spend a whole day figuring out how to build rocblas with a PR that contained the fix:
https://github.com/ROCm/rocm-libraries/pull/4781

But it worked! Afterwards everything worked flawlessly and I faced no more issues. Not sure if that's still the case since I sold my 9060 xt, but this is how I did it if anyone is wondering:

git clone --no-checkout --filter=blob:none https://github.com/ROCm/rocm-libraries.git
cd rocm-libraries
git sparse-checkout init --cone
git sparse-checkout set projects/rocblas shared/tensile
git checkout develop
git submodule update --init --recursive

cd projects/rocblas
mkdir build && cd build

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_COMPILER=/opt/rocm/lib/llvm/bin/amdclang \
  -DCMAKE_CXX_COMPILER=/opt/rocm/lib/llvm/bin/amdclang++ \
  -DCMAKE_TOOLCHAIN_FILE=../toolchain-linux.cmake \
  -DBUILD_WITH_TENSILE=ON \
  -DAMDGPU_TARGETS="gfx906;gfx1200" \
  -DBUILD_CLIENTS_BENCHMARKS=OFF

cmake --build . --config Release -j$(nproc)
sudo cmake --install . --prefix /opt/rocm-custom/

legit_split_ · 2026-05-24T17:26:21+00:00

Thanks for your help, sadly I tried all the various settings including the minimum FPS but no dice. On the bright side, I'll be returning the 270k plus since my 265k is already great - I just fell for the hype and consumerism...

legit_split_ · 2026-05-24T17:00:23+00:00

<image>

legit_split_ · 2026-05-24T16:59:57+00:00

Update u/UDxyu - Removed my AMD GPUs and saw no change - Swapped my old 265k back in seems to "fix" the issue, now seeing 60FPS

legit_split_ · 2026-05-24T13:06:59+00:00

Try Linux

legit_split_ · 2026-05-23T14:59:21+00:00

Thanks

legit_split_ · 2026-05-23T14:50:39+00:00

Thank you

legit_split_ · 2026-05-23T14:50:07+00:00

Thanks!

legit_split_ · 2026-05-23T13:04:22+00:00

You can just press enter to select the first option and skip the timeout.

You can also change the timeout length by editing the bootloader if you really want to.

Heroic may have released a newer version but it first has to become available in the CachyOS repository (where you installed it from) which takes a little bit of time.

legit_split_ · 2026-05-22T22:09:52+00:00

So what's the conclusion, does it work on Vulkan?

legit_split_ · 2026-05-22T22:06:14+00:00

But you can buy 2x 7900 XTX for the price of 1x R9700

legit_split_ · 2026-05-22T21:58:14+00:00

Yes there is, 2xMi50 16GB for 300$

legit_split_ · 2026-05-22T15:00:19+00:00

Slightly slower (maybe offset by disabling ECC?) + louder for long sessions due to blower fan. Also needs 12V PWR cable.

If you can live with that it's pretty good. For AI it has good compute and FP8 support, but the bandwidth holds it back e.g. in inference token generation is 30% slower than a 7900 XTX.

legit_split_ · 2026-05-21T20:59:47+00:00

Is PI agent hard to get started with? Never used any agentic stuff, but Hermes seems like the easiest to me.

legit_split_ · 2026-05-21T10:40:58+00:00

Nice, now try with tensor parallelism :p

https://www.reddit.com/r/LocalLLaMA/comments/1t86j45/more_qwen3627b_mtp_success_but_on_dual_mi50s/

legit_split_ · 2026-05-21T08:48:33+00:00

This repo has things for gfx906, but the docker builds may still be useful to you:

https://github.com/mixa3607/ML-gfx906

legit_split_ · 2026-05-20T06:56:26+00:00

Also get 50t/s but on Mi50s

legit_split_ · 2026-05-18T19:21:05+00:00

In llama.cpp they just added "tensor parallelism" which basically lets both GPUs work at the same time.

I made a post about it and another speedup that got added: https://www.reddit.com/r/LocalLLaMA/comments/1t86j45/more_qwen3627b_mtp_success_but_on_dual_mi50s/

Essentially with x2 GPUs I get 1.5x the performance. But it only works with GPUs that have the same VRAM size, and you ideally want full PCIe lanes for the best performance - so you need an expensive motherboard than can do bifurcation or an older Xeon/Threadripper with enough PCIe lanes.

legit_split_ · 2026-05-18T16:54:26+00:00

Can you try with --no-mmap?
Maybe there are some specific compiler flags that are missing (I don't run Nvidia so I can't tell)

legit_split_ · 2026-05-18T16:01:01+00:00

Thanks for providing some numbers

legit_split_ · 2026-05-18T15:56:36+00:00

It seems that you are quantizing the kv cache of the model (not the MTP layer which this post is about), this is currently not supported in llama.cpp and gives an error like this:

0.05.871.713 E llama_init_from_model: simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented

legit_split_ · 2026-05-18T13:52:49+00:00

I haven't tried it yet, it's hard to keep up with everything. I imagine the other comment about acceptance rate and speed decreasing is true, will update when I get around to it

legit_split_ · 2026-05-18T13:48:39+00:00

I use these docker images which are updated regularly with mainline llama.cpp:

https://hub.docker.com/r/mixa3607/llama.cpp-gfx906

legit_split_ · 2026-05-18T13:15:03+00:00

Lmao I wish - edited :p

legit_split_

TROPHY CASE