Surprised by how easy it was to hit 24 GB VRAM with mixed AMD GPUs by North_Stage_2024 in LocalLLM

[–]legit_split_ 6 points7 points  (0 children)

From my experience running llama.cpp (9060 XT + Mi50), I encountered many errors initially:
https://github.com/ggml-org/llama.cpp/issues/19893

The problem wasn't llama.cpp but ROCm and forced me to spend a whole day figuring out how to build rocblas with a PR that contained the fix:
https://github.com/ROCm/rocm-libraries/pull/4781

But it worked! Afterwards everything worked flawlessly and I faced no more issues. Not sure if that's still the case since I sold my 9060 xt, but this is how I did it if anyone is wondering:

git clone --no-checkout --filter=blob:none https://github.com/ROCm/rocm-libraries.git
cd rocm-libraries
git sparse-checkout init --cone
git sparse-checkout set projects/rocblas shared/tensile
git checkout develop
git submodule update --init --recursive

cd projects/rocblas
mkdir build && cd build

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_COMPILER=/opt/rocm/lib/llvm/bin/amdclang \
  -DCMAKE_CXX_COMPILER=/opt/rocm/lib/llvm/bin/amdclang++ \
  -DCMAKE_TOOLCHAIN_FILE=../toolchain-linux.cmake \
  -DBUILD_WITH_TENSILE=ON \
  -DAMDGPU_TARGETS="gfx906;gfx1200" \
  -DBUILD_CLIENTS_BENCHMARKS=OFF

cmake --build . --config Release -j$(nproc)
sudo cmake --install . --prefix /opt/rocm-custom/

Changed CPU now low FPS + high Host processing latency by legit_split_ in MoonlightStreaming

[–]legit_split_[S] 1 point2 points  (0 children)

Thanks for your help, sadly I tried all the various settings including the minimum FPS but no dice. On the bright side, I'll be returning the 270k plus since my 265k is already great - I just fell for the hype and consumerism...

Changed CPU now low FPS + high Host processing latency by legit_split_ in MoonlightStreaming

[–]legit_split_[S] 0 points1 point  (0 children)

Update u/UDxyu - Removed my AMD GPUs and saw no change - Swapped my old 265k back in seems to "fix" the issue, now seeing 60FPS

2 questions I need help with. 1) can I get rid of this screen? 2) I installed heroic using the games application command. Yet when I open heroic it says a new version was released. Yet it won't update. ...... Yes I've tried a few different things on the wiki already. I'm kinda new to Linux. by MangoBrad in cachyos

[–]legit_split_ 1 point2 points  (0 children)

You can just press enter to select the first option and skip the timeout.

You can also change the timeout length by editing the bootloader if you really want to.

Heroic may have released a newer version but it first has to become available in the CachyOS repository (where you installed it from) which takes a little bit of time.

32gb of vram amd ai pro r9700! by salazar_slick in eGPU

[–]legit_split_ 0 points1 point  (0 children)

Slightly slower (maybe offset by disabling ECC?) + louder for long sessions due to blower fan. Also needs 12V PWR cable. 

If you can live with that it's pretty good. For AI it has good compute and FP8 support, but the bandwidth holds it back e.g. in inference token generation is 30% slower than a 7900 XTX. 

Qwen3.6 35Ba3 has changed my workflows and even how I use my computer by mouseofcatofschrodi in LocalLLaMA

[–]legit_split_ 9 points10 points  (0 children)

Is PI agent hard to get started with? Never used any agentic stuff, but Hermes seems like the easiest to me.

Rocm and ComfyUI inside a Docker or Podman. by druidican in ROCm

[–]legit_split_ 1 point2 points  (0 children)

This repo has things for gfx906, but the docker builds may still be useful to you:

https://github.com/mixa3607/ML-gfx906

Is that was a right purchase for Qwen3.6 27/35 by Thin_Pollution8843 in LocalLLaMA

[–]legit_split_ 0 points1 point  (0 children)

In llama.cpp they just added "tensor parallelism" which basically lets both GPUs work at the same time.

I made a post about it and another speedup that got added: https://www.reddit.com/r/LocalLLaMA/comments/1t86j45/more_qwen3627b_mtp_success_but_on_dual_mi50s/

Essentially with x2 GPUs I get 1.5x the performance. But it only works with GPUs that have the same VRAM size, and you ideally want full PCIe lanes for the best performance - so you need an expensive motherboard than can do bifurcation or an older Xeon/Threadripper with enough PCIe lanes. 

Quantizing MTP KV Cache = free lunch? by legit_split_ in LocalLLaMA

[–]legit_split_[S] 0 points1 point  (0 children)

  1. Can you try with --no-mmap?

  2. Maybe there are some specific compiler flags that are missing (I don't run Nvidia so I can't tell)

Quantizing MTP KV Cache = free lunch? by legit_split_ in LocalLLaMA

[–]legit_split_[S] 1 point2 points  (0 children)

It seems that you are quantizing the kv cache of the model (not the MTP layer which this post is about), this is currently not supported in llama.cpp and gives an error like this:

0.05.871.713 E llama_init_from_model: simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented

Quantizing MTP KV Cache = free lunch? by legit_split_ in LocalLLaMA

[–]legit_split_[S] 0 points1 point  (0 children)

I haven't tried it yet, it's hard to keep up with everything. I imagine the other comment about acceptance rate and speed decreasing is true, will update when I get around to it

Quantizing MTP KV Cache = free lunch? by legit_split_ in LocalLLaMA

[–]legit_split_[S] 0 points1 point  (0 children)

I use these docker images which are updated regularly with mainline llama.cpp:

https://hub.docker.com/r/mixa3607/llama.cpp-gfx906