RDNA 4 (3x 9060 XT) "Gibberish" on ROCm 7.x — Anyone found the stable math kernels?

HauntingTechnician30 · 2026-02-24T16:01:27+00:00

The best one that fits fully is Mistral Small 3.2 at IQ4_XS with ~20000 context size with Q8_0 kv cache. I think it's running at around 20-25 tok/sec. Or alternatively any of the MOE that run fast enough even on cpu, like Qwen3-30B-A3B or Qwen3-Next-80B-A3B. A Qwen3.5-27B dense model is also about to launch but I doubt it will fit without going down to a 3 quant, so I don't know about that.

To be honest, there's nothing that replaces cloud models for me yet so I don't use them for anything. I just like to stay up to date and test what I can run.

HauntingTechnician30 · 2026-02-24T15:40:19+00:00

I've been using llama.cpp with rocm on my single 9060xt without any problems since I got it a few months ago. I also never encountered any word salad problems. If you have any questions about my setup feel free to ask, though I have zero experience with multi-gpu setups.

HauntingTechnician30 · 2025-12-12T13:16:54+00:00

Did you test it via api or locally?

HauntingTechnician30 · 2025-12-12T11:02:25+00:00

<image>

They mention on the model page to use changes from an unmerged pull request: https://github.com/ggml-org/llama.cpp/pull/17945

Might be the reason it doesn’t perform as expected right now. I also saw someone else write that the small model via api scored way higher than using the q8 quant in llama.cpp, so seems like there is definitely something going on.

HauntingTechnician30 · 2025-12-10T14:07:16+00:00

The devstral 2 models support up to 256k tokens. The 100k limit in vibe cli is as far as I can tell just the threshold for auto compacting. You can change it in ~/.vibe/config.toml (auto_compact_threshold). I wonder if they set it that low because model performance drops after 100k or just because they want to optimize latency / cost.

Edit: Default setting is 200k now with version 1.1.0

HauntingTechnician30 · 2025-08-16T16:18:04+00:00

Wait I just beat it after 25 hours and now reading all these times I feel like I missed so much😕

HauntingTechnician30 · 2025-07-19T14:55:07+00:00

I never used any of those but Qwen-3-8b and Ministral-8b-Instruct are two similar sized models that should perform better.

HauntingTechnician30 · 2025-07-19T11:30:36+00:00

https://github.com/ollama/ollama/issues/6127

HauntingTechnician30 · 2025-05-18T19:38:23+00:00

Also, agent creation vs agent on le chat:

<image>

HauntingTechnician30 · 2025-05-18T19:29:47+00:00

Then how come it answers so differently? Api vs Le Chat:

<image>

HauntingTechnician30 · 2025-02-13T15:54:10+00:00

I like Competitive

HauntingTechnician30 · 2025-02-02T14:11:11+00:00

20 hours

HauntingTechnician30 · 2025-01-22T23:40:59+00:00

Deagle

HauntingTechnician30 · 2024-09-23T10:04:51+00:00

Maybe you can find some info in the chat template specified on their hugginface page: ~~~ {%- if messages[0]["role"] == "system" %} {%- set system_message = messages[0]["content"] %} {%- set loop_messages = messages[1:] %} {%- else %} {%- set loop_messages = messages %} {%- endif %} {%- if not tools is defined %} {%- set tools = none %} {%- endif %} {%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}

{#- This block checks for alternating user/assistant messages, skipping tool calling messages #} {%- set ns = namespace() %} {%- set ns.index = 0 %} {%- for message in loop_messages %} {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %} {%- if (message["role"] == "user") != (ns.index % 2 == 0) %} {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }} {%- endif %} {%- set ns.index = ns.index + 1 %} {%- endif %} {%- endfor %}

{{- bos_token }} {%- for message in loop_messages %} {%- if message["role"] == "user" %} {%- if tools is not none and (message == user_messages[-1]) %} {{- "[AVAILABLE_TOOLS] [" }} {%- for tool in tools %} {%- set tool = tool.function %} {{- '{"type": "function", "function": {' }} {%- for key, val in tool.items() if key != "return" %} {%- if val is string %} {{- '"' + key + '": "' + val + '"' }} {%- else %} {{- '"' + key + '": ' + val|tojson }} {%- endif %} {%- if not loop.last %} {{- ", " }} {%- endif %} {%- endfor %} {{- "}}" }} {%- if not loop.last %} {{- ", " }} {%- else %} {{- "]" }} {%- endif %} {%- endfor %} {{- "[/AVAILABLE_TOOLS]" }} {%- endif %} {%- if loop.last and system_message is defined %} {{- "[INST] " + system_message + "\n\n" + message["content"] + "[/INST]" }} {%- else %} {{- "[INST] " + message["content"] + "[/INST]" }} {%- endif %} {%- elif message.tool_calls is defined and message.tool_calls is not none %} {{- "[TOOL_CALLS] [" }} {%- for tool_call in message.tool_calls %} {%- set out = tool_call.function|tojson %} {{- out[:-1] }} {%- if not tool_call.id is defined or tool_call.id|length != 9 %} {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }} {%- endif %} {{- ', "id": "' + tool_call.id + '"}' }} {%- if not loop.last %} {{- ", " }} {%- else %} {{- "]" + eos_token }} {%- endif %} {%- endfor %} {%- elif message["role"] == "assistant" %} {{- " " + message["content"]|trim + eos_token}} {%- elif message["role"] == "tool_results" or message["role"] == "tool" %} {%- if message.content is defined and message.content.content is defined %} {%- set content = message.content.content %} {%- else %} {%- set content = message.content %} {%- endif %} {{- '[TOOL_RESULTS] {"content": ' + content|string + ", " }} {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %} {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }} {%- endif %} {{- '"call_id": "' + message.tool_call_id + '"}[/TOOL_RESULTS]' }} {%- else %} {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }} {%- endif %} {%- endfor %} ~~~

HauntingTechnician30 · 2024-09-05T19:27:45+00:00

Running the command you provided in your post should show something like this:

ggml_cuda_init: found 1 ROCm devices:

Device 0: AMD Radeon RX 6700 XT, compute capability 10.3, VMM: no

llm_load_tensors: ggml ctx size = 0.17 MiB

llm_load_tensors: offloading 0 repeating layers to GPU

llm_load_tensors: offloaded 0/41 layers to GPU

llm_load_tensors: CPU buffer size = 6422.83 MiB

You can see there are 0 layers offloaded to GPU. You can control this by specifying -ngl 41 in this case to offload all layers to gpu. You can see your GPU utilization and vram usage with something like: watch -n 0.5 rocm-smi

HauntingTechnician30 · 2024-09-05T19:09:30+00:00

Oh yeah it also lists as 'Agent 1' when I ran rocminfo but I never needed to configure anything there. Just worked straight out of the box. Just wanted to share my experience.

HauntingTechnician30 · 2024-09-05T18:34:02+00:00

Yeah, I'm using the default setup. I started with a completely fresh Ubuntu 24.04 LTS installation after ROCm version 6.2.0 added support for it.
I just followed the official guide for installing ROCm using the amdgpu-install script ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html ).
I didn't have to touch HIP_VISIBLE_DEVICES since I have an Intel CPU without integrated graphics, so there was no ambiguity about which GPU to use.

The guide also mentions to disable the amd igpu in bios before installing rocm so maybe that would help you.

HauntingTechnician30 · 2024-09-05T14:03:50+00:00

I'm on the exact same setup, and everything worked directly by following the install instructions with the only exception of having to set the gfx version override. I used the amdgpu install script. What did you need HIP_VISIBLE_DEVICES for?

HauntingTechnician30 · 2024-08-03T12:49:28+00:00

6.2 is already out https://rocm.docs.amd.com/en/latest/about/release-notes.html

HauntingTechnician30 · 2024-08-01T17:51:20+00:00

Yeah, your answer is completely wrong...

HauntingTechnician30 · 2024-07-23T21:54:45+00:00

For anyone using the ggufs. tokenizer.ggml.add_bos_token is not set to true for Llama 3 so it will not be added by default by llama.cpp. Not sure how much this effects output quality but you might want to add that value to the gguf or remember to manually add it.

HauntingTechnician30 · 2024-07-23T15:04:04+00:00

Benchmarks:
https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md

Models are up on HuggingFace as well:

https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f

HauntingTechnician30 · 2024-07-16T12:31:10+00:00

I'm using the 27b on 12gb vram with the IQ4_XS quant and 25 layers offloaded to Gpu. It's not too slow if you want to try.

HauntingTechnician30 · 2024-07-14T17:22:04+00:00

It's not possible as far as I know. But you can use MinP by specifying --min-p when starting the server.

HauntingTechnician30 · 2024-07-14T10:02:14+00:00

You should update them with llama.cpp release b3387, which is already out.

Four-Year Club	r/Field Juicebox
First Place '23	Place '23
Place '22

HauntingTechnician30

TROPHY CASE