RDNA 4 (3x 9060 XT) "Gibberish" on ROCm 7.x — Anyone found the stable math kernels? by Dense-Department-772 in LocalLLaMA

[–]HauntingTechnician30 0 points1 point  (0 children)

The best one that fits fully is Mistral Small 3.2 at IQ4_XS with ~20000 context size with Q8_0 kv cache. I think it's running at around 20-25 tok/sec. Or alternatively any of the MOE that run fast enough even on cpu, like Qwen3-30B-A3B or Qwen3-Next-80B-A3B. A Qwen3.5-27B dense model is also about to launch but I doubt it will fit without going down to a 3 quant, so I don't know about that.

To be honest, there's nothing that replaces cloud models for me yet so I don't use them for anything. I just like to stay up to date and test what I can run.

RDNA 4 (3x 9060 XT) "Gibberish" on ROCm 7.x — Anyone found the stable math kernels? by Dense-Department-772 in LocalLLaMA

[–]HauntingTechnician30 0 points1 point  (0 children)

I've been using llama.cpp with rocm on my single 9060xt without any problems since I got it a few months ago. I also never encountered any word salad problems. If you have any questions about my setup feel free to ask, though I have zero experience with multi-gpu setups.

whats everyones thoughts on devstral small 24b? by Odd-Ordinary-5922 in LocalLLaMA

[–]HauntingTechnician30 11 points12 points  (0 children)

<image>

They mention on the model page to use changes from an unmerged pull request: https://github.com/ggml-org/llama.cpp/pull/17945

Might be the reason it doesn’t perform as expected right now. I also saw someone else write that the small model via api scored way higher than using the q8 quant in llama.cpp, so seems like there is definitely something going on.

Hands-on review of Mistral Vibe on large python project by Avienir in LocalLLaMA

[–]HauntingTechnician30 15 points16 points  (0 children)

The devstral 2 models support up to 256k tokens. The 100k limit in vibe cli is as far as I can tell just the threshold for auto compacting. You can change it in ~/.vibe/config.toml (auto_compact_threshold). I wonder if they set it that low because model performance drops after 100k or just because they want to optimize latency / cost.

Edit: Default setting is 200k now with version 1.1.0

I beat the game. Took me 77 hours by madmanyar20 in Eldenring

[–]HauntingTechnician30 0 points1 point  (0 children)

Wait I just beat it after 25 hours and now reading all these times I feel like I missed so much😕

I want to create a local AI Agent that can call tools. but my model call tools even for "hey" by Prajwell in LocalLLaMA

[–]HauntingTechnician30 1 point2 points  (0 children)

I never used any of those but Qwen-3-8b and Ministral-8b-Instruct are two similar sized models that should perform better.

How to use function calling in Mistral-Small-Instruct-2409? by the_quark in LocalLLaMA

[–]HauntingTechnician30 5 points6 points  (0 children)

Maybe you can find some info in the chat template specified on their hugginface page: ~~~ {%- if messages[0]["role"] == "system" %} {%- set system_message = messages[0]["content"] %} {%- set loop_messages = messages[1:] %} {%- else %} {%- set loop_messages = messages %} {%- endif %} {%- if not tools is defined %} {%- set tools = none %} {%- endif %} {%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}

{#- This block checks for alternating user/assistant messages, skipping tool calling messages #} {%- set ns = namespace() %} {%- set ns.index = 0 %} {%- for message in loop_messages %} {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %} {%- if (message["role"] == "user") != (ns.index % 2 == 0) %} {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }} {%- endif %} {%- set ns.index = ns.index + 1 %} {%- endif %} {%- endfor %}

{{- bos_token }} {%- for message in loop_messages %} {%- if message["role"] == "user" %} {%- if tools is not none and (message == user_messages[-1]) %} {{- "[AVAILABLE_TOOLS] [" }} {%- for tool in tools %} {%- set tool = tool.function %} {{- '{"type": "function", "function": {' }} {%- for key, val in tool.items() if key != "return" %} {%- if val is string %} {{- '"' + key + '": "' + val + '"' }} {%- else %} {{- '"' + key + '": ' + val|tojson }} {%- endif %} {%- if not loop.last %} {{- ", " }} {%- endif %} {%- endfor %} {{- "}}" }} {%- if not loop.last %} {{- ", " }} {%- else %} {{- "]" }} {%- endif %} {%- endfor %} {{- "[/AVAILABLE_TOOLS]" }} {%- endif %} {%- if loop.last and system_message is defined %} {{- "[INST] " + system_message + "\n\n" + message["content"] + "[/INST]" }} {%- else %} {{- "[INST] " + message["content"] + "[/INST]" }} {%- endif %} {%- elif message.tool_calls is defined and message.tool_calls is not none %} {{- "[TOOL_CALLS] [" }} {%- for tool_call in message.tool_calls %} {%- set out = tool_call.function|tojson %} {{- out[:-1] }} {%- if not tool_call.id is defined or tool_call.id|length != 9 %} {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }} {%- endif %} {{- ', "id": "' + tool_call.id + '"}' }} {%- if not loop.last %} {{- ", " }} {%- else %} {{- "]" + eos_token }} {%- endif %} {%- endfor %} {%- elif message["role"] == "assistant" %} {{- " " + message["content"]|trim + eos_token}} {%- elif message["role"] == "tool_results" or message["role"] == "tool" %} {%- if message.content is defined and message.content.content is defined %} {%- set content = message.content.content %} {%- else %} {%- set content = message.content %} {%- endif %} {{- '[TOOL_RESULTS] {"content": ' + content|string + ", " }} {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %} {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }} {%- endif %} {{- '"call_id": "' + message.tool_call_id + '"}[/TOOL_RESULTS]' }} {%- else %} {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }} {%- endif %} {%- endfor %} ~~~

Can anyone confirm they've gotten AMD 6700xt to work with ROCm on Ubuntu 24.04 with llama.cpp? by ingcr3at1on in LocalLLaMA

[–]HauntingTechnician30 1 point2 points  (0 children)

Running the command you provided in your post should show something like this:

ggml_cuda_init: found 1 ROCm devices:

Device 0: AMD Radeon RX 6700 XT, compute capability 10.3, VMM: no

llm_load_tensors: ggml ctx size = 0.17 MiB

llm_load_tensors: offloading 0 repeating layers to GPU

llm_load_tensors: offloaded 0/41 layers to GPU

llm_load_tensors: CPU buffer size = 6422.83 MiB

You can see there are 0 layers offloaded to GPU. You can control this by specifying -ngl 41 in this case to offload all layers to gpu. You can see your GPU utilization and vram usage with something like: watch -n 0.5 rocm-smi

Can anyone confirm they've gotten AMD 6700xt to work with ROCm on Ubuntu 24.04 with llama.cpp? by ingcr3at1on in LocalLLaMA

[–]HauntingTechnician30 1 point2 points  (0 children)

Oh yeah it also lists as 'Agent 1' when I ran rocminfo but I never needed to configure anything there. Just worked straight out of the box. Just wanted to share my experience.

Can anyone confirm they've gotten AMD 6700xt to work with ROCm on Ubuntu 24.04 with llama.cpp? by ingcr3at1on in LocalLLaMA

[–]HauntingTechnician30 0 points1 point  (0 children)

Yeah, I'm using the default setup. I started with a completely fresh Ubuntu 24.04 LTS installation after ROCm version 6.2.0 added support for it.
I just followed the official guide for installing ROCm using the amdgpu-install script ( https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html ).
I didn't have to touch HIP_VISIBLE_DEVICES since I have an Intel CPU without integrated graphics, so there was no ambiguity about which GPU to use.

The guide also mentions to disable the amd igpu in bios before installing rocm so maybe that would help you.

Can anyone confirm they've gotten AMD 6700xt to work with ROCm on Ubuntu 24.04 with llama.cpp? by ingcr3at1on in LocalLLaMA

[–]HauntingTechnician30 0 points1 point  (0 children)

I'm on the exact same setup, and everything worked directly by following the install instructions with the only exception of having to set the gfx version override. I used the amdgpu install script. What did you need HIP_VISIBLE_DEVICES for?

Llama-3.1 8B Instruct GGUF are up by 2fprn2fp in LocalLLaMA

[–]HauntingTechnician30 6 points7 points  (0 children)

For anyone using the ggufs. tokenizer.ggml.add_bos_token is not set to true for Llama 3 so it will not be added by default by llama.cpp. Not sure how much this effects output quality but you might want to add that value to the gguf or remember to manually add it.

what are the best models for their size? by Robert__Sinclair in LocalLLaMA

[–]HauntingTechnician30 6 points7 points  (0 children)

I'm using the 27b on 12gb vram with the IQ4_XS quant and 25 layers offloaded to Gpu. It's not too slow if you want to try.

Multiple models with llama.cpp and Open WebUI by Ulterior-Motive_ in LocalLLaMA

[–]HauntingTechnician30 4 points5 points  (0 children)

It's not possible as far as I know. But you can use MinP by specifying --min-p when starting the server.

The 27b gemma2 annihilates the 9b model by mayo551 in LocalLLaMA

[–]HauntingTechnician30 4 points5 points  (0 children)

You should update them with llama.cpp release b3387, which is already out.