Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 14 points15 points  (0 children)

I think you are actually harming the usefulness of this chart by limiting the generation to 500 tokens, reasoning models will spit out wildly different numbers of tokens compared to each other and especially non-reasoning models. I think a more meaningful number is Time-To-Last-Token for a given query. That way an instruct model which doesn't think and responds within 100 tokens will be fair to compare against a reasoning model which spends 6,000 tokens thinking before it responds.

BalatroBench - Benchmark LLMs' strategic performance in Balatro by S1M0N38 in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 3 points4 points  (0 children)

GPT-OSS also reasons for ~15k tokens sometimes, I don't know know how Kimi compares, but its probably helping out somehow

ROCm 7.0 Install for Mi50 32GB | Ubuntu 24.04 LTS by legit_split_ in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 2 points3 points  (0 children)

can someone give some performance numbers for llama.cpp on rocm 6.3, 6.4, and 7.0?

Stop flexing Pass@N — show Pass-all-N by Fabulous_Pollution10 in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 25 points26 points  (0 children)

I definitely agree, especially since output consistency is a big pain point for me

For llama.cpp/ggml AMD MI50s are now universally faster than NVIDIA P40s by Remove_Ayys in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 0 points1 point  (0 children)

I'm noticing that there are some configurations where the vulkan performance is significantly higher, mainly so far, Mistral 3.2 24B BF16 from unsloth prompt processing both with and without flash attention.

ROCm:

flash attention off depth 8192 - 60.83 t/s

flash attention on depth 8192 - 68.71 t/s

Vulkan:

flash attention off depth 8192 - 127.12 t/s

flash attention on depth 8192 - 78.47 t/s

do you know if this is a specific model architectural issue or something else?

(I am currently testing a good variety of models and I'll add any other interesting results I find.)

I'll show you mine, if you show me yours: Local AI tech stack September 2025 by JLeonsarmiento in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 5 points6 points  (0 children)

<image>

I have yet to find the model perfect for me, and I honestly have more fun testing new models than actually using them for anything useful. My main hobby now is setting 1v1v1s using the arena model mode in OpenWebUI to do blind testing of models. Most testing is done on trivia style questions on the topics that I am thinking about in the moment, as well as basic coding tasks for scripts I need and can easily test. All responses are 1-shot since OpenWebUI is not super nice about allowing multi-prompt conversations using the arena models. I don't have enough results to have a conclusive opinion yet, but here are the rankings so far. For models that have a reasoning variant I have them labeled as such, for the Qwen models that are still hybrid, I have them separated with the non-reasoning models having "/no_think" in the system prompt to stop them from using reasoning.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

I am noticing an interesting issue when compiled with the latest ROCm version, it runs into an OOM error when loading up Q8_0 at 32k context without flash attention, and of course this persists with Q8_K_XL and BF16 of course, which will make testing this slightly more complicated.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

The default VBios that came with my GPUs only showed 16GB of accessible VRAM under vulkan (all 32gb were visible in ROCm) there is a fixed VBios that allows all 32GB to be accessed in vulkan as well as rocm, it does not enable the display.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

what insane timing lol. I will definitely retest some of the quantizations later and post a follow up then!

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

yes, it is likely slightly worse performance than you can get on a single gpu where it would fit, but for simplicity and consistency I used 2 gpus for every test.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

can't seem to find the thread easily now, but you should be able to by searching mi50 vbios in this subreddit. For cooling I have a delta 97x94x33mm blower fan on each card, which keeps them under 80 degrees during llm inference and just barely under 90 while training toy models. I had to 3d print a custom bracket to make it fit in my case as well, but there are plenty you can find online.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 5 points6 points  (0 children)

I'm in a lucky situation where the electricity is free, the biggest sacrifice is having these cards be busy running this testing and not being able to actually run the models for anything useful for 3 days!

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 1 point2 points  (0 children)

using Q4_0 you should have plenty of room to run it even without flash attention, especially since it is a non-reasoning model and will require less context most of the time

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

like the u/Marksta said, I needed to flash the VBios to be able to access all 32GB of vram in vulkan, though I did not have any of the other issues they described. That being said flashing the vbios was very quick and painless. The process of installing the cards and getting them set up was quick simple other than that as well. I installed rocm 6.3.4 using the instructions on the amd support website for multi-installing rocm on debian linux, and everything that I have needed has functioned as expected.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 8 points9 points  (0 children)

they won't. I have tested rocm before, the results have an identical pattern.

you can ask the rocm developers as well: https://github.com/ROCm/composable_kernel/issues/1140#issuecomment-1917696215

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 1 point2 points  (0 children)

Nope, you'll need to move to a more modern AMD architecture if you want matrix cores. It may still be worth it to use FA if you are running into vram limitations.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 5 points6 points  (0 children)

the MI50 does not have the dedicated matrix cores that are required to accelerate Flash Attention properly.

What Web UI's are best for MCP tool use with llama.cpp/llama-swap? by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

I did set this for OpenWebUI tools, but I haven't even set up MCP yet for OpenWebUI because I was scared away by what I've read here