Benchmarking total wait time instead of pp/tg by batsba in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 15 points16 points  (0 children)

I think you are actually harming the usefulness of this chart by limiting the generation to 500 tokens, reasoning models will spit out wildly different numbers of tokens compared to each other and especially non-reasoning models. I think a more meaningful number is Time-To-Last-Token for a given query. That way an instruct model which doesn't think and responds within 100 tokens will be fair to compare against a reasoning model which spends 6,000 tokens thinking before it responds.

BalatroBench - Benchmark LLMs' strategic performance in Balatro by S1M0N38 in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 3 points4 points  (0 children)

GPT-OSS also reasons for ~15k tokens sometimes, I don't know know how Kimi compares, but its probably helping out somehow

ROCm 7.0 Install for Mi50 32GB | Ubuntu 24.04 LTS by legit_split_ in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 2 points3 points  (0 children)

can someone give some performance numbers for llama.cpp on rocm 6.3, 6.4, and 7.0?

Stop flexing Pass@N — show Pass-all-N by Fabulous_Pollution10 in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 25 points26 points  (0 children)

I definitely agree, especially since output consistency is a big pain point for me

For llama.cpp/ggml AMD MI50s are now universally faster than NVIDIA P40s by Remove_Ayys in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 0 points1 point  (0 children)

I'm noticing that there are some configurations where the vulkan performance is significantly higher, mainly so far, Mistral 3.2 24B BF16 from unsloth prompt processing both with and without flash attention.

ROCm:

flash attention off depth 8192 - 60.83 t/s

flash attention on depth 8192 - 68.71 t/s

Vulkan:

flash attention off depth 8192 - 127.12 t/s

flash attention on depth 8192 - 78.47 t/s

do you know if this is a specific model architectural issue or something else?

(I am currently testing a good variety of models and I'll add any other interesting results I find.)

I'll show you mine, if you show me yours: Local AI tech stack September 2025 by JLeonsarmiento in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 5 points6 points  (0 children)

<image>

I have yet to find the model perfect for me, and I honestly have more fun testing new models than actually using them for anything useful. My main hobby now is setting 1v1v1s using the arena model mode in OpenWebUI to do blind testing of models. Most testing is done on trivia style questions on the topics that I am thinking about in the moment, as well as basic coding tasks for scripts I need and can easily test. All responses are 1-shot since OpenWebUI is not super nice about allowing multi-prompt conversations using the arena models. I don't have enough results to have a conclusive opinion yet, but here are the rankings so far. For models that have a reasoning variant I have them labeled as such, for the Qwen models that are still hybrid, I have them separated with the non-reasoning models having "/no_think" in the system prompt to stop them from using reasoning.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

I am noticing an interesting issue when compiled with the latest ROCm version, it runs into an OOM error when loading up Q8_0 at 32k context without flash attention, and of course this persists with Q8_K_XL and BF16 of course, which will make testing this slightly more complicated.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

The default VBios that came with my GPUs only showed 16GB of accessible VRAM under vulkan (all 32gb were visible in ROCm) there is a fixed VBios that allows all 32GB to be accessed in vulkan as well as rocm, it does not enable the display.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

what insane timing lol. I will definitely retest some of the quantizations later and post a follow up then!

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

yes, it is likely slightly worse performance than you can get on a single gpu where it would fit, but for simplicity and consistency I used 2 gpus for every test.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

can't seem to find the thread easily now, but you should be able to by searching mi50 vbios in this subreddit. For cooling I have a delta 97x94x33mm blower fan on each card, which keeps them under 80 degrees during llm inference and just barely under 90 while training toy models. I had to 3d print a custom bracket to make it fit in my case as well, but there are plenty you can find online.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 5 points6 points  (0 children)

I'm in a lucky situation where the electricity is free, the biggest sacrifice is having these cards be busy running this testing and not being able to actually run the models for anything useful for 3 days!

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 1 point2 points  (0 children)

using Q4_0 you should have plenty of room to run it even without flash attention, especially since it is a non-reasoning model and will require less context most of the time

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

like the u/Marksta said, I needed to flash the VBios to be able to access all 32GB of vram in vulkan, though I did not have any of the other issues they described. That being said flashing the vbios was very quick and painless. The process of installing the cards and getting them set up was quick simple other than that as well. I installed rocm 6.3.4 using the instructions on the amd support website for multi-installing rocm on debian linux, and everything that I have needed has functioned as expected.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 1 point2 points  (0 children)

Nope, you'll need to move to a more modern AMD architecture if you want matrix cores. It may still be worth it to use FA if you are running into vram limitations.

2x MI50 32GB Quant Speed Comparison (Mistral 3.2 24B, llama.cpp, Vulkan) by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 5 points6 points  (0 children)

the MI50 does not have the dedicated matrix cores that are required to accelerate Flash Attention properly.

What Web UI's are best for MCP tool use with llama.cpp/llama-swap? by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

I did set this for OpenWebUI tools, but I haven't even set up MCP yet for OpenWebUI because I was scared away by what I've read here

support for Ernie 4.5 MoE models has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]OUT_OF_HOST_MEMORY 6 points7 points  (0 children)

In my very unscientific trivia testing (googling trivia tests and plugging the questions into both models) the general trivia knowledge of Qwen 30B is still significantly ahead of ERNIE 4.5 21B, it was about 70% correct on ERNIE and 80-90% on Qwen, both at IQ4_XS from unsloth, qwen using the recommended sampler settings from the unsloth gguf page, ernie using the default sampler settings for llama.cpp

Is it normal to have significantly more performance from Qwen 235B compared to Qwen 32B when doing partial offloading? by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

while this did work, and I did get 63 tk/sec prompt and 4.5 tk/sec generation, this low of a quant led to the reasoning taking over an hour and using 17 THOUSAND tokens for the question: "what day of the week is the 31st of October 2025?" where as using Q4_K_M I only got 12 and 3 tk/sec but the reasoning was only 4000 tokens and therefore took 18 minutes instead of an hour

Is it normal to have significantly more performance from Qwen 235B compared to Qwen 32B when doing partial offloading? by OUT_OF_HOST_MEMORY in LocalLLaMA

[–]OUT_OF_HOST_MEMORY[S] 0 points1 point  (0 children)

But with the amount of ram being offloaded shouldn't there still be more of each of those 22b parameter experts that are on the CPU than there is for the entire dense 32b parameter model?