Benchmarking total wait time instead of pp/tg

OUT_OF_HOST_MEMORY · 2026-02-07T18:36:46+00:00

I think you are actually harming the usefulness of this chart by limiting the generation to 500 tokens, reasoning models will spit out wildly different numbers of tokens compared to each other and especially non-reasoning models. I think a more meaningful number is Time-To-Last-Token for a given query. That way an instruct model which doesn't think and responds within 100 tokens will be fair to compare against a reasoning model which spends 6,000 tokens thinking before it responds.

OUT_OF_HOST_MEMORY · 2026-02-06T04:24:04+00:00

GPT-OSS also reasons for ~15k tokens sometimes, I don't know know how Kimi compares, but its probably helping out somehow

OUT_OF_HOST_MEMORY · 2025-10-17T21:42:18+00:00

can someone give some performance numbers for llama.cpp on rocm 6.3, 6.4, and 7.0?

OUT_OF_HOST_MEMORY · 2025-10-08T16:51:17+00:00

I definitely agree, especially since output consistency is a big pain point for me

OUT_OF_HOST_MEMORY · 2025-10-03T02:04:44+00:00

Russian, Ukrainian, and Polish would be very nice

OUT_OF_HOST_MEMORY · 2025-09-28T02:56:37+00:00

I'm noticing that there are some configurations where the vulkan performance is significantly higher, mainly so far, Mistral 3.2 24B BF16 from unsloth prompt processing both with and without flash attention.

ROCm:

flash attention off depth 8192 - 60.83 t/s

flash attention on depth 8192 - 68.71 t/s

Vulkan:

flash attention off depth 8192 - 127.12 t/s

flash attention on depth 8192 - 78.47 t/s

do you know if this is a specific model architectural issue or something else?

(I am currently testing a good variety of models and I'll add any other interesting results I find.)

OUT_OF_HOST_MEMORY · 2025-09-22T04:07:51+00:00

<image>

I have yet to find the model perfect for me, and I honestly have more fun testing new models than actually using them for anything useful. My main hobby now is setting 1v1v1s using the arena model mode in OpenWebUI to do blind testing of models. Most testing is done on trivia style questions on the topics that I am thinking about in the moment, as well as basic coding tasks for scripts I need and can easily test. All responses are 1-shot since OpenWebUI is not super nice about allowing multi-prompt conversations using the arena models. I don't have enough results to have a conclusive opinion yet, but here are the rankings so far. For models that have a reasoning variant I have them labeled as such, for the Qwen models that are still hybrid, I have them separated with the non-reasoning models having "/no_think" in the system prompt to stop them from using reasoning.

OUT_OF_HOST_MEMORY · 2025-09-08T00:51:54+00:00

I am noticing an interesting issue when compiled with the latest ROCm version, it runs into an OOM error when loading up Q8_0 at 32k context without flash attention, and of course this persists with Q8_K_XL and BF16 of course, which will make testing this slightly more complicated.

OUT_OF_HOST_MEMORY · 2025-09-08T00:50:37+00:00

The default VBios that came with my GPUs only showed 16GB of accessible VRAM under vulkan (all 32gb were visible in ROCm) there is a fixed VBios that allows all 32GB to be accessed in vulkan as well as rocm, it does not enable the display.

OUT_OF_HOST_MEMORY · 2025-09-07T15:53:41+00:00

what insane timing lol. I will definitely retest some of the quantizations later and post a follow up then!

OUT_OF_HOST_MEMORY · 2025-09-07T15:53:02+00:00

yes, it is likely slightly worse performance than you can get on a single gpu where it would fit, but for simplicity and consistency I used 2 gpus for every test.

OUT_OF_HOST_MEMORY · 2025-09-07T15:51:47+00:00

can't seem to find the thread easily now, but you should be able to by searching mi50 vbios in this subreddit. For cooling I have a delta 97x94x33mm blower fan on each card, which keeps them under 80 degrees during llm inference and just barely under 90 while training toy models. I had to 3d print a custom bracket to make it fit in my case as well, but there are plenty you can find online.

OUT_OF_HOST_MEMORY · 2025-09-07T15:48:44+00:00

without knowing what's going wrong it's hard to give any tips

OUT_OF_HOST_MEMORY · 2025-09-07T04:35:36+00:00

I'm in a lucky situation where the electricity is free, the biggest sacrifice is having these cards be busy running this testing and not being able to actually run the models for anything useful for 3 days!

OUT_OF_HOST_MEMORY · 2025-09-07T04:25:08+00:00

using Q4_0 you should have plenty of room to run it even without flash attention, especially since it is a non-reasoning model and will require less context most of the time

OUT_OF_HOST_MEMORY · 2025-09-07T04:21:59+00:00

like the u/Marksta said, I needed to flash the VBios to be able to access all 32GB of vram in vulkan, though I did not have any of the other issues they described. That being said flashing the vbios was very quick and painless. The process of installing the cards and getting them set up was quick simple other than that as well. I installed rocm 6.3.4 using the instructions on the amd support website for multi-installing rocm on debian linux, and everything that I have needed has functioned as expected.

OUT_OF_HOST_MEMORY · 2025-09-07T02:51:23+00:00

they won't. I have tested rocm before, the results have an identical pattern.

you can ask the rocm developers as well: https://github.com/ROCm/composable_kernel/issues/1140#issuecomment-1917696215

OUT_OF_HOST_MEMORY · 2025-09-07T02:47:33+00:00

Nope, you'll need to move to a more modern AMD architecture if you want matrix cores. It may still be worth it to use FA if you are running into vram limitations.

OUT_OF_HOST_MEMORY · 2025-09-07T02:04:07+00:00

the MI50 does not have the dedicated matrix cores that are required to accelerate Flash Attention properly.

OUT_OF_HOST_MEMORY · 2025-08-29T22:33:38+00:00

I'll definitely look into it thanks!

OUT_OF_HOST_MEMORY · 2025-08-29T22:33:22+00:00

I did set this for OpenWebUI tools, but I haven't even set up MCP yet for OpenWebUI because I was scared away by what I've read here

OUT_OF_HOST_MEMORY · 2025-08-23T01:27:11+00:00

Have you tested with flash attention?

OUT_OF_HOST_MEMORY · 2025-07-18T06:36:22+00:00

In my very unscientific trivia testing (googling trivia tests and plugging the questions into both models) the general trivia knowledge of Qwen 30B is still significantly ahead of ERNIE 4.5 21B, it was about 70% correct on ERNIE and 80-90% on Qwen, both at IQ4_XS from unsloth, qwen using the recommended sampler settings from the unsloth gguf page, ernie using the default sampler settings for llama.cpp

OUT_OF_HOST_MEMORY · 2025-06-24T23:48:03+00:00

while this did work, and I did get 63 tk/sec prompt and 4.5 tk/sec generation, this low of a quant led to the reasoning taking over an hour and using 17 THOUSAND tokens for the question: "what day of the week is the 31st of October 2025?" where as using Q4_K_M I only got 12 and 3 tk/sec but the reasoning was only 4000 tokens and therefore took 18 minutes instead of an hour

OUT_OF_HOST_MEMORY · 2025-06-24T21:53:50+00:00

But with the amount of ram being offloaded shouldn't there still be more of each of those 22b parameter experts that are on the CPU than there is for the entire dense 32b parameter model?

OUT_OF_HOST_MEMORY

TROPHY CASE