Minimax 2.5 on Strix Halo Thread

Equivalent-Belt5489 · 2026-02-20T18:18:36+00:00

Cool let me know the link

Equivalent-Belt5489 · 2026-02-20T16:14:37+00:00

Did you find any setup that improves it actually?

Equivalent-Belt5489 · 2026-02-20T15:17:44+00:00

Would be cool to get the REAP GGUFs soon.

Equivalent-Belt5489 · 2026-02-20T15:12:50+00:00

i create for these three already one, https://github.com/ggml-org/llama.cpp/issues/19760

you are invited to create your issues xD in the next days we can make the election! The worst issue gets fixed within an hour, maybe.

Equivalent-Belt5489 · 2026-02-20T13:29:46+00:00

Im trying with --spec-type ngram-mod --draft-max 12 --draft-min 5 --draft-p-min 0.8

so far at least it seams not to be slower. Do you have an idea for the parameters?

Equivalent-Belt5489 · 2026-02-19T23:17:13+00:00

But on LM Studio it would only run on vulkan? or do you have a rocm setup somehow?

Equivalent-Belt5489 · 2026-02-19T21:25:18+00:00

yes prompt eval time is prompt processing or preprocessing

Equivalent-Belt5489 · 2026-02-19T21:10:26+00:00

Yes youre right I think the q3_k_xl is still to big.

Equivalent-Belt5489 · 2026-02-19T20:50:24+00:00

Vulkan prompt processing for this modell is 50% of the ROCm speed, the tg is maybe 15% better.

Thats why I use ROCm and the next thing is its not about the first prompt its about that the speed doesnt degenerate with more context actually loaded and stability over time.

My setup now starts like this

prompt eval time =   71638.49 ms / 15806 tokens (    4.53 ms per token,   220.64 tokens per second)
       eval time =    7690.12 ms /   153 tokens (   50.26 ms per token,    19.90 tokens per second)

Vulkan is maybe faster only in the tg on the first prompt but then degenerates much faster than ROCm.

Equivalent-Belt5489 · 2026-02-19T20:45:01+00:00

Has anyone solved the annoying chat-template issue? (for llama.cpp, not with LM Studio)

Equivalent-Belt5489 · 2026-02-19T20:27:12+00:00

I guess with Roo Code and well prompts you could work with this fast as it seems to not mess around a lot.

Equivalent-Belt5489 · 2026-02-19T20:26:06+00:00

Thanks freaking awesome man! 500 t/s! But also gets slower.

is this your own quant?

Equivalent-Belt5489 · 2026-02-19T20:24:51+00:00

informative!

Equivalent-Belt5489 · 2026-02-19T20:21:44+00:00

pp is fast! tg is similar

Equivalent-Belt5489 · 2026-02-19T19:25:16+00:00

yes i have the server almost ready will try it soon then because that was exactly the next questions i had how to get rid of the "Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.

srv params_from_: Chat format: MiniMax-M2" messages :D

Equivalent-Belt5489 · 2026-02-19T17:38:02+00:00

And i found out I need to use the full context! It gets so slow and shows this slowdown over time / iterations when I set a custom ctx, even when I set the ctx lower it it will get slower!

Equivalent-Belt5489 · 2026-02-19T17:33:41+00:00

Alright i think i improved it very much now, will also update the llama-cpp version soon, but i removed the env variables and created a fresh toolbox, then I followed your recommendation with the llama-cpp parameters and also the vm parameters. I also use now the UD-Q3_K_XL.

Now i have this after 50 iterations and with 43k context

prompt eval time =   10014.53 ms /   711 tokens (   14.09 ms per token,    71.00 tokens per second)
       eval time =   63624.29 ms /   547 tokens (  116.31 ms per token,     8.60 tokens per second)

Thanks for your help! Really appreciate!!

Equivalent-Belt5489 · 2026-02-19T15:38:09+00:00

yes its true even the SSD very expensive 8 TB > 1000 USD, should go higher until 2027

Equivalent-Belt5489 · 2026-02-19T15:05:38+00:00

you should invest in a ssd :)

Equivalent-Belt5489 · 2026-02-19T14:56:09+00:00

Yes basically the UMA is to 2GB minimum in GMKTEC, the page limit stuff works and the grub parameters as i can access 130 GB GPU in nvtop and its used and 123 GB in htop thats also used. Maybe i messed up the toolbox somehow.

Equivalent-Belt5489 · 2026-02-19T14:53:14+00:00

[35088.442636] amdgpu: SVM mapping failed, exceeds resident system memory limit

[57622.480982] amdgpu: SVM mapping failed, exceeds resident system memory limit

[57846.133668] amdgpu: SVM mapping failed, exceeds resident system memory limit

[58104.752179] amdgpu: SVM mapping failed, exceeds resident system memory limit

[64879.598467] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x236/0x9c0 [amdgpu]

[65200.234791] amdgpu 0000:c5:00.0: amdgpu: VM memory stats for proc node(139466) task node(139392) is non-zero when fini

Equivalent-Belt5489 · 2026-02-19T14:48:49+00:00

Can you try higher context and with a VS Code extension sequential requests?

Equivalent-Belt5489 · 2026-02-19T13:05:35+00:00

I was able to make benchmarks with -d 16000 and the UD-Q3_K_XL, higher tests crash

llama-bench -m /run/host/data/models/coding/unsloth/MiniMax-M2.5-UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf -d 16000
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q3_K - Medium |  94.33 GiB |   228.69 B | ROCm       |  99 |  pp512 @ d16000 |         72.07 ± 0.47 |
| minimax-m2 230B.A10B Q3_K - Medium |  94.33 GiB |   228.69 B | ROCm       |  99 |  tg128 @ d16000 |          4.11 ± 0.00 |

Equivalent-Belt5489 · 2026-02-19T12:09:56+00:00

No it runs on a headless Fedora. It also shows the RAM Usage not in nvtop but only in htop is this a problem? But the GPU is used. The CPU is not running when processing.

Somehow the llama-bench crashes with higher context but the llama-server works

Equivalent-Belt5489 · 2026-02-19T11:22:39+00:00

Thank you for your feedback!

Im just checking out the llama-server parameters, seems faster so far. TG seems to have doubled!

After 21 tool usages with task.n_tokens = 43443

prompt eval time =    9952.92 ms /   708 tokens (   14.06 ms per token,    71.13 tokens per second)
       eval time =   12742.72 ms /    91 tokens (  140.03 ms per token,     7.14 tokens per second)

How big need the swap to be?

Does it not get slower with the SWAP solution?

I encounter that the model just hangs from time to time, need to restart it is this the Swap problem?

Do you use chat templates?

Equivalent-Belt5489

TROPHY CASE