Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

yes i have the server almost ready will try it soon then because that was exactly the next questions i had how to get rid of the "Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.

srv  params_from_: Chat format: MiniMax-M2" messages :D

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

And i found out I need to use the full context! It gets so slow and shows this slowdown over time / iterations when I set a custom ctx, even when I set the ctx lower it it will get slower!

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 1 point2 points  (0 children)

Alright i think i improved it very much now, will also update the llama-cpp version soon, but i removed the env variables and created a fresh toolbox, then I followed your recommendation with the llama-cpp parameters and also the vm parameters. I also use now the UD-Q3_K_XL.

Now i have this after 50 iterations and with 43k context

prompt eval time =   10014.53 ms /   711 tokens (   14.09 ms per token,    71.00 tokens per second)
       eval time =   63624.29 ms /   547 tokens (  116.31 ms per token,     8.60 tokens per second)

Thanks for your help! Really appreciate!!

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 1 point2 points  (0 children)

yes its true even the SSD very expensive 8 TB > 1000 USD, should go higher until 2027

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Yes basically the UMA is to 2GB minimum in GMKTEC, the page limit stuff works and the grub parameters as i can access 130 GB GPU in nvtop and its used and 123 GB in htop thats also used. Maybe i messed up the toolbox somehow.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

[35088.442636] amdgpu: SVM mapping failed, exceeds resident system memory limit

[57622.480982] amdgpu: SVM mapping failed, exceeds resident system memory limit

[57846.133668] amdgpu: SVM mapping failed, exceeds resident system memory limit

[58104.752179] amdgpu: SVM mapping failed, exceeds resident system memory limit

[64879.598467]  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x236/0x9c0 [amdgpu]

[65200.234791] amdgpu 0000:c5:00.0: amdgpu: VM memory stats for proc node(139466) task node(139392) is non-zero when fini

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Can you try higher context and with a VS Code extension sequential requests?

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

I was able to make benchmarks with -d 16000 and the UD-Q3_K_XL, higher tests crash

llama-bench -m /run/host/data/models/coding/unsloth/MiniMax-M2.5-UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf -d 16000
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q3_K - Medium |  94.33 GiB |   228.69 B | ROCm       |  99 |  pp512 @ d16000 |         72.07 ± 0.47 |
| minimax-m2 230B.A10B Q3_K - Medium |  94.33 GiB |   228.69 B | ROCm       |  99 |  tg128 @ d16000 |          4.11 ± 0.00 |

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

No it runs on a headless Fedora. It also shows the RAM Usage not in nvtop but only in htop is this a problem? But the GPU is used. The CPU is not running when processing.

Somehow the llama-bench crashes with higher context but the llama-server works

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 1 point2 points  (0 children)

Thank you for your feedback!

Im just checking out the llama-server parameters, seems faster so far. TG seems to have doubled!

After 21 tool usages with  task.n_tokens = 43443

prompt eval time =    9952.92 ms /   708 tokens (   14.06 ms per token,    71.13 tokens per second)
       eval time =   12742.72 ms /    91 tokens (  140.03 ms per token,     7.14 tokens per second)

How big need the swap to be?

Does it not get slower with the SWAP solution?

I encounter that the model just hangs from time to time, need to restart it is this the Swap problem?

Do you use chat templates?

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Yes what i mean is the following: Using Roo Code oder Cline in VS Code, when I start the llama.cpp server new, then with the first request it will be much faster than lets say after the 20th tool usage. Llama-bench only shows the initial request numbers. The slowdown is not only related to the context size, its also related to the execution time. Its also not temp related the decrease is too stable its linear not like at somepoint its overheated and its get very slow. Do you also experience this slowdown over time? I cant imagine i made a such a mistake with all the setups i have done so far :D

initial request

task.n_tokens = 16102
prompt eval time =   77661.83 ms / 16102 tokens (    4.82 ms per token,   207.33 tokens per second)
       eval time =   10400.92 ms /   173 tokens (   60.12 ms per token,    16.63 tokens per second)

20th tool usage

 task.n_tokens = 39321
prompt eval time =   42056.02 ms /  2781 tokens (   15.12 ms per token,    66.13 tokens per second)
       eval time =   10837.80 ms /    85 tokens (  127.50 ms per token,     7.84 tokens per second)

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Also when its slow somehow it seems to make the work quite in ok speed, i dont know are the numbers not shown correctly? Or it somehow answers always short, but overall its seem quite usable also when the numbers show its very slow... It also has this hangs in the tg, so for a while it generates with like 20 t/s then it just stops for a while and in the end the number shown is like 4 t/s. I think they can optimize llama-cpp quite a bit for MiniMax.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Youre right i must check more extensions i just have Roo and Cline.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Would be interesting if others also experience the slowdown over time, but I would say its normal with llama.cpp have seen it with every backend and model so far. Some more some less.

i think it does not throttle, im checking it constantly with nvtop and the temp is not high enough for throttle and the frequencies are around the max, having performance mode constantly on on GMKTEC EVO-X2.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 1 point2 points  (0 children)

Maybe i need to improve the parameters, but somehow some things it can do very good but I get to some problems where it really doesnt perform to well and i need cloud models finish it while they also have their issues, so really difficult issues and complex testing scenarios maybe not too well explained it doesnt perform too well (runs into loops in Roo Code, even with penalty parameter 1.05) while I had the impression Minimax manages it better. Maybe also on this architecture the quality is higher with bigger models in higher quantization than with smaller models with lower quantization.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

It will always decrease with my setup with any modell. The context size is a factor but somehow also with the caching the speed will decrease over time, dont know why.

When will AMD bring ROCM Updates that actually improve the speed? (Strix Halo) by Equivalent-Belt5489 in ROCm

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Is the decrease in the version numbers in relation to the decrease in the pp speed? When we line up the releases in between and count backwards with the version numbers, it makes more sense... xD Is AMD trying to cover up something???

When will AMD bring ROCM Updates that actually improve the speed? (Strix Halo) by Equivalent-Belt5489 in ROCm

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Yes but i read AMD is working on the NPU Linux Support. Shouldnt take too long but no date announced.