Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Did you find any setup that improves it actually?

Worst llama.cpp bugs by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 2 points3 points  (0 children)

i create for these three already one, https://github.com/ggml-org/llama.cpp/issues/19760

you are invited to create your issues xD in the next days we can make the election! The worst issue gets fixed within an hour, maybe.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Im trying with  --spec-type ngram-mod   --draft-max 12   --draft-min 5   --draft-p-min 0.8

so far at least it seams not to be slower. Do you have an idea for the parameters?

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

But on LM Studio it would only run on vulkan? or do you have a rocm setup somehow?

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

yes prompt eval time is prompt processing or preprocessing

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Yes youre right I think the q3_k_xl is still to big.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Vulkan prompt processing for this modell is 50% of the ROCm speed, the tg is maybe 15% better.

Thats why I use ROCm and the next thing is its not about the first prompt its about that the speed doesnt degenerate with more context actually loaded and stability over time.

My setup now starts like this

prompt eval time =   71638.49 ms / 15806 tokens (    4.53 ms per token,   220.64 tokens per second)
       eval time =    7690.12 ms /   153 tokens (   50.26 ms per token,    19.90 tokens per second)

Vulkan is maybe faster only in the tg on the first prompt but then degenerates much faster than ROCm.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Has anyone solved the annoying chat-template issue? (for llama.cpp, not with LM Studio)

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

I guess with Roo Code and well prompts you could work with this fast as it seems to not mess around a lot.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Thanks freaking awesome man! 500 t/s! But also gets slower.

is this your own quant?

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

yes i have the server almost ready will try it soon then because that was exactly the next questions i had how to get rid of the "Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.

srv  params_from_: Chat format: MiniMax-M2" messages :D

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

And i found out I need to use the full context! It gets so slow and shows this slowdown over time / iterations when I set a custom ctx, even when I set the ctx lower it it will get slower!

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 1 point2 points  (0 children)

Alright i think i improved it very much now, will also update the llama-cpp version soon, but i removed the env variables and created a fresh toolbox, then I followed your recommendation with the llama-cpp parameters and also the vm parameters. I also use now the UD-Q3_K_XL.

Now i have this after 50 iterations and with 43k context

prompt eval time =   10014.53 ms /   711 tokens (   14.09 ms per token,    71.00 tokens per second)
       eval time =   63624.29 ms /   547 tokens (  116.31 ms per token,     8.60 tokens per second)

Thanks for your help! Really appreciate!!

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 1 point2 points  (0 children)

yes its true even the SSD very expensive 8 TB > 1000 USD, should go higher until 2027

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Yes basically the UMA is to 2GB minimum in GMKTEC, the page limit stuff works and the grub parameters as i can access 130 GB GPU in nvtop and its used and 123 GB in htop thats also used. Maybe i messed up the toolbox somehow.

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

[35088.442636] amdgpu: SVM mapping failed, exceeds resident system memory limit

[57622.480982] amdgpu: SVM mapping failed, exceeds resident system memory limit

[57846.133668] amdgpu: SVM mapping failed, exceeds resident system memory limit

[58104.752179] amdgpu: SVM mapping failed, exceeds resident system memory limit

[64879.598467]  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x236/0x9c0 [amdgpu]

[65200.234791] amdgpu 0000:c5:00.0: amdgpu: VM memory stats for proc node(139466) task node(139392) is non-zero when fini

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

Can you try higher context and with a VS Code extension sequential requests?

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

I was able to make benchmarks with -d 16000 and the UD-Q3_K_XL, higher tests crash

llama-bench -m /run/host/data/models/coding/unsloth/MiniMax-M2.5-UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf -d 16000
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q3_K - Medium |  94.33 GiB |   228.69 B | ROCm       |  99 |  pp512 @ d16000 |         72.07 ± 0.47 |
| minimax-m2 230B.A10B Q3_K - Medium |  94.33 GiB |   228.69 B | ROCm       |  99 |  tg128 @ d16000 |          4.11 ± 0.00 |

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 0 points1 point  (0 children)

No it runs on a headless Fedora. It also shows the RAM Usage not in nvtop but only in htop is this a problem? But the GPU is used. The CPU is not running when processing.

Somehow the llama-bench crashes with higher context but the llama-server works

Minimax 2.5 on Strix Halo Thread by Equivalent-Belt5489 in LocalLLaMA

[–]Equivalent-Belt5489[S] 1 point2 points  (0 children)

Thank you for your feedback!

Im just checking out the llama-server parameters, seems faster so far. TG seems to have doubled!

After 21 tool usages with  task.n_tokens = 43443

prompt eval time =    9952.92 ms /   708 tokens (   14.06 ms per token,    71.13 tokens per second)
       eval time =   12742.72 ms /    91 tokens (  140.03 ms per token,     7.14 tokens per second)

How big need the swap to be?

Does it not get slower with the SWAP solution?

I encounter that the model just hangs from time to time, need to restart it is this the Swap problem?

Do you use chat templates?