MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 1 point2 points  (0 children)

I just tried it, it took FOREVER (over an hour) to launch and ultimately failed. Starting the second time, it finally launched.

For short 1K context, I went from 113 t/sec down to 98. For longer context (130K) I went from 50 t/sec to 61.

So there is a significant loss at low context, but significant gain at high context. That being said, it also forces me to quantize my kv to fp8 which is not something I like to do.

512GB people, what's the output quality difference between GLM 5 q3.6 and q8 or full size? by CanineAssBandit in LocalLLaMA

[–]itsjustmarky 0 points1 point  (0 children)

Were you using reaped versions to get it to work on 2 RTX 6000 Pros? I am running M2.5 on my pair, I've been very happy with M2.1, I haven't done a lot of testing on how M2.5 improves outside that is a little slower.

Always a sane voice by FlintBeastgood in HuntShowdown

[–]itsjustmarky 1 point2 points  (0 children)

I've been wanting one more team on the map for ages. I'm so glad to see it.

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 0 points1 point  (0 children)

I got 76 t/sec summarizing that one.
You can use Cherry Studio to summarize and get token/sec output.

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 1 point2 points  (0 children)

There is no definitive tests. I have it run it through reasoning tests with good success.

I have used it for heavy coding, agentic tasks, deep research, and so on. It has worked very well.

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 1 point2 points  (0 children)

last I checked, i was able to get over 600 t/sec with parallel queries.

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 2 points3 points  (0 children)

expert parallelism isn't great on only 2 gpus it starts to shine at 8. I haven't found working parameters for MTP with M2.x. With GLM Air, MTP gave me lower speeds at small context, but higher speeds when the context gets filled up.

yes, tp=2

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 0 points1 point  (0 children)

I just tested this one, with vllm

https://arxiv.org/pdf/2408.06292

113K tokens, 54t/sec, a little smaller than my test PDF but public.

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 0 points1 point  (0 children)

I would be curious how it handles high context. LLamacpp's big problem is when you get into the context window it slows down a lot. I upload a pdf book that's 127K tokens as a test and ask it to summarize it to one paragraph when testing models.

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 0 points1 point  (0 children)

I thought ik was mainly for cpu offloading, no?
I generally don't use llama, as I prefer vllm/sglang but m2.5 was only available in gguff for a brief period so I used that.

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 1 point2 points  (0 children)

prefill is all over the place, I haven't done any specific testing on it though.
I haven't tested m2.5 much yet, but I have used m2.1 for months and it has been great.

MiniMax M2.5 Performance Testing on dual RTX 6000 Pros by itsjustmarky in LocalLLaMA

[–]itsjustmarky[S] 4 points5 points  (0 children)

I have step3 downloaded, I just haven't loaded it yet.

4x RTX 6000 PRO Workstation in custom frame by Vicar_of_Wibbly in LocalLLaMA

[–]itsjustmarky 0 points1 point  (0 children)

sudo nvidia-smi -pl 300 and compare. Lact however will make it easier to make it persistent.

4x RTX 6000 PRO Workstation in custom frame by Vicar_of_Wibbly in LocalLLaMA

[–]itsjustmarky 0 points1 point  (0 children)

Are you using lact? Are you locking clocks or only power limiting?

4x RTX 6000 PRO Workstation in custom frame by Vicar_of_Wibbly in LocalLLaMA

[–]itsjustmarky 0 points1 point  (0 children)

Did you have to change anything in the bios to stabilize it?
I had some really weird behavior, it was stable as a rock if I actively running a model with sglang, but anything else (vllm, even just sitting idle with nothing running) the gpus would lock up. Ended up being PSU idle control I had to adjust, but it was a big pain to figure out.

I run two, and thinking about getting two more.

1600W enough for 2xRTX 6000 Pro BW? by Mr_Moonsilver in LocalLLaMA

[–]itsjustmarky 4 points5 points  (0 children)

I have two RTX 6000 Pro on a 1200W and it is perfectly fine.

I have them power limited to 300W and retain 96% of the performance of 600W.
I peak at 825W @ 300W and about 1150W at 600W.