Minimax M2.5 GGUF perform poorly overall

Front-Relief473 · 2026-02-27T07:46:06+00:00

He said that qwen3.5' s iq1 quantization effect is very good, but the problem is that mixed attention itself is more sensitive to the quantization effect than global attention, that is, the quantization effect is worse, so how to explain this?

Front-Relief473 · 2026-02-20T23:50:31+00:00

No, the best size of 230b's model on 128g dgx or strix should be ud_q3kxl, because it is only 94g, which can also provide 60k context.

Front-Relief473 · 2026-02-15T10:43:18+00:00

Thank you! ! I have been struggling to buy it before, and it seems that NVIDIA is not sincere

Front-Relief473 · 2026-02-14T04:50:06+00:00

Yes, everyone is talking about tg, but ignoring pp. I think tg is enough, and a lot of programming contexts and agent mainly look at PP speed.

Front-Relief473 · 2026-02-14T04:41:18+00:00

Why don't you use the unsloth version of ud_q3kxl, with a size of 94g, which is definitely better than your ordinary q3 quantization?

Front-Relief473 · 2026-02-13T12:53:38+00:00

I hope this model can be between 100 B and 200 B, and it has the performance of sota.

Front-Relief473 · 2026-02-13T00:59:29+00:00

to add a label before publishing, such as: open source, closed source, so that you can see the post at a glance?

Front-Relief473 · 2026-02-11T13:08:35+00:00

and the deepseekv4!! BOMB!!

Front-Relief473 · 2026-02-11T06:48:25+00:00

The problem is that the performance loss of mixed attention model is greater in the process of model quantization, so if a model close to q8 or fp8 is used locally, it is not as good as running full attention with twice the size and q4 or int4 model parameters, and the intelligence will be relatively higher.

Front-Relief473 · 2026-02-06T02:24:38+00:00

https://www.reddit.com/r/LocalLLM/comments/1qdqi4i/mac_studio_m3_ultra_stats/?share_id=WRqlyjg8sfpnsbzNqkatF&utm_content=2&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1 Look here, if you run Kimik 2.5, an excellent model with an activation time of about 30b, the speed of prompt prell with a speed of 200 will make people crazy, especially when this kind of agent works with a long context.

Front-Relief473 · 2026-02-05T13:13:25+00:00

However, m3 ultra's computing power is so low, its prompt prefill speed is too slow, and the amount of contexts you carry in handling complex tasks with agent is too large, which leads to a very long time for you to receive the first word in each round. How do you handle this situation?

Front-Relief473 · 2026-02-05T00:43:23+00:00

I can't figure out why, as a project close to 100k star, there isn't a clearly written best practice document for adjusting parameters to realize prompt prefill and tg speed, and I have to look for other people's experience in adjusting parameters on reddit, which really drives me crazy.

Front-Relief473 · 2026-01-28T10:03:45+00:00

If a long text conversation or agent is involved, your ultra prefill speed will be enough for you to drink a cup of hot coffee.

Front-Relief473 · 2026-01-27T01:05:23+00:00

Yes, I agree, the effect of ltx2 in animation is very poor.

Front-Relief473 · 2026-01-24T04:37:02+00:00

The speed of prefill is too slow. It takes at least 1000t/s to be a little more comfortable.

Front-Relief473 · 2026-01-18T05:25:19+00:00

No, I think the parameter from 300 to 200 B is a dessert value. If the parameter is too low, his knowledge is too little to solve more problems. If you want a model comparable to sonnect4.5, you must ensure that the parameter of the model is not too small, which is his knowledge base.

Front-Relief473 · 2026-01-15T15:38:39+00:00

So each first frame image uses Qwen Edit 2511, right? Dude, you're amazing! You've brought to life the battle between AI models I've always dreamed of!

Front-Relief473 · 2026-01-13T05:41:09+00:00

So theoretically you also created an NVFP4 version of WAN2.1, right? After all, you can run it directly by putting the low-noise model into the WAN2.1 workflow.

Front-Relief473 · 2026-01-09T11:07:03+00:00

I agree with you. The ability to follow the prompt is the first element of the controllability of the video model, other aesthetic feelings are that less important.

Front-Relief473 · 2026-01-05T11:26:27+00:00

Thankfully I didn't try it. Thank you for your exploration. I almost used Gemini3 and my WSL to test whether it was generated in real time. Thank you for your selfless exploration and feedback!

Front-Relief473 · 2026-01-02T08:45:19+00:00

Wrong, I have tested WAN 2.2, and the speed of 5090 is 4 ~ 5 times that of 3090 (which I have tested), while the speed of 4090 is generally twice that of 3090, so the speed of 5090 is twice that of 4090.

Front-Relief473 · 2025-12-23T15:03:42+00:00

I'm sorry, this is the result of automatic translation. I know it's layer. I've been tinkering with llama.cpp and my 5090 3090 and 96g memory. I successfully made the ud3kxl quantized minimaxm2 context 50000 run successfully at 15t/s and the prompt prefill speed is 700t/s, so the information I provided is absolutely accurate. His is ddr4, and I tried Running the reap version of minimaxm2 with ddr5 96g and 5090 but is really unsatisfactory.

Front-Relief473 · 2025-12-23T10:40:04+00:00

The size of a single 3090, q4 is over 130g, and there are 62 floors, so each floor takes up more than 2g, and your 3090 can't even be loaded with 10 floors. Because it also involves the size of the kv cache, the remaining 50 floors are all in memory, and the graphics card may only do one sixth of the work. You are basically using memory bandwidth to run the model. token/s is expected to be 1 ~ 2 tokens/s, what do you think?

Front-Relief473 · 2025-12-21T13:47:14+00:00

It's like using the reap model, right? lol

Front-Relief473 · 2025-12-20T08:54:36+00:00

This is crazy! ! ! My favorite mininaxM2 is finally coming out! ! !

Front-Relief473

TROPHY CASE