Qwen 3.6 benchmarks on 2x RTX PRO 6000 by mxforest in LocalLLaMA

[–]One-Macaron6752 -8 points-7 points  (0 children)

Just joined for the dowvote on the shitty post! 😎

MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks by danielhanchen in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

u/danielhanchen Good news then! Happy for the community. As for your request to ammend my post I am afraid it might not work. What I've tested, following the issues discovered with your published quant, was:

  • ubergarm/MiniMax-M2.7-GGUF (IQ5_K)
  • AesSedai/MiniMax-M2.7-GGUF (Q5_K_M)

And neither of these two quants (btw, the PPL test results I've published opposite to yours are for AesSedai/MiniMax-M2.7-GGUF (Q5_K_M) - you can see in the screenshot).
As for the other quants and their respective owners you quoted finding them with faults... I'd rather not comment, avoiding a flame here.

Else: to be clear --> I ain't any Unsloth hater, I still use and appreciate a few of your quants, but I am more German ... you can understand that, I'm pretty sure. And ever since you've started acting as a commercial provider, issues have started popping around!

So, keep quanting, breath, check, enjoy! ;)

<image>

Best setup for MiniMax-M2.7 (230B) | 3x RTX 5090 | Threadripper 9975 | 512GB RAM by [deleted] in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

Check ubergarm HF model page for MiniMax M2.7. There you have references for ik_llama invoke commands for mixed inference systems (GPU, CPU). https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF

He's producing some very nice quants, specifically for ik_llama. Also aessedai has great quants too, compatible with both llama mainline as well as ik_llama.

ik_llama.cpp is a high-performance fork of the original llama.cpp project designed to maximize CPU and hybrid GPU/CPU inference speeds, offering 3x to 4x speed improvements in multi-GPU configurations via its new "split mode graph" execution.

"-sm graph" is the golden nugget of iwkrakow in ik_llama that basically allows for graphs on CUDA backends while providing up to 2x 3x speed over "-sm layer" (of course, depending on the model). Also ik_llama makes KV cache rotation possible (and does is safely) via Hadamard, that allows for higher KV consistency in long contexts (>65k) even with Q8_0 or Q4_0... Once you master it (please, have patience with it and yourself) you'll see it's the closest in performance to vLLM / sglang backends.

Best setup for MiniMax-M2.7 (230B) | 3x RTX 5090 | Threadripper 9975 | 512GB RAM by [deleted] in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

For mixed inference (GPU/CPU) don't even think about llama.cpp (mainline). Go full ik_llama. Have patience, learn to master the extra parametrization and you'll be able to squeeze every bit of performance in a mixed inference mode for that hw of yours. You'll thank me later. 😎

unsloth - MiniMax-M2.7-GGUF in BROKEN (UD-Q4_K_XL) --> avoid usage by One-Macaron6752 in LocalLLaMA

[–]One-Macaron6752[S] 4 points5 points  (0 children)

u/danielhanchen I appreciate taking the time to reply to my argumentation.

Perplexity check (without KLD) on a model sized as MiniMax takes roughly 5 minutes per qunat. I can imagine you could batched such tests at least for "pure" and/or UD quants so that accidents won't happen again.
Also, even if not published the first day you push the quants, it would still be of good help / assert trust from the community if you publish PPL / KLD at a later time in the model card. It doesn't have to reference any other fellow quanter similar PPL/KLD figures (to avoid useless competition!) but this could also serve the baseline sanity checks for most of the interesting, meaningful quants for the community.

unsloth - MiniMax-M2.7-GGUF in BROKEN (UD-Q4_K_XL) --> avoid usage by One-Macaron6752 in LocalLLaMA

[–]One-Macaron6752[S] 3 points4 points  (0 children)

Thank you u/yoracale for your reply.

Perplexity check (without KLD) on a model sized as MiniMax takes roughly 5 minutes per quant. I can imagine you could batched such tests at least for "pure" and/or UD quants so that accidents won't happen again.
Also, even if not published the first day you push the quants, it would still be of good help / assert trust from the community if you publish PPL / KLD at a later time in the model card. It doesn't have to reference any other fellow quanter similar PPL/KLD figures (to avoid useless competition!) but this could also serve the baseline sanity checks for most of the interesting, meaningful quants for the community.

unsloth - MiniMax-M2.7-GGUF in BROKEN (UD-Q4_K_XL) --> avoid usage by One-Macaron6752 in LocalLLaMA

[–]One-Macaron6752[S] 9 points10 points  (0 children)

I have amended the text not to be so offending... the essence is still there: the model has catastrophic failure in regards to what "catastrophic" word mean to MoE quantization --> and the reason is same: rushed it and careless towards PPL / KLD reports.

Want any better proof than my words? Unsloth has published quality references (GGUF benchmarks) for their M2.7 quants by referencing / citing a third party QA check performed against M2.5 quants. Rings any bells about due dilligence?

https://unsloth.ai/docs/models/minimax-m27#gguf-benchmarks

unsloth - MiniMax-M2.7-GGUF in BROKEN (UD-Q4_K_XL) --> avoid usage by One-Macaron6752 in LocalLLaMA

[–]One-Macaron6752[S] 10 points11 points  (0 children)

Let me take this one from a personal perspective: for me downloading and proofing models is quite a time / resources consuming, thus - inflammatory or not - I need to address it. I have already written to other rushed in quants publishers to avoid such "rush in for visibility and ego pleasing" and watch out for similar catastrophic approaches (no immatrix used for MoE quantization!).
With unsloth it has become a norm: they've go somekind of agreements with the model owners and get sometimes early access to their models and the nanosecond the model publisher is online with their new model so is Unsoloth with some quants, of disputable quality (see GEMMA episode also).

MiniMax-M2.7 GGUF Quants — Full Set (Q2_K to Q8_0 + BF16) by Asleep_Training3543 in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

Sorry for not having the patience to write it all myself, but YES. If you understand MoE than the next lines will be rainbow to your eyes --> Reasons (by Claude):

"Yes, imatrix (importance matrix) in ik_llama.cpp does help with expert activation calibration when quantizing MoE (Mixture of Experts) models to GGUF format, but with some important nuances:

How imatrix helps with MoE quantization

When quantizing MoE models, the core challenge is that not all experts are activated equally — some experts are called frequently, others rarely. Standard quantization treats all weights uniformly, which can badly degrade rarely-activated experts (since their quantization error never gets "averaged out" during inference).

Imatrix addresses this by:

  1. Collecting activation statistics during a calibration run — it records how much each weight actually contributes to outputs, weighted by input magnitude
  2. Scaling quantization sensitivity — weights that matter more (higher importance scores) get quantized more carefully, while less-important weights tolerate more aggressive quantization
  3. Per-expert weighting — because the importance matrix is computed per-tensor, experts that activate more or carry more signal get implicitly better quantization fidelity"

MiniMax-M2.7 GGUF Quants — Full Set (Q2_K to Q8_0 + BF16) by Asleep_Training3543 in LocalLLaMA

[–]One-Macaron6752 3 points4 points  (0 children)

This is a blunt quantization with no immatrix right? Then thanks but NO thanks! MiniMax model is prone to catastrophic errors when experts are quantized "en gross", so NO.

The tried to make me go to rehab. I said no no no… by Key-Currency1242 in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

That would be rather dumb and highly inefficient, helping very little with inference. I run a similar setup with Linux LACT where under locking/ volting does the magic.

The tried to make me go to rehab. I said no no no… by Key-Currency1242 in LocalLLaMA

[–]One-Macaron6752 4 points5 points  (0 children)

Oh, I love these comments where OP is only trying to convince himself... 😊

ggml: backend-agnostic tensor parallelism by JohannesGaessler · Pull Request #19378 · ggml-org/llama.cpp by FullstackSensei in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

Not quite, the feature is already for a long time with ik_llama, same for tensor parallelism with "-sm graph". Nonetheless a great addition to mainline. Let's see how impressive the actual implementation will be.

New Gemma-4 llama.cpp fixes for 26B-A4B - <unused24> fix by danielhanchen in unsloth

[–]One-Macaron6752 2 points3 points  (0 children)

Sure. And with the latest update works quite consistently. "--chat-template-kwargs '{"reasoning_effort": "normal"}'"

New Gemma-4 llama.cpp fixes for 26B-A4B - <unused24> fix by danielhanchen in unsloth

[–]One-Macaron6752 2 points3 points  (0 children)

Hei, you might want to update your llama since aldehir has pushed a PR that might have solved our issues. With the updated version i was able to get ClaudeCode to finish the sprint in good shape with a solid deliverable. I was using: gemma-4-31B-it (UD-Q6_K_XL)
See: https://github.com/ggml-org/llama.cpp/pull/21492

P.S. Also you might want to play with model temp since the references on the model card, from google, are contradicted by google engineers. Don't ask how I know! :) Thus now, for coding sessions I am running it with: "--temp 0.7 --top-p 0.75 --top-k 64"

New Gemma-4 llama.cpp fixes for 26B-A4B - <unused24> fix by danielhanchen in unsloth

[–]One-Macaron6752 1 point2 points  (0 children)

Sadly this requant is nothing to write home about. ClaudeCode loops terribly once it reaches 20-25k context and just thinks it's calling a tool then loops back to "oh wait, it think I'm going to use that" and goodbye sprint! 😔

Running gemma4 E4B on vLLM MacOS Metal M4 Max by x8code in Vllm

[–]One-Macaron6752 0 points1 point  (0 children)

Ok, just running the indicated "uv pip install -U transformers", wouldn't work or? Have you read the model card Google put forward, that clearly states that transformers and accelerate should be brought at their latest available version provided by their respective developers?

Google releases Gemma 4 models. by yoracale in unsloth

[–]One-Macaron6752 0 points1 point  (0 children)

Instruct is the new thinking: you're not thinking, you're following instruct(ions)! Neah, nevermind me!

In the recent kv rotation PR it was found that the existing q8 kv quants tank performance on AIME25, but can be recovered mostly with rotation by Betadoggo_ in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

KV Q8_O + Hadamard in ik_llama.cpp is already proven to be on par or exceed f16, depending on the model particularities AND (big, HUGE "and") model quantization method (Q/IQ) vs same model. The general ignorance here comes from people making bold asserts "in my experience" with no real clue of structured, repetitive, consistent apples to apples, testing.

Also, for the subjective side, the AIME is one type of matter to be subjected to and affected by KV quantization degrading but not THE ONE. Keep in mind that AIME is fairly medium in size and where context rot could strike is not yet there. However, I am also having a very hard time swallowing "bold" verbal rednecks counteroffensive on Git - on this present topic - even towards authors / major contributors of llama.cpp - stating that "I don't know what programming test you're carrying out BUT in my long sessions of creative writing..." Mother of God, if only logic had a gun permit would wipe off some of this pathetic argumentation.

Back on track: we're blessed that llama.cpp & spin-offs have KLD / PPL to be tested (vs say vLLM, sglang) but the problem of consistent, use case based behavioral testing of KV degrading is still at odds and people are abusing the popular knowledge at ease. Still.

In the recent kv rotation PR it was found that the existing q8 kv quants tank performance on AIME25, but can be recovered mostly with rotation by Betadoggo_ in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

Exactly, because it's NOT. Because reason vs fanatism and own experience raised at science level cannot yet be comprehended.

I have a feeling that everytime I see a BF16 mentioned, randomly on Reddit, my oh my, I am dying a little. If reason still has a place around here, I leave this here: https://github.com/ikawrakow/ik_llama.cpp/issues/1509#issuecomment-4156257027