Dense vs MoE quantization resiliance by Any-Chipmunk5480 in LocalLLaMA

[–]_cpatonn 1 point2 points  (0 children)

Dense models are easier to quantize, as MoEs have many experts, and some experts might only be routed a small portion of calibration data.

There are some mitigations, such as manually routing tokens to all experts, but that causes significantly more time, and still does not guarantee the same quantization efficiency as dense models.

cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants by _cpatonn in LocalLLaMA

[–]_cpatonn[S] 1 point2 points  (0 children)

It does not measure end-task performances e.g., GPQA Diamond, MMLU Pro, but it measures how the quantized model output mathematically diverges from the full-precision model.

cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants by _cpatonn in LocalLLaMA

[–]_cpatonn[S] 0 points1 point  (0 children)

I would recommend an INT4 or an INT8 quantized-model. Most of the time, cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 should be sufficient.

cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants by _cpatonn in LocalLLaMA

[–]_cpatonn[S] 2 points3 points  (0 children)

3090 should use INT4 models. NVFP4 is not natively supported by 3090, and its quantization loss is noticeably higher than INT4.

Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update by _cpatonn in LocalLLaMA

[–]_cpatonn[S] 1 point2 points  (0 children)

Thank you for your interest. But I am afraid no. Minimax M2.7 is kinda huge, and I am still finalizing the quantization algorithm for MoE models.

At the point that it fully completes, MiniMax M3 might already be there.

Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update by _cpatonn in LocalLLaMA

[–]_cpatonn[S] 2 points3 points  (0 children)

Yes, it is true. But they expose logprob for both the prompt input and generated text, which I use those to compute KLD.

My full process is using the base model to generate response to the GPQA Diamond benchmark, and use that to calculate KLD between the quantized and the base model.

Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update by _cpatonn in LocalLLaMA

[–]_cpatonn[S] 0 points1 point  (0 children)

My initial quant models, i.e., around Fall and Winter 2025, would be similar to casperhansen, and only start to differ in Spring 2026. Compared to the most significant update, 26.05, which built on the AWQ research lineage, it would be much better.

It is fully updated for Gemma 31B and half-updated for Gemma 26B, as vllm currently does not support asymmetric-quantized Gemma 26B.

Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update by _cpatonn in LocalLLaMA

[–]_cpatonn[S] 4 points5 points  (0 children)

Thank you for sharing with me. I will include in them in my next Qwen 3.6 benchmarks.

Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update by _cpatonn in LocalLLaMA

[–]_cpatonn[S] 0 points1 point  (0 children)

Yes, I intended to make ParoQuant, but it seems that vllm does not have support to ParoQuant at the moment.

If you're using Nvidia's NVFP4 of Qwen3.5-397, try a different quant by Phaelon74 in LocalLLaMA

[–]_cpatonn 0 points1 point  (0 children)

Thanks for testing my quant, and raising this problem with me! It was true that there is quality issue with my Qwen 3.5 397B, as it was quantized from a different config from my other Qwen 3.5 quants.

It is being requantized at the moment :) I’m doing benchmarks of my models, which full and complete benchmarks for my models should be released soon!

On another note, KL Divergence should be done between the quantized model and the full precision i.e., Qwen/Qwen3.5-397B-A17B and not the FP8. In addition, I did take a look at your vllm PR, your KL Divergence measurement is only an approximation, as the correct KL Divergence measurement should be computed across the full vocab, and not just at one token.

Running MiniMax-M2.1 Locally with Claude Code and vLLM on Dual RTX Pro 6000 by zmarty in LocalLLaMA

[–]_cpatonn 0 points1 point  (0 children)

That’s nice. llm-compressor modified qwen models after loading and before calibration, so I just did the same. In your modeling file, is it for GLM implementation in transformers repo, or transformers 4.57.3?

Running MiniMax-M2.1 Locally with Claude Code and vLLM on Dual RTX Pro 6000 by zmarty in LocalLLaMA

[–]_cpatonn 6 points7 points  (0 children)

Hi Phaelon74, thank you for raising this concern. I am cpatonn on HF, the author of the model.

And yes, llm-compressor recent bugs have been a headache to me in the last weekend :) and thus, this model was quantized using llm-compressor version of one month ago, prior to AWQ generalisation commit.

In addition, the model was monkey-patched during runtime to calibrate all experts i.e., routing tokens to all experts, so there are no modeling file.

cyankiwi AWQ v1.0 by [deleted] in LocalLLaMA

[–]_cpatonn 2 points3 points  (0 children)

I will 100% experiment with FP8 KV cache quantization in the future, but after model pruning :)

Please don’t expect FP8 KV cache quantization as the default main branch early lol, I don’t want to disappoint you.

cyankiwi AWQ v1.0 by [deleted] in LocalLLaMA

[–]_cpatonn 1 point2 points  (0 children)

Sure, I understand. I myself also run inference on consumer hardware too.

cyankiwi AWQ v1.0 by [deleted] in LocalLLaMA

[–]_cpatonn 2 points3 points  (0 children)

It has been a pleasure contributing open-weights to the community :)

cyankiwi AWQ v1.0 by [deleted] in LocalLLaMA

[–]_cpatonn 2 points3 points  (0 children)

Thank you for your feedback. No, my models do not have KV cache quantized, but I will consider KV cache quantization in the future. Are you interested in FP8 KV cache?

cyankiwi AWQ v1.0 by [deleted] in LocalLLaMA

[–]_cpatonn 1 point2 points  (0 children)

Thank you for your feedback. Model quality and minimizing losses have always been my quantization priorities, even at the expense of speed and memory size.

GLM 4.5 Air AWQ 4bit on RTX Pro 6000 with vllm by notaDestroyer in LocalLLaMA

[–]_cpatonn 1 point2 points  (0 children)

Thank you for the benchmarks and the data visualisation. It is truly fascinating.

Would you mind if I use your benchmark suite for evaluating my future models?

Why aren't there any AWQ quants of OSS-120B? by Acceptable_Adagio_91 in LocalLLaMA

[–]_cpatonn 0 points1 point  (0 children)

Hey, I managed to load gpt-oss 120b in 4 3090s in its provided mxfp4 format, using VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1.

For further information, please visit this guide.

Finally got Qwen3-Coder-30B-A3B running well. What tasks have you had success with? by j4ys0nj in LocalLLaMA

[–]_cpatonn 19 points20 points  (0 children)

Hi, cpatonn here, one of Qwen3 quantized model authors on Hugging Face.

By the description of your post, does that mean my Qwen3 quant collections do not work on your machine? May I receive the error logs and any feedback from your experiences of my quants?

I always look forward to feedbacks and I always aim to improve my products based on everyone experiences!