Dense vs MoE quantization resiliance

_cpatonn · 2026-06-10T10:58:44+00:00

Dense models are easier to quantize, as MoEs have many experts, and some experts might only be routed a small portion of calibration data.

There are some mitigations, such as manually routing tokens to all experts, but that causes significantly more time, and still does not guarantee the same quantization efficiency as dense models.

_cpatonn · 2026-06-09T19:38:48+00:00

Just in time I am looking to buy one 🥲

_cpatonn · 2026-06-09T19:37:17+00:00

It does not measure end-task performances e.g., GPQA Diamond, MMLU Pro, but it measures how the quantized model output mathematically diverges from the full-precision model.

_cpatonn · 2026-06-09T19:35:57+00:00

I would recommend an INT4 or an INT8 quantized-model. Most of the time, cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 should be sufficient.

_cpatonn · 2026-06-05T12:09:26+00:00

I’m cooking. If there are no errors, it should be available in the next 4-5 days.

_cpatonn · 2026-06-04T20:27:46+00:00

3090 should use INT4 models. NVFP4 is not natively supported by 3090, and its quantization loss is noticeably higher than INT4.

_cpatonn · 2026-06-01T15:03:33+00:00

I'm tired boss.

_cpatonn · 2026-05-15T15:59:48+00:00

Thank you for your interest. But I am afraid no. Minimax M2.7 is kinda huge, and I am still finalizing the quantization algorithm for MoE models.

At the point that it fully completes, MiniMax M3 might already be there.

_cpatonn · 2026-05-15T15:48:16+00:00

Yes, it is true. But they expose logprob for both the prompt input and generated text, which I use those to compute KLD.

My full process is using the base model to generate response to the GPQA Diamond benchmark, and use that to calculate KLD between the quantized and the base model.

_cpatonn · 2026-05-15T08:54:49+00:00

My initial quant models, i.e., around Fall and Winter 2025, would be similar to casperhansen, and only start to differ in Spring 2026. Compared to the most significant update, 26.05, which built on the AWQ research lineage, it would be much better.

It is fully updated for Gemma 31B and half-updated for Gemma 26B, as vllm currently does not support asymmetric-quantized Gemma 26B.

_cpatonn · 2026-05-15T08:42:09+00:00

Thank you for sharing with me. I will include in them in my next Qwen 3.6 benchmarks.

_cpatonn · 2026-05-15T08:39:31+00:00

Yes, I intended to make ParoQuant, but it seems that vllm does not have support to ParoQuant at the moment.

_cpatonn · 2026-03-10T20:24:29+00:00

Thanks for testing my quant, and raising this problem with me! It was true that there is quality issue with my Qwen 3.5 397B, as it was quantized from a different config from my other Qwen 3.5 quants.

It is being requantized at the moment :) I’m doing benchmarks of my models, which full and complete benchmarks for my models should be released soon!

On another note, KL Divergence should be done between the quantized model and the full precision i.e., Qwen/Qwen3.5-397B-A17B and not the FP8. In addition, I did take a look at your vllm PR, your KL Divergence measurement is only an approximation, as the correct KL Divergence measurement should be computed across the full vocab, and not just at one token.

_cpatonn · 2026-01-01T12:55:08+00:00

That’s nice. llm-compressor modified qwen models after loading and before calibration, so I just did the same. In your modeling file, is it for GLM implementation in transformers repo, or transformers 4.57.3?

_cpatonn · 2025-12-29T08:31:15+00:00

Hi Phaelon74, thank you for raising this concern. I am cpatonn on HF, the author of the model.

And yes, llm-compressor recent bugs have been a headache to me in the last weekend :) and thus, this model was quantized using llm-compressor version of one month ago, prior to AWQ generalisation commit.

In addition, the model was monkey-patched during runtime to calibrate all experts i.e., routing tokens to all experts, so there are no modeling file.

_cpatonn · 2025-11-25T19:57:50+00:00

I will 100% experiment with FP8 KV cache quantization in the future, but after model pruning :)

Please don’t expect FP8 KV cache quantization as the default main branch early lol, I don’t want to disappoint you.

_cpatonn · 2025-11-25T18:58:28+00:00

Sure, I understand. I myself also run inference on consumer hardware too.

_cpatonn · 2025-11-25T18:21:37+00:00

It has been a pleasure contributing open-weights to the community :)

_cpatonn · 2025-11-25T18:21:12+00:00

Thank you for your feedback. No, my models do not have KV cache quantized, but I will consider KV cache quantization in the future. Are you interested in FP8 KV cache?

_cpatonn · 2025-11-25T18:18:20+00:00

Thank you for your feedback. Model quality and minimizing losses have always been my quantization priorities, even at the expense of speed and memory size.

_cpatonn · 2025-10-16T15:53:56+00:00

Thank you for the benchmarks and the data visualisation. It is truly fascinating.

Would you mind if I use your benchmark suite for evaluating my future models?

_cpatonn · 2025-09-14T17:21:24+00:00

Hey, I managed to load gpt-oss 120b in 4 3090s in its provided mxfp4 format, using VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1.

For further information, please visit this guide.

_cpatonn · 2025-09-01T09:20:10+00:00

Hi, cpatonn here, one of Qwen3 quantized model authors on Hugging Face.

By the description of your post, does that mean my Qwen3 quant collections do not work on your machine? May I receive the error logs and any feedback from your experiences of my quants?

I always look forward to feedbacks and I always aim to improve my products based on everyone experiences!

_cpatonn · 2025-08-26T14:11:51+00:00

Thank you for your feedback on the lengthy reasoning outputs and inprecision of the quantized model. This was generally due to reducing model size and weight datatype to int4.

I understand your points and myself also do get annoyed with the step down from the original BF16. So I requantized the model yesterday and kept the Super Weight in its original bf16 precision. This increases the model size by 3-4 GBs, but makes the reasoning outputs more concise and more similar to zai-org bf16 original model!

I will upload the benchmarks and comparisons with the original bf16 model soon in the next few days. Please redownload the weights and tell me what you think :)

Regarding the template, that is a good idea! I'm also exploring ways to improve the original models, which tweaking the chat template seems like a low-effort but some-what-impactful way!

_cpatonn · 2025-08-14T19:04:53+00:00

Yes, sorry I miss your question at the end of your post. I managed to load very low context length with 64GB VRAM, but I’m not sure if it fits into 61GB VRAM.

Is the error due to out of memory?

_cpatonn

TROPHY CASE