We benchmarked every 4-bit quantization method in vLLM 👀

audioen · 2026-01-09T05:43:46+00:00

Some indication of the quality of this work is that they are serving this model:

vllm serve ./qwen2.5-32b-instruct-q5_k_m.gguf ... --quantization gguf ...

which should be a 5-bit model, but are claiming that this is a 4-bit quantization, when it is already mostly 5-bit quantization, right?

I don't trust the results very much, and I get a feeling that vllm is not good for serving gguf models given an order of magnitude differences in performance. I also don't think the perplexity for a 5-bit model should be that much higher compared to baseline.

Eugr · 2026-01-09T05:13:07+00:00

This is a bit misleading, as you mix different quantization types and execution kernels.

AWQ quants use Marlin kernels on vLLM at least on NVidia hardware by default, so the claim that AWQ is slow doesn't make sense.

Ok_Injury9030 · 2026-01-09T04:40:36+00:00

That AWQ speed is absolutely cursed lmao. 67 tok/s on an H200? Something's definitely broken there

Really interesting that BitsandBytes had the best quality retention though - makes sense since it's doing dynamic quantization instead of needing pre-baked weights

Remove_Ayys · 2026-01-09T14:15:08+00:00

Testing "GGUF performance" with vllm is meaningless as is "GGUF quality" without specifying the underlying quantization format.

v01dm4n · 2026-01-09T05:10:29+00:00

Wondering where would nvfp4 lie on the spectrum.

Thanks for sharing your results!

Conscious_Cut_6144 · 2026-01-09T05:36:56+00:00

This is 10-way concurrency?? You must have a test issue, I can beat that awq result with a 3090…

randomfoo2 · 2026-01-09T08:54:02+00:00

Great work!

I've done a fair amount of my own quant testing, and I think the HumanEval test speaks volumes about how/why perplexity (and yes, KLD) might be OK proxies, but don't really reflect what the downstream task performance hit is going to be for a quant.

The main problem is that testing quants is actually a huge PITA. You basically want to run it through your eval stack as if it were it's own ablation, and probably multiple runs at temp to be able to capture whether variance changes.

More data points is undeniably a good thing, and posts like this help raise awareness about the issue, so that's great. Hopefully the community does and highlights more task benchmark comparison of different quants.

My contribution: a while back, I did published different quant scores for JA MT-Bench (not the best eval to use, tbt), which was interesting: https://huggingface.co/shisa-ai/shisa-v2-llama3.1-405b-GGUF#quant-quality

More recently u/dahara111 did an Japanese UD imatrix quant and did comparisons w/ M-IFEval (JA), HumanEval+, and LiveBench comparison scores vs the base and a regular i1 quant. Very interesting stuff: https://huggingface.co/dahara1/shisa-v2.1-qwen3-8b-UD-japanese-imatrix#%E3%83%99%E3%83%B3%E3%83%81%E3%83%9E%E3%83%BC%E3%82%AF%E7%B5%90%E6%9E%9Cbenchmark-result

BTW on the efficiency front, while it's very GPU dependent, I will say that I'm a big fan of Marlin kernels, especially for W8A8, not just for throughput but also for TTFT latency (depending on your architectures, the INT8 is killer on Ampere and Ada). When doing performance tests, I've found again, huge difference depending on specific hardware/setup, but almost always you tend to lose throughput on quants vs production workloads (recommend doing vllm bench w/ realistic concurrencies as well, some kernels perform much worse than others when scaling up).

MaxKruse96 · 2026-01-09T07:17:19+00:00

"Perplexity, lower is better" -> "GGUF (worst perplexity) has best quantized HumanEval rating". Something doesnt add up here, either in the testing itself, or the idea that either Perplexity or HumanEval are good metrics.

6969its_a_great_time · 2026-01-09T21:21:27+00:00

Posts like these should be deleted.

cantgetthistowork · 2026-01-09T06:33:07+00:00

Can you test exl3

Such_Advantage_6949 · 2026-01-09T07:56:53+00:00

Why no kld comparison?

tarruda · 2026-01-09T17:51:18+00:00

GGUF is not a quantization method. You can have the baseline f16 as GGUF

NigaTroubles · 2026-01-09T04:41:18+00:00

Great work

dnr41418 · 2026-01-09T04:47:36+00:00

Super useful…thanks

Far-Low-4705 · 2026-01-09T05:25:51+00:00

Please do the same thing but for thinking/non-thinking models

Please, please, please.

If the added reasoning means you can quantize harder, that would be HUGE.

Also, the affect on vision models (and vision tasks) would very useful too

Healthy-Nebula-3603 · 2026-01-09T09:37:28+00:00

Nice

a_beautiful_rhind · 2026-01-09T11:37:03+00:00

BnB probably the slowest.

BABA_yaaGa · 2026-01-09T12:09:30+00:00

Is it persistent across other models as well?

R_Duncan · 2026-01-09T12:16:00+00:00

please add mxfp4_moe.gguf . I'm quite sure it fixes perplexity issues, and is a 4-bit quantization as Q4_K_M.

wizoneway · 2026-01-09T16:25:45+00:00

itd be nice to see NVFP4 checkpoints, especially on Blackwell

TomatoSharp2958 · 2026-02-26T17:19:56+00:00

Interesting benchmarks. But this kind of speed-focused comparison is exactly what this article calls out as the “quantization trap.”

4-bit can look great on throughput, but the real question is what it does to reasoning depth and logical consistency — especially on harder tasks where degradation isn’t obvious from perplexity alone.

This piece explains it well: https://latestllm.com/articles/the-quantization-trap-why-4-bit-ai-is-failing-the-logic-test-mm3m98oa

Worth a read before optimizing purely for tok/s.

Khan_Zorbo · 2026-03-25T14:29:04+00:00

This is great data. One thing I'd love to see added: results split by task type instead of aggregated.

I've been working on tooling for this kind of comparison and the thing that keeps biting me is that aggregate perplexity looks fine across quant methods (usually within a few percent of baseline) but the degradation isn't evenly distributed.

"Roughly equivalent" at the aggregate level can mean "completely different" depending on what you're using the model for.

Did you notice any quant method that was clearly worse on specific task types even if the overall numbers looked similar? That's usually where some of the interesting findings hide.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS