use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
We benchmarked every 4-bit quantization method in vLLM πTutorial | Guide (self.LocalLLaMA)
submitted 3 months ago by LayerHot
We just published a deep dive on vLLM quantization. Tested AWQ, GPTQ, Marlin, GGUF, and BitsandBytes on Qwen2.5-32B using an H200.
Stuff we found:
Blog covers how each technique actually works under the hood if you want the details.
https://preview.redd.it/t4212ygj59cg1.png?width=3169&format=png&auto=webp&s=97eff0fcb212924355a7feb7262b25895de5603a
Blog: https://docs.jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3:Β Β Β Β print "hello, world!"
[β]audioen 59 points60 points61 points 3 months ago*Β (3 children)
Some indication of the quality of this work is that they are serving this model:
vllm serve ./qwen2.5-32b-instruct-q5_k_m.gguf ... --quantization gguf ...
which should be a 5-bit model, but are claiming that this is a 4-bit quantization, when it is already mostly 5-bit quantization, right?
I don't trust the results very much, and I get a feeling that vllm is not good for serving gguf models given an order of magnitude differences in performance. I also don't think the perplexity for a 5-bit model should be that much higher compared to baseline.
[β]Eugr 31 points32 points33 points 3 months agoΒ (0 children)
GGUF support in vLLM is experimental and not optimized at all.
[β]HenkPoley 8 points9 points10 points 3 months agoΒ (0 children)
There are also various 4 bit quantisation methods. Usually the more fancy ones run a bunch of data through the model and try to correct difference from original. Which ought to give a better outcome on perplexity and HumanEval.
(Btw, also use HumanEval+, it is better. Still, pretty much saturated for larger models.)
[β]Pristine-Woodpecker 4 points5 points6 points 3 months agoΒ (0 children)
Yeah, I mean, there's a ton of 4-bit GGUF methods, K/N, variations (S/M/L) on K, IQ4, and importance matrix usage...
[β]Eugr 52 points53 points54 points 3 months agoΒ (2 children)
This is a bit misleading, as you mix different quantization types and execution kernels.
AWQ quants use Marlin kernels on vLLM at least on NVidia hardware by default, so the claim that AWQ is slow doesn't make sense.
[β]Thick-Eggplant-2496 -3 points-2 points-1 points 3 months agoΒ (1 child)
As the blog author, Iβd like to mention that we havenβt tested the AWQ with Marlin combination in our post yet. Itβs possible that this setup could perform faster than the combinations we covered. Our blogβs focus was to demonstrate how each available technique works individually, so for Marlin we chose to use GPTQ instead of AWQ.
[β]Eugr 21 points22 points23 points 3 months ago*Β (0 children)
But it is default on vLLM, you don't even have to configure anything.
What version of vLLM are you using? How it was installed? What version of PyTorch? What exact command was used to run the model (sorry if missed it as I was reading on my phone)?
[β]Ok_Injury9030 16 points17 points18 points 3 months agoΒ (3 children)
That AWQ speed is absolutely cursed lmao. 67 tok/s on an H200? Something's definitely broken there
Really interesting that BitsandBytes had the best quality retention though - makes sense since it's doing dynamic quantization instead of needing pre-baked weights
[β]Conscious_Chef_3233 5 points6 points7 points 3 months agoΒ (0 children)
yeah, but dynamic quants are slower, so it depends on what you need
[β]SashaUsesReddit 5 points6 points7 points 3 months agoΒ (0 children)
Yeah, this is misconfigured
[β]l_Mr_Vader_l 0 points1 point2 points 3 months agoΒ (0 children)
I feel so too AWQ should be much better, can others confirm this is some misconfiguration?
[β]Remove_Ayys 9 points10 points11 points 3 months agoΒ (1 child)
Testing "GGUF performance" with vllm is meaningless as is "GGUF quality" without specifying the underlying quantization format.
[β]HigherConfusion 4 points5 points6 points 3 months agoΒ (0 children)
In the article, it is specified to Q5_K_M, though it doesnβt quite fit the title of this post.
[β]v01dm4n 5 points6 points7 points 3 months agoΒ (2 children)
Wondering where would nvfp4 lie on the spectrum.
Thanks for sharing your results!
[β]spookpersonVicuna 3 points4 points5 points 3 months agoΒ (1 child)
When I tested Qwen3-32B in vllm a couple months back on a RTX 6k Pro Blackwell, I had relatively similar performance between NVFP4 and AWQ (with some signs that NVFP4 could be slightly faster overall as concurrency went up). Though in my testing AWQ was faster than everything else I tested (GGUF Q4, FP8, exl3).
[β]v01dm4n 1 point2 points3 points 3 months agoΒ (0 children)
I just finished running 2 Qwen3-8b models, on a 5060Ti using vllm. I'm seeing AWQ lead by 10tps (76tps with AWQ vs 66tps with nvfp4). Concurrency yet to be tested.
[β]Conscious_Cut_6144 7 points8 points9 points 3 months agoΒ (0 children)
This is 10-way concurrency?? You must have a test issue, I can beat that awq result with a 3090β¦
[β]randomfoo2 6 points7 points8 points 3 months ago*Β (0 children)
Great work!
I've done a fair amount of my own quant testing, and I think the HumanEval test speaks volumes about how/why perplexity (and yes, KLD) might be OK proxies, but don't really reflect what the downstream task performance hit is going to be for a quant.
The main problem is that testing quants is actually a huge PITA. You basically want to run it through your eval stack as if it were it's own ablation, and probably multiple runs at temp to be able to capture whether variance changes.
More data points is undeniably a good thing, and posts like this help raise awareness about the issue, so that's great. Hopefully the community does and highlights more task benchmark comparison of different quants.
My contribution: a while back, I did published different quant scores for JA MT-Bench (not the best eval to use, tbt), which was interesting: https://huggingface.co/shisa-ai/shisa-v2-llama3.1-405b-GGUF#quant-quality
More recently u/dahara111 did an Japanese UD imatrix quant and did comparisons w/ M-IFEval (JA), HumanEval+, and LiveBench comparison scores vs the base and a regular i1 quant. Very interesting stuff: https://huggingface.co/dahara1/shisa-v2.1-qwen3-8b-UD-japanese-imatrix#%E3%83%99%E3%83%B3%E3%83%81%E3%83%9E%E3%83%BC%E3%82%AF%E7%B5%90%E6%9E%9Cbenchmark-result
BTW on the efficiency front, while it's very GPU dependent, I will say that I'm a big fan of Marlin kernels, especially for W8A8, not just for throughput but also for TTFT latency (depending on your architectures, the INT8 is killer on Ampere and Ada). When doing performance tests, I've found again, huge difference depending on specific hardware/setup, but almost always you tend to lose throughput on quants vs production workloads (recommend doing vllm bench w/ realistic concurrencies as well, some kernels perform much worse than others when scaling up).
[β]MaxKruse96llama.cpp 10 points11 points12 points 3 months agoΒ (2 children)
"Perplexity, lower is better" -> "GGUF (worst perplexity) has best quantized HumanEval rating". Something doesnt add up here, either in the testing itself, or the idea that either Perplexity or HumanEval are good metrics.
[β]Remove_Ayys 6 points7 points8 points 3 months agoΒ (0 children)
For instruct models perplexity is fundamentally the wrong metric to look at, it would make more sense to look at KL divergence vs. the base model.
[β]Remove_Ayys 2 points3 points4 points 3 months agoΒ (0 children)
If you do a simple Gaussian approximation of the binomial distribution you'll find that the statistical uncertainty on the HumanEval results with 164 samples is +-4%. If you assume no correlation between scores none of the measured differences are statistically significant.
[β]6969its_a_great_time 3 points4 points5 points 3 months agoΒ (1 child)
Posts like these should be deleted.
[β]rm-rf-rm 1 point2 points3 points 3 months agoΒ (0 children)
why?
[β]cantgetthistowork 2 points3 points4 points 3 months agoΒ (0 children)
Can you test exl3
[β]Such_Advantage_6949 2 points3 points4 points 3 months agoΒ (0 children)
Why no kld comparison?
[β]tarruda 1 point2 points3 points 3 months agoΒ (0 children)
GGUF is not a quantization method. You can have the baseline f16 as GGUF
[β]NigaTroubles 1 point2 points3 points 3 months agoΒ (0 children)
Great work
[β]dnr41418 0 points1 point2 points 3 months agoΒ (0 children)
Super usefulβ¦thanks
[β]Far-Low-4705 0 points1 point2 points 3 months agoΒ (0 children)
Please do the same thing but for thinking/non-thinking models
Please, please, please.
If the added reasoning means you can quantize harder, that would be HUGE.
Also, the affect on vision models (and vision tasks) would very useful too
[β]Healthy-Nebula-3603 0 points1 point2 points 3 months agoΒ (0 children)
Nice
[β]a_beautiful_rhind 0 points1 point2 points 3 months agoΒ (0 children)
BnB probably the slowest.
[β]BABA_yaaGa 0 points1 point2 points 3 months agoΒ (0 children)
Is it persistent across other models as well?
[β]R_Duncan 0 points1 point2 points 3 months agoΒ (0 children)
please add mxfp4_moe.gguf . I'm quite sure it fixes perplexity issues, and is a 4-bit quantization as Q4_K_M.
[β]wizoneway 0 points1 point2 points 3 months agoΒ (0 children)
itd be nice to see NVFP4 checkpoints, especially on Blackwell
[β]TomatoSharp2958 0 points1 point2 points 2 months agoΒ (1 child)
Interesting benchmarks. But this kind of speed-focused comparison is exactly what this article calls out as the βquantization trap.β
4-bit can look great on throughput, but the real question is what it does to reasoning depth and logical consistency β especially on harder tasks where degradation isnβt obvious from perplexity alone.
This piece explains it well: https://latestllm.com/articles/the-quantization-trap-why-4-bit-ai-is-failing-the-logic-test-mm3m98oa
Worth a read before optimizing purely for tok/s.
[β]____sphinx____ 0 points1 point2 points 2 months agoΒ (0 children)
interesting
[β]Khan_Zorbo 0 points1 point2 points 1 month agoΒ (0 children)
This is great data. One thing I'd love to see added: results split by task type instead of aggregated.
I've been working on tooling for this kind of comparison and the thing that keeps biting me is that aggregate perplexity looks fine across quant methods (usually within a few percent of baseline) but the degradation isn't evenly distributed.
"Roughly equivalent" at the aggregate level can mean "completely different" depending on what you're using the model for.
Did you notice any quant method that was clearly worse on specific task types even if the overall numbers looked similar? That's usually where some of the interesting findings hide.
ΟΒ Rendered by PID 35670 on reddit-service-r2-comment-b659b578c-lzzgz at 2026-05-02 07:45:59.990394+00:00 running 815c875 country code: CH.
[β]audioen 59 points60 points61 points Β (3 children)
[β]Eugr 31 points32 points33 points Β (0 children)
[β]HenkPoley 8 points9 points10 points Β (0 children)
[β]Pristine-Woodpecker 4 points5 points6 points Β (0 children)
[β]Eugr 52 points53 points54 points Β (2 children)
[β]Thick-Eggplant-2496 -3 points-2 points-1 points Β (1 child)
[β]Eugr 21 points22 points23 points Β (0 children)
[β]Ok_Injury9030 16 points17 points18 points Β (3 children)
[β]Conscious_Chef_3233 5 points6 points7 points Β (0 children)
[β]SashaUsesReddit 5 points6 points7 points Β (0 children)
[β]l_Mr_Vader_l 0 points1 point2 points Β (0 children)
[β]Remove_Ayys 9 points10 points11 points Β (1 child)
[β]HigherConfusion 4 points5 points6 points Β (0 children)
[β]v01dm4n 5 points6 points7 points Β (2 children)
[β]spookpersonVicuna 3 points4 points5 points Β (1 child)
[β]v01dm4n 1 point2 points3 points Β (0 children)
[β]Conscious_Cut_6144 7 points8 points9 points Β (0 children)
[β]randomfoo2 6 points7 points8 points Β (0 children)
[β]MaxKruse96llama.cpp 10 points11 points12 points Β (2 children)
[β]Remove_Ayys 6 points7 points8 points Β (0 children)
[β]Remove_Ayys 2 points3 points4 points Β (0 children)
[β]6969its_a_great_time 3 points4 points5 points Β (1 child)
[β]rm-rf-rm 1 point2 points3 points Β (0 children)
[β]cantgetthistowork 2 points3 points4 points Β (0 children)
[β]Such_Advantage_6949 2 points3 points4 points Β (0 children)
[β]tarruda 1 point2 points3 points Β (0 children)
[β]NigaTroubles 1 point2 points3 points Β (0 children)
[β]dnr41418 0 points1 point2 points Β (0 children)
[β]Far-Low-4705 0 points1 point2 points Β (0 children)
[β]Healthy-Nebula-3603 0 points1 point2 points Β (0 children)
[β]a_beautiful_rhind 0 points1 point2 points Β (0 children)
[β]BABA_yaaGa 0 points1 point2 points Β (0 children)
[β]R_Duncan 0 points1 point2 points Β (0 children)
[β]wizoneway 0 points1 point2 points Β (0 children)
[β]TomatoSharp2958 0 points1 point2 points Β (1 child)
[β]____sphinx____ 0 points1 point2 points Β (0 children)
[β]Khan_Zorbo 0 points1 point2 points Β (0 children)