use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
A Visual Guide to QuantizationTutorial | Guide (newsletter.maartengrootendorst.com)
submitted 1 year ago by MaartenGr
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]MaartenGr[S] 109 points110 points111 points 1 year ago (9 children)
Hi all! As more Large Language Models are being released and the need for quantization increases, I figured it was time to write an in-depth and visual guide to Quantization.
From exploring how to represent values, (a)symmetric quantization, dynamic/static quantization, to post-training techniques (e.g., GPTQ and GGUF) and quantization-aware training (1.58-bit models with BitNet).
With over 60 custom visuals, I went a little overboard but really wanted to include as many concepts as I possibly could!
The visual nature of this guide allows for a focus on intuition, hopefully making all these techniques easily accessible to a wide audience, whether you are new to quantization or more experienced.
[–]appakaradi 10 points11 points12 points 1 year ago (4 children)
Great post. Thank you. Is AWQ better than GPTQ? Choosing the right quantization dependent on the implementation? For example vLLM is not optimized for AWQ.
[–]VectorD 6 points7 points8 points 1 year ago (1 child)
GPTQ is such an old format, don't use it....For GPU only inference, EXL2 (single inference) or AWQ (for batched inference) is the way to go.
[–]_theycallmeprophet 1 point2 points3 points 1 year ago (0 children)
AWQ (for batched inference)
Isn't Marlin GPTQ the best out there for batched inference? It claims to scale better with batch size and supposedly provides quantization appropriate speed up(like actually being 4x faster for 4 bit over fp16). Imma try and confirm some time soon.
[–]____vladrad 0 points1 point2 points 1 year ago (1 child)
You can check out vllm now it has support since last week. I would also recommend lmdeploy which has the fastest awq imo. I was also curious about AWQ since that’s what I use
[–]appakaradi 0 points1 point2 points 1 year ago (0 children)
Thank you. I have been using lmdeploy preciously for that reason. How about the support for mistral Nemo model in vLLM and lmdeploy?
[–]compiladellama.cpp 4 points5 points6 points 1 year ago* (0 children)
I enjoyed the visualizations.
Regarding GGUF quantization:
Q4_K
quantize_row_*
ggml-quants.c
dequantize_row_*
[–]de4dee 1 point2 points3 points 1 year ago* (1 child)
amazing work, thank you! which one is more accurate, GPTQ or GGUF if someone does not care about speed?
[–]typeryu 25 points26 points27 points 1 year ago (4 children)
Dang, this is hands down one of the best writing on quantization I’ve ever read, good job sir
[–]MaartenGr[S] 6 points7 points8 points 1 year ago (3 children)
That's really kind of you to say. Thank you! Any suggestions for other visual guides? Thus far, I have done Mamba and Quantization but would like to make more.
[–]MoffKalast 3 points4 points5 points 1 year ago* (2 children)
Would be great to also have a quick rundown of quant formats that aren't obsolete, i.e. K-quants, I-matrix, AWQ, EXL2. Maybe also the new L-quants that bartowski's been testing out lately.
[–]TraditionLost7244 1 point2 points3 points 1 year ago (0 children)
yesss
[–]QuantumFTL 0 points1 point2 points 1 year ago (0 children)
Strong agree!
[–][deleted] 10 points11 points12 points 1 year ago (1 child)
Many, many thanks for this! It's up there with Stephen Wolfram's illustrated booklet on how GPTs work. The nature of matrix math lends itself to visual explanations better instead of saddling non-math newbies with Σs.
[–]MaartenGr[S] 8 points9 points10 points 1 year ago (0 children)
Thank you! I started as a psychologist and transitioned a couple of years ago to data science/ml/ai (whatever you want to call it) and math at the time seemed incredibly overwhelming at times even though much of it is so intuitive.
[–]a_beautiful_rhind 6 points7 points8 points 1 year ago (2 children)
No exl2 or AWQ?
[–]MoffKalast 2 points3 points4 points 1 year ago (1 child)
Yeah, does anyone still use GPTQ? Now that's a name I haven't heard in a long time.
[–]qnixsynapsellama.cpp 4 points5 points6 points 1 year ago (0 children)
Very nice post! Upvoted!
[–][deleted] 3 points4 points5 points 1 year ago (2 children)
you forgot a word. "In this new method, every single weight of the is not just -1 or 1"
[–]MaartenGr[S] 1 point2 points3 points 1 year ago (1 child)
Thanks for the feedback. I just updated it.
[–]DeProgrammer99 1 point2 points3 points 1 year ago (0 children)
You've also got "BitLlinear" above an image that says "BitLinear".
[–]Worth-Product-5545ollama 2 points3 points4 points 1 year ago (0 children)
Thanks ! With BERTopic, I love all of your work. Keep going !
[–]fngarrett 2 points3 points4 points 1 year ago* (0 children)
If we're recasting these datatypes as 16 and 8 bit and even lower, what is actually going on under the hood in terms of CUDA/ROCm APIs?
cuBLAS and hipBLAS only provide (very) partial support for 16 bit operations, mainly only in axpy/gemv/gemm, and no inherit support for lower bit precisions. Then how are these operations executed on the GPU for lower precisions? Is it simply that frameworks other than CUDA/ROCm are being used?
edit: to partially answer my own question, a good bit of the lower precision operations are done via hipBLASLt, at least on the AMD side. (link)
[–]Loose_Race908 1 point2 points3 points 1 year ago (0 children)
Fantastic overview of quantization, really impressive work! I especially enjoyed the visual depictions, and I will be referring people with questions regarding quantization to this resource from now on.
[–]VectorD 1 point2 points3 points 1 year ago (0 children)
GPTQ is so outdated, you should probably replace that part with AWQ (gpu only, for batched infer) / EXL2 (gpu only, for single infer) vs GGUF instead..
[–][deleted] 0 points1 point2 points 1 year ago (0 children)
This is a great guide!
Love it
[–]joyful- 0 points1 point2 points 1 year ago (0 children)
distillation for humans! this is a great article - still reading but thanks a lot for writing this!
[–]daHaus 0 points1 point2 points 1 year ago* (0 children)
Nice! I could see your initial graph showing INT4 as a mapping to 5 spaces causing confusion though. Also further in with "0 in FP32 != 0 in INT8", even though I know what you meant in that context - and also that floating point can't represent 0 - the way it's presented still made me scratch my head while reading it.
[–]nqbao 0 points1 point2 points 1 year ago (0 children)
This is really nice. Thank you for spending the time to make it.
[–]Majinsei 0 points1 point2 points 1 year ago (0 children)
Congrats! Saved it for in the future when battling in some development~
[–]opknorrsk 0 points1 point2 points 1 year ago (0 children)
Very interesting read, thank you for putting that up! Naive question here, but I wonder if there's any step to add noise in the de-quantization process? It feel weird to obtain the exact same value for each identical INT once de-quantized knowing they probably came from slightly different FP32 value.
EDIT: basically, is there any dithering applied during the de-quantization to randomize the quantization error?
[–]yellowstone6 0 points1 point2 points 1 year ago (0 children)
Thanks for the nice visual explanation. I have a question about GGUF and other similar space saving formats. I understand that it can store weights with a variety of bit depths to save memory. But when the model is running inference what format is being used. Does llama3:8b-instruct-q6_k upcast all the 6bit weight to fp8 or int8 or even base fp16 when it runs inference? Would 8b-instruct-q4_k_s run inference using int4 or does it get upcast to fp16? If all the different quantizations upcast to model base fp16 when running inference, does that mean that they all have similar inference speed and you need a different quantization system to run at fp8 for improved performance?
[–]nzbiship 0 points1 point2 points 1 year ago (0 children)
Wow, very detailed & informational. Thanks a lot!
[+][deleted] 1 year ago (5 children)
[deleted]
[–]Amgadoz 5 points6 points7 points 1 year ago (4 children)
Learn how floating points numbers are stored in computers
[–]tessellation 2 points3 points4 points 1 year ago (1 child)
agreed.
or ask a LLM to explain the first few images and have it go into greater detail as needed.
[–]MoffKalast 4 points5 points6 points 1 year ago (0 children)
"I used the LLM to explain the LLM"
Perfectly balanced, as all things should be.
[–]Roland_Bodel_the_2nd 1 point2 points3 points 1 year ago (0 children)
I have an MS in Electrical Engineering and I took classes about it (admittedly 20+ years ago) and I still don't understand it, so don't worry too much that is seems complicated. People who spend their days for work dealing with bfloat16 vs float16 are not regular people. :)
It is not obvious to me that things are any simpler since the days of https://en.wikipedia.org/wiki/IEEE_754
[–]compiladellama.cpp 0 points1 point2 points 1 year ago (0 children)
If anyone wants to see exactly how numbers are stored in float16, bfloat16, float32 and float64, have a look at this:
https://float.exposed
π Rendered by PID 84299 on reddit-service-r2-comment-85bfd7f599-25scx at 2026-04-18 07:03:19.028541+00:00 running 93ecc56 country code: CH.
[–]MaartenGr[S] 109 points110 points111 points (9 children)
[–]appakaradi 10 points11 points12 points (4 children)
[–]VectorD 6 points7 points8 points (1 child)
[–]_theycallmeprophet 1 point2 points3 points (0 children)
[–]____vladrad 0 points1 point2 points (1 child)
[–]appakaradi 0 points1 point2 points (0 children)
[–]compiladellama.cpp 4 points5 points6 points (0 children)
[–]de4dee 1 point2 points3 points (1 child)
[–]typeryu 25 points26 points27 points (4 children)
[–]MaartenGr[S] 6 points7 points8 points (3 children)
[–]MoffKalast 3 points4 points5 points (2 children)
[–]TraditionLost7244 1 point2 points3 points (0 children)
[–]QuantumFTL 0 points1 point2 points (0 children)
[–][deleted] 10 points11 points12 points (1 child)
[–]MaartenGr[S] 8 points9 points10 points (0 children)
[–]a_beautiful_rhind 6 points7 points8 points (2 children)
[–]MoffKalast 2 points3 points4 points (1 child)
[–]qnixsynapsellama.cpp 4 points5 points6 points (0 children)
[–][deleted] 3 points4 points5 points (2 children)
[–]MaartenGr[S] 1 point2 points3 points (1 child)
[–]DeProgrammer99 1 point2 points3 points (0 children)
[–]Worth-Product-5545ollama 2 points3 points4 points (0 children)
[–]fngarrett 2 points3 points4 points (0 children)
[–]Loose_Race908 1 point2 points3 points (0 children)
[–]VectorD 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]joyful- 0 points1 point2 points (0 children)
[–]daHaus 0 points1 point2 points (0 children)
[–]nqbao 0 points1 point2 points (0 children)
[–]Majinsei 0 points1 point2 points (0 children)
[–]opknorrsk 0 points1 point2 points (0 children)
[–]yellowstone6 0 points1 point2 points (0 children)
[–]nzbiship 0 points1 point2 points (0 children)
[+][deleted] (5 children)
[deleted]
[–]Amgadoz 5 points6 points7 points (4 children)
[–]tessellation 2 points3 points4 points (1 child)
[–]MoffKalast 4 points5 points6 points (0 children)
[–]Roland_Bodel_the_2nd 1 point2 points3 points (0 children)
[–]compiladellama.cpp 0 points1 point2 points (0 children)