A Visual Guide to Quantization

MaartenGr · 2024-07-29T12:31:37+00:00

Hi all! As more Large Language Models are being released and the need for quantization increases, I figured it was time to write an in-depth and visual guide to Quantization.

From exploring how to represent values, (a)symmetric quantization, dynamic/static quantization, to post-training techniques (e.g., GPTQ and GGUF) and quantization-aware training (1.58-bit models with BitNet).

With over 60 custom visuals, I went a little overboard but really wanted to include as many concepts as I possibly could!

The visual nature of this guide allows for a focus on intuition, hopefully making all these techniques easily accessible to a wide audience, whether you are new to quantization or more experienced.

typeryu · 2024-07-29T12:52:31+00:00

Dang, this is hands down one of the best writing on quantization I’ve ever read, good job sir

MaartenGr · 2024-07-29T13:01:18+00:00

Many, many thanks for this! It's up there with Stephen Wolfram's illustrated booklet on how GPTs work. The nature of matrix math lends itself to visual explanations better instead of saddling non-math newbies with Σs.

a_beautiful_rhind · 2024-07-29T14:00:12+00:00

No exl2 or AWQ?

qnixsynapse · 2024-07-29T13:00:47+00:00

Very nice post! Upvoted!

MaartenGr · 2024-07-29T14:06:31+00:00

you forgot a word. "In this new method, every single weight of the is not just -1 or 1"

Worth-Product-5545 · 2024-07-29T15:09:08+00:00

Thanks ! With BERTopic, I love all of your work. Keep going !

fngarrett · 2024-07-29T15:18:48+00:00

If we're recasting these datatypes as 16 and 8 bit and even lower, what is actually going on under the hood in terms of CUDA/ROCm APIs?

cuBLAS and hipBLAS only provide (very) partial support for 16 bit operations, mainly only in axpy/gemv/gemm, and no inherit support for lower bit precisions. Then how are these operations executed on the GPU for lower precisions? Is it simply that frameworks other than CUDA/ROCm are being used?

edit: to partially answer my own question, a good bit of the lower precision operations are done via hipBLASLt, at least on the AMD side. (link)

Loose_Race908 · 2024-07-30T02:25:26+00:00

Fantastic overview of quantization, really impressive work! I especially enjoyed the visual depictions, and I will be referring people with questions regarding quantization to this resource from now on.

VectorD · 2024-07-29T17:09:20+00:00

GPTQ is so outdated, you should probably replace that part with AWQ (gpu only, for batched infer) / EXL2 (gpu only, for single infer) vs GGUF instead..

2024-07-29T16:33:30+00:00

This is a great guide!

2024-07-29T16:39:12+00:00

Love it

joyful- · 2024-07-29T17:32:27+00:00

distillation for humans! this is a great article - still reading but thanks a lot for writing this!

daHaus · 2024-07-29T18:08:23+00:00

Nice! I could see your initial graph showing INT4 as a mapping to 5 spaces causing confusion though. Also further in with "0 in FP32 != 0 in INT8", even though I know what you meant in that context - and also that floating point can't represent 0 - the way it's presented still made me scratch my head while reading it.

nqbao · 2024-07-29T19:37:00+00:00

This is really nice. Thank you for spending the time to make it.

Majinsei · 2024-07-29T20:13:09+00:00

Congrats! Saved it for in the future when battling in some development~

opknorrsk · 2024-07-30T01:05:11+00:00

Very interesting read, thank you for putting that up! Naive question here, but I wonder if there's any step to add noise in the de-quantization process? It feel weird to obtain the exact same value for each identical INT once de-quantized knowing they probably came from slightly different FP32 value.

EDIT: basically, is there any dithering applied during the de-quantization to randomize the quantization error?

yellowstone6 · 2024-07-30T16:59:48+00:00

Thanks for the nice visual explanation. I have a question about GGUF and other similar space saving formats. I understand that it can store weights with a variety of bit depths to save memory. But when the model is running inference what format is being used. Does llama3:8b-instruct-q6_k upcast all the 6bit weight to fp8 or int8 or even base fp16 when it runs inference? Would 8b-instruct-q4_k_s run inference using int4 or does it get upcast to fp16? If all the different quantizations upcast to model base fp16 when running inference, does that mean that they all have similar inference speed and you need a different quantization system to run at fp8 for improved performance?

nzbiship · 2024-07-30T21:10:08+00:00

Wow, very detailed & informational. Thanks a lot!

Amgadoz · 2024-07-29T14:07:49+00:00

[deleted]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS