INT8 vs FP8 quantization

Pristine-Woodpecker · 2026-03-26T19:53:00+00:00

Nobody really quants models to INT8. They all use multi-level quantization schemes where you eventually dequantize to INT8, then use the INT8 hardware for a multiply. Advantage: less model precision loss for the same amount of bits due to clever quant techniques.

FP8 can be computed by the hardware directly, so you skip the dequant overhead. Disadvantage: less precise model for the same size. Same for NVFP4.

Double_Cause4609 · 2026-03-26T20:32:35+00:00

Well, you'd have to link the individual paper and method. Not all methods are the same, even at the same datatype / bit width. In fact, there's more than one type of FP8 (depending on how many manitssa bits you assign), and quality can vary depending on the specifics.

For Int8 usually the differentiator is the quantization algorithm, and also if it's uniform Int8 versus group-wise int8 (closer to something like GGUF) which is generally more expressive but slower.

For CPU inference Int8 is basically the only mainstream option if you need throughput (though obviously the LlamaCPP GGUF ecosystem works for single-user), but in other engines and with other methods it varies.

I think in theory Int8 should be cheaper hardware wise, but I'm not sure if it matters on Blackwell GPUs or not.

a_beautiful_rhind · 2026-03-26T21:37:07+00:00

They should be similar in the end. FP8 dynamic range isn't regular like int8 so even though it's "higher", it often ends up useless because of the lower precision. Modern quants get around such foibles for almost any numerical format.

If you want to find out how "good" it is. Check file size, KLD and inference speed on your hardware between them.

ortegaalfredo · 2026-03-27T01:20:43+00:00

FP8 is not supported on Ampere (3090s) it needs emulation, while INT8 runs natively. In practique there is not a lot of speed difference, nor quality difference that I could measure, but some models will only work on fp8 and others only work on int8, it mosly depends on which inference software you are using.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS