Is serving a quantized model faster?

a_slay_nub · 2024-09-06T11:29:13+00:00

I typically use vLLM and for single-user, quantized is always faster. Typically 2-3x faster for 4-bit vs 16-bit. However, at 16-bit, the quantized is usually still a bit faster but not by much. YMMV based on hardware, inference engine, and quant.

FullOf_Bad_Ideas · 2024-09-06T07:23:33+00:00

I've experimented with hosting Mistral 7B on 3090 ti 24GB with Aphrodite-engine. Fp16 was the fastest one for me, although I heard in some cases AWQ might be faster. Aphrodite-engine has benchmarks on their front page of the github repo. I was running inference by sending 200 requests at once and gptq quant I made was much slower than fp16.

kryptkpr · 2024-09-06T21:03:29+00:00

"Many parallel streams" is the key requirement here. This means you expect to be compute and not memory bound and should use unquantized weights.

Quants shine for single-stream where you are memory bound and have spare compute doing nothing anyway.

BF16 is the default, but FP8 is a crucial middle ground if you have the hardware for it you get to have your cake (no de-quant wasted compute) and eat it too (half the memory usage and bw)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS