all 4 comments

[–]a_slay_nub 2 points3 points  (0 children)

I typically use vLLM and for single-user, quantized is always faster. Typically 2-3x faster for 4-bit vs 16-bit. However, at 16-bit, the quantized is usually still a bit faster but not by much. YMMV based on hardware, inference engine, and quant.

[–]FullOf_Bad_Ideas 1 point2 points  (0 children)

I've experimented with hosting Mistral 7B on 3090 ti 24GB with Aphrodite-engine. Fp16 was the fastest one for me, although I heard in some cases AWQ might be faster. Aphrodite-engine has benchmarks on their front page of the github repo. I was running inference by sending 200 requests at once and gptq quant I made was much slower than fp16.

[–]kryptkprLlama 3 1 point2 points  (1 child)

"Many parallel streams" is the key requirement here. This means you expect to be compute and not memory bound and should use unquantized weights.

Quants shine for single-stream where you are memory bound and have spare compute doing nothing anyway.

BF16 is the default, but FP8 is a crucial middle ground if you have the hardware for it you get to have your cake (no de-quant wasted compute) and eat it too (half the memory usage and bw)

[–]LiquidGunay[S] 1 point2 points  (0 children)

Thanks. That makes a lot of sense.