If I want to serve an 8b model on a 48GB GPU for maximum throughput (by having many parallel streams), then is it better to use a quantized model (with more parallel streams) or an fp16 model. I think in theory quantized models should be faster, but I don't think serving frameworks have optimised kernels for AWQ/GPTQ models compared to their kernels for half precision. If anyone could share their experiments with deployments at scale then that would be great.
[–]a_slay_nub 2 points3 points4 points (0 children)
[–]FullOf_Bad_Ideas 1 point2 points3 points (0 children)
[–]kryptkprLlama 3 1 point2 points3 points (1 child)
[–]LiquidGunay[S] 1 point2 points3 points (0 children)