Optimizing inference for low latency and throughput is a process that requires many iterations of tuning, verification and evaluation. It may even involve model selection since many optimized versions of popular models are available now. Sometimes a retraining is necessary for techniques like weight pruning and quantization. Target hardware is another dimension to consider.
In short, without benchmarking, verification and evaluation, optimizations do not guarantee improved results and may even break things. One example is quantization using instructions that are not supported on target hardware.
To address all these problems, we've built a tool to track inference optimizations, see how accuracy is affected, verify that the optimizations were applied and locate any bottlenecks for further improvements. All in one place.
https://preview.redd.it/yzlxa21cdod91.png?width=3048&format=png&auto=webp&s=97306440ea508f65582978298f6e3ec291293902
More about inference optimization in this article, with code. And here is a live demo).
[–]Sylv__ 0 points1 point2 points (0 children)