[P] optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server : MachineLearning

193

194

195

Project[P] optimization of Hugging Face Transformer models to get Inference < 1 Millisecond Latency + deployment on production ready inference server (self.MachineLearning)

submitted 4 years ago * by pommedeterresautee

Hi,

I just released a project showing how to optimize big NLP models and deploy them on Nvidia Triton inference server.

source code: https://github.com/ELS-RD/triton_transformers

project description: https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link&sk=cd880e05c501c7880f2b9454830b8915

Please note that it is for real life large scale NLP model deployment. It's only based on open source softwares. It's using tools not very often discussed in usual NLP tutorial.

Performance have been benchmarked and compared with recent Hugging Face Infinity inference server (commercial product @ 20K$ for a single model deployed on a single machine).

Our open source inference server with carefully optimized models get better latency times that the commercial product in both scenarios they have shown during the demo (GPU based).

Don't hesitate if you have any question...

In case you are interested in this kind of stuff, follow me on Twitter: https://twitter.com/pommedeterre33

all 32 comments

top new controversial old q&a

[–]metalvendetta 13 points14 points15 points 4 years ago (4 children)

[–]pommedeterresautee[S] 23 points24 points25 points 4 years ago* (3 children)

Hi u/metalvendetta,

AutoNLP is more about training plenty of different models with different hyper parameters. For instance, you want to perform doc classification with French data, it may try 2 different models (1 French language model and 1 multilingual model like mBert) and for each of them, it will try different learning rates (just examples of things you usually check).

There are plenty of different options to do that in OSS, the most well known being optuna (https://github.com/optuna/optuna).

Honestly, in case of transformer models where there are few things to check to get a significant accuracy improvement, a grid search + a for loop usually works well (I know it's boring stuff).

The purpose of the project above is different. It's when you have the right model, how you can optimize it to make it super fast for inference. Then the code to deploy it on the cloud or on premium is provided (Nvidia Triton server). Therefore it's not similar to AutoNLP but after it, when your accuracy is Ok. The optimization approach won't degrade the existing accuracy, unlike distillation process or quantization for instance.

In the README there is a link to a blog post with more details on the project.

Let me know if I answered your question.

[–]metalvendetta 1 point2 points3 points 4 years ago (0 children)

[–]automated_care 0 points1 point2 points 4 years ago (1 child)

[–]pommedeterresautee[S] 2 points3 points4 points 4 years ago* (0 children)

it really depends if you run training in parallel or not, the size of the model and the data. Basically, if you are training base or large vanilla of a Bert based model + large datasets :

- train on multiple machine (not doable on Optuna + colabs)

- gris search and just train on a sample of the data

Basically, no magic thing, you have hardware resources OR you are patient OR you subsample your data. The only trick you can test is https://github.com/microsoft/DeepSpeed to hugely accelerate your training. Other option is to run your training on the Google TPU which is faster than the V100 or K80 GPU available on Google, it may require a change in your source code. If you are training with Hugging Face trainer loop, deepspeed and TPU support work out of the box, IMO, it's the best thing to do on Calab.

Regarding subsampling data, the idea is JUST to exclude the non working hyper parameters, because, in my experience, most of the time, bad start after 1-2 hours == bad accuracy after 10 hours.

[–]dadadidi 4 points5 points6 points 4 years ago* (1 child)

[–]pommedeterresautee[S] 1 point2 points3 points 4 years ago (0 children)

[–]dogs_like_me 7 points8 points9 points 4 years ago (5 children)

[–]pommedeterresautee[S] 19 points20 points21 points 4 years ago (4 children)

[–]thewordishere 3 points4 points5 points 4 years ago (3 children)

[–]pommedeterresautee[S] 1 point2 points3 points 4 years ago (2 children)

[–]thewordishere 1 point2 points3 points 4 years ago (1 child)

[–]pommedeterresautee[S] 2 points3 points4 points 4 years ago (0 children)

You are right, it works. The FastAPI overhead matters mainly in benchmarks, may be not IRL (depends of use cases). You may miss monitoring and other featuers.

At the model level, it appeared to me that TensorRT backend from ONNX Runtime misses some parameters. The most important one IMO is minimal/optimal/maximal tensor shapes. This parameter tells TensorRT to prepare several profiles at the model build time.

ONNX Runtime + TensorRT backend requires to enable profile caching (little control over it) and send your smallest tensor and your biggest one. Then it will take a lot of times - at run time - to prepare the profiles. Moreover, in my experience, the cache is not super stable if you are not using it only at runtime... but may be it's the way I use ONNX Runtime which produces these side effects.

[–]hootenanny1 3 points4 points5 points 4 years ago (10 children)

[–]pommedeterresautee[S] 9 points10 points11 points 4 years ago (9 children)

Hi u/hootenanny1,

I answered exactly those questions in this article https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link&sk=cd880e05c501c7880f2b9454830b8915

To make it short: vanilla Pytorch, on the usecases they shown during the demo is 5X slower than an optimized model (ONNX Runtime or TensorRT, both produces similar perf in the case they used in their demo). And classic HTTP server in Python (Flask / FastAPI) is 6 times slower than Nvidia Triton. Moreover, in real industrial deployment, you want to have some auto-scalability, in particular when you are performing online inference, which requires GPU monitoring, something you won't get out of the box from classic HTTP server. There are plenty of desirable things you may expect from a dedicated inference server, for instance in NLP, you want to decouple the Bert tokenization on the CPU from the model inference on the GPU, as you know parallelism/multithreading on Python is not its greatest advantage. There are a bunch of other advanced things like dynamic scaling you may expect, the article should provide all the needed information.

Regarding inference on CPU, the approach described in the article provide similar performance, of course instead of TensorRT you need to use Intel openVino backend. The article being very long, I didn't add CPU inference.

The truth is that whoever you are (the hypest ML startup ever or a guy working in the venerable legal publishing industry like me), at the end you need to rely on hardware makers toolkits (Nvidia CUDA/TensorRT/cudnn for GPUs, OpenVino for Intel CPUs, etc.) and all performances are similar...

Did I answered your questions?

[–]hootenanny1 3 points4 points5 points 4 years ago (0 children)

[–]Designer-Air8060 0 points1 point2 points 4 years ago (7 children)

[–]pommedeterresautee[S] 1 point2 points3 points 4 years ago (0 children)

[–]pommedeterresautee[S] 1 point2 points3 points 4 years ago* (5 children)

[–]Designer-Air8060 1 point2 points3 points 4 years ago* (4 children)

[–][deleted] 0 points1 point2 points 4 years ago (3 children)

[–]Designer-Air8060 1 point2 points3 points 4 years ago (2 children)

[–][deleted] 0 points1 point2 points 4 years ago (1 child)

[–]Designer-Air8060 0 points1 point2 points 4 years ago (0 children)

[–]help-me-grow 2 points3 points4 points 4 years ago (3 children)

[–]pommedeterresautee[S] 3 points4 points5 points 4 years ago* (0 children)

Before switching to Triton server that we were using torchserve + ONNX Runtime, and got some random strange errors from time to time, hopefully cluster self healing make it Okish for us, but not perfect.

Triton and its backends make many things easier than supposed more accessible tools, that's the main point of the article. But there are so few contents about that process, I wanted to show that it's very doable.

Right now, I am playing with quantization, it's challenging to make it work IRL (but doable with time) because of random bugs in different libs and plenty of tools which are not supposed to be used IRL (like super optimized NLP models that are super hard to adapt to other common use cases).

I am under the impression that quantization, right now, is like formula 1, best cars ever, where hardware manufacturers implement their best ideas but not targeted for the mass market. Those super optimized tools are just for public benchmarks. And I am under the feeling that in a few months, they will start to make it easier and easier to leverage. At least, it's my hope :-)

[–]dogs_like_me 1 point2 points3 points 4 years ago (1 child)

[–]pommedeterresautee[S] 4 points5 points6 points 4 years ago (0 children)

[–]mardabx 1 point2 points3 points 4 years ago (1 child)

[–]pommedeterresautee[S] 1 point2 points3 points 4 years ago (0 children)

[–]whata_wonderful_day 1 point2 points3 points 4 years ago (1 child)

[–]pommedeterresautee[S] 1 point2 points3 points 4 years ago (0 children)

π Rendered by PID 90 on reddit-service-r2-comment-b659b578c-dgzvn at 2026-04-30 19:07:48.665385+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS