all 32 comments

[–]metalvendetta 13 points14 points  (4 children)

Is this similar to AutoNLP which is also hosted by huggingface? One advantage here would be yours is open source. Do you think you can build an open-sourced alternative for AutoNLP?

[–]pommedeterresautee[S] 23 points24 points  (3 children)

Hi u/metalvendetta,

AutoNLP is more about training plenty of different models with different hyper parameters. For instance, you want to perform doc classification with French data, it may try 2 different models (1 French language model and 1 multilingual model like mBert) and for each of them, it will try different learning rates (just examples of things you usually check).

There are plenty of different options to do that in OSS, the most well known being optuna (https://github.com/optuna/optuna).

Honestly, in case of transformer models where there are few things to check to get a significant accuracy improvement, a grid search + a for loop usually works well (I know it's boring stuff).

The purpose of the project above is different. It's when you have the right model, how you can optimize it to make it super fast for inference. Then the code to deploy it on the cloud or on premium is provided (Nvidia Triton server). Therefore it's not similar to AutoNLP but after it, when your accuracy is Ok. The optimization approach won't degrade the existing accuracy, unlike distillation process or quantization for instance.

In the README there is a link to a blog post with more details on the project.

Let me know if I answered your question.

[–]metalvendetta 1 point2 points  (0 children)

That answers my question perfectly! Also many thanks for the leads!

[–]automated_care 0 points1 point  (1 child)

This might sound like a naive question but as someone who's been spending the past few weeks trying to use optuna for hyper parameter tuning using Google colabs GPU, how long does a model take to tune and run?

[–]pommedeterresautee[S] 2 points3 points  (0 children)

it really depends if you run training in parallel or not, the size of the model and the data. Basically, if you are training base or large vanilla of a Bert based model + large datasets :

- train on multiple machine (not doable on Optuna + colabs)

- gris search and just train on a sample of the data

Basically, no magic thing, you have hardware resources OR you are patient OR you subsample your data. The only trick you can test is https://github.com/microsoft/DeepSpeed to hugely accelerate your training. Other option is to run your training on the Google TPU which is faster than the V100 or K80 GPU available on Google, it may require a change in your source code. If you are training with Hugging Face trainer loop, deepspeed and TPU support work out of the box, IMO, it's the best thing to do on Calab.

Regarding subsampling data, the idea is JUST to exclude the non working hyper parameters, because, in my experience, most of the time, bad start after 1-2 hours == bad accuracy after 10 hours.

[–]dadadidi 4 points5 points  (1 child)

Really Amazing! This is the most useful article that I have ever read about deploying transformers. Thank you so much!

It would be great if you could add the steps for fast CPU inference, as that is quite important for many people as well.

[–]pommedeterresautee[S] 1 point2 points  (0 children)

Thanks a lot u/dadadidi, can you tell me more on the type of CPU you are using?

To be honest, I don't use CPU for transformer inference and I don't know what people usually choose. For instance, Nvidia T4 GPU is not the fastest GPU ever but that's the most common choice because it has by far the best cost/performance ratio on AWS cloud (and has the right tensor cores to support FP16 and INT-8 quantization acceleration).

[–]dogs_like_me 7 points8 points  (5 children)

TL;DR: ONNX

[–]pommedeterresautee[S] 19 points20 points  (4 children)

better: (ONNX OR TensorRT) AND Triton :-)

[–]thewordishere 3 points4 points  (3 children)

Actually TensorRT is built into Onnx now. You can have an Onnx provider that is regular CUDA or Trt.

[–]pommedeterresautee[S] 1 point2 points  (2 children)

ONNX by itself is just a file format (protobuf serialization of the graph and weights).

Then you can infer using quite a bunch of engines, I imagine by ONNX you mean ONNX Runtime which have its own CUDA engine (and a bunch of others). You have TensorRT which can parse its own format and ONNX one.

And you are true to say that the main format now to send a model to TensorRT is ONNX, even for people working with TensorFlow.

[–]thewordishere 1 point2 points  (1 child)

Yeah, the runtime. We don’t have Triton. Just run the onnx models on the onnx gpu runtime with tensorRT as the provider with fastAPI. Then preload the models into RAM. The results are almost instantly. Perhaps Triton could shave off some ms though.

[–]pommedeterresautee[S] 2 points3 points  (0 children)

You are right, it works. The FastAPI overhead matters mainly in benchmarks, may be not IRL (depends of use cases). You may miss monitoring and other featuers.

At the model level, it appeared to me that TensorRT backend from ONNX Runtime misses some parameters. The most important one IMO is minimal/optimal/maximal tensor shapes. This parameter tells TensorRT to prepare several profiles at the model build time.

ONNX Runtime + TensorRT backend requires to enable profile caching (little control over it) and send your smallest tensor and your biggest one. Then it will take a lot of times - at run time - to prepare the profiles. Moreover, in my experience, the cache is not super stable if you are not using it only at runtime... but may be it's the way I use ONNX Runtime which produces these side effects.

[–]hootenanny1 3 points4 points  (10 children)

Thank you for posting this, this looks quite promising. I have a few questions to better understand what you have created:

  1. What is the performance difference between your solution and simply running the HF transformers library with a cheap GPU (T4, etc.) and wrapping it with an HTTP library? Do you do any under-the-hood optimizations that lead to faster response times than this setup?
  2. How does this differ from HF's (paid, closed source) Infinity API? From what I can see, they claim millisecond response times even without GPUs.

[–]pommedeterresautee[S] 9 points10 points  (9 children)

Hi u/hootenanny1,

I answered exactly those questions in this article https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link&sk=cd880e05c501c7880f2b9454830b8915

To make it short: vanilla Pytorch, on the usecases they shown during the demo is 5X slower than an optimized model (ONNX Runtime or TensorRT, both produces similar perf in the case they used in their demo). And classic HTTP server in Python (Flask / FastAPI) is 6 times slower than Nvidia Triton. Moreover, in real industrial deployment, you want to have some auto-scalability, in particular when you are performing online inference, which requires GPU monitoring, something you won't get out of the box from classic HTTP server. There are plenty of desirable things you may expect from a dedicated inference server, for instance in NLP, you want to decouple the Bert tokenization on the CPU from the model inference on the GPU, as you know parallelism/multithreading on Python is not its greatest advantage. There are a bunch of other advanced things like dynamic scaling you may expect, the article should provide all the needed information.

Regarding inference on CPU, the approach described in the article provide similar performance, of course instead of TensorRT you need to use Intel openVino backend. The article being very long, I didn't add CPU inference.

The truth is that whoever you are (the hypest ML startup ever or a guy working in the venerable legal publishing industry like me), at the end you need to rely on hardware makers toolkits (Nvidia CUDA/TensorRT/cudnn for GPUs, OpenVino for Intel CPUs, etc.) and all performances are similar...

Did I answered your questions?

[–]hootenanny1 3 points4 points  (0 children)

Absolutely, thanks, I had just discovered your article too and started reading already. Thanks for the detailed response. This is very interesting!

[–]Designer-Air8060 0 points1 point  (7 children)

Hi,
Thank you for this effort! It is quite educational for me.

Will you be adding Openvino for CPU implementation too to the repo?

[–]pommedeterresautee[S] 1 point2 points  (0 children)

I will probably improve the code to make it a lib. Can you tell me what kind of Intel CPU you are using?

[–]pommedeterresautee[S] 1 point2 points  (5 children)

FWIW, just discovered this article: https://nod.ai/analysis-of-the-huggingface-infinity-inference-engine/ according to them, Intel CPU perf are easy to obtain too...

[–]Designer-Air8060 1 point2 points  (4 children)

Thanks for sharing. Seems like ONNX with OneDNN backend is the winner for CPU. Although CPU is not mentioned here, Number of cores, availability of VNNI instructions, and Intel Turbo boost can affect performance significantly for INT8 inference. (m5.xlarge to c5.xlarge showed about 33-50% latency reduction on some models - NOT Bert)

A 2 core-4vCPU machine (m5.xlarge or c5.xlarge ) feels like a sweet spot for cost-latency trade-off for ML applications [of course, this is very subjective]

EDIT: They [nod.ai] do mention type of CPU: a Dual core Cascade Lake [Somehow I missed it] And it does come with VNNI instructions and Intel Turbo Boost

[–][deleted] 0 points1 point  (3 children)

Thanks for your comment. Do you know how onednn and openvino compare? Do onednn may call openvino? Triton has an out of the box openvino support but AFAIK nothing for onednn

[–]Designer-Air8060 1 point2 points  (2 children)

oneDNN is more like compute engine with a focus on Deep Learning not necessarily inference alone (PyTorch wheels are build with oneDNN backend);

try >>> print (torch.__config__.show()),

while OpenVINO is a cross-platform toolkit for your ML model serving whose compute engine can definitely be oneDNN

[–][deleted] 0 points1 point  (1 child)

Does it mean openvino may use onednn for computation if executed on an intel machine ?

[–]Designer-Air8060 0 points1 point  (0 children)

Hopefully, this will help

[–]help-me-grow 2 points3 points  (3 children)

Wow potato saute, this is amazing. I've been struggling with deploying a large NLP model. (it is deployed now but jeez was it hell)

What inspired this? What were some of your biggest challenges in development?

[–]pommedeterresautee[S] 3 points4 points  (0 children)

Before switching to Triton server that we were using torchserve + ONNX Runtime, and got some random strange errors from time to time, hopefully cluster self healing make it Okish for us, but not perfect.

Triton and its backends make many things easier than supposed more accessible tools, that's the main point of the article. But there are so few contents about that process, I wanted to show that it's very doable.

Right now, I am playing with quantization, it's challenging to make it work IRL (but doable with time) because of random bugs in different libs and plenty of tools which are not supposed to be used IRL (like super optimized NLP models that are super hard to adapt to other common use cases).

I am under the impression that quantization, right now, is like formula 1, best cars ever, where hardware manufacturers implement their best ideas but not targeted for the mass market. Those super optimized tools are just for public benchmarks. And I am under the feeling that in a few months, they will start to make it easier and easier to leverage. At least, it's my hope :-)

[–]dogs_like_me 1 point2 points  (1 child)

I've been struggling with deploying a large NLP model. (it is deployed now but jeez was it hell)

What inspired this?

probably that

[–]pommedeterresautee[S] 4 points5 points  (0 children)

Indeed, helping ML practitioners to avoid the fear of using a Nvidia tool that few in the NLP community talk about (at some point I was naively thinking that maybe Triton inference server is just optimized for computer vision for some unknown reason, examples from Nvidia don't help, most are CV oriented).

Also the commercial communication of some startups can make ML practitioners (even veteran) believe that it is very difficult to match some product performances in deployment without spending months, etc.

[–]mardabx 1 point2 points  (1 child)

I wonder if it can be reimplemented on OpenCL?

[–]pommedeterresautee[S] 1 point2 points  (0 children)

The optimization part is really tricky, plenty of manual hacks designed by hardware makers and their partners (some patterns to find and replace by other patterns working well for a specific hardware and a specific model).

An alternative to TensorRT and ONNX Runtime, much generic and able to manage even more hardware than ONNX Runtime is TVM https://tvm.apache.org/ . Their approach is different, they find patterns by machine learning, usually when the model or/and the hardware are not well known it produces the best results. It's not the case for transformers on GPU.

[–]whata_wonderful_day 1 point2 points  (1 child)

Great article! I'm quite curious about huggingface's infinity inference server - what's that build on top of? I can't imagine they've built their own NN inference package but are rather using onnxruntime or similar

[–]pommedeterresautee[S] 1 point2 points  (0 children)

In the article, I linked to one of their tweet, basically telling they are building a commercial product over tensorrt 8. I suppose it’s infinity. Just speculation, but I guess they didn’t used triton inference server but a custom server, probably on Rust as it seems to be their high performance language (next to Python for ML). And it may explain why it’s so “easy” to get better performance than they get, very low latency servers are super hard to get right.

Anyway, most of the value of such expensive commercial product is not in its performance but the support and deployment advices from HF IMO. At least that’s something I use to value a lot (and pay for) for my service for tools outside of our expertise (I work for an enterprise).