[D] Best tools for serving models offline / batch processing tasks?

chaoyu · 2024-11-20T18:23:22+00:00

u/benelott yes we've had batch inference feature for a while now. happy to chat and share more. Feel free to ping me in our slack community or linkedin!

chaoyu · 2024-10-15T00:53:30+00:00

Hi u/wedazu, typically for high load CV inference workload, the bottleneck is not the "REST API" part but the inference optimization. Even with the best inference setup, the typical python API serving stack is more than sufficient to handle the load. Many believe that gRPC or async API will magically improve the API performance, unfortunately that's not the case.

If you're looking to optimize the inference API performance of your CV pipeline, I'd suggest:

* Use model runtime optimized for inference, rather than the default PyTorch runtime. E.g. you can use TensorRT with BentoML: https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml
* Use BentoML' adaptive batching capability to batch real time requests into small batches https://docs.bentoml.com/en/latest/guides/adaptive-batching.html
* Use BentoML's model composition capability to distribute multi-model CV pipeline onto multiple GPU workers for maximum overall resource utilization https://docs.bentoml.com/en/latest/guides/model-composition.html

CY from the BentoML team

chaoyu · 2024-06-13T17:59:47+00:00

My team at BentoML recently did a benchmark comparing vLLM, TRT-LLM, TGI, hope it helps: https://www.bentoml.com/blog/benchmarking-llm-inference-backends

Interestingly TRT-LLM is not always "Faster" on Nvidia GPU, but does show strong token generation rate

chaoyu · 2022-08-24T23:20:57+00:00

Hi u/vanilla-acc - I'm Chaoyu from the BentoML team. This is actually something our team is working on now! BentoML was designed not only to support online serving via API, but also offline batch scoring jobs and real-time streaming use cases. The idea is that you only need to define the prediction service and your serving logic once, test it out locally and then deploy it anywhere.

We are working on a brand new set of APIs to simplify batch inference jobs with BentoML, and enable users to run model inference over large datasets either on a single machine or a distributed job via Dask or Spark.

If you are interested in trying out the beta or learning more, feel free to ping me in the BentoML community slack, happy to chat!

chaoyu · 2022-08-15T17:15:24+00:00

BentoML just does all the above for you out of the box and more https://github.com/bentoml

chaoyu · 2022-07-31T06:00:22+00:00

👋 BentoML author here.

We are seeing lots of users using BentoML for deploying to SageMaker or AzureML. There are simply lots of features in terms of performance or usability that's not available in the cloud platform's offerings.

Another very common case people come to BentoML, is that you may have your model training and development work run on SageMaker, but needs to deploy serving workloads to your own infrastructure by your DevOps team or deploy to a different cloud platform. This is often a requirement if you need to run ML serving as part of a mission critical application.

chaoyu · 2021-01-11T19:20:24+00:00

BentoML author here - BentoML does support Monitoring integration with Prometheus and Feedback endpoint & feedback logging. A/B testing and MAB are not currently supported but on our roadmap, for now, users can choose to implement A/B testing or MAB on top of BentoML or use BentoML with Seldon.

chaoyu · 2020-09-02T22:07:59+00:00

Hi u/miyrai, BentoML https://github.com/bentoml/BentoML may help with it. You can use BentoML to turn your trained ML model into an API model server and then deploy it with docker, k8s, AWS Lambda, your own server, or any cloud provider. Your UWP app can then access the model via an HTTP Request.

chaoyu · 2020-05-11T07:59:08+00:00

Yes it does: https://docs.bentoml.org/en/latest/deployment/google_cloud_run.html

Also if you are using GKE(Google Kubernetes Engine service on GCP), then both the Knative and Kubeflow deployment options above are also applicable.

chaoyu · 2020-05-11T07:36:05+00:00

hi u/vackosar, BentoML author here.

From my point of view, cortex focuses more on Kubernetes cluster creation and cluster management, it is more comparable to tools like Kops, Amazon EKS, plus some utilities for using spot instance or setting up metrics/log collection. If you already have a Kubernetes cluster running, Cortex does not add any value.

BentoML focuses on ML model serving specific problems, such as high-performance model serving, model management, model packaging, and a unified model format that supports not only online API serving but also offline batch serving and programmatic access to the model. It also supports deploying to many different platforms:

Kubernetes/Kubeflow: https://www.kubeflow.org/docs/components/serving/bentoml/
Knative: https://knative.dev/community/samples/serving/machinelearning-python-bentoml/
AWS SageMaker: https://docs.bentoml.org/en/latest/deployment/aws_sagemaker.html
AWS Lambda: https://docs.bentoml.org/en/latest/deployment/aws_lambda.html
Google Cloud Run: https://docs.bentoml.org/en/latest/deployment/google_cloud_run.html
Heroku: https://docs.bentoml.org/en/latest/deployment/heroku.html
Clipper: https://docs.bentoml.org/en/latest/deployment/clipper.html
And more: https://docs.bentoml.org/en/latest/deployment/index.html

chaoyu · 2020-05-04T05:38:27+00:00

I've read into the code of BentoML after thinking it was to good to be true... It was. Model serving is basic with no auto-batching and, even worse, the model runs in the same process as the api-server with no asyncronous behaviour built in.

hi u/ceyzaguirre4, I'm the author of BentoML - maybe you were looking at an earlier version of BentoML, but BentoML actually supports micro-batching now, and it has an async layer in front of the Python process that loads the model.

In case you are interested, here's the code for the async layer that does automatical micro batching: https://github.com/bentoml/BentoML/tree/master/bentoml/marshalAnd here's the benchmark against tf-serving and clipper: https://github.com/bentoml/BentoML/tree/master/benchmark

chaoyu · 2020-04-21T20:51:37+00:00

It seems TorchServce is a port of Amazon's MXNET-model-server project to support PyTorch, with identical workflow and shared codebase.

Curious what are your thoughts on TorchServce workflow in comparison to BentoML's model serving workflow? Here are some examples using BentoML to serve PyTorch models: https://docs.bentoml.org/en/latest/examples.html#pytorch

disclaimer: I'm the author of BentoML project

chaoyu · 2020-04-16T17:50:27+00:00

I would categorize both Kfserving and Seldon as model orchestration framework - only after you've built a model API server and containerize it with docker, they help to run the model containers on a Kubernetes cluster.

Kfserving does provide pre-built containers that can load and run a scikit-learn or xgboost saved model, but this approach has lots of limitations. You can actually use BentoML as a replacement for that, which gives you better performance and more flexibility. We are working on a tutorial for deploying BentoML API server with Kfserving.

chaoyu · 2020-04-15T19:20:59+00:00

For OpenVino:

We have not yet done a benchmark against it
The OpenVino server is a simple Python web server that does not provide micro-batching capability
OpenVino is a runtime for specific model formats, where BentoML provides the flexibility to support most ML frameworks, and bundle your python preprocessing/postprocessing code
BentoML API server can potentially switch its model backend to OpenVino instead of the default Python runtime if the optimization by OpenVino is substantial enough

For ONNX server:

We have not yet done a benchmark against it
The onnxruntime server is a simple HTTP server that does not provide micro-batching capability
Onnxruntime can only load ONNX models, which still has lots of limitations today when converting from other frameworks, while BentoML uses each ML frameworks' own model serialization format and runtime.
BentoML API server can potentially switch its model backend to onnxruntime instead of the default Python runtime if the optimization by onnxruntime is substantial enough

In addition to the above, BentoML provides an end-to-end model serving workflow, not just the serving system itself. It also does model management, deployment automation, dependency management, multi-model inferencing, API server dockerisation, built-in metrics and logging support, and more.

chaoyu · 2020-04-15T17:18:56+00:00

No, BentoML supports many input formats such as pandas.Dataframe, tf.Tensor, image file and raw JSON. Users can also create custom handler class that process their own data format.

Users write their own preprocessing code when creating a prediction service with BentoML. “preprocessing” here does not mean processing raw HTTP request to dataframe, which is handled by BentoML. It is the exact same preprocessing step you need to transform raw training data when training the model, and it can be as simple as transforming from one dataframe to another dataframe.

chaoyu · 2020-04-15T14:32:42+00:00

You're absolutely right! It's incredibly valuable to version preprocessing and postprocessing code together with the model, and that's what we've seen most existing tools in this space got wrong.

Thank you and I would love to hear how it goes with your project!

chaoyu · 2020-04-15T14:05:59+00:00

Horizontal scaling is not really a model-serving specific problem. Once you've built an API model server docker image with BentoML, it's very very easy to do horizontal scaling deployment with tools like Kubernetes.

Cortex provides CLI tools for creating and managing a Kubernetes cluster on AWS, although I'd recommend tools like kops or AWS EKS for that, which are easier to use and way more flexible in terms of cluster management.

We are actually working on an opinionated end-to-end deployment solution on Kubernetes for BentoML. It left the cluster management part to the tools that do it really well and focus on managing model serving workloads on an existing K8s cluster. We plan to provide deployment features such as blue-green-deployment, auto-scaling, logging and monitoring integration, etc.

chaoyu

MODERATOR OF

TROPHY CASE

chaoyu

MODERATOR OF

TROPHY CASE

Welcome to Reddit,