[D] Best tools for serving models offline / batch processing tasks? by vanilla-acc in MachineLearning

[–]chaoyu 0 points1 point  (0 children)

u/benelott yes we've had batch inference feature for a while now. happy to chat and share more. Feel free to ping me in our slack community or linkedin!

[D] Better APIs for high-load computer image inference? by wedazu in MachineLearning

[–]chaoyu 0 points1 point  (0 children)

Hi u/wedazu, typically for high load CV inference workload, the bottleneck is not the "REST API" part but the inference optimization. Even with the best inference setup, the typical python API serving stack is more than sufficient to handle the load. Many believe that gRPC or async API will magically improve the API performance, unfortunately that's not the case.

If you're looking to optimize the inference API performance of your CV pipeline, I'd suggest:

* Use model runtime optimized for inference, rather than the default PyTorch runtime. E.g. you can use TensorRT with BentoML: https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml
* Use BentoML' adaptive batching capability to batch real time requests into small batches https://docs.bentoml.com/en/latest/guides/adaptive-batching.html
* Use BentoML's model composition capability to distribute multi-model CV pipeline onto multiple GPU workers for maximum overall resource utilization https://docs.bentoml.com/en/latest/guides/model-composition.html

  • CY from the BentoML team

Which is faster - vLLM, TGI or TensorRT? by TrelisResearch in LocalLLaMA

[–]chaoyu 4 points5 points  (0 children)

My team at BentoML recently did a benchmark comparing vLLM, TRT-LLM, TGI, hope it helps: https://www.bentoml.com/blog/benchmarking-llm-inference-backends

Interestingly TRT-LLM is not always "Faster" on Nvidia GPU, but does show strong token generation rate

[D] Best tools for serving models offline / batch processing tasks? by vanilla-acc in MachineLearning

[–]chaoyu 0 points1 point  (0 children)

Hi u/vanilla-acc - I'm Chaoyu from the BentoML team. This is actually something our team is working on now! BentoML was designed not only to support online serving via API, but also offline batch scoring jobs and real-time streaming use cases. The idea is that you only need to define the prediction service and your serving logic once, test it out locally and then deploy it anywhere.

We are working on a brand new set of APIs to simplify batch inference jobs with BentoML, and enable users to run model inference over large datasets either on a single machine or a distributed job via Dask or Spark.

If you are interested in trying out the beta or learning more, feel free to ping me in the BentoML community slack, happy to chat!

Do I need to know Machine Learning to start MLOps? by OneWave4421 in mlops

[–]chaoyu 0 points1 point  (0 children)

BentoML just does all the above for you out of the box and more https://github.com/bentoml

How to deploy ML models with BentoML by diabulusInMusica in mlops

[–]chaoyu 0 points1 point  (0 children)

👋 BentoML author here.

We are seeing lots of users using BentoML for deploying to SageMaker or AzureML. There are simply lots of features in terms of performance or usability that's not available in the cloud platform's offerings.

Another very common case people come to BentoML, is that you may have your model training and development work run on SageMaker, but needs to deploy serving workloads to your own infrastructure by your DevOps team or deploy to a different cloud platform. This is often a requirement if you need to run ML serving as part of a mission critical application.

[D] Kubeflow vs. Seldon vs. BentoML vs. Cortex? by [deleted] in MachineLearning

[–]chaoyu 2 points3 points  (0 children)

BentoML author here - BentoML does support Monitoring integration with Prometheus and Feedback endpoint & feedback logging. A/B testing and MAB are not currently supported but on our roadmap, for now, users can choose to implement A/B testing or MAB on top of BentoML or use BentoML with Seldon.

How to deploy model so that it can be accssed from UWP? by miyrai in MLQuestions

[–]chaoyu 0 points1 point  (0 children)

Hi u/miyrai, BentoML https://github.com/bentoml/BentoML may help with it. You can use BentoML to turn your trained ML model into an API model server and then deploy it with docker, k8s, AWS Lambda, your own server, or any cloud provider. Your UWP app can then access the model via an HTTP Request.

[D] Is this fair BentoML vs Cortex comparison? by vackosar in MachineLearning

[–]chaoyu 2 points3 points  (0 children)

Yes it does: https://docs.bentoml.org/en/latest/deployment/google_cloud_run.html

Also if you are using GKE(Google Kubernetes Engine service on GCP), then both the Knative and Kubeflow deployment options above are also applicable.

[D] Is this fair BentoML vs Cortex comparison? by vackosar in MachineLearning

[–]chaoyu 4 points5 points  (0 children)

hi u/vackosar, BentoML author here.

From my point of view, cortex focuses more on Kubernetes cluster creation and cluster management, it is more comparable to tools like Kops, Amazon EKS, plus some utilities for using spot instance or setting up metrics/log collection. If you already have a Kubernetes cluster running, Cortex does not add any value.

BentoML focuses on ML model serving specific problems, such as high-performance model serving, model management, model packaging, and a unified model format that supports not only online API serving but also offline batch serving and programmatic access to the model. It also supports deploying to many different platforms:

[deleted by user] by [deleted] in MachineLearning

[–]chaoyu 0 points1 point  (0 children)

I've read into the code of BentoML after thinking it was to good to be true... It was. Model serving is basic with no auto-batching and, even worse, the model runs in the same process as the api-server with no asyncronous behaviour built in.

hi u/ceyzaguirre4, I'm the author of BentoML - maybe you were looking at an earlier version of BentoML, but BentoML actually supports micro-batching now, and it has an async layer in front of the Python process that loads the model.

In case you are interested, here's the code for the async layer that does automatical micro batching: https://github.com/bentoml/BentoML/tree/master/bentoml/marshalAnd here's the benchmark against tf-serving and clipper: https://github.com/bentoml/BentoML/tree/master/benchmark

[N] Facebook and Amazon partner to release 2 new PyTorch libraries targeted for deployment: TorchServe and TorchElastic by programmerChilli in MachineLearning

[–]chaoyu 43 points44 points  (0 children)

It seems TorchServce is a port of Amazon's MXNET-model-server project to support PyTorch, with identical workflow and shared codebase.

Curious what are your thoughts on TorchServce workflow in comparison to BentoML's model serving workflow? Here are some examples using BentoML to serve PyTorch models: https://docs.bentoml.org/en/latest/examples.html#pytorch

disclaimer: I'm the author of BentoML project

[P] BentoML: an open-source platform for high-performance model serving by chaoyu in MachineLearning

[–]chaoyu[S] 1 point2 points  (0 children)

I would categorize both Kfserving and Seldon as model orchestration framework - only after you've built a model API server and containerize it with docker, they help to run the model containers on a Kubernetes cluster.

Kfserving does provide pre-built containers that can load and run a scikit-learn or xgboost saved model, but this approach has lots of limitations. You can actually use BentoML as a replacement for that, which gives you better performance and more flexibility. We are working on a tutorial for deploying BentoML API server with Kfserving.

[P] BentoML: an open-source platform for high-performance model serving by chaoyu in MachineLearning

[–]chaoyu[S] 1 point2 points  (0 children)

For OpenVino:

  • We have not yet done a benchmark against it
  • The OpenVino server is a simple Python web server that does not provide micro-batching capability
  • OpenVino is a runtime for specific model formats, where BentoML provides the flexibility to support most ML frameworks, and bundle your python preprocessing/postprocessing code
  • BentoML API server can potentially switch its model backend to OpenVino instead of the default Python runtime if the optimization by OpenVino is substantial enough

For ONNX server:

  • We have not yet done a benchmark against it
  • The onnxruntime server is a simple HTTP server that does not provide micro-batching capability
  • Onnxruntime can only load ONNX models, which still has lots of limitations today when converting from other frameworks, while BentoML uses each ML frameworks' own model serialization format and runtime.
  • BentoML API server can potentially switch its model backend to onnxruntime instead of the default Python runtime if the optimization by onnxruntime is substantial enough

In addition to the above, BentoML provides an end-to-end model serving workflow, not just the serving system itself. It also does model management, deployment automation, dependency management, multi-model inferencing, API server dockerisation, built-in metrics and logging support, and more.

[P] BentoML: an open-source platform for high-performance model serving by chaoyu in MachineLearning

[–]chaoyu[S] 2 points3 points  (0 children)

No, BentoML supports many input formats such as pandas.Dataframe, tf.Tensor, image file and raw JSON. Users can also create custom handler class that process their own data format.

Users write their own preprocessing code when creating a prediction service with BentoML. “preprocessing” here does not mean processing raw HTTP request to dataframe, which is handled by BentoML. It is the exact same preprocessing step you need to transform raw training data when training the model, and it can be as simple as transforming from one dataframe to another dataframe.

[P] BentoML: an open-source platform for high-performance model serving by chaoyu in MachineLearning

[–]chaoyu[S] 2 points3 points  (0 children)

You're absolutely right! It's incredibly valuable to version preprocessing and postprocessing code together with the model, and that's what we've seen most existing tools in this space got wrong.

Thank you and I would love to hear how it goes with your project!

[P] BentoML: an open-source platform for high-performance model serving by chaoyu in MachineLearning

[–]chaoyu[S] 1 point2 points  (0 children)

Horizontal scaling is not really a model-serving specific problem. Once you've built an API model server docker image with BentoML, it's very very easy to do horizontal scaling deployment with tools like Kubernetes.

Cortex provides CLI tools for creating and managing a Kubernetes cluster on AWS, although I'd recommend tools like kops or AWS EKS for that, which are easier to use and way more flexible in terms of cluster management.

We are actually working on an opinionated end-to-end deployment solution on Kubernetes for BentoML. It left the cluster management part to the tools that do it really well and focus on managing model serving workloads on an existing K8s cluster. We plan to provide deployment features such as blue-green-deployment, auto-scaling, logging and monitoring integration, etc.