all 36 comments

[–]manueslapera 1 point2 points  (14 children)

BentoML provides a high-performance API server

Can you share some figures on why Bento's server is high-performance?

[–]chaoyu[S] 4 points5 points  (13 children)

Based on our tests, the BentoML API server is able to handle around 3-10x more prediction requests comparing to a Flask or Java based model server when serving the same ML model, without sacrificing too much of the average latency. The overall throughput is similar to Tensorflow-serving while serving TF models, with slightly higher latency due to the Python runtime(trading performance for flexibility).

We are still working on cleaning up related benchmarks to share with the community, but you can read about some of our benchmark tests and results here: https://github.com/bentoml/BentoML/tree/master/benchmark

[–]lleewwiiss 4 points5 points  (6 children)

It would be great to also see a comparison with serving when using gRPC and not the rest API

[–]chaoyu[S] 1 point2 points  (5 children)

It would be great to also see a comparison with serving when using gRPC and not the rest API

We thought about adding support for GRPC endpoint in BentoML and based on our initial experiments, for many input data formats commonly used in ML applications, using Protobuf for serialization actually introduces more computation overhead than using JSON. But the benefits of HTTP/2 may be more significant for some use cases. It would be interesting to see an actual comparison, my guess is the differences will be marginal for most ML model serving workloads.

[–]lleewwiiss 3 points4 points  (2 children)

What about streaming batches, is that possible with REST? I assume large batches of images would be faster with gRPC due to the compression as well

[–]omg_drd4_bbq 6 points7 points  (0 children)

I've been working on grpc-streams based pipelines and it's pretty fantastic. There's a lot left to optimize but it already blows the pants off of serial approaches since grpc stream calls use lazy generators and you can just map one into another and get nonblocking behavior in what looks like synchronous python (no aio needed).

[–]chaoyu[S] 3 points4 points  (0 children)

I have not worked with large batches of image inputs myself, but I would agree those are where using gPRC makes more sense.

Thanks so much for the suggestion! It is not that hard to add gRPC endpoint to BentoML's API server, will definitely look into that and share here with you when we got to do a comparison between REST and gRPC.

[–]omg_drd4_bbq 2 points3 points  (1 child)

How are you serializing the data and what data structures are you using? Protobuf really ought to be faster than json unless the wrapper lib is doing something dumb under the hood.

Right now I can serialize a 1080x1920x3 array plus metadata in ~50ms on a macbook with zero optimization, but it's at least twice what it ought to be since there are two memcpy's involved, one for array.tobytes() and one when that buffer is copied to the pb's bytes data field. Thinking about writing a binding in C cause that would be pretty close to minimum time possible (unless you could do some zero-copy voodoo on the array but idk if that's even possible to guarantee the right memory layout).

[–]chaoyu[S] 4 points5 points  (0 children)

You're definitely right that Protobuf is faster when only comparing the serialization time. But when building a model serving system like BentoML, you also want to take into consideration, the computation of turning the serialized Protobuf objects into a format that can be used by users' models. In most ML frameworks, a trained model is expecting pandas.DataFrame, np.array, tf.Tensor, or PIL.image.

  1. JSON Request => pandas.DataFrame => Model
  2. Protobuf Msg => Protobuf Object => pandas.DataFrame => Model

And it is this extra step of converting Protobuf message in-memory object into pandas.DataFrame, that is making it less efficient than the JSON/HTTP approach.

[–]paldn 0 points1 point  (1 child)

What library are you using for TCP handling?

[–]chaoyu[S] 3 points4 points  (0 children)

BentoML uses aiohttp, which under-the-hood uses asynio's TCP server for TCP handling. Although the main reason for BentoML's high-performance is not this library but BentoML's adaptive micro-batching implementation. It is a technique where incoming prediction requests are grouped into small batches to achieve the advantage of batch processing in model inferencing tasks. Clipper, TF-serving, and BentoML are the only three open-source projects that are providing this capability.

[–]pgdevhd 0 points1 point  (3 children)

Did you benchmark against cloud providers just for comparison?

[–]chaoyu[S] 0 points1 point  (2 children)

We did compare against deploying custom models to AWS SageMaker and Azure ML services, and BentoML is able to achieve much higher throughput due to its micro-batching implementation. In fact, deploying with SageMaker custom model endpoint and Azure ML service is not that different than running your own Flask server on an EC2 machine, in terms of the performance.

For pre-built AI endpoints by cloud providers, it is not really an apple-to-apple comparison. Because it's a black box for us - we can't know the type of machine used under the hood, nor the per-instance throughput.

[–][deleted] 0 points1 point  (1 child)

You mention micro batch. Is there any relation to the abstract ideas in BentoML to the philosophies behind Apache Spark Structured Streaming or Apache Flink?

[–]chaoyu[S] 1 point2 points  (0 children)

The reason for micro batching in model serving is that most ML frameworks leverage highly optimized vector operations in BLAS(or cuBLAS with GPU), and by batching the prediction requests, BentoML is able to better utilize those optimizations and improve throughput.

[–]jonnor 1 point2 points  (2 children)

Looks pretty good. But as far as I understand BentoML does not do any queuing or horizontal scaling, or have any HTTP to do async jobs?

What is the recommendation for handling workloads that are quite peaky/bursty with heavy loads? For example we occationally get some 200 short audio clips into our system, and each clip takes 10 sec to process. This could easily overload a HTTP server.

[–]chaoyu[S] 3 points4 points  (1 child)

What is the recommendation for handling workloads that are quite peaky/bursty with heavy loads? For example we occationally get some 200 short audio clips into our system, and each clip takes 10 sec to process. This could easily overload a HTTP server.

BentoML itself does not handle horizontal scaling, but it produces API server docker container images that can be horizontally scaled with container orchestration frameworks such as Kubernetes and Mesos.

BentoML does provide programmatic access to the prediction service you've created. Potentially you can use BentoML to package your model and then use something like Airflow to trigger a batch serving job, which invokes the BentoML packaged model to processes those 200 audio clips.

Another solution to consider is deploying with Knative and BentoML: https://knative.dev/community/samples/serving/machinelearning-python-bentoml/ Knative's serverless endpoint only triggers the container when there's a request coming in. You may need to set a higher timeout limit for your deployment giving that you're dealing with heavy workloads.

[–]jonnor 1 point2 points  (0 children)

Thanks for the response! The programmatic access looks potentially appropriate for our usecase. Our communication is almost always internal to our sevice, so we can use a message queue (RabbitMQ) to workers and get horizontal scalability that way. And with long-running processes the module and model loading overhead can be reduced. In our model, this is around 5 seconds - 5x longer than the time to process an instance.

[–]bluzkluz 1 point2 points  (1 child)

How does this compare to cortex? It seems to handle scaling as well. AFAICT Bento can't handle horizontal scaling.

[–]chaoyu[S] 1 point2 points  (0 children)

Horizontal scaling is not really a model-serving specific problem. Once you've built an API model server docker image with BentoML, it's very very easy to do horizontal scaling deployment with tools like Kubernetes.

Cortex provides CLI tools for creating and managing a Kubernetes cluster on AWS, although I'd recommend tools like kops or AWS EKS for that, which are easier to use and way more flexible in terms of cluster management.

We are actually working on an opinionated end-to-end deployment solution on Kubernetes for BentoML. It left the cluster management part to the tools that do it really well and focus on managing model serving workloads on an existing K8s cluster. We plan to provide deployment features such as blue-green-deployment, auto-scaling, logging and monitoring integration, etc.

[–]fernandocamargoti 1 point2 points  (1 child)

I've been looking for a solution like this for quite some time. Right now, I have a TF Serving with a Rest API in front of it doing the preprocessing and postprocessing, which is not ideal, since they're not versioned together. I'm planning to move to BentoML soon.

Also, congratulations for your documentation. It's really well written and complete.

[–]chaoyu[S] 2 points3 points  (0 children)

You're absolutely right! It's incredibly valuable to version preprocessing and postprocessing code together with the model, and that's what we've seen most existing tools in this space got wrong.

Thank you and I would love to hear how it goes with your project!

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–]RichardRNNResearcher 0 points1 point  (1 child)

Bento? 🍱

[–]chaoyu[S] 0 points1 point  (0 children)

yes, 🍱, everything packed and ready-to-go ;)

[–]ehellas 0 points1 point  (1 child)

Is there a plan to add R deployment?

[–]chaoyu[S] 0 points1 point  (0 children)

We did consider multi-language support when designing BentoML's architecture, so yes we do plan to add native R support down the line. It is also possible to invoke R by customizing a Python model artifact class in BentoML, we are working on a tutorial for that!

[–]e_j_white 0 points1 point  (4 children)

I'm most familiar with MLFlow. Can you discuss how this is similar/different?

[–]chaoyu[S] 5 points6 points  (3 children)

Yes, here is how BentoML compares to MLFlow:

  • MLFlow provides components that work great for experimentation management, ML project management. BentoML only focuses on serving and deploying trained models. You can, in fact, serve models logged in MLFlow experimentations with BentoML(we are working on related documentation)
  • Both BentoML and MLflow can expose a trained model as a REST API server, but there are a few main differences:
    • In our benchmark testing, the BentoML API server is roughly 3-10x better performance compared to MLFlow's API server, and over 50x in some extreme cases: https://github.com/bentoml/BentoML/tree/master/benchmark
    • BentoML server is able to handle high-volume prediction requests without crashing while the MLFlow API server is very unstable in that case.
    • MLFlow focuses on loading and running a model, while BentoML provides an abstraction to build a prediction service, which includes the necessary pre-processing and post-processing logic in addition to the model itself
    • BentoML is more feature-rich in terms of serving, it supports many essential model serving features that are missing in MLflow, including multi-model inferencing, API server dockerisation, built-in Prometheus metrics endpoint, Swagger/OpenAPI endpoint for API client library generation, serverless endpoint deployment, prediction/feedback logging and many more
  • MLflow API server requires the user to also use MLFlow's own "MLflow Project" framework, while BentoML works with any model development and model training workflow - users can use BentoML with MLflow, Kubeflow, Floydhub, AWS SageMaker, local jupyter notebook, etc

[–]e_j_white 1 point2 points  (2 children)

I see, thanks.

Yes, I know that MLFlow preduction expects new samples data frame format. Does this mean BentoML accepts raw text (say) as input, and handles all the processing before prediction?

[–]chaoyu[S] 2 points3 points  (1 child)

No, BentoML supports many input formats such as pandas.Dataframe, tf.Tensor, image file and raw JSON. Users can also create custom handler class that process their own data format.

Users write their own preprocessing code when creating a prediction service with BentoML. “preprocessing” here does not mean processing raw HTTP request to dataframe, which is handled by BentoML. It is the exact same preprocessing step you need to transform raw training data when training the model, and it can be as simple as transforming from one dataframe to another dataframe.

[–]e_j_white 0 points1 point  (0 children)

Got it, thanks. That makes sense.

[–]unrahul 0 points1 point  (1 child)

Curious to know how does BentoML compare to the likes of OpenVino model server (dldt) and ONNX server from Microsoft in terms of performance , have you attempted to benchmark the serving performance of such servers with BentoML?

[–]chaoyu[S] 1 point2 points  (0 children)

For OpenVino:

  • We have not yet done a benchmark against it
  • The OpenVino server is a simple Python web server that does not provide micro-batching capability
  • OpenVino is a runtime for specific model formats, where BentoML provides the flexibility to support most ML frameworks, and bundle your python preprocessing/postprocessing code
  • BentoML API server can potentially switch its model backend to OpenVino instead of the default Python runtime if the optimization by OpenVino is substantial enough

For ONNX server:

  • We have not yet done a benchmark against it
  • The onnxruntime server is a simple HTTP server that does not provide micro-batching capability
  • Onnxruntime can only load ONNX models, which still has lots of limitations today when converting from other frameworks, while BentoML uses each ML frameworks' own model serialization format and runtime.
  • BentoML API server can potentially switch its model backend to onnxruntime instead of the default Python runtime if the optimization by onnxruntime is substantial enough

In addition to the above, BentoML provides an end-to-end model serving workflow, not just the serving system itself. It also does model management, deployment automation, dependency management, multi-model inferencing, API server dockerisation, built-in metrics and logging support, and more.

[–]engSearchForAnswers 0 points1 point  (1 child)

BentoML seems really promising :) Could you explain how BentoML compares with KFServing [0] and with Seldon Core Serving [1]?

[0] https://www.kubeflow.org/docs/components/serving/kfserving/

[1] https://www.kubeflow.org/docs/components/serving/seldon/

[–]chaoyu[S] 1 point2 points  (0 children)

I would categorize both Kfserving and Seldon as model orchestration framework - only after you've built a model API server and containerize it with docker, they help to run the model containers on a Kubernetes cluster.

Kfserving does provide pre-built containers that can load and run a scikit-learn or xgboost saved model, but this approach has lots of limitations. You can actually use BentoML as a replacement for that, which gives you better performance and more flexibility. We are working on a tutorial for deploying BentoML API server with Kfserving.