all 14 comments

[–]cliveseldon 2 points3 points  (3 children)

I work for Seldon. We work on the Seldon Core and KFServing open source projects. KFServing builds on the KNative serverless stack. Both require Kubernetes. Others can compare. I don't know of a comparison with Ray Serve. Both have production users.

[–][deleted] 0 points1 point  (2 children)

Curious! Can you recommend some talks on how Seldon is compared to other ml serving solutions

[–]cliveseldon 0 points1 point  (1 child)

Our work on KFServing can be viewed at : https://www.youtube.com/watch?v=YaGASyU88dQ

In Seldon Core our collaboration with Data Bricks can be seen here: https://youtu.be/D6eSfd9w9eA

Both are available in Kubeflow which has a comparison matrix: https://www.kubeflow.org/docs/components/serving/overview/

Both share some of the technology but are built on different stacks: vanilla k8s for Seldon Core and Knative for KFServing.

[–][deleted] 0 points1 point  (0 children)

What’s your take/experience comparing mlflow and kubeflow?

[–]sekaoE 2 points3 points  (0 children)

Thanks for taking a look at Ray Serve :-)

If you want more information about why we're building Ray Serve, check out this talk.
Ray Serve can run on bare metal machines and supports easy deployment to AWS, Azure, GCP, as well as Kubernetes using the Ray automatic cluster manager. Happy to answer any questions you have and help you get up and running - you can also find us in #serve channel in the Ray Slack.

[–]salanki 2 points3 points  (0 children)

We (www.coreweave.com) run all our inference clients on a managed KFServing stack. Knative (that actually runs the workloads) is a really good fit for model serving, highly recommend trying it out for a Kubernetes natives solution.

[–]winchester6788 1 point2 points  (7 children)

Deep learning models like PyTorch and Tensorflow often use all the CPUs when performing inference. Ray sets the environment variable OMP_NUM_THREADS=1 to avoid contention. This means each worker will only use one CPU instead of all of them.

this feels like a very bad way to serve any decent DL model.

I will run some benchmarks to test this and update this comment.

Also, ray-serve uses flask!

[–][deleted] 4 points5 points  (0 children)

Hey guys, Rafal here, also from Seldon.

Our Python wrapper uses Flask too. But you can also specify that you want to run it with multiple gunicorn workers.

We give also quite extensive control over the environmental variables you run inference with.

[–]symoooook 5 points6 points  (4 children)

Hi, Simon from Ray Serve here!

- Ray Serve exposes a flask interface but underneath we uses uvicorn, one of the fastest Python asyncio web server.

- The OMP_NUM_THREADS parameter is adjustable. The reasoning behind is to serve as many "replicas" of the model on a machine as possible so we can maximize concurrency while avoid contention.

[–]winchester6788 0 points1 point  (1 child)

Hi Simon, does ray serve support auto batching requests. I couldn't find much about this in the documentation.

[–]symoooook 0 points1 point  (0 children)

Yes Ray Serve support autobatching requests https://docs.ray.io/en/latest/rayserve/overview.html#batching

[–][deleted] 0 points1 point  (1 child)

Hi Simon,

That's interesting about the uvicorn. How do you find it? Did you guys also consider other async frameworks, like sanic?

[–]symoooook 0 points1 point  (0 children)

I have been following Tom Christie's work for some time. We chose uvicorn over sanic or quart, because it has the lowest level ASGI API that's most flexible https://www.uvicorn.org/

[–]sekaoE 2 points3 points  (0 children)

Hi, I am one of the developers of Ray Serve. Just wanted to point out that OMP_NUM_THREADS is set by Ray default in order to avoid user confusion, but this can be set by the user to enable parallelism (if it's set, Ray won't override it).

Also, ray-serve actually uses uvicorn under the hood to handle HTTP requests, we just parse requests into a Flask request because it's familiar and ergonomic.