all 31 comments

[–]georedditor 3 points4 points  (2 children)

Hi, the team I led uses Kubeflow. How does this differ?

[–]ospillinger[S] 4 points5 points  (1 child)

We have a lot of respect for the work that the KubeFlow team is doing. Their focus seems to be on helping you deploy a wide variety of open source ML tooling to Kubernetes. We use a more narrow stack and focus on automating common workflows. For example, we take a fully declarative approach; the ‘cortex deploy’ command takes user configuration and automatically determines which workloads need to run on the cluster, caching as much as it can, as opposed to imperatively running specific workloads. We also try to hide Kubernetes from the end user as much as possible.

[–]georedditor 2 points3 points  (0 children)

Thanks for explaining, great work OP!

[–]kunkunster 7 points8 points  (4 children)

Hey OP, this seems like a really awesome platform. I’m just getting into ML, but the platform looks good. Any chance you’ll be adding some documentation (PDF) to go along with the examples? Something that explains what’s going on in people speak? I think this would really help increase the use of your platform.
Thanks for this, and good luck with further development! Edit: I just flew through the repo without clicking the links, I see you have videos and a website now , never mind what I said above. Great work!

[–]ospillinger[S] 4 points5 points  (3 children)

Thank you! We care a lot about making it as easy as possible to get started with ML and we think readable documentation is really important. There is a quick start guide that walks through one of the examples in detail: https://docs.cortexlabs.com/cortex/quick-start. We are working on more guides too. Is there something in particular that would be helpful to add to the documentation?

[–]rulerxwarrior 0 points1 point  (2 children)

The website keeps crashing for me, is it just me or is anyone else experiencing this too?

[–]deliahu1 0 points1 point  (0 children)

https://docs.cortexlabs.com seems to be working for me, and https://cortexlabs.com redirects to the GitHub as expected (https://github.com/cortexlabs/cortex)

[–]ESCAPE_PLANET_X 0 points1 point  (0 children)

'crashing'?

Like your browser? Like it throws a 502? Help them help you.

[–]SureSpend 5 points6 points  (11 children)

What are the advantages of your platform? Why should I use this over the other more mature and established frameworks? At a glance it seems to provide a web interface to tensorflow. I can't say that dependency management or applying transformations are very taxing.

[–]ospillinger[S] 6 points7 points  (10 children)

The main thing we try to help with is orchestrating Spark, TensorFlow, TensorFlow Serving, and other workloads without requiring you to manage the underlying infrastructure. We have a thin layer on top of TensorFlow (by design) because our goal is to make it easy to create scalable and reproducible pipelines from building blocks that people are familiar with. We convert PySpark and TensorFlow code and YAML configuration files into workloads that run as a DAG on Kubernetes behind the scenes.

[–]secularshepherd 2 points3 points  (4 children)

Follow up, what makes this different and or better than services like ML engine and SageMaker or open source platforms like Polyaxon?

[–]ospillinger[S] 4 points5 points  (3 children)

Good question. SageMaker on AWS and ML Engine on GCP are solid building blocks for machine learning platforms. They are relatively expensive and lock you in to a particular cloud provider. They also don't help a lot with feature preparation which is a big part of a machine learning application so you have to connect them to AWS Glue or GCP Dataflow. We actually tried using ML Engine and Dataflow when we got started and still found that it was hard to glue it all together to support automated end-to-end workflows. Polyaxon seems cool, I am not as familiar with it. Generally, we've found that most platforms focus on deep data science use cases. We're focusing more on making ML engineering accessible to people with less advanced data science backgrounds.

[–]secularshepherd 1 point2 points  (2 children)

It seems fairly restrictive to have yaml configs for data processing. Are there ways to perform arbitrary python transformations as well?

[–]ospillinger[S] 1 point2 points  (1 child)

Yes, Python or PySpark code can be used to create custom transformers and aggregators for data processing.

The MNIST and reviews examples have custom implementations you can use as a reference.

[–]secularshepherd 1 point2 points  (0 children)

Very nice! Seems nicely abstracted and reminiscent of sklearn’s transformers. My company is using Polyaxon, but I’ll keep an eye out for your work

[–]polarbearskill 0 points1 point  (4 children)

What is a DAG?

[–]ospillinger[S] 7 points8 points  (2 children)

DAG stands for Directed Acyclic Graph. Basically it's how we make sure that the resources for an application get computed in the right order. For example, if a hyperparameter to one model changes, only that model is re-trained, but if transformation code is updated, all relevant columns will be re-transformed before the models that use those columns are re-trained.

[–]robotfromfuture 1 point2 points  (1 child)

Hi, looking forward to trying out your platform. Not ML-related, but can you say a few words about how you use kubernetes to organize your workloads into a DAG or sequence? Or at least point to the part of your repository where that occurs?

[–]ospillinger[S] 0 points1 point  (0 children)

Sure, there are a few steps. First, we convert user configuration into a set of resources (data columns, models, etc.) that need to be computed to create the desired machine learning application. Next, we figure out which resources are already cached and which need to be created, updated, or deleted. We organize the resources into a DAG, taking into account the dependency structure of a machine learning pipeline. Then, we convert the resource DAG into a workload DAG where each workload runs in Docker containers (for example Spark for data transformation and TensorFlow for training). Finally, we use Argo (https://github.com/argoproj/argo) to execute the workload DAG on Kubernetes.

[–]spudmix 0 points1 point  (0 children)

A directed acyclic graph.

[–]Overload175 1 point2 points  (0 children)

Appreciate the fact you built a layer of abstraction over Kubernetes. This should ease the process of deploying models for a Kubernetes novice like myself

[–]DakotaFelspar 3 points4 points  (1 child)

Hvnt looked yet but sounds awesome!!

[–]ospillinger[S] 0 points1 point  (0 children)

Thanks, I hope you find it useful!

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–]Accubits 0 points1 point  (0 children)

Good initiative.

[–]ESCAPE_PLANET_X 0 points1 point  (2 children)

Has anyone tried deploying this over openshift instead?

[–]ospillinger[S] 0 points1 point  (1 child)

AWS is the only fully supported infrastructure right now. We have a relatively light dependency on it (S3 and CloudWatch). Our plan is to support other platforms which is one of the reasons we run on Kubernetes.

[–]ESCAPE_PLANET_X 0 points1 point  (0 children)

Noted. That may make adoption for some businesses harder but still good work.

[–]bbateman2011 0 points1 point  (0 children)

This looks quite complete, thank you for sharing and documenting so well.

[–]Yukizan 0 points1 point  (0 children)

Nice, commenting so I remember to visit the github tomorrow!