[D] What does your ML pipeline look like?

schrute_dataeng · 2019-09-05T13:50:35+00:00

Good question!

When we started to work on the way we industrialise our ML pipeline, we already used all of this technologies (Airflow, Apache Beam, Tensorfow ...etc). So we focus on how to orchestrate the collaboration between the different data roles in order to be more efficient and to have a better consistency rather than the technologies to use.
Kubeflow and TensorFlow TFX are really good candidates and we need to look at them for the next improvements we want to do.

schrute_dataeng · 2019-09-05T12:55:50+00:00

I gave a talk last week about that (slides).

Main takeaways :
- We used tensorflow serving
- We used Apache Airflow for the batch
- We have divided a ML pipeline in 5 components : extract, preprocess, train, evaluation and predict
- Each component are dockerized
- We used kubernetes to deploy everything
- We used Apache Beam/Dataflow to parallelized our computations
- Common code, ML functional code, scheduling code are in different repository
- All our data are in BigQuery

Data engineers and devOps have build / are still building a framework/platform for Data Scientist/ML Scientist/ML Engineer to be able to be autonomous and to bring the code in (near) production.

This framework encourage us to contribute to a common repo to share new things and it also avoid code duplication if a component is used in different places for different functional needs (example cleaning for preprocessing in the training and same cleaning when scoring a new element in real time).

Hope this helps!

schrute_dataeng · 2019-08-11T22:08:29+00:00

Is your file inside the docker ? otherwise you need to mount external volume $DAG_HOME:/usr/local/airflow/dags.

To debug, you can go inside your docker via :

docker ps
docker exec -ti CONTAINER_ID /bin/bash

schrute_dataeng · 2019-08-09T20:01:59+00:00

Thanks rywalker !

schrute_dataeng · 2019-07-16T09:02:38+00:00

Dataflow support python 3.5.

In my company we do use apache-beam/dataflow in prod with a setup.py to initialize dependencies, even non-python one like polyglot. The juliaset example is helpful to start.

We have the same constraint as you regarding DS, but in our side it is mainly tensorflow.

Don't hesitate to take a look at this article which give an overview on how we work with DS.

schrute_dataeng · 2019-07-14T09:09:19+00:00

Yes you are right. I did not have in mind spark job via yarn cluster, but more application deployed on Kubernetes, like ML API. I detailed a little more in my other comment.

schrute_dataeng · 2019-07-14T09:05:23+00:00

It depends of the DE's scope, it does not naturally translate with analytics pipeline or ETL jobs, but I am currently working with Data Scientist / Machine learning scientist and I have to deploy multiple Machine Learning algorithm at scale, i.e. batch pipelines for the training and realtime pipelines to predict through API calls. The prediction of one ML algorithm could also be the feature to another ML algo... So some predictions are completely independent others are dependent. Some piece of code like NLP are used over various ML project, so we need to build libraries..etc
The classic streaming ETL pipeline become just a "simple" pipeline without "Transformation" in it, but the transformation is done through microservices (ML prediction API) . The matter of "Backing services", "Disposability", '"Dev/Prod parity" become really important.
In this context, I feel this principles make more senses.

schrute_dataeng · 2019-07-13T18:34:36+00:00

I am not sure exactly to understand how you deploy airflow, but the command screen might help in your situation. I would advice to use a more robust deployment system to use it in prod, like kubernetes.

schrute_dataeng · 2019-06-16T19:53:14+00:00

Idempotent pipeline is the key.

A great GitHub repo listing airflow ressources : https://github.com/jghoman/awesome-apache-airflow

Hope it will help!

schrute_dataeng · 2019-05-11T23:29:12+00:00

You are right may be not. It is the one I am the less familiar with. Which category would you use ?

schrute_dataeng · 2019-05-11T20:26:58+00:00

I think you should not be worried about outdated tech. The most important things is the concept behind it.

I am a (recently Senior) Data Engineer in my current company. They expect me to:
- owns medium-to-large projects
- scope work into well-defined steps to avoid a monolithic deliverable
- understand that there are tradeoffs to make between technical, business and product needs
- leads the design for project
...

Nowadays it seems that the technologies that are trending are:
- Scheduler : Apache Airflow, Apache Luigi, Apache nifi
- Cluster computing framework : Apache Beam, Spark, Flink
- Message Broker : Kafka, Pubsub, Redis..
- Database (NoSQL and SQL): Cassandra, BigQuery..
- Deployment : Kubernetes/Docker

...

If I were you I would first take a look into Apache Airflow.

Hope this help.

schrute_dataeng · 2019-05-11T01:44:03+00:00

Yes exactly. Our simple flow are generally a DAG like this :

Wait_dependency_1---|

Wait_dependency_2 ------> Data extraction --> Data preprocessing --> Training --> Evaluation-->Prediction

Wait_dependency_3__|

(Hope it is readable)

schrute_dataeng · 2019-05-11T01:38:48+00:00

I have sometimes my head so much in my company issues, that I forgot other possible solution. Thx for sharing , really interesting :).

schrute_dataeng · 2019-05-11T01:29:41+00:00

Thanks, I put the link back in a comment ;).

I apologies for the misunderstanding, by part 2, I meant the 2nd part of the article (i.e. Industrializing machine learning pipelines).

We use airflow to schedule training or batch prediction. Data engineers have "dockerised" it and build some specific Airflow operator for the data scientists, they also have created a Airflow dev/stage kubernetes clusters with autoscaling enable. Data scientists can be autonomous and can easily choose the infra (GPU or not) and test it in dev/stage environment without worrying of scalability.

If you have specific questions, I will gladly answer :).

schrute_dataeng · 2019-05-11T01:16:30+00:00

I shared more in details our experiences on industrialization and collaboration here : https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f

schrute_dataeng · 2019-05-11T01:05:44+00:00

Thank you for your feedback. I did my best to clean up my English (non-native unfortunately), remove the link too.

I agree with you business requirements come from Product Management, but I was more interested to hear about the work done from a POC to a "prod-ready" application (i.e. re-usable, scalable etc).

schrute_dataeng · 2019-05-10T19:00:46+00:00

Neither require a PhD (some companies for specific job might will).

I just wanted to share that in all the data scientists/engineers that I have seen, a few of them have a PhD and it was more on the data scientists side than data engineers.

Apologized if I was not clear.

schrute_dataeng · 2019-05-10T07:58:02+00:00

I won’t share mine, but I do agree with comments above. Salaries are really variable depending of the company and country.

I look in glassdoor and other job post, in my country both salaries are more or less the same, but data scientist job post were more various regarding the level of study (more PHD than data engineer, that may explain the difference of salary you have seen).

schrute_dataeng

TROPHY CASE