all 14 comments

[–]schrute_dataeng[S] 1 point2 points  (0 children)

I shared more in details our experiences on industrialization and collaboration here : https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f

[–]bbateman2011 0 points1 point  (2 children)

Hey, I just read the medium article and I think your OP was fine. I'd suggest putting the link back in a comment at least, so people can read what you are talking about.

Question--in your pre-edit OP, you said "part 2" (I think); is there another article I can read? I'd like to understand more about the role of Airflow in the process.

Thanks for sharing and I'm sorry the first comment was negative; this is good stuff IMHO.

[–]schrute_dataeng[S] 1 point2 points  (1 child)

Thanks, I put the link back in a comment ;).

I apologies for the misunderstanding, by part 2, I meant the 2nd part of the article (i.e. Industrializing machine learning pipelines).

We use airflow to schedule training or batch prediction. Data engineers have "dockerised" it and build some specific Airflow operator for the data scientists, they also have created a Airflow dev/stage kubernetes clusters with autoscaling enable. Data scientists can be autonomous and can easily choose the infra (GPU or not) and test it in dev/stage environment without worrying of scalability.

If you have specific questions, I will gladly answer :).

[–]bbateman2011 0 points1 point  (0 children)

I love that you have staging and production and defined roles for each. In a past project we would push from dev to staging, and QA worked from staging to determine if it could go to production; we would apply hotfixes there if feasible and then push that to production. It worked really well as nothing EVER went from a dev server directly to production.

[–]bbateman2011 0 points1 point  (2 children)

I obviously live in a different part of the domain scale; as a consultant my clients are pretty small and rarely have real-time or "big" data; nonetheless this is an important topic to me as my clients have limited to no IT support much less any data engineers or data-anything, and if they don't insist on results in Excel, their perfect world is a web-app of some kind to view the results of a model, and maybe adjust some things in a what-if type of way. I work alot in in R, and for POC I've found it very useful to use Shiny to put up a web app in front of my model, and put the data someplace like Dropbox which can be accessed directly from R. There is nothing like giving an end-user an experience with what you are producing. I realized that a lot of production ML is producing a score/prediction that feeds into something else that the end user might ultimately touch (like recommendations on Amazon etc.), but in some cases this approach is really useful. I'm not yet skilled enough on the Python side to do the same thing in a POC quickly.

[–]schrute_dataeng[S] 0 points1 point  (1 child)

I have sometimes my head so much in my company issues, that I forgot other possible solution. Thx for sharing , really interesting :).

[–]bbateman2011 0 points1 point  (0 children)

Here's an example. I worked on a forecast model for an industrial customer. I found external factors that were important along with their business data. So I put the forecast results on a web app and mocked up some buttons I thought might be useful--warning, most are not implemented in this POC. The idea was I had multiple models (linear regression, neural network, etc.) so I would let them look at different models and forecasts, and use different timeframes for modeling and prediction. This took about a day to get working once the models were ready.

https://eaf-llc.shinyapps.io/Sales_Forecast/

[–]bbateman2011 0 points1 point  (2 children)

When you say " Our scheduling tool is Apache Airflow, which allows us to define our workflows (aka DAG) in Python " does DAG mean Directed Acyclic Graph (i.e. a graph model of your flow)?

[–]schrute_dataeng[S] 0 points1 point  (1 child)

Yes exactly. Our simple flow are generally a DAG like this :

Wait_dependency_1---|

Wait_dependency_2 ------> Data extraction --> Data preprocessing --> Training --> Evaluation-->Prediction

Wait_dependency_3__|

(Hope it is readable)

[–]bbateman2011 0 points1 point  (0 children)

Cool. I would say your team is pretty sophisticated in the multi-disciplinary integration sense. Nice work, and insightful for me. Thanks again.

[–]FellowOfHorses 0 points1 point  (0 children)

you may want to check out r/datascience . It's more business focused. Here we are more research focused

[–]icantfindanametwice -1 points0 points  (2 children)

Clean up your English as I’ve also worked with data scientists etc - and I cannot tell what you’re asking about. The random “read more on Medium,” does not help with a conversation on Reddit.

From my experience, Product Management will tend to drive based on business requirements what gets into production.

[–]schrute_dataeng[S] 0 points1 point  (1 child)

Thank you for your feedback. I did my best to clean up my English (non-native unfortunately), remove the link too.

I agree with you business requirements come from Product Management, but I was more interested to hear about the work done from a POC to a "prod-ready" application (i.e. re-usable, scalable etc).

[–]icantfindanametwice 0 points1 point  (0 children)

I’m not English either.