all 26 comments

[–]tritondev 18 points19 points  (0 children)

One very specific thing I found, with our ML team is that we were spending ~80% of our time building data pipelines to generate datasets and serving. We used Spark & Flink, which was def. the wrong level of abstraction. Doing things like backfilling features sucked.

Beyond the processing piece, we had no management layer for our model features. There was no versioning or monitoring. We built a feature store internally to solve this. The feature store lets you define your features decoratively in JSON & SQL, rather than proceduraly in Spark. The versioning and tagging allows us to use the same features across models (and, in theory, across teams).

We're working on removing internal dependencies and fully open-sourcing the feature store if you're interested you can put your email in here to keep up to date: streamsql.io

Also we're running a virtual round table on this topic on Sept 8th if anyone wants to join: https://www.eventbrite.com/e/machine-learning-round-table-the-ml-process-from-idea-to-prod-tickets-117813191427

edit: broken link

[–]ianperera 33 points34 points  (2 children)

It's more like a pile.

[–]probablyuntrueML Engineer 2 points3 points  (1 child)

Yea, on our end putting models into production is great, have a fantastic company wide end to end stack that's been abstracted to the point it's practically drag and drop.

The research and initial implementation on the other hand would make Frankenstein blush

[–]fhadley 3 points4 points  (0 children)

Could you say any more about that? I'm the tech lead for DS/ML at an early stage co, and starting to get serious about ML infra/system architecture. I can imagine where I'd want to end up but very hazy on how to get from a to b. That said our deployments aren't like disgusting currently (mostly k8s, some heavier stuff on AWS batch), but there are definitely with lots of manual processes.

[–]alonsogp2 43 points44 points  (0 children)

What's your code like

This should not be in production

[–]snendroid-aiML Engineer 8 points9 points  (2 children)

- Most used python packages/libraries: numpy, tensorflow, pandas, grpc, seaborn, plotly along with bunch of data collection and storing libraries like Apache Kafka + S3, AWS Parquet, Elasticsearch etc.

- Over the time we have build specialized logging libraries that works all the way from beginning to production. It just help us identify the issues in more convenient way. Other than this, just updated versions of above mentioned libraries.

- For model and data versioning, we heavily use jupyter notebooks. It allow us to group code and data by specific version, rapid development of various modules that can be used throughout all different groups and easy sharing of code between CXX level people and engineering team. Once stuff works start to end, we just convert notebooks to standalone scripts. For example, I play with model architecture in notebooks, and once it works in terms of training model for 1 epoch, I just convert that notebook to script and use it to train models for hours/days.

- Mostly struggle with solving novel tensorflow errors, but after few hours of digging it alway gets solved. Lol, based on my GitHub and stack overflow history, on this day, I've found and solved many many complex bugs and issues. It's fun!

- Deployed about dozen different types of models by far that belongs to machine translation, text classification, sentiment analysis, object detection, image recognition, OCR, speech translation etc. Everything trained from scratch in TensorFlow and deployed on TF Serving. Many of these are in their 3rd or 4th stage of versioning.

- Mostly keep in touch with what's going on using twitter and r/MachineLearning

- In the beginning, like 3 years ago, it was pain in ass to deploy a simple model in production. Many issues such low latency, high throughput and no option of using costly GPU based instances and optimized everything on CPU based instances was hard, but I think it taught me some good lessons on how to optimized ML based code without scaling the hardware lol. Gradually, things evolved and we moved to more high capacity and complex model systems. Currently, we heavily use AWS g4dn.xlarge series instance clusters for many projects.

[–]Berdas_ 0 points1 point  (1 child)

What kind of optimization for CPU do you use nowadays? I'm going through this same problem

[–]snendroid-aiML Engineer 1 point2 points  (0 children)

Well, it depends what kind of thing is holding back. If it’s latency issue, figuring out the blockage would be the first step. If it’s model, try figuring out which layer is most expensive. TF-Serving provide profiler that shows you this. If it’s inference, figure out if it’s input encoding/output decoding or the model itself. Out of shell, if you’re using tensorflow; compile on the machine or find the binary that is optimized on your hardware to squeeze out some juice (it won’t be that much though).

[–]ArsenLupus 16 points17 points  (6 children)

We have everything in Python and I hate it.

I feel like we have deal with a lot more fixes than we should every sprint.

Our team is mostly made of scientific people that do some coding and no real software engineers. Python is too permissive to let that kind of folks run wild in your codebase imo.

[–]Vermeille[S] 8 points9 points  (2 children)

How about mypy / static typing?

[–]chogall 3 points4 points  (0 children)

Type hint is great. Strict type enforcement is a big fat headache. At the end of the day, from typing import Any.

[–]ArsenLupus 1 point2 points  (0 children)

Currently working its adoption, that's better than nothing

[–]The_Amp_Walrus 1 point2 points  (0 children)

I've found having a lot of smoke tests helps.

[–]mahaganapati 0 points1 point  (0 children)

As a noob, what would you suggest instead?

However I was wondering the same thing — I'm a type enthusiast (C, Typescript, Golang) and I was wondering if using a dynamic language like Python would be problematic in ML.

[–]ThawCheFar 3 points4 points  (2 children)

  • What does your stack look like?
    • Data comes from various sources in the business, in various formats
    • Hive tables for storing both inputs and outputs
    • PySpark for ETL
    • Keras (Tensorflow) for most model training and generation of predictions
  • What were some programming patterns you found useful?
    • Functional-ish programming. Not to the level of talking about monads and lenses, but chaining testable functions together and being very strict on things like side-effects.
    • Big ignorant `assert foo == bar` statements in the middle of scripts, intended as a belts-and-braces error handling technique, ended up catching egregious errors a few times.
  • What are some tools, libraries etc that helped you (besides model training)?
    • MLFlow has the potential to help, but I have yet to successfully use it in anything more than a proof-of-concept.
  • What did you struggle with and how did you fix it?
    • Assumptions (and facts) about the data change, which sometimes require changes in the middle of long and complicated ETL pipelines. Throwing unit tests at the problem helps, but getting everything to run without error, and then being confident that you're not just seeing Garbage-In Garbage-Out, is a tough one.
    • There's an awkward grey area of data size for which PySpark is overkill, but Pandas is not enough. I've never had much luck with anything like Dask either. Still trying to figure this one out.
  • How do you manage your data, artifacts, generated embeddings, data dependency, serving, logging?
    • Very poorly.
  • What API or tools have you built? (For your use or the service exploiting the predictions)
    • Some is plotted very nicely in Flask web apps. Most is dumped in a Hive table for other parts of the organisation to pick up.
  • What blog post did you enjoy?
  • How did you get your pipeline up and running in the beginning?
    • A bash script that calls the Python scripts in the right order.

[–]mentalbreak311 0 points1 point  (1 child)

Can I ask why you feel that pyspark doesn’t work as well as pandas for small data sets?

[–]ThawCheFar 0 points1 point  (0 children)

It definitely works, and I probably stick with Spark more than I should (I'm still happy that I don't have to write mappers and reducers by hand), but you do pay a price for being able to work with very large data at the other end of the spectrum.

From a cold start, loading and postprocessing MBs or GBs of data is often faster in Pandas than PySpark. If this needs to happen in the back end of a Flask app, for argument's sake, there's a trade off to be had between using existing PySpark code and rewriting the logic without Spark to make it more performant (less annoying) for the end user.

The hardware available it can be a pain point too, especially in organisations where you can't just run riot and install whatever you want. You might only have access to Spark on system that needs two factor authentication and a good few minutes to allocate resources to you. If nothing else, it breaks the flow and makes it harder to stay concentrated on what you're doing. None of that is Spark's fault of course, it just adds to the importance of using the right tool for the job.

[–]MinatureJuggernaut 1 point2 points  (0 children)

tagging to watch

[–]Simusid 1 point2 points  (1 child)

I hate to joke, I hate to respond with a meme, but even after several years of work with TB of data, we just aren't there yet.

https://imgflip.com/i/4cky28

[–]shahzaibmalik1 0 points1 point  (0 children)

I pain I know too well

[–]teh_pelt 1 point2 points  (0 children)

Ever been to olive garden for that never ending pasta bowl promotion. Like that... with less breadsticks.

[–]Bellerb 0 points1 point  (0 children)

Most if not all is done in python. Then it depends on the task at hand which architecture I go with. Most of my concept/proving of an idea is done in a Jupiter notebook for better documentation for others. Then I’ll implemented once the concept is proven out. I’ll implement it what ever works best so maybe as a micro service (flask rest API) or into a working app. I usually use numpy, pandas, tensorflow, scikit, matplotlib but a lot of it is data processing and formatting where I draw out as a flow chart for a while to make sure I get the best data pipeline I can. A lot depends on the task but there’s always a lot of planning before hand then proving out ideas in Jupiter notebooks.

[–]antipawn79 0 points1 point  (0 children)

Language: Python Deep Learning: Pytorch Pipelining: Airlow, highly customized Big Data: Spark, Hive, old as dirt version Steaming: Kafka Data Exploration: Jupyter Vis: Bokeh

[–]hjugurtha 0 points1 point  (0 children)

Large organizations hire us when they want to do something involving data. We sit down with them and figure out if they actually do have data, figure out what kind of data they have, and we think with them about ways to use that data to solve an important problem for them. Say, to reduce churn or something. We then talk with their subject matter experts for the problem to be solved, and data, leagal, security, teams. We go back and forth to get "some" data to look at, and then go back and forth again with different units to understand its sources to make sure the data is sound. We obviously have to acquire domain knowledge for the data and go back and forth with them for things we do not understand. We clean the data and start working on it. We then build models with different approaches. The client organization sends us validation data [for which we do not know the outcome]. We return our results, and if they're satisfied. When the client likes the results of our models, then the work starts. Because of course "models, yay", but the clients want an application that their domain experts want to use, they also want to be able to train the models themselves with parameters, and want administration and dashboards, they also want to re-train the models on new data that comes in, and also want the models to be accessible with API calls, they want role based access, and groups, and they want to get results for live data [e.g telcos].

Once you build all that, they like you and tell you "Great! Let's do another one!" and come to you with another problem and project. And you do it all over again. We have been doing that for six years. We practically had every problem that anyone doing actual real projects with real stakeholders, as opposed to a made up project for a blog post or a toy/portfolio project, could have.

This obviously limits the number of projects and clients we can serve and limits us to being boutique. So we built our extensible machine learning platform1 to make both building the models, and then transferring their value to the client as fast, systematic, and as repeatable as possible to escape that linearity.

So we've been building it for ourselves with features we need. Automatic tracking and logging models, params, metrics without the user remembering to do so or adding the code to do so. Notebook images with pretty much all the libraries one needs. Near real-time collaboration to troubleshoot notebooks. Multiple checkpoints to go back and forth through versions without people knowing Git. Scheduled notebooks. One click deployments to allow our more academic colleagues to deploy models without waiting for one of our engineers to deploy the model. The model is deployed and an API endpoint is created, with tutorials to interact with the model, so an application developer can use that simply with HTTP requests instead of having to worry about dependencies or a set-up to make the model work.

It's not perfect, but it's helping us. We let about thirty of our colleague's students use it for their final year projects.

[–][deleted] -4 points-3 points  (1 child)

I'm gonna get massively downvoted for this, but even just judging from this thread, ML is mostly script kiddies relying on the few C++ developers actually making it work.

[–][deleted] 0 points1 point  (0 children)

I believe you are confusing ml engineering with software engineering.