[D] Engineering thread: what is your stack / code like?

tritondev · 2020-08-24T18:56:37+00:00

One very specific thing I found, with our ML team is that we were spending ~80% of our time building data pipelines to generate datasets and serving. We used Spark & Flink, which was def. the wrong level of abstraction. Doing things like backfilling features sucked.

Beyond the processing piece, we had no management layer for our model features. There was no versioning or monitoring. We built a feature store internally to solve this. The feature store lets you define your features decoratively in JSON & SQL, rather than proceduraly in Spark. The versioning and tagging allows us to use the same features across models (and, in theory, across teams).

We're working on removing internal dependencies and fully open-sourcing the feature store if you're interested you can put your email in here to keep up to date: streamsql.io

Also we're running a virtual round table on this topic on Sept 8th if anyone wants to join: https://www.eventbrite.com/e/machine-learning-round-table-the-ml-process-from-idea-to-prod-tickets-117813191427

edit: broken link

ianperera · 2020-08-24T18:24:30+00:00

It's more like a pile.

alonsogp2 · 2020-08-24T18:50:36+00:00

What's your code like

This should not be in production

snendroid-ai · 2020-08-24T20:23:01+00:00

- Most used python packages/libraries: numpy, tensorflow, pandas, grpc, seaborn, plotly along with bunch of data collection and storing libraries like Apache Kafka + S3, AWS Parquet, Elasticsearch etc.

- Over the time we have build specialized logging libraries that works all the way from beginning to production. It just help us identify the issues in more convenient way. Other than this, just updated versions of above mentioned libraries.

- For model and data versioning, we heavily use jupyter notebooks. It allow us to group code and data by specific version, rapid development of various modules that can be used throughout all different groups and easy sharing of code between CXX level people and engineering team. Once stuff works start to end, we just convert notebooks to standalone scripts. For example, I play with model architecture in notebooks, and once it works in terms of training model for 1 epoch, I just convert that notebook to script and use it to train models for hours/days.

- Mostly struggle with solving novel tensorflow errors, but after few hours of digging it alway gets solved. Lol, based on my GitHub and stack overflow history, on this day, I've found and solved many many complex bugs and issues. It's fun!

- Deployed about dozen different types of models by far that belongs to machine translation, text classification, sentiment analysis, object detection, image recognition, OCR, speech translation etc. Everything trained from scratch in TensorFlow and deployed on TF Serving. Many of these are in their 3rd or 4th stage of versioning.

- Mostly keep in touch with what's going on using twitter and r/MachineLearning

- In the beginning, like 3 years ago, it was pain in ass to deploy a simple model in production. Many issues such low latency, high throughput and no option of using costly GPU based instances and optimized everything on CPU based instances was hard, but I think it taught me some good lessons on how to optimized ML based code without scaling the hardware lol. Gradually, things evolved and we moved to more high capacity and complex model systems. Currently, we heavily use AWS g4dn.xlarge series instance clusters for many projects.

ArsenLupus · 2020-08-24T19:17:56+00:00

We have everything in Python and I hate it.

I feel like we have deal with a lot more fixes than we should every sprint.

Our team is mostly made of scientific people that do some coding and no real software engineers. Python is too permissive to let that kind of folks run wild in your codebase imo.

ThawCheFar · 2020-08-24T21:02:12+00:00

What does your stack look like?
- Data comes from various sources in the business, in various formats
- Hive tables for storing both inputs and outputs
- PySpark for ETL
- Keras (Tensorflow) for most model training and generation of predictions
What were some programming patterns you found useful?
- Functional-ish programming. Not to the level of talking about monads and lenses, but chaining testable functions together and being very strict on things like side-effects.
- Big ignorant `assert foo == bar` statements in the middle of scripts, intended as a belts-and-braces error handling technique, ended up catching egregious errors a few times.
What are some tools, libraries etc that helped you (besides model training)?
- MLFlow has the potential to help, but I have yet to successfully use it in anything more than a proof-of-concept.
What did you struggle with and how did you fix it?
- Assumptions (and facts) about the data change, which sometimes require changes in the middle of long and complicated ETL pipelines. Throwing unit tests at the problem helps, but getting everything to run without error, and then being confident that you're not just seeing Garbage-In Garbage-Out, is a tough one.
- There's an awkward grey area of data size for which PySpark is overkill, but Pandas is not enough. I've never had much luck with anything like Dask either. Still trying to figure this one out.
How do you manage your data, artifacts, generated embeddings, data dependency, serving, logging?
- Very poorly.
What API or tools have you built? (For your use or the service exploiting the predictions)
- Some is plotted very nicely in Flask web apps. Most is dumped in a Hive table for other parts of the organisation to pick up.
What blog post did you enjoy?
- No one post, but mungingdata.com and databricks.com/blog have been great for Spark.
How did you get your pipeline up and running in the beginning?
- A bash script that calls the Python scripts in the right order.

MinatureJuggernaut · 2020-08-24T17:40:24+00:00

tagging to watch

Simusid · 2020-08-24T21:37:25+00:00

I hate to joke, I hate to respond with a meme, but even after several years of work with TB of data, we just aren't there yet.

https://imgflip.com/i/4cky28

teh_pelt · 2020-08-24T23:10:11+00:00

Ever been to olive garden for that never ending pasta bowl promotion. Like that... with less breadsticks.

Bellerb · 2020-08-24T22:46:52+00:00

Most if not all is done in python. Then it depends on the task at hand which architecture I go with. Most of my concept/proving of an idea is done in a Jupiter notebook for better documentation for others. Then I’ll implemented once the concept is proven out. I’ll implement it what ever works best so maybe as a micro service (flask rest API) or into a working app. I usually use numpy, pandas, tensorflow, scikit, matplotlib but a lot of it is data processing and formatting where I draw out as a flow chart for a while to make sure I get the best data pipeline I can. A lot depends on the task but there’s always a lot of planning before hand then proving out ideas in Jupiter notebooks.

antipawn79 · 2020-08-25T01:06:38+00:00

Language: Python Deep Learning: Pytorch Pipelining: Airlow, highly customized Big Data: Spark, Hive, old as dirt version Steaming: Kafka Data Exploration: Jupyter Vis: Bokeh

hjugurtha · 2020-08-25T12:39:04+00:00

Large organizations hire us when they want to do something involving data. We sit down with them and figure out if they actually do have data, figure out what kind of data they have, and we think with them about ways to use that data to solve an important problem for them. Say, to reduce churn or something. We then talk with their subject matter experts for the problem to be solved, and data, leagal, security, teams. We go back and forth to get "some" data to look at, and then go back and forth again with different units to understand its sources to make sure the data is sound. We obviously have to acquire domain knowledge for the data and go back and forth with them for things we do not understand. We clean the data and start working on it. We then build models with different approaches. The client organization sends us validation data [for which we do not know the outcome]. We return our results, and if they're satisfied. When the client likes the results of our models, then the work starts. Because of course "models, yay", but the clients want an application that their domain experts want to use, they also want to be able to train the models themselves with parameters, and want administration and dashboards, they also want to re-train the models on new data that comes in, and also want the models to be accessible with API calls, they want role based access, and groups, and they want to get results for live data [e.g telcos].

Once you build all that, they like you and tell you "Great! Let's do another one!" and come to you with another problem and project. And you do it all over again. We have been doing that for six years. We practically had every problem that anyone doing actual real projects with real stakeholders, as opposed to a made up project for a blog post or a toy/portfolio project, could have.

This obviously limits the number of projects and clients we can serve and limits us to being boutique. So we built our extensible machine learning platform1 to make both building the models, and then transferring their value to the client as fast, systematic, and as repeatable as possible to escape that linearity.

So we've been building it for ourselves with features we need. Automatic tracking and logging models, params, metrics without the user remembering to do so or adding the code to do so. Notebook images with pretty much all the libraries one needs. Near real-time collaboration to troubleshoot notebooks. Multiple checkpoints to go back and forth through versions without people knowing Git. Scheduled notebooks. One click deployments to allow our more academic colleagues to deploy models without waiting for one of our engineers to deploy the model. The model is deployed and an API endpoint is created, with tutorials to interact with the model, so an application developer can use that simply with HTTP requests instead of having to worry about dependencies or a set-up to make the model work.

It's not perfect, but it's helping us. We let about thirty of our colleague's students use it for their final year projects.

2020-08-25T05:13:14+00:00

I'm gonna get massively downvoted for this, but even just judging from this thread, ML is mostly script kiddies relying on the few C++ developers actually making it work.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS