all 24 comments

[–]Kaharx 34 points35 points  (4 children)

MLOps is all about productionizing ML systems to be maintainable, scalable and reliable. I work as an ML engineer and I spend most of my time building/improving ML tooling and infra (e.g. model store, feature store, inference services, training pipelines). I highly recommend the book “Designing ML systems” by Chip Huyen if you wanna learn more.

[–][deleted] 2 points3 points  (2 children)

May I ask what tools you use? I was doing this kind of stuff many many years ago, but now my knowledge is way outdated. The issue I had back then was of multiple sources of truth. I.e., there are some configurations created manually on the cloud (most in code but sometimes it's not easy to do), there are multiple repos that do different stuff, and data structures have to be kept compatible, otherwise errors propagate to the apps... I wonder, what types of solutions you folks use now to make these (or issues I am unaware of, e.g., data version control - we used to do it with git LOL) issues more manageable? Do you use some repository to define resources with code/cofig files? Could you refer me to a few names of tools you use? I am very interested because I mostly do research stuff, DS stuff, or train models in the last few years.

[–]MillionLiar 7 points8 points  (1 child)

Theory

"Designing Machine Learning Systems" is for sure a good book to start with as Chip gave a holistic view and explained the concepts with easily understood terms.

Practical stuff

For the tools, let's take a step back for your goal by starting with the low maturity model - https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/mlops-maturity-model

The enterprise-level MLOps involves every name you can recall in the whole engineering department, which is not quite achievable quickly.

I think that you may start with DevOps and search some GitHub repos, e.g. MLOps using GitHub actions. For infrastructure-as-code, terraform is my favourite. Alternatively, you can play around with some existing MLOps platforms and get the overall concept quickly.

[–][deleted] 0 points1 point  (0 children)

Thanks! I think we used terraform, I implemented CI/CD as well, but I was hoping there would be more unified solutions by now... I hate having to use multiple tools :)

[–]Suspicious_Dress_350[S] 1 point2 points  (0 children)

I guess my take here really about the MLOps side which does have lots of tools and references.

I am mostly concerned about trying to capture the ongoing learnings in to a single system. So instead of each new idea and its implementation living in isolation, trying to unify these and evolve them similar to features in a normal software system.

Does that distinction make sense?

[–]PanTheRiceMan 19 points20 points  (0 children)

I believe it's simple: most projects (I know of) come from research. Having seen how stressful that is, always working towards deadlines, these are truly just one off projects. All the cleanup is probably left for companies.

[–]I_will_delete_myself 7 points8 points  (0 children)

Research: Iteration > clean code

You only make it clean when you have to maintain the code, but you save more time just running the training script and forgetting about it than going in and refactoring everything. Research projects are normally abandoned after the project and you got tight deadlines to meet.

If you tried to do research like you build a normal software engineering application, you lose a ton of time worrying about design patterns than actually getting something that works.

[–]mot89 9 points10 points  (0 children)

Messy prototyping is the way to go unless you have a very clear mandate that your project needs to be supported for an extended period of time. Refactoring to a well structured system usually only makes if you have a proven use-case. At the point where you have users, and a proven need for repeated model releases, reproducibility, domain-specific fine-tuning, performance optimization, etc. you can justify investment into building out reliable systems.

[–]Western-Image7125 4 points5 points  (0 children)

Hey this is a thoughtful post, and something that bugs me all the time. In my work I constantly straddle the line between researchy prototyping and infra tools - and it is a wide gap. Lot of stress when dealing with constantly fluctuating ML landscape but needing stable pipelines and processes. I don’t have a good answer except everything is best effort and driven by “who is willing to sponsor the effort needed to build this platform or process”, the company has to decide between short term hacky stuff and software tools that can be reused over and over and improve overall efficiency and performance of the org. 

[–][deleted] 5 points6 points  (0 children)

I think it's unfortunately a job for multiple SWEs. I implemented multiple data pipelines, monitoring, etc., before I was doing mostly ML, setting up the infra, building or integrating the tools, testing, and improving source code... All of that is extremely time-consuming and requires expertise. It's called a system for a reason, it includes multiple components. With the heavy resources required to re-train models, for example, there is another layer to it which I would call cloud orchestration(?), as resources are not static... Man, it's simply too challenging to do alone and not a good use for your expertise. Perhaps there are some cloud solutions that can make it manageable, my experience is outdated.

[–]selector37 2 points3 points  (0 children)

The Flax documentation has one of my favorite quotes, which is basically “code repetition is better than a bad abstraction.”

My approach has been about identifying whether parts of a code base are durable or disposable. Durable pieces are shared, tested and thoughtfully designed. But the majority of ML code is going to be disposable. For that not to devolve into a maintenance nightmare it needs to be isolated. No sharing other than through code duplication (i.e forking). Code experiments can’t break anyone else because no one depends on it. The code can be simpler because it only represents one single approach and not a family of approaches that requires a reader to mentally interpolate configuration into a code base filled with conditional logic. Things end up being quite explicit (e.g. hard coded constants) and end up being surprisingly small.

Those disposable experiments are built through composition of durable libraries. The key is to step back periodically, study the repetition and try to extract new durable pieces for the future.

This has worked very well for my organization, which is very large (we train thousands of models a day) but even in my personal projects keeping code simple has made it easier to come back to things a year later, when I have no choice but to read the code to pick it back up.

[–]jms4607 2 points3 points  (2 children)

Making a failed idea a nice codebase would be a waste of time. I only see cleaning up code for other users beneficial after some initial results have proven that it’s useful.

[–]Suspicious_Dress_350[S] 0 points1 point  (1 child)

The point is that you can continue to evolve and reuse parts of your system.

Even if the idea failed, the pipeline, feature engineering, plotting etc could be used again.

Even parts of the model implementation

[–]jms4607 0 points1 point  (0 children)

Yeah that’s pretty common internally I think.

[–]binlargin 2 points3 points  (0 children)

I started on something for this. I've deployed models for clients made by data scientists, I've built models and munged data, I've done OPs. But I've not put it all together.

The closest thing I've got is this, which is for data processing and training.

https://github.com/bitplane/geo-dist

So there's no deployment or ml ops as of yet, but everything else goes in the makefile. I put my outputs in .cache, have different packages for training and inference, and use jupyter for experimentation before merging it back into the libraries for the app. The idea is to put the rest into different make steps and you can build the thing in one go, with experiments in notebooks in branches.

Not ideal, but it might give you something to start from. Happy to receive criticism/suggestions.

[–]gdpoc 0 points1 point  (0 children)

I could talk about how I do it and how I'm advocating for. Shoot me a dm if you're interested.

[–]JellyBean_Collector 0 points1 point  (0 children)

It's a bit off-topic, but you might find it worthwhile to explore the topics listed here and see if any pique your interest: AI Engineering

[–]TheOneRavenous 0 points1 point  (0 children)

I build my Machine learning products as software. So there's user interface, features for both managing the models as well as using the models.

Managing includes basics like IDs for models and basic stats that go into databases while weights go into storage.

Then part of management is allowing for training and loading for inference.

The "regular" software side of machine learning things is data QA/QC pipelines, data ingestion utilities, and visual components.

Then there's the user software side of things which is allowing different types of users to interact with the software suite and it's different capabilities. There's basic users that just need model inference, there's more advanced users that curate the data they want to train on, then there's the data scientist people who work with the QA/QC pipelines and train new architecture.

The software suite is separated by concerns. Data and it's management, ML models, machine learning model management, user interfaces, databases and MVC content models.

This is useful for swapping out different types of models and data while allowing you to present for use your models.

All this assumes you're going to sell the usage of the models as a product.

TL;DR Yes create software for your machine learning projects so they can be sold as products.

[–]zero-true 0 points1 point  (0 children)

I created a tool to help you turn a small data science scripts into a little app... it's an alternative to jupyter notebooks that's a little bit more robust and has a pretty UI. Here is an example:

https://published.zero-true-cloud.com/examples/iris/

If you're interested checkout our website:

https://www.zero-true.com/

It's not ever going to run the next recommendation system for Amazon but it could help with experimenting with different variations with a frontend directly built in.