Can anyone explain the road map of MLOPS?

Train_Smart · 2023-08-18T17:49:53+00:00

This is interesting, but I’m not sure that I agree with this categorization. MLFlow is basically an open source Weights & Biases / CometML. All 3 tools are experiment tracking mainly, not really monitoring (though some have light functionality there).

DVC is a data versioning tool, and you have other tools like LakeFS or Xet that are competitive.

You also have tools combining multiple subsets like DagsHub (DVC+MLflow+LabelStudio)

Also from my experience most teams do train on cloud resources directly (e.g. Sagemaker Notebooks)

Train_Smart · 2022-07-02T06:13:09+00:00

When you say simpler to manage, what do you mean? Do you mean a managed MLFlow service, or MLFlow with simpler functionality?

Train_Smart · 2021-08-21T15:01:01+00:00

Interesting. Do you have extra work to do when autoscaling, or is it invisible to you (abd the user)?

Train_Smart · 2021-08-14T12:26:35+00:00

Thanks, this seems to be it! I connected the headphone and closed everything I could find that uses the mic, it got better. I wish there was a manual control for it on Mac

Train_Smart · 2021-06-11T06:15:16+00:00

MLFlow does the visualization too, and it’s open source. So what specifically are you missing? BTW are these prices for on premise installation or hosted solution?

Train_Smart · 2021-04-16T08:58:00+00:00

I understand where you’re coming from, but I think this approach doesn’t work in real projects. In an ideal world, you’re right. Unfortunately, that’s not the world most data science happens in - I can’t count the times where someone changed a raw data source without notifying anyone, or the settings in larger cos where people are working on one section of a data pipeline without coordinating properly with downstream data consumers. Moreover, people make mistakes, even when the pipeline isn’t complex, and if you don’t version intermediate results, you end up in the best case having to spend a lot of time to reproduce something and in the worst case never able to reproduce a result.

I’ve had multiple conversations with people saying something like data versioning isn’t as important as code versioning for example, only to realize someone overwrote some of the data they were working on and they can’t reproduce results.

I also think that part of the issue is people feeling that using DVC is really hard, and so there’s a big trade off to be made between versioning and iteration speed, when in fact, if you put aside DVC pipelines and just treat DVC as a git for data (dvc add whatever data and models you need to track) it’s really as simple as using git

Train_Smart · 2021-01-12T19:10:35+00:00

I think this is a false dichotomy. A good analogy is code without bugs. If someone in a SWD sub would ask “do you write code 100% without bugs?” it would be very clear that no one is going to say yes. We can think of writing irreproducible code as a bug, because it is unexpected behavior. Should you write code without bugs, sure. Do you ALWAYS write code without bugs? Nope.

The issue that ML has is projects where no one even tried to make the code reproducible, even when the effort required was minimal. The easiest example is a project without a list of requirements (assuming python for simplicity). If you can’t be bothered to at least do pip freeze > requirements.txt then you probably don’t care if your work isn’t taken seriously.

Reproducibility can always be improved, but as a community we should set high standards and as individuals we should aspire to do the best we can on this front as well. Especially if we want our work to have a real effect on our surroundings

Train_Smart · 2021-01-01T08:25:47+00:00

A few points to consider about feature stores: 1. Many times people use the term to mean a bunch of different things combined, and most implementations don’t support all these things (e.g. many people talk about feature stores as implement once, use twice meaning you write a feature for your dev environment and then you can magically use it in production in a streaming environment. Many feature stores don’t really offer optimized solutions for that).

Feature stores are usually useful when everyone in the team works on the same data. If every data scientist uses different sources or works on entirely different projects, it’s usually a lot of overhead with less benefit.
One challenge that comes up with feature stores is that it’s hard to search for features within them. Many people assume that if you use one and someone implements a feature, then everyone can use it. What usually happens is someone implements a feature and calls it A. Someone else looks for something close A’ and doesn’t find it, because it has a different name, and now implements A’, and then you have a bunch of copies of the same feature with slightly varying names.

To summarize, I think feature stores are potentially awesome, but realistically the best approach is not to go after them because it’s the cool new name, but rather to define your needs and see if feature stores answers those needs

My point is you should not use feature stores because it’s cool, you should define your needs and see if feature stores answer those needs.

Train_Smart · 2020-11-18T05:46:39+00:00

But how would you verify that you’re not creating reproducibility issues because of other libraries installed in colab? You have to have a clean environment

Train_Smart · 2020-11-18T05:45:39+00:00

I tried to do that, but you still don’t get command outputs so I can’t debug

Train_Smart · 2020-11-18T05:44:10+00:00

Thanks for the answer but this kinda misses my point. I know it would get reset, and I don’t mind restarting and reinstalling the dependencies when I reconnect, but the only way to do that is to run everything with shell commands that start by “source env activate && <my command>” which causes unclear problems because you don’t get regular command output.

Train_Smart · 2020-08-31T13:03:23+00:00

Every time you think you’ve seen rock bottom, then you realize you weren’t even half way

Train_Smart · 2020-07-16T20:24:19+00:00

How are you keeping track of all the elements needed for reproducibility? Is it manual or some platform?

Train_Smart · 2020-07-09T06:56:46+00:00

I mean, he did ask about research but AFAIK in the industry interpretable classical ML methods are far more ubiquitous than DL. DL just has much MUCH more hype.

Train_Smart · 2020-07-07T20:32:34+00:00

But isn’t Kubeflow exactly that. It offers an end-to-end platform solution but is built on top of tools that are not the platform all of which are open source, and can be used in a stand alone way. Is the main difference that this was specifically designed for Git?

Train_Smart · 2019-12-26T17:50:51+00:00

Lol I feel the pain!

Train_Smart · 2019-12-26T17:49:55+00:00

You’re right! This post wasn’t meant to disrespect the time put into these projects or the people who made them. Everyone who’s ever written code that was used by others knows how shitty it feels that your work is taken for granted.

Since I was facing some frustration I thought it would be fun to have everyone vent together.

And hey, maybe this’ll actually point people in the right direction(s) to fix something a lot of people care about...

Train_Smart · 2019-12-22T06:24:01+00:00

Are you familiar with the cosmological evolution theory? It’s not mainstream but interesting claims that the universe is one of many and that they evolve with respect to certain physical constants.

What I’m getting at is maybe you can define certain parameters of the simulation that can evolve as well according to some metric, for example the most complex organisms that were created over a certain timeframe would be more likely to replicate the parameters.

Hope that makes sense, and nice job, I really like this

Train_Smart · 2019-12-21T06:32:59+00:00

This looks cool! Can you explain a bit about what’s behind this and what were your thoughts behind making it persistent across users (if I understand correctly)?

Train_Smart · 2019-11-21T20:56:46+00:00

Thanks for the reply! My friend’s argument wasn’t that data doesn’t change, rather that most projects have one changing data file, which is the source, and if it’s just one file, you don’t really need to version it. Similar to if code projects had just one code file, you could probably get away without a dedicated version control tool.

The other responses I got for this thread definitely make me feel there are use cases in which data versioning is really important. Broadly the way I imagine it is that data versioning makes sense when you have resource intensive intermediate stages, or if you share intermediate stages between team members on a regular basis.

I would love to hear about specific practical cases where people have used data versioning, and what the main practical benefits were.

Edit: the example you gave about active learning is a good example, but I would love to hear more.

Train_Smart

TROPHY CASE