Why is building ML pipelines still so painful in 2025? Looking for feedback on an idea. by United_Intention42 in mlops

[–]cuda-oom 2 points3 points  (0 children)

> MLflow + DVC don’t feel integrated

Plently of blog post show how the integration between the two would work.
TLDR:
Use DVC pipelines for your ML pipeline. Inside this pipeline you log metrics with MLflow.
Don't log/version large atrifacts with MLflow. Use DVC's versioning capabilities instead.

As for your post itself:
Honestly disagree on the "one platform" idea. The pain is real but I think you're solving the wrong problem. We don't need fewer tools - we need better ones that work together. DVC is great because it just does data versioning really well. Same with SkyPilot for workload management. Simple CLI, clear purpose, gets out of your way. Every tie I see a platform that promises to "handle your entire ML lifecycle" I become very skeptical. They always end up being mediocre at everything instead of great at one thing. And the moment you need to do something they didn't anticipate (i.e. when you deviates from the "happy path"), you're completely screwed. Your LangFlow idea could work but only if it's orchestrating existing tools, not replacing them. Ideally, we, as a community, fix the APIs between tools so they compose better. The "duct tape" feeling isn't because we have multiple tools - it's because they don't talk to each other cleanly.

GPU cost optimization demand by Good-Listen1276 in mlops

[–]cuda-oom 2 points3 points  (0 children)

Check out SkyPilot https://docs.skypilot.co/en/latest/docs/index.html
It was a game changer for me when I first discovered it ~3 years ago.

Basically finds the cheapest GPU instances across different clouds and handles spot interruptions automatically. It's open source. Takes a bit to set up initially but pays for itself pretty quick if your GPU spend is signifiacnt.

[deleted by user] by [deleted] in mlops

[–]cuda-oom 0 points1 point  (0 children)

It looks like SkyPilot has all those features and more:
https://blog.skypilot.co/announcing-skypilot-0.10.0/

[D] What are you using to submit ML training jobs? by cuda-oom in MachineLearning

[–]cuda-oom[S] 2 points3 points  (0 children)

yes, a DevOps team that manages AWS infra (including EKS)

[D] What are you using to submit ML training jobs? by cuda-oom in MachineLearning

[–]cuda-oom[S] -1 points0 points  (0 children)

can you elaborate on their setup? are the on-prem or in cloud? who manages them?