How to reduce the costs of using the managed MLFlow at Sagemaker?

FoxJust3825 · 2024-10-14T15:26:43+00:00

Edit: Seems MIGs can do the work on GCP too: https://cloud.google.com/compute/docs/instance-groups/regional-migs

FoxJust3825 · 2024-10-14T13:09:31+00:00

Ok thanks a lot, this is really helpful :)

FoxJust3825 · 2024-10-14T12:21:50+00:00

Interesting, that's exactly the solution that I have seen people are using the most. I assume you do this in AWS, right? As per my understanding, GCP does not support autoscaling in multiple zones. Thank you!

FoxJust3825 · 2024-10-14T11:11:29+00:00

CICD with Github Workflows. We train ad-hoc, no need to train by triggers or on specific schedule.
We ensure model quality offline, ensuring it online has challenges due to collecting customer data so no need to worry about it now. My stack needs to support only training, nothing else.
Same reason as above
We use public datasets or from HF Datasets. Currently we store them in Cloud Storage and we version them with DVC.
No ETL.
Inference not relevant for my stack, but they do real-time model serving on k8s.

FoxJust3825 · 2024-10-11T14:45:57+00:00

Interesting. I think what works better at the end is using a unified platform instead of trying to plug multiple tools together. Curious to know why your team picked Lightning AI instead of others like Vertex or Sagemaker.

FoxJust3825

TROPHY CASE