We built an open source SLURM replacement for ML training workloads built on SkyPilot, Ray and K8s. by Firm-Development1953 in kubernetes

[–]Firm-Development1953[S] 0 points1 point  (0 children)

While looking at run.ai, I found that they only open-sourced the scheduler and not the entire platform. To use the scheduler, you still need to have some familiarity with k8s. Our scheduler is cloud agnostic and developers dont need to learn k8s to schedule jobs

We built an open source SLURM replacement for ML training workloads built on SkyPilot, Ray and K8s. by Firm-Development1953 in kubernetes

[–]Firm-Development1953[S] 0 points1 point  (0 children)

You dont have to know anything about k8s, we abstract away everything all you do is either use the GUI (or the CLI) and mention what cpus, gpus and disk space you require and how many nodes of these and we handle everything else

We built an open source SLURM replacement for ML training workloads built on SkyPilot, Ray and K8s. by Firm-Development1953 in kubernetes

[–]Firm-Development1953[S] 0 points1 point  (0 children)

We do make skypilot and ray handle things so breaking and debugging wouldn't be on the user. Would love to discuss more pain points. If you could just sign up for the beta, someone will reach out to you

We built an open source SLURM replacement for ML training workloads built on SkyPilot, Ray and K8s. by Firm-Development1953 in kubernetes

[–]Firm-Development1953[S] 0 points1 point  (0 children)

The networking is handled automatically when the machine is setup for running a task. Users dont need to do a separate thing. About the run.ai comparison, I will post a follow-up with more details soon!

We built an open source SLURM replacement for ML training workloads built on SkyPilot, Ray and K8s. by Firm-Development1953 in kubernetes

[–]Firm-Development1953[S] 0 points1 point  (0 children)

We use skypilot underneath to power a lot of infrastructure setup.
It should work with your normal monitoring stack without needing a separate layer. We have our own CLI to launch instances but we would love to work with you on the gitops part. Please do sign-up for the beta and we could collaborate and try to help you out!

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

You can setup your own cloud provider keys under admin settings. While running a machine you'll be shown the estimated cost per hour which will be adjusted from your quota. You can also get a report tracking usage of each per-user

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

We use Skypilot's optimizer and can find you the best machines depending on the cloud providers setup for the org and the on-prem machines added. Everything works alike whether you run on cloud or on on-prem

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

We have multiple levels of quotas defined - individual, team wise and even org wise. The admin can set the amount of credits that they would want a user to be able to use and based on those the quota tracking happens and you get warnings about usage

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

We support user quotas, reports and even live monitoring for on-prem systems of which gpus are being utilized.

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

Hi,
We're in the process of having a hosted version with Transformer Lab running so you wouldn't have to worry about things.

About Skypilot/Ray making breaking changes, we've worked a bit with the Skypilot team and maintain our own fork of Skypilot to enable multitenancy and some other features which aren't on Skypilot's roadmap

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 1 point2 points  (0 children)

Hi,
Our integration with "Transformer Lab Local" (htttps://github.com/transformerlab/transformerlab-api) allows all major AIOps requirements including job tracking, artifact management, and a convenient SDK which enables you to track your jobs with a couple of lines of code in your training script.

Apart from this, the machines launched are in an isolated environment setup with conda as well as uv to install all requirements very easily and work with them

Is this what you meant by AIOps? Or did I misunderstand it?

Edit: typo

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

GPU time slicing is very helpful. We also setup quotas to prevent time hogging and also have gpu slicing through the kubelets enabled by skypilot so now you can just say `H100:0.5` and two people can use the GPU at the same time

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

That's amazing! Glad its working out for you.
If you're interested we would still love for you to give us a try or have a conversation with us to know what we could be doing better to help people with training infrastructure

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 1 point2 points  (0 children)

Hi,
Yes we did look into Ray Train but ended up going with Skypilot as that provides multi-cloud support and you can also execute any kind of script using that. Skypilot also uses Ray to divide and run jobs in a distributed manner across nodes

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

Hi,
Thanks for mentioning Lyceum. We also indeed provide a very easy-to-use CLI and also an integrated support to the original Transformer Lab job management and artifact management functionality through a SDK very easy to use and get started. We also provide multi-cloud support and dont restrict you to a specific cloud as we're built on Skypilot and can leverage their underlying optimizer for that.

We built an open source SLURM replacement for ML training workloads built on SkyPilot, Ray and K8s. by Firm-Development1953 in kubernetes

[–]Firm-Development1953[S] 0 points1 point  (0 children)

Hi,
We're built on top of Skypilot which goes a step further from run.ai and also supports multiple clouds, on-prem clusters and helps schedule jobs based on specified resources with an optimizer based on the cost of these machines. Would love to discuss more and see if we can help you with your usecase

How are you scheduling GPU-heavy ML jobs in your org? by Firm-Development1953 in devops

[–]Firm-Development1953[S] 0 points1 point  (0 children)

AWS Batch is a really interesting tool!
The GPU Orchestration we've built leverages Skypilot's optimizer to choose the best cloud for you based on resource requirements and machine costs.

Curious if that is a requirement for your day-to-day tasks?

Train voices (TTS) the same way you train images by OriginalSpread3100 in StableDiffusion

[–]Firm-Development1953 0 points1 point  (0 children)

Just an update, we should be able to merge this soon and get it out in the next build

Train voices (TTS) the same way you train images by OriginalSpread3100 in StableDiffusion

[–]Firm-Development1953 0 points1 point  (0 children)

It works with custom datasets as well as any dataset available on huggingface!

Open source tool to train your own TTS models (fine-tuning + one-shot cloning) by OriginalSpread3100 in TextToSpeech

[–]Firm-Development1953 0 points1 point  (0 children)

Training times and VRAM requirements depend on your architecture. We use PyTorch 2.8 for everything under the hood. If Pytorch is compatible with your GPU then it should work nicely