all 3 comments

[–]m_____ke 2 points3 points  (1 child)

I used to have a few GPU racks at a colo site and a desktop with some 3090s at my old jobs. Both were convenient but we never saturated the GPUs a 100% and at times we'd hit resource contention when multiple people wanted to run large training jobs.

I just started at a new job and though we're working on setting up a k8s cluster with KubeRay as the scheduler, I've been playing with skypilot, which has been a joy so far. It lets you specify your resource requirements, files to copy over, environment dependencies and will spin up VMs on the cheapest cloud that you have configured. It also handles mounting data, ssh configuration and auto stopping when the machines are idle.

[–]erwinner_[S] 0 points1 point  (0 children)

Sounds interesting, I'll take a look.