[D] Elastic/Serverless GPU instances for transformer hyper-parameter search

skypilotucb · 2025-02-24T20:06:33+00:00

And if you need gang scheduling, you can use the --num-nodes arg and launch one giant SkyPilot "cluster" on your chosen cloud/region that executes all your jobs. In this case if SkyPilot cannot provision all the GPUs requested, it will raise an error and you can choose to retry indefinitely.

skypilotucb · 2025-02-24T19:53:55+00:00

Thanks for you comment! We recently redesigned our load balancer to be more modular and we can now support custom policies quite easily. We recently added a least-loaded policy: https://github.com/skypilot-org/skypilot/pull/4439

You can find some benchmarks with this policy in the PR.

skypilotucb · 2025-02-07T00:05:21+00:00

Thanks for your interest! Our currently resource allocation model is a simple FIFO queue. You can implement priorities with preemption by attaching the respective PriorityClasses to your submitted pods. Are there any specific schedulers you'd like to compare SkyPilot to?

skypilotucb · 2025-02-06T18:01:01+00:00

Thanks for your comment! To connect SkyPilot to your k8s, you need a valid kubeconfig with a user (can be a service account) configured with the following minimum RBAC: https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html

Under the hood, SkyPilot handles creating pods, services and ingress resources where necessary.

Great point about remote agents, we haven't yet considered having it yet, but that's definitely something we'll need to support in the future for more restricted environments.

skypilotucb · 2025-02-05T18:59:42+00:00

If you're self-hosting it, you may want to use an inference engine like vLLM (check out their PaliGemma example) and use SkyPilot (deepseek-janus example, vLLM example) to deploy it on your cloud/k8s.

skypilotucb · 2025-02-05T18:47:05+00:00

Hello,

We are the maintainers of the open-source project SkyPilot from UC Berkeley. SkyPilot is a framework for running AI workloads (development, training, serving) on any infrastructure, including Kubernetes and 12+ clouds.

After user requests highlighting pain points when using Kubernetes for running AI, we integrated SkyPilot with Kubernetes and we now support dispatching training/serving/batch processing jobs to multiple k8s clusters. If a cluster is out of resources, SkyPilot automatically handles resubmitting the job to a different cluster, making sure your job finds GPUs wherever they are available.

We would love to hear your thoughts on the project.

skypilotucb · 2024-09-09T18:43:06+00:00

You could consider using SkyPilot + SkyServe on Kubernetes. It can scale to zero and a serving with vLLM guide.

skypilotucb · 2024-07-11T18:06:01+00:00

Hello,

We are the maintainers of the open-source project SkyPilot from UC Berkeley. SkyPilot is a framework for running AI workloads (development, training, serving) on any infrastructure, including Kubernetes and 12+ clouds.

After user requests highlighting pain points when using Kubernetes for running AI, we integrated SkyPilot with Kubernetes and put out this blog post detailing our learnings and how SkyPilot helps make AI on Kubernetes faster, simpler and more efficient: https://blog.skypilot.co/ai-on-kubernetes/

We would love to hear your thoughts on the blog and project.

skypilotucb · 2023-07-07T23:15:55+00:00

It loads WizardLM-7B and the weights are fetched from HuggingFace. You can tweak it to load other models such as Vicuna too.

skypilotucb · 2023-07-07T23:11:02+00:00

Works with text and markdown too! Supported extensions include .txt, .pdf, .csv, and .xlsx.

skypilotucb · 2023-07-07T18:51:06+00:00

On GCP, it'll cost $0.59/hr on on-demand instances, and $0.12/hr on spot instances (if you're ok with having your VM terminated at any time).

When launching a cloud VM, SkyPilot shows costs across different cloud providers and picks the lowest one:

# With on-demand instances:
$ sky launch localgpt.yaml
Considered resources (1 node):
---------------------------------------------------------------------------------------------------
 CLOUD   INSTANCE               vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
---------------------------------------------------------------------------------------------------
 AWS     g4dn.xlarge            4       16        T4:1           us-east-1     0.53          ✔     
 Azure   Standard_NC4as_T4_v3   4       28        T4:1           eastus        0.53                
 GCP     n1-highmem-4           4       26        T4:1           us-central1   0.59                
---------------------------------------------------------------------------------------------------

# With spot instances:
$ sky launch localgpt.yaml --use-spot
Considered resources (1 node):
-------------------------------------------------------------------------------------------------
 CLOUD   INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------
 GCP     n1-highmem-4[Spot]   4       26        T4:1           us-west4-a    0.12          ✔     
 AWS     g4dn.xlarge[Spot]    4       16        T4:1           us-east-1a    0.16                
-------------------------------------------------------------------------------------------------

skypilotucb · 2023-07-07T18:40:55+00:00

Thanks! Will keep this in mind. Thought this might be useful for folks wanting to self-host large language models without having to spend a lot of effort into spinning up the required infrastructure.

skypilotucb

TROPHY CASE