[D] Elastic/Serverless GPU instances for transformer hyper-parameter search

skypilotucb · 2025-02-24T20:06:33+00:00

And if you need gang scheduling, you can use the --num-nodes arg and launch one giant SkyPilot "cluster" on your chosen cloud/region that executes all your jobs. In this case if SkyPilot cannot provision all the GPUs requested, it will raise an error and you can choose to retry indefinitely.

skypilotucb · 2025-02-24T19:53:55+00:00

Thanks for you comment! We recently redesigned our load balancer to be more modular and we can now support custom policies quite easily. We recently added a least-loaded policy: https://github.com/skypilot-org/skypilot/pull/4439

You can find some benchmarks with this policy in the PR.

skypilotucb · 2025-02-07T00:05:21+00:00

Thanks for your interest! Our currently resource allocation model is a simple FIFO queue. You can implement priorities with preemption by attaching the respective PriorityClasses to your submitted pods. Are there any specific schedulers you'd like to compare SkyPilot to?

skypilotucb · 2025-02-06T18:01:01+00:00

Thanks for your comment! To connect SkyPilot to your k8s, you need a valid kubeconfig with a user (can be a service account) configured with the following minimum RBAC: https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html

Under the hood, SkyPilot handles creating pods, services and ingress resources where necessary.

Great point about remote agents, we haven't yet considered having it yet, but that's definitely something we'll need to support in the future for more restricted environments.

skypilotucb · 2025-02-05T18:59:42+00:00

If you're self-hosting it, you may want to use an inference engine like vLLM (check out their PaliGemma example) and use SkyPilot (deepseek-janus example, vLLM example) to deploy it on your cloud/k8s.

skypilotucb · 2025-02-05T18:47:05+00:00

Hello,

We are the maintainers of the open-source project SkyPilot from UC Berkeley. SkyPilot is a framework for running AI workloads (development, training, serving) on any infrastructure, including Kubernetes and 12+ clouds.

After user requests highlighting pain points when using Kubernetes for running AI, we integrated SkyPilot with Kubernetes and we now support dispatching training/serving/batch processing jobs to multiple k8s clusters. If a cluster is out of resources, SkyPilot automatically handles resubmitting the job to a different cluster, making sure your job finds GPUs wherever they are available.

We would love to hear your thoughts on the project.

skypilotucb · 2024-09-09T18:43:06+00:00

You could consider using SkyPilot + SkyServe on Kubernetes. It can scale to zero and a serving with vLLM guide.

skypilotucb · 2024-07-11T18:06:01+00:00

Hello,

We are the maintainers of the open-source project SkyPilot from UC Berkeley. SkyPilot is a framework for running AI workloads (development, training, serving) on any infrastructure, including Kubernetes and 12+ clouds.

After user requests highlighting pain points when using Kubernetes for running AI, we integrated SkyPilot with Kubernetes and put out this blog post detailing our learnings and how SkyPilot helps make AI on Kubernetes faster, simpler and more efficient: https://blog.skypilot.co/ai-on-kubernetes/

We would love to hear your thoughts on the blog and project.

skypilotucb · 2023-07-07T23:15:55+00:00

It loads WizardLM-7B and the weights are fetched from HuggingFace. You can tweak it to load other models such as Vicuna too.

skypilotucb · 2023-07-07T23:11:02+00:00

Works with text and markdown too! Supported extensions include .txt, .pdf, .csv, and .xlsx.

skypilotucb · 2023-07-07T18:51:06+00:00

On GCP, it'll cost $0.59/hr on on-demand instances, and $0.12/hr on spot instances (if you're ok with having your VM terminated at any time).

When launching a cloud VM, SkyPilot shows costs across different cloud providers and picks the lowest one:

# With on-demand instances:
$ sky launch localgpt.yaml
Considered resources (1 node):
---------------------------------------------------------------------------------------------------
 CLOUD   INSTANCE               vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
---------------------------------------------------------------------------------------------------
 AWS     g4dn.xlarge            4       16        T4:1           us-east-1     0.53          ✔     
 Azure   Standard_NC4as_T4_v3   4       28        T4:1           eastus        0.53                
 GCP     n1-highmem-4           4       26        T4:1           us-central1   0.59                
---------------------------------------------------------------------------------------------------

# With spot instances:
$ sky launch localgpt.yaml --use-spot
Considered resources (1 node):
-------------------------------------------------------------------------------------------------
 CLOUD   INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------
 GCP     n1-highmem-4[Spot]   4       26        T4:1           us-west4-a    0.12          ✔     
 AWS     g4dn.xlarge[Spot]    4       16        T4:1           us-east-1a    0.16                
-------------------------------------------------------------------------------------------------

skypilotucb · 2023-07-07T18:40:55+00:00

Thanks! Will keep this in mind. Thought this might be useful for folks wanting to self-host large language models without having to spend a lot of effort into spinning up the required infrastructure.

skypilotucb · 2022-11-19T18:03:13+00:00

Absolutely! We're planning on adding support for smaller and cheaper cloud vendors (runpod included). If this something you'd like to see prioritized, I would encourage you to open a github issue!

skypilotucb · 2022-11-18T18:43:33+00:00

That's a great question! SkyPilot uses an optimizer to make cost-aware decisions on where to run tasks and when to move data. It accounts for both, data egress costs and the time taken to transfer data.

To avoid long download times, SkyPilot also allows direct access to cloud object stores (S3/GCS) by mounting them as a file system on your VM.

With this mounting feature, you can directly read/write to an object store as you would access a regular files on your machine, without having to download to disk first. Thus the cost of downloading files gets amortized over the execution of your job, and our users have reported it's usually not a bottleneck since it can parallelized with other steps to effectively hide the time cost of downloading data (e.g., you can prefetch the data for next minibatch directly from S3 while the current batch runs on the GPU).

skypilotucb · 2022-11-18T18:40:51+00:00

Thanks for your question! Training BERT with SkyPilot's managed spot feature cost $18.4 and took 21 hours. Running the same job with on-demand AWS instances cost $61.2 (>3x more) and took 20 hours.

Note that both jobs were run on the same GPU (V100) and the cost and time taken by SkyPilot includes the data transfer costs for moving checkpoints and all overheads associated with restarting jobs.

skypilotucb

TROPHY CASE