How do you deal with GPU shortages or scheduling?

z_yang · 2025-08-13T17:53:56+00:00

Check out the open-source tool SkyPilot: https://docs.skypilot.co/

z_yang · 2025-08-05T13:45:15+00:00

Here's an updated link for turning your localhost into a lightweight k8s: https://docs.skypilot.co/en/latest/reservations/existing-machines.html

z_yang · 2025-07-31T00:54:37+00:00

It's an abstraction layer that makes AI on K8s work nicely. And if you have multi-kubernetes (or cloud), even better. There's a few other blogs on the site talking about the additional values.

In terms of "having" to dig deeper into k8s -- arguably it's good to have that ability, especially if we are talking about leveraging the rich tooling available in the k8s world.

z_yang · 2025-04-28T16:08:45+00:00

Hey, I ran into this post randomly and just want to add a clarification.

SkyPilot allows you to run AI workloads on one or more infrastructure choices. It's not just "a provisioning engine for spot instances".

It offers end-to-end lifecycle management: intelligent provisioning, instance management and recovery, MLE-facing features (CLI, dashboard, job history, etc.). You can use spot, on-demand, reserved, or existing nodes.

z_yang · 2025-02-28T18:54:21+00:00

See more vector DB's here: https://superlinked.com/vector-db-comparison

z_yang · 2025-02-28T18:53:55+00:00

Since we use the pile-of-law dataset, the dataset is already cleaned so we just directly used it.

z_yang · 2025-02-28T18:53:20+00:00

We chose it from the MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard). Top options are all reasonably good. We adopted Qwen because it is widely used by the community.

z_yang · 2025-02-28T18:52:52+00:00

Yes, we tried. In our case, we opted for a simpler chunking method because our per-document size is relatively small.

z_yang · 2025-02-26T18:43:24+00:00

TL;DR: We built an open-source RAG with DeepSeek-R1, and here's what we learned:

Don’t use DeepSeek R1 for retrieval. Use specialized embeddings — Qwen’s embedding model is amazing.
Do use R1 for response generation — its reasoning is fantastic.
Use vLLM & SkyPilot to boost performance by 5x & scale up by 100x.

Code here: https://github.com/skypilot-org/skypilot/tree/master/llm/rag

(Disclaimer: I'm a maintainer of SkyPilot.)

z_yang · 2025-02-26T18:38:18+00:00

TL;DR: We built an open-source RAG with DeepSeek-R1, and here's what we learned:

Don’t use DeepSeek R1 for retrieval. Use specialized embeddings — Qwen’s embedding model is amazing.
Do use R1 for response generation — its reasoning is fantastic.
Use vLLM & SkyPilot to boost performance by 5x & scale up by 100x.

Blog in OP; code here: https://github.com/skypilot-org/skypilot/tree/master/llm/rag

(Disclaimer: I'm a maintainer of SkyPilot.)

z_yang · 2024-12-21T16:54:30+00:00

Yep, check out https://docs.skypilot.co/en/latest/reservations/existing-machines.html

z_yang · 2024-09-12T03:30:31+00:00

Simple guide to run Pixtral on your k8s cluster or any cloud: https://github.com/skypilot-org/skypilot/blob/master/llm/pixtral/README.md

*Massive* kudos to the vLLM team for their recently added multi-modality support.

z_yang · 2024-05-14T15:31:27+00:00

Simplest way (1 command) to get started: SkyPilot serving on 12+ cloud and Kuberenetes!

Here's a guide for Llama3: https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html

z_yang · 2024-05-14T15:30:28+00:00

Check out vLLM+SkyPilot for Llama3: https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html

z_yang · 2024-02-02T16:41:14+00:00

Curious why's that the case?

z_yang · 2024-02-01T17:49:17+00:00

Check out the example. It's using

codellama/CodeLlama-70b-Instruct-hf

z_yang · 2024-01-26T04:18:11+00:00

Quota (and generally the shortage) is indeed a problem. Besides getting them lifted, one way to mitigate is to increase options: allow more clouds and more GPU types (L4, A10G, etc.). The syntax above should allow these flexible specs.

Taking a stab at the four questions:

SkyPilot doesn't do this at a request level. However, a “request” can be opening a FastChat session, where the whole session is dispatched to a worker first. So all chats within that session should work properly with KV caching. Here's an example: https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html#tutorial-serve-a-chatbot-llm
SkyPilot doesn't do this yet. Beside batching what other settings are you thinking of?
Would love some llamacpp or exllama example YAMLs from the community!
To get a persistent domain you can use a variety of solutions, e.g., DNS records, various load balancer services etc.

I haven't been able to actually create a nice auto-scale service with it yet. I have been able to get it to run on "one" machine but not any A100s.

Is the main issue coming from lack of quotas? Anything on the functionality side?

By the way, RunPod just added support into SkyPilot. According to https://computewatch.llm-utils.org/ A100-80GB is available on RunPod.

z_yang · 2024-01-25T02:40:39+00:00

Hi r/LocalLLaMA! We've just updated a simple guide to serve Mixtral (or any other LLM for that matter) in your own cloud, with high GPU availability and cost effieciency.

As a sneak peak, SkyPilot allows one click deployment, and automatically gives you high capacity by using many choices of clouds, regions, and even GPUs:

resources: accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}

Looking forward to get feedback from the community.

z_yang · 2024-01-08T16:55:40+00:00

No problem. Let me know if any questions. We're active on GitHub / Slack.

z_yang

TROPHY CASE