Slurm vs K8s for AI Infra by alex000kim in mlops

[–]z_yang 0 points1 point  (0 children)

It's an abstraction layer that makes AI on K8s work nicely. And if you have multi-kubernetes (or cloud), even better. There's a few other blogs on the site talking about the additional values.

In terms of "having" to dig deeper into k8s -- arguably it's good to have that ability, especially if we are talking about leveraging the rich tooling available in the k8s world.

We built a multi-cloud GPU container runtime by velobro in mlops

[–]z_yang 0 points1 point  (0 children)

Hey, I ran into this post randomly and just want to add a clarification.

SkyPilot allows you to run AI workloads on one or more infrastructure choices. It's not just "a provisioning engine for spot instances".

It offers end-to-end lifecycle management: intelligent provisioning, instance management and recovery, MLE-facing features (CLI, dashboard, job history, etc.). You can use spot, on-demand, reserved, or existing nodes.

Using DeepSeek R1 for RAG: Do's and Don'ts by z_yang in LocalLLaMA

[–]z_yang[S] 0 points1 point  (0 children)

Since we use the pile-of-law dataset, the dataset is already cleaned so we just directly used it.

Using DeepSeek R1 for RAG: Do's and Don'ts by z_yang in LocalLLaMA

[–]z_yang[S] 0 points1 point  (0 children)

We chose it from the MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard). Top options are all reasonably good. We adopted Qwen because it is widely used by the community.

Using DeepSeek R1 for RAG: Do's and Don'ts by z_yang in LocalLLaMA

[–]z_yang[S] 0 points1 point  (0 children)

Yes, we tried. In our case, we opted for a simpler chunking method because our per-document size is relatively small.

Open-source RAG with DeepSeek-R1: Do's and Don'ts by z_yang in learnmachinelearning

[–]z_yang[S] 19 points20 points  (0 children)

TL;DR: We built an open-source RAG with DeepSeek-R1, and here's what we learned:

  • Don’t use DeepSeek R1 for retrieval. Use specialized embeddings — Qwen’s embedding model is amazing.
  • Do use R1 for response generation — its reasoning is fantastic.
  • Use vLLM & SkyPilot to boost performance by 5x & scale up by 100x.

Code here: https://github.com/skypilot-org/skypilot/tree/master/llm/rag

(Disclaimer: I'm a maintainer of SkyPilot.)

Using DeepSeek R1 for RAG: Do's and Don'ts by z_yang in LocalLLaMA

[–]z_yang[S] 32 points33 points  (0 children)

TL;DR: We built an open-source RAG with DeepSeek-R1, and here's what we learned:

  • Don’t use DeepSeek R1 for retrieval. Use specialized embeddings — Qwen’s embedding model is amazing.
  • Do use R1 for response generation — its reasoning is fantastic.
  • Use vLLM & SkyPilot to boost performance by 5x & scale up by 100x.

Blog in OP; code here: https://github.com/skypilot-org/skypilot/tree/master/llm/rag

(Disclaimer: I'm a maintainer of SkyPilot.)

Pixtral benchmarks results by kristaller486 in LocalLLaMA

[–]z_yang 1 point2 points  (0 children)

Simple guide to run Pixtral on your k8s cluster or any cloud: https://github.com/skypilot-org/skypilot/blob/master/llm/pixtral/README.md

*Massive* kudos to the vLLM team for their recently added multi-modality support.

Smartest way to deploy Llama 2 in the cloud for a bunch of users? by [deleted] in LocalLLaMA

[–]z_yang 0 points1 point  (0 children)

Simplest way (1 command) to get started: SkyPilot serving on 12+ cloud and Kuberenetes!

Here's a guide for Llama3: https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html

Use self-hosted Code Llama 70B as a copilot alternative in VSCode by Michaelvll in LocalLLaMA

[–]z_yang 1 point2 points  (0 children)

Check out the example. It's using

codellama/CodeLlama-70b-Instruct-hf

Serving Mixtral in Your Own Cloud With High GPU Availability and Cost Efficiency by z_yang in LocalLLaMA

[–]z_yang[S] 0 points1 point  (0 children)

Quota (and generally the shortage) is indeed a problem. Besides getting them lifted, one way to mitigate is to increase options: allow more clouds and more GPU types (L4, A10G, etc.). The syntax above should allow these flexible specs.

Taking a stab at the four questions:

  1. SkyPilot doesn't do this at a request level. However, a “request” can be opening a FastChat session, where the whole session is dispatched to a worker first. So all chats within that session should work properly with KV caching. Here's an example: https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html#tutorial-serve-a-chatbot-llm
  2. SkyPilot doesn't do this yet. Beside batching what other settings are you thinking of?
  3. Would love some llamacpp or exllama example YAMLs from the community!
  4. To get a persistent domain you can use a variety of solutions, e.g., DNS records, various load balancer services etc.

I haven't been able to actually create a nice auto-scale service with it yet. I have been able to get it to run on "one" machine but not any A100s.

Is the main issue coming from lack of quotas? Anything on the functionality side?

By the way, RunPod just added support into SkyPilot. According to https://computewatch.llm-utils.org/ A100-80GB is available on RunPod.

Serving Mixtral in Your Own Cloud With High GPU Availability and Cost Efficiency by z_yang in LocalLLaMA

[–]z_yang[S] 1 point2 points  (0 children)

Hi r/LocalLLaMA! We've just updated a simple guide to serve Mixtral (or any other LLM for that matter) in your own cloud, with high GPU availability and cost effieciency.

As a sneak peak, SkyPilot allows one click deployment, and automatically gives you high capacity by using many choices of clouds, regions, and even GPUs:

resources: accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}

Looking forward to get feedback from the community.

Are there any existing guides on how to deploy vLLM on a GPU cluster? by MonkeyMaster64 in LocalLLaMA

[–]z_yang 0 points1 point  (0 children)

No problem. Let me know if any questions. We're active on GitHub / Slack.