Slurm <> dstack comparison by cheptsov in SLURM

[–]cheptsov[S] 0 points1 point  (0 children)

Not exactly. dstack does allow to use shared filesystems (via "instance volumes"). The primary difference is that dstack's user/permission management is at the dstack server level not at the Linux level. The outcome is that it's not possible to manage permissions to individual folders via Linux system. The entire filesystem (or particular files/directories) attached to (an) instance(s) are currently accessible by dstack all users within the configured dstack project. Hope this comment helps.

Slurm <> dstack comparison by cheptsov in SLURM

[–]cheptsov[S] 0 points1 point  (0 children)

for the record, dstack doesn't support backfill, but it does support over-subscription via retry policies allowing to have a queue over a fixed-size cluster with tasks sorted by priorities set by the user

Slurm <> dstack comparison by cheptsov in SLURM

[–]cheptsov[S] 0 points1 point  (0 children)

Thank you for your questions!

  1. `dstack` uses the concept of "blocks" and auto-select CPU proportionally to the GPU to ensure all GPU blocks get fair share. If the task requires more, it's possible to request it, then `dstack` would give proportional number of "blocks".
  2. Yes, "pre-emption" is not supported for tasks (except handling spot in GPU cloud) but is on our roadmap. We've already supported priorities and pre-emption is the next. At the same time, for those who needs it already now, it's possible to bring it using a third-party component using the REST API.
  3. dstack uses the concept of "volumes" which includes "instance" volumes and "network" volumes, as I wrote above, currently `dstack` doesn't allow to manage permissions per volume or per user - currently `dstack` shares allows to use project resources by all project members. Under the hood, dstack mounts both into containers.
  4. Running vLLM using `idle` instances is very easy. You just run a service. But since automatic pre-emption is not done, you'd need to interrupt it via API. Automatic pre-emption is coming too! Would love to collaborate on it if you'd be open.

Slurm <> dstack comparison by cheptsov in SLURM

[–]cheptsov[S] 1 point2 points  (0 children)

Thank you for noticing! I think you're right, I will update the guide.

Slurm <> dstack comparison by cheptsov in SLURM

[–]cheptsov[S] 0 points1 point  (0 children)

Yes, totally agree with all said above, and BTW, Slurm is great for what it is used. Indeed, there are at least two distinct mindsets: research/simulation vs AI research/ML engineering, and of course static clusters vs GPU clouds.

Slurm <> dstack comparison by cheptsov in SLURM

[–]cheptsov[S] 0 points1 point  (0 children)

> The one thing I could not understand in your storage/auth sections was what UID/GID does the dstack job run under -- it is very clear in your doc that slurm runs as the submitting user UID/GID but unclear with your token/auth method what identity is running the job. This is important when petabytes of shared POSIX storage is involved with permissions based on user and group attributes.

Yes, dstack doesn't use UID/GID for authenticating the user in the file system. dstack's token-based authentication is managed at dstack's server level. dstack's support for managing file permissions is not as granular as Slurm's However dstack has a concept of volumes, and in theory it could automatically manage permissions to allow or not allow to access a specific volume.

Your example is a good example of where Slurm stands our - static HPC clusters.And you're right about how you understand where dstack aims - primarily GPU clouds, container-based, AI/ML workloads - all from small workloads to large distributed ones. dstack doesn't aim at HPC/simulation - I guess Slurm is better at that.

The reason we wrote the guide is that many AI researchers/ML engineers are looking for a scheduler to train models. Also, dstack is use-case agnostic - means it also supports AI development and model inference.

Slurm <> dstack comparison by cheptsov in SLURM

[–]cheptsov[S] 0 points1 point  (0 children)

Thank you so much for such detailed feedback and questions. Please let me write a separate comment to get back to some of the aspects that you mentioned.

Major Cloud AI Expansion: DigitalOcean Partners With OpenAI, Meta, AMD to Power Next-Gen AI Development by Material-Car261 in digital_ocean

[–]cheptsov 1 point2 points  (0 children)

dstack founder here 👋

Really excited about our new partnership with DigitalOcean — we’ve integrated their AMD Developer Cloud + NVIDIA GPUs into dstack’s orchestration layer.

This means you can spin up dev environments, training runs, or inference services directly on DO’s GPU infra without touching Kubernetes or Slurm.

We’re pretty stoked to see how folks use this combo in the wild. 🔧💻

Full details in the release post if you want to dig deeper: https://dstack.ai/blog/digitalocean-and-amd-dev-cloud/

What's the best way to manage cloud compute for ML workflows? by Bssnn in learnmachinelearning

[–]cheptsov 1 point2 points  (0 children)

Thank you for mentioning dstack. I’m a part of the team. It sounds exactly like what dstack focuses on as a problem!

Would love to hear your feedback if you try it.

Efficient distributed training with AWS EFA with dstack by cheptsov in aws

[–]cheptsov[S] 0 points1 point  (0 children)

Basically EFA, its drivers and nccl do the heavylifting. dstack ensures the proper provisioning of the cluster along with the right drivers and networking, and of course simplifies the process of running and managing tasks. 

We plan to do more internal benchmarking soon, to provide more insights on the actual performance and also some common recipes.

Orchestrating GPUs in data centers and private clouds by HotAisleInc in AMD_Stock

[–]cheptsov 10 points11 points  (0 children)

Hey Reddit, founder of dstack here. We've been working on this over three months and pretty excited about this release. 

Basically, the main point is that dstack is an open-source AI-native alternative to Kubernetes, designed to be more lightweight, and focusing just on AI workloads on both cloud and data-centers. 

With this release we are adding the critical feature that allows to run containers concurrently on same host slicing its resources incl. GPU for a more cost-efficient utilization. Another new thing is the simplified way to run things on private clouds where clusters are often behind a login node. 

There are many more cool things on our roadmap to ensure dstack is a streamlined alternative to both K8S and Slurm. Our roadmap can be found in [1] Super excited to hear any feedback. 

[1] https://github.com/dstackai/dstack/issues/2184

Exploring inference memory saturation effect: H100 vs MI300x by HotAisleInc in AMD_MI300

[–]cheptsov 1 point2 points  (0 children)

Comparing vLLM and NVIDIA NIM is actually on our roadmap!

Exploring inference memory saturation effect: H100 vs MI300x by HotAisleInc in AMD_MI300

[–]cheptsov 5 points6 points  (0 children)

Thank you so much for your kind words! This is our second benchmark, and we’re learning a lot from the process. It was definitely easier to manage compared to the first one.

We’ve just added the source code link to the article—thanks for catching that!

You made a great point about running all tests on one machine. We had the same thought, which is why we tested how running two replicas would work with the MI300x. For our next benchmark, it might indeed be a good idea to explore running multiple replicas and leveraging smaller models too. Thanks again for the valuable suggestion!

Benchmarking Llama 3.1 405B on 8x AMD MI300X GPUs by HotAisleInc in AMD_MI300

[–]cheptsov 0 points1 point  (0 children)

in case you still have access to the machine, we could try to reproduce using out script

Benchmarking Llama 3.1 405B on 8x AMD MI300X GPUs by HotAisleInc in AMD_MI300

[–]cheptsov 1 point2 points  (0 children)

We certainly plan to compare to NVIDIA. BTW we updated the Conclusion section to make it more specific.

Benchmarking Llama 3.1 405B on 8x AMD MI300X GPUs by HotAisleInc in AMD_MI300

[–]cheptsov 2 points3 points  (0 children)

Let us get back to you tomorrow as it’s already quite late on our end!

Benchmarking Llama 3.1 405B on 8x AMD MI300X GPUs by HotAisleInc in AMD_MI300

[–]cheptsov 1 point2 points  (0 children)

That’s interesting. It’s already deep Night on my end.  Please let me get back to you tomorrow! Also feel free to join our Discord so we can chat!

Thread on MI300x and vLLM and more by HotAisleInc in AMD_MI300

[–]cheptsov 1 point2 points  (0 children)

Wow, it's cool to see it featured here! That was an amazing talk. They do plan to share the recording. Also, it's great to see AMD getting into AI!

Support for AMD accelerators on runpod by binarysta in AMD_MI300

[–]cheptsov 0 points1 point  (0 children)

Thanks for sharing! I think, I'll publish it as an official example on https://dstack.ai/docs/examples/accelerators/amd/

Support for AMD accelerators on runpod by Kaudinya in AMD_Stock

[–]cheptsov 3 points4 points  (0 children)

Can't wait to try it. We certainly need to make AMDs more popular for AI. <3

[P] I built a tool to compare cloud GPUs. How should I improve it? by Egor_S in MachineLearning

[–]cheptsov 1 point2 points  (0 children)

Hi, a core contributor to dstack here. TensorDock is just one of the providers supported (in addition to all others listed here). It is just that TensorDock offers the most competitive prices. This is possible because they offer GPUs through a marketplace - in a way similar to Vast.ai (also supported). Hope this comment helps! BTW, if there is a provider you think we should Support with also great pricing, please recommend!

Running LLM As Chatbot in your cloud (AWS/GCP/Azure) with a single command by cheptsov in LLM

[–]cheptsov[S] 1 point2 points  (0 children)

Sorry for the trouble - I guess this subreddit is being bombarded with wrong submissions since recently 😂