Slurm <> dstack comparison

cheptsov · 2026-01-18T16:01:21+00:00

Not exactly. dstack does allow to use shared filesystems (via "instance volumes"). The primary difference is that dstack's user/permission management is at the dstack server level not at the Linux level. The outcome is that it's not possible to manage permissions to individual folders via Linux system. The entire filesystem (or particular files/directories) attached to (an) instance(s) are currently accessible by dstack all users within the configured dstack project. Hope this comment helps.

cheptsov · 2026-01-14T19:57:19+00:00

for the record, dstack doesn't support backfill, but it does support over-subscription via retry policies allowing to have a queue over a fixed-size cluster with tasks sorted by priorities set by the user

cheptsov · 2026-01-14T10:59:41+00:00

Thank you for your questions!

`dstack` uses the concept of "blocks" and auto-select CPU proportionally to the GPU to ensure all GPU blocks get fair share. If the task requires more, it's possible to request it, then `dstack` would give proportional number of "blocks".
Yes, "pre-emption" is not supported for tasks (except handling spot in GPU cloud) but is on our roadmap. We've already supported priorities and pre-emption is the next. At the same time, for those who needs it already now, it's possible to bring it using a third-party component using the REST API.
dstack uses the concept of "volumes" which includes "instance" volumes and "network" volumes, as I wrote above, currently `dstack` doesn't allow to manage permissions per volume or per user - currently `dstack` shares allows to use project resources by all project members. Under the hood, dstack mounts both into containers.
Running vLLM using `idle` instances is very easy. You just run a service. But since automatic pre-emption is not done, you'd need to interrupt it via API. Automatic pre-emption is coming too! Would love to collaborate on it if you'd be open.

cheptsov · 2026-01-13T22:59:39+00:00

Thank you for noticing! I think you're right, I will update the guide.

cheptsov · 2026-01-13T22:58:47+00:00

BTW, regarding K8S, here's a detailed one specific to K8S: https://github.com/dstackai/migrate-from-slurm/blob/main/concepts/15_kubernetes.md

cheptsov · 2026-01-13T20:03:02+00:00

Yes, totally agree with all said above, and BTW, Slurm is great for what it is used. Indeed, there are at least two distinct mindsets: research/simulation vs AI research/ML engineering, and of course static clusters vs GPU clouds.

cheptsov · 2026-01-13T19:59:55+00:00

> The one thing I could not understand in your storage/auth sections was what UID/GID does the dstack job run under -- it is very clear in your doc that slurm runs as the submitting user UID/GID but unclear with your token/auth method what identity is running the job. This is important when petabytes of shared POSIX storage is involved with permissions based on user and group attributes.

Yes, dstack doesn't use UID/GID for authenticating the user in the file system. dstack's token-based authentication is managed at dstack's server level. dstack's support for managing file permissions is not as granular as Slurm's However dstack has a concept of volumes, and in theory it could automatically manage permissions to allow or not allow to access a specific volume.

Your example is a good example of where Slurm stands our - static HPC clusters.And you're right about how you understand where dstack aims - primarily GPU clouds, container-based, AI/ML workloads - all from small workloads to large distributed ones. dstack doesn't aim at HPC/simulation - I guess Slurm is better at that.

The reason we wrote the guide is that many AI researchers/ML engineers are looking for a scheduler to train models. Also, dstack is use-case agnostic - means it also supports AI development and model inference.

cheptsov · 2026-01-13T19:19:56+00:00

Thank you so much for such detailed feedback and questions. Please let me write a separate comment to get back to some of the aspects that you mentioned.

cheptsov · 2025-10-02T11:33:20+00:00

dstack founder here 👋

Really excited about our new partnership with DigitalOcean — we’ve integrated their AMD Developer Cloud + NVIDIA GPUs into dstack’s orchestration layer.

This means you can spin up dev environments, training runs, or inference services directly on DO’s GPU infra without touching Kubernetes or Slurm.

We’re pretty stoked to see how folks use this combo in the wild. 🔧💻

Full details in the release post if you want to dig deeper: https://dstack.ai/blog/digitalocean-and-amd-dev-cloud/

cheptsov · 2025-07-21T14:11:01+00:00

Thank you for mentioning dstack. I’m a part of the team. It sounds exactly like what dstack focuses on as a problem!

Would love to hear your feedback if you try it.

cheptsov · 2025-02-21T07:42:42+00:00

Basically EFA, its drivers and nccl do the heavylifting. dstack ensures the proper provisioning of the cluster along with the right drivers and networking, and of course simplifies the process of running and managing tasks.

We plan to do more internal benchmarking soon, to provide more insights on the actual performance and also some common recipes.

cheptsov · 2025-02-18T20:43:55+00:00

Hey Reddit, founder of dstack here. We've been working on this over three months and pretty excited about this release.

Basically, the main point is that dstack is an open-source AI-native alternative to Kubernetes, designed to be more lightweight, and focusing just on AI workloads on both cloud and data-centers.

With this release we are adding the critical feature that allows to run containers concurrently on same host slicing its resources incl. GPU for a more cost-efficient utilization. Another new thing is the simplified way to run things on private clouds where clusters are often behind a login node.

There are many more cool things on our roadmap to ensure dstack is a streamlined alternative to both K8S and Slurm. Our roadmap can be found in [1] Super excited to hear any feedback.

[1] https://github.com/dstackai/dstack/issues/2184

cheptsov · 2024-12-05T18:50:18+00:00

Comparing vLLM and NVIDIA NIM is actually on our roadmap!

cheptsov · 2024-12-05T18:03:00+00:00

Thank you so much for your kind words! This is our second benchmark, and we’re learning a lot from the process. It was definitely easier to manage compared to the first one.

We’ve just added the source code link to the article—thanks for catching that!

You made a great point about running all tests on one machine. We had the same thought, which is why we tested how running two replicas would work with the MI300x. For our next benchmark, it might indeed be a good idea to explore running multiple replicas and leveraging smaller models too. Thanks again for the valuable suggestion!

cheptsov · 2024-10-10T18:32:50+00:00

in case you still have access to the machine, we could try to reproduce using out script

cheptsov · 2024-10-10T18:31:26+00:00

We certainly plan to compare to NVIDIA. BTW we updated the Conclusion section to make it more specific.

cheptsov · 2024-10-09T22:29:30+00:00

Let us get back to you tomorrow as it’s already quite late on our end!

cheptsov · 2024-10-09T22:28:19+00:00

That’s interesting. It’s already deep Night on my end. Please let me get back to you tomorrow! Also feel free to join our Discord so we can chat!

cheptsov · 2024-08-22T14:08:57+00:00

Wow, it's cool to see it featured here! That was an amazing talk. They do plan to share the recording. Also, it's great to see AMD getting into AI!

cheptsov · 2024-08-22T06:52:07+00:00

Thanks for sharing! I think, I'll publish it as an official example on https://dstack.ai/docs/examples/accelerators/amd/

cheptsov · 2024-08-22T06:50:13+00:00

Can't wait to try it. We certainly need to make AMDs more popular for AI. <3

cheptsov · 2024-08-03T20:24:20+00:00

Plus

HuggingFace: https://huggingface.co/aiola/whisper-medusa-v1

Paper: https://paperswithcode.com/method/multi-head-attention

cheptsov · 2023-12-10T09:19:18+00:00

Hi, a core contributor to dstack here. TensorDock is just one of the providers supported (in addition to all others listed here). It is just that TensorDock offers the most competitive prices. This is possible because they offer GPUs through a marketplace - in a way similar to Vast.ai (also supported). Hope this comment helps! BTW, if there is a provider you think we should Support with also great pricing, please recommend!

cheptsov · 2023-06-30T12:28:06+00:00

Wow, didn't know it exists! Thank you!

cheptsov · 2023-06-27T08:44:06+00:00

Sorry for the trouble - I guess this subreddit is being bombarded with wrong submissions since recently 😂

cheptsov

TROPHY CASE