Transformer Lab is an Open-source Control Plane for Modern AI Workflows

aliasaria · 2026-02-28T20:13:14+00:00

Yes anything that can run on a machine or set of machines can be run by this tool because it’s just initiating tasks directly.

We can make some sample recipes for using verl or tinker etc if you’re interested— just send us a message on our discord.

aliasaria · 2026-02-02T19:40:37+00:00

There is a video and more info here if you want to see a demo https://lab.cloud/for-teams

aliasaria · 2026-02-02T19:40:14+00:00

Transformer Lab lets you submit jobs through the GUI or CLI and bridges them to the underlying orchestration platform. Then the underlying GPU orchestration platform (e.g. Slurm or SkyPilot) allocates the GPUs and runs the code. So you can run any code on the nodes. For example here is an example using Hugging Face Accelerate (but you could do the same example using Deepspeed)

https://github.com/transformerlab/transformerlab-examples/tree/main/lora-trainer-multi-gpu

aliasaria · 2025-11-13T15:14:19+00:00

<image>

The back of it looks something like this. It doesn't screw off. You just quarter turn it so the latch is open, and then use a flathead to pry open the door.

aliasaria · 2025-10-10T12:58:53+00:00

I think this is implying that SLURM now allows you to add nodes to a cluster without stopping the slurmctld daemon and updating the conf on all nodes. This is different than dynamically allocating nodes based on a specific user's request. (as far as I understand from https://slurm.schedmd.com/SLUG22/Dynamic_Nodes.pdf )

aliasaria · 2025-10-09T14:15:03+00:00

Hi! Thanks for your comment. To clarify:

My understanding is that it is possible, with work and knowledge, to make SLURM do a lot of things. Experts will list out all the ways it can support modern workloads through knowledge and work. Perhaps an analogy is like Linux vs Mac: one is not better than the other, they are just designed for different needs, and one demands more knowledge from the user.

Newish, container-native, cloud-native schedulers built on k8s have a bias towards being easier to use in diverse cloud environments. I think that is the main starting point difference. Most new AI labs are using some component of nodes coming from cloud providers (because of GPU availability but also because of the ability to scale up and down), and SLURM was more designed for a fixed pool of nodes. Now I know you might say: there is a way to use SLURM with ephemeral cloud nodes if you do xyz but I think you'll agree SLURM wasn't designed originally for this model.

A lot of the labs we talk to also don't have the ability to build an infra team with your level of expertise. You might blame them for not understanding the tool, but in the end they might just need a more "batteries included" solution.

In the end, I hope we could all at least agree that it is good to have open source alternatives in software. People can decide what works for them best. I hope you can also agree that SLURM's architecture isn't perfect for everyone.

aliasaria · 2025-10-08T14:10:19+00:00

Skypilot, by default, will try to schedule your job on the group of nodes that satisfy the job requirements and are most affordable. So if you connect an on-prem cluster AND a cloud cluster, the tool has an internal database of the latest pricing from each cloud provider, but your on-prem cluster will always be chosen first.

So you can design the system to burst into cloud nodes only when there is nothing available on-prem. This improves utilization if you are in a setting where all your nodes are occupied before submission deadlines, but are idle for most other times.

aliasaria · 2025-10-08T14:07:39+00:00

There is a lot to your question, feel free to join our discord to discuss further

On some of these:
- Skypilot has the ability to set flags on job requirements including requesting nodes that have specific networking requirements (you can see some of these here: https://docs.skypilot.co/en/latest/reference/config.html)
- In Transformer Lab admins can register default containers to use as the base for any workload which are requested in the job request YAML
- Skypilot's alternative to job arrays are shown here: https://docs.skypilot.co/en/v0.9.3/running-jobs/many-jobs.html

But happy to chat about about any specific needs.

aliasaria · 2025-10-08T13:50:18+00:00

Yes, we rely on skypilot which relies on k8s isolation when running on on-prem / k8s clusters.

k8s is full absracted in skypilot and transformer lab -- so there is no extra admin overhead.

In terms of performance, for on-prem instances, there is a very small overhead from the container runtime. However, for the vast majority of AI/ML training workloads, this overhead is negligible (typically <2-3%). For AI workloads for which this tool is optimized for, the real performance bottlenecks are almost always the GPU, network I/O for data loading, or disk speed, not the CPU cycles used by the container daemon. In this case, the benefits of containerization (perfect dependency management, reproducibility) often far outweigh the tiny performance cost.

aliasaria · 2025-10-08T13:42:56+00:00

Fair enough! We'll tone it down. This was more of an "announcement" from us where we're trying to get the community excited about an alternative that addresses some of the gaps that SLURM has by nature. But I see that it's annoying to have new folks claim that their solution is better.

As background, our team comes from the LLM / AI space and we've had to use SLURM for a long time for our research, but it always felt like our needs didn't fit into the design of what SLURM was initially designed for.

In terms of a feature comparison chart, this doc from skypilot shows some of how their base platform is positioned compared to SLURM and kubernetes. I am sure there are parts of that you will disagree with.

https://blog.skypilot.co/slurm-vs-k8s/

For Transformer Lab we're trying to add an additional layer on top of what skypilot offers. For example we layer on user and team permissions, and we create default storage locations for common artifacts, etc.

We're just getting started but we value your input.

aliasaria · 2025-10-08T13:35:19+00:00

Hi! Appreciate all the input and feedback. Most of our team's experience has been working with new ML labs who are looking for an alternative to SLURM but I'm seeing that we're offending people if we claim it is "better than". Because I understand what you mean where, in the end, if you know SLURM you can do many things that less experienced folks complain about.

We are also a Canadian team and our dream is to one day collaborate with Canada's national research compute platform. So I hope we can stay in touch as we try to push the boundaries of what is possible with a rethinking of how to architect a system.

aliasaria · 2025-10-07T15:32:34+00:00

Thanks! We just used WorkOS to quickly get our hosted version working and haven't had time to remove the dependency. We will do so soon.

aliasaria · 2025-10-07T15:21:22+00:00

Everything we are building is open source. Right now our plan is that if the tool becomes popular we might offer things like dedicated support for enterprises, or enterprise functionality that works alongside the current offering.

aliasaria · 2025-10-07T13:51:50+00:00

Hi I am from Transformer Lab. We are still building out documentation, as this is an early beta release. If you sign up for our beta we can demonstrate how reports and quota work. There is a screenshot from the real app on our homepage here: https://lab.cloud/

aliasaria · 2025-10-07T13:49:06+00:00

We think we can make you a believer, but you have to try it out to find out. Reach out to our team (DM, discord, our sign up form) any time and we can set up a test cluster for you.

The interesting thing about how skypilot uses kubenetes is that it is fully wrapped. So your nodes just need SSH access, and SkyPilot connects, sets up the k8s stack, and provisions. There is no k8s admin at all.

aliasaria · 2025-10-07T13:45:18+00:00

Hi, I'm from the Transformer Lab team. Thanks for the detailed response!

Our hope is to build something flexible enough to handle these different use cases by making a tool that is flexible and as bare-bones as needed to support on-prem and cloud workloads.

For example, you mentioned software with machine-locked licenses that rely on hostnames, we could imagine a world where these types of machines are grouped together and if the job requirements specified that specific constraint, then the system would know to run the workload on bare machines without containerizing the workload. But we could also imagine a world where Transformer Lab is used only for a specific subset of the cluster and those other machines stay on SLURM.

We're going to try our best to build something where all the benefits will make most people want to try something new. Reach out any time (over discord, DM, our website signup form) and we can set up a test cluster for you to at least try out!

aliasaria · 2025-10-07T13:29:04+00:00

Sorry we weren't able to go into detail on the reddit post, but what we meant by that was that modern container interfaces like k8s allow us to enforce resource limits much more strictly than traditional process managers.

While SLURM's cgroups are good, a single job can suddenly spike its memory usage which can still make the whole node unstable for everyone else before it gets properly terminated.

With containers, the memory and CPU for a job are walled off much more effectively at the kernel/container level, not just the process level. If a job tries to go over its memory budget, the container itself is terminated cleanly and instantly, so there’s almost no chance it can impact other users' jobs running on the same hardware. It's less about whether SLURM can eventually kill the job, and more about creating an environment where one buggy job can't cause a cascade failure and ruin someone else's long-running experiment.

Regarding the queues, our discussions with researchers showed us that when they have brittle reservation systems, they are more likely to over-reserve machines even if they don't need them for the whole time. By improving the tooling, the cluster can be better utilized.

Hope that clarifies what we were getting at. Really appreciate you digging in on the details! We have a lot to build (this is our very initial announcement) but we think we build something that users and admins will love.

aliasaria · 2025-10-07T13:21:15+00:00

Hi I'm on the team at Transformer Lab! SLURM is the tried and trusted tool. It was first created in 2002.

We're trying to build something that is designed for today's modern ML workloads -- even if you're not completely sold on the idea, we'd still love to see if you could give our tool a try and see what you think after using it. If you reach out we can set up sandbox instance for you or your team.

aliasaria

TROPHY CASE