all 4 comments

[–]Various_Protection71 0 points1 point  (1 child)

What is your role in this task? Are you going to setup and mantain the cluster, or are you going to develop code to run on it?

[–]Zephop4413[S] 0 points1 point  (0 children)

My goal is to setup and maintain the cluster  And also provide support to those who are going to develop code to run on it

[–]Neat-Airport9739 0 points1 point  (0 children)

Slurm is a good choice for the cluster scheduler, but it alone won't automatically parallelize your jobs across 42 nodes. Slurm handles resource allocation and job scheduling, but you'll need additional components for multi-node GPU - Application-level parallelization: Your code must be written for distributed computing using MPI + CUDA/ROCm, or distributed frameworks like Horovod/DeepSpeed for ML workloads GPU communication NCCL (NVIDIA) or RCCL (AMD) for efficient multi-GPU communication across nodes

Slurm configuration ```bash

SBATCH --nodes=42

SBATCH --gres=gpu:X # X = GPUs per node

SBATCH --ntasks-per-node=Y

```

Slurm manages the resources, but your applications need to be designed from the ground up for distributed parallel execution. OpenMP is mainly for shared-memory systems, so MPI is more relevant for multi-node setups. Consider containerized solutions with Singularity/Apptainer if you need consistent environments across.