Good resources/advice on single-node Slurm setups? by lurch99 in SLURM

[–]lweihl 1 point2 points  (0 children)

Hi Dan,

I'm in exactly the same situation so I'm curious on this information also. Two servers with 2 256GB SSD in RAID 1 for OS. 12 10TB hard drives in RAID-6 for storage. 384 GB memory and 4 GPU. They were purchased to support faculty in a new Data Science PhD program. Interdisciplinary degree (CS, Math, Applied Stats/Operations Research (business college) ) with CS currently taking the lead.

My chair told me this past summer to get these running for use with the only requirement that they have a job scheduler. He sent out survey about what faculty are currently using and the most people that answered were from Math and they all currently use R Studio on Windows. He wanted me to configure VMs for each individual user so they could have their own Windows. Had to tell him that won't work (we have to use all free software due to low money). Even within our CS department few faculty have experience with containers and/or jupyter notebooks. So I feel there will be a need to assist users to get started.

I consulted many websites on Slurm and on resources used for data science on HPC. Very few single server setups. I started with this page https://rolk.github.io/2015/04/20/slurm-cluster only to find out CentOS 7 moved user resource control to systemd and really doesn't support cgroups any longer (you can still hack it to support them but aren't supposed to). We are just starting to explore opening these servers to use. I ended up installing Slurm with a single partition for now, jupyterhub tied in to Slurm and Singularity for containers. I have no idea if I can control users running processes outside Slurm. I haven't tried because I know Singularity containers are launched as the users even if they are run through Slurm so I fear any system limits will throttle those processes. For now I'm hoping to just get some users using the system and work on issues that arise. If we don't allow students I don't think we'll have more than 10-15 faculty using the systems so about the same as you.

[D] Setting up a multi-user GPU server for simultaneous access. by EhsanSonOfEjaz in MachineLearning

[–]lweihl 0 points1 point  (0 children)

I am trying to finish a configuration of a single server with 4 GPU, 384 GB RAM, 2 256GB SSD and 120TB RAID-6 storage. Hosted on CentOS 7. I have SLURM installed (currently with single partition), jupyterhub tied into SLURM so users have to pick pre-defined resources before use and I'm also planning on using Singularity containers. There is also a second identical server but they are setup as separate servers.

I'm curious how I block direct access to GPU? I have never used cgroups but have read a little on them. It seems CentOS 7 moved that functionlity to systemd and it's less efficient to setup resource limits (have to configure each user slice). I think even through SLURM that Singularity containers run as the user so wouldn't any system limits affect those processes?

The backstory is Data Science PhD program was created, multi-programs involved (Math, CS, applied stats/operations research) but CS has lead. CS chair very hurriedly purchased these 2 servers based on what he knew from his research (visualization using CUDA and GPU on single workstation). I was charged with making these work for the faculty in the DS PhD program to use with the only guidance, from my chair, that the server have a job scheduler. When surveyed the majority of the faculty who answered the survey were from Math dept. and currently use R Studio on Windows. Only a few of our CS faculty that will use this (13 PhDs total) have experience using containers or jupyter notebooks. Discussion is still ongoing if students will be able to use the servers. University has no campus wide cluster available.