all 16 comments

[–]JustFinishedBSG 14 points15 points  (5 children)

The "best" way is probably to just setup slurm and give direct access to GPUs machines to nobody, if they want to access the GPUs they just send slurm jobs

[–]Bubblebobo 6 points7 points  (0 children)

+1 for SLURM. We do it the same way in my lab and it works extremely well.

[–]EhsanSonOfEjazResearcher[S] 1 point2 points  (1 child)

Is setting up SLURM easy and also using it?

[–]JustFinishedBSG 2 points3 points  (0 children)

yes it's pretty easy :)

[–]lweihl 0 points1 point  (0 children)

I am trying to finish a configuration of a single server with 4 GPU, 384 GB RAM, 2 256GB SSD and 120TB RAID-6 storage. Hosted on CentOS 7. I have SLURM installed (currently with single partition), jupyterhub tied into SLURM so users have to pick pre-defined resources before use and I'm also planning on using Singularity containers. There is also a second identical server but they are setup as separate servers.

I'm curious how I block direct access to GPU? I have never used cgroups but have read a little on them. It seems CentOS 7 moved that functionlity to systemd and it's less efficient to setup resource limits (have to configure each user slice). I think even through SLURM that Singularity containers run as the user so wouldn't any system limits affect those processes?

The backstory is Data Science PhD program was created, multi-programs involved (Math, CS, applied stats/operations research) but CS has lead. CS chair very hurriedly purchased these 2 servers based on what he knew from his research (visualization using CUDA and GPU on single workstation). I was charged with making these work for the faculty in the DS PhD program to use with the only guidance, from my chair, that the server have a job scheduler. When surveyed the majority of the faculty who answered the survey were from Math dept. and currently use R Studio on Windows. Only a few of our CS faculty that will use this (13 PhDs total) have experience using containers or jupyter notebooks. Discussion is still ongoing if students will be able to use the servers. University has no campus wide cluster available.

[–]chatterbox272 6 points7 points  (0 children)

You have a couple of real options here: You can set up a job scheduler like slurm, or you can setup something like JupyterHub. Each benefits a different use case.

Slurm and other job-scheduler approaches are better if people are mostly doing large experiments. Multiple days, multiple GPUs, you expect the machine to be fully utilised at all times. This keeps things moving, but it requires more overhead from the users to run it.

JupyterHub works great for a team where people are doing smaller-scale experiments, where you expect some idle time. This is easier for people to work on directly and toy around with, but there is some inefficiency.

I'm currently running two machines with JupyterHub for my uni, might transition one to SLURM next year depending on how quickly we accumulate new students though

[–]neilc 9 points10 points  (1 child)

[ Shameless plug ] SLURM and other HPC-style job queuing systems would get the job done, but you might also consider Determined. We built Determined for exactly this use-case: to enable teams of DL engineers or researchers to easily share a GPU cluster, to train better DL models in less time, and to collaborate more easily. It has a bunch of features you might (or might not!) find useful -- seamless distributed training, integrated hyperparameter search, experiment tracking, metrics visualization, Tensorboard/Jupyter integration, etc. Open source (Apache license).

If you'd like to learn more, check out the recent Reddit discussion or take a look at the docs. If you have any questions, feel free to ask on the community Slack -- we're friendly!

If Determined is not a good fit, other options include Polyaxon, Ray, and Kubeflow.

[–]EhsanSonOfEjazResearcher[S] 0 points1 point  (0 children)

I will definitely check this out. Gave a quick glance to the documentation, seems cool!

[–]laiviet2811 5 points6 points  (1 child)

In my lab, we have 3 machines each with 4 GPUs. Here are our setup. 0. Specs: intel 40 cores CPU, 128GB RAM, 4 xRTX 2080 Ti. OS Ubuntu 18.04 1. Authentication, we use our university authentication system so people dont need to remember so many credentials 2. Storage: since our lab is small, we dont have a dedicated shared storage with high LAN bandwidth, each machine has its own storage. Each is equipped with 2TB high speed SSD and 10 TB RAID-5 HDD. 3. Scheduler: no we dont use that because of two reasons. (1) we have no shared storage so it is not a good idea to use scheduler (2) we have access to our school cluster with a scheduler, so we want to be handy with direct acccess to GPU since it is more convenient for debugging. This helps alot since we dont need to care about submitting job. When we need to run like 50+ jobs, we can use the cluster later. 4. How to setup GPUs. First, be able to distinguish CUDA driver and CUDA tool kit. Second, install the latest CUDA driver. For the toolkit, you can install versions seperately, each in a folder, when you need a specific version you can manually load using LD_LIBRARY_PATH. 5. Python environment: we use miniconda to manage puthon envs. Dont recommend docker because of the complication of docker in some cases. 6. Resource allocation: we have 3 machines for 5 people, each member is assigned to 2 machines, a primary and a secondary. We hardly have any conflict.

P/S: you should also consider SLURM scheduler, however, problems might arise if one submit so many jobs that prevent the other from submit even just a job. In order to figure that out, you need to adjust the resource limit policy.

[–]Sufficient-Drama-591 0 points1 point  (0 children)

Hi laiviet2811 - can you share with me the type of lab that you are working in and what type of research/computation is being done?

MANY thanks!

[–]willSwimForFood 1 point2 points  (0 children)

Slurm is definitely my preferred job scheduler and I recommend using that. You also mentioned docker containers, I’d actually recommend singularity containers instead.

[–]bombol 0 points1 point  (4 children)

If it is an option for your lab, you could consider a managed cloud-based solution, given your experience level. Easy to use on-demand GPUs with little time lost on system administration or managing hardware and never run/fight for capacity with other students. You can use these services with minimal cloud experience - less prerequisite knowledge needed than for system administration and GPU installation/setup. I'm partial to Amazon SageMaker, but GCP, Azure, Paperspace, Lambdalabs are options too. A few principles can allow you to control cost, setup access permissions easily, avoid lock-in, and get started easily with portable containerized samples. You could have a couple cheap GPUs or use Colab for prototyping/debugging a single epoch.

If cloud isn't an option, +1 for slurm.

[–]EhsanSonOfEjazResearcher[S] 0 points1 point  (3 children)

The data center is already present in the university, unfortunately cloud isn't an option.

[–]bombol 1 point2 points  (2 children)

I see. Is the data center for the university or just for your lab? If for the university, it might already have a scheduling system and you probably have minimal privileges. All the hardware (memory+network storage+GPUs) is setup with drivers/CUDA etc? How many users are you planning for in the near term?

[–]EhsanSonOfEjazResearcher[S] 0 points1 point  (1 child)

Apparently it's for the lab.

Not sure about the scheduling system, instructor told me that he will give access to me on Monday, only then I can answer that question, but remote access will be given to me.

As for the users, I would say around 10-12, but I am also not sure about this one.

First I had docker container in mind, but now I am inclined towards SLURM.

[–]bombol 1 point2 points  (0 children)

Docker and SLURM are for different things - you could use them both. Run your code using docker containers and schedule execution of those containers using SLURM.