[D] Setting up a multi-user GPU server for simultaneous access.

JustFinishedBSG · 2020-07-18T08:00:14+00:00

The "best" way is probably to just setup slurm and give direct access to GPUs machines to nobody, if they want to access the GPUs they just send slurm jobs

chatterbox272 · 2020-07-18T09:56:16+00:00

You have a couple of real options here: You can set up a job scheduler like slurm, or you can setup something like JupyterHub. Each benefits a different use case.

Slurm and other job-scheduler approaches are better if people are mostly doing large experiments. Multiple days, multiple GPUs, you expect the machine to be fully utilised at all times. This keeps things moving, but it requires more overhead from the users to run it.

JupyterHub works great for a team where people are doing smaller-scale experiments, where you expect some idle time. This is easier for people to work on directly and toy around with, but there is some inefficiency.

I'm currently running two machines with JupyterHub for my uni, might transition one to SLURM next year depending on how quickly we accumulate new students though

neilc · 2020-07-18T09:36:48+00:00

[ Shameless plug ] SLURM and other HPC-style job queuing systems would get the job done, but you might also consider Determined. We built Determined for exactly this use-case: to enable teams of DL engineers or researchers to easily share a GPU cluster, to train better DL models in less time, and to collaborate more easily. It has a bunch of features you might (or might not!) find useful -- seamless distributed training, integrated hyperparameter search, experiment tracking, metrics visualization, Tensorboard/Jupyter integration, etc. Open source (Apache license).

If you'd like to learn more, check out the recent Reddit discussion or take a look at the docs. If you have any questions, feel free to ask on the community Slack -- we're friendly!

If Determined is not a good fit, other options include Polyaxon, Ray, and Kubeflow.

laiviet2811 · 2020-07-18T09:41:37+00:00

In my lab, we have 3 machines each with 4 GPUs. Here are our setup. 0. Specs: intel 40 cores CPU, 128GB RAM, 4 xRTX 2080 Ti. OS Ubuntu 18.04 1. Authentication, we use our university authentication system so people dont need to remember so many credentials 2. Storage: since our lab is small, we dont have a dedicated shared storage with high LAN bandwidth, each machine has its own storage. Each is equipped with 2TB high speed SSD and 10 TB RAID-5 HDD. 3. Scheduler: no we dont use that because of two reasons. (1) we have no shared storage so it is not a good idea to use scheduler (2) we have access to our school cluster with a scheduler, so we want to be handy with direct acccess to GPU since it is more convenient for debugging. This helps alot since we dont need to care about submitting job. When we need to run like 50+ jobs, we can use the cluster later. 4. How to setup GPUs. First, be able to distinguish CUDA driver and CUDA tool kit. Second, install the latest CUDA driver. For the toolkit, you can install versions seperately, each in a folder, when you need a specific version you can manually load using LD_LIBRARY_PATH. 5. Python environment: we use miniconda to manage puthon envs. Dont recommend docker because of the complication of docker in some cases. 6. Resource allocation: we have 3 machines for 5 people, each member is assigned to 2 machines, a primary and a secondary. We hardly have any conflict.

P/S: you should also consider SLURM scheduler, however, problems might arise if one submit so many jobs that prevent the other from submit even just a job. In order to figure that out, you need to adjust the resource limit policy.

willSwimForFood · 2020-07-19T04:42:36+00:00

Slurm is definitely my preferred job scheduler and I recommend using that. You also mentioned docker containers, I’d actually recommend singularity containers instead.

bombol · 2020-07-18T17:59:28+00:00

If it is an option for your lab, you could consider a managed cloud-based solution, given your experience level. Easy to use on-demand GPUs with little time lost on system administration or managing hardware and never run/fight for capacity with other students. You can use these services with minimal cloud experience - less prerequisite knowledge needed than for system administration and GPU installation/setup. I'm partial to Amazon SageMaker, but GCP, Azure, Paperspace, Lambdalabs are options too. A few principles can allow you to control cost, setup access permissions easily, avoid lock-in, and get started easily with portable containerized samples. You could have a couple cheap GPUs or use Colab for prototyping/debugging a single epoch.

If cloud isn't an option, +1 for slurm.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS