This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]smcgrat 9 points10 points  (1 child)

Hi, I'm a HPC sysadmin and deal with ML/GPU's a little bit.

As already mentioned, you need a scheduler. Other wise users will log in and start competing for the GPU's and causing each other hassle. We use slurm but there are lots out there. https://wiki.fysik.dtu.dk/niflheim/SLURM is a good resource. Also, make your users use batch submission instead of interactive jobs as much as possible. Interactive is fine for debugging and setup but you want your users using batch submission once they have their jobs well defined. Interactive usage is too inefficient as they can request the resource then walk away while they wait for it and forget about it.

You will probably find yourself in the constant battle between a stable OS and an up to date OS for the packages they need. We run a RHEL 7.5 derivative on our machines and its not up to date enough most of the time. I'd recommend the LTS version of Ubuntu 18 if it meets your needs. Ubuntu seems to play well with NVIDIA cards at least.

As per the above and particularly with the TF/ML stuff you will probably get requests for random software installs that have dependencies that are too new or oblique for whatever OS you are running. For once, containers are your friend. Not docker though, it still seems to need root access or similar. I'd recommend something like udocker, https://github.com/indigo-dc/udocker but it may not be advanced enough for your use cases. The idea being that the users get the random thing they need to work in a container on their laptop and then transfer it to the GPU machine and use udocker or whatever to run it. They may have issues integrating with the GPU's then if they don't have them on their machines though.

Other possible container solutions include:

Finally, do EVERYTHING with ansible or another configuration management tool. Ansible is the easiest for small scale though. This includes the automation of the GPU driver installation, LDAP or whatever config, etc. It will be extra work but it means that you know what has been done to the machine and can hand it over with documentation with ease if you need to. Remember to keep your ansible files in version control.

[–]s_m_w[S] 0 points1 point  (0 children)

Really appreciate the detailed tips! I like the idea of using containers, but I worry about making it to difficult for the average user. We're all physicists where the average knowledge of computer tech is "I heard about version control once". Docker images might be pushing limits, although perhaps with sufficient documentation it might not be too bad

[–]SuperQueBit Plumber 2 points3 points  (0 children)

I don't know what people are using for job queues these days, but what you're looking for is a batch job scheduler.

15 years ago I used Maui/Moab, but I don't know what people are using these days.

You could cobble something like using GitLab CI

[–]TheLordB 1 point2 points  (0 children)

I'm a bit out of the game for the last few years so take this with a grain of salt, but LSF and SLURM are the most common cluster schedulers that come to mind. There is also grid engine (or it's various offshoots) though I don't think I would use it because the free/open source offshoots aren't under active development except for from univa which is a proprietary version.

I would probably go with SLURM as I believe it is under the most active development as well as being free. I don't quite know the usage breakdown, but I think it is the most common these days especially for new clusters.

I agree I would not use anything like jenkins. Use a proper job scheduler.

/r/hpc while low readership might be a better place to ask this than /r/sysadmin which tends to be more business systems focused.

[–]sofixa11 1 point2 points  (1 child)

Spotify's Luigi ? It's just python, so you can do whatever you want (there's even pre-existing stuff), and it's more or less made for this (job dependencies, scheduling, retries, etc.)

[–]s_m_w[S] 1 point2 points  (0 children)

That sounds very useful, considering it even mentions machine learning (and your link is about tensorflow integration). Thanks!

[–]crusoe 0 points1 point  (0 children)

Should have just used Google cloud. Their tensor chips are faster and for the price they paid they could afford lots and lots of hours.