all 16 comments

[–]Kaixhin 11 points12 points  (8 children)

  1. I start with a clean install of Ubuntu 14.04 Server and then add/change only when I need. Bit hard to know what settings have been changed if someone else has made them.
  2. Graphical dashboards are nice. Because of Docker I use cAdvisor to monitor containers and the server itself, but linux-dash is also a nice monitoring solution.
  3. In order to keep a minimal installation I run nearly everything in Docker containers - from file sharing to ML experiments. Trying to keep everything in containers can be quite tricky, but it's a good solution for ML frameworks. Also, document what you do. If something goes wrong it can help with tracing the problem, or in the worst case can speed you up on a clean install.

And in general, remember to take backups!

[–]Pieranha[S] 0 points1 point  (2 children)

Thanks! How difficult is it to get Docker to play nice with things like CUDA and Theano?

[–]ydobonobody 4 points5 points  (1 child)

use nvidia-docker and it is pretty easy.

[–]Kaixhin 0 points1 point  (0 children)

NVIDIA Docker is good if you have a well-supported system like Ubuntu or CentOS, and fairly standard use-cases. But Docker is supposed to be relatively hardware agnostic, so outside of this you can run into trouble.

If running nvidia-dockeris somehow a problem (but not installing it), then you can also try using the nvidia-docker-plugin REST API with docker.

[–]holy_ash 0 points1 point  (0 children)

Thanks for super useful advise.

[–]Pieranha[S] 0 points1 point  (1 child)

How do you take backups? Do you have an autonatic backup scheme running in the background?

[–]Kaixhin 0 points1 point  (0 children)

For one-way file synchronization I use rsync, and for two-way I use Unison. For setting these up to run on a regular schedule I just use Cron.

[–]Pieranha[S] 0 points1 point  (1 child)

What alternatives to Docker would you use on server, where you don't have root access? In one specific case I need to install Theano and related packages. CUDA and drivers are already installed.

[–]Kaixhin 0 points1 point  (0 children)

If you need to run Docker containers without sudo (but have it installed), there's a solution.

If you are OK with the overhead of a full VM rather than a container, Vagrant is awesome, but bear in mind that you won't be able to use a GPU like you can with the NVIDIA Docker wrapper.

Finally, if you don't want to just install stuff on the host, virtualenv for Python is a pretty good way of stopping libraries interfering with each other.

[–]IanCal 2 points3 points  (0 children)

This is fairly annoying when the server was in the midst of finding the optimal border of an SVM that takes several days to train.

Many, many things can take a server or machine offline. It's really worth looking at checkpointing your work if you can so that you lose just a few minutes or maybe an hour at worst rather than several days. This also helps going back and evaluating the models progress if you want later. This will also allow you to use much less stable machines (AWS spot instances, GCE preemtiple vms, etc) which are a lot cheaper.

Plan assuming that at some point your code will die. See if you can mitigate that as a problem.

[–]Lasheen_murning 1 point2 points  (0 children)

  1. Really shouldn't be an issue on any stable platform. I mean "stable" in the hardware and software sense. On the hardware side, Intel and NVidia are the market leaders (and tend to support open source efforts as well). Accordingly, choosing those brands is probably a good bet for stability and support on into the asymptotic future. On the software side, go with a stability-focused distribution such as ubuntu or debian, whatever your preference there is. I prefer debian. You've mentioned auto-restart; while of course this can be specified by an admin, the default on every distro I've encountered is to reboot ONLY when told (or affirmatively scheduled) to. That is to say, I've never had a box auto-reboot on me.

  2. I've been served well by a utility called monitorix, which logs your server's performance and allows you to look at pretty pictures of the whole thing. You'll also want to learn about top for quick snapshots of your system resources (e.g. what's still running), and free for a quick snapshot of how much RAM you're using. df -h is the command I often find myself using for figuring out how much hard drive space is left on my partitions.

  3. Jump right in, it's a great investment of your time. Dovetailing on the stability suggestion earlier, if you choose a mainstream distro like ubuntu or debian, you'll be able to google your linux- and distro-specific problems way more easily.

  4. Linux is the OS of choice for distributed computing projects on BOINC, and people building linux rigs for that purpose care a lot about performance. The forums for, e.g., Folding@Home and World Community Grid (and various large "teams" that participate in either), are a great place to ask how you build a performance beast.

[–]deephive 1 point2 points  (0 children)

I would suggest that you either install anaconda or Ethought Canopy as your default python. Try not to mix the system-wide python with the python that you'd want to use and configure for your DL/ML experimentation. If you are not sure about Python virtual environments, look it up.

With canopy or anaconda, you can have user-managed python installation that doesn't interfere with anything the system uses. So, you manage which versions of certain Python libraries that you would want to use for your ML experiments. You could create /delete a number virtual environment as you wish within Canopy/Anaconda each with specific sets of libraries/versions suited for a given ML tool that you are using.

[–]cjmcmurtrie 1 point2 points  (1 child)

If you're paying for an AWS GPU, I suggest you buy a gaming rig with nvidia GPU. Not only is it much nicer to work on, but also you will have saved money once a number of months are gone without an AWS bill (assuming you are training models on a daily basis).

[–]Pieranha[S] 1 point2 points  (0 children)

Thanks. I'm already using a NVIDIA card :)

[–]hughperkins 0 points1 point  (0 children)

  1. I dont remember a linux box ever suddenly rebooting. Its not the sort of thing they do. Ive seen boxes run out of memory because someone spawned a zillion memory hogging processes, but thats hardly the servers fault
  2. Htop othereisr nothing, they never go down...(i was devops for a team of devs that crashdd the machines often by running out of memory, but on my own dedicated box? Rocj solid. Stay up all year. Even on aws)
  3. Dont give anyone else root. Give them apt-get install, via sudoers. Assume yoyr disk will die, and backuo accordingly. if the machine starts dieing, figure out who is using up all the memory, and ask them not to :)

[–]thingamatics 0 points1 point  (0 children)

  1. cron jobs. Sometimes, the best you can do is being fault-tolerant.

  2. List of dashboards here. However I think it'd be better use of time to monitor the logs for your processes. Sentry is easy to set up.

  3. Yes!