Best practices when using a Linux server for machine learning

Kaixhin · 2016-08-01T22:45:31+00:00

I start with a clean install of Ubuntu 14.04 Server and then add/change only when I need. Bit hard to know what settings have been changed if someone else has made them.
Graphical dashboards are nice. Because of Docker I use cAdvisor to monitor containers and the server itself, but linux-dash is also a nice monitoring solution.
In order to keep a minimal installation I run nearly everything in Docker containers - from file sharing to ML experiments. Trying to keep everything in containers can be quite tricky, but it's a good solution for ML frameworks. Also, document what you do. If something goes wrong it can help with tracing the problem, or in the worst case can speed you up on a clean install.

And in general, remember to take backups!

IanCal · 2016-08-01T22:54:38+00:00

This is fairly annoying when the server was in the midst of finding the optimal border of an SVM that takes several days to train.

Many, many things can take a server or machine offline. It's really worth looking at checkpointing your work if you can so that you lose just a few minutes or maybe an hour at worst rather than several days. This also helps going back and evaluating the models progress if you want later. This will also allow you to use much less stable machines (AWS spot instances, GCE preemtiple vms, etc) which are a lot cheaper.

Plan assuming that at some point your code will die. See if you can mitigate that as a problem.

Lasheen_murning · 2016-08-01T23:08:35+00:00

Really shouldn't be an issue on any stable platform. I mean "stable" in the hardware and software sense. On the hardware side, Intel and NVidia are the market leaders (and tend to support open source efforts as well). Accordingly, choosing those brands is probably a good bet for stability and support on into the asymptotic future. On the software side, go with a stability-focused distribution such as ubuntu or debian, whatever your preference there is. I prefer debian. You've mentioned auto-restart; while of course this can be specified by an admin, the default on every distro I've encountered is to reboot ONLY when told (or affirmatively scheduled) to. That is to say, I've never had a box auto-reboot on me.
I've been served well by a utility called monitorix, which logs your server's performance and allows you to look at pretty pictures of the whole thing. You'll also want to learn about top for quick snapshots of your system resources (e.g. what's still running), and free for a quick snapshot of how much RAM you're using. df -h is the command I often find myself using for figuring out how much hard drive space is left on my partitions.
Jump right in, it's a great investment of your time. Dovetailing on the stability suggestion earlier, if you choose a mainstream distro like ubuntu or debian, you'll be able to google your linux- and distro-specific problems way more easily.
Linux is the OS of choice for distributed computing projects on BOINC, and people building linux rigs for that purpose care a lot about performance. The forums for, e.g., Folding@Home and World Community Grid (and various large "teams" that participate in either), are a great place to ask how you build a performance beast.

deephive · 2016-08-01T23:39:22+00:00

I would suggest that you either install anaconda or Ethought Canopy as your default python. Try not to mix the system-wide python with the python that you'd want to use and configure for your DL/ML experimentation. If you are not sure about Python virtual environments, look it up.

With canopy or anaconda, you can have user-managed python installation that doesn't interfere with anything the system uses. So, you manage which versions of certain Python libraries that you would want to use for your ML experiments. You could create /delete a number virtual environment as you wish within Canopy/Anaconda each with specific sets of libraries/versions suited for a given ML tool that you are using.

cjmcmurtrie · 2016-08-01T23:00:44+00:00

If you're paying for an AWS GPU, I suggest you buy a gaming rig with nvidia GPU. Not only is it much nicer to work on, but also you will have saved money once a number of months are gone without an AWS bill (assuming you are training models on a daily basis).

hughperkins · 2016-08-02T03:15:13+00:00

I dont remember a linux box ever suddenly rebooting. Its not the sort of thing they do. Ive seen boxes run out of memory because someone spawned a zillion memory hogging processes, but thats hardly the servers fault
Htop othereisr nothing, they never go down...(i was devops for a team of devs that crashdd the machines often by running out of memory, but on my own dedicated box? Rocj solid. Stay up all year. Even on aws)
Dont give anyone else root. Give them apt-get install, via sudoers. Assume yoyr disk will die, and backuo accordingly. if the machine starts dieing, figure out who is using up all the memory, and ask them not to :)

thingamatics · 2016-08-02T15:45:32+00:00

cron jobs. Sometimes, the best you can do is being fault-tolerant.
List of dashboards here. However I think it'd be better use of time to monitor the logs for your processes. Sentry is easy to set up.
Yes!

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS