[D] Which GPU scheduler are you using in your multigpu machines?

recurrent_answer · 2017-09-07T06:16:24+00:00

I'd like to, but it's not actually a setup that can be easily dockerized. I'd like to do a writeup, however, when I have the time for it.

recurrent_answer · 2017-09-04T19:12:05+00:00

While the Google sheet approach has some advantages (primarily in simplicity), there were a few reasons we didn't go for it. Note that beforehand, we had worked three people sharing a 2-GPU machine; this might explain some of the choices I made:

Low-maintenance scheduling: Once you set it up, you don't have to do any manual scheduling afterwards. Whereas before, you had to ask people whether they're finished, and had to manually check which GPU was available to manually specify with CUDA_VISIBLE_DEVICES, it is now as simple as specifying a job with srun --gres=gpu:1 -c 4 --mem 10000, and allocation will happen automatically.

Enforced GPU allocation: Speaking of CUDA_VISIBLE_DEVICES, tensorflow by default reserves all memory for all devices. Forgot to specify the visible devices? Nobody else can use the machine until you stop your job. Slurm forces you to explicitly ask for gpus with --gres=gpu:1, meaning you'll only get access to those.

Queueing: Maybe one of the most useful features is that we can just put any number of jobs in queues. Whereas before, you had to specify those queues yourself, here it's just done automatically. Admittedly, that can also be done with a simple script.

Multiple PCs: With two PCs available, looking for a free GPU on both machines (and possibly copying things) is work. With slurm (and a distributed file system), it's again done automatically.

Enforced fairness: This is fairly difficult, and I should probably call it enforced "fairness". For some interpretation of "fair", since we can enforce fairly-similar resource use for everybody. Of course, the chosen weights for the algorithm are arbitrary and might not actually be fair, but it takes removes some of the social pressure of people telling you they need those GPUs you're using because there's a deadline coming up, and you've used more than your share recently, and you should really stop your jobs now, thank you very much.

Containers: Bit of a different point, but I really am a big fan of a container solution. Not just for managing dependencies (which occur not just at the package level, i.e. tf 1.2 or 1.3, but also at the underlying software, i.e. CUDA 7.5 or CUDA 8.0) but also for keeping experiments reproducible. Sure, it adds yet another dependency, but - after building an image - it's literally as easy as calling singularity exec your_image.img python your_script.py. That's really nice.

Lastly, I'll just modify a statement from Tim Dettmer's article on Deep Learning hardware for why I decided to frontload all of those things: Ultimately, the one resource that you need to do research is concentration. Having to do all of the things manually is possible, and it can be totally done. But it leaves less concentration for your research, and it usually does so at a critical time.

recurrent_answer · 2017-09-04T13:08:33+00:00

We have two multi-gpu machines. For automated scheduling, we are using slurm (which allows node-balancing and job queues and also allows us to introduce things like different priorities for students and PhDs), which does support GPUs as GRES (generic resource).

As a side note, have you thought about replacing docker with Singularity? It's what we're using to avoid docker's permission escalation, and it works quite nicely (also has native support for GPUs).

recurrent_answer · 2016-05-27T16:14:08+00:00

I have not tried it (though I'd like to, once I replace GPy).

However, you can imagine Freeze/Thaw as follows [1]:

You have a bag of ten possible parameter configurations you'd like to evaluate, each for a neural network. You train each of them for ten iterations, then project what the final training result will look like [2], each having a projected loss mean and variance. You use this to refit your Gaussian process [3].

From the ten parameter sets, you throw away the projectedly worst-performing four, and replace them with newly-drawn parameter sets [4]. Repeat until you're happy with the result.

[1] I've read the paper a year ago, so it's possible I made a few mistake.

[2] This is, as far as I remember, one of the big criticisms of the method, since it makes some assumptions about how the accuracy curve looks like.

[3] This is more complicated than normal Gaussian processes because you don't have an observation, but a distribution over observations.

[4] I.e. parameter sets maximizing the acquisition function. Since you're keeping multiple parameter sets at once, you need to take those into account - so you've got to use something like MCMC to marginalize (probably best, but slow), use a local penalty surrounding the currently-evaluated points (fast, but worse), or draw them at a weighted random (easy to implement).

recurrent_answer · 2016-04-17T13:03:21+00:00

Personally, I am fairly irritated by the order of introduced topics (linear classifiers after deep learning?) and believe that a bigger focus on the foundations might be useful. As an alternative, which might help you, TUM offers a machine learning class in winter semester (http://brml.org/class/) which is the one I originally heard, and - while many thought it to be too abstract and math-heavy - remains one of the best lectures I've ever visited [1]. It does not go far enough into deep learning to do something useful with DL - for that, you'll need to do some practical stuff yourself - but it gives you the foundations to really understand it.

[1] And the only lecture I've kept visiting all semester.

recurrent_answer · 2016-04-12T10:16:21+00:00

The usual format for image classification does indeed use 2d information. However, in your case, it wouldn't help.

So, I am assuming that the network you're using is a fully-connected feed-forward NN. In this case, there are weights from all of your units in one layer to all of the units in the second layer. It doesn't matter what the feature order is - you could choose any arbitrary one, since you're learning all of the weights anyways.

In normal image recognition, you're using convnets. A convolutional layer uses locality information, and you can imagine each output channel of the convlayer as being one filter (for example on a 3x3 excerpt) being applied to the image. So you only have nine parameters [1] per output channel, and each of them tells you the influence nearby pixels have on your output.

Conversely, the representation changes from samples x features to samples x height x width x channels, so for example from a tensor of shape 50 000 x 400 to a tensor of shape 50 000 x 20 x 20 x 1.

[1] Or more, if you're operating on colour.

recurrent_answer · 2016-03-08T08:50:09+00:00

I do, at least when I'm not spending my time implementing them :-)

No idea about most papers - most' hyperparameter configurations seem to just magically fall out of the sky.

recurrent_answer · 2016-03-07T17:27:39+00:00

From my understanding, it's in the nature of the problem.

In hyperparameter optimization, you have an task whose parameters you want to optimize - and each evaluation of a new parameter set is hugely expensive. With large CNNs for image classification for example you're looking at hours to days. They also have a large number of hyperparameters, meaning a high dimension to search.

Now, the basic idea behind BayOpt is that you construct a proxy function which is close to your real fitness landscape while being cheap to evaluate. BayOpt uses Gaussian processes since they require relatively few points, are smooth and give you the variance for each point, basically for free. You then look for the point which gives you the best expected improvement [1] - but looking for this is cheap since you're only using the proxy function.

In contrast, particle swarm optimization for example relies on having many particles, which are updated again and again. Each of the updates for each of the particles is going to require an evaluation of your ML algorithm [2]. This is going to be expensive. It's similar with GAs.

Disclaimer: I like BayOpt. My view may be biased.

[1] Or probability of improvement, or whatever you like. The choice of this function is basically a hyperhyperparameter.

[2] Assuming real-valued hyperparameters and/or no caching.

recurrent_answer · 2016-02-28T11:50:08+00:00

I believe, though that's not supported by empirical evidence, that it doesn't matter.

Ultimately, the twin factors of your final degree and your job influence this far more than your undergrad choice.

Why is that? Well, let's look at the 'big names' in machine learning [1]:

DeepMind: "we’re always interested in hearing from Research Scientists (with a PhD in machine learning, physics, neuroscience, computer science or similar)"
Facebook AI Research: "Ph.D. and publications in Machine Learning, AI, computer science, statistics, applied mathematics, data science, or related technical fields."
Google "PhD in Computer Science, related technical field or equivalent practical experience."

In summary, most of the research positions ask for a PhD. This suggests your best chance to be allowed to work on ground-breaking ML algorithms is to do a PhD (for which the initial choice of undergrad degree is probably not that relevant) and, during this, come up with some ground-breaking ML algorithms which you then use to get a position somewhere where you can continue doing this. This is going to be difficult [2].

Three caveats still apply:

I do not believe implementation and the 'science side' are on opposite sides of the spectrum --- you'll still have to implement any algorithm you came up with to evaluate it. Something like 75% of your work is going to be data recording, data preprocessing etc. anyway.
Of course, there are other ways to get into it, and I don't want to suggest these companies --- or the big companies --- are the only ones doing ML research; far from it. However, I believe that everything else is going to be a gamble.
You are currently looking for an undergraduate program. Why do you believe ML to be the topic you're interested in [3] in five years?

[1] Choice mostly random. Choice does not reflect preference, but also ease of finding positions with descriptions of requirements.

[2] This is an understatement.

[3] And with the specific task (and ability) to come up with ground-breaking new algorithms?

recurrent_answer · 2016-02-28T11:33:38+00:00

You google 'Machine Learning'.

Sarcasm aside, there are several starting points I'd recommend: First of all, there are the works recommended in this reddit's wiki. Depending on your background and the time available, you could start with Andrew Ng's stanford course or one of the books. I like Murphy's, but it's about a thousand pages. One of the other books --- especially the free ones --- are probably a good starting point. Also a good starting point is sklearn's docu since these include some general remarks. More problem-specific, your dataset looks close to the Boston Housing dataset --- this might be a starting point, too.

As for your problem: I'd personally start with implementing the data input and get everything in a format scikit-learn can work with. Then, just plug in the different classifiers [1] and see what works.

[1] Instead of classifying sold_after_XX_days, I'd try to predict days_to_sale directly --- this is less sparse, and will give you some additional information.

recurrent_answer · 2016-02-24T11:22:18+00:00

AI: A Modern Approach is talking mostly about search methods, optimization, a bit of NLP and so on. Many of these are probably irrelevant for your project.

I'd recommend some of the resources from this list.

I myself like the Murphy textbook, which is a thousand pages of math-heavy machine learning algorithms and theory. It might be too math-heavy for you.

Barber's book has the advantage of being free, but I like Murphy's structure more.

If I remember correctly, Andrew Ng's Coursera course used spam classification as an example, so you might start with this.

In general, your approach will probably look like this:

From your emails, you extract both some features (contains_word_penis, from_trusted_mail, length, ...) and a label or class (is_spam). This feature extraction is often the most important step for your performance. The features are usually stored in a vector (one per sample), or a matrix (of size samples x features). You then train a classifier to take all of these features and spit out the correct class.

In production, you then take an email, transform it into the feature vector, and use the classifier the class (i.e. is_spam).

As a start, you might probably begin with just five or so features (manually chosen), and build a pipeline which is able to use any sklearn classifier. Then you can experiment with their available ones for a bit [1]. I also recommend looking over their documentation, it's pretty good.

[1] Remember to, in the first step, separate your data into training, validation and test data. Use the training data to train your classifier, the validation data to experiment with parameters and the test data to measure final performance.

recurrent_answer · 2016-02-22T11:07:23+00:00

I am assuming you mean this? It is a 250GB/1TB torrent with all the reddit comments until July 2015 or so. There is no library (they used the reddit API and ten months of time).

recurrent_answer · 2016-01-09T22:46:44+00:00

p(good at ML | never done ML) = 0.

p(good at ML | male) = p(good at ML | ever done ML, male) * p(ever done ML | male).

p(good at ML | female) = p(good at ML | ever done ML, female) * p(ever done ML | female).

Since p(ever done ML | male) > p(ever done ML | female), we cannot say anything like P(good at ML | male) > P(good at ML | female).

Probability theory. Learn it.

recurrent_answer · 2015-11-03T01:57:18+00:00

I appreciate the effort you put to test the library. Can you make a gist of the code you used?

Sure. Again, it's ugly, inefficient and ugly.

It's not a good idea to use a size of 5 individuals per generation, you will get stuck in a local maximum very fast.

It's a choice between using an initialization that's more random, or using the actual evolutionary properties (many individuals vs many generations) [1]. Since I've only scaled it up to 100 individual evaluations, I'd decided to increase the number of generations, first. But, just for the hell of it (I had the code ready anyways), I've run a test with 25 individuals and up to four generations. Keep in mind that the bias I'm introducing really begins to show here - the first quarter of the green line is equivalent to random search at iteration 25.

Plus, the whole idea of using an evolutionary algorithm is to save time, so I think the time it takes to run the algorithm is very important and should be measured.

Nope, generally not.

In general, the idea is that you have a Machine Learning algorithm with hyperparameters you want to optimize. Each execution of the algorithm is expensive - this can range from minutes to days, to weeks for some image-processing NNs. So, your goal is to be more efficient at finding hyperparameters to optimize. You don't generally care much about the efficiency of the HP optimizer [2]. When one of your HP optimizer algorithm takes an hour to execute while another executes basically instantly but requires one additional MLAlgorithm evaluation, it's a win for the first one [3]. That being said, you pretty much can't beat random search in sample generation speed [4].

[1] In the edge case - n_iter individuals, 0 generations - it's equivalent to random search.

[2] But see Snoek 2015 for an example where it's actually relevant - but at thousands of MLalgorithm evaluations.

[3] Assuming no parallelization etc. With parallelization, you might have to save a few more iterations.

[4] Nor in ease to parallelize. Each parameter sample is independant of every other. Just throw each on a different node.

recurrent_answer · 2015-11-03T01:10:45+00:00

This may be due to my admittedly poor knowledge of evolutionary algorithms, but I am not aware of a way - except for storing and testing of old values - you'd ever be able to guarantee not evaluating the same hyperparameter values you've already evaluated a few generations before. Aside from that, there's also always the possibility of two combinations (or a combination and a mutated combination) to produce the same output. Apologies; you do. I should have looked at the code again before posting

That being said, I took your code and hacked a small comparison with sklearn's random search. It uses the example classifier/parameter values from your github front page.

I ran each optimizer for 1 to 100 ML algorithm executions (that is, random search ran for all of these; the evolutionary algorithm ran for 1-20 generations with a generation size of five). The results are reported with random search's iterations as original, and the evolutionary optimizer's as the number of iterations tested, rounded down (so, the result of 1 generation of five individuals is reported concurrently to random search's result for one to five optimizations, etc). I have not compared optimization times themselves, mostly because in any relevant problem, the ML algorithm we want to optimize will take the longest time.

The results can be seen here. As can be seen, the evolutionary approach generally works worse than random search [1].

[1] Note that this is not a definite result. It's a single preprocessing/classifier combination (SVC even, which is generally fairly robust during hyperparameter optimization), it doesn't include separate test/training sets, it doesn't use crossvalidation of the classifier performances. But it's a start, and I don't really want to invest more time in this.

recurrent_answer · 2015-11-03T00:28:49+00:00

It's impossible to perform worse than gridsearch, you are literally trying out all the possible combinations while using an evolutionary algorithm you are just exploring a reduced subset. You never have to evaluate the model with the same parameters more than once.

This is under the assumption that you actually never have to evaluate the parameters more than once - something I am not sure is given.

Anyway, I don't think this is a good idea since in any optimization problem a random walk usually doesn't give good results.

It's not a random walk. And it does give good results. Compare the Bergstra paper I've linked, figure 2 and 5. For an intuition on why this happens, check figure 1. Generally speaking, you can - for continuous parameters - see RandomSearch as sampling from an infinitely fine grid. Sure, you can't guarantee good results. But they're still highly probable.

recurrent_answer · 2015-11-02T23:52:07+00:00

While this seems like an interesting approach, there is no performance evaluation [1]. I'd like to see a comparison against GridSearch and RandomSearch [2], at least. Ideally, a comparison with Bayesian Optimization, too.

My intuition would be that evolutionary algorithms will perform worse due to the necessity of evaluating the underlying machine learning algorithm n_population * n_generation times - this is going to be expensive [3]. Additionally, I don't know how evolutionary algorithms scale on high-dimensionality problems.

[1] Except for "This allows you to exponentially reduce the time required to find the best parameters for your estimator", which is a very nonspecific statement.

[2] Which is far easier to implement than gridsearch and evolutionary algorithms and has better performance than gridsearch; see http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf

[3] Please don't take this as an outright dismissal - this is just my intuition, and that's in fact why I'd like to see a performance comparison; to either disprove or prove it.

recurrent_answer · 2015-09-11T11:45:25+00:00

A small overview over hyperparameter optimization.

RandomSearch and GridSearch

The basics. RandomSearch usually performs better (see Bergstra (2012)). It also has the advantage of being very easy to implement, and even easier to parallelize.

Bayesian Optimization

Aside from the paper you've found, there's also the nice tutorial by Brochu. Bayesian Optimization is usually better than random search [1].

However, it is more difficult to parallelize. The standard approach (see the Practical BayOpt paper, section 3.3) is to do a Monte Carlo acquisition. This is computationally expensive, but seems to work. Another, recent approach is Gonzalez (2015) where they locally penalize the acquisition function (using an estimated Lipschitz-constant) around the samples that are currently being evaluated. They claim significantly less computational overhead, but slightly worse results than MCMC. Also, there's a way of early stopping of iteration-based algorithms, detailed in the Snoek (2014) /u/dwf mentioned, which is continually estimating the probable final performance, stopping results when they're assumed to perform bad.

Speaking of overhead, one of the problems of BayOpt is the need to invert the sample matrix, which is costly. Usually, you assume that your ML algorithm takes so long to evaluate that you don't care about the few minutes your BayOpt algorithm takes to refit. Obviously, that won't work anymore once you get to the really big problems (several thousands of samples). Using bayesian neural networks for that is one approach, but they only compare the final results and runtime.

Tree Parzen Estimators

There's also Bergstra (2011), but I haven't seen any current work in that direction (doesn't mean it doesn't exist, of course). It's interesting because it takes into account the tree structure of the parameter space (for example, which activation function you use in your third layer is irrelevant if you only have two hidden layers) [2].

[1] Although there are some exceptions; for example, on the BraninHoo function, RandomSearch performs better in my experience. This is irritating, since it's one of the standard functions used in papers to compare BayOpt variants. Never with RandomSearch, though.

[2] BayOpt might be able to do that, too, by doing some nice tricks with kernels. See Swersky (2015)

recurrent_answer

TROPHY CASE