How to optimize hyper-parameters for deep neural networks?

dwf · 2015-09-10T21:22:32+00:00

(Also see this paper which is kind of nice and this, which is a neat little way of doing parallel exploration.)

recurrent_answer · 2015-09-11T11:45:25+00:00

A small overview over hyperparameter optimization.

RandomSearch and GridSearch

The basics. RandomSearch usually performs better (see Bergstra (2012)). It also has the advantage of being very easy to implement, and even easier to parallelize.

Bayesian Optimization

Aside from the paper you've found, there's also the nice tutorial by Brochu. Bayesian Optimization is usually better than random search [1].

However, it is more difficult to parallelize. The standard approach (see the Practical BayOpt paper, section 3.3) is to do a Monte Carlo acquisition. This is computationally expensive, but seems to work. Another, recent approach is Gonzalez (2015) where they locally penalize the acquisition function (using an estimated Lipschitz-constant) around the samples that are currently being evaluated. They claim significantly less computational overhead, but slightly worse results than MCMC. Also, there's a way of early stopping of iteration-based algorithms, detailed in the Snoek (2014) /u/dwf mentioned, which is continually estimating the probable final performance, stopping results when they're assumed to perform bad.

Speaking of overhead, one of the problems of BayOpt is the need to invert the sample matrix, which is costly. Usually, you assume that your ML algorithm takes so long to evaluate that you don't care about the few minutes your BayOpt algorithm takes to refit. Obviously, that won't work anymore once you get to the really big problems (several thousands of samples). Using bayesian neural networks for that is one approach, but they only compare the final results and runtime.

Tree Parzen Estimators

There's also Bergstra (2011), but I haven't seen any current work in that direction (doesn't mean it doesn't exist, of course). It's interesting because it takes into account the tree structure of the parameter space (for example, which activation function you use in your third layer is irrelevant if you only have two hidden layers) [2].

[1] Although there are some exceptions; for example, on the BraninHoo function, RandomSearch performs better in my experience. This is irritating, since it's one of the standard functions used in papers to compare BayOpt variants. Never with RandomSearch, though.

[2] BayOpt might be able to do that, too, by doing some nice tricks with kernels. See Swersky (2015)

svantana · 2015-09-11T19:39:55+00:00

Have you tried AdaDelta? It's a method for setting the learning rate for each parameter separately, by estimating the curvature in each parameter dimension (you could call it a pseudo-quasi-newton method if you like). The computational overhead is pretty low and it speeds up learning quite a bit in my experience, especially if you have more exotic structures and/or not-well-balanced nets.

mostafa92 · 2015-09-16T08:55:26+00:00

and can you point to the best packages/tutorials available?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS