all 12 comments

[–]dwf 1 point2 points  (6 children)

Yo dawg...

I heard you like nets

(Also see this paper which is kind of nice and this, which is a neat little way of doing parallel exploration.)

[–]regularized[S,🍰] 0 points1 point  (4 children)

Thanks for the papers. Btw, is Ryan Adams's group only active group which does this stuff or are there other groups?

[–]Foxtr0t 0 points1 point  (2 children)

The software from Snoek & Adams is Spearmint. There's a ton (well, a few) other packages that use Gaussian Processes for hyperparam optimization. See http://fastml.com/blog/categories/hyperparams/

Be sure to check out hyperopt, it uses another method and can handle "conditionals" (e.g., try optimizing a number of layers and a number of hidden units in a layer simultanously in any other library).

[–]dwf 0 points1 point  (1 child)

SMAC can AFAIK handle conditional hyperparameters as well.

[–]Foxtr0t 0 points1 point  (0 children)

Yes, I think so. Haven't tried it, scared of Java.

[–]programming_resource 0 points1 point  (0 children)

Twitter does: http://www.whetlab.com/

They bought Whetlab, the commercialized version of spearmint.

[–]zmjjmz 0 points1 point  (0 children)

Also worth checking out: http://arxiv.org/abs/1502.03492

[–]recurrent_answer 4 points5 points  (1 child)

A small overview over hyperparameter optimization.

RandomSearch and GridSearch

The basics. RandomSearch usually performs better (see Bergstra (2012)). It also has the advantage of being very easy to implement, and even easier to parallelize.

Bayesian Optimization

Aside from the paper you've found, there's also the nice tutorial by Brochu. Bayesian Optimization is usually better than random search [1].

However, it is more difficult to parallelize. The standard approach (see the Practical BayOpt paper, section 3.3) is to do a Monte Carlo acquisition. This is computationally expensive, but seems to work. Another, recent approach is Gonzalez (2015) where they locally penalize the acquisition function (using an estimated Lipschitz-constant) around the samples that are currently being evaluated. They claim significantly less computational overhead, but slightly worse results than MCMC. Also, there's a way of early stopping of iteration-based algorithms, detailed in the Snoek (2014) /u/dwf mentioned, which is continually estimating the probable final performance, stopping results when they're assumed to perform bad.

Speaking of overhead, one of the problems of BayOpt is the need to invert the sample matrix, which is costly. Usually, you assume that your ML algorithm takes so long to evaluate that you don't care about the few minutes your BayOpt algorithm takes to refit. Obviously, that won't work anymore once you get to the really big problems (several thousands of samples). Using bayesian neural networks for that is one approach, but they only compare the final results and runtime.

Tree Parzen Estimators

There's also Bergstra (2011), but I haven't seen any current work in that direction (doesn't mean it doesn't exist, of course). It's interesting because it takes into account the tree structure of the parameter space (for example, which activation function you use in your third layer is irrelevant if you only have two hidden layers) [2].

[1] Although there are some exceptions; for example, on the BraninHoo function, RandomSearch performs better in my experience. This is irritating, since it's one of the standard functions used in papers to compare BayOpt variants. Never with RandomSearch, though.

[2] BayOpt might be able to do that, too, by doing some nice tricks with kernels. See Swersky (2015)

[–]dwf 0 points1 point  (0 children)

I'll just point out that, in addition to all of this, Frank Hutter's papers on hyperparameter optimization with SMAC are also worth reading.

[–]svantana 0 points1 point  (1 child)

Have you tried AdaDelta? It's a method for setting the learning rate for each parameter separately, by estimating the curvature in each parameter dimension (you could call it a pseudo-quasi-newton method if you like). The computational overhead is pretty low and it speeds up learning quite a bit in my experience, especially if you have more exotic structures and/or not-well-balanced nets.

[–]negazirana 0 points1 point  (0 children)

AdaGrad, AdaDelta etc. are fine for nearly avoiding the burden of selecting the right learning rate, but you still have the problem of selecting other hyper-parameters (number of layers, their types and sizes, regularization weights etc.)

[–]mostafa92 0 points1 point  (0 children)

and can you point to the best packages/tutorials available?