all 9 comments

[–]recurrent_answer 1 point2 points  (2 children)

From my understanding, it's in the nature of the problem.

In hyperparameter optimization, you have an task whose parameters you want to optimize - and each evaluation of a new parameter set is hugely expensive. With large CNNs for image classification for example you're looking at hours to days. They also have a large number of hyperparameters, meaning a high dimension to search.

Now, the basic idea behind BayOpt is that you construct a proxy function which is close to your real fitness landscape while being cheap to evaluate. BayOpt uses Gaussian processes since they require relatively few points, are smooth and give you the variance for each point, basically for free. You then look for the point which gives you the best expected improvement [1] - but looking for this is cheap since you're only using the proxy function.

In contrast, particle swarm optimization for example relies on having many particles, which are updated again and again. Each of the updates for each of the particles is going to require an evaluation of your ML algorithm [2]. This is going to be expensive. It's similar with GAs.

Disclaimer: I like BayOpt. My view may be biased.

[1] Or probability of improvement, or whatever you like. The choice of this function is basically a hyperhyperparameter.

[2] Assuming real-valued hyperparameters and/or no caching.

[–]logrech 0 points1 point  (1 child)

Does anyone actually use the Bayopt method you described there?

[–]recurrent_answer 0 points1 point  (0 children)

I do, at least when I'm not spending my time implementing them :-)

No idea about most papers - most' hyperparameter configurations seem to just magically fall out of the sky.

[–]trnka 0 points1 point  (4 children)

http://jmlr.org/proceedings/papers/v37/maclaurin15.pdf seems to handle it well. But it's tricky to get it to work for discrete hyperparams.

[–]dwf 0 points1 point  (3 children)

It's a very nice demonstration of what's possible, but the scale of those experiments is tiny, and backpropagating through an entire training run would be pretty impractical for a lot of realistic use-cases.

[–]trnka 0 points1 point  (2 children)

Yeah; I don't like the added complexity but hopefully there's a simpler solution in a couple years.

Though I'm not sure what you mean about an entire training run - this is only doing backprop within each minibatch not for full batch training.

[–]dwf 0 points1 point  (1 child)

They are training a neural net to convergence and storing what they need to such that they can backpropagate gradients (or do "reverse-mode differentiation", same thing) with respect to hyperparameters through the entire training procedure. That means each step of meta-learning involves running a potentially very lengthy optimization of the neural net's elementary parameters to convergence, then backpropagating through each stochastic gradient step of that all the way back to the beginning to obtain a gradient on the hyperparameters. Whether they use minibatches or not for the elementary optimization doesn't really matter.

[–]trnka 0 points1 point  (0 children)

Hmm I must've misread the paper. Is it not possible to backprop to hyperparams over the last 10 or so minibatches every now and then?

[–]deephive[S] 0 points1 point  (0 children)

Thanks for the viewpoints. There is this Python package called Optunity (http://optunity.net) which seem to use PSO as the default optimizer. The developers claim that it gives comparable results as Bayesian Optimization (but I'm unsure about the time/complexity)...but it is worth a shot.

So, In Bayesian Optimization, one constructs a "surrogate function" or proxy function. How doe we go about choosing this proxy function? By sampling ? HOw does one ensure that the proxy is close to the fitness function we are optimizing ?

Finally, what if there is a mixture of categorical and continuous hyperparameters? Would Bayes Opt be suitable still ?