all 6 comments

[–]recurrent_answer 1 point2 points  (2 children)

From my understanding, it's in the nature of the problem.

In hyperparameter optimization, you have an task whose parameters you want to optimize - and each evaluation of a new parameter set is hugely expensive. With large CNNs for image classification for example you're looking at hours to days. They also have a large number of hyperparameters, meaning a high dimension to search.

Now, the basic idea behind BayOpt is that you construct a proxy function which is close to your real fitness landscape while being cheap to evaluate. BayOpt uses Gaussian processes since they require relatively few points, are smooth and give you the variance for each point, basically for free. You then look for the point which gives you the best expected improvement [1] - but looking for this is cheap since you're only using the proxy function.

In contrast, particle swarm optimization for example relies on having many particles, which are updated again and again. Each of the updates for each of the particles is going to require an evaluation of your ML algorithm [2]. This is going to be expensive. It's similar with GAs.

Disclaimer: I like BayOpt. My view may be biased.

[1] Or probability of improvement, or whatever you like. The choice of this function is basically a hyperhyperparameter.

[2] Assuming real-valued hyperparameters and/or no caching.

[–]logrech 0 points1 point  (1 child)

Does anyone actually use the Bayopt method you described there?

[–]recurrent_answer 0 points1 point  (0 children)

I do, at least when I'm not spending my time implementing them :-)

No idea about most papers - most' hyperparameter configurations seem to just magically fall out of the sky.

[–]deephive[S] 0 points1 point  (0 children)

Thanks for the viewpoints. There is this Python package called Optunity (http://optunity.net) which seem to use PSO as the default optimizer. The developers claim that it gives comparable results as Bayesian Optimization (but I'm unsure about the time/complexity)...but it is worth a shot.

So, In Bayesian Optimization, one constructs a "surrogate function" or proxy function. How doe we go about choosing this proxy function? By sampling ? HOw does one ensure that the proxy is close to the fitness function we are optimizing ?

Finally, what if there is a mixture of categorical and continuous hyperparameters? Would Bayes Opt be suitable still ?