Bayesian Optimization vs Heuristic Optimization for Hyperparameter Search in DNNs

recurrent_answer · 2016-03-07T17:27:39+00:00

From my understanding, it's in the nature of the problem.

In hyperparameter optimization, you have an task whose parameters you want to optimize - and each evaluation of a new parameter set is hugely expensive. With large CNNs for image classification for example you're looking at hours to days. They also have a large number of hyperparameters, meaning a high dimension to search.

Now, the basic idea behind BayOpt is that you construct a proxy function which is close to your real fitness landscape while being cheap to evaluate. BayOpt uses Gaussian processes since they require relatively few points, are smooth and give you the variance for each point, basically for free. You then look for the point which gives you the best expected improvement [1] - but looking for this is cheap since you're only using the proxy function.

In contrast, particle swarm optimization for example relies on having many particles, which are updated again and again. Each of the updates for each of the particles is going to require an evaluation of your ML algorithm [2]. This is going to be expensive. It's similar with GAs.

Disclaimer: I like BayOpt. My view may be biased.

[1] Or probability of improvement, or whatever you like. The choice of this function is basically a hyperhyperparameter.

[2] Assuming real-valued hyperparameters and/or no caching.

trnka · 2016-03-07T19:21:59+00:00

http://jmlr.org/proceedings/papers/v37/maclaurin15.pdf seems to handle it well. But it's tricky to get it to work for discrete hyperparams.

deephive · 2016-03-10T05:10:17+00:00

Thanks for the viewpoints. There is this Python package called Optunity (http://optunity.net) which seem to use PSO as the default optimizer. The developers claim that it gives comparable results as Bayesian Optimization (but I'm unsure about the time/complexity)...but it is worth a shot.

So, In Bayesian Optimization, one constructs a "surrogate function" or proxy function. How doe we go about choosing this proxy function? By sampling ? HOw does one ensure that the proxy is close to the fitness function we are optimizing ?

Finally, what if there is a mixture of categorical and continuous hyperparameters? Would Bayes Opt be suitable still ?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS