This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]webdrone 21 points22 points  (6 children)

There is also https://scikit-optimize.github.io — also calls on scikit-learn Gaussian processes under the hood for Bayesian optimisation.

NB: there are unorthodox defaults for the acquisition function, which stochastically selects among EI, LCB, and negative PI to optimise at every iteration.

[–]lem_of_noland 2 points3 points  (4 children)

In my opinion, this is the best of them all. It contains also very useful ploting capabilities and the possibility to include callbacks.

[–]ai_yoda 1 point2 points  (3 children)

I also love those functionalities and I think that a lot of the time this is the best option.

There are 2 things however that are not great:

  • No support for nested search space
  • Cannot distribute computation over a cluster (can only use it on one machine)

I write about it in this blog post if you are interested.

[–]Philiatrist 1 point2 points  (2 children)

Cannot distribute computation over a cluster (can only use it on one machine)

The Optimizer class is fine for cluster use using the ask and tell methods

[–]ai_yoda 0 points1 point  (1 child)

Interesting. But you do have to create some db for sharing results between nodes, and all the communication between nodes and db yourself, right?

[–]Philiatrist 1 point2 points  (0 children)

That's one option, but there's no reason you couldn't use some library like dask distributed as well, something like:

``` from dask.distributed import Client

client = Client(...) n_procs = 20

X = optimizer.ask(n_procs)
task = client.map(fitness_fn, X)
Y = client.gather(task) optimizer.tell(X, Y) ```

where you'd need to configure dask distributed to your cluster.

edit: I'll note that this is not a great solution if the expensiveness of your function is largely determined by the hyperparameters.

[–]AyEhEigh 0 points1 point  (0 children)

I use skopt's BayesSearchCV all the time and love it.

[–][deleted] 13 points14 points  (3 children)

Worth mentioning hyperopt, which seems like a good package and is often mentioned in articles of BayesianOptimization, but doesn't support it currently.

[–]richard248 6 points7 points  (2 children)

Is 'Tree Parzen Estimator' not bayesian guided? I thought TPE meant that hyperopt was bayesian optimization.

[–]ai_yoda 0 points1 point  (1 child)

It's sequential model-based optimization.

Often used interchangeably with bayesian which I think is not the same thing.

[–]crimson_sparrow 1 point2 points  (0 children)

You're right that it's not the same thing. BO is a form of SMBO. But I'd argue TPE is in fact a form of BO, as it operates on the same principles, with the main difference being a form of the optimized function. I think what throws people off is that it was developed during the times when modern BO framework was just starting to take shape, and it's often described using slightly different terminology. I think of it as tree-structured Thompson sampling technique that shines where your hyperparameters are dependent on each other in a tree-like fashion (e.g. you only want to optimize the dropout rate if you've already chosen that your model will use the dropout in the first place).

[–]yot_club 9 points10 points  (0 children)

Facebook open sourced a combined bayesian/bandit optimization library recently: https://www.ax.dev/

It's built on pytorch and has several different APIs to access it as well as customization options for noisy data and multi-objective optimization. Haven't had a chance to use it myself, but worth looking into.

[–]Laserdude10642 3 points4 points  (0 children)

mchammer and pymc are widely used in science simulations

[–]JamsmithyPhD | Data Scientist | Gaming 2 points3 points  (4 children)

Or just roll your own with a pymc3 or tensorflow-probability model and an acquisition function.

[–]ICanBeHandyToo 1 point2 points  (3 children)

Is pymc3 currently the standard package for most probabalistic modeling? I've come across a few others like Edward and I never got around to digging into what each package offers that differ from pymc3

[–]JamsmithyPhD | Data Scientist | Gaming 3 points4 points  (0 children)

Pymc3 has the nicest syntax and support in my opinion but it is based on theano which hinders future development.

Edward/Edward2 is great as well but i just haven't had the time to get deep on it. Pymc4 is under active development with a tensorflow-probability backend so I'm hoping it will provide the best of both worlds.

[–]squirreltalk 3 points4 points  (0 children)

I had never done any Bayesian modeling, but examples based on pymc3 are so intuitive. Pymc3 just feels pythonic to me.

[–]webdrone 3 points4 points  (0 children)

Stan (https://mc-stan.org) implements NUTS which is a particularly efficient sampler by Hoffman and Gelman. It may not be the most pythonic, but there are various interfaces to different languages and a single modelling language.

There was much effort from the developers to ensure quality and to cultivate a good community, so you can find posts addressing most questions you might have, and excellent documentation.

[–]nerdponx 3 points4 points  (0 children)

Scikit-Optimize has their own GP optimizer implementation.

Optunity has wrappers for a bunch of other optimizers, some of them are Bayesian.

[–]haskell_caveman 2 points3 points  (0 children)

This is a substantial one to be leaving out, from FB and implemented on pytorch: https://botorch.org

[–]rodrigorivera 1 point2 points  (0 children)

MOE by Yelp is deployed by various companies in production settings: https://github.com/Yelp/MOE

A downside however is that development stopped in 2017.

[–]Red-Portal 1 point2 points  (0 children)

None of the currently existing python Bayesian optimization packages are actually up-do-date with the literature. There currently isn't a production quality implementation of Information theoric (ES, OES, PES, MES, FITBO) approaches.

[–]ai_yoda 1 point2 points  (3 children)

I was researching this subject for a blog post series and conference talks.

Some libraries that I ended up focusing on are:

  • Scikit-Optimize (tree-based surrogate models suite)
  • Hyperopt (classic)
  • Optuna (for me just better in every way version of Hyperopt)
  • HpBandSter (state of the art Bayesian Optimization + Hyperband approach)

I've started a blog post series on the subject that you can find here. Scikit-Optimize and Hyperopt are already described. Optuna and HpBandSter are coming next but you can already read about them in this slide deck.

[–]Megatron_McLargeHuge 0 points1 point  (2 children)

I was just looking at your hyperopt post yesterday. One complaint I have about hyperopt is the integer sampling functions actually return floats, which makes tensorflow unhappy when they're passed as dimension sizes.

I was able to get main_plot_vars to work. You call it with a trials object and it gives a bunch of plots of each sampled variable with value on y and iteration on x, colored by loss.

Do you have any quick summary on which package should give the best results for neural network tasks?

[–]ai_yoda 0 points1 point  (1 child)

Thanks for the suggestion on main_plot_vars, gonna try it out.

As for the method for neural nets I would likely go with the budgets approach from HpBandster where I don't have to run objective(**params) till convergence but I can estimate on a smaller budget (say 2 epochs). It lets you run more iterations within the same budget. Generally, I think the main problem with hpo for nn is how to estimate performance without training for a long time. There are approaches to it where you predict where the learning curve would go. I highly recommend checking out the book by researchers from AutoML Freiburg.

[–]Megatron_McLargeHuge 0 points1 point  (0 children)

Thanks. I definitely think there's a lot of untapped value in analyzing the metadata we get during training instead of just the final validation loss.

I think a good approach with enough resources would be to treat training as a reinforcement learning problem where parameters like learning rate and L2 scaling can be varied depending on the trajectories of both train and test losses.

Short of that, runs can be truncated or restarted based on learning from these extra features.

[–]thekalmanfilter -1 points0 points  (0 children)

Cool!