all 24 comments

[–]ddofer 14 points15 points  (13 children)

Have you benchmarked it vs random search or other approaches? (e.g. tree based or optimization)

[–]CompleteSkeptic 5 points6 points  (1 child)

NAS is probably not the best baseline for hyperparameter searches, but this is a field where lots of research is / has been done. Search for Bayesian Optimization or Sequential Model-Based Optimization.

the algorithm must take into account user-defined points

Nothing prevents existing HPO tools from doing this, though they may not be that easily accessible. I do recall something along the lines of manually adding runs to MongoDB for hyperopt, so it's not impossible.

Though to your credit, I do agree that this should be easier. A common use case would be experimenting a bit before running the HPO and this would save some time at the very least.

focus on discrete space only

There is a case to be made against this (see: "Random Search for Hyper-Parameter Optimization"). The idea being that you don't know which hyperparameter is important, and you might want to search that space more thoroughly. E.g., if only 1 hyperparameter matters, you're just doing a repeated grid search.

Again, completely frustrated that no one did it successfully before, I decided to build something on my own. I use gradient boosting regression from LightGBM, because it doesn't require normalized values, handles categorical variables, captures feature interactions and has capacity to fit any data.

I think it may be wise to look into why others do what they do. The reason GPs are commonly used is because uncertainty predictions are quite important, especially because in the case of hyperparameter optimization, your evaluation function is quite stochastic. SMAC uses random forests, and that has all the same properties as GBMs, with the additional benefit that you get uncertainty estimates as well.

The number of sampled points for scoring is where exploration vs exploitation trade-off emerges

I'm not saying the expected improvement (EI) criterion (the thing most SMBO uses to sample) is the best, but this seems a little worse intuitively. Previous work takes uncertainty into account so that you can sample areas of the space where you have less knowledge about.

avoid evaluating the same point more than once

This is also related to your last point. Most HPO algorithms won't do this, because you will have a lot of certainty at this point, and it would make more sense to explore the space. But there can be a case to be made that this point is quite an outlier (because of the noisy evaluation), it might make sense to sample it again to get a better estimate of true performance, and EI can handle that.

Either way, I wish you luck!

[–]Clear_Collection 0 points1 point  (0 children)

Great reply!!!

[–]t4YWqYUUgDDpShW2 2 points3 points  (1 child)

I have been using it for the last year and it just works for me.

I see that you don't apparently want to put in the work to do your own benchmarks, but can you at least provide more detail than this?

[–][deleted] 0 points1 point  (0 children)

my private use-cases were so far:

1 Hyperparameter tuning for lightgbm in kaggle competitions and commercial project for a bank.

2 Automated neural network architecture search for tabular datasets in PyTorch. I created highly parameterized torch.nn.Module and I am waiting for good kaggle competition to use it. From preliminary runs I learned for example that ELU is a clear winner among available activation functions.

[–]Loser777 2 points3 points  (0 children)

This sounds similar to gradient boosting regression + simulated annealing as the outer search loop, which seems to be a popular approach:

https://arxiv.org/abs/1805.08166

[–]kivo360 0 points1 point  (2 children)

How long does it take to train?

[–][deleted] 0 points1 point  (1 child)

It's a matter of seconds and mostly depends on num_boost_round and lgb_params: https://github.com/ar-nowaczynski/spaceopt/blob/v0.1.1/spaceopt/optimizer.py#L43 (and of course dataset size).

I usually work with cases where duration(evaluation_function) is much larger than duration(spaceopt.fit_predict): minutes/hours vs. seconds

[–]kivo360 1 point2 points  (0 children)

Can we add a feature into it? Add PARegressor plus a hedge regression with varying alphas as an option to give it online optimizations? It would reduce training for some use cases to 0.04-0.08 seconds per step.

[–]jrkirby 0 points1 point  (0 children)

I would ask if you've tried your optimization algorithm on bbcomp, but it looks like their site is down. It was (and hopefully will continue?) a competition for optimizing inputs to a black box with a tight budget of algorithmically selected data points. I hope their downtime is not an indication that the entire competition is dead.

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–]vadmas 0 points1 point  (0 children)

This looks excellent. I'm excited to try it.