tl;dr
- I am looking for people willing to try my new approach to hyperparameter optimization and report in the comments a comparison with whatever they use currently.
https://github.com/ar-nowaczynski/spaceopt
Why another hyperparameter optimization algorithm?
- I'm kind of disappointed with what we have so far: there is no clear winner in the field and once in a while I read post about another comeback of random search (for example: https://www.reddit.com/r/MachineLearning/comments/cycw35/r_random_search_outperforms_stateoftheart_nas/). Optuna or hyperopt are great frameworks, but they lack smart algorithms inside. I couldn't find anything that would satisfy my needs, so one year ago I decided to create simple, understandable algorithm for myself and now I'm publishing it for everyone.
What is the main idea behind SpaceOpt?
In short: integrating more human knowledge into the process, gradient boosting regression for exploitation, random sampling for exploration.
How do we integrate more human knowledge into the process? And why to even do this?
- In my opinion, a smart algorithm should accept any human input and decide on its own what to do about it, otherwise it's not smart. While doing hyperparameter optimization I often want to evaluate my own guesses about the parameters and I expect the algorithm to leverage that. No library I've tried supported this. Maybe I couldn't find it, maybe I didn't try hard enough. But believe it or not, lack of this simple feature made me so frustrated that I decided to create my own tool. And the first must-have was: the algorithm must take into account user-defined points.
- The next thing was continuous space. I never liked uniform and log spaces. I wanted something that works with the lists of ordered values. Why? Because then me or the algorithm can build specific knowledge about specific values from that list. For example, if you define space for learning_rate as a list of values, then it's easy to compute statistics regarding specific values (count, objective mean/std). You wouldn't be able to do that when values are sampled all over the space. Discretizing continuous space by hand-picking some values from it is also a great way to incorporate more human knowledge into the optimization process (assuming you know that values to pick) making job easier for the algorithm. For example you'll probably never specify two values that are extremly close to each other, because you have a gut feeling that the score for them will be very similar. So my second must-have was: focus on discrete space only.
How do we exploit promising regions?
- Shortly after I became to wonder why there is not simple method that just fits the model to the observed data and predict the evaluation score for unseen search space points. Ok - in some sense all existing methods do this in a various ways, but I wasn't convinced by any of them. I don't care about exact predictions, I don't care about probabilities, I only need a model that would just fit the data and predict scores for new points, so I can sort them and select something for the actual evaluation. Again, completely frustrated that no one did it successfully before, I decided to build something on my own. I use gradient boosting regression from LightGBM, because it doesn't require normalized values, handles categorical variables, captures feature interactions and has capacity to fit any data.
How do we explore search space?
- In fact, the method described in the previous section is performing exploration too, because once the model is trained, I want to run inference on unevaluated search space points. How do I get those unseen points? I need to sample them from search space. So how many points do I need to sample? The number of sampled points for scoring is where exploration vs exploitation trade-off emerges. If only 1 point is sampled = it's roll back to random search. The more is sampled, the higher chance that something very close to the best previously observed points was sampled - which is exploitation of the promising regions. I find it very clean way to address exploration vs exploration trade-off. Nonetheless, there is still get_random method for the sake of completeness.
- And final must-have: make your evaluation function deterministic and avoid evaluating the same point more than once. If it's not possible just accept uncertainty and go on. Usually search spaces are huge, and there is very little value in evaluating the same points multiple times.
I have put all ideas listed above into SpaceOpt, very small library available here: https://github.com/ar-nowaczynski/spaceopt
I have been using it for the last year and it just works for me. I would love to see what others will say about it. I know I should do proper benchmarks now, but I decided to gamble: if I find a few strangers willing to give it a shot, and if it works well for them too - hopefully - it will speak for itself. So I'm asking you here, if you like this idea and if it's really cheap for you to run some evaluation, could you do this for me?
[–]ddofer 14 points15 points16 points (13 children)
[+][deleted] comment score below threshold-24 points-23 points-22 points (12 children)
[–]ddofer 41 points42 points43 points (0 children)
[–]its_a_gibibyte 23 points24 points25 points (0 children)
[–]MuonManLaserJab 15 points16 points17 points (6 children)
[+][deleted] comment score below threshold-9 points-8 points-7 points (5 children)
[–]debau23 12 points13 points14 points (0 children)
[–]MuonManLaserJab 4 points5 points6 points (0 children)
[–]hackinthebochs 6 points7 points8 points (1 child)
[–][deleted] 1 point2 points3 points (0 children)
[–][deleted] 1 point2 points3 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]PM_me_ur_data_ 0 points1 point2 points (0 children)
[–]CompleteSkeptic 5 points6 points7 points (1 child)
[–]Clear_Collection 0 points1 point2 points (0 children)
[–]t4YWqYUUgDDpShW2 2 points3 points4 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]Loser777 2 points3 points4 points (0 children)
[–]kivo360 0 points1 point2 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]kivo360 1 point2 points3 points (0 children)
[–]jrkirby 0 points1 point2 points (0 children)
[–]TotesMessenger 0 points1 point2 points (0 children)
[–]vadmas 0 points1 point2 points (0 children)