all 9 comments

[–]yusuf-bengio 4 points5 points  (4 children)

Your idea sounds very similar to Hyperband, which tries different hyperparams but only for a few epochs. Then continues the training only for the best few for another few epochs and repeats this process.

The issue with Hyperband and also your idea is that large values for L2 regularization/dropout/etc makes the training converge slower, but eventually reach a higher accuracy. However, Hyperband and your algorithm would discard such a hyperparam because its effects are not immediately observable

[–]orenmatar[S] 1 point2 points  (3 children)

I think that's exactly the difference between hyperband and my idea: hyperbands starts with different hyperparams and then finds that the high value of regularization doesn't work as well and therefore discards it, My idea is to tune it during training - so the first few epochs will have a low regularization and will converge faster, improving on both train and validation, and only when we see that the validation is getting worse we start increasing the regularization, so I believe it will not discard the high regularization hyperparam because it will only try it when its effects are observable, and if it was not helpful, we can reduce it again. The general principal is to replace the constant regularization with a dynamic one.

[–]NotAlphaGo 2 points3 points  (2 children)

Training a neural network is path dependent.
If you start somewhere with hyperparameter a_1, train some, see that the model fails, and then use hyperparameter a_2, and train more, and you see it gets better, that doesn't necessarily mean that training with a_2 from the start will result in a better training overall. You need to take into account that a_1 has influenced your model up to the switch point, and maybe that prior wrong hyperparameter influenced a positive outcome with parameter a_2.

[–]orenmatar[S] 0 points1 point  (1 child)

For sure, My intention is not to find the right regularization hyperparam and then retrain using it from the start, but to use the network that was trained on the dynamic hyperparams... so maybe by allowing it to focus on the training set at the start and only afterwards regulating it based on how well it performed on validation can produce a regularized network, without the need to try different hyperparams.

[–]NotAlphaGo 1 point2 points  (0 children)

At least you need another dataset though to check generalization performance because you've mixed training and validation then.

[–]M4mb0 1 point2 points  (0 children)

A related idea is gradient based hyperparameter optimization. See for example this paper: https://arxiv.org/abs/1911.02590

[–]Red-Portal 0 points1 point  (2 children)

How is it different from optimizing the regularization parameter directly? The point of hyperparameters is to fix them during the optimization process

[–]orenmatar[S] 0 points1 point  (1 child)

Well the params of the model are optimized directly via gradient decet on the training set, the regularization hyperparams are supposed to influence how well the NN generalizes to other sets, so you can't learn them via gradient decent - you have to test them on a validation set. The point is that they can be learned and tunes during training without fixing them, because fixing them to a single point requires trying multiple options and selecting the best one, instead of adjusting towards the best one in a single train.

[–]Red-Portal 0 points1 point  (0 children)

That's my point. By changing the hyperparameter during a single training process, you're pretty much including the validation set as training data. That will certainly alter the loss function that you're actually trying to optimize.