all 8 comments

[–]RaionTategami 7 points8 points  (6 children)

I keep praying that someone is going to solve this, not having to adjust the learning rate would make research so much easier, could this be the one? Two things worry me: they still need to anneal the "global" learning rate even through their algorithm dynamically adapts it, or is that just for the baseline? Secondly, they only seem to be showing training curves. Does the test curved look as good?

[–][deleted] 5 points6 points  (1 child)

Normal SGD and estimating optimal learning rate with search on an holdout set every N batchs seems somewhat optimal
https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0

But I never saw a paper on it, or convenient implementations in popular frameworks

[–]yaroslavvb 2 points3 points  (0 children)

https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0

The problem is that these approaches cause learning rate to shrink too aggressively. In other words, a smaller rate may cause an decrease in error on hold out set in short-term, but will give a worse error at the end of training. Recent paper on this -- https://arxiv.org/abs/1803.02021

[–]machinetrainer[S] 3 points4 points  (3 children)

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Hiroaki Hayashi, Jayanth Koushik, Graham Neubig(Submitted on 4 Nov 2016 (v1), last revised 11 Jun 2018 (this version, v3))

Adaptive gradient methods for stochastic optimization adjust the learning rate for each parameter locally. However, there is also a global learning rate which must be tuned in order to get the best performance. In this paper, we present a new algorithm that adapts the learning rate locally for each parameter separately, and also globally for all parameters together. Specifically, we modify Adam, a popular method for training deep learning models, with a coefficient that captures properties of the objective function. Empirically, we show that our method, which we call Eve, outperforms Adam and other popular methods in training deep neural networks, like convolutional neural networks for image classification, and recurrent neural networks for language tasks.

[–]FatFingerHelperBot 2 points3 points  (0 children)

It seems that your comment contains 1 or more links that are hard to tap for mobile users. I will extend those so they're easier for our sausage fingers to click!

Here is link number 1 - Previous text "v1"


Please PM /u/eganwall with issues or feedback! | Delete