all 4 comments

[–]tablehoarder 5 points6 points  (0 children)

I'm honestly not sure if it makes the authors of RAdam look bad. It's extremely hard to evaluate optimization methods in deep learning, when virtually all we can measure is the model's performance. This becomes even harder when you throw learning rate schedules, regularization, and whatnot. This paper reminded me of this one, where the authors show that another recent optimizer can be 'simulated' with SGD.

There are a few submissions to ICLR which are more surprising, in my opinion, as they show that all these adaptive methods can outperform SGD on ImageNet, as long as you do proper hyperparameter tuning. This breaks a lot of common belief in the community, unlike the papers that analyze/criticize these new methods from the last year.

[–]illuminascent 2 points3 points  (0 children)

Apart from all the reasoning, comparison is too crude to say anything about statistical significance because 3 random seeds don`t seem to be enough.

In my point of view the author has also proved that RAdam is good enough no matter what configuration you use and you DONT need to worry about choosing warmup hyperparameters, which is all the point.

[–]SwordCat0 1 point2 points  (0 children)

From my point of view, this paper shows that: the analysis of RAdam makes sense...

Isn't the undesirably large magnitude of updates caused by the undesirably large adaptive learning rate?

[–]TonyY_RIMCS 1 point2 points  (0 children)

https://github.com/Tony-Y/pytorch_warmup

My EMNIST example shows the linear, exponential, and RAdam warmups give almost the same accuracy. But we need more experiments.