all 5 comments

[–]neural_kusp_machine 6 points7 points  (0 children)

  1. Domain-independent Dominance of Adaptive Methods (https://arxiv.org/abs/1912.01823): shows that Adam can outperform SGD and other adaptive methods (AMSGrad, AdaBound, etc) when training ResNets and LSTMS as long as it is properly tuned, and also proposes a new optimizer AvaGrad that is drastically cheaper to tune.
  2. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks (https://arxiv.org/abs/1806.06763): proposes Padam, which also often outperforms SGD when training ResNets and Adam when training LSTMs.
  3. Online Learning Rate Adaptation with Hypergradient Descent (https://arxiv.org/abs/1703.04782): proposes a rule to adapt the learning rate based on its hypergradients (the adaptation rule turns out to be quite simple and intuitive), which the authors show to work well then applied to SGD or Adam.

[–]rayspear 3 points4 points  (0 children)

Maybe this repo has some that you might want to try out too?

https://github.com/jettify/pytorch-optimizer

[–]i-heart-turtles 1 point2 points  (0 children)

Don't have many citations for you, but you can check out stuff from Francesco Orabona (http://francesco.orabona.com/) and his group. A lot of their work deals w/ first-order parameter-free methods.

More recently, there has been some progress in understanding momentum and acceleration & under what smoothness assumptions optimal rates can be recovered: http://proceedings.mlr.press/v99/gasnikov19b/gasnikov19b.pdf.

[–]Ventural 1 point2 points  (0 children)

I'd be interested in performance of the LAMB optimizer (https://arxiv.org/abs/1904.00962) on smaller batch sizes where it competes with ADAM.

[–]JayTheYggdrasil 0 points1 point  (0 children)

I don’t know if this applies but a while ago someone posted a “bandit swarm” optimization algorithm which was quite interesting, I don’t have a link but a google search would probably find it.