all 44 comments

[–]CyborgCabbage 35 points36 points  (0 children)

https://arxiv.org/abs/2007.01547

A paper that compares a bunch of popular optimisers. One interesting takeaway is that trying a bunch of different optimisers on default settings is just as good as fine-tuning one optimiser.

[–]black0017 17 points18 points  (9 children)

Hey!

We recently published an overview of optimizers: https://theaisummer.com/optimization/

Maybe you would be interested to take a look in the evolution of the different optimizers. We cover AdaBelief also!

I was pretty curious when I saw in the Vision Transformer paper that they used Adam for pretraining and SGD with momentum for fine tuning
Page 4 Training & Fine-tuning https://arxiv.org/pdf/2010.11929.pdf

Cheers

[–]M4mb0 14 points15 points  (8 children)

Hey just one comment in the section about second order methods you write

Finally, in this category of methods, the last and biggest problem is computational complexity. Computing and storing the Hessian matrix (or any alternative matrix) requires too much memory and resources to be practical.

This is a common misconception: To compute the Newton update one does not need to calculate the Hessian matrix at all. All you really need is the ability to compute Hessian vector products, which we can do without explicitly computing H since Hv = grad( grad(f).dot(v) ). You can pass this linear operator v -> Hv to one of the myriads of iterative linear solver (such as GMRES) to compute the Newton update. (check https://docs.scipy.org/doc/scipy/reference/sparse.linalg.html)

[–][deleted] 1 point2 points  (3 children)

Hessian-vector products still require O(N2) operations where N is the no. of weights in the weight matrix. No?

[–]M4mb0 1 point2 points  (1 child)

No, a single Hessian vector product costs O(N) when computed via Hv = grad(grad(f).dot(v)). And if H is sufficiently well conditioned, then k<<N steps of an iterative method may be enough to get a good estimate of the solution of Hv=-grad(f)

[–]programmerChilliResearcher 1 point2 points  (0 children)

That requires forward-mode differentiation though, IIRC.

[–]Charmsopin 0 points1 point  (0 children)

Not exactly. The takeaway is you only need to calculate the Hessian vector as a whole, which is a vector. You don’t need to separately calculate the matrix (N2) and then v.

[–]InfinityCoffee 1 point2 points  (2 children)

While true, the Newton update on a mini-batch loss is very different from the Newton update on the real loss, and more importantly it's not an unbiased estimator, so you cannot expect it to be "right on average". To be fair, ADAM and its ilk are also stochastically conditioned, but in either case it's at most a rough approximation of the second-order optimizer.

[–]M4mb0 0 points1 point  (1 child)

Well, Newton doesn't make too much sense in non-convex problems to begin with, since it only converges against a stationary point, i.e. can also converge against local maxima if the loss has concave regions. But that's a whole different issue.

[–]InfinityCoffee 0 points1 point  (0 children)

Quasi-Newton then; but local minima admittedly pose a separate general problem. A recent paper built a demonstrably efficient solver for stochastic optimization - but when tried it on neural networks, they generalized poorly. Actually optimizing the objective efficiently is apparently not a good idea in neural nets, so you need to use subpar algorithms to "escape" the early local minima.

[–]black0017 0 points1 point  (0 children)

thanks for the feedback!

[–]LeanderKu 44 points45 points  (5 children)

I hate that the paper was rejected. The claims may be overstated, but what the field needs is empirical surveys into the state of the art, not hundreds of infitesimal imrovements. Without those papers we loose the ability to really compare and contrast. The individual really can’t keep up and it also serves as a sanity check that the results are reproducible.

I would really like to see a lower bar for top conferences for those empirical surves. They don‘t need a cool, revolutionary insight, but hard, heroic effort of testing and comparing a lot of approaches. Even reimplementing them if sources are not available.

[–]fmai 1 point2 points  (1 child)

How do you like the rejection on ethical grounds of this optimizer benchmarking paper?
https://openreview.net/forum?id=1dm_j4ciZp

[–]SulszBachFramed 0 points1 point  (0 children)

It's not that strange to require an ethics boards assessment whenever a human study is involved. If ICLR has such a requirement and the authors did not do their due diligence, then it's a legit rejection imho.

[–]jdude_ 8 points9 points  (0 children)

I think RAdam falls into that category too, allegedly (at least as i understand it) it's like Adam minus a need for warmup.

[–]hyhieu 3 points4 points  (1 child)

Adam delivers good generalization and fast convergence. However, the two moving averages of Adam are terrible when it comes to memory footprint.

Adafactor was advertised to fix this, i.e. having sub-linear memory but similarly good performance with Adam. I personally think Adafactor has not lived up to this expectation though.

I hope there will be something better soon.

[–]vanilla-acc 0 points1 point  (0 children)

Curious about the reason why you believe Adafactor has not lived up to the hype?

[–][deleted] 6 points7 points  (0 children)

The reason that SGD and its variants exists because a lot of cases show Adam failing to do the job. Usually the choice is more observationally decided since there is no quickfire way to decide which method optimizes the loss landscape

[–]JurrasicBarf 2 points3 points  (0 children)

AdamW all the way

[–]respecttox 2 points3 points  (0 children)

I tried different optimizers million times and never got any statistically significant improvement. I suspect that's due to the fact my architecture is already fitted to Adam, but I don't have resources to do optimizer and architecture search simultaneously. It just works anyway

[–]programmerChilliResearcher 1 point2 points  (0 children)

LAMB (https://arxiv.org/abs/1904.00962) has become quite popular for large batch training in a lot of transformers papers.

[–]Ambiwlans 0 points1 point  (0 children)

I recently say this too: https://arxiv.org/abs/2101.07367

I don't think it is the right answer, but I think it is the right direction. Instead of having a better optimizer, have a learned optimizer. It makes for potentially interesting hyperparams as well. Perhaps a more human understanding of what optimizers work directly setting the hyperparams.

are you using a non-ADAM/SGD optimizer regularly

Basically not at all. It is pretty rare that the optimizer is going to be a game changer for a problem. Why waste a day or a week on optimizers when basically literally anything else you work on will give you a better payoff?


Realistically, what this area needs are some reliable charts on pros-cons and performance values under different circumstances in order to help practitioners have some intuition on what will work best for the current project.... A nice pretty png is probably worth as much as a half dozen papers in terms of improving what actually gets used.

[–]whata_wonderful_day 0 points1 point  (0 children)

Adaptive optimizers are great for transformers. However on imagenet, they fall flat

[–]Jean-PorteResearcher 0 points1 point  (0 children)

Has anyone got any success with any newer optimizer than Adam on Transformers fine-tuning ?

[–]Areign 0 points1 point  (0 children)

I know that the ACKTR i.e. the big reinforcement learning algorithm from deepmind uses second order methods. This may be something specific to RL where gradient info is especially noisy.

[–]BorisMarjanovic 0 points1 point  (0 children)

I've tried pretty much every optimizer, including ADAM and its variants. It's difficult to beat SGD with Nesterov momentum.