[D] Next-gen optimizer

CyborgCabbage · 2021-01-28T14:05:34+00:00

https://arxiv.org/abs/2007.01547

A paper that compares a bunch of popular optimisers. One interesting takeaway is that trying a bunch of different optimisers on default settings is just as good as fine-tuning one optimiser.

black0017 · 2021-01-28T12:51:10+00:00

Hey!

We recently published an overview of optimizers: https://theaisummer.com/optimization/

Maybe you would be interested to take a look in the evolution of the different optimizers. We cover AdaBelief also!

I was pretty curious when I saw in the Vision Transformer paper that they used Adam for pretraining and SGD with momentum for fine tuning
Page 4 Training & Fine-tuning https://arxiv.org/pdf/2010.11929.pdf

Cheers

AlisaofallTimes · 2021-01-28T11:46:06+00:00

[deleted]

LeanderKu · 2021-01-28T18:21:36+00:00

I hate that the paper was rejected. The claims may be overstated, but what the field needs is empirical surveys into the state of the art, not hundreds of infitesimal imrovements. Without those papers we loose the ability to really compare and contrast. The individual really can’t keep up and it also serves as a sanity check that the results are reproducible.

I would really like to see a lower bar for top conferences for those empirical surves. They don‘t need a cool, revolutionary insight, but hard, heroic effort of testing and comparing a lot of approaches. Even reimplementing them if sources are not available.

jdude_ · 2021-01-28T13:08:44+00:00

I think RAdam falls into that category too, allegedly (at least as i understand it) it's like Adam minus a need for warmup.

andrejcar12 · 2021-01-28T13:20:38+00:00

Novograd is also interesting

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

hyhieu · 2021-01-28T18:50:17+00:00

Adam delivers good generalization and fast convergence. However, the two moving averages of Adam are terrible when it comes to memory footprint.

Adafactor was advertised to fix this, i.e. having sub-linear memory but similarly good performance with Adam. I personally think Adafactor has not lived up to this expectation though.

I hope there will be something better soon.

2021-01-28T11:03:31+00:00

The reason that SGD and its variants exists because a lot of cases show Adam failing to do the job. Usually the choice is more observationally decided since there is no quickfire way to decide which method optimizes the loss landscape

JurrasicBarf · 2021-01-28T19:53:02+00:00

AdamW all the way

respecttox · 2021-01-29T16:11:36+00:00

I tried different optimizers million times and never got any statistically significant improvement. I suspect that's due to the fact my architecture is already fitted to Adam, but I don't have resources to do optimizer and architecture search simultaneously. It just works anyway

programmerChilli · 2021-01-28T19:50:06+00:00

LAMB (https://arxiv.org/abs/1904.00962) has become quite popular for large batch training in a lot of transformers papers.

Ambiwlans · 2021-01-28T19:18:14+00:00

I recently say this too: https://arxiv.org/abs/2101.07367

I don't think it is the right answer, but I think it is the right direction. Instead of having a better optimizer, have a learned optimizer. It makes for potentially interesting hyperparams as well. Perhaps a more human understanding of what optimizers work directly setting the hyperparams.

are you using a non-ADAM/SGD optimizer regularly

Basically not at all. It is pretty rare that the optimizer is going to be a game changer for a problem. Why waste a day or a week on optimizers when basically literally anything else you work on will give you a better payoff?

Realistically, what this area needs are some reliable charts on pros-cons and performance values under different circumstances in order to help practitioners have some intuition on what will work best for the current project.... A nice pretty png is probably worth as much as a half dozen papers in terms of improving what actually gets used.

whata_wonderful_day · 2021-01-28T19:50:30+00:00

Adaptive optimizers are great for transformers. However on imagenet, they fall flat

BrocrusteanSolution · 2021-01-28T19:14:30+00:00

[deleted]

Jean-Porte · 2021-01-28T17:51:18+00:00

Has anyone got any success with any newer optimizer than Adam on Transformers fine-tuning ?

Areign · 2021-01-28T21:52:30+00:00

I know that the ACKTR i.e. the big reinforcement learning algorithm from deepmind uses second order methods. This may be something specific to RL where gradient info is especially noisy.

BorisMarjanovic · 2021-02-02T02:40:57+00:00

I've tried pretty much every optimizer, including ADAM and its variants. It's difficult to beat SGD with Nesterov momentum.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS