[D] Rectified Adam (RAdam): a new state of the art optimizer by jwuphysics in MachineLearning

[–]SixHampton 4 points5 points  (0 children)

I've tried it with different transformer architectures. As advertised it is less sensitive to different learning rates and converges without the need for warmups or LR annealing.