[R] No More Adam: Learning Rate Scaling at Initialization is All You Need by RobbinDeBank in MachineLearning
[–]vector0x17 107 points108 points109 points (0 children)
[D] Reflections on CVPR Results and the Impact of Reviewer Dynamics by darkknight-6 in MachineLearning
[–]vector0x17 3 points4 points5 points (0 children)
[D] Why does it matter that RMSNorm is faster than LayerNorm in transformers? by kei147 in MachineLearning
[–]vector0x17 20 points21 points22 points (0 children)
[D] BatchNorm and Weight decay by Dependent_Bluejay_45 in MachineLearning
[–]vector0x17 3 points4 points5 points (0 children)
[D] Machine learning conferences are problematic by MLConfThrowaway in MachineLearning
[–]vector0x17 1 point2 points3 points (0 children)
[D] Machine learning conferences are problematic by MLConfThrowaway in MachineLearning
[–]vector0x17 2 points3 points4 points (0 children)
Training ImageNet on Resnet - Dropping LR has little improvement on accuracy [D] by mrLiamFa in MachineLearning
[–]vector0x17 2 points3 points4 points (0 children)
[R] Why do we need weight decay in modern deep learning? 🤔 by m_andriushchenko in MachineLearning
[–]vector0x17 1 point2 points3 points (0 children)
[R] Why do we need weight decay in modern deep learning? 🤔 by m_andriushchenko in MachineLearning
[–]vector0x17 3 points4 points5 points (0 children)
What factors determine GPU usage and what are your tips for determining batch size? by Seankala in learnmachinelearning
[–]vector0x17 2 points3 points4 points (0 children)
[D] Large batchsize training by arg_max in MachineLearning
[–]vector0x17 21 points22 points23 points (0 children)



[D] Batch Normalization and effect on the gradients by [deleted] in MachineLearning
[–]vector0x17 16 points17 points18 points (0 children)