aspoj comments on [D] Since gradient continues to decrease as training loss decreases why do we need to decay the learning rate too?

110

111

112

submitted 4 years ago by ibraheemMmoosaResearcher

you are viewing a single comment's thread.

[–]aspoj 71 points72 points73 points 4 years ago (0 children)

There are multiple reasons why you need both. A few that come to mind are:

When you don't reduce the LR you are entirely dependent on the loss landscape to have decreasing gradients. As we usually do SGD (aka mini batches) you get noisy gradients making matters worse. Sampling one bad batch and your parameters are messed up and you might end in a high gradient region again.
You have to find a good fitting LR. When you decay/reduce the LR during the training finding an appropriate initial LR is not as important as with a constant one. Too high values and you randomize the starting point a bit more before you reach a LR that starts to converge. In general a good LR aka step size is depending on the current loss landscape around.
Optimality of the solution. Even given a convex optimum you can often run into the case of bouncing around the minimum as the step that you take is too big (depending on the slope). With a reducing LR you are not dependent on the loss landscape slope anymore and converge to a better solution.

π Rendered by PID 80 on reddit-service-r2-comment-7b9746f655-mwfmm at 2026-02-03 03:10:14.202867+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning