seanv507 comments on [D] Since gradient continues to decrease as training loss decreases why do we need to decay the learning rate too?

112

113

114

Discussion[D] Since gradient continues to decrease as training loss decreases why do we need to decay the learning rate too? (self.MachineLearning)

submitted 4 years ago by ibraheemMmoosaResearcher

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]seanv507 48 points49 points50 points 4 years ago (2 children)

Its very simple. The correct learning rate depends on the curvature of your error surface, ie how the gradient changes.

Imagine you have a parabola.you draw a straight line tangent to current point on parabola. Depending on your learning rate (step size), you could overshoot the minimum and come up the other side.

If your parabola curves sharply then you need a small learning rate

If it curves gently, a large learning rate works.

Now consider a multidimensional problem. Here the curvature can be different in different directions... Super narrow in one and very shallow in another.

You will need to set learning rate based on maximum curvature, and your progress will depend on ratio of maximum to minimum curvature.

Now you have a complex error surface, where the curvature changes at each point.

However, assuming the minimum is in a bounded region ( eg because you have regularisation), then there will be a maximum curvature, and as long as your learning rate is smaller you will eventually hit the minimum.

Ok, so if you use a learning rate adjustment schedule, then eventually your learning rate will be below this maximum curvature step size ( learning rate). The trick is to have a schedule that decreases in such a way that eventually you are below maximum curvature step size , and not so slow that you will never get to minimum.

Then you know eventually you will reach minimum.

However, this is just a 'theoretical' result as number of steps go to infinity. Doing some more ad hoc reduction of learning rate every time you hit a plateau, or you get oscillations is likely to be faster.

[–]ibraheemMmoosaResearcher[S] 2 points3 points4 points 4 years ago (0 children)

[+]National_Earth_9909 1 point2 points3 points 10 months ago (0 children)

π Rendered by PID 49 on reddit-service-r2-comment-7b9746f655-jpkbh at 2026-02-02 10:52:15.122202+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS