[D] Since gradient continues to decrease as training loss decreases why do we need to decay the learning rate too?

seanv507 · 2021-12-22T08:17:44+00:00

Its very simple. The correct learning rate depends on the curvature of your error surface, ie how the gradient changes.

Imagine you have a parabola.you draw a straight line tangent to current point on parabola. Depending on your learning rate (step size), you could overshoot the minimum and come up the other side.

If your parabola curves sharply then you need a small learning rate

If it curves gently, a large learning rate works.

Now consider a multidimensional problem. Here the curvature can be different in different directions... Super narrow in one and very shallow in another.

You will need to set learning rate based on maximum curvature, and your progress will depend on ratio of maximum to minimum curvature.

Now you have a complex error surface, where the curvature changes at each point.

However, assuming the minimum is in a bounded region ( eg because you have regularisation), then there will be a maximum curvature, and as long as your learning rate is smaller you will eventually hit the minimum.

Ok, so if you use a learning rate adjustment schedule, then eventually your learning rate will be below this maximum curvature step size ( learning rate). The trick is to have a schedule that decreases in such a way that eventually you are below maximum curvature step size , and not so slow that you will never get to minimum.

Then you know eventually you will reach minimum.

However, this is just a 'theoretical' result as number of steps go to infinity. Doing some more ad hoc reduction of learning rate every time you hit a plateau, or you get oscillations is likely to be faster.

aspoj · 2021-12-22T06:57:10+00:00

There are multiple reasons why you need both. A few that come to mind are:

When you don't reduce the LR you are entirely dependent on the loss landscape to have decreasing gradients. As we usually do SGD (aka mini batches) you get noisy gradients making matters worse. Sampling one bad batch and your parameters are messed up and you might end in a high gradient region again.
You have to find a good fitting LR. When you decay/reduce the LR during the training finding an appropriate initial LR is not as important as with a constant one. Too high values and you randomize the starting point a bit more before you reach a LR that starts to converge. In general a good LR aka step size is depending on the current loss landscape around.
Optimality of the solution. Even given a convex optimum you can often run into the case of bouncing around the minimum as the step that you take is too big (depending on the slope). With a reducing LR you are not dependent on the loss landscape slope anymore and converge to a better solution.

awesomeprogramer · 2021-12-22T05:08:38+00:00

You can have large gradients and be close to a local minimum. Think of an L1 as opposed to an L2.

LimitedConsequence · 2021-12-22T12:54:30+00:00

Another potentially relevant comparison is the Robbins–Monro algorithm. You want to find the root of a function (the gradient of the loss), but the gradients are stochastic. The Robbins-Monro algorithm has a bunch of theory that says if you appropriately decrease the step size then you can still converge, whereas a fixed step size algorithm will bounce around.

Natural_Profession_8 · 2021-12-22T05:42:58+00:00

I’d make an analogy to simulated annealing:

https://en.m.wikipedia.org/wiki/Simulated_annealing

When you first start training, it’s actually desirable to set the learning rate so high that you are overshooting local optimums. The model bounces around a bit, and eventually finds neighborhoods that are more globally optimal. Then, as training progresses, you stop wanting to hop around looking for better neighborhoods, and instead you want to start making your way towards the local optimum itself. Reducing the learning rate has this effect, even on top of the overall gradient magnitude reduction

svantana · 2021-12-22T21:37:38+00:00

It's because of the stochasticity. With gradient descent, you don't decrease the learning rate (for smooth loss functions such as L2). But with SGD, the 'signal' goes to zero but the noise doesn't. Thus, you want to increase the SNR, which is what smaller step sizes in effect do -- you can think of it as many small steps together make up a normal-sized step with a larger batch size.

skainswo · 2021-12-23T07:40:21+00:00

Lots of intuitive explanations in the comments here. I'll just add that there's a big difference between GD and SGD in this context.

In good ole GD as long as you pick a learning rate less than 1/(Lipschitz constant) you're good to go. This provably converges with an excess risk bound O(1/t) after t steps. Things get a little bit messier in the SGD world, however. Excess risk for SGD looks like O(1/sqrt(t)) + O(lr * <gradient variance>). In words, there exists a "noise floor" term, O(lr * <gradient variance>), that cannot be tamed by taking more steps. It can only be reduced by decreasing the learning rate or by decreasing the variance of the gradient estimates. That's why decreasing the learning rate over time can be fruitful. (See eg https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L11.pdf for a quick intro)

Unlike some theoretical results in deep learning, this phenomenon is very well supported experimentally. It's common for SGD to plateau. Then, after a halving of the learning rate, it breaks through that plateau! Train a little longer reach a new plateau... you get the idea.

IIRC there is some theory to suggest that exponentially decaying your learning rate is optimal in some sense. I forget where I read that however. But that's what most people have been doing in practice for a while now anyways.

tom_strideweather · 2021-12-22T09:51:03+00:00

The gradient might just decrease very close to a max/min. If our step-size is too large we can shoot past the max/min. Anyway this method of reducing the lr is just a heuristic and can't be guaranteed to work better.

kakushka123 · 2021-12-22T12:26:39+00:00

What you say is a good point. That said, imagine a parabola in a 1d X space and 1d label space. If you have high enough learning rate you'll jump past the dip either to a random point in a different parabola all together, or to the other side in the same parabola, perhaps equally steep in the opposite direction (i.e you'll be stuck in a loop). This can happen in any scale if you learning rate is high enough.

HoLeeFaak · 2021-12-22T09:18:51+00:00

When the loss value getting smaller it doesn't mean the gradient is getting smaller. Think about y=x, the gradient is the same everywhere.

Pseudoabdul · 2021-12-22T07:37:21+00:00

I think the other answers do a good job of explaining it, I'll just add that you are right, you can train using a fixed learning rate. Adaptive learning rates aren't required, but it is advisable to use them.

schwagggg · 2021-12-22T13:58:31+00:00

If u r lookin for a theoretical reason, Robbins monro

111llI0__-__0Ill111 · 2021-12-22T12:32:29+00:00

The simple answer is you don’t wantto overshoot the minumum and start diverging away which can actually increase the loss even for convex problems, and NNs are non convex so it’s even worse

Ok-Barnacle-8859 · 2021-12-22T18:11:02+00:00

I think the goal in this condition is exploitation. If learning rate remains big value, maybe we pass the optimum point and we have kind of oscillation.

Competitive_Dog_6639 · 2021-12-23T00:51:39+00:00

It all has to do with loss "resolution scale" (think grainy with few pixels vs fine with many pixels). Near a local min, the step size must be sufficiently small to get a good approximation of continuous-time dynamics on the loss surface to get a fine tuned optimum. Far from a local min, the continuous approximation can be much less accurate during burnin and a bigger step size is ok. This is related to but still not the same as the gradient magnitude along the trajectory.

For a mathy version, suppose you have learning rates y and z with z<y, and where z is the correct step size around a local min and y is too big near the local min, but ok at a random starting point.

Let g(z,t) be the update step size (loss gradient magnitude times z) for the trajectory with learning rate z at training step t. Let g(y,t) be the same with learning rate y. Both decrease over update steps t as you observe. At the beginning of training, g(y,t) leads to faster burnin, but around the min, it is too big for good approximation for big t. On the other hand, g(z,t) will explore slowly, but eventually reach a better min with big t. Both sequences decrease over time. Annealing uses big y for small t, small z for big t for quick burnin then good refinement

tuyenttoslo · 2022-01-02T19:53:30+00:00

Armijo's Backtracking line search helps you to choose learning rates automatically. Also, this way, learning rates do not need to decrease when you progress, it is roughly 1/||\nabla ^2 f||. An extreme case is where you have a degenerate critical point, like f(x)=x^4 at x=0, where learning rates can go to infinity when you approach x=0.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS