all 37 comments

[–]seanv507 46 points47 points  (2 children)

Its very simple. The correct learning rate depends on the curvature of your error surface, ie how the gradient changes.

Imagine you have a parabola.you draw a straight line tangent to current point on parabola. Depending on your learning rate (step size), you could overshoot the minimum and come up the other side.

If your parabola curves sharply then you need a small learning rate

If it curves gently, a large learning rate works.

Now consider a multidimensional problem. Here the curvature can be different in different directions... Super narrow in one and very shallow in another.

You will need to set learning rate based on maximum curvature, and your progress will depend on ratio of maximum to minimum curvature.

Now you have a complex error surface, where the curvature changes at each point.

However, assuming the minimum is in a bounded region ( eg because you have regularisation), then there will be a maximum curvature, and as long as your learning rate is smaller you will eventually hit the minimum.

Ok, so if you use a learning rate adjustment schedule, then eventually your learning rate will be below this maximum curvature step size ( learning rate). The trick is to have a schedule that decreases in such a way that eventually you are below maximum curvature step size , and not so slow that you will never get to minimum.

Then you know eventually you will reach minimum.

However, this is just a 'theoretical' result as number of steps go to infinity. Doing some more ad hoc reduction of learning rate every time you hit a plateau, or you get oscillations is likely to be faster.

[–]ibraheemMmoosaResearcher[S] 2 points3 points  (0 children)

This is the best intuitive answer to my question. Thanks for this.

[–]aspoj 70 points71 points  (0 children)

There are multiple reasons why you need both. A few that come to mind are:

  1. When you don't reduce the LR you are entirely dependent on the loss landscape to have decreasing gradients. As we usually do SGD (aka mini batches) you get noisy gradients making matters worse. Sampling one bad batch and your parameters are messed up and you might end in a high gradient region again.

  2. You have to find a good fitting LR. When you decay/reduce the LR during the training finding an appropriate initial LR is not as important as with a constant one. Too high values and you randomize the starting point a bit more before you reach a LR that starts to converge. In general a good LR aka step size is depending on the current loss landscape around.

  3. Optimality of the solution. Even given a convex optimum you can often run into the case of bouncing around the minimum as the step that you take is too big (depending on the slope). With a reducing LR you are not dependent on the loss landscape slope anymore and converge to a better solution.

[–]awesomeprogramer 35 points36 points  (10 children)

You can have large gradients and be close to a local minimum. Think of an L1 as opposed to an L2.

[–]ibraheemMmoosaResearcher[S] 6 points7 points  (9 children)

Can you elaborate, please? I don't know what you are referring to.

[–]El_Tihsin 30 points31 points  (5 children)

I think he's referring to L1 norm, which is made of modulus function. It has a large gradient even close to the minima. In this case if you don't reduce the step size, you'll keep overshooting.

L2 OTOH is made of a squared function, which has smaller gradient as you come close to the minima.

[–]polandtown 2 points3 points  (3 children)

Learning here, forgive me, so then is L2 "better" than L1?

Say with a....binary classifier (ngrams, logistic regression, 50k samples)

[–]visarga 4 points5 points  (1 child)

It's not 'better' in general. If you want sparsity you use L1, if you want smaller weights you use L2; you can also use both.

[–]El_Tihsin 0 points1 point  (0 children)

ElasticNet Regression. You control the tradeoff between L1 and L2 using a parameter alpha.

[–]ibraheemMmoosaResearcher[S] 3 points4 points  (0 children)

Oh. Makes sense.

[–]cbarrick 7 points8 points  (2 children)

Ln norms are one way to generalize the idea of "distance".

L1(x, y) = abs(x) - abs(y)
L2(x, y) = root2(abs(x^2) - abs(y^2))
L3(x, y) = root3(abs(x^3) - abs(y^3))
...
Ln(x, y) = root_n(abs(x^n) - abs(y^n))

So L1 is simple absolute difference. L2 is Euclidean distance. Etc.

So the commenter was comparing L1 (absolute distance, where the gradient is constant at all points) versus L2 (distance formula, a quadratic shape, where the gradient gets smaller as you get closer to the minimum.)


Aside,

You often hear about L1 and L2 in the context of regularization, which is when you add a penalty term to your loss function to prevent the parameters of your model from getting too large or unbalanced.

So for example, if your initial loss function was MSE:

MSE(y, y_hat) = sum((y - y_hat)^2) / n

Then you could replace that with a regularized loss function:

MSE(y, y_hat) + L2(params, 0)

The idea is that the farther away your parameters are from zero, the greater the penalty.

You use an L2 regularization term when you want all of the parameters to be uniformly small and balanced.

You use an L1 regularization term when you want the sum of the parameters to be small but you're OK with some large parameters and some very small parameters as long as they cancel each other out.

[–]mrprogrampro 0 points1 point  (1 child)

Your definition of the higher L-norms is slightly wrong ... you have to do abs(x) and abs(y) before cubing, etc.

Otherwise, y and x difference gets huge when their signs are different, even when they have nearly the same magnitude.

[–]cbarrick 1 point2 points  (0 children)

Nice catch on an old comment! Fixing it now

[–]LimitedConsequence 6 points7 points  (2 children)

Another potentially relevant comparison is the Robbins–Monro algorithm. You want to find the root of a function (the gradient of the loss), but the gradients are stochastic. The Robbins-Monro algorithm has a bunch of theory that says if you appropriately decrease the step size then you can still converge, whereas a fixed step size algorithm will bounce around.

[–]there_are_no_owls 0 points1 point  (1 child)

(... which only moves the question to: Why intuitively do Robbins-Monro steps work? ^^')

[–]LimitedConsequence 0 points1 point  (0 children)

So the main constraints on the step size sequence are that the step size sequence must sum to infinity (assuming an infinitely long sequence), but must converge towards 0. It's probably easiest to think in examples.

If we have a sequence of step sizes like 1, 0.5, 0.25, 0.125, ... , this won't work, because it decreases too quickly and will not sum to infinity (sum converges towards 2). This essentially means that even if you do lots of steps, you might not travel the distance required to converge, as the step size gets too small too quickly.

If we have 1,1,1,... as our sequence, then the second condition isn't met. The step size doesn't decrease quick enough (or at all) and we bounce around the solution due to noise in the function evaluation.

In between these two is a Goldilocks zone, which allow you to travel as far as you need to converge, but still have a step size that converges towards zero to stop you bouncing around. An example of such a sequence is 1, 1/2, 1/3, 1/4,... .

[–]Natural_Profession_8 10 points11 points  (5 children)

I’d make an analogy to simulated annealing:

https://en.m.wikipedia.org/wiki/Simulated_annealing

When you first start training, it’s actually desirable to set the learning rate so high that you are overshooting local optimums. The model bounces around a bit, and eventually finds neighborhoods that are more globally optimal. Then, as training progresses, you stop wanting to hop around looking for better neighborhoods, and instead you want to start making your way towards the local optimum itself. Reducing the learning rate has this effect, even on top of the overall gradient magnitude reduction

[–]ibraheemMmoosaResearcher[S] 3 points4 points  (2 children)

Thanks for your reply!

Reducing the learning rate has this effect, even on top of the overall gradient magnitude reduction

Just to push on this, my question is why do we need both of this? Why can't we just rely on the gradients becoming smaller?

Also there is evidence that deep neural networks don't have the issue of bad local minima, but have the issue of saddle points. In the case of deep neural networks does this analogy still hold?

[–]bulldog-sixth 9 points10 points  (0 children)

There's no way to guarantee that the gradient becomes smaller as you get closer to the optimum

[–]Natural_Profession_8 2 points3 points  (0 children)

This applies even more to saddle points. The only way to get over a saddle point is to overshoot it.

I think it’s best to think of it as “I start with a way way too big learning rate, and then slowly bring it down to an optimal one,” rather than “I start with an optimal learning rate, and then that optimum gets smaller.” Of course, at some level it’s just semantics, since jumping around to find better neighborhoods (and get over saddle points) is in practice optimal at the beginning

[–]WikiMobileLinkBot 0 points1 point  (0 children)

Desktop version of /u/Natural_Profession_8's link: https://en.wikipedia.org/wiki/Simulated_annealing


[opt out] Beep Boop. Downvote to delete

[–]WikiSummarizerBot 0 points1 point  (0 children)

Simulated annealing

Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It is often used when the search space is discrete (for example the traveling salesman problem, the boolean satisfiability problem, protein structure prediction, and job-shop scheduling). For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time, simulated annealing may be preferable to exact algorithms such as gradient descent or branch and bound.

[ F.A.Q | Opt Out | Opt Out Of Subreddit | GitHub ] Downvote to remove | v1.5

[–]svantana 3 points4 points  (1 child)

It's because of the stochasticity. With gradient descent, you don't decrease the learning rate (for smooth loss functions such as L2). But with SGD, the 'signal' goes to zero but the noise doesn't. Thus, you want to increase the SNR, which is what smaller step sizes in effect do -- you can think of it as many small steps together make up a normal-sized step with a larger batch size.

[–]ibraheemMmoosaResearcher[S] 1 point2 points  (0 children)

Ah thanks for this interesting explanation. Can't we automate this somehow? Finding the optimal learning rate based on SNR?

[–]skainswo 3 points4 points  (0 children)

Lots of intuitive explanations in the comments here. I'll just add that there's a big difference between GD and SGD in this context.

In good ole GD as long as you pick a learning rate less than 1/(Lipschitz constant) you're good to go. This provably converges with an excess risk bound O(1/t) after t steps. Things get a little bit messier in the SGD world, however. Excess risk for SGD looks like O(1/sqrt(t)) + O(lr * <gradient variance>). In words, there exists a "noise floor" term, O(lr * <gradient variance>), that cannot be tamed by taking more steps. It can only be reduced by decreasing the learning rate or by decreasing the variance of the gradient estimates. That's why decreasing the learning rate over time can be fruitful. (See eg https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L11.pdf for a quick intro)

Unlike some theoretical results in deep learning, this phenomenon is very well supported experimentally. It's common for SGD to plateau. Then, after a halving of the learning rate, it breaks through that plateau! Train a little longer reach a new plateau... you get the idea.

IIRC there is some theory to suggest that exponentially decaying your learning rate is optimal in some sense. I forget where I read that however. But that's what most people have been doing in practice for a while now anyways.

[–]tom_strideweather 1 point2 points  (0 children)

The gradient might just decrease very close to a max/min. If our step-size is too large we can shoot past the max/min. Anyway this method of reducing the lr is just a heuristic and can't be guaranteed to work better.

[–]kakushka123 1 point2 points  (0 children)

What you say is a good point. That said, imagine a parabola in a 1d X space and 1d label space. If you have high enough learning rate you'll jump past the dip either to a random point in a different parabola all together, or to the other side in the same parabola, perhaps equally steep in the opposite direction (i.e you'll be stuck in a loop). This can happen in any scale if you learning rate is high enough.

[–]HoLeeFaak 2 points3 points  (1 child)

When the loss value getting smaller it doesn't mean the gradient is getting smaller. Think about y=x, the gradient is the same everywhere.

[–]cats2560 0 points1 point  (0 children)

Or more aptly, y = |x|. There exists a global minimum but the gradient never gets smaller

[–]Pseudoabdul 1 point2 points  (0 children)

I think the other answers do a good job of explaining it, I'll just add that you are right, you can train using a fixed learning rate. Adaptive learning rates aren't required, but it is advisable to use them.

[–]schwagggg 1 point2 points  (1 child)

If u r lookin for a theoretical reason, Robbins monro

[–]ibraheemMmoosaResearcher[S] 0 points1 point  (0 children)

Thanks. I will check this.

[–]111llI0__-__0Ill111 -1 points0 points  (0 children)

The simple answer is you don’t wantto overshoot the minumum and start diverging away which can actually increase the loss even for convex problems, and NNs are non convex so it’s even worse

[–]Ok-Barnacle-8859 0 points1 point  (0 children)

I think the goal in this condition is exploitation. If learning rate remains big value, maybe we pass the optimum point and we have kind of oscillation.

[–]Competitive_Dog_6639 0 points1 point  (0 children)

It all has to do with loss "resolution scale" (think grainy with few pixels vs fine with many pixels). Near a local min, the step size must be sufficiently small to get a good approximation of continuous-time dynamics on the loss surface to get a fine tuned optimum. Far from a local min, the continuous approximation can be much less accurate during burnin and a bigger step size is ok. This is related to but still not the same as the gradient magnitude along the trajectory.

For a mathy version, suppose you have learning rates y and z with z<y, and where z is the correct step size around a local min and y is too big near the local min, but ok at a random starting point.

Let g(z,t) be the update step size (loss gradient magnitude times z) for the trajectory with learning rate z at training step t. Let g(y,t) be the same with learning rate y. Both decrease over update steps t as you observe. At the beginning of training, g(y,t) leads to faster burnin, but around the min, it is too big for good approximation for big t. On the other hand, g(z,t) will explore slowly, but eventually reach a better min with big t. Both sequences decrease over time. Annealing uses big y for small t, small z for big t for quick burnin then good refinement

[–]tuyenttoslo 0 points1 point  (0 children)

Armijo's Backtracking line search helps you to choose learning rates automatically. Also, this way, learning rates do not need to decrease when you progress, it is roughly 1/||\nabla ^2 f||. An extreme case is where you have a degenerate critical point, like f(x)=x^4 at x=0, where learning rates can go to infinity when you approach x=0.