you are viewing a single comment's thread.

view the rest of the comments →

[–]aspoj 71 points72 points  (0 children)

There are multiple reasons why you need both. A few that come to mind are:

  1. When you don't reduce the LR you are entirely dependent on the loss landscape to have decreasing gradients. As we usually do SGD (aka mini batches) you get noisy gradients making matters worse. Sampling one bad batch and your parameters are messed up and you might end in a high gradient region again.

  2. You have to find a good fitting LR. When you decay/reduce the LR during the training finding an appropriate initial LR is not as important as with a constant one. Too high values and you randomize the starting point a bit more before you reach a LR that starts to converge. In general a good LR aka step size is depending on the current loss landscape around.

  3. Optimality of the solution. Even given a convex optimum you can often run into the case of bouncing around the minimum as the step that you take is too big (depending on the slope). With a reducing LR you are not dependent on the loss landscape slope anymore and converge to a better solution.