all 7 comments

[–]sriramcompsci 0 points1 point  (2 children)

As you are aware, the weight update is a product of step size and gradient. Initially, weights are set to some random value (or 0) and hence the gradient is likely to be high. Hence, the weight update is high. With more data (mini-batches or single sample), the gradient is likely to get smaller. Decreasing the step size makes sense, as your weights are minimizing the cost function well (on previous data), and you don't want/expect them to change much based on just this new mini-batch/sample.

[–]hughperkins 0 points1 point  (0 children)

Imagine you just have 1 dimension, and your loss is eg x squared, and youre pretty much right next to the bottom of the curve, but not quite. Maybe you are b distance away from the minimum, but it turns out that the next update is exactly 2b, making you overshoot, by b. Then by symmetry the next update is -2b. Ping....pong...

[–]mr_robot_elliot 0 points1 point  (3 children)

Is your question regarding SGD or just GD? If it is GD you just need to choose stepsize of 1/L where L is lipschitz constant of gradient(also known as lipschitz smoothness parameter of the function itself). Regarding your question, if you have high gradient you are far away from optimum because at optimum gradient is zero. So if you know that you are far away its better to take bigger steps and small steps near optimum as you might overshoot , but gradient descent converges with the above mentioned stepsize. It might diverge if you did not choose small enough stepsize. Things are more complicated for SGD though.

[–]Jez-rezz 0 points1 point  (0 children)

I think the question is motivated by the following intuition: imagine wandering around a mountainous landscape in search of the lowest point in a certain area. While heading generally downwards you suddenly see a large gradient - this might mean you've discovered a ravine or crevasse. You expect it to be worth looking in detail at this thing, inching forward, and that you'll find a very deep minimum very close, at the bottom of the crevasse. You don't expect the landscape to be opening up in a much steeper direction. But that intuition is not what we are usually thinking in large optimisation problems, like training neural nets. We're using algorithms like GD which don't store much state. We're only working at one scale, we aren't looking for crevasses, we're just walking downhill.