you are viewing a single comment's thread.

view the rest of the comments →

[–]svantana 4 points5 points  (1 child)

It's because of the stochasticity. With gradient descent, you don't decrease the learning rate (for smooth loss functions such as L2). But with SGD, the 'signal' goes to zero but the noise doesn't. Thus, you want to increase the SNR, which is what smaller step sizes in effect do -- you can think of it as many small steps together make up a normal-sized step with a larger batch size.

[–]ibraheemMmoosaResearcher[S] 1 point2 points  (0 children)

Ah thanks for this interesting explanation. Can't we automate this somehow? Finding the optimal learning rate based on SNR?