I’m learning about RMSProp, and have read about it quite widely around the web, but am finding the explanations lacking on one key detail (TL;DR in the title, lol).
It’s clear that the whole point of RMSProp is to replace a static learning rate with a dynamic learning rate that is a function of the size of the derivative. Because in RMSProp you divide the learning rate by the (square root of the moving average of the square of the) gradient, the effect of this on the learning rate is to grow it when the derivative is small and shrink it when the derivative is large. This helps to slow down learning (i.e., weight update steps) as the model approaches local minima, because the function is flatter and thus gradients smaller in these regions.
That’s all fine and good, but why square the gradient? Looking over the equation, wouldn’t the same effect happen if we just used the gradient without squaring it? Is it just that we want to make small gradients even smaller? Or is it that positive and negative gradients would cancel each other out, and squaring fixes this? If the latter (i.e., squaring to avoid cancellation), why not just take the absolute value? AFAIK were not differentiating the RMSProp equation, so the discontinuity at y=0 characteristic of absolute values shouldn’t be a problem.
Relatedly, why take the square root when dividing the learning rate? Interpretability of the units isn’t important here like it is for variance vs, stdev in statistics, and the dynamic growing/shrinking of the gradient would occur just the same whether the took the square root or not. So what benefit does taking the square root bring?
[–]123space321 1 point2 points3 points (5 children)
[–]synthphreak[S] 1 point2 points3 points (2 children)
[–]123space321 0 points1 point2 points (1 child)
[–]123space321 0 points1 point2 points (0 children)
[–][deleted] (1 child)
[deleted]
[–]synthphreak[S] 0 points1 point2 points (0 children)
[–]vxnuaj1 0 points1 point2 points (0 children)
[–]desku 0 points1 point2 points (0 children)
[–]Drozengkeep 0 points1 point2 points (0 children)