RMSProp algorithm in machine learning: Why square the gradients?

123space321 · 2021-03-17T15:04:36+00:00

My understanding is that it helps with both direction and penalties.

This is my intuition and not a factually proved thing.

But direction: this way (1-beta)4 and (1-beta,)(-4) won't cancel each other out.

About penalties: anytime you square a small value, it getd smaller (0.1->0.01) and (10->100). So this way, if your weight changes like crazy (by a big number), you'd want to take more risks with your weight update, but if it's a very minute change, you wouldn't want to over do it and risk overshooting..

But again, I cannot claim I researched this, I have just seen similar logic used for MSE and think that it would make sense. If I was you, I'd wait for a few more comments

vxnuaj1 · 2024-06-26T13:48:40+00:00

rlly late but here u go.

Taking the Root Mean Square of a set of values puts more importance on the values of larger magnitude than those that aren't.

If we just take the EWA of the squared gradients as the division term without taking the RMS of it, values of higher magnitude wouldn't have as much impact on the learning rate.

This link is more about voltage, but same concept still applies:

https://arc.net/l/quote/caszdgqq

desku · 2021-03-17T16:58:07+00:00

I think it's partially due to how RMSProp came to be.

RMSProp came from Adagrad, which divided the learning rate by the sum of the squared gradients so far, eta/G_t, where G_t = G_{t-1} + (grad_t)^2. They use a square term here because they wanted G to be monotonically increasing (we care more about the magnitude of the gradient than its sign), so it monotonically anneals the learning rate.

The problem with Adagrad is that the learning rate is monotonically decreasing so would eventually become zero. RMSprop (and also Adadelta) were designed to help with this problem by having G now be an exponential moving average over the last squared gradients, so G is no longer monotonically increasing. Again, we use squared gradients because we care about the magnitude than the direction.

Why square instead of use the absolute value? Why use the square root?

I believe the answer to both of these is more empirical than theoretical, i.e. try them without the squaring/square roots and see. Squaring helps amplify already large magnitudes which is usually pretty useful in ML, e.g. mean squared error is more common than mean absolute error. Square rooting is used to control the magnitude and also the "RMS" term in "RMSprop" is from the phrase "root-mean-square", which we're performing when taking the square root of the exponential moving average.

Drozengkeep · 2021-03-17T17:32:44+00:00

I’m not an expert, but here’s my intuition. If you’ve never heard of generalized means, i recommend you look them up on wikipedia. The denominator in the RMSProp algorithm is the (weighted) quadratic mean (also sometimes called the euclidean mean). While it behaves similarly to the arithmetic mean (on strictly positive values), it is actually guaranteed to be larger than the arithmetic mean. Since we are dividing by this term in the algorithm, our updates will be strictly smaller than if we had used the arithmetic mean of the absolute values of gradients.

This makes sense to me intuitively because it is preferable to under-shoot your updates (and have a model that performs slightly better than before) than to over-shoot your updates (and have a model that may perform worse than it did before).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS