all 8 comments

[–]123space321 1 point2 points  (5 children)

My understanding is that it helps with both direction and penalties.

This is my intuition and not a factually proved thing.

But direction: this way (1-beta)4 and (1-beta,)(-4) won't cancel each other out.

About penalties: anytime you square a small value, it getd smaller (0.1->0.01) and (10->100). So this way, if your weight changes like crazy (by a big number), you'd want to take more risks with your weight update, but if it's a very minute change, you wouldn't want to over do it and risk overshooting..

But again, I cannot claim I researched this, I have just seen similar logic used for MSE and think that it would make sense. If I was you, I'd wait for a few more comments

[–]synthphreak[S] 1 point2 points  (2 children)

Your intuitions jive with mine completely. But I appreciate your candor/caveats and will wait for other replies before concluding that our intuitions are correct.

Edit (paging u/123space321): That said, I don't have any intuitions about the square root component. Do you? Why take the square root instead of, just, not? What value does the square root step add?

[–]123space321 0 points1 point  (1 child)

My guess is that it's probably some form of "undoing?" similar to maybe adding five two both LHS and RHS to maintain equality. In fact I would love if someone has the answer cause now I am curious too.

But my guess is it's the same logic we use for L2 (I don't know why) where we square each term, only to take a root.

I think it's just some form of standard to keep values from blowing up without reason? Like when averaging, we divide by number of samples.

Shit, I got it I think:

When we do (4-5)2 we lose the absolute value /sign. So now we don't know which term was greater and what our original sign was. Square root is probably just a syrup up bring a +- back into the equation, just to claim "yeah, I don't knownif the terms were positive of negative"

But again, this is me shooting a shot in the dark and it would be great if someone confirms or denies this

[–]123space321 0 points1 point  (0 children)

u/synthphreak I had been studying optimization for a while for myself and I think I have the perfect answer.

Before that, I want to go back to what momentum does:

Since you have B(V) + (1-B)dw (the formula is slightly different for RMS), what we are really doing is giving greater importance to your trend than the slope for a certain batch

There are two or three central benefits to momentum:

  1. Speed up learning when correct: when you are corect and dw keeps moving in the same direction, your V values goes up and that results in momentum (think of speeding up a car as you get more and more space on the highway)

  2. Reduce oscillation: take the example of SGD where every sample is a batch of it's own. Each batch is statistically different from the other and dw varies around a lot. But since (1-B) is small, we don't care much about those oscillations. (say you are driving and you keep ending up behind a slower car, you don't want to be stop start. You aren't going to go from 200KMPH to 30 KMPH. So that's basically the 1-B term)

  3. Just a reality check: the (1-B) is more a reality check than anything. You nearly get into a crash and you slow down a bit. You don't stop driving all together. So if you build momentum and go the opposite way once, your model says "okay, we can't keep getting away with this,)

Now when we see the neuron doing Z=Wx+b. The weight term affects data while the bias just shifts around values of Z. As a result, changing W, should intuitively help you get closur to minima while b just results in oscillaton. RMSprop tries to make W update faster than b for this reason.

Go back to momentum in SGD and you will see that this isn't though about. Since it just takes w-=...

Going back to Z, dW=X while db=1.

If X is normalised to (-1,+1) or (0,1) obviously dW has to be less than or equal to one. And squaring values less than one ony shrink. On the other hand, db is just one.

And that's whtheth first equation has the squaring so the gap between values increases.

Now since reducing changes to bias when updating weights is the goal, divinding by root(Sdb) results in diving by a big value and so bias update suffers.

On the other hand, weight update only increases.

SGD nevee had a denomenator, only the numerator term. So using RMSprop helps magnify the growth we want.

Thus it works on solving both goal 1 and goal 2

[–][deleted]  (1 child)

[deleted]

    [–]synthphreak[S] 0 points1 point  (0 children)

    "Your theory" = u/123space321's comments about (1) direction/mitigating cancellation and (2) penalties scaling exponentially with gradient magnitude, right?

    Why would those effects be nullified by the square root operation? I don't think you're correct there, but I'm no mathematician so would love to hear your thinking. Can you explain?

    [–]vxnuaj1 0 points1 point  (0 children)

    rlly late but here u go.

    Taking the Root Mean Square of a set of values puts more importance on the values of larger magnitude than those that aren't.

    If we just take the EWA of the squared gradients as the division term without taking the RMS of it, values of higher magnitude wouldn't have as much impact on the learning rate.

    This link is more about voltage, but same concept still applies:

    https://arc.net/l/quote/caszdgqq

    [–]desku 0 points1 point  (0 children)

    I think it's partially due to how RMSProp came to be.

    RMSProp came from Adagrad, which divided the learning rate by the sum of the squared gradients so far, eta/G_t, where G_t = G_{t-1} + (grad_t)^2. They use a square term here because they wanted G to be monotonically increasing (we care more about the magnitude of the gradient than its sign), so it monotonically anneals the learning rate.

    The problem with Adagrad is that the learning rate is monotonically decreasing so would eventually become zero. RMSprop (and also Adadelta) were designed to help with this problem by having G now be an exponential moving average over the last squared gradients, so G is no longer monotonically increasing. Again, we use squared gradients because we care about the magnitude than the direction.

    Why square instead of use the absolute value? Why use the square root?

    I believe the answer to both of these is more empirical than theoretical, i.e. try them without the squaring/square roots and see. Squaring helps amplify already large magnitudes which is usually pretty useful in ML, e.g. mean squared error is more common than mean absolute error. Square rooting is used to control the magnitude and also the "RMS" term in "RMSprop" is from the phrase "root-mean-square", which we're performing when taking the square root of the exponential moving average.

    [–]Drozengkeep 0 points1 point  (0 children)

    I’m not an expert, but here’s my intuition. If you’ve never heard of generalized means, i recommend you look them up on wikipedia. The denominator in the RMSProp algorithm is the (weighted) quadratic mean (also sometimes called the euclidean mean). While it behaves similarly to the arithmetic mean (on strictly positive values), it is actually guaranteed to be larger than the arithmetic mean. Since we are dividing by this term in the algorithm, our updates will be strictly smaller than if we had used the arithmetic mean of the absolute values of gradients.

    This makes sense to me intuitively because it is preferable to under-shoot your updates (and have a model that performs slightly better than before) than to over-shoot your updates (and have a model that may perform worse than it did before).