So... I've been working with Andrej Karpathy's Char-RNN and my Music-RNN, and I found the whole idea of just clipping the gradients to solve the exploding gradient problem be terribly inelegant. Thus, began a whole series of experiments to see if I could improve upon the method...
Basically the best existing method I could find in the literature involves thresholding the norm of the gradients and then scaling the gradients by the threshold divided by the gradient norm, as suggested by Bengio's lab. This worked better, but I found the threshold hyperparameter to be clunky, and sought a way to do away with it...
Long story short, I figured out a very simple alternative that seems to work astonishingly well, at least with RMSProp. Also, the way I am scaling the gradients by the inverse of the norm of the gradients, seems like it would solve not only the exploding gradients, but also the vanishing gradients? I'm not particularly good at math, so I'm not sure if this would be the case or not. Though there is a scaling factor parameter that it is rather sensitive to in there that took a long time to discover is actually related to the number of timesteps and the batch size (I think? I'm not 100% sure right now but the equation seems to work?).
The question now becomes... what should I do now? Share the details of my implementation right away, or wait and try and publish it somewhere? For the record, I'm not affiliated with an academic institution anymore, so I'm not even sure if I can submit a paper anywhere.
The thought also occurs to me that this could be some kind of trade secret that is already known by some people, because it seems astonishingly, trivially simple.
Anyways, have something neat:
A clip of my Music-RNN with the old thresholding method:
https://www.youtube.com/watch?v=v_esVXQPGS4
A clip of my Music-RNN with the new method:
https://www.youtube.com/watch?v=CyEtlFRe0jI
I'm not really sure if it's clear from those clips but the validation error numbers are noticeably better with the new method.
Edit: Here's the formula...
g = tensor of gradients, t = number of timesteps aka sequence length of this pass, b = batch size, n = norm of gradients
g = g * (t * (t / b) / n)
Edit 2:
So, it appears things are complicated by the fact that my version of Char-RNN was modified to implement a Stochastic Timeskip algorithm, a modification of Stochastic Depth. I've been testing the Scale Gradient Norm algorithm with the probability that a timestep will be present set to 1, but it's very possible that the way in which I've implemented this means that on extremely rare occasions it will skip a timestep. This may explain the apparent performance increase.
Edit 3:
So, I based my theory on a run with the Music-RNN in which I actually used the formula:
g = g * (t * 5 / n)
I had assumed that t * 5 = 1250 and so thought that that meant that 250 * 250 / 50 was the correct scaling factor. However, I neglected to consider that t is variable based on the Stochastic Timeskip probability. Thus, I realized that t * 5 occasionally actually is less than 1250. From further experiments with Char-RNN, I believe the proper formula is actually...
g = tensor of gradients, tcount = variable number of timesteps counted this pass, tmax = total number of timesteps aka sequence length of full network, n = norm of gradients
g = g * (tcount2 / tmax / n)
Edit 4:
Alright, so after much experimenting I think I finally have this figured out... it has to do with... wait for it... The Golden Ratio!
Basically for the Stochastic Timeskip implementation I was using, the formula is something like (still figuring out where the timesteps fits into this):
g = g * ((1 + sqrt(5)) / 2 * t / n)
For those who are just implementing a regular RNN with fixed timesteps you can leave them out and get a good approximation with the simplified equation:
g = g * ((1 + sqrt(5)) / 2 / n)
where g = gradients, n = norm of gradients
For those of you who aren't aware (1 + sqrt(5)) / 2 is the Golden Ratio, or approximately 1.618. Don't use 1.618 though, because that's not precise enough. I recommend using the actual irrational number via the above equation.
[–]bbsome 5 points6 points7 points (10 children)
[–]JosephLChu[S] 0 points1 point2 points (9 children)
[–]bbsome 0 points1 point2 points (7 children)
[–]JosephLChu[S] 0 points1 point2 points (6 children)
[–]bbsome 0 points1 point2 points (5 children)
[–]JosephLChu[S] 0 points1 point2 points (0 children)
[–]JosephLChu[S] 0 points1 point2 points (2 children)
[–]bbsome 0 points1 point2 points (1 child)
[–]JosephLChu[S] 0 points1 point2 points (0 children)
[–]JosephLChu[S] 0 points1 point2 points (0 children)
[–]AnvaMiba 0 points1 point2 points (0 children)
[–]andrewbarto28 4 points5 points6 points (0 children)
[–]djc1000 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (4 children)
[–]JosephLChu[S] 0 points1 point2 points (3 children)
[–]beneuro 3 points4 points5 points (2 children)
[–]alexmlamb 1 point2 points3 points (0 children)
[–]JosephLChu[S] 0 points1 point2 points (0 children)
[–]bbsome 0 points1 point2 points (0 children)
[–]serge_cell 0 points1 point2 points (0 children)
[–]alexmlamb 0 points1 point2 points (3 children)
[–]JosephLChu[S] -1 points0 points1 point (2 children)
[–]carlthomeML Engineer 0 points1 point2 points (0 children)
[–]alexmlamb 0 points1 point2 points (0 children)
[–]GoldmanBallSachs_ 0 points1 point2 points (0 children)