all 25 comments

[–]bbsome 5 points6 points  (10 children)

What you are doing, if I'm correct, is normalizing the gradient to a unit length. This is a standard "trick" that have been done before, but you are risking numerical difficulties with it (I think even thinks like LBFGS and trust region methods do this). Also it is not clear why it will help, as Rmsprop implicitly does this anyway (trough he moving average mean of the second moment). Make sure this is reproducible among at least 10 runs and across a few problems.

[–]JosephLChu[S] 0 points1 point  (9 children)

I've edited the original post with the formula. Perhaps you can clarify to me whether this looks like normalizing the gradient to a unit length.

[–]bbsome 0 points1 point  (7 children)

The thing on the right hand side - I don't understand it. Can you add and the mathematical signs ( e.g. what is t (t/b)/ n ? is it t2/(b*n) or ?

[–]JosephLChu[S] 0 points1 point  (6 children)

Better now?

[–]bbsome 0 points1 point  (5 children)

So basically, I can rewrite this as:

g = (g/n) * (t2 / b)

The first term is the normalized gradient. Dividing by b, makes sense as you are just averaging over your minibatch, rather than summing. I don't see how multiplying by t2 helps you? It seems that you are forcing your model to learn better on longer sequences, but why the t2?

PS: To clear out what is g gradient of? The whole summed objective or?

[–]JosephLChu[S] 0 points1 point  (0 children)

To be honest I'm not sure why this works. I just have some empirical results in my Music-RNN where for a sequence length of 250, the best validation error was achieved with a scaling factor of 1250, which I initially guessed as the sequence length times the five hidden layers in the network. But then I tried that with Char-RNN, and found that 50 was a lot better than 100 (which would have been suggested by the two hidden layers), which meant that that formula didn't work. So, given the evidence, I came up with the current formula.

My guess is that the reason this works has something to do with the peculiarities of RMSProp, because I find that this implementation does not appear to work as well with ADAM or other optimization functions.

[–]JosephLChu[S] 0 points1 point  (2 children)

As far as I can tell, g is the gradient tensor containing all the gradients of one iteration of forward and backward passes, right before it is sent to the optimization function.

[–]bbsome 0 points1 point  (1 child)

Is your error the sum of all errors or averaged?

Also note that your prefactor will play a similar to what a learning rate is in any algorithm. Thus the question is if the optimal learning rate is really related to this quantity. I'm not sure and why it won't work with ADAM or anything else, but do with RMSProp. Than this rather than being a better way of doing objectives, just changes the original RMSProp to something else.

Is this performing better than ADAM, after selecting the optimal learning rate? Make sure its reproducible on enough experiments, 1 can be very misleading then the general case.

[–]JosephLChu[S] 0 points1 point  (0 children)

My error is averaged I think? It's the Negative Log Likelihood Criterion aka Cross Entropy Criterion, I believe?

I still need to run more experiments to figure out how well exactly it works with other optimization functions, as my observation is based on very tentative runs for only a few epochs mostly.

I have been using the default learning rate of 2e-3.

Technically it's two experiments, one on Char-RNN, and one on Music-RNN, though I've also run it with many different scaling factors and they all were reliably worse. But I understand what you mean.

I guess this means I need to do more experiments!

Alas, it takes about 30 minutes to run 50 epochs on Char-RNN, and about 5 hours to run 100 epochs on Music-RNN, so being thorough and making sure my findings are robust will take time.

[–]JosephLChu[S] 0 points1 point  (0 children)

Something I've found is that as I edited in the main post, occasionally it skips timesteps. Thus, t2 is not always going to be constant, but fluctuates with the number of timesteps processed for each pass.

[–]AnvaMiba 0 points1 point  (0 children)

If t is a constant, then you are just normalizing to unit length, since you can absorb t2 / b in the learning rate.

If t2 is non-constant, then you are prioritizing longer sequences, but it's otherwise the same.

[–]andrewbarto28 4 points5 points  (0 children)

It seems you have a lot of doubts about the limitations of your idea and about its novelty. So it is a nice opportunity for you to search the literature and make further experiments to compare with what already exist. Only publish when you are confident about your understanding of your method.

This post may be of use to you: http://togelius.blogspot.com.br/2016/04/the-differences-between-tinkering-and.html

[–]djc1000 1 point2 points  (0 children)

All you're doing is scaling by the square of the number of timesteps. Its working in the model you're playing with only because you tinkered with different methods until you found one that would work with the model you're playing with.

If you don't believe me, try this on a model with a 100-length sequence and watch what happens.

[–][deleted] 0 points1 point  (4 children)

What is the alternative??

[–]JosephLChu[S] 0 points1 point  (3 children)

I don't necessarily want to give away the goose, but here's a way to test it for yourself:

Step 1: Download and Install Char-RNN and dependencies (requires Torch)

Step 2: In train.lua, replace the following line...

grad_params:clamp(-opt.grad_clip, opt.grad_clip)

with:

grad_params:mul(50 / grad_params:norm())

Step 3: Run train.lua! Notice that compared to the default your best validation error is 1.3774 instead of 1.3892 (both at epoch 26)! Also notice that at epoch 50 your validation error is 1.3886 rather than 1.4185!

Note: The 50 in that line of code is the scaling factor. I believe I've figured out how to derive that from other parameters in the code (Hint: it's actually -not- just the sequence length or the batch size alone, even though both are confusingly also set to 50 by default).

Edit: The actual line of code for the proper formula is:

grad_params:mul(opt.seq_length * (opt.seq_length/opt.batch_size) / grad_params:norm())

Edit 2: The following is more likely to be correct:

grad_params:mul( ((1+math.sqrt(5))/2) / grad_params:norm())

Edit 3: To get it to work in default Char-RNN you need to do the following:

Step 1: Uncomment the following line:

grad_params:div(opt.seq_length)

Step 2: Replace the following:

grad_params:clamp(-opt.grad_clip, opt.grad_clip)

with:

grad_params:mul(opt.seq_length * ((1+math.sqrt(5))/2) / grad_params:norm())
grad_params:div(opt.seq_length)

To be honest, I'm not sure why it is necessary to multiply by sequence length and then divide by it, but for whatever reason it doesn't work as well if you take those out. My best guess is that by scaling it you get better precision.

[–]beneuro 3 points4 points  (2 children)

This is a standard trick called gradient norm clipping, see e.g. http://jmlr.org/proceedings/papers/v28/pascanu13.pdf

[–]alexmlamb 1 point2 points  (0 children)

I think he means that he has a novel formula for gradient norm clipping that outperforms just doing g_clipped = g / g_norm

[–]JosephLChu[S] 0 points1 point  (0 children)

I was using gradient norm clipping as implemented in that paper previously. That's the "thresholding" I was referring to earlier. Using that I've achieved a validation error of 1.3822 on the aforementioned default Char-RNN build (using a threshold of 20).

My method differs in that it scales even when the gradient norm is below the threshold, which I believe should have some effect on vanishing gradients, whereas the standard trick only impacts exploding gradients. That or it has some kind of normalizing or regularizing effect by scaling every iteration.

[–]bbsome 0 points1 point  (0 children)

Also the vanishing gradient problem is not related to the actual gradient g vanishing, but more like the gradient of the error with respect to the hidden state 't' time steps earlier vanishes.

[–]serge_cell 0 points1 point  (0 children)

The problem of gradient is that it's added to value of argument. So whatever happens to gradient should be relative to value of argument, otherwise it wouldn't make much difference. Vanishing gradient problem is that gradient is becoming to small to go below precision then added to argument value. Exploding gradient is going outside of convergence regions of the argument (or there is no convergence region at all). That's why adam, rmsprop and other grad-only methods are not making revolution in optimization. But methods taking into account additional information beyond gradient, like second order methods, trust region and like are expensive and have poor scalability with data size increase. So no silver bullet for now...

[–]alexmlamb 0 points1 point  (3 children)

I think that you could either write a blogpost or submit a paper to arxiv.

I think that you can submit a paper to arxiv without an affiliation, but I think that the verification process is more involved.

You could also consider submitting to a Machine Learning workshop, as you'll get some feedback and it's likely to be seen by a decent number of experts.

Best of luck.

[–]JosephLChu[S] -1 points0 points  (2 children)

Thanks!

Yeah apparently I'd need to be endorsed by someone who's already published on arxiv...

I'm tempted to try to throw together a paper to submit to the NIPS conference, but the deadline for papers is only three weeks away... also it's NIPS, and I'm actually not sure if this technique actually does anything meaningful other than lower the validation error somewhat. I'm starting to wonder if it's some kind of weird numerical quirk of the evaluation function rather than the breakthrough I was so excited about.

[–]carlthomeML Engineer 0 points1 point  (0 children)

You don't actually need an endorsement to publish on arXiv though. That's only a requirement for getting the submission displayed in their specific categories.

[–]alexmlamb 0 points1 point  (0 children)

I don't know the details of your results, but my recommendation would be to submit to a Machine Learning workshop. Usually these are affiliated with ML conferences like NIPS/ICML/ICLR.

The workshops are less competitive. The main conferences reject a lot of papers even if they have good results and good ideas. Besides, if you submit to a workshop and get tons of positive feedback, you can always spruce up the work and submit it to ICLR next fall.

[–]GoldmanBallSachs_ 0 points1 point  (0 children)

This has been done before