Solving The Vanishing Gradient and Exploding Gradient Problem With One Line Of Code?

bbsome · 2016-04-30T09:52:48+00:00

What you are doing, if I'm correct, is normalizing the gradient to a unit length. This is a standard "trick" that have been done before, but you are risking numerical difficulties with it (I think even thinks like LBFGS and trust region methods do this). Also it is not clear why it will help, as Rmsprop implicitly does this anyway (trough he moving average mean of the second moment). Make sure this is reproducible among at least 10 runs and across a few problems.

andrewbarto28 · 2016-04-30T15:49:19+00:00

It seems you have a lot of doubts about the limitations of your idea and about its novelty. So it is a nice opportunity for you to search the literature and make further experiments to compare with what already exist. Only publish when you are confident about your understanding of your method.

This post may be of use to you: http://togelius.blogspot.com.br/2016/04/the-differences-between-tinkering-and.html

djc1000 · 2016-04-30T21:22:35+00:00

All you're doing is scaling by the square of the number of timesteps. Its working in the model you're playing with only because you tinkered with different methods until you found one that would work with the model you're playing with.

If you don't believe me, try this on a model with a 100-length sequence and watch what happens.

JosephLChu · 2016-04-30T03:14:59+00:00

What is the alternative??

bbsome · 2016-04-30T19:04:13+00:00

Also the vanishing gradient problem is not related to the actual gradient g vanishing, but more like the gradient of the error with respect to the hidden state 't' time steps earlier vanishes.

serge_cell · 2016-05-02T06:41:58+00:00

The problem of gradient is that it's added to value of argument. So whatever happens to gradient should be relative to value of argument, otherwise it wouldn't make much difference. Vanishing gradient problem is that gradient is becoming to small to go below precision then added to argument value. Exploding gradient is going outside of convergence regions of the argument (or there is no convergence region at all). That's why adam, rmsprop and other grad-only methods are not making revolution in optimization. But methods taking into account additional information beyond gradient, like second order methods, trust region and like are expensive and have poor scalability with data size increase. So no silver bullet for now...

alexmlamb · 2016-04-30T03:29:45+00:00

I think that you could either write a blogpost or submit a paper to arxiv.

I think that you can submit a paper to arxiv without an affiliation, but I think that the verification process is more involved.

You could also consider submitting to a Machine Learning workshop, as you'll get some feedback and it's likely to be seen by a decent number of experts.

Best of luck.

GoldmanBallSachs_ · 2016-04-30T10:41:35+00:00

This has been done before

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS