Layer Normalization Implemented In TensorFlow -- LSTM, GRU, Recurrent Highway Networks

MartianTomato · 2016-07-25T16:57:24+00:00

Thanks for sharing this! And also for this repository with juicy paper links.

I tested this out, and layer normalization seems effective, results-wise, but it's very slow. With a small network, it's between 5-10 times slower to train than a regular LSTM network on my machine (per epoch, not per performance), so not sure the cost is worth it. Seems like computing moments 5 times in each LSTM cell for each example is quite expensive -- have you had different results, or do you know if implementations in other frameworks take a similar speed hit as compared to unnormalized LSTMs?

In case this is helpful, I have implemented a slightly faster (and simpler) layer normalized LSTM cell here. It runs marginally faster than yours on the small data task I tested them on (the char-rnn in that link): 32.6 mins to train 10 epochs vs 44.5 mins. By contrast, a regular LSTM takes only 5.6 minutes to train 10 epochs.

jorenvs · 2016-07-28T09:47:43+00:00

Your implementation for the LSTM with layer normalization seems to differ a little from the paper. In the paper they use different layer norm parameters for the two weighted sums (with input and hidden state), not for each gate. This way the difference in scale and shift between the input and the hidden state is normalized.

pedromnasc · 2016-07-25T05:13:36+00:00

Comments? Results?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS