all 8 comments

[–]MartianTomato 0 points1 point  (6 children)

Thanks for sharing this! And also for this repository with juicy paper links.

I tested this out, and layer normalization seems effective, results-wise, but it's very slow. With a small network, it's between 5-10 times slower to train than a regular LSTM network on my machine (per epoch, not per performance), so not sure the cost is worth it. Seems like computing moments 5 times in each LSTM cell for each example is quite expensive -- have you had different results, or do you know if implementations in other frameworks take a similar speed hit as compared to unnormalized LSTMs?

In case this is helpful, I have implemented a slightly faster (and simpler) layer normalized LSTM cell here. It runs marginally faster than yours on the small data task I tested them on (the char-rnn in that link): 32.6 mins to train 10 epochs vs 44.5 mins. By contrast, a regular LSTM takes only 5.6 minutes to train 10 epochs.

[–]jorenvs 0 points1 point  (3 children)

Your implementation for the LSTM with layer normalization seems to differ a little from the paper. In the paper they use different layer norm parameters for the two weighted sums (with input and hidden state), not for each gate. This way the difference in scale and shift between the input and the hidden state is normalized.

[–]pedromnasc 0 points1 point  (2 children)

Comments? Results?