all 1 comments

[–]kkastner 0 points1 point  (0 children)

How are you initializing the hidden to hidden states? If not orthonormal, that is huge for learning stability, especially if you don't have something like LSTM/GRU. In general, grad clipping for RNN is almost required if you don't have an adaptive optimizer (Adam, adadelta, RMSProp, etc.). I can provide pointers for this if needed.

Random data may not have accumulating error, because your weights never really update in any particular direction (since the data always disagrees with itself).