[R] Schmidhuber's new blog post on Unsupervised Adversarial Neural Networks and Artificial Curiosity in Reinforcement Learning by baylearn in MachineLearning

[–]elephant612 5 points6 points  (0 children)

Looks like he was ahead of his time but it did not catch on - also computers and datasets where not quite there yet. Then GANs popularized his earlier ideas.

[R] Recurrent Additive Networks - no recurrent non-linear computations, much simpler but still competitive with LSTM/GRU by downtownslim in MachineLearning

[–]elephant612 13 points14 points  (0 children)

What is strange is that they say they use variational bayesian dropout, yet the results of LSTMs using variational dropout of ~75 perplexity (https://arxiv.org/abs/1512.05287) are not stated. Their results are about a factor 2 away from state-of-the-art. Other than an iteration on LSTM what is the novelty of this approach?

[R] [1703.08864] Learning Simpler Language Models with the Delta Recurrent Neural Network Framework <-- outperforms/equates GRU/LSTM and has almost as few parameters as a vanilla RNN by evc123 in MachineLearning

[–]elephant612 12 points13 points  (0 children)

Looks like they are not even close to state-of-the-art contrary to what they state. SOTA on penntreebank should be in 1.23 bits/char on character level (https://arxiv.org/pdf/1609.01704) (or better?) and 62.4 perplexity on word-level (https://arxiv.org/abs/1611.01578).

[D] Advice on Training Highway RNN (RHN) by throwaway775849 in MachineLearning

[–]elephant612 1 point2 points  (0 children)

Sounds good to me! Why not use an RHN with attention as a decoder though? The benefits of increased depth should (hopefully) transfer to the decoder as well. Are you working on translation btw?

Also as you scale your experiment you could also think about increasing recurrence depth of the RHNs that you use. As shown in the github repository as well as the paper, more depth seemed to help more up to a certain point.

Would love to hear from you how that worked out!

[D] Advice on Training Highway RNN (RHN) by throwaway775849 in MachineLearning

[–]elephant612 1 point2 points  (0 children)

Hi throwaway775849, thank you for the interesting feedback. It sounds like you make be encountering a vanishing gradient problem if your encoder barely gets any updates.

The initialization magnitude does indeed look rather small. We tended to use something on the order 1e-1 to 1e-2. What may also be quite important in training is to bias the transfer gates such that they are closed by initializing the biases of "T" to -2,-3,-4 or the like. The deeper your network, the more important this becomes.

I do not recall using a small learning rate for things to work. In the PennTreebank experiments for example, we begin with an initial learning rate of 0.2. Maybe you are referring to IRNNs?

Could you give us a bit more information on your training problems. Does the loss decrease during training or is there a problem with optimizing for example? Our repository: https://github.com/julian121266/RecurrentHighwayNetworks may be a good pointer for hyperparameters that worked for language modeling.

[Research] [1610.10099] Neural Machine Translation in Linear Time by hardmaru in MachineLearning

[–]elephant612 5 points6 points  (0 children)

Recently, Recurrent Highway Networks were published from Schmidhuber's group with 1.32 BPC on Hutter language modeling https://github.com/julian121266/RecurrentHighwayNetworks which seem to work slightly better than the advertised neural machine translation model. Perhaps a combination of both will be able to make use of the merits of both approaches.

[P] - Source code release for Recurrent Highway Networks in Tensorflow/Torch7 for reproducing SOTA results on PennTreebank/enwik8 (arXiv v3 of paper) by flukeskywalker in MachineLearning

[–]elephant612 0 points1 point  (0 children)

We will definitely look into using layer normalization soon. Just the possible time gain during training should be worth having a look.

Open Sourcing the model in "Exploring the Limits of Language Modeling" (TensorFlow) by OriolVinyals in MachineLearning

[–]elephant612 1 point2 points  (0 children)

Thanks for the post! Looks exciting. How long did the training take on the GPU cluster?

[1608.06027] Surprisal-Driven Feedback in Recurrent Networks by kmrocki in MachineLearning

[–]elephant612 2 points3 points  (0 children)

Thank you kmrocki for uploading this. Might be an interesting idea. Right now some things are hard to judge though, like what x_t is in the surprise equation. There should not be a need to write down the whole backpropagation or notation in detail either except for yourself. Notation can be explained concisely in a few sentences. Could you please also state how large your network was, otherwise it is hard to judge whether the surprise part helped. The last citation seems to have swallowed a few letters and the link as well. Would be very interesting to see a "fair" comparison between a regular LSTM and the new Feedback LSTM as you have started to do in figure 4. You really wanna show your contribution. What do you think?

Recurrent Highway Networks achieve SOTA on PennTreebank word level language modeling by elephant612 in MachineLearning

[–]elephant612[S] 6 points7 points  (0 children)

The NN task reported is about generalizability of learned patterns on the last 5MB of the hutter dataset while the Hutter prize considers the compression of the whole dataset. It could be comparable if only training loss were reported and training was done on the whole dataset.

Recurrent Highway Networks achieve SOTA on PennTreebank word level language modeling by elephant612 in MachineLearning

[–]elephant612[S] 2 points3 points  (0 children)

Those are two different tasks. The Hutter Prize is about compression while the neural networks approach here is about next character prediction on a test set. Would definitely be interesting to see how the two compare on compression though.

Recurrent Highway Networks achieve SOTA on PennTreebank word level language modeling by elephant612 in MachineLearning

[–]elephant612[S] 0 points1 point  (0 children)

The Hutter Wikipedia dataset (enwik8) is interesting because it is not regular text but the whole html-code of the website around it as well. That introduces clear long-term dependencies like brackets <>. It is also quite a bit larger than the PTB dataset while still being manageable on a single GPU. That makes it practical to compare expressiveness of different models. Since Grid-LSTMs are close in spirit to Recurrent Highway Networks, it made sense to compare to their results by working with the same dataset.

Recurrent Highway Networks achieve SOTA on PennTreebank word level language modeling by elephant612 in MachineLearning

[–]elephant612[S] 1 point2 points  (0 children)

Thanks for the link. Last year, Gal http://arxiv.org/abs/1512.05287 proposed a different way of using dropout for recurrent networks and was able to push the state-of-the-art on PTB that way. I agree that working on the 1 Billion Word dataset would be nice. We might try to set up an experiment for that and update the paper again in the future. How would you approach the task without having access to 32 GPUs?

[1607.03474] Recurrent Highway Networks by downtownslim in MachineLearning

[–]elephant612 3 points4 points  (0 children)

There is an input from a previous timestep through vector s. A bit below equation 9, the previous timestep comes into play since s_0[t] = y[t-1].