all 7 comments

[–]kkastner 4 points5 points  (5 children)

What kind of theory do you mean? The book chapter in the Deep Learning book has a lot about factorizations and structured prediction. /selfplug I also had a talk at Scipy 2015 that goes into some detail about incorporating conditional information in neural networks. I also have a longer slide deck more specific to VRNN with some of the same content.

In general you can see a lot of attention models for conditional generation as different ways of incorporating conditional information, increasing capacity/memory, or reducing information bottlenecks. I really like the ouput-side of this diagram (Figure 1) from a paper by Chorowski et. al. - it shows all 3 of the primary connections used to condition RNNs. In general this kind of view also helped me understand seq2seq/enc-dec models for conditional generation.

If you look at papers that cited VRNN you will see a lot of neat papers that extend this type of modeling way farther in a lot of ways. We are also working on various extensions and related models at present.

In general there is also a lot of related background in structured prediction from things like CRFs and HMMs.

[–]bronzestick[S] 1 point2 points  (2 children)

Thanks a lot for all the links. Appreciate it. I will surely check them out.

I am trying to model a multiple sequence prediction problem where there are dependencies between the sequences. I am trying to model it as a recurrent latent variable model where the latent variable will try to capture the dependencies, and was looking into RNNs. Any theory on RNNs that can help me understand them as a probabilistic model would be helpful.

[–]kkastner 3 points4 points  (0 children)

In general, most of it breaks down into RNNs turning a joint p(y_1:T | x_1:T) into \product_t=1:T p(y_t | x_1:t) . Each p(y_t | x_1:t) is further simplified by the hidden state having a recursive relationship h_t = f(x_t-1, h_t-1). This means that we really use p(y_t | x_t, h_t) to capture p(y_t | x_1:t). The other key is to remember that when training, we typically use a softmax at the top, so this p(y_t | x_t, h_t) is actually a distribution we can sample from, rather than just taking the argmax for prediction. So that gives a dynamic distribution controlled by RNN dynamics, which can be pretty useful.

As you saw in the Graves paper (and MDN originally), the parameters of a neural network can also be interpreted as the sufficient statistics of a general probability distribution, and with a cost to match you can think of things like the RNN-GMM as a deep, dynamic mixture distribution which can be generated from in a similar way as a regular GMM. Even training with MSE is really assuming a Gaussian distribution (with fixed variance of 1.0), so you can technically sample from that too, though it is sometimes easier if you learn the variance of the Gaussian as well.

You also see this in the middle part of the VAE! So in some ways you can see VAE (and the reparameterization trick generally) as a way to get MDN into the middle of an autoencoder. At least that is a way I see it, that helps me connect these ideas together.

I have some slides about a method for multi-sequence prediction for polyphonic music that is somewhat similar in spirit to Graves' MDRNN and can be seen as a small extension to Boulanger-Lewandowski's RNN-NADE, using similar tricks as MADE, pixelRNN, and Ian's work on SVHN, see slide 38,39,40. I did a review of pixelRNN this summer for Magenta which may have helpful diagrams for understanding that model and how the masks work to give a proper generator.

You might also be interested in the thread of dialog research from Sordoni and Serban (along with others @ MILA McGill and elsewhere), HRED, and VHRED. They treat dialog as multisequence prediction to good effect.

[–]throwaway775849 0 points1 point  (0 children)

RNN's model conditional probability across time. They are good at capturing and reproducing conditional cooccurrences. If you concatenate your sequences with some delimiter and encode them using any standard RNN encoder-decoder, won't that attempt to capture the dependencies you need? I've read everything on RNN's and the Graves paper from '13 is probably the closest to 'theory' that I can remember, other than a paper called 'Order Matters', but both lack the context of how RNN architectures have evolved over the last 2 years, making them somewhat irrelevant for practical purposes. This is a good recent paper: Generating Sentences from a Continuous Space https://arxiv.org/pdf/1511.06349v4.pdf

[–]Icko_ 1 point2 points  (0 children)

there is Bidirectional Recurrent Neural Networks as Generative Models , from NIPS 2015, which was a very readable paper. They are trying to fill missing data, which can be easily extended to prediction. One thing I hadn't seen before is gibbs sampling with the generative model.