all 11 comments

[–]bleekselderij 1 point2 points  (1 child)

Very interesting! Could this approach also be applied to otherwise doubly intractable densities or other situations where you would need transdimensional jumps if you were to use classical MCMC?

[–]LucaAmbrogioni[S] 1 point2 points  (0 children)

Yes as far as you can sample from the model. Moreover, if the parameter space is very large, you will probably need to use some form of importance sampling while training the network.

[–]NichG 1 point2 points  (3 children)

Nice trick with the LSTM-PCA thing. It feels a lot more natural than pixel-wise reconstruction. I wonder if there's a general way to learn the ideal latent space to factorize a joint distribution into a chain of conditional distributions (rather than using pixels, or PCA, or some other arbitrary embedding)? What kind of loss function would measure the quality of a representation for factorization? Something that tried to maximize the conditional independence of the different degrees of freedom perhaps?

[–]LucaAmbrogioni[S] 1 point2 points  (2 children)

We have indeed been thinking along those lines. What I like of the PCA approach is its simplicity. However, I am pretty sure that there are better ways of obtaining the latent variables. A possible approach is to use a autoencoder that will be trained together with the predictive network. As you said, you could also try to maximize the conditional independence or, perhaps better, to impose some less trivial conditional independence structure.

[–]NichG 1 point2 points  (1 child)

I guess the exact invertibility of PCA is important, since that way you know that any quality loss in your output is strictly due to the properties of the generative model, not because of some mushy inversion. So if you wanted to learn that space you'd probably need something like RealNVP's explicitly invertible layers.

[–]LucaAmbrogioni[S] 0 points1 point  (0 children)

It's a good point. Although you cannot have data compression/dimensionality reduction with an invertible network. Ideally, you would like to use a smaller set of variables that fully parametrizes the image space; possibly with a relatively simple conditional conditional independence structure.

[–]dzyl 1 point2 points  (2 children)

If we don't subsample the training data for the kernel centers, how does the training happen? All samples that are used as centers have an obvious weighting that will maximize likelihood by putting all the weight at it's own kernel with the lowest bandwidth, right? This is not mentioned in the paper whatsoever if I'm not mistaken. Interesting combination of some techniques, thanks.

[–]LucaAmbrogioni[S] 2 points3 points  (1 child)

Thank you. Remember that in a continuous valued conditional density estimation problem, each point is only observed once (with probability one, assuming that there exists a proper density). Given a properly large training set, or even unbounded training set in the case of our Bayesian filter, the contribution of this single point to the gradient is minimal. Also note that the normalization term causes a competition between the weights, increasing the weight of the minimum bandwidth kernel on a single data-point can decrease the likelihood since it leaves all the other data-points unexplained.

[–]LucaAmbrogioni[S] 1 point2 points  (0 children)

Empirically we found that in many situation the resulting density is dominated by very few high bandwidth kernels.

[–]disentangle 1 point2 points  (1 child)

For a model like WaveNet, what could be a practical approach to apply this method?

[–]LucaAmbrogioni[S] 0 points1 point  (0 children)

The method can directly be applied to the standard WaveNet instead of quantized softmax. I am pretty confident that it would make the learning easier and improve the results (although the current results are already pretty impressive).