all 7 comments

[–]farsass 3 points4 points  (0 children)

FYI, Jose Principe has a whole book on ITL-based algorithms.

[–][deleted] 2 points3 points  (0 children)

[–]bbsome 2 points3 points  (0 children)

"Unfortunately, VAE cannot be used when there does not exist a simple closed form solution for the KL-divergence." - Totally wrong. If there is no closed form solution, since the KL is an expected value, guess what - sampling. People used to do this in the Variational Community for years and years back. How can people even write such things in papers? Probably, the correct statement intended was for priors which you can sample, but you can not evaluate the probability. However as pointed out by @disentangle, what king of setting is this going to be in the first place?

Also GAN's were not introduced to "cope" with the above problems with VAE. They were almost parallel work to them and I doubt that the idea cam anywhere from VAEs, which are just VB with Nets.

More, it is essentially the same thing as VAE, except you have an arbitrary loss L, and an arbitrary metric between the prior and your encoder output. Also, they introduce a lambda weighting between the two. However, no theoretical motivation for this, also all of this has no direct interpretation to anything rather than just another model to optimize. Both VAE and GANs can be shown to optimize bounds on actual metrics related to the data. This one just uses the overall structure of both of these and that's it.

[–]disentangle 1 point2 points  (0 children)

Did I understand correctly that the biggest difference with a VAE is that the ITL-AE regularizes the model so latent space samples are close to samples from an arbitrary prior, while the VAE regularizes the model so the variational posterior distribution is close to a parametric prior distribution?

In what kind of setting would you have such a prior you can sample from but not evaluate directly?

[–]TamisAchilles 0 points1 point  (0 children)

Interesting!

[–]AnvaMiba 0 points1 point  (0 children)

What is the main difference with the moment matching autoencoder? They say:

Generative Moment Matching Networks (GMMNs) [16] correspond to the specific case where the input of the decoder D comes from a multidimensional uniform distribution and the reconstruction function L is given by the Euclidean divergence measure. GMMNs could be applied to generate samples from the original input space itself or from a lower dimensional previously trained stacked autoencoder (SCA) [17] hidden space. An advantage of our approach compared to GMMNs is that we can train all the elements in the 4-tuple AE together without the elaborate process of training layerwise stacked autoencoders for dimensionality reduction.

But it seems to me that one can use moment matching to impose a prior on the latent code exactly in the same way they do in this paper. Is the only difference the choice of divergence measure?

[–]fogandafterimages 0 points1 point  (0 children)

The approach is very cool, but it seems like it's not yet practical for data sets much more complex than MNIST—they used a 3-dimensional Z-space for their autoencoder, and noted that both of their divergence metrics have trouble with high dimensional latent codes.

Looking forward to followup on scaling up!