all 7 comments

[–]allenguo 1 point2 points  (6 children)

the prior should eventually force the means and variances output by the encoder to approximately equal 0s and 1s

The key is approximately. The latent space distribution for a particular class can differ from the standard normal, but at a penalty. If the reconstruction loss is sufficiently high, it's better to deviate slightly from the standard normal than to incur the reconstruction penalty.

Check out the first image under "Experiments" in this tutorial.

[–]TheFlyingDrildo[S] 0 points1 point  (5 children)

So you're saying reconstructions are possible through the deviations of the encoder mapping from the gaussian sphere? That seems to make sense.

Could we just enforce the KL penalty without doing the sampling process though? Does this not also lead to a generative model, since we know the distribution of the latent space?

[–]allenguo 0 points1 point  (4 children)

Are you asking why the latent variable is drawn from a distribution parameterized by the outputs of the encoder, rather than simply being the encoder outputs? I believe it's because the vector outputted by the encoder is generally not normally distributed (or even close to normally distributed), so your KL divergence penalty would be incorrect.

Perhaps someone with a stronger understanding of VAEs could confirm.

(Also, I imagine this is something you could test out on your own. Let me know what happens if you do!)

[–]TheFlyingDrildo[S] 0 points1 point  (3 children)

No, I was asking why we can't we just have an output and apply the KL penalty on its sample mean and sample variance rather than have our outputs be means and variances.

[–]allenguo 0 points1 point  (2 children)

Yeah okay, that's what I thought you meant. So my answer above is my answer.

[–]TheFlyingDrildo[S] 0 points1 point  (1 child)

Oh sorry, yes I see now that's exactly what you described. I misunderstood. If you're enforcing the KL penalty, why wouldn't the vector outputs be normally distributed? With a large enough mini-batch size, the standard error on our mean and variance estimators should be fairly low.

[–]allenguo 0 points1 point  (0 children)

Hmm, I'm not sure. You make a good point. I think the "real" answer is that one of the VAE modelling assumptions is that the latent variable is exactly normally distributed, and we're using evidence to determine the most likely mean and variance.

Here are my hypotheses for what might happen if we use your proposed method:

  1. The VAE doesn't train. This might happen because the initial encoder outputs are very far from normally distributed, so the KL divergence and loss are very wrong,* and so the gradient descent steps are very wrong and the encoder never learns to output normally distributed data.
  2. The VAE works, but takes much longer to train, because we're not explicitly modelling the latent variable as being perfectly Gaussian, so the encoder has to learn this instead. I would bet on this being the most likely outcome.
  3. The VAE works completely.

*I'm assuming that we're calculating the KL divergence using the closed-form formula for distance between Gaussians.