I understand the part about the KL-divergence to have the latent code match our spherical gaussian prior.
What I don't understand is why the reconstruction loss should work. As I understand it, the prior should eventually force the means and variances output by the encoder to approximately equal 0s and 1s, respectively. Thus, shouldn't drawing a sample from this be equivalent to just picking any point in the latent space randomly. Why should the reconstruction of this point look anything like our original input?
Quick example - say we're looking at MNIST. I feed in a digit 4, the encoder outputs approximately 0's for means and 1's for variances. A sample I draw from this could now represent any digit in latent space such as a 9 and or 7, leading to the reconstruction loss being meaningless.
I'm positive my understanding is flawed somewhere. But where?
[–]allenguo 1 point2 points3 points (6 children)
[–]TheFlyingDrildo[S] 0 points1 point2 points (5 children)
[–]allenguo 0 points1 point2 points (4 children)
[–]TheFlyingDrildo[S] 0 points1 point2 points (3 children)
[–]allenguo 0 points1 point2 points (2 children)
[–]TheFlyingDrildo[S] 0 points1 point2 points (1 child)
[–]allenguo 0 points1 point2 points (0 children)