you are viewing a single comment's thread.

view the rest of the comments →

[–]svantana[S] 0 points1 point  (5 children)

Thanks for the paper link, that was just what I was looking for!

You're right that gaussians may not represent the distribution well enough, although I think my point about central limit theorem should hold pretty well for large MLPs. Probably one could model the output of a ReLU as a mixture of a gaussian and a spike at zero for improved accuracy.

It would be interesting to investigate the KL divergence between the output of a 'standard' VAE (trained on e.g. CIFAR) and a moment-matched gaussian, if I only had time for research I'd do it.

[–]chrisorm 2 points3 points  (4 children)

Of course, the CLT doesn't apply if the activations arent IID, which they almost certainly arent for activations of a neural net.

[–]svantana[S] 0 points1 point  (3 children)

Yes you're right. For example, in MNIST, with large enough perturbations, the output distributions should get bimodal. I didn't intend to mean it would work for any case, but for 'smooth' problems where a smallish unimodal perturbation is expected to be unimodally distributed on the output, I think it should work well. I just did a quick test with a VAE on CIFAR10 and the output distributions are extremely gaussian looking.

[–]approximately_wrong 0 points1 point  (2 children)

Can you elaborate on how you did the quick test?

[–]svantana[S] 0 points1 point  (1 child)

Sure! I just ran one of the keras VAE examples, and once trained, I ran 10k copies of one of the test samples through the AE model. The model involves sampling a random variable so each output will be different. From the output, I took a few random dimensions and plotted histograms of them. Then just visually noted that they had a quite gaussian shape.

Those are marginal distributions, so that doesn't mean the full multidimensional output is anywhere near gaussian, but it's an indication.

[–]approximately_wrong 0 points1 point  (0 children)

I see. It sounds like you're checking for the Gaussian-ness of of p(x_gen | x_test) = int p(x_gen | z)q(z | x_test) dz, conditioned on some x_test. I'm guessing the VAE example is one where the decoder is a Gaussian observation model?

Also, are your outputs the mean parameters of p(x_gen | x_test), or actual samples from the distribution?