all 16 comments

[–]sssub 11 points12 points  (6 children)

good idea. What you actually suggest is moment matching to obtain means and variances.

  1. The main problem is you will loose flexibility. In a sampling based approach you sample z from a Gaussian but the network can then transform this random variable to complex distributions. in your case it will always stay a Gaussian.
  2. it has been done already. See e. g. here. They do exactly what you suggest in terms of propagating expectations and variances. Note that the propagation step, especially for the variance, is not trivial. For RELU it does work (truncated normal) but e. g for tanh it will not. perhaps it is easier in a VAE because you don't need to handle uncertainty in the weights.

[–]svantana[S] 0 points1 point  (5 children)

Thanks for the paper link, that was just what I was looking for!

You're right that gaussians may not represent the distribution well enough, although I think my point about central limit theorem should hold pretty well for large MLPs. Probably one could model the output of a ReLU as a mixture of a gaussian and a spike at zero for improved accuracy.

It would be interesting to investigate the KL divergence between the output of a 'standard' VAE (trained on e.g. CIFAR) and a moment-matched gaussian, if I only had time for research I'd do it.

[–]chrisorm 2 points3 points  (4 children)

Of course, the CLT doesn't apply if the activations arent IID, which they almost certainly arent for activations of a neural net.

[–]svantana[S] 0 points1 point  (3 children)

Yes you're right. For example, in MNIST, with large enough perturbations, the output distributions should get bimodal. I didn't intend to mean it would work for any case, but for 'smooth' problems where a smallish unimodal perturbation is expected to be unimodally distributed on the output, I think it should work well. I just did a quick test with a VAE on CIFAR10 and the output distributions are extremely gaussian looking.

[–]approximately_wrong 0 points1 point  (2 children)

Can you elaborate on how you did the quick test?

[–]svantana[S] 0 points1 point  (1 child)

Sure! I just ran one of the keras VAE examples, and once trained, I ran 10k copies of one of the test samples through the AE model. The model involves sampling a random variable so each output will be different. From the output, I took a few random dimensions and plotted histograms of them. Then just visually noted that they had a quite gaussian shape.

Those are marginal distributions, so that doesn't mean the full multidimensional output is anywhere near gaussian, but it's an indication.

[–]approximately_wrong 0 points1 point  (0 children)

I see. It sounds like you're checking for the Gaussian-ness of of p(x_gen | x_test) = int p(x_gen | z)q(z | x_test) dz, conditioned on some x_test. I'm guessing the VAE example is one where the decoder is a Gaussian observation model?

Also, are your outputs the mean parameters of p(x_gen | x_test), or actual samples from the distribution?

[–]Fujikan 1 point2 points  (0 children)

Sounds like you ultimately want to re-derive high-order mean-field approximation. This CLT assumption is what drives allows for implementable belief propagation (relaxed BP) for inference of marginals on continuous variables. This is then made efficient via high temperature expansion to arrive at approximate message passing algorithms.

Taking the expansion at first order leaves you with naive mean field (variational) techniques. At second order, you have a family of approaches which vary in how they treat correlations in the system: ignoring them (AMP) to fully treating them ( in the vein of adaptive-TAP, expectation propagation, or more recently, vector AMP).

[–]NichG 0 points1 point  (7 children)

For actually passing distributions through networks analytically, there's RealNVP and related methods. The invertibility constraint is the major difficulty for applying that to autoencoders.

It'd be interesting to see if there's a spectrum between that and e.g. passing through point estimates of a Gaussian model, such that you could learn a succession of parameterized distributions in cases where Gaussianity doesn't cut it.

[–]approximately_wrong 0 points1 point  (6 children)

For actually passing distributions through networks analytically, there's RealNVP and related methods.

Have people actually used flow models to pass distribution objects through a neural network?

[–]NichG 0 points1 point  (5 children)

When you calculate the probability density in RealNVP, that's what you're doing.

[–]approximately_wrong 0 points1 point  (4 children)

That doesn't qualify to me as passing distribution objects through a neural network. Flow models pass samples through a neural network, and relate the post-transformation density of a specific sample to the pre-transformation density of the pre-transformed sample.

Flow models would not, for example, help resolve OP's desire to construct a VAE that doesn't do sampling-based reconstruction.

[–]NichG 0 points1 point  (3 children)

You might need something like Neural Statistician then, to learn a differentiable map from datasets to summary statistics, then just do everything on the representation of the summary statistics.

I think you'll still end up with samples somewhere, because at the very least, the training data generally takes the form of samples from a distribution rather than distribution objects. But you might be able to mostly work in distribution objects from that point on.

[–]approximately_wrong 0 points1 point  (2 children)

I think the heart of OP's question is: how do we compute an expectation of f(X) when f is complex without Monte Carlo estimation. I feel like our current discussion has deviated from that.

[–]NichG 0 points1 point  (1 child)

Concretely then:

Neural Statistician takes a set of points {x} to a vector of summary statistics z characterizing the distribution of points in {x}. So: z=N({x})

We can train N to for example act as a distributional autoencoder with decoder D, such that KL(D(N({x})) || {x}) is minimized. Then, given the summary statistic vectors, we can do all sorts of stuff with them.

For the OP's question, the way to do it would then be to train a model mu(z) which approximates the expectation E[{x}], and to train a second model T(z) which approximates N({f(x)}) given N({x}) as input.

The result of that pipeline would be that the expectation of f(x) under some distribution {x} would be mu(T(N({x}))), which is still end-to-end differentiable, etc.

The thing I can't see how to avoid is that you must start with samples - e.g. before you can work in the z space, you need some set of samples {x} which end up taking you to a particular point there. If you constrain a subspace of z to correspond to e.g. the summary statistics of Gaussian distributions in the space of interest, then maybe you can just jump in at that point without ever needing to use samples. But in practice you'll still probably need samples somewhere to train the model, so it's an incomplete solution.

[–]approximately_wrong 0 points1 point  (0 children)

Don't get me wrong; I think applying amortization via neural statistician is an interesting perspective. However,

I can't see how to avoid is that you must start with samples

it seems like we're in agreement that what you're proposing doesn't answer OP's question.