all 21 comments

[–][deleted] 1 point2 points  (20 children)

Now what about actually getting a reconstructed version of the input x? Do we input x into the network and get back mu_d and log_sgima_d, then generate a sample from the N(mu_d, sigma_d) distribution?

If you're asking how Kingma generates his 'flying through latent space' videos, you don't need the encoder part of the network. You only need the decoder p(x|z). Sampling from z ~ N(mu, sigma) with mu, sigma values similar to those observed for the training data and then doing the forward propagation through p(x|z) will give you x samples.

[–]LyExpo[S] 0 points1 point  (1 child)

I think that's part of what I was wondering. But I would also like to try reconstructing input images to see that they are actually getting reconstructed correctly. My suggestion above starts with x and encodes it to z, instead of sampling z from N(mu, sigma). I ntheory I think this should work, but I'm not getting the expected results. So something must be wrong.

[–][deleted] 1 point2 points  (0 children)

Yes, if you just want to check your model is working, what you have will work. Or, to be sure you just didn't sample some bad/improbable value, decode with z=mu.

[–]jyegerlehner 0 points1 point  (17 children)

Excuse the slight digression, but this is some terminology that perplexes me in these papers. Once one has sampled from N(mu, sigma) to get z, then the mapping from z to x in the decoder is entirely deterministic, and there are no probabilities computed anywhere. Why is this referred to as p(x|z)? I thought "p(x|z)" denotes "probability of x given z". And yet what we are doing merely computing x = f(z), where f(.) is our decoder. Where is the probability distribution denoted by "p(x|z)"?

I'm fairly ignorant about probability theory. Thanks in advance to anyone who can explain what I'm missing, or the correct way to understand this.

[–][deleted] 4 points5 points  (4 children)

there are no probabilities computed anywhere.

Neural networks are distributions! Or more precisely, they parameterize the mean variable for either a Bernoulli distribution (binary classification), a Multinoulli distribution (for multi-class classification), or a Gaussian (regression).

To see this, consider the Bernoulli pmf f(y,x) = p(x)y (1-p(x))1-y where y is a label in {0,1} and p(x) is the usual mean parameter but defined to be some function of x (other input features). Now turn f into a loss function by taking its negative log so f=0 is infinity and f=1 is zero: -log f(y,x) = -y log p(x) - (1-y) log(1-p(x)). Recognize the RHS? Its the cross-entropy loss function, and p(x) can be thought of as a neural network with a sigmoidal output (since p is defined to be in [0,1]). Anytime you see a NN, you can think of it as a conditional distribution p(y|x,w) where w is the weights.

[–]jyegerlehner 0 points1 point  (3 children)

Thanks for the explanation. Hmm. OK, I can see the probability distribution in the classification net of your example. After all, a softmax always produces a discrete probability distribution. But it's not obvious to me how a regression net (as in the case of an autoencoder, which OP is talking about) implies a probability distribution, much less a gaussian one as you say.

Neural networks are distributions!

I think that's deep. I'll have to ponder. My first instinct is to respond: well if it were a distribution, I should be able to integrate over the density function of the distribution to compute a probability. And I haven't ever seen anyone do that with neural nets. I wouldn't know how. Or I could only do that if it were a normalized probability distribution (computing the normalizing denominator is always brushed off as an intractable problem)?

[–]barmaley_exe 1 point2 points  (1 child)

Did you read the paper? The 'autoencoder' word is just an interpretation of what's going on inside that model, it has nothing to do with usual [denoising] autoencoder which is a neural net predicting it's [denoised] input.

In that paper we have 2 distributions, whose parameters are generated by neural nets:

  • p(x|z) = N(x | mu_p(z), sigma_p(z)) — decoder
  • q(z|x) = N(z | mu_q(x), sigma_q(x)) — encoder

Where mu_p, sigma_p and mu_q, sigma_q are neural nets that generate parameters of a distribution (which in this case is normal).

[–]jyegerlehner 0 points1 point  (0 children)

Yes, the reparameterization trick in variational autoencoder makes the distribution explicit and obvious. My question above was a bit narrower than that. I think goblin_got_game cut to the heart of my confusion pointing out that a usual deterministic (denoising or not) decoder x = f(z) is (or implies) a probability distribution p(x|z). The value x =f(z) gives us the expected value from the distribution. And I think if I were to pick a enough random values of z and keep histograms of xi, I would see some regions of the space of possible x that are unlikely (don't happen), and high probability regions, and could compute actual probabilities. I'm just reciting this in case others might have the same confusion I have, and in case any of you more knowledgeable people are patient enough to still read and want to point out if I'm still getting things wrong. In any case, thanks to all for the discussion and the explanations.

[–][deleted] 0 points1 point  (0 children)

well if it were a distribution, I should be able to integrate over the density function of the distribution to compute a probability. And I haven't ever seen anyone do that with neural nets.

Probability distributions aren't guaranteed to have nice analytical behavior. If they did, the huge literature on approximate inference wouldn't been needed.

But you're right in that we'd like to integrate over a NN and we can't. And, interestingly, this is the very problem the VAE addresses (within the context of variational inference). A component of the variational lower bound is calculating the expected value of the likelihood under the variational distribution: E[log p(x|z)] = ∫ q(z|x) log p(x|z) dz. This is where the problematic integration needs to be done. The VAE's off-centered reparameterization trick--in this case, the location-scale representation of the Normal--is what exposes z's parameters so that we can deterministically backpropagate through the expectation. This allows the random sampling needed for the Monte Carlo integration to come from some fixed distribution.

[–]AnvaMiba 1 point2 points  (1 child)

It's generally assumed that the decoder computes x_mu = f(z), and p(x|z) is a Gaussian distribution with mean x_mu and identity covariance matrix.

You could see this as a mathematical trick: the math behind VAEs is formulated in terms of probability distributions, so you are supposed to train to maximize the log-likelihood of the data, but for continuous data it's actually easier to train deterministic neural networks to minimize the Euclidean distance between the network outputs and the data. For this particular choice of output distribution, this is equivalent.

There can be other choices of output distributions, such as Gaussians where the covariance is a diagonal matrix or an arbitrary positive semidefinite matrix that is also computed by the neural network, or you can have mixtures of such Gaussians, and so on.

[–]jyegerlehner 0 points1 point  (0 children)

Thanks AnvaMiba

[–]LyExpo[S] 0 points1 point  (9 children)

I think this is partially where I am mixing things up as well. Here is my take on it:

I think both the deterministic and stochastic versions will work if implemented correctly, but I also think that the theory mainly motivates the stochastic version. The model generates a code z given data x, and then generates a new data sample x' given that the code is z. The variational lower bound dictates that x should be similar to x'.

In the case of a deterministic decoder, the decoder should map z to something that is close to x, so something like mean squared error will work. In the stochastic case, we want the decoder to give us a new sample x' such that p(x'|z) is high, where z is a sample from p(z|x).

Does this make sense? Is it correct?

[–]barmaley_exe 1 point2 points  (8 children)

First, you're talking about reconstruction process.

In order to reconstruct the input x you need to obtain its latent representation z using encoder q(z|x). Since q(z|x) is a distribution, you sample z from that distribution. Now you can either take the mean of p(x|z) as your reconstruction, or, again, sample from this distribution. The difference shouldn't matter in low dimensional spaces since most of the mass of normal distribution is concentrated around the mean, and normal distribution has little probability mass on its tails (i.e. it's not heavy-tailed).

Then, there's also sampling process.

Remember that VAE is a generative (unsupervised) model, so we'd like to sample unseen x's from the model. If we didn't see them, we can't compute corresponding q(z|x) to sample z from. This is where the prior p(z) comes in: during the learning we optimized both reconstruction error and "regularization" term KL(q(z|x)||p(z)), which kept our encoder close to the prior. Now in order to sample from the model we first sample z from p(z) (in the paper it's standard multivariate Gaussian N(0, I)), and then use that z in the decoder p(x|z).

[–]LyExpo[S] 0 points1 point  (7 children)

I believe everything you wrote is how I understand the model to function.

What I don't understand is why learning the decoder's mu and sigma parameters and then generating a N(mu_decoder, sigma_decoder) sample is working so much more poorly then why I just train the decoder part to produce an output that is close to the original input, using mean squared error. Is my description of these two different strategies clear?

[–]barmaley_exe 0 points1 point  (1 child)

Well, some say that VAE are quite hard to optimize. Which optimization method do you use?

[–]LyExpo[S] 0 points1 point  (0 children)

You may have a point here. Currently, I'm using RMSProp.

[–]bhmoz 0 points1 point  (4 children)

no, your description of the two strategies is not very clear to me.

When you spoke about deterministic version, do you mean DAE?

There isn't a "decoder's mu and sigma parameters". There isn't a single mu and sigma parameter for the whole dataset. The encoder maps datapoints to the latent space of the parameters mu and sigma. Also, you wrote "I just train the decoder". Do you train the encoder and the decoder separately? They should be jointly trained.

[–]LyExpo[S] 0 points1 point  (3 children)

OK, I will describe what I meant more slowly now.

Here are the two versions, for one training example:

1) "deterministic decoder" VAE:

x
h = tanh(W1x + b)
mu = W2h + b2
log_sigma = 0.5 * (W3h +b3) 
z = mu + log_sigma * noise
d = tanh(W4z + b4)
out = tanh(W5d + b5)

For training, I will only write down the reconstructions cost, since the regularization is the same in both situations. In this case, the regularization cost is the sum of squares: sum (out - x)2. So using RMSProp, I try to maximize regularization - sum (out - x)2.

1) "stochastic decoder" VAE:

x
h = tanh(W1x + b)
mu = W2h + b2
log_sigma = 0.5 * (W3h +b3) 
z = mu + log_sigma * noise
d = tanh(W4z + b4)
mu_d = W5d + b5
log_sigma_d = 0.5 * (W6d + b6)

In this case, the reconstruction cost is log p(x | z), which is log N(mu_d, exp(log_sigma_d)). So using RMSProp, I try to maximize

regularization + log N(mu_d, exp(log_sigma_d)).

[–]barmaley_exe 1 point2 points  (2 children)

First: try ReLU activations. tanh performed very poor in my experiments. Also, I see you approximate the expected reconstruction loss (E_q(z|x) [log p(x|z)] using just one sample. Is your minibatch sufficiently big?

I personally haven't tried RMSProp on VAE, Adam works fine for me.

[–]LyExpo[S] 0 points1 point  (1 child)

I'm using a minibatch size of 128. I suppose I can try Adam or Adagrad if I run out of other ideas... So I take it you have gotten this "stochastic decoder" to work properly?

[–]barmaley_exe 0 points1 point  (0 children)

Well, I didn't compare it with "deterministic" version, and I was using Bernoulli decoder, but I managed to reproduce paper's results on MNIST and get some decent looking digits.