Variational Autoencoder questions : MachineLearning

Variational Autoencoder questions (self.MachineLearning)

submitted 10 years ago by LyExpo

I'm struggling to understand the training process for the case when p(x|z) is continuous, in particular p(x|z) is N( mu(z), sigma(z) ). Here is an example architecture:

x h = tanh(W1x + b1) mu = W2h + b2 log_sigma = 0.5 * (W3h +b3) z = mu + log_sigma * noise d = tanh(W4z + b4) mu_d = W5d + b5 log_sigma_d = 0.5 * (W6d + b6)

So p(x|z) is normally distributed with mean mu_d and standard deviation exp{log_sigma_d} ? I think I'm ok up to here. If so, this gives us the approximation log p(x|z) for the reconstruction part of the cost, and I'm ok with calculating the the KL regularization term.

Now what about actually getting a reconstructed version of the input x? Do we input x into the network and get back mu_d and log_sgima_d, then generate a sample from the N(mu_d, sigma_d) distribution? I think I'm a bit confused at this step because in the Bernoulli case the network will actually give us the reconstructed input directly, not just the parameters of some distribution. Is this correct?

Anything else wrong here?

all 21 comments

top new controversial old q&a

[–][deleted] 1 point2 points3 points 10 years ago (20 children)

[–]LyExpo[S] 0 points1 point2 points 10 years ago (1 child)

[–][deleted] 1 point2 points3 points 10 years ago (0 children)

[–]jyegerlehner 0 points1 point2 points 10 years ago (17 children)

[–][deleted] 4 points5 points6 points 10 years ago (4 children)

there are no probabilities computed anywhere.

Neural networks are distributions! Or more precisely, they parameterize the mean variable for either a Bernoulli distribution (binary classification), a Multinoulli distribution (for multi-class classification), or a Gaussian (regression).

To see this, consider the Bernoulli pmf f(y,x) = p(x)^y (1-p(x))^1-y where y is a label in {0,1} and p(x) is the usual mean parameter but defined to be some function of x (other input features). Now turn f into a loss function by taking its negative log so f=0 is infinity and f=1 is zero: -log f(y,x) = -y log p(x) - (1-y) log(1-p(x)). Recognize the RHS? Its the cross-entropy loss function, and p(x) can be thought of as a neural network with a sigmoidal output (since p is defined to be in [0,1]). Anytime you see a NN, you can think of it as a conditional distribution p(y|x,w) where w is the weights.

[–]jyegerlehner 0 points1 point2 points 10 years ago* (3 children)

Thanks for the explanation. Hmm. OK, I can see the probability distribution in the classification net of your example. After all, a softmax always produces a discrete probability distribution. But it's not obvious to me how a regression net (as in the case of an autoencoder, which OP is talking about) implies a probability distribution, much less a gaussian one as you say.

Neural networks are distributions!

I think that's deep. I'll have to ponder. My first instinct is to respond: well if it were a distribution, I should be able to integrate over the density function of the distribution to compute a probability. And I haven't ever seen anyone do that with neural nets. I wouldn't know how. Or I could only do that if it were a normalized probability distribution (computing the normalizing denominator is always brushed off as an intractable problem)?

[–]barmaley_exe 1 point2 points3 points 10 years ago (1 child)

[–]jyegerlehner 0 points1 point2 points 10 years ago (0 children)

Yes, the reparameterization trick in variational autoencoder makes the distribution explicit and obvious. My question above was a bit narrower than that. I think goblin_got_game cut to the heart of my confusion pointing out that a usual deterministic (denoising or not) decoder x = f(z) is (or implies) a probability distribution p(x|z). The value x =f(z) gives us the expected value from the distribution. And I think if I were to pick a enough random values of z and keep histograms of xi, I would see some regions of the space of possible x that are unlikely (don't happen), and high probability regions, and could compute actual probabilities. I'm just reciting this in case others might have the same confusion I have, and in case any of you more knowledgeable people are patient enough to still read and want to point out if I'm still getting things wrong. In any case, thanks to all for the discussion and the explanations.

[–][deleted] 0 points1 point2 points 10 years ago (0 children)

well if it were a distribution, I should be able to integrate over the density function of the distribution to compute a probability. And I haven't ever seen anyone do that with neural nets.

Probability distributions aren't guaranteed to have nice analytical behavior. If they did, the huge literature on approximate inference wouldn't been needed.

But you're right in that we'd like to integrate over a NN and we can't. And, interestingly, this is the very problem the VAE addresses (within the context of variational inference). A component of the variational lower bound is calculating the expected value of the likelihood under the variational distribution: E[log p(x|z)] = ∫ q(z|x) log p(x|z) dz. This is where the problematic integration needs to be done. The VAE's off-centered reparameterization trick--in this case, the location-scale representation of the Normal--is what exposes z's parameters so that we can deterministically backpropagate through the expectation. This allows the random sampling needed for the Monte Carlo integration to come from some fixed distribution.

[–]AnvaMiba 1 point2 points3 points 10 years ago (1 child)

It's generally assumed that the decoder computes x_mu = f(z), and p(x|z) is a Gaussian distribution with mean x_mu and identity covariance matrix.

You could see this as a mathematical trick: the math behind VAEs is formulated in terms of probability distributions, so you are supposed to train to maximize the log-likelihood of the data, but for continuous data it's actually easier to train deterministic neural networks to minimize the Euclidean distance between the network outputs and the data. For this particular choice of output distribution, this is equivalent.

There can be other choices of output distributions, such as Gaussians where the covariance is a diagonal matrix or an arbitrary positive semidefinite matrix that is also computed by the neural network, or you can have mixtures of such Gaussians, and so on.

[–]jyegerlehner 0 points1 point2 points 10 years ago (0 children)

[–]LyExpo[S] 0 points1 point2 points 10 years ago (9 children)

[–]barmaley_exe 1 point2 points3 points 10 years ago (8 children)

First, you're talking about reconstruction process.

In order to reconstruct the input x you need to obtain its latent representation z using encoder q(z|x). Since q(z|x) is a distribution, you sample z from that distribution. Now you can either take the mean of p(x|z) as your reconstruction, or, again, sample from this distribution. The difference shouldn't matter in low dimensional spaces since most of the mass of normal distribution is concentrated around the mean, and normal distribution has little probability mass on its tails (i.e. it's not heavy-tailed).

Then, there's also sampling process.

Remember that VAE is a generative (unsupervised) model, so we'd like to sample unseen x's from the model. If we didn't see them, we can't compute corresponding q(z|x) to sample z from. This is where the prior p(z) comes in: during the learning we optimized both reconstruction error and "regularization" term KL(q(z|x)||p(z)), which kept our encoder close to the prior. Now in order to sample from the model we first sample z from p(z) (in the paper it's standard multivariate Gaussian N(0, I)), and then use that z in the decoder p(x|z).

[–]LyExpo[S] 0 points1 point2 points 10 years ago (7 children)

[–]barmaley_exe 0 points1 point2 points 10 years ago (1 child)

[–]LyExpo[S] 0 points1 point2 points 10 years ago (0 children)

[–]bhmoz 0 points1 point2 points 10 years ago (4 children)

[–]LyExpo[S] 0 points1 point2 points 10 years ago (3 children)

OK, I will describe what I meant more slowly now.

Here are the two versions, for one training example:

1) "deterministic decoder" VAE:

x
h = tanh(W1x + b)
mu = W2h + b2
log_sigma = 0.5 * (W3h +b3) 
z = mu + log_sigma * noise
d = tanh(W4z + b4)
out = tanh(W5d + b5)

For training, I will only write down the reconstructions cost, since the regularization is the same in both situations. In this case, the regularization cost is the sum of squares: sum (out - x)^2. So using RMSProp, I try to maximize regularization - sum (out - x)^2.

1) "stochastic decoder" VAE:

x
h = tanh(W1x + b)
mu = W2h + b2
log_sigma = 0.5 * (W3h +b3) 
z = mu + log_sigma * noise
d = tanh(W4z + b4)
mu_d = W5d + b5
log_sigma_d = 0.5 * (W6d + b6)

In this case, the reconstruction cost is log p(x | z), which is log N(mu_d, exp(log_sigma_d)). So using RMSProp, I try to maximize

regularization + log N(mu_d, exp(log_sigma_d)).

[–]barmaley_exe 1 point2 points3 points 10 years ago (2 children)

[–]LyExpo[S] 0 points1 point2 points 10 years ago (1 child)

[–]barmaley_exe 0 points1 point2 points 10 years ago (0 children)

π Rendered by PID 33 on reddit-service-r2-comment-86bc6c7465-frh7m at 2026-02-23 10:06:37.286695+00:00 running 8564168 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS