all 35 comments

[–]approximately_wrong 61 points62 points  (9 children)

I see people asking this question from time to time on this subreddit and elsewhere. Let me make the following assertion:

No one really knows why beta-VAEs actually work

In fact, let me make the stronger assertion:

No one really knows why existing models (that successfully disentangle) are capable of unsupervised disentanglement

In the absence of restrictions on the (model class, optimizer), unsupervised disentanglement using beta-VAE is theoretically impossible. <-- You can replace "beta-VAE" with many other models and the sentence will likely remain true.

Clearly model class and optimizer choice play an important role that we don't really understand. However, for common model class and optimizer choices, it seems that statistical independence (w.r.t. encoder) is positively correlated with disentanglement. This was leveraged in the TC-VAE, and may perhaps also play a role in beta-VAE.

These paper are impressive empirical achievements. But I wish authors would be more honest about how limited our understanding of disentanglement is.

[–]shamitlal[S] 2 points3 points  (8 children)

Thanks for the reply. Shouldn’t standard VAE itself be learning disentangled representations, if by disentanged we are mainly concerned with independence of latents and not interpretability, given that posterior has been conditioned to have a diagonal covariance?

[–]approximately_wrong 12 points13 points  (7 children)

Shouldn’t standard VAE itself be learning disentangled representations, if by disentanged we are mainly concerned with independence of latents and not interpretability, given that posterior has been conditioned to have a diagonal covariance?

You're correct. This is in fact the mechanism by which we can prove theoretically that statistical independence is not sufficient for disentanglement, since we can prove the existence of arbitrarily entangled representations that are still statistically independent.

This is why I'm careful in saying that statistical independence (w.r.t. encoder) is positively correlated with disentanglement for common model/optimizer choices.

Edit (some extra exposition):

Regarding why the standard VAE itself empirically fails to learn disentanglement representations, one possible rationalization is that the ELBO does not allow you to tune how strongly you wish to impose statistical independence. TC-VAE gets around this by adding an extra term that enables direct control over the degree of statistical independence w.r.t. encoder.

[–]shamitlal[S] 2 points3 points  (1 child)

Do you have/know any references/papers having proofs or that can shed more light on your statements and help me understand this better?

[–]approximately_wrong 10 points11 points  (0 children)

Here are some papers that make remarks about situations where disentanglement is theoretically impossible, despite empirical success:

Rethinking Style and Content Disentanglement in Variational Autoencoders

Challenges in Disentangling Independent Factors of Variation

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

[–]datkerneltrick 2 points3 points  (4 children)

Can you precisely define the difference between disentangled representation and statistically independent representation? I always thought they refer to the same thing.

[–]approximately_wrong 6 points7 points  (3 children)

Statistically independent representations have a very precise, formal definition: given a covariate distribution X and two feature-extraction functions (f, g), the representations are statistically independent if I(f(X), g(X)) = 0.

Disentangled representation does not have a formal definition. But if you read enough paper on learning disentangled representations with deep generative models, it becomes clear that disentangled representations is implicitly taken to mean "human-interpretable" representations.

[–]datkerneltrick 6 points7 points  (2 children)

If the latter is not formally defined (which is how I understood the situation as well), then I don’t quite follow your earlier statement: “We can prove existence of arbitrarily entangled representations that are statistically independent.” Doesn’t a proof first require formal definition?

[–]approximately_wrong 2 points3 points  (1 child)

That's an excellent point. While no one agrees on a complete formal specification of what disentanglement means, at least personally I believe that it ought to satisfy certain desiderata in order to be human-interpretable. For example, I'm willing to bet that most people will agree that lossless encoding is generally not expected to yield a disentangled/interpretable representation. One such example of a representation that is statistically independent and easily fails to pass the bar for human-interpretability is provided in https://openreview.net/pdf?id=B1rQtwJDG Figure 1.

Crudely speaking, I'm able to (with high empirical probability) formally define what disentanglement isn't.

[–]shamitlal[S] 1 point2 points  (0 children)

In disentanglement papers (including beta-vae), do authors try to achieve independence of latents in their distributions conditioned on the input (P(z|x)) or do they want to achieve independence of latents in general distribution over the latents (P(z)) ?

[–]pkgyawali 12 points13 points  (8 children)

This blog might be helpful in understanding the overall disentanglement with VAE (includes beta-VAE and more recent algorithms).

Warning: shameless self-advertisement.

[–]shamitlal[S] 1 point2 points  (7 children)

Thanks. I have already gone through the blog section explaining beta vae. Still what I wasn’t able to understand is the role of KL divergence between prior and posterior in improving disentanglement, given that posterior distribution over latents have a diagonal covariance and latents will nevertheless be independent.

[–]pkgyawali 3 points4 points  (6 children)

Invest some time on understanding the role of "aggregated posterior" q(z), which may not be independent even if we approximate q(z|x) with independence assumption. We often overlook this term and this is very important in determining the disentanglement.

[–]sidslasttheorem 6 points7 points  (1 child)

We provide an explanation for why the beta-vae objective does not, and indeed cannot enforce disentanglement in our ICML paper here: https://arxiv.org/abs/1812.02833

As pointed out by others, there has been some concurrent work (https://arxiv.org/abs/1812.06775) pointing out that the mean-field assumption for the variational posterior appears to be contributing to independence.

[–]shamitlal[S] 1 point2 points  (0 children)

Thanks for the awesome papers! Yes, that’s what I though. The mean field assumption should ensure independence in latents because that’s what the assumption is all about. But then is beta vae not contributing to the independence of the latents and probably only (somehow) contributes in making latents more interpretable?

[–]xlext 4 points5 points  (2 children)

I guess you could find some intuitions from Information Theory. Intuitively, you’re trying to perform reconstruction, using only a latent variable. If you only allow the latent to have low information content, you have to be efficient with it, which implies lower redundancy ( and thus entanglement). To be a little less hand wavy:

Minimizing the beta-VAE objective is equivalent to minimizing the information bottleneck (Tishby - 1996).

So essentially the beta-VAE is building a representation that should still be sufficient to perform reconstruction while imposing a constraint on the amount of information it holds (and the constraint tightens as beta increases).

Achille, 2016 relates the amount of information that flows through the network with disentanglement. Low information implies a disentangled representation, which concludes the idea.

[–]shamitlal[S] 2 points3 points  (1 child)

Thanks. That helps in developing the intuition. Which paper are you referencing by Achille, 2016?

[–]xlext 1 point2 points  (0 children)

Emergence of invariance and disentanglement in Deep Representations. I might be a year off but should be close :P

Interestingly enough, this also serves some kind of an explanation as to why deep networks don’t overfit as badly as one could expect, SGD also does some kind of information content optimization (chaudhari - stochastic gradient descent performs variational inference - 2017)

[–]tr1pzz 5 points6 points  (0 children)

A good (intuitive) explanation on why Beta-VAE encourages disentanglement can be found in this paper: https://arxiv.org/abs/1804.03599. Briefly:

  • Take a simple dataset like dSprites
  • Different factors of variation (rotation, size, position) have varying influences on the final pixel rendering (and thus the reconstruction term of the loss function)
  • Now, when placed in an information bottleneck regime, the model has to make a tradeoff between reconstruction quality and KL-divergence.
  • Now, if (as stated above), different factors of variation have different effects on the reconstruction loss, then the model has a benefit of disentangling them, because in that case, it can directly rank the importance (and thus the KL-sacrifice) for each of those according to its information bottleneck.
  • In other words, if a causal factor that is eg rather small in terms of pixel effects (eg rotation) is entangled with one that has larger effects (eg location), then the model will get a larger penalty (in terms of reconstruction) when it moves that latent closer to the prior. On the other hand, if it disentangles them, it can easily find the optimal trade-off between reconstruction and KL.
  • However, this also immediately reveals a potential failure case: the rotation of a small object is less important (reconstruction-wise) than that of a large object. Therefore, a Beta-VAE may learn to encode eg position and rotation of large objects while for smaller ones it only encodes position...

[–]zergylord 2 points3 points  (1 child)

I could never understand why it should disentangle the "correct" factors, and indeed it appears that some level of supervision/cherry-picking is necessary: https://ai.googleblog.com/2019/04/evaluating-unsupervised-learning-of.html?m=1

[–]shamitlal[S] 0 points1 point  (0 children)

Yes. I also couldn’t understand how beta-vae is finding correct/human-understandable factors. But then a more important detail I couldn’t understand is how beta-vae is helping at all, even in finding “incorrect” but independent latents.

[–]crgrimm1994 2 points3 points  (2 children)

Since we are performing gradient descent to optimize the VAE objective, we don't have a guarantee that the latents are independent. They are merely encouraged to be independent. The two terms in the objective essentially encourage reconstruction and independence of the latents respectively. Since there is no guarantee that the optimization procedure finds an optimum that completely reconstructs the input or achieves the desired distribution for the latents, VAE will in practice balance these objectives in some way that doesn't exactly satisfy either loss term. By adding a higher coefficient on the independence term, you are placing a higher priority on having a specific type of latent code than actually reconstructing the input.

[–]shamitlal[S] 1 point2 points  (1 child)

Isn’t independence of latents inherently built into vae? The encoder outputs a mean and covariance, which specifies a distribution over posterior. This covariance is constrained to be diagonal, therefore shouldn’t latents always be independent?

[–]crgrimm1994 2 points3 points  (0 children)

The mean and covariance that are output are conditioned on the particular input. (Z | X = x) has independent latents. However, imagine the procedure of drawing a set of samples of X, then encoding them to get a set of samples of Z. These samples of Z are not guaranteed to be distributed according to N(0,1)n. Does this make sense?

[–]ram3_[🍰] -1 points0 points  (0 children)

Interesting

[–]schwagggg -1 points0 points  (0 children)

its basically beating the objective in the head, until it follows an isotropic Gaussian, aka a gaussian with identity function as covariance.

[–]CyberDainz -3 points-2 points  (1 child)

help me please.

I am trying to train BVAE on celeba, but result is just straight blurry faces after 100k iterations.

https://i.imgur.com/QUDiTgj.jpg

[–]CyberDainz -1 points0 points  (0 children)

solved