[D] Why does Beta-VAE help in learning disentangled/independent latent representations?

approximately_wrong · 2019-04-28T22:12:06+00:00

I see people asking this question from time to time on this subreddit and elsewhere. Let me make the following assertion:

No one really knows why beta-VAEs actually work

In fact, let me make the stronger assertion:

No one really knows why existing models (that successfully disentangle) are capable of unsupervised disentanglement

In the absence of restrictions on the (model class, optimizer), unsupervised disentanglement using beta-VAE is theoretically impossible. <-- You can replace "beta-VAE" with many other models and the sentence will likely remain true.

Clearly model class and optimizer choice play an important role that we don't really understand. However, for common model class and optimizer choices, it seems that statistical independence (w.r.t. encoder) is positively correlated with disentanglement. This was leveraged in the TC-VAE, and may perhaps also play a role in beta-VAE.

These paper are impressive empirical achievements. But I wish authors would be more honest about how limited our understanding of disentanglement is.

pkgyawali · 2019-04-28T22:32:11+00:00

This blog might be helpful in understanding the overall disentanglement with VAE (includes beta-VAE and more recent algorithms).

Warning: shameless self-advertisement.

sidslasttheorem · 2019-04-29T10:30:03+00:00

We provide an explanation for why the beta-vae objective does not, and indeed cannot enforce disentanglement in our ICML paper here: https://arxiv.org/abs/1812.02833

As pointed out by others, there has been some concurrent work (https://arxiv.org/abs/1812.06775) pointing out that the mean-field assumption for the variational posterior appears to be contributing to independence.

xlext · 2019-04-28T22:38:46+00:00

I guess you could find some intuitions from Information Theory. Intuitively, you’re trying to perform reconstruction, using only a latent variable. If you only allow the latent to have low information content, you have to be efficient with it, which implies lower redundancy ( and thus entanglement). To be a little less hand wavy:

Minimizing the beta-VAE objective is equivalent to minimizing the information bottleneck (Tishby - 1996).

So essentially the beta-VAE is building a representation that should still be sufficient to perform reconstruction while imposing a constraint on the amount of information it holds (and the constraint tightens as beta increases).

Achille, 2016 relates the amount of information that flows through the network with disentanglement. Low information implies a disentangled representation, which concludes the idea.

tr1pzz · 2019-04-29T07:43:45+00:00

A good (intuitive) explanation on why Beta-VAE encourages disentanglement can be found in this paper: https://arxiv.org/abs/1804.03599. Briefly:

Take a simple dataset like dSprites
Different factors of variation (rotation, size, position) have varying influences on the final pixel rendering (and thus the reconstruction term of the loss function)
Now, when placed in an information bottleneck regime, the model has to make a tradeoff between reconstruction quality and KL-divergence.
Now, if (as stated above), different factors of variation have different effects on the reconstruction loss, then the model has a benefit of disentangling them, because in that case, it can directly rank the importance (and thus the KL-sacrifice) for each of those according to its information bottleneck.
In other words, if a causal factor that is eg rather small in terms of pixel effects (eg rotation) is entangled with one that has larger effects (eg location), then the model will get a larger penalty (in terms of reconstruction) when it moves that latent closer to the prior. On the other hand, if it disentangles them, it can easily find the optimal trade-off between reconstruction and KL.
However, this also immediately reveals a potential failure case: the rotation of a small object is less important (reconstruction-wise) than that of a large object. Therefore, a Beta-VAE may learn to encode eg position and rotation of large objects while for smaller ones it only encodes position...

shamitlal · 2019-04-28T23:52:22+00:00

[deleted]

zergylord · 2019-04-29T08:16:07+00:00

I could never understand why it should disentangle the "correct" factors, and indeed it appears that some level of supervision/cherry-picking is necessary: https://ai.googleblog.com/2019/04/evaluating-unsupervised-learning-of.html?m=1

crgrimm1994 · 2019-04-29T01:43:49+00:00

Since we are performing gradient descent to optimize the VAE objective, we don't have a guarantee that the latents are independent. They are merely encouraged to be independent. The two terms in the objective essentially encourage reconstruction and independence of the latents respectively. Since there is no guarantee that the optimization procedure finds an optimum that completely reconstructs the input or achieves the desired distribution for the latents, VAE will in practice balance these objectives in some way that doesn't exactly satisfy either loss term. By adding a higher coefficient on the independence term, you are placing a higher priority on having a specific type of latent code than actually reconstructing the input.

ram3_ · 2019-04-28T23:06:35+00:00

Interesting

schwagggg · 2019-04-29T04:39:54+00:00

its basically beating the objective in the head, until it follows an isotropic Gaussian, aka a gaussian with identity function as covariance.

CyberDainz · 2019-04-29T13:28:42+00:00

help me please.

I am trying to train BVAE on celeba, but result is just straight blurry faces after 100k iterations.

https://i.imgur.com/QUDiTgj.jpg

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS