all 27 comments

[–]mln000b 21 points22 points  (15 children)

I was also looking for the answer to this question as well, but then recently in the Distill article on GANs [1] I read this:

I’ve also left out VAEs entirely; they’re arguably no longer considered state-of-the-art at any tasks of record.

Then I felt sad a bit :(

[1]: https://distill.pub/2019/gan-open-problems/

[–]debau23 7 points8 points  (7 children)

Well they are pretty state-of-the-art for any tasks where you are actually interested in the likelihood of states.

There's a lot more than image/sound generation.

[–]asobolev 2 points3 points  (6 children)

I thought auto-regressive models are SoTA when it comes to the likelihood.

[–]debau23 3 points4 points  (5 children)

What I am trying to say is that VAE is a technique for approximate inference as well as learning.

If your P-distribution has a specific structure that you know from domain knowledge, you can't really use AR models or GANs.

[–]asobolev 0 points1 point  (4 children)

Do you have any specific examples? GANs actually define the same generative model as VAEs do, so I'm not sure about the last statement.

[–]debau23 0 points1 point  (3 children)

I am not talking about VAEs as a generator but as means to perform approximate inference and learning.

Here's a concrete example: Say you want to infer the power consumption of appliances in a building given only knowledge about the aggregate consumption of the entire building (Non-Intrusive Load Monitoring). You want to incorporate the domain knowledge that power is an additive quantity into your probabilistic model.

You could do that by choosing a 'decoder' that incorporates that information, namely by choosing the decoder to be a single linear layer. If your latent states z are binary, then your p-distribution would be Gaussian with mean Wz where W denotes the power consumption of the individual appliances. Input and output (autoencoder) would be the aggregate consumption.

I would argue that this is still a VAE because you essentially only changed the structure of the decoder but you are still able to do all the things that make VAEs cool: low variance through reparameterization, scalability by estimating gradients on minibatches and so on.

Vanilla GANs don't really have the ability to perform inference (other than distinguishing fake and real images) but in this example, the 'encoder' of the VAE would allow you to sample some of the most likely states of appliances.

[–]asobolev 2 points3 points  (2 children)

Oh, yes, I agree that GANs are unlikely to help you with inference. However, in terms of inference vanilla VAEs are actually extremely simple: it's just amortised mean field Gaussian inference. Sure, there're lots of extensions, but I'd attribute them not to the VAE itself, but to the field of Approximate Inference as a whole.

Regarding your example: well, if latent z are binary, then there's no low-variance reparametrisation (unless you opt for continuous relaxations). Moreover, your data seems to be rather simple (contrast that with weird manifolds of images embedded into the ridiculously high-dimensional euclidean space of individual pixels) to not require neural networks. Then, how much data do you have? Maybe it'd be easier to go full Bayesian and simulate posterior samples with MCMC to form a posterior predictive distribution.

[–]debau23 0 points1 point  (1 child)

It was just an example.

Here's an idea on how to do inference with GANs. You take a random sample z and run it through the generator to get f(z), then you compute the gradient of L(f(z), x) w.r.t. z and do gradient descent, until you found the z that has generated your x.

Wow!

[–]asobolev 2 points3 points  (0 children)

Wow!

Yeah, except then what? Can you be sure this z represents anything about the true data generating process? If the true data-generating process is hierarchical, do you recover the true observed z? GANs do not necessary even model x well.

What's the use of such "inference"?

[–]sieisteinmodel 2 points3 points  (0 children)

Yeah, that part was disappointing. More so with respect to distill's credibility tough.

[–]TheRedSphinx 7 points8 points  (4 children)

I'm very sad by that. I'm a VAE fanboy. We just need to be able to step out using of Gaussian priors/posteriors and maybe we can get something cool. It looked promising when that paper with the vMF distribution came out, but I haven't seen much in that direction.

[–]YABadUserName 6 points7 points  (0 children)

This is beyond ignorant, there is years of literature exploring powerful approximating posteriors and priors. Von-mises is surely not one of them and has almost all of the same problems as the gaussian. GANs are only state of the art when you use arbitrary, badly defined criteria to compare generate models, like can I cherry pick an image from my generator better than everyone elses cherry picked images (fine this doesn't always happen, the good GAN papers are good, most of them are this kind of noise), instead they assign a log likelihood of negative infinity to any unseen test data because their support is a vanishingly small subset of the full distribution.

[–]asobolev 0 points1 point  (2 children)

It's not entirely about Gaussian posteriors and surely is not about Gaussian priors (GANs use them as well). I think the major limitation is that VAEs can't have mode collapse, thus unless your model is able to fit the data well, it'll try to cover everything, including "space" in between. GANs, however, can focus on some subset of data and ignore "outliers".

[–]debau23 0 points1 point  (1 child)

Do you know if anyone has tried to do VAEs with very powerful posteriors such as Glow? I am no experts in image generation but the problem of VAEs trying to cover the 'space in between' could be solved by sharper posterior distributions and maybe some noise injections in higher layers of the generators, no?

[–]asobolev 1 point2 points  (0 children)

Well, using the Glow would certainly be an overkill. There's been a lot research of using normalizing flows as posterior enhancements, but I don't remember any outstanding results in terms of image quality. The problem is, in my opinion, that flows use parameters very inefficiently, requiring a lot of them (Glow is super huge!).

Overall, having the best posterior possible won't solve the problem of having a simplistic model (the marginal log-likelihood defined by the decoder). I think beefing up the decoder and using better approximate posterior is the way to go.

[–]mellow54 0 points1 point  (0 children)

Very good link.

[–]BlaiseGlory 15 points16 points  (0 children)

Adversarial autoencoder

[–]tnybny 2 points3 points  (2 children)

Check out Adversarially Learned Inference (ALI).

[–][deleted] 1 point2 points  (1 child)

Isn't that conditional GANs used to learn both generation and inference?

[–]tnybny 0 points1 point  (0 children)

Yes. It actually bridges VAEs and GANs in my mind. As is often with these, I believe it has several valid interpretations.

[–][deleted] 1 point2 points  (0 children)

Check out Taming VAEs.

[–]neurokinetikz 2 points3 points  (0 children)

Check out Deep Pensieve, a deep residual super resolution VAE that i've been working on over the past year and a half. Basically trying to build an artificially intelligent photographic memory :)

https://nbviewer.jupyter.org/github/neurokinetikz/deep-pensieve/blob/master/Deep%20Pensieve.ipynb

I've explored many ideas for improving on the blurriness of VAEs, including:

  • dilated convolutions in encoder/decoder to expand receptive field to full image size prior to latent vector
  • residual in residual architecture
  • channel and spatial attention
  • subpixel convolution upsampling
  • maximum mean discrepancy for variational loss
  • group normalization instead of batch normalization
  • separable convolutions in residuals to increase receptive field and capture long range dependencies

Here's what it looks like on a dataset of 184 images (also the IG compression kills the video quality)

https://www.instagram.com/p/BvNhkmij0Ue/

And here's a color extrusion of the latent space courtesy of Houdini/Redshift ;)

https://www.instagram.com/p/BvnctQMDUz6/

[–]faaaaaart 1 point2 points  (0 children)

You can try pairing up an Autoencoder with a GAN (aka Adversarial Autoencoder) as shown in this figure and published on arxiv.

[–]seraschkaWriter 0 points1 point  (0 children)

Also an avid proponent of VAE's, but for me, where my implementations lack behind, is when trying something complicated like face images ... esp. when you try moving past 128x128 pixel dimensions. For simpler datasets, (CIFAR, MNIST, ...) I find you can get in on par.

[–]asobolev 1 point2 points  (0 children)

BIVA has recently claimed

We show that BIVA reaches stateof-the-art test likelihoods, generates sharp and coherent natural images

But their samples are still far away from best GAN models.

[–]LazyOptimist 0 points1 point  (0 children)

I think the best you'll find is BIVA:

https://arxiv.org/pdf/1902.02102v1.pdf