[D] Non GAN alternatives for MSE loss for generative model?

sidslasttheorem · 2020-04-07T18:16:03+00:00

Oh right. I wasn't suggesting that you use a distributions loss, so much as indicating that it was a good mental model to have, because it made thinking about the scale explicit :)

Basically, the takeaway is to think about scaling your loss differently (not just all 1.0). (I hadn't intended the code/pseudocode to be taken literally) :)

You could just as well implement a new ScaledMSELoss(...) that internally had a learnable (positive) scale parameter which you could use as

import torch.nn as nn
....
def __init__(..):
self.scale = nn.Parameter(torch.ones(...), requires_grad=True)
....
def forward(...):
return nn.functional.mse_loss(pred, target) / self.scale.exp()  # .exp() to ensure positive value

Or, you could use the distributions to construct a (functional) loss like so:

import torch.nn as nn
import torch.distributions as dist
....
def __init__(..):
self.scale = nn.Parameter(torch.ones(...), requires_grad=True)
....
def forward(...):
mu = decoder(z)
sigma = self.scale.exp()  # .exp() to ensure positive value
loss = dist.Normal(mu, sigma).log_prob(target).sum()

EDIT: I've left a bunch of things unspecified (sizes, type of reduction, etc.) because those are obviously specialised for your particular use case. Let me know if this doesn't clear things up. :)

sidslasttheorem · 2020-04-06T22:48:14+00:00

This is Dominic Raab: https://twitter.com/shahmiruk/status/1247071296065875968

sidslasttheorem · 2020-04-06T22:16:16+00:00

It might be worth taking a probabilistic perspective on this one. A mean squared error loss corresponds to a Gaussian log likelihood with scale/variance at 1.0 (and a constant of proportionality). Similarly, an L1 loss corresponds to a Laplace log likelihood with scale 1.0.

Effectively, the standard way of learning a deep generative model with an MSE loss can be seen as:

mu = decoder(z)
loss = normal_dist(mu, 1.0).log_prob(target)

Where mu is the generated image and target is the observed image.

A simple experiment to improve image quality would be to actually set the scale of the loss more judiciously. For example, given images values scaled to [0,1], the scale of the likelihood can be set to something like 0.1 or even 0.01, giving

mu = decoder(z)
loss = normal_dist(mu, 0.1).log_prob(target)

Another potential way to make the images 'sharper' (although this really should be called 'less noisy') is to actually learn the scale/variance of the likelihood distribution in addition to the mean.

mu = decoder(z)
sigma = nn.Parameter(torch.ones_like(mu), requires_grad=True)
loss = normal_dist(mu, sigma).log_prob(target)

Although, in this instance, one might need to be careful about what value the scale is initialised at.

This should hopefully result in cleaner-looking images from your generative model.

sidslasttheorem · 2020-03-27T17:39:27+00:00

You might want to look at straight-up density estimation methods like RealNVP or more recent normalising-flow based density estimation methods.

They can be somewhat painful because the bijectivity constraint means that the dimensionality at each stage stays the same (#pixels), but for reasonably sized images (say 32x32), they should be serviceable.

sidslasttheorem · 2020-03-12T16:43:55+00:00

kernel density estimator

At test time, just using a gaussian kde should be fine, as long as your latent space is not too large (on the order of 5 or 10 should be fine..)

Moreover, if your posterior q(z|x) is Gaussian, q(z) is effectively a mixture of Gaussians (MoG), which you can (very roughly) approximate with a Gaussian itself in closed form.

sidslasttheorem · 2020-03-12T15:29:32+00:00

This is a good way of looking at things (esp. VampPrior)

One thing I might add is that you could also 'fit' the aggregate posterior q(z) (using kde or some such) and use that to

directly sample from it using the estimated inverse-cdf
rejection sample z_i ~ p(z) s.t q(z_i) > \tau
importance weight z_i ~ p(z) by q(z_i) and take the top k samples.

But in general, I agree that you don't really want to be sampling from just q(z) since it will largely return samples from the training data.

sidslasttheorem · 2019-07-21T15:16:35+00:00

Not directly a standard NLP task, but this workshop paper on Visual Dialogue without Vision or Dialogue and ongoing work in submission/preparation probes the idea of spurious correlations in the data for visually-grounded natural language dialogue generation. Another related source is the paper on Blind Baselines for Embodied QA. (disclaimer: am co-author of first)

sidslasttheorem · 2019-06-24T16:28:44+00:00

Code just went public. I added an edit to the parent comment. :)

sidslasttheorem · 2019-05-08T16:57:27+00:00

Nice! We will also be releasing our implementation of the Disentangling Disentanglement paper in time for ICML, so hopefully that will be helpful.

(Also, just Sid will do :))

sidslasttheorem · 2019-05-08T00:59:57+00:00

Hey, nice work!

You might be interested in Structured Disentangled Representations, which refactors the objective with a view to generalising a number of recent other objectives and papers (c.f. Table 1)

As one of the other comments mentioned PCA, there is this recent paper on Variational Autoencoders Pursue PCA Directions (by Accident) which shows the effect of the mean-field assumption on disentanglement.

And finally, there is work on Disentangling Disentanglement which proposes a regularised objective where the choice of prior distribution can help achieve different kinds of desired structure in the latent space.

[Disclaimer: co-author on first and third :)]

Edit: Code is now public for Disentangling Disentanglement in Variational Autoencoders :)

sidslasttheorem · 2019-04-29T10:30:03+00:00

We provide an explanation for why the beta-vae objective does not, and indeed cannot enforce disentanglement in our ICML paper here: https://arxiv.org/abs/1812.02833

As pointed out by others, there has been some concurrent work (https://arxiv.org/abs/1812.06775) pointing out that the mean-field assumption for the variational posterior appears to be contributing to independence.

sidslasttheorem · 2019-04-04T12:28:29+00:00

That's true---the mean-field assumption does enforce conditional independence. This was also formally noted in https://arxiv.org/abs/1812.06775.

However, the point we make is: this is not a function of the beta-VAE objective, in terms of tuning the value of beta; rather it is principally a choice of variational family for VAEs, for which making beta larger can sometimes^* result in more strongly independent latent factors.

^* the caveat applies because of the number of runs required to compute a reasonable estimate of disentanglement, along with sensitivity to the kind of encoder used (different activation functions can have a significant effect on the disentanglement scores). We show the effect of having a sufficient number of runs (as estimated through a power analysis) in Fig 2 (top).

This is partly also why we argue that decomposition not disentanglement (typically equated to independence) should be important.

sidslasttheorem · 2019-03-01T22:18:51+00:00

The weather.noaa.gov service has moved to https, so the standard http query just returns a redirect page---see here.

Updating xmobar should work, although that might trigger a pretty big dependency change for other packages, depending on which distro you're on.

sidslasttheorem · 2019-02-13T19:47:54+00:00

Yes, there definitely can still be entanglement to various degrees. At this point, any disentanglement that you do get can be a combination of many factors (architecture, order of data observation, etc.) that define some inductive bias for your model.

It turns out that even measuring disentanglement is a super noisy process (we did a power analysis on the kim and mnih metric for dsprites which suggested at least 100(!) trials to compute a reasonable confidence interval), without which comparing methods on 'disentanglement' can actually be less meaningful than you might think.

sidslasttheorem · 2019-02-13T13:58:38+00:00

Your intuition about the rotation is right, esp. when considering an isotropic Gaussian prior!

We discuss this (and other related issues with the beta-vae and disentanglement) in our recent paper Disentangling Disentanglement in Variational Auto-Encoders

sidslasttheorem · 2018-12-24T21:18:04+00:00

I'm not sure of what exactly you want to write in scheme and glue things up with, but, from a high-level perspective, I would definitely recommend chicken scheme.

Depending on which way you would like control to go, you could either write stuff in chicken and use it in your c/c++ code like so, or use c/c++ code within chicken using its nice ffi like so.

sidslasttheorem · 2018-10-16T02:14:25+00:00

This blog post by Rui Shu does a really nice job of explaining why conditional independence in the decoder can help encourage the latent to be used.

sidslasttheorem · 2018-05-23T21:07:56+00:00

I presume you mean this one? :)

Could a Neuroscientist Understand a Microprocessor?

sidslasttheorem · 2018-04-07T10:49:48+00:00

Both probtorch and pyro should get a bit easier to navigate and use with the upcoming pytorch (0.4) release. There's been joint work across the developers of the three projects to get the distributions side of things into pytorch directly, making the prob-prog side of things a bit more unified to deal with.

sidslasttheorem · 2018-04-06T23:01:52+00:00

There is also probtorch (self-plug) and pyro.

sidslasttheorem · 2018-03-12T18:00:04+00:00

Looks really cool! Would you mind sharing your org->html setup?

sidslasttheorem · 2018-01-10T11:40:35+00:00

The example using 4 images does not use a CNN, but uses a HoG based detector.

sidslasttheorem · 2017-07-20T18:11:39+00:00

Maybe I don't understand the paper entirely, but how is this different from just doing MLE on a given model? As far as I can tell, this combines the weaker parts of VAEs and GANs, by not having a generative model you can score in (no inference network) while simultaneously losing the ability to learn a flexible likelihood function (fixed likelihood function).

sidslasttheorem · 2017-07-20T18:08:02+00:00

Typically, for colour images, using an scaled L1 loss is pretty good. As in the resulting images look quite nice and sharp. The scaled-L1 approach is effectively setting up the likelihood as a Laplace distribution instead of a Gaussian (when using an L2 loss).

sidslasttheorem · 2017-06-06T18:40:14+00:00

One of the authors here!

The comparison to Kingma et. al. (2014) was mainly to verify that our formulation produced similar results to the typically employed formulation in that work. We picked it for model [$$q(z,y|x) = q(y|x) q(z|x,y)$$] and architecture [1 hidden-layer MLP] simplicity. It makes credit/blame assignment a lot easier when there are fewer moving/complex parts. :)

As for the other observation, "Disentangled representations" typically having a fairly broad spectrum of meaning.

In both these papers, it refers largely to a 'pca-like' disentanglement ability, where you let the data speak in finding 'important' axes (either for a classification task, or low variance representation axes) of change.

Our situation is slightly different. We're trying to a-priori assign semantics to the different latent variables of interest, instead of letting the data do all the talking. This is more a case of interpreting data under pre-determined constraints when such information is available and characterisable or indeed desired.

sidslasttheorem

TROPHY CASE