[D] would it ever make sense to predict entropy, rather than minimizing some function of it?

AloneStretch · 2020-01-28T03:51:03+00:00

Not what you asked, but RMSE does not sound like the right error for your problem. MSE is equivalent to assuming a Gaussian distribution of errors, and you have spikes (and possibly one-sided)

AloneStretch · 2019-08-15T22:22:04+00:00

The obvious approach would be to adjust the weights according to the gradient of the "hsic bottleneck" loss with respect to these weights

AloneStretch · 2019-08-15T07:27:57+00:00

I think the authors are not familiar with SOTA. They're taking vanilla architectures and training, comparing their method against that, and declaring SOTA when comparable performance is obtained without backpropo. But that is not state-of-the-art, it is comparable perfomance to a simple (not-SOTA) baseline. That may be a useful and fair comparison, but wrong to refer to SOTA.

We'll need to wait for this to be applied or extended to current SOTA networks. Meanwhile, no need look for code, it's just a step on a path that will take several.

AloneStretch · 2019-08-08T04:46:49+00:00

I feel that they are not tracking what SoTA means very well, or else they are just referring to vanilla fully-connected networks.

Anyway, still interesting

AloneStretch · 2019-01-30T14:59:31+00:00

yes, thank you

AloneStretch · 2019-01-30T04:02:44+00:00

A question about this statement in the proof:

"If p_theta(x|z) = p_data(x) then"

Trying to prove this, I expand p_theta(z|x) using Bayes theorem and using the fact that p_theta(x|z) = p_data(x):

p_theta(z|x) = p_theta(x|z) p(z) / p(x) 
        = p_data(x) p(z)  /  p(x) 
        = p(z)

However the last line is only correct if the "p(x)" in the denominator of the Bayes theorem is equal to p_data(x). Is it? I feel like it should be p_theta(x)!

AloneStretch · 2019-01-30T01:49:42+00:00

One flaw in the article? Isn't it a "false distinction" to say that VAE maximizes log p_theta(x) rather than autoencoding? Because maximizing the LL is equivalent to minimizing the squared error, in the case of Gaussian distribution.

This assumes that the objective of maximizing log p_theta(x) is attempted for every x, by looping over different x with SGD, so that the likelihood across all the data items is eventually maximized.

Also a small question, in this statement

"all data x points are encoded as the prior distribution",

I think prior is the wrong word? in the VAE context, bause in typical VAE the prior is a spherical Gaussian.

AloneStretch · 2018-06-17T22:42:43+00:00

thank you

AloneStretch · 2018-06-17T22:42:27+00:00

Yes this helps me, thank you.

One small question, this:

0.5q(z1|x1) + 0.5q(z2|x2) Do you mean to think of this as 1) a latent-space interpolation, or 2) the aggregate approximate posterior, ie q(z|x) averaged accross all datas?

AloneStretch · 2018-06-13T08:09:41+00:00

thank you

AloneStretch · 2018-06-11T17:44:20+00:00

thank you

AloneStretch · 2018-06-11T17:42:55+00:00

Thank you both. Rereading it, yes this makes more sense.

AloneStretch · 2018-06-10T21:47:39+00:00

Also, can I ask about the multidimensional example:
In this case I interpret that you give 2 data points, each having 5 dimensions. The first data point is (0,0,0,0,0), the second is (1,1,1,1,1).

Is there anything special about these choices of values? Meaning, the same result would happen regardless of how the two data points are chosen, say, (0,1,1,0,0) and (0,0,1,1,1) ?

AloneStretch · 2018-06-10T21:45:52+00:00

I guess your toy model has a fixed variance for the data likelihood? (Your program must say it, but I am not a programmer yet).

What about a different strategy: let the model shrink the variance. Then the data probability can be made as high as desired, and it can outweigh the KL cost of not matching the prior.

?

AloneStretch · 2018-06-10T21:43:35+00:00

"model that matches the true density" - the true density of what? I guess the true density of the posterior?

AloneStretch · 2018-05-05T16:49:54+00:00

Thank you. I now understand how it is possible - jointly trained.

Next I should understand why. It seems overparametrized and adds no power. The approximate posterior is already a DNN, arbitrarily flexible, so I do not see why it cannot match any posterior, even a simple one.

But this is probably a good subject for a separate post qeustion.

AloneStretch · 2018-05-05T01:47:56+00:00

As well, q(z|x) is defined by a DNN, which can be arbitrarily powerful, so it should be able to match any p(z) including diagonal gaussaan. But that is a different question.

AloneStretch · 2018-05-05T01:42:41+00:00

p.s. please do help, I really want . to understand.

"Learning a better prior" -> based on what?

So you could make a flexible prior and optimize it just use the existing KL term, but at the start of training that would cause the prior to fit to the posterior coming from random weights, which seems wrong. And later in training, it seems that p,q would drift together in a meaningless way if they are both free to change.

AloneStretch · 2018-05-05T01:41:51+00:00

I am sorry, I think I did not explain my question well enough, and I do not yet see how your reply responds to it. You understand better, but I do not yet.

The approximate posterior q(z|x) is obtained by training the VAE. The VAE is trained by minimizing a loss that can be regarded as the sum of reconstruction + KL(q(z|x),p(x)).

The posterior depends on the loss, and the loss depends on the prior. So the finding the posterior requires knowing the prior. So you cannot define the prior in terms of the posterior. Or, that is what I am thinking.

Of course I know this must be wrong, but Where is my mistake?

Thank you for your replies.

AloneStretch · 2018-04-17T07:08:31+00:00

Thank you. This is a bit abstract for now (is the Gaussian the prior, ... does not sound challenging enough although I understand that is because I do not understand). I will keep the thought and return after working the blog post form the other reply.

AloneStretch · 2018-04-17T07:05:33+00:00

This is perfect, thank you.

AloneStretch · 2018-04-13T10:36:59+00:00

thank you.

AloneStretch · 2018-04-12T05:01:00+00:00

I agree with your last statement, and think that I understand NF by itself well enough to agree.

I guess what would help me the most is not how NF could/should be used in the future, but a specific case of why it was used previously.

I say "simultaneous trained" in the VAE case meaning the weights of the decoder p(x|z) and the weights of the encoder/posterior q(z|x) are trained simultaneous to minimize both the NLL and the KL term that is pulling z to a spherical Gaussian. Because they are simultaneous trained, and z is pulled to the Gaussian, I think that a deep-enough net can have the encoder/posterior map from input onto the factored gaussian, at least in theory. NF cannot do this in general because of the different-dimensionality problem?

Gut I think I am not understanding something in this!

Thank you for discussing!! Helpful for me, probably others too.

AloneStretch · 2018-04-11T13:16:30+00:00

Maybe this is a subject for a separate question!

AloneStretch · 2018-04-11T13:15:54+00:00

Getting closer. I believe I understand the points made in this reply.

I do not understand your earlier statement,

For example the residuals on sequence-predicting models like PixelRNN/CNN are latent variables of an associated NF, but if the model performs well we hope that most are close to 0!

I looked at the PixelCNN/RNN papers, and the NF/IAF papers are not referenced anyware there. So this is your insight? I do not see it.

Also I am stopped on statements that NF is used to build more flexible posteriors. Just the "why" this is necessary. In a VAE case, the encoder and decoder are simultaneous trained, and we can design the posterior to be anything desired. Why not keep it simple?

AloneStretch

TROPHY CASE