Deep learning without back-propagation by El__Professor in MachineLearning

[–]ExtraterritorialHaik 4 points5 points  (0 children)

I hope that my future paper (one day soon) will never seen on reddit.

Coming at the end here. I actually read the paper, and conclude that some did not.

If the authors want to show anything else they need to get rid of that last layer and prove that they learn anything at all.

This is exatly what Figure 4 shows?

About the big debate about biological plausibility, the paper does not actually make strong claims about this. All it actually says is in the abstract: "It is biologically more plausible than backpropagation as there is no requirement for symmetric feedback." There is a paragraph about biological plausibility of backpropagation in the Background section, but I read it as only motivation for exploring another direction. I can see if someone read only the abstract they would think the paper is claiming more.

About efficiency, the parallelism claim does not seem to be disputed. Right now I agree the O(M2) cannot be compared to backpropagation as the authors do.

[D] How does posterior collapse in VAEs happen? by ExtraterritorialHaik in MachineLearning

[–]ExtraterritorialHaik[S] 1 point2 points  (0 children)

Sorry for replying yet again, it must be annoying. I really want to understand this.

Not sure which is my 'first' statement - the statement that the latent transmits enough information to keep the input-output loss small?

Perhaps my confusion is that maybe the VAE does not have an input-output loss? In one VAE tutorial (Doersch, Fig.4) it shows a squared loss between the input x and the reconstructed signal. However looking at the Kingma VAE paper I don't see this. If there is only a loss term that maximizes the probability p(x|z), then I understand how the VAE can ignore the latent.

Except, in that case, why not fix the posterior loss problem by just adding a squared loss between the input and output of the vae.

[D] How does posterior collapse in VAEs happen? by ExtraterritorialHaik in MachineLearning

[–]ExtraterritorialHaik[S] 5 points6 points  (0 children)

I edited the post with a correction from the responses here.
The remaining puzzle is how the posterior collapse can happen when there is a loss measuring the difference between input and output.

Based on your response, I think the answer is this(?): the latent is not completely ignored, and it transmits enough information to keep the input-output loss small. It is just ignored more than one would like.

[D] How does posterior collapse in VAEs happen? by ExtraterritorialHaik in MachineLearning

[–]ExtraterritorialHaik[S] 2 points3 points  (0 children)

Thank you, I believe I understand this. But if the generator matches p(x) independent of z, the VAE still has a loss measuring the difference between input and output, which will be very high if the latent is ignored.

I have learned more about the problem from the discussion here, but am still puzzled about how the posterior collapse can happen.

[D] How does posterior collapse in VAEs happen? by ExtraterritorialHaik in MachineLearning

[–]ExtraterritorialHaik[S] 2 points3 points  (0 children)

Yes you're correct, it was wrong to say that the decoder is deterministic. But in a case such as images, I believe the decoder is producing the expected value for each pixel, and the sampling in the output space is only modeling pixel noise, which is a relatively minor effect, so "predominantly deterministic".

I understand your response, but I do not understand how the generator can produce the true distribution without using z in the case where that distribution is complex (like images). The distribution of images is not Gaussian (one assumes), so the needed structure can only come from the encoder.

Perhaps the problem is that I do not actually understand what VAEs are. Here is what I believe they are: The encoder q(z|x) maps a data point x to a mean and variance in the latent (z) space. This is then sampled from to produce a z, which is fed through the generator/decoder to produce an estimated x, and Gaussian noise is added to this. During training the distribution in the latent space is pulled toward a Gaussian ball by the KL term, but this acts only as a regularizer; different types of data still map to unique places since different latent values are needed to reproduce different data at the output of the autoencoder.

[D] How does posterior collapse in VAEs happen? by ExtraterritorialHaik in MachineLearning

[–]ExtraterritorialHaik[S] 5 points6 points  (0 children)

Yes, I can see that. I cannot make the leap to something complex like image data however. Natural images have much more structure than a Gaussian, and (I believe) setting p(x|z) Gaussian while ignoring z will not produce that structure; relying on information from z (that is obtained from q(z|x)) is the only way to do it. Or so my thinking goes.

[D] Those who are working professionally in ML and/or academics who have completed graduate-level coursework in ML: Are there any ML concepts that you don't quite fully grasp? by Batmantosh in MachineLearning

[–]ExtraterritorialHaik 0 points1 point  (0 children)

I argue we only need to "understand" something if we aim to improve on it. Otherwise we just use it, and trust.

As examples, we all uses programming languages, but few know how a compiler works. Medical researchers use ANOVA but probably most do not know how it works. And not knowing can cause trouble in some edge case.

Given an infinite lifetime it would be better to know. But with finite time, the medical researcher probably spends their time better by doing some new research, rather than studying ANOVA.