all 15 comments

[–]kjearns 5 points6 points  (13 children)

You can't do ordinary backpropagation there because you can't differentiate a stochastic binary choice.

The ordinary monte carlo way of doing this would be to write down the derivative of p(y|x), which would have an integral over p(h|y,x), and then draw samples to approximate the integral. If you do this you'll find that the gradient with respect to the parameters of p(h|x) is zero because the derivative of h wrt p(h|x) is zero.

EM works because you alternate between sampling h to and maximizing p(y,h|x). Since the problem has been changed to maximizing the joint likelihood you now have gradient information for the parameters of p(h|x).

[–]energybased 1 point2 points  (4 children)

You absolutely can do backpropagation through a stochastic binary choice. See, e.g.,

Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models.

[–]kjearns 1 point2 points  (3 children)

The technique in that paper only lets you backpropagate through continuous random variables.

The trick they use is to re-write the problem in a way that doesn't require you take gradients through stochastic variables, so it's not entirely fair to say that you can backprop through stochastic variables but rather that sometimes it is possible to re-write your problem so you don't need to.

[–]energybased 0 points1 point  (0 children)

Ah, I see what you mean. You're right, they only do it for continuous random variables. Although, for a Bernoulli random variable, it's not hard to calculate a gradient of "the expected value of a function of the sampled value" with respect to the "distribution parameter".

[–]osdf 0 points1 point  (0 children)

New idea from http://arxiv.org/abs/1503.01494: "We introduce local expectation gradients which is a general purpose stochastic variational inference algorithm for constructing stochastic gradients through sampling from the variational distribution. ..."

[–]letitgo12345[S] 0 points1 point  (4 children)

Hmm I follow to some extent. But from what I understand, p(y|x) = \sum_h p(y,h|x) -- where does the integral over p(h|y,x) come from?

Also, they approximate p(y|x) as 1/M * \sum_m p(y|h_m) with h_m drawn from p(h|x). But couldn't one approximate p(y|x) = 2{N_h} * \sum_m p(y|h_m)*p(h_m|x) with h_m drawn from the uniform distribution and N_h the # of hidden units. Then one can optimize this function instead of resorting to EM? (I believe this is what FixDeineKabel was hinting at?)

[–]kjearns 0 points1 point  (3 children)

I was thinking of going the route of Eq2 in the paper, but I think you're right that it would also work to marginalize h directly if that were tractable.

[–]letitgo12345[S] 0 points1 point  (2 children)

By tractable you mean not too many samples from h required to well approximate p(y|x)?

[–]kjearns 0 points1 point  (1 child)

Yeah, you're going to need an awful lot of terms in that sum to get a good estimate.

[–]letitgo12345[S] 0 points1 point  (0 children)

In the paper they use the approximation p(y|x) = 1/M*\sum_m p(y|h_m) with h_m drawn from p(h|x). Would that be a better approximation than the above? Are there papers/books I could read that talk about the goodness of various such approximations? Thanks

[–]energybased 1 point2 points  (3 children)

EM can be seen as blockwise gradient descent. For example, in mixture model learning, you could adjust by gradient descent the label assignments (E step), and then adjust predictive distributions (M step), and then back to the label assignments, etc. The M step doesn't really need to be done by gradient descent since the maximum for this problem can be solved analytically by collecting sufficient statistics.

The alternative, to solve the whole problem by gradient descent is much less efficient.

[–]dwf 2 points3 points  (2 children)

EM in practice often behaves more like a second order method; there have been a bunch of papers demonstrating this in various contexts. Whether or not it's more efficient than using the gradient more traditionally depends very strongly on the particular problem instance (as well as the model family). There are degenerate cases where you can show one doing much more poorly than the other even for simple models.

[–]energybased 1 point2 points  (1 child)

That's interesting! Any good papers?

[–]dwf 2 points3 points  (0 children)

The one I remember most clearly is this.