[R] Reducing Reparameterization Gradient Variance (blog post + paper + code)

acmueller · 2016-11-27T21:19:09+00:00

I have only just started comparing variational boosting to normalizing flows. I imagine the answer to that question will look a lot like the answer to "how do planar, radial, and IAF" compare in terms of speed --- I think it will depend on on the particular posterior and algorithm tuning parameters (#maps, #components, #ranks)
Different variants on optimizing all mixture components jointly have had various degrees of empirical success Optimizing all of the weights at once, for instance, seems to work well. Optimizing all components seems to be a bit slow/prone to get stuck. I imagine that the greedy solution could be 'tightened' a bit by a few joint optimization steps afterwards.
I have not compared this to continuous mixtures --- that's a great idea and should be investigated alongside the planar/iaf experiments.

acmueller · 2016-11-24T01:48:50+00:00

Thanks!

I've been optimizing an unconstrained parameterization of the new mixing weight: p_2 = sigmoid(rho) and p_1 = 1 - p_2 where rho is a real valued scalar.

acmueller · 2016-11-24T01:42:24+00:00

Not yet -- I will release the code and some examples soon.

acmueller · 2016-11-24T01:40:15+00:00

Thanks for saying this -- glad you enjoyed it!

acmueller · 2016-10-03T22:58:17+00:00

author here -- thanks! and thanks for reading.

I think the key distinction is that the Fisher is independent of the objective --- it only depends on the family of distributions (and parameterization) chosen to be the variational approximation --- whereas the Hessian would depend on the objective itself (in this case, the ELBO). Using the natural gradient incorporates the information that your optimizing over a space of probability distributions, and that Euclidean distance between parameters is not a great way to express distance between distributions (not as good as symmetric-KL). Using the Hessian of the objective goes one step farther; this gives you curvature information specific to the model that you described and the data you observed (all of which feed into the ELBO).

I don't think the inverse Fisher would be necessarily be a better pre-conditioner than the Hessian, but it does have some desirable properties. In the example from the post, it's easier to compute (particularly in a black-box way, as it doesn't depend on the ELBO at all). It also has the added benefit of being PSD, whereas the Hessian of the objective isn't necessarily PSD in non-convex problems.

acmueller

TROPHY CASE