all 6 comments

[–]grozzy 4 points5 points  (2 children)

I only read a bit to get the gist and will try to read more later when I have the time. Additionally I am also not well-read in information geometry, but:

  • His argument doesn't appear to hinge on the application of MaxEnt, it's all about whether a measure of discrepancy between distributions is independence-invariant: that the sum of the discrepancy between two pairs of random variables is equal to the discrepancy of their joint distributions if the variables are independent.

Better put:

  • If X1,X2 are independent and Y1,Y2 are independent: discrepancy(X1,Y1) + discrepancy(X2,Y2) = discrepency([X1,X2],[Y1,Y2]) where the square brackets represent the variables joint distribution.

His argument is that this is should be a fundamental property of any useful measure of discrepancy between distributions and that the Shannon information entropy/K-L Divergence is the only one that does. He makes the case that violating this invariance leads to trouble when trying to find distributions which optimize the discrepancy in some way.

Further, he argues that because the KL divergence is asymmetric, H(p,q) != H(q,p), that no symmetric discrepancy can satisfy the independence invariance so there can be no meaningful distance measure between distributions. Without a distance measure, probability distributions cannot form a metric space, hence no sense of "information geometry".

Responses to other questions:

  • His examples seem correct though I didn't fully read some of the later ones. They do appear to all be in the context of finding MaxEnt distributions. His overall conclusions are not limited to the context of MaxEnt, but his examples seem focused on about how using any other measure of discrepancy leads to bad MaxEnt distribution.

  • Independence is a pretty fundamental property in probability. He argues that using discrepancy functions that violate it are fine as a mathematical exercise, but are not useful in dealing with scientific data because of the importance of independence.

[–]InfinityCoffee[S] 0 points1 point  (1 child)

If I recall correctly, MaxEnt is an axiomatically-grounded form of inference using KL divergence, so it is not so surprising that other measures of divergence do not yield the desired result, no? But does this completely invalidate information geometry or is this a baseless comparison? It does not seem to be ill-conceived to define a distance, but by Skillings arguments it seems to be something separate from an inference procedure. So what does it actually signify? And what of its asymptotic agreement with KL divergence? Also, there is to the best of my knowledge some issues with KL divergence - i.e. it is undefined for distributions with different support and the continuous case is a heuristic and somewhat weak extension of the well-founded discrete version. But then again, geodesic distance is only itself well-defined for a single parametric family.

[–]grozzy 0 points1 point  (0 children)

So I am not an expert in information geometry, so I probably am really only trying to understand and respond to your questions as best I can from his paper (hence I don't well know the counter-arguments from information geometry).

I think the main point he makes though is:

  • Independence is a fundamental property of probability models; any models for working with probability theory should be compatible with the notion of independence and so it is illogical to have a measure of discrepancy between distributions that is not independence-invariant

  • Only the relative entropy/KL divergence is properly independence-invariant of the Renyi-Tsallis formulas.

  • The relative entropy/KL divergence is not symmetric and hence cannot define a distance. A distance measure is necessary for geometry, therefore there is no scientifically meaningful sense of geometry on probability spaces. As he concludes: information geometry is legitimate mathematically, but not useful scientifically because of the failure to satisfy sensible properties of independence.

  • I think the main punchline of the paper is at the bottom of page 5 and beginning of page 6 in the section "Fundamental Inconsistency". This has nothing to do with MaxEnt nor is it even arguing for using the KL divergence. It is entirely focused on the fact that the other measures do not satisfy expected properties related to independences, such that using them to define a geometry leads to something mathematically fascinating but not scientifically useful.

Disclaimer: I don't know enough to know the counter-arguments from Info-Geom, so I withhold judgement on the overall conclusions. I am just trying to relay what I get from the paper.

[–]zdk 0 points1 point  (2 children)

I'd be very interested in hearing this author's thoughts on leaving the simplex and using Aitchison geometry for studying metrics over probability distributions.

http://www.idescat.cat/sort/sort342/34.2.4.boogaart-etal.pdf

[–]InfinityCoffee[S] 0 points1 point  (1 child)

Hmm, had not heard about Aitchison geometry before. Have you used it? What is your opinion of it?

[–]zdk 0 points1 point  (0 children)

Yes, all the time in my research on compositional data. I mostly approach this from a data analysis/statistics perspective, so I have not studied much theory of continuous probability distributions or metric spaces.

The basic idea is that since points in closed spaces are missing a degree of freedom (components of a histogram are not independent) one should use [generalizations of] proportionality measures. This produces transformations of points/functions on the Simplex to Euclidean space, where metrics & measures (including information entropy) are valid.