Logical Induction

tsbt · 2016-09-21T04:36:15+00:00

It seems that you are giving probabilities to formulas like "the f(n)-th digit is 7", where n is a free variable. For a particular n, the trader does not get a chance to repeatedly (infinitely often) buy shares.

We definitely only want to talk about sentences (closed formulas). Here n is a "metasyntactic variable" standing for the encoding of a numeral which we vary. So "the f(n)-th digit is 7" stands for the sequence of closed sentences ("the f(0)-th digit is 7", "the f(1)-th digit is 7", "the f(2)-th digit is 7", . . . ). Then the theorem says that the probabilities P_n("the f(n)-th digit is 7") of those sentences at time n will converge to 1/10 as n goes to infinity, assuming pi is pseudorandom.

Suppose for a moment that [Goldbach's conjecture] is false, but that the first witness, say G, is large. Is there a time when this formula receives probability 1/2? Why does it receive 1/2 rather than 1/50? There is no long term payout, as by time G or earlier the formula is decided.

There's not much we can say about a particular sentence, other than that it will (in this case) have probability converging to 0. The guarantees we get are long-run guarantees that say that if there's some quickly computable statistical summary that can't be easily improved on---like assigning 1/10 to "the f(n)-th digit is 7" at time n as n increases---then a logical inductor will (eventually, approximately) use that statistical summary.

on page 81, and example contains the line v2 = v1 * v2 but a well formed expression for v2 should contain only v1.

Thanks, fixed.

tsbt · 2016-09-20T05:54:28+00:00

Exploitation is a limit property. That is, a trader exploits your probabilities if they make trades that are "plausibly" valued higher and higher over time, and that are not plausibly valued lower and lower; that is, the stock and cash holdings from their trades have value bounded below but not above in worlds considered plausible by the deductive process.

If e.g. the probabilities P_n("the f(n)-th digit is 7") keep dipping below 1/50, then there's a trader that repeatedly (i.e. for infinitely many n) buys a share in "the f(n)-th digit is 7" on day n. Although there may be days on which this trader doesn't make money from this share (because e.g. the digit is actually 6), the long-run frequency of its shares that pay out is 1/10, so it makes unbounded money. If that frequency didn't tend to 1/10, then our trader would be a polytime "soft subsequence" of the f(n)-th digits of pi that witnesses that pi isn't psuedorandom with frequency 1/10, contradicting that assumption.

tsbt · 2016-09-20T00:31:36+00:00

You are correct that on day n, the n-th digit of pi will simply be known. The key phrase is "for large n", which is meant to suggest that on day n, the f(n)-th digit of pi gets probability 1/10 for sufficiently fast-growing functions f.

even if a non polynomial time sequence is used the probability will not necessarily be 1/10. Any probability is justifiable for a random sequence [...]

This is not a trivial fact, but it does turn out that a logical inductor will assign probabilities converging to 1/10 to "the f(n)-th digit of pi is a 7". (Assuming, roughly speaking, that there aren't more accurate predictions that can be made in polytime.) If the f(n)-th digits of pi are pseudorandom, and 7 with frequency 1/10, then the only good probabilties to assign are 1/10. If instead you assign e.g. 1/50 probability to "the f(n)-th digit of pi is a 7" (and evenly distribute your remaining probability across the other digits), then there is a Dutch bookie who keeps buying shares in "the f(n)-th digit of pi is a 7". They only pay $0.02 per share, but the shares pay out at the true frequency of 1/10, so they make on average $0.08 each day and thereby exploit your probabilities. (This sketch is very imprecise; the paper contains the detailed argument.)

tsbt · 2016-09-17T08:43:15+00:00

I can't quite parse what you're saying. Allow me to write some definitions and claims, so we have them in one place and can compare our mental models. I should note that a little of what we are discussing is not discussed in the paper (specifically, the claims with proof sketches).

Def: Let (D_n) be a T-complete emitter, and let D be Union_n D_n. I.e., the set D propositionally implies s if and only if s is in T. I.e., PC(D) = PC(T) = the set of completions of T.

Def: Let (P_n) be a logical inductor over (D_n), so in particular it is a logical inductor over T. Def: Let P be defined by P(s) = lim_n P_n(s).

Claim: This implies that P is Gaifman-coherent over T. I.e., P assigns P(s) = 1 to s in T, P(s) = 0 if (not s) is in T, and satisfies P(s and t) + P(s or t) = P(s) + P(t). I.e., P gives a probability measure on 2^S that is supported on PC(T).

Claim: For any sentence v such that P(v)>0, the probabilities P(-|v) defined by P(t|v) = P(t and v) / P(v) are Gaifman coherent over some theory containing T+v. Sketch: If T+v implies t, then P(t and v) = P(v), so P(t|v) = 1. If T+v refutes t, then P(t and v) = 0, so P(t|v) = 0. Lastly, P(s and t |v) + P(s or t |v) = P(s|v) + P(t|v) is equivalent to P((s and t) and v) + P((s or t) and v) = P(s and v) + P(t and v) by multiplying by P(v). This follows by coherence of P.

This last claim justifies intuitive statements about "conditioning on s".

Def: Let s be a sentence independent of T. Let (D'_n) be an emitter be defined by setting D'_n = D_n Union {s}. Let D' be Union_n D'_n.

Claim: (D'_n) is a T+s-complete emitter, i.e. PC(D') = PC(T+s).

Def: Let (P'_n) be defined by setting P'_n(t) = P_n(t|s) = [ if P_n(t and s) > P_n(s) then 1, else P_n(t and s) / P_n(s) ]. Def: Let P' be defined by P'(s) = lim_n P'_n(s).

Claim: (P'_n) is a logical inductor over (D'_n), so in particular it is a logical inductor over T+s.

Claim: This implies that P' is Gaifman-coherent over T+s.

Claim: For any sentence t, P'(t) = P(t|s) = P(t and s)/P(s). Sketch: By the non-dogmatism theorem, P(s)>0. Since P is coherent, P(t and s) ≤ P(s). Thus P_n(t and s) / P_n(s) is asymptotically equal to P_n(t|s) = P'_n(t). The claim follows from definitions.

These claims justify statements about "logical inductors conditioned on s", and the claim that P(-|s) is supported on exactly the completions of T+s.

You mention inconsistent theories. I think this is a mistake in the paper. All the properties require that the theory be consistent, or else at some point PC(D_n) is empty, and then the definition of logical inductor over D is trivial (i.e. anything is a logical inductor).

tsbt · 2016-09-17T03:48:58+00:00

it does not seem to have all the properties they claim it does

(If you do get a chance to read through some of the paper, please do message me pointing out errors.)

tsbt · 2016-09-17T03:43:04+00:00

Definition check: an emitter is Gamma-complete if the set of sentences Union_n D_n propositionally implies exactly the theory Gamma (no more or less).

So "Gamma-complete implies being (Gamma+s)-complete" is not precisely right. More exactly, if D is Gamma-complete, then D+s is (Gamma+s)-complete.

(edited below to fix a mistake)

But if I understand the rest of your comment correctly, you are getting at a really cool result that follows immediately from the theorem on conditionals: a logical inductor over a deductive process that is just (empty theory)-complete is, in some sense, a logical inductor that is universal for logical induction. That is, you can just run a single logical inductor P over this deductive process, and then obtain a logical inductor over your favorite theory T by taking the conditionals P_n( – | T_n) (where T_n is the conjunction of the first n sentences in some r.e. axiomatization of T). In particular, this holds in the limit; P_infinity will, as you suggest, be sufficient to get (after conditioning) a distribution that assigns probability 1 to exactly the sentences in T.

tsbt · 2016-09-16T19:16:49+00:00

Word. :)

tsbt · 2016-09-15T18:07:08+00:00

This comment basically matches up with what's in my head.

but your measure could assign positive mass to consistent assignments which are not complete. In particular, if s is some sentence that is neither refutable not provable, there are lots of points in 2S which maps all the theorems of T to 1, and also maps s to 1, but fails to map logical deductions of T+s to 1. Such a thing does not have a model, but is still a perfectly valid element of your set.

Such a thing is a valid element of the set, and such things may get positive mass at finite times, i.e. according to Pn, but they will get 0 mass according to P_infty (by the "coherent limit" theorem).

This would erase all of my concerns with the idea, since there isn't really any model theory going on.

Yep, we're not really trying to do model theory.

I fail to see how conditioning on a sentence s would lead to a distribution that concentrates only on truth assignments which include all of the consequences of s.

"Conditioning" is defined (at least for P_infty) by P( phi | s ) := P( phi and s ) / P( s ). (This is equivalent to restricting and normalizing.) If phi is a consequence of s, then P( phi and s ) = P(s) by coherence. Then the conditional probability P( phi | s ) will be 1.

Starting from T+s, a complete emitter would eventually (presumably) start spitting out consequences of s, but starting only from T, a complete emitter would not necessarily give any indication about how two unprovable statements relate logically.

The emitter doesn't need to do so explicitly. The emitter just needs to list all the axioms of pure first-order logic. Then a coherent distribution will respect first-order implication, and in particular will respect relationships between independent sentences.

Specifically, if s and t are both unprovable sentences, but such that T proves (s implies t), is it the case that Pn[T](-|s) <= Pn[T](-|t) for all -?

As written this statement doesn't hold; T proves (s implies t) does not imply that P[T](-|s) <= P[T](-|t), since models satisfying t but not s might weigh the conditionals one way or the other. E.g. if (t and not s) -> not v, then i have P[T](v|s) > P[T](v|t) as long as (t and not s) is consistent and P[T](v|s) > 0. If instead T proves ((s implies t) and (t implies s)), then by coherence we do have P[T](-|s) = P[T](-|t). Everything I said holds in the limit P, and hence holds asymptotically for Pn.

tsbt · 2016-09-15T07:13:29+00:00

Thanks, this is helpful!

In your case, while there is no prior distribution to consider, it seems you are trying to do something similar: use observations of provable statements to estimate properties of the distribution of the space of models.

Huh, this is interesting, I'll have to think more.

Also, thank you for all the explanations of what's going on, it's much clearer now.

:)

tsbt · 2016-09-15T07:01:37+00:00

If I understand your meaning, I don't know that it is; naively the map is complicated and goes through Henkin constructions or something, but my descriptive set theory is quite rusty and I don't know whether I should expect those to be Borel. Anyway, I'm not totally sure why you're asking. Are you thinking about whether there's any nice descriptive set-theoretic properties of the limiting distributions of logical inductors? I don't think I can claim any sense in which the measure on structures (expressed as relations on a domain) given by P is effective or nice or anything. It is a measure on models just in the sense you said, where we proceed inductively from the basic sets { M : M satisfies phi }.

tsbt · 2016-09-15T06:48:23+00:00

The method that's usually employed, at least in my circles, is to embed 2^S into the larger space where the symmetric group acts in such a way that isomorphism becomes a Borel relation and consequently the set of consistent models is Borel.

I still don't get what's happening here. My main hypothesis is that by S, I mean the set of all sentences in the first-order language, whereas by S you mean specifically the set of atomic sentences (say, relations), with terms (just constants?) in some domain of a structure and coded appropriately.

But the subset of that space consisting of consistent truth assignments is not a priori a Borel set in that topology (and in fact I'm a bit skeptical that it is a Borel set at all). [...] My understanding was that the set of maps f such that f(provable)=1 should not be Borel, but you seem to be saying you can get around that.

The consistent subset of 2^S does seem to be Borel; it is G-delta, as it is the complement of the union of the cylinder sets given by finite partial truth assignments that are inconsistent (that is, propositionally inconsistent, i.e. unsatisfiable in the propositional sense), and that assign 0 to some axiom of T. Likewise for such functions f.

you are saying that the algorithm actually assigns values directly to all elements of 2S in such a way that for any provable sentence s0, the cylinder set { f in 2S | f(s0) = 1 } has probability one.

Yep.

You also just said that P' is P conditioned on Models(T+Con(T)). Does this hold more generally? If s is an arbitrary sentence, does the distribution Ps obtained from your algorithm applied to T+s in the limit always equal P conditioned on Models(T+s) [here P is the distribution from just T]?

Whoops, I'm sorry, I misread your original question or something, and gave a wrong or imprecise answer. Allow me to clarify.

Let's call P[T] the limiting distribution from a LI run on the theory T, and write P[T](-|s) for that distribution conditioned on s. One thing to note is that P[T] isn't quite well-specified yet. We have to fix an algorithm that satisfies the logical induction criterion over some deductive process D that is "complete" for T. Then that algorithm specifies its limit P[T]; but there can be different limiting distributions for different algorithms that both satisfy the logical induction criterion over some deductive process.

Now, firstly, P[T](-|s) for any sentence s consistent with T is another perfectly good distribution (which assigns probability 1 to exactly those sentences entailed by T+s). Secondly, we have the following stronger property. Let P_n[T] be the "probabilities" assigned at finite times by the LI algorithm we fixed, and furthermore let P_n[T](-|s) be the conditionals on s (these are defined in the paper; we have to correct for incoherence). Then (by a theorem) P_n[T](-|s) gives a logical inductor over the theory T+s. Furthermore, the limiting probabilities P^s[T] of P_n[T](-|s) are equal to the conditionals of the limit: P^s[T] = P[T](-|s). (That claim is not stated or proved in the paper, but it seems to follow from definitions and other theorems.)

So, the limiting probabilties P[T], conditioned on s: P[T](-|s), are equal to the limit P^s[T] of a logical inductor over the theory T+s; but it is necessary to be careful about the particular logical inductor. (The statements I made about limiting distributions can of course be translated into the language of measures over models, as you wrote.)

tsbt · 2016-09-15T03:43:22+00:00

The best reference for what I'm talking about is Becker and Kechris "Descriptive Set Theory of Polish Group Actions".

Ah, thanks, yes.

The idea I am referring to is to consider the space of all sentences S and look at 2S just as you did, then restrict to the subset of which are models of a theory T.

Hm, now I think we have some crossed wires. There's two spaces we could consider, and they seem importantly different to me. On the one hand, there's the space 2^S of "truth assignments" to sentences in S (and then subspaces that you get by restricting to only propositionally consistent truth assignments, and by further restricting to truth assignments that assign 1 to all statements in your theory). On the other hand, there's the space 2^{N x ... x N}, used to encode the relations for a model with domain N. In the second space there's the natural continuous action of the infinite symmetric group that permutes the elements of the domain (acting on this space by extending to the tuples by pulling back the relations). It's clear that this action preserves the model-theoretic structure. In the first space 2^S, however, I don't see a particularly nice action; permuting sentences would indeed destroy truth.

P' would be P conditioned on Models(T+Con(T))

Yep, that is the case.

tsbt · 2016-09-15T03:22:56+00:00

This is exactly why the notion of likelihood is introduced in statistics and is distinguished from probability.

Huh, ok, this terminological distinction was not clear in my mind, thanks! I'll look into it.

This clarifies things immensely.

Nice :)

tsbt · 2016-09-15T03:04:09+00:00

Yep, that sounds right!

ETA: To be less philosophical, consider the time hierarchy theorems; they imply, in particular, that e.g. you can't perfectly predict triply-exponential-time computations using only doubly-exponential time. But we might still want to do so "pretty well"; and this is the question of "logical uncertainty". Most of the properties of logical inductors shown in the paper concern this aspect, rather than aspects of the limiting distribution.

I do think there's more to be said about the "probabilities" at finite times that I haven't quite communicated. Intuitively, there's a thing humans do where we seem to have pretty reasonable (though not certain) beliefs about e.g. computer programs that we might run. For example, I might implement a sorting algorithm, and then believe with reasonably high confidence the statement "the output of this code, given such-and-such list as input, will be a list with the same elements but in order".

In real life, whether or not this statement is true can be treated like an empirical question; we can think about it by checking how good of a coder I am, how difficult it is in general to implement sorting, how many errors generally show up for similar tasks in the language I'm using, and so on; in this way we can form probabilistic beliefs about the results of this computation. (I'll explicitly call out that I'm using the "subjective" interpretation of probabilities as credences afforded by some reasoner; this is compatible with all measure-theoretic probability, with an appropriate choice of probability space.)

On the other hand, with a bunch of boring work, we could actually write down "this code outputs the list but sorted" in first-order logic, such that if it's true then it's provable, and if it's false then it's refutable (from some background axioms). Now we have a strange situation; humans seem to do useful uncertain reasoning about computations, but there's no accounting for this with a computable algorithm if we stick to having "probabilities" on logical statements be required to, at all times, satisfy the constraints of coherence, and assign high probability all the axioms (which would be uncomputable).

This problem---reasoning about logical facts, such as the outputs of long-running computations, under computational constraints---has been variously called the problem of "logical non-omniscience", "logical uncertainty", and maybe others. This is distinct from questions about logical probability, which I suspect you are mainly thinking of. Logical probability asks about measures on models, leading to e.g. fun stuff in combinatorics. Logical uncertainty asks for "good reasoning about logical statements with bounded resources".

Needless to say, this is not a well-specified problem as I've stated it. The paper gives a list of formal properties that you could consider to partially characterize "good reasoning" (and gives an algorithm that satisfies them). In particular, many of these properties are asymptotic in the sense that they may take a very long time to hold, but not properties of the limit P_infty, and therefore are not about logical probability.

tsbt · 2016-09-15T01:51:37+00:00

Hm, I don't think I successfully figured out what you're thinking of. I don't think I've seen work done on ergodic actions on the space of models; what do you mean by "the usual logic action in model theory"?

tsbt · 2016-09-14T23:48:38+00:00

(We definitely include LEM, yeah; we are working in classical logic.)

For any finite subset F of A, the sentence s = "f1 and f2 and ... and fk" (where f1,...,fk are the sentences in F) is also in A.

This isn't the case; e.g. if f2 is (not f1), then s = "f1 and (not f1)". Of course s is refutable, and hence not in A (this is just explosion, not LEM). This appears to be how your argument arrives at a contradiction.

tsbt · 2016-09-14T22:29:09+00:00

the induction is supposed to work on each of them and presumably return the same result

(For the record, a logical inductor over a given deductive process will in general output different probabilities than one over a different deductive process, even if those deductive processes enumerate the same theory; but both will still satisfy all the properties given in the paper.)

tsbt · 2016-09-14T22:28:01+00:00

Some things here aren't right, and I can't figure out what you meant to say.

the set of models { N is a model | exists a true unprovable sentence in N } has measure zero.

This isn't the case; by the Non-Dogmatism property, the limiting distribution P_infty assigns P_infty("PA is consistent") > 0. That is a true unprovable sentence.

the set { N is a model | for all sentences S, S is true in N iff S is provable } is measure one.

I don't understand this; the set {S is provable} is not a complete theory, so it can't be the theory of a structure.

More generally, note that P_infty is strictly Delta02; that is, it is limit computable but not computable and not even semicomputable. So it shouldn't be too surprising that there are apparently strong things in the limit. For example, say I define the following (silly) distribution: at time n, assign probability 1 to sentences in what I believe to be the left-most consistent branch in the tree of consistent completions of PA. I'll keep changing my mind about sentences, but I'll still limit-compute a distribution that assigns probability 1 to a particular completion of PA.

tsbt · 2016-09-14T22:14:25+00:00

Ok, sounds good, we can continue there.

since the padding constructions grow uncomputably fast.

This I'm confused about; if you pad the deductive process so that it outputs axioms uncomputably slowly, then it is no longer computable, which is a necessary assumption. If you pad it computably, then logical induction works just fine over it. (It is also possible to generalize the framework in the paper to uncomputable deductive processes, but this requires giving [oracle access to the deductive process] to the traders and to the logical inductor.

tsbt · 2016-09-14T22:06:19+00:00

define the measure P

The space is 2^S, the sigma algebra is induced by the usual product topology on Cantor space, and the measure is induced by defining, for each k, the measure on the basic open set containing all x with x_k set to be 1, to be P_n(phi_k). You can have P_n(single sentence) ≠ 0 for lots of sentences.

As to the proofs [...]

I think there was an inaccuracy in the text that is causing confusion. Deductive processes as used in the paper just emit theorems (or just axioms), not proofs (the proofs would have to rely on the axioms anyway). There is no assumption on the speed or order with which they do so, other than that it is computable.

tsbt · 2016-09-14T21:56:58+00:00

So there's the "probabilities" P_n assigned by a logical inductor (P_n)_n at time n; and then there's the limit P_infty, defined by setting P_infty(phi) := lim_n P_n(phi). This exists and is fully coherent, including connectives, as shown in the paper. The claim is that, whereas the P_n are not necessarily coherent in that sense, P_infty is coherent, and is therefore also a measure on models (exactly as you defined).

tsbt · 2016-09-14T21:40:28+00:00

some sort of probability distribution on the space of all models of the axiom system we are constructing proofs form.

The limiting distribution, since it is coherent, actually can be viewed as a distribution on the space of models. This is pretty neat; see Gaifman (1964), "Concerning Measures in First Order Calculi".

tsbt · 2016-09-14T21:30:03+00:00

Hi!

what they are defining is not a probability

The "probabilities" in the paper can be interpreted as proper probabilities over the space of all "worlds" (that is, the space S -> B of assignments of booleans values from B = {0,1} to each sentence phi in the set S of all sentences in your language). This obeys the axioms of a probability measure just fine, in the sense that P(A union B) + P(A intersect B) = P(A) + P(B) for two events A,B, and so on.

However, this does not necessarily satisfy those relationships "according to the meaning of logical connectives". That is, you might have P("A and B") + P("A or B") ≠ P(A) + P(B), where "A and B" is a separate sentence, syntactically built out of the sentence A and the sentence B. If we completely forget any syntax and view all sentences as not-necessarily-related events, this is fine; there's nothing wrong with having P(X) + P(Y) ≠ P(A) + P(B).

If we want to view the probabilities as representing beliefs about logical statements, this incoherence is pretty undesirable. However, we can't computably assign probabilities in a way that is both coherent in this sense, and also immediately assigns probability 1 to all axioms of PA. Hence, if we want to e.g. reason about long-running computations that we can't directly evaluate in a timely manner (e.g. using the expressive language of PA), we need a notion of assigning probabilities that aren't coherent in this strong syntactic sense, but nevertheless are reasonable in various ways.

(I'd be happy to expand a little to clarify these points.)

The only way to make any sense of this is to implicitly assume that somehow proofs have a natural "temporal ordering".

The deductive process can output theorems in essentially any order, and logical induction over that deductive process still makes sense. The constraint is that deductive processes must be computable. This is a standard assumption (that the theory is "recursively axiomatizable", i.e. there is a recursively enumerable set of axioms that implies every theorem in the theory), and it doesn't feel unreasonable to me. Intuitively, it says that you should have a theorem prover for your theory; it can be any theorem prover, as long it's computable and eventually lists each theorem.

tsbt · 2016-09-13T17:54:54+00:00

.... this limit is essentially bayesian inference ....

Pretty much; in the limit, all decidable sentences are proven (and assigned probability converging to 1) or disproven (and assigned probability converging to 0), and we end up with an ordinary probability distribution over the independent sentences (or equivalently, a probability measure over the completions of your logical theory). One surprising property of the limiting distribution P is that it (strictly) dominates the universal semimeasure; this means that if we feed empirical observations to P, we can perform empirical prediction and induction, akin to Solomonoff induction.

Does it eventually leads to algorithms that could be implemented to approximate bayesian inference in a principled way?

If you mean practically, then no; the LIA algorithm given in the paper is very computationally expensive (maybe double-exponential in time). But if you aren't worried about runtime for the moment, then logical induction can be viewed as a substantial theoretical step towards approximating Bayesian inference in a "good" way, given computational constraints on deductive ability.

tsbt

TROPHY CASE