The connection between program-size complexity and algorithmic probability: H(x) = −log2 P(x) + O(1). Occam's razor: there are few minimum-size programs

Introduction

The first half of the main theorem of this chapter is trivial:

P(x) ≥ 2H(x),

therefore

−log2 P(x) ≤ H(x).

So now we only need to show that

H(x) ≤ −log2 P(x) + c.

In this chapter we'll actually prove that

H(x) ≤ −log2 PC(x) + c

where c depends only on the computer C, not on the S-expression x. Then the special case C = U will complete our proof that

H(x) = −log2 P(x) + O(1).

Why is this result important? Because:

  1. We need it for a crucial lemma in the next chapter, which will enable us to understand relative complexity. So we need the main theorem of this chapter in order to be able to lay bare what relative complexity is all about. In turn, we need next chapter's result on relative complexity in order to use it at the beginning of Part III to understand what a random bit string is.

  2. This chapter's result shows that AIT is what you get when you take logarithms of probabilities. In other words,

    AIT   ≈   −log2 Probability Theory !!!

    So statements in probability theory dealing with products of probabilities correspond to results in AIT dealing with sums of complexities. And divisions turn into subtractions... This gives you a feeling for what AIT is like, in a purely formal way, if you're familiar with the formulas in probability theory. Of course the meaning of these formulas is completely different! Normal probability theory says nothing about computer programs.

  3. Occam's razor, a principle in medieval philosophy, states that ``entities should not be multiplied'', that the best theory is a simple theory. We'll see that minimum-size programs are essentially unique, that there can't be too many of them, and that you can easily get one from another. Let's do that right now!

An application: Occam's razor---There are few minimum-size programs

Here is a very interesting immediate corollary of the main theorem of this chapter. Later in this chapter we'll show that

H(x) ≤ −log2 P(x) + c.

This will imply that the number of minimum-size programs p for the S-expression x is always ≤ 2c:

# { p : U(p) = x   &   |p| = H(x) }   ≤   2c.

Why is this? Well, simply because if there were exactly 2c minimum-size programs for x and no other programs for x, then we would have

P(x) = 2c 2H(x) = 1/2H(x)−c

and then it would be the case that

H(x) = −log2 P(x) + c.

So more than 2c minimum-size programs for x would certainly push −log2 P(x) + c over the edge and make it less than H(x), which we'll see is impossible.

So we'll get the nice corollary from our main theorem that there cannot be too many minimum-size programs for a given result x. In fact, even more than that is true: the minimum-size program for x is essentially unique, because if we're given any minimum-size program x* for x then we only need be told O(1) bits more in order to get any other minimum-size program p' for x. How so?

Well, given x*, we know x and H(x), and if in addition we're told precisely how many minimum-size programs p for x there are, then by systematically running all H(x) bit programs p on U for more and more time until we find the requisite number of p with U(p) = x, we can determine all the minimum-size programs p for x. Then to select any particular p, we just need to know which one to pick. To do this we only need to know x* and two numbers ≤ 2c, that is, two c + 1 bit numbers. [Actually, two c-bit numbers will do. Since neither of these numbers can be zero, we can just let their maximum possible values of 2c wrap around to zero.]

Indeed, we can restate this much more forcefully if we borrow the notion of mutual information content from the next chapter, where it will be defined. We've just given the heart of the argument that any two minimum-size programs p and p' for x have mutual information content H(p:p') very close to the maximum possible, H(x). That is, essentially all their information is in common.

Exercise for the reader: show that a similar argument works if we fix a k-bit neighborhood around the minimum-size program for x, i.e., consider all programs p for x with |p| ≤ H(x) + k, k fixed.

Proof that H(x) ≤ −log2 PC(x) + c

Here is the programming for proving the main theorem of this chapter.

The approximate idea is as follows. We are given a computer C, and from it we'll construct a computer C' with the property that if

PC(x) ≥ 2k

(and we can realize this in a finite amount of time) then there is precisely one k + 1 bit program for C' to produce x. In other words, our approximate goal is that for each j there is either one or no j-bit program for C' to produce x depending on whether or not

PC(x) ≥ 1/2j−1.

But we won't produce all of these programs, which would be infinitely many for each output. We'll actually only produce finitely many for each output.

More precisely, the process is this. We systematically run more and more programs on C for more and more time. At stage k = 0,1,2,... we run each k-bit program on C for time k. And if we find ≥ 2m and < 2m+1 k-bit programs that produce x, [These programs may have trailing bits that were never read by C.] this means that

PC(x) ≥ 2m/2k = 1/2km,

and we emit the following (output, size-of-program) requirement for C': (x, km+1). This is done only if this is a new requirement that we have not emitted previously. So the total probability that C' assigns to x is

≤    PC(x)   ∑j = 1,2,3,... 2j   =   PC(x),

and the Kraft inequality condition for constructing C' is satisfied simply because it was satisfied by C; C' inherits Kraft from C.

Thus C' ``concentrates'' the algorithmic probability that C assigns to each S-expression x into a set of programs for x all of which are of different size.

Then we can simulate C' using U, with the result that

H(x) ≤ −log2 PC(x) + c,

which completes our proof.

What is the computer C that I use to illustrate how our algorithm works? Well, I've chosen a C that ignores all the odd bits in its program p, and counts the number i of bits in even positions up to and including the first 1 bit. The value that C returns, C(p), will be ten raised to this power, that is, 10i.

Since odd bits of p are ignored, for each i ≥ 1 we have

PC(10i) = 2i
HC(10i) = 2i.

Thus we'll produce the requirement (10i, i+1) for constructing C', which requests that there should be a program for C' to produce 10i that is i+1 bits long. That's how C' concentrates C's algorithmic probability into smaller programs, and that's how we show that

H(x) ≤ −log2 PC(x) + c.

Below we define an unending expression all-together for enumerating the entire infinite set of requirements for C', and then we test it with the C described above.

[The case that we're actually interested in in this chapter is C = U, which shows that

H(x) ≤ −log2 P(x) + c,

but this would be too slow to run, and it would be harder to interpret the output.]

After we define all-together, we try it out. We let it run for a while to show that it works with the C that we've chosen. Within the time limit of 60 units that I give it, our algorithm manages to discover the first three requirements for C', namely: (101, 2), (102, 3), (103, 4). So

PC'(101) = 2−2
PC'(102) = 2−3
PC'(103) = 2−4 ...

and for this C and C' we have

ΩC' = 1/2 < ΩC = 1.

That's it, we're finished!

So our LISP program works, at least in this special case it does. Can you try it out on a C for which our algorithm will produce more than one requirement for each output?

[[[[[
   Occam's razor---Concentration process.
   From computer C construct C' such that
   if P_C(x) >= 1/2^k, then
   then C' has a k+1 bit program for x.
   Hence H(x) <= -log_2 P_C(x) + c
   where c depends only on C, not on x.
]]]]]
 
define all-together
 
[this is used to avoid duplicate requirements for C']
let previous-requirements nil
 
[test case special-purpose computer C:]
[ignores odd bits, multiplies by ten until hits a 1]
[this C has many programs that do the same job!]
[[to put U here instead, let C be 'eval read-exp]]
let C '
   let (loop n)
      let ignore-it [be] read-bit [skip bit]
      if = 1 read-bit [then return] n
         [else] (loop * 10 n)
   (loop 10)
 
[stage n = 0, 1, 2, ... of overall concentration process]
[look at all n-bit programs for C, run them for time n]
[merge (output,multiplicity) pairs, emit requirements for C']
let (stage n)
   let previous-requirements
      (make-requirements debug (how-many? nil debug n))
   (stage + n 1)
 
[produce (output,multiplicity) pairs]
[by running all n-bit programs on C for time n]
let (how-many? p n)
   if = n length p
 
      [run program p for time n]
      let v try n [U => 'eval read-exp] C p
      if = success car v
 
          [program ran to completion]
          [indicate that it produces]
          [its output with multiplicity one]
          cons cons cadr v cons 1 nil
               nil
 
          [otherwise program failed]
          nil
          [empty list of (output,multiplicity) pairs]
 
      [otherwise use recursion to combine multiplicities]
      (merge (how-many? cons 0 p n)
             (how-many? cons 1 p n)
      )
 
[add one (output,multiplicity) pair to a list of such pairs]
let (merge1 pair list)
   if  atom list     cons pair nil
   let first-in-list car  list
   let rest-of-list  cdr  list
   let output        car  pair
   let multiplicity  cadr pair
   let output2       car  first-in-list
   let multiplicity2 cadr first-in-list
   if = output output2
      [= -> combine multiplicities]
      cons cons output cons + multiplicity multiplicity2 nil
           rest-of-list
      [!= -> don't combine multiplicities]
      cons first-in-list
           (merge1 pair rest-of-list)
 
[combine two lists of (output,multiplicity) pairs]
let (merge list list2)
   if atom list list2
   (merge1 car list (merge cdr list list2))
 
[exponent in highest power of 2 <= n, n != 0]
let (log2 n)
   let (loop power exponent)
      let new-power    + power power [double it]
      let new-exponent + 1 exponent  [add 1 to it]
      if > new-power n [then return] exponent
         [else] (loop new-power new-exponent)
   (loop [initial power of two] 1 [initial exponent of 2] 0)
 
let (make-requirements list-of-pairs)
   if  atom list-of-pairs  previous-requirements
   let first-pair     car  list-of-pairs
   let list-of-pairs  cdr  list-of-pairs
   let output         car  first-pair
   let multiplicity   cadr first-pair
   let kraft-requirement
       cons output cons - + n 1 (log2 multiplicity) nil
   let previous-requirements (make-requirements list-of-pairs)
   [keep only first appearance of requirement]
   if (is-in? kraft-requirement previous-requirements)
      previous-requirements
   cons debug display kraft-requirement previous-requirements
 
let (is-in? x list) [is x in list?]
   if atom list     false
   if = x car list  true
   (is-in? x cdr list)
 
[HERE GOES!]
(stage 0)
 
try 60 all-together nil

Software for this chapter

The source code and runs of the software for this chapter can be found on the web at these URL's:
   http://www.cs.umaine.edu/~chaitin/ait/occam.l
   http://www.cs.umaine.edu/~chaitin/ait/occam.r
 
   http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/occam.l
   http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/occam.r

Exercises

  1. In this chapter we've shown that

    H(x) = −log2 P(x) + O(1).

    Prove that

    H(x,y) = −log2 P(x,y) + O(1)
    H(y|x) = −log2 P(y|x) + O(1).

  2. Show that for any given constant c, the number of n-bit strings β with

    H(β) ≤ H(n) + c

    is bounded by a c' that depends only on c, not on n.

  3. In this chapter we've shown that

    |H(x) + log2 P(x)| ≤ c.

    Determine an explicit numerical value for c.

  4. How many different H(x)-bit programs p for U to compute x can there be? Within one bit of the minimum? Two bits? Three bits? Give explicit numerical bounds.

  5. We've shown that it's easy to go from one minimum-size program for x to another minimum-size program for the same S-expression x. Write out the program that does this.