The basic result on relative complexity: H(y|x) = H(x,y) − H(x) + O(1)

Introduction

This entire chapter will be devoted to the proof of one major theorem:

H(y|x) = H(x,y) − H(x) + O(1).

This is our main theorem on relative complexity, and is the ex post facto justification of the definition that we chose for relative complexity. [The definition of a good definition is that it leads to beautiful theorems!] Before plunging into the details of the proof, I'd like to discuss at the beginning of this chapter some applications of this, our main theorem on relative complexity. Other important applications of this theorem, for defining randomness, will be discussed in the next chapter, the first chapter of Part III.

An application: mutual information

The mutual information content H(x:y) of two S-expressions x and y is defined to be the extent to which computing them together is cheaper than computing them separately:

H(x:y) = H(x) + H(y) − H(x,y).

And two S-expressions x and y are algorithmically independent if H(x:y) is small compared with H(x) and H(y).

The main result of this chapter on relative complexity will be that

H(x,y) = H(x) + H(y|x) + O(1).

In fact, since joint information is symmetrical, [This is because you can easily convert a program for the pair (x y) into one for the pair (y x) by adding an appropriate prefix.] that is, since

H(x,y) = H(y,x) + O(1),

it will follow that

H(x,y)     =     H(x) + H(y|x) + O(1)     =     H(y) + H(x|y) + O(1).

Substituting these two expressions for H(x,y) in the expression for H(x:y), we will know at the end of this chapter that

H(x:y)     =     H(y) − H(y|x) + O(1)     =     H(x) − H(x|y) + O(1).

In other words, the mutual information content of x and y is also the extent to which knowing x helps us to know y and the extent to which knowing y helps us to know x.

These elegant and deep results go a long way toward justifying our somewhat unintuitive definition of relative complexity; this will become even more apparent in the next chapter.

As long as we're on the subject of mutual information, here are some easy results that don't require any heavy machinery and are left as an exercise for the reader:

Applications to theoretical biology?

Many years ago I used mutual information in an attempt to formulate a general mathematical definition of life, to distinguish organized living matter from ordinary matter. [See my papers on theoretical biology: ``To a mathematical definition of `life','' ACM SICACT News, No. 4 (January 1970), pp. 12-18; ``Toward a mathematical definition of `life','' in R. Levine and M. Tribus, The Maximum Entropy Formalism, MIT Press, 1979, pp. 477-498; ``Algorithmic information and evolution,'' in O. Solbrig and G. Nicolis, Perspectives on Biological Complexity, IUBS Press, 1991, pp. 51-60.] The idea was that the parts of a living organism are highly correlated, highly interdependent, and have high mutual information. After all, the primordial quality of living beings is their tremendous complexity and interdependentness. I never got very far with this theory. Can you do better?

Another application: does the relative complexity H(y|x) depend on which minimum-size program for x one is given?

Here is another application of

H(x,y) = H(x) + H(y|x) + O(1),

the main result of this chapter on relative complexity.

I swept under the rug a problem with my definition of the relative complexity H(y|x)! I never said which minimum-size program for x one is given for free to use in computing y! What if there are several minimum-size programs for x? What then?

Well, fortunately it turns out that the choice of minimum-size program for x doesn't make too much difference. More precisely, it can make at most c bits of difference, where c doesn't depend on x or y. How come? Well, this is a corollary of the main result of this chapter:

H(x,y) = H(x) + H(y|x) + O(1)!

How so? Well, our proof of this formula doesn't depend at all on which minimum-size program for x we're given to help us compute y. In each case we will have

H(y|x) = H(x,y) − H(x) + O(1).

And since the right-hand side can't vary too much, neither can the left-hand side! The only part of the right-hand side that can vary is the O(1); obviously H(x,y) − H(x) will not vary at all. So there can be at most a bounded number of bits of difference, given by the O(1) bound on the difference of both sides that we will establish when we prove the main theorem of this chapter!

So this is a painless way to notice after the fact, that there was never really a problem. We just went ahead without paying any attention, until the fact that there was no need to just dropped out for free! [The underlying reason that the choice of minimum-size program for x doesn't affect the relative complexity H(y|x) too much, is of course Occam's razor that there is essentially only one minimum-size program for x, or, more precisely, that there can't be too many of them and that we can easily go from one to another. But my point here is that you do not have to make this argument, because the fact that the choice of minimum-size program for x doesn't matter too much drops out for free as a corollary of the main theorem of this chapter, whose proof doesn't depend at all on which minimum-size program for x we're given.]

Overview of proof that H(y|x) = H(x,y) − H(x) + O(1)

Okay, I've shown you some of the things that our main theorem is good for, now let me sketch the proof. I'll fly overhead at ten thousand meters and give you an overview, then we'll buckle down and really get to work.

We must show that

H(y|x) ≥ H(x,y) − H(x) − c
H(y|x) ≤ H(x,y) − H(x) + c.

The first half is easy. It says that

H(x,y) ≤ H(x) + H(y|x) + c,

which we already knew at the end of Part I.

So what this chapter is really about is a proof that

H(y|x) ≤ H(x,y) − H(x) + c.

The general idea is given x to take each program for (x y) and convert it into a program for y, simultaneously shortening it by H(x)−c bits, and that c can be picked in such a way that the Kraft inequality is satisfied and the whole thing works.

Now let's get down to business!

We shall actually show the more general result that

H(y|x) ≤ HC(x,y) − H(x) + c

for any self-delimiting binary computer C, where the constant c depends only on C. We only need the special case C = U.

In fact, from C and x* we shall construct another computer Cx for which

HCx(y) = HC(x,y) − H(x) + c

and therefore

H(y|x) ≤ HCx(y) + c'
                        ≤ HC(x,y) − H(x) + c''.

How do we do this? Well, we know C, and we are given a minimum-size program x* for x, and therefore we also know H(x) and x. So we start running more and more programs p on C searching for those that calculate a pair of the form (x y). Each time that we find a p such that C(p) = (x y), we emit a requirement for Cx assigning to the output y a program of size |p|−H(x)+c. Here we will chose c depending on C and not on x in such a way that these requirements are consistent and satisfy the Kraft inequality, i.e., in such a way that the sum of the probabilities of all the requested programs is ≤ 1.

So the sum of the probabilities of the requested programs is

ΩCx = 2H(x)−cy PC(x,y).

If this is ≤ 1, Kraft is satisfied and we can build Cx.

In other words, given x* we shrink all programs for C to compute pairs (x y) into programs for Cx to compute y which are H(x)−c bits shorter. The c is chosen, following the lemma that is immediately below, in such a way as to guarantee that all the requirements for constructing Cx are consistent.

Now let's do the lemma!

Why are these requirements consistent? How do we know that we can pick a c for which these requirements satisfy the Kraft inequality?

Well, take C and make it into C' as follows:

if C(p) = (x y), then C'(p) = x.

So

PC'(x) = ∑y PC(x,y).

And, by the previous chapter, there is a c depending on C and not on x such that

H(x) ≤ −log2 PC'(x) + c
                ≤ −log2y PC(x,y) + c

or equivalently

y PC(x,y) ≤ 2H(x)+c

for all x.

Let's use precisely this c as our c in building Cx. So with this choice of c, then we have

ΩCx = 2H(x)−cy PC(x,y)
                ≤   2H(x)−c   2H(x)+c   =   1

which confirms that this is just the c that we need so that the Kraft inequality

ΩCx ≤ 1

is satisfied and we can build Cx using the same c for all x.

Details of proof of lemma

Here is the programming for proving the lemma in the case C = U. We construct a computer C' = U' such that U'(p) = x if U(p) = (x y). Hence, by the main result of the previous chapter,

H(x) ≤ −log2y P(x,y) + c.

And this gives us the value of c to carry over to our next algorithm, the one that proves our main theorem. There we'll suppose that this gave us c = 100.

define U-prime
 
let (is-pair? x)
   if atom x         false
   if atom cdr x     false
   if atom cdr cdr x true
                     false
 
[run original program for U]
 
let v eval read-exp
 
[and if is a pair, return first element]
 
if (is-pair? v) car v
 
[otherwise loop forever]
 
   let (loop) [be] (loop) [in]
      (loop)
To test U', we consider four programs. On U, they produce xyz, (a), (a b), (a b c), respectively. On U', the third program produces a, and the first, second, and fourth programs run forever, as they should.
run-utm-on bits' xyz
run-utm-on bits' cons a nil
run-utm-on bits' cons a cons b nil
run-utm-on bits' cons a cons b cons c nil
 
cadr try 99 U-prime bits' xyz
cadr try 99 U-prime bits' cons a nil
cadr try 99 U-prime bits' cons a cons b nil
cadr try 99 U-prime bits' cons a cons b cons c nil

Details of proof of main theorem

Given a supposedly minimum-size program x* for x, the algorithm below looks for programs p for the computer C such that C(p) = (x y). For each such p, it emits the requirement that there should be a program for Cx that produces y whose size is |p|−H(x)+c bits.

The test computer C that we're using counts the number of bits x up to and including the first 1 bit in its program, then C counts the number of bits y up to and including the second 1 bit in its program, and then C produces the output (x xy).

[[[[[
   FUNDAMENTAL DECOMPOSITION
   We prove here that
      H(y|x) <= H_C(x,y) - H(x) + c
]]]]]
 
define (all-together x*)
 
let c debug 100 [constant to satisfy Kraft (see lemma)]
 
let x debug run-utm-on debug x*
 
let H-of-x debug length x*
 
[programs we've discovered that calculate pairs
 starting with x]
let programs nil
 
let (stage n)
    [generate requirements for all new programs we've
     discovered that produce (x y) pairs]
    let programs
        (add-to-set debug (halts? nil debug n) programs)
    (stage + n 1)
 
[at stage n = 0, 1, 2, 3, ...]
[look at all programs with <=n bits that halt within time n]
[returns list of all of them that produce pairs (x y)]
let (halts? p bits-left)
   let v try n C p [C is eval read-exp if C = U]
   if = success car v (look-at cadr v)
   if = 0 bits-left nil
   append (halts? append p cons 0 nil - bits-left 1)
          (halts? append p cons 1 nil - bits-left 1)
 
[returns (p) if C(p) = (x y), otherwise ()]
let (look-at v)
   if (and (is-pair v)
            = x car v ) cons p nil
      nil
 
[logical "and"]
let (and p q)
   if p q false
 
[is x a pair?]
let (is-pair? x)
   if atom x         false
   if atom cdr x     false
   if atom cdr cdr x true
                     false
 
[is an element in a set?]
let (is-in-set? element set)
   if atom set          false
   if = element car set true
   (is-in-set? element cdr set)
 
[forms set union avoiding duplicates,
 and makes requirement for each new find]
let (add-to-set new old)
   if  atom new  old
   let first-new car new
   let rest-new  cdr new
   if (is-in-set? first-new old) (add-to-set rest-new old)
   (do (make-requirement first-new)
       cons first-new (add-to-set rest-new old)
   )
 
[first argument discarded, done for side-effect only!]
let (do x y) y
 
[given new p such that C(p) = (x y),
 we produce the requirement for C_x
 that there be a program for y that is |p|-H(x)+c bits long]
let (make-requirement p)
   display cons cadr cadr try no-time-limit C p
           cons - + c length p H-of-x
                nil
 
let C ' [here eval read-exp gives U]
[test case special-purpose computer C here in place of U:]
[C(00100001) with x-1 and y-1 0's gives pair (x xy)]
[loop function gives number of bits up to next 1 bit]
   let (loop n)
      if = 1 read-bit n
      (loop + n 1)
   let x (loop 1)
   let y (loop 1)
   cons x cons * x y nil
 
[HERE GOES!]
(stage 0)
Now let's run this algorithm. First let's give it a 16-bit program for U to produce x = 3, and then let's give it a 16-bit program for U to produce x = 4. So H(x) = 16, and c = 100, and we'll discover the following programs: C(0011)=(3 3), C(00101)=(3 6), C(001001)=(3 9), ... C(00011)=(4 4), C(000101)=(4 8), C(0001001)=(4 12), ... So the requirements (output, size-of-program) that we'll generate for Cx will be (3 100+4−16), (6 100+5−16), (9 100+6−16), ... (4 100+5−16), (8 100+6−16), (12 100+7−16), ... Here's how you do a time-limited run of this infinite-time algorithm:
define x* 3
length bits x*
 
[give all-together x*]
try 60 cons cons "'
            cons all-together
            nil
       cons cons "'
            cons bits x*
            nil
       nil
    nil
 
define x* 4
length bits x*
 
[give all-together x*]
try 60 cons cons "'
            cons all-together
            nil
       cons cons "'
            cons bits x*
            nil
       nil
    nil
That wasn't so bad was it?! The main idea was simple, it just took a while to work through all the details!

Software for this chapter

The source code and runs of the software for this chapter can be found on the web at these URL's:
   http://www.cs.umaine.edu/~chaitin/ait/decomp.l
   http://www.cs.umaine.edu/~chaitin/ait/decomp.r
 
   http://www.cs.umaine.edu/~chaitin/ait/lemma.l
   http://www.cs.umaine.edu/~chaitin/ait/lemma.r
 
   http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/decomp.l
   http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/decomp.r
 
   http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/lemma.l
   http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/lemma.r

Exercises

  1. Prove that

    |H(x:y) − H(y:x)| ≤ c

    and determine the constant c.

  2. Prove that

    |H(x:x) − H(x)| ≤ c

    and determine the constant c.

  3. Prove that

    H(nil:x) ≤ c

    and determine the constant c.

  4. Prove that

    0 ≤ H(x:y) + c

    and determine the constant c.

  5. Prove that

    H(x:y) ≤ H(x) + c

    and determine the constant c.

  6. Prove that

    H(x:y) ≤ H(y) + c

    and determine the constant c.

  7. Relative complexity expressed in terms of algorithmic probabilities. Show that

    H(y|x)   =   −log2 P(x,y)/P(x) + O(1).

  8. Joint complexity expressed in terms of algorithmic probabilities. Show that

    H(x,y)   =   −log2 P(x,y) + O(1)
      =   −log2 P(x)P(y|x) + O(1)
      =   −log2 P(y)P(x|y) + O(1).

  9. Mutual information expressed in terms of algorithmic probabilities. Show that

    H(x:y)   =   −log2 P(x)P(y)/P(x,y) + O(1)
      =   −log2 P(y)/P(y|x) + O(1)
      =   −log2 P(x)/P(x|y) + O(1).

  10. Distance metric. Show that

    D(x,y) = H(x|y) + H(y|x)

    is a distance metric to within O(1). I.e., show that

    D(x,x) = O(1)
    D(x,y) = D(y,x) + O(1)
    D(x,z) ≤ D(x,y) + D(y,z) + O(1).

    Does this metric D(x,y) capture the intuition that two S-expressions are close if their mutual information is high? Hint: Note that

    D(x,y) = 2H(x,y) − H(x) − H(y) + O(1)
    = H(x) + H(y) − 2H(x:y) + O(1).

  11. In this chapter we proved that

    H(y|x) ≤ H(x,y) − H(x) + c.

    Determine the constant c.