This is our main theorem on relative complexity, and is the ex post facto justification of the definition that we chose for relative complexity. [The definition of a good definition is that it leads to beautiful theorems!] Before plunging into the details of the proof, I'd like to discuss at the beginning of this chapter some applications of this, our main theorem on relative complexity. Other important applications of this theorem, for defining randomness, will be discussed in the next chapter, the first chapter of Part III.
And two S-expressions x and y are algorithmically independent if H(x:y) is small compared with H(x) and H(y).
The main result of this chapter on relative complexity will be that
In fact, since joint information is symmetrical, [This is because you can easily convert a program for the pair (x y) into one for the pair (y x) by adding an appropriate prefix.] that is, since
it will follow that
Substituting these two expressions for H(x,y) in the expression for H(x:y), we will know at the end of this chapter that
In other words, the mutual information content of x and y is also the extent to which knowing x helps us to know y and the extent to which knowing y helps us to know x.
These elegant and deep results go a long way toward justifying our somewhat unintuitive definition of relative complexity; this will become even more apparent in the next chapter.
As long as we're on the subject of mutual information, here are some easy results that don't require any heavy machinery and are left as an exercise for the reader:
the main result of this chapter on relative complexity.
I swept under the rug a problem with my definition of the relative complexity H(y|x)! I never said which minimum-size program for x one is given for free to use in computing y! What if there are several minimum-size programs for x? What then?
Well, fortunately it turns out that the choice of minimum-size program for x doesn't make too much difference. More precisely, it can make at most c bits of difference, where c doesn't depend on x or y. How come? Well, this is a corollary of the main result of this chapter:
How so? Well, our proof of this formula doesn't depend at all on which minimum-size program for x we're given to help us compute y. In each case we will have
And since the right-hand side can't vary too much, neither can the left-hand side! The only part of the right-hand side that can vary is the O(1); obviously H(x,y) − H(x) will not vary at all. So there can be at most a bounded number of bits of difference, given by the O(1) bound on the difference of both sides that we will establish when we prove the main theorem of this chapter!
So this is a painless way to notice after the fact, that there was never really a problem. We just went ahead without paying any attention, until the fact that there was no need to just dropped out for free! [The underlying reason that the choice of minimum-size program for x doesn't affect the relative complexity H(y|x) too much, is of course Occam's razor that there is essentially only one minimum-size program for x, or, more precisely, that there can't be too many of them and that we can easily go from one to another. But my point here is that you do not have to make this argument, because the fact that the choice of minimum-size program for x doesn't matter too much drops out for free as a corollary of the main theorem of this chapter, whose proof doesn't depend at all on which minimum-size program for x we're given.]
We must show that
The first half is easy. It says that
which we already knew at the end of Part I.
So what this chapter is really about is a proof that
The general idea is given x to take each program for (x y) and convert it into a program for y, simultaneously shortening it by H(x)−c bits, and that c can be picked in such a way that the Kraft inequality is satisfied and the whole thing works.
Now let's get down to business!
We shall actually show the more general result that
for any self-delimiting binary computer C, where the constant c depends only on C. We only need the special case C = U.
In fact, from C and x* we shall construct another computer Cx for which
and therefore
How do we do this? Well, we know C, and we are given a minimum-size program x* for x, and therefore we also know H(x) and x. So we start running more and more programs p on C searching for those that calculate a pair of the form (x y). Each time that we find a p such that C(p) = (x y), we emit a requirement for Cx assigning to the output y a program of size |p|−H(x)+c. Here we will chose c depending on C and not on x in such a way that these requirements are consistent and satisfy the Kraft inequality, i.e., in such a way that the sum of the probabilities of all the requested programs is ≤ 1.
So the sum of the probabilities of the requested programs is
If this is ≤ 1, Kraft is satisfied and we can build Cx.
In other words, given x* we shrink all programs for C to compute pairs (x y) into programs for Cx to compute y which are H(x)−c bits shorter. The c is chosen, following the lemma that is immediately below, in such a way as to guarantee that all the requirements for constructing Cx are consistent.
Now let's do the lemma!
Why are these requirements consistent? How do we know that we can pick a c for which these requirements satisfy the Kraft inequality?
Well, take C and make it into C' as follows:
So
And, by the previous chapter, there is a c depending on C and not on x such that
or equivalently
for all x.
Let's use precisely this c as our c in building Cx. So with this choice of c, then we have
which confirms that this is just the c that we need so that the Kraft inequality
is satisfied and we can build Cx using the same c for all x.
And this gives us the value of c to carry over to our next algorithm, the one that proves our main theorem. There we'll suppose that this gave us c = 100.
define U-prime
let (is-pair? x)
if atom x false
if atom cdr x false
if atom cdr cdr x true
false
[run original program for U]
let v eval read-exp
[and if is a pair, return first element]
if (is-pair? v) car v
[otherwise loop forever]
let (loop) [be] (loop) [in]
(loop)
To test U',
we consider four programs.
On U, they produce xyz, (a),
(a b), (a b c), respectively.
On U', the third program produces a,
and the first, second, and fourth programs run forever,
as they should.
run-utm-on bits' xyz run-utm-on bits' cons a nil run-utm-on bits' cons a cons b nil run-utm-on bits' cons a cons b cons c nil cadr try 99 U-prime bits' xyz cadr try 99 U-prime bits' cons a nil cadr try 99 U-prime bits' cons a cons b nil cadr try 99 U-prime bits' cons a cons b cons c nil
The test computer C that we're using counts the number of bits x up to and including the first 1 bit in its program, then C counts the number of bits y up to and including the second 1 bit in its program, and then C produces the output (x xy).
[[[[[
FUNDAMENTAL DECOMPOSITION
We prove here that
H(y|x) <= H_C(x,y) - H(x) + c
]]]]]
define (all-together x*)
let c debug 100 [constant to satisfy Kraft (see lemma)]
let x debug run-utm-on debug x*
let H-of-x debug length x*
[programs we've discovered that calculate pairs
starting with x]
let programs nil
let (stage n)
[generate requirements for all new programs we've
discovered that produce (x y) pairs]
let programs
(add-to-set debug (halts? nil debug n) programs)
(stage + n 1)
[at stage n = 0, 1, 2, 3, ...]
[look at all programs with <=n bits that halt within time n]
[returns list of all of them that produce pairs (x y)]
let (halts? p bits-left)
let v try n C p [C is eval read-exp if C = U]
if = success car v (look-at cadr v)
if = 0 bits-left nil
append (halts? append p cons 0 nil - bits-left 1)
(halts? append p cons 1 nil - bits-left 1)
[returns (p) if C(p) = (x y), otherwise ()]
let (look-at v)
if (and (is-pair v)
= x car v ) cons p nil
nil
[logical "and"]
let (and p q)
if p q false
[is x a pair?]
let (is-pair? x)
if atom x false
if atom cdr x false
if atom cdr cdr x true
false
[is an element in a set?]
let (is-in-set? element set)
if atom set false
if = element car set true
(is-in-set? element cdr set)
[forms set union avoiding duplicates,
and makes requirement for each new find]
let (add-to-set new old)
if atom new old
let first-new car new
let rest-new cdr new
if (is-in-set? first-new old) (add-to-set rest-new old)
(do (make-requirement first-new)
cons first-new (add-to-set rest-new old)
)
[first argument discarded, done for side-effect only!]
let (do x y) y
[given new p such that C(p) = (x y),
we produce the requirement for C_x
that there be a program for y that is |p|-H(x)+c bits long]
let (make-requirement p)
display cons cadr cadr try no-time-limit C p
cons - + c length p H-of-x
nil
let C ' [here eval read-exp gives U]
[test case special-purpose computer C here in place of U:]
[C(00100001) with x-1 and y-1 0's gives pair (x xy)]
[loop function gives number of bits up to next 1 bit]
let (loop n)
if = 1 read-bit n
(loop + n 1)
let x (loop 1)
let y (loop 1)
cons x cons * x y nil
[HERE GOES!]
(stage 0)
Now let's run this algorithm.
First let's give it a 16-bit program for U to produce x = 3, and
then let's give it a 16-bit program for U to produce x = 4.
So H(x) = 16, and c = 100, and we'll discover the following programs:
C(0011)=(3 3), C(00101)=(3 6),
C(001001)=(3 9), ...
C(00011)=(4 4), C(000101)=(4 8),
C(0001001)=(4 12), ...
So the requirements (output, size-of-program)
that we'll generate for Cx will be
(3 100+4−16), (6 100+5−16),
(9 100+6−16), ...
(4 100+5−16), (8 100+6−16),
(12 100+7−16), ...
Here's how you do a time-limited run of this infinite-time algorithm:
define x* 3
length bits x*
[give all-together x*]
try 60 cons cons "'
cons all-together
nil
cons cons "'
cons bits x*
nil
nil
nil
define x* 4
length bits x*
[give all-together x*]
try 60 cons cons "'
cons all-together
nil
cons cons "'
cons bits x*
nil
nil
nil
That wasn't so bad was it?!
The main idea was simple, it just took a while to work through all the details!
http://www.cs.umaine.edu/~chaitin/ait/decomp.l http://www.cs.umaine.edu/~chaitin/ait/decomp.r http://www.cs.umaine.edu/~chaitin/ait/lemma.l http://www.cs.umaine.edu/~chaitin/ait/lemma.r http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/decomp.l http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/decomp.r http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/lemma.l http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait/lemma.r
and determine the constant c.
and determine the constant c.
and determine the constant c.
and determine the constant c.
and determine the constant c.
and determine the constant c.
is a distance metric to within O(1). I.e., show that
Does this metric D(x,y) capture the intuition that two S-expressions are close if their mutual information is high? Hint: Note that
Determine the constant c.