Theoretical interlude---What is randomness? My definitions

Finally, THE definition of randomness!! Warning to the reader!

There's only one definition of randomness (divided into the finite and the infinite case for technical reasons): something is random if it is algorithmically incompressible or irreducible. More precisely, a member of a set of objects is random if it has the highest complexity that is possible within this set. In other words, the random objects in a set are those that have the highest complexity. Applied to the set of all n-bit strings this gives one of our definitions, applied to infinite binary sequences this gives our second definition.

This is the basic, new fundamental idea in AIT, and it's the source of the randomness that I discovered in the heart of pure mathematics, the randomness that Gödel and Turing encountered without realizing it! This one idea is the raison d'être of AIT. I want to understand randomness, I want to formulate the theory that defines it as well as I can.

But in Part III I'll temporarily introduce a number of variant definitions just in order to show that they're equivalent!

Don't get confused!

What is randomness? Overview of six variant definitions

This chapter is a theoretical interlude; we'll stop programming for a while and discuss definitions of randomness = incompressibility = irreducibility. One of the key contributions of AIT is the recognition that this program-size kind of randomness also implies that something is typical and undistinguished, that is, the more traditional statistical kind of randomness.

In this part we discuss five, really six, definitions of randomness: my two for bit strings (using absolute and relative complexity, respectively), and four for infinite binary sequences or, equivalently, for base-two real numbers. Two of the definitions for random real are statistical (Martin-Löf's and Solovay's), and two involve program size (my two: Chaitin and strong Chaitin randomness).

It turns out that the two definitions of a random bit string are equivalent, and that the four definitions for random reals are equivalent. That's a very good sign; it confirms that we've captured the correct concept!

In this chapter we'll show that my two definitions of random bit string are equivalent, but it'll take us all of Part III to prove that the four definitions of a random real number are equivalent.

The common thread of my definitions is, as I said, that incompressible implies typical, and that a ``random'' string is one that has ``close to'' the highest possible complexity. So it turns out that a random n-bit string β is one for which H(β) is close to n + H(n), which is the greatest possible and also the typical complexity of an n-bit string.

n-bit β is random   ⇔   H(β) ≈ n + H(n)

As the complexity of H(β) drops below this β becomes less and less random; it is a matter of degree. [If you must pick a cutoff, it should be at around H(β) = n.]

For random infinite binary sequences x, we consider the complexity H(xn) of the first n bits of x as a function of n, and we'll demand that xn always be random, as random as possible. Due, for example, to long runs of consecutive 1's and 0's, with probability one the complexity of some initial segments of x will have to dip well below n + H(n), so we cannot be as demanding as we were in the case of finite randomness.

It turns out that the correct cutoff is to allow dips of about H(n), so that we'll define x to be Chaitin random if and only if there is a c such that H(xn) > nc for all n.

infinite binary sequence x is Chaitin random   ⇔   ∃cn [ H(xn) > nc ]

This definition of randomness for infinite binary sequences x is of an all-or-nothing character, and it is satisfied by ``most'' infinite binary sequences x, that is, with probability one. In particular, it's easy to see that the halting probability Ω is Chaitin random; that was one of the reasons that I picked this particular definition!

So the notion of program-size complexity provides a unified view of randomness that works both for bit strings and for infinite binary sequences. I believe Chaitin randomness to be a more fundamental concept than statistical randomness, e.g. because program-size irreducibility yields statistical randomness as a corollary in the infinite case, and it also applies to bit strings, where Martin-Löf's and Solovay's statistical definitions do not apply.

When is an n-bit string β random?

Consider an n-bit string β. It's obvious that

H(β) = H(n, β) + O(1)

because we can easily convert a program for (n  β) into one for β and vice versa by adding an appropriate prefix. And according to our fundamental theorem on relative complexity proved in the previous chapter,

H(n, β) = H(n) + H(β|n) + O(1).

Therefore,

H(β) = H(n) + H(β|n) + O(1)

And because the complexity of an n-bit string β given n is obviously bounded by n + c, it follows that

H(β) ≤ H(n) + n + c,

which we already knew in Part I.

So we have a beautifully simple relationship between the absolute complexity H(β) of an n-bit string β and its relative complexity H(β|n) given n. [It's another confirmation that we chose the right definition of relative complexity! It's the basic fact governing the randomness of n-bit strings. It opens the door to all our deeper understanding of randomness. With its help everything will become clear.]

It follows that the most complex or ``random'' n-bit strings β are those for which the complexity of β is very close to n + H(n),

H(β) ≈ n + H(n),

or, equivalently, those for which the complexity of β given n is very close to n,

H(β|n) ≈ n.

And most n-bit strings are random. In fact, a fraction of at most 1/2kc of the n-bit strings β satisfy

H(β) ≤ n + H(n) − k,

or, equivalently,

H(β|n) ≤ nk.

Why? Simply because there are 2n n-bit strings β but there are fewer than 2nk programs for β given n with size < nk bits [since ∑j < nk 2j = 2nk−1].

In other words,

# { n-bit β : H(β) ≤ n + H(n) − k } ≤ 2nk + c.

For bit strings, randomness is a matter of degree: the most random ones have complexity very close to the maximum, and as their complexity drops, they become less and less random. But for infinite binary sequences, as we shall see, it's all or nothing!

The basic property of finite randomness is this: if an n-bit string β is unusual, if it stands out from the crowd in any way, if it has any special distinguishing property P, then it isn't random. Why is this?

Because if β is a member of a small set (the set of all strings having some unusual, atypical property P), then a simpler way to specify β is by indicating its unusual property P (via a program for enumerating all strings that have that property), and then give β's position or index κ (a small number, much less than 2n) in that enumeration. If the number of bits needed to specify the distinguishing property P is not greater than the number of bits saved because the property is unusual, i.e., if the complexity H(P) of the distinguishing property is substantially less than n − log2 κ, then we win; we've found a way to specify β using much less than n bits!

In other words, we're given n for free, and a program that given n enumerates all n-bit strings β with property P, and an integer κ that tells us at what position in this enumeration the particular β that we're interested in appears.

Let's restate this.

Given n, it normally takes n bits to specify an n-bit binary string β. But if β has an unusual property P, and there are only about 2m < 2n n-bit strings with this unusual property P, and if it takes much less than nm bits to specify P, then we may be able to compress β by specifying its unusual property P and which n-bit string with this property β is (about m bits). The result is a program for β given n with about m + H(P) bits, which is much less than m + nm = n bits.

Mutual information again

Let's put this new concept, random bit string, to work. Here's an application to mutual complexity.

Let's look at the mutual complexity of n-bit strings. We consider H(α:β), where α and β are both n-bit strings. How high and how low can we make the mutual information H(α:β)?

Well, to get the lowest possible mutual information for n-bit strings, let α = 0n be the string of n 0's, and let β = 1n be the string of n 1's. Then the mutual information is H(n) + O(1), the lowest possible.

[Why must two n-bit strings α, β always have mutual information H(α:β) ≥ H(n) − c? Well,

H(α:β) = H(α) + H(β) − H(α,β)
= (H(n) + H(α|n)) + (H(n) + H(β|n))(H(n) + H(α,β|n)) + O(1)
= H(n) + H(α|n) + H(β|n) − H(α,β|n) + O(1).

But obviously

H(α,β|n) ≤ H(α|n) + H(β|n) + c'

(relativised subadditivity). So

H(α:β) ≥ H(n) − c'',

which was to be proved.]

But here is a more interesting example, [Taken from my paper ``Toward a mathematical definition of `life','' in R. Levine and M. Tribus, The Maximum Entropy Formalism, MIT Press, 1979, pp. 477-498.] in which we minimize the mutual information H(α:β) while maximizing the individual information contents H(α) and H(β) of two n-bit strings α and β. Consider a highly random 2n-bit string γ, and let α and β be, respectively, the first n bits of γ and the last n bits of γ. So, by hypothesis,

H(γ) = H(α,β) + O(1),

and H(γ) is very close, within a few bits of, 2n + H(2n).

Now on the one hand we have

H(n) = H(2n) + O(1).

[Why? Because we can easily convert a program for n into one for 2n, and vice versa, just by adding a suitable prefix.] On the other hand H(α) must be within a few bits of n + H(n), and H(β) must also be within a few bits of n + H(n). Why? Because

H(γ) = 2n + H(n) + O(1)
H(n) + H(α|n) + H(β|n) + c.

That is, start by computing n, then compute α from n and β from n. These n-relative programs for α and β must both have size very close to n; otherwise we get a program for γ that is too small, that is, that is much smaller than 2n + H(2n) = 2n + H(n) + O(1) bits in size.

It follows that the two halves α and β of a 2n-bit string γ that has close to the maximum possible complexity, simultaneously have H(α) very close to n + H(n), H(β) very close to n + H(n), and H(α,β) very close to 2n + H(n). Thus

H(α:β) = H(α) + H(β) − H(α,β)
= n + H(n) + n + H(n) − 2nH(n) + O(1)
= H(n) + O(1),

which was to be proved.

And to maximize the mutual information of two n-bit strings, just let α be an n-bit string with very close to the maximum possible complexity n + H(n), and let β be α or the bit by bit complement of α. Then

H(α:β) = n + H(n) + O(1).

The number of n-bit strings with maximum complexity is random

Now let's have some fun! Let me illustrate these ideas by using them to show that there have to be many n-bit strings with a certain property, because the number of such strings is a random n-bit number! [See G. Chaitin, ``On the number of n-bit strings with maximum complexity,'' Applied Mathematics and Computation 59 (1993), pp. 97-100.]

Consider the set of all n-bit strings β. Let hn be the maximum complexity of an n-bit string:

hn = max|β| = n H(β).

And let κn be the n-bit base-two numeral for the number of n-bit strings β with exactly the maximum possible complexity hn:

κn = # { |β| = n : H(β) = hn }.

There must be at least one string β that achieves this maximum. If there should happen to be exactly 2n such β, then we will let κn ``wrap around'' and be the n-bit string consisting entirely of 0's.

We know that

hn = n + H(n) + O(1)

and let's define Δn to be their difference

Δn = | n + H(n) − hn | ,

and σn to be the sign of this difference, either a plus or a minus.

Now consider the following n-relative computation. We're given for free n*, a minimum-size program for n, and a program p consisting of a suitable prefix followed by Δn and σn and a minimum-size program to get κn from n. From this we first get n = U(n*) and H(n) = |n*|. Then we add n and H(n), and we are within O(1) of hn. Next we use Δn and σn, which are O(1) bits of information, to ``correct'' n + H(n) and get the exact value of hn, the maximum possible complexity of an n-bit string. Next we calculate κn from n*; the program that we are given for doing this is exactly Hn|n) bits long.

At this point we've read in all of p; it's Hn|n) + O(1) bits long.

Then by systematically running more and more programs on U for more and more time, we start determining better and better upper bounds on the complexity H(β) of each n-bit string β, until we are sure that we have all 2n − κn n-bit strings β with H(β) < hn. At that point the remaining κn n-bit strings β must all have precisely the maximum possible complexity H(β) = hn. Pick one, let's call it βmax,

Hmax) = hn,

and return βmax as the value of this computation.

So what we have here is an

Hn|n) + O(1)

bit program to calculate from n* an n-bit string βmax with

Hmax) =   n + H(n) + O(1)
Hmax|n) =   n + O(1).

It follows that

nc   ≤   Hmax|n)   ≤   Hn|n) + c

and so

Hn|n) =   n + O(1)
Hn) =   n + H(n) + O(1).

Therefore κn is itself an n-bit string with almost exactly the maximum possible complexity! It's full of valuable information!

From this it is easy to see that

κn ≥ 2nc.

For if there were an unbounded number of initial 0 bits at the beginning of the bit string κn, then the complexity of κn would drop substantially, which is impossible. Indeed, j initial 0's would reduce κn's complexity by about j − log2 j bits.

So we have shown that κn, the number of n-bit strings β with exactly the maximum possible complexity H(β) = hn, is large because it is a random n-bit number! κn has to be at least a fixed fraction of all the n-bit strings, it has to be greater than 2nc.

In order to prove this assertion, which involves only absolute complexity, we had to make a foray into the world of relative complexity. I hope you enjoyed reading it as much as I enjoyed discovering it! I think it's a cute example. And it is good preparation for our future discussion of the fact that the halting probability Ω is a random real number. In fact, the κn are like finite versions of Ω. Ω sort of combines all of the κn into a single real number.

By the way, it also follows that the number of 0's and 1's in the bit string κn is nearly the same, otherwise κn could be compressed. In fact, if the relative frequency of 0's is p and the relative frequency of 1's is q and n is large, then κn can be compressed from n bits to about (−p log2 pq log2 q) × n bits. [Actually it's better to fix an ε > 0 and show that for large n, any n-bit string with relative frequencies of 0's and 1's that differ from 1/2 by more than ε can be compressed into a program whose size is at most about n × (−p log2 pq log2 q) bits, where p = 1/2 − ε and q = 1/2 + ε. I.e., the n-bit string is compressed into a program a fixed percentage smaller (depending only on ε).] Can you figure this out for yourself? Look at base-two logarithms of binomial coefficients and use Stirling's approximation:

log2 n!   ~   n log2 n.

Using AIT to prove that there are infinitely many prime numbers

Here's another example of how we can play with these ideas. Let's use AIT to show that there cannot be only a finite number of prime numbers. A prime number is an integer greater than 1 that is divisible only by 1 and itself. So now we're doing elementary number theory!

Consider a positive integer n, and suppose on the contrary that there are only finitely many prime numbers, in fact, exactly k of them: p0,...,pk−1. Then n can always be expressed via a prime factorization of the form

n   =   ∏i < k piei.

And if we are given the exponents of the k primes in this factorization, then we can calculate n. So, using the subadditivity of program-size complexity, we have

H(n)   ≤   ∑i < k H(ei) + c.

But on the one hand, each exponent ei is ≤ log2 n and thus

H(ei)   =   O(log log n).

On the other hand, for most n, for random n, [Definition: An integer is random iff its base-two numeral is a random bit string.]

H(n)   ≈   log2 n + O(log log n).

So we conclude that for random n,

log2 n + O(log log n)   =   H(n)   ≤   k  O(log log n),

which is impossible, since the logarithm grows much faster than the iterated logarithm. Therefore our assumption that there were exactly k primes must be incorrect, no matter what the value of k is. Therefore the number of primes must be infinite! [Please note that in this proof we did not use the fact that the prime factorization is unique. That is why I started with

H(n)   ≤   ∑i < k H(ei) + c

instead of with

H(n)   =   H(e0,...,ek−1) + O(1) .

]

This amusing proof of Euclid's result that there are infinitely many primes certainly feels like it belongs to the new millennium! [It was first published as an appendix in my paper ``Toward a mathematical definition of `life','' in R. Levine and M. Tribus, The Maximum Entropy Formalism, MIT Press, 1979, pp. 477-498. It was more widely seen in my paper ``Gödel's theorem and information,'' International Journal of Theoretical Physics 22 (1982), pp. 941-954.] It's very different from the proof known to the ancient Greeks, which is that if there were only k primes pi, then the expression

1 + ∏i < k pi

would also have to be prime, which is impossible.

When is an infinite binary sequence random? Random reals---My definition

Let's consider a real number in the unit interval, [The set of real numbers between zero and one.] or equivalently, an infinite binary sequence. The only difference is that the reals

.xxx100000...

and

.xxx011111...

aren't different, but the corresponding infinite binary sequences certainly are! But this will not be a problem.

We saw at the beginning of this chapter that for an n-bit string β

Prob{ H(β) ≤ n + H(n) − k }   ≤   1/2kc.

In other words, there's a c such that (the probability that the complexity of an n-bit string is less than or equal to n + H(n) − k) is less than or equal to 2k+c.

Substituting H(n) + c for k, and substituting xn, the first n bits of an infinite binary sequence x, for β, gives us

Prob{ H(xn) ≤ nc }   ≤   1/2H(n).

Let's sum this over all positive integers n. That gives us

n Prob{ H(xn) ≤ nc }   ≤   ∑n 1/2H(n)   ≤   Ω   ≤   1.

It follows that with probability one,

H(xn) ≤ nc

is only the case finitely often.

[Why? Well, this is a special case of what Feller [W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Wiley, 1950.] calls the Borel-Cantelli lemma, which states that if

n Prob{An}

converges, then with probability one only finitely many of the events An can occur. How come? Well, just look at the tails of this convergent infinite series. Since this series converges, we can get the tail to be arbitrarily small by going out far enough. So for any real ε > 0, there will be an integer N such that

nN Prob{An}   <   ε.

Hence the probability that any of the events An occur for nN is less than ε, and can be made arbitrarily small, and the probability that infinitely many of the An occur is less than ε for every ε > 0 and is therefore zero.]

It follows that with probability one there must be a c' depending on x such that for all n,

H(xn) > nc'.

I'll take this to be my program-size irreducibility definition of a random real, more precisely, of what I'll call a Chaitin-random real.

Actually, repeating this reasoning again and again, each time with a different complexity cutoff, also shows that with probability one it must be the case that

limn → ∞ H(xn) − n   =   ∞.

That is, with probability one, H(xn) − n → ∞, H(xn) − n goes to infinity. I'll call this strong Chaitin randomness. But we'll postpone considering strong Chaitin randomness until the last chapter of Part III, where we'll show that there's actually a complexity gap and that if

cn [ H(xn) > nc ]

then

H(xn) − n → ∞.

So Chaitin and strong Chaitin randomness are actually equivalent; but that's not at all obvious.

The halting probability Ω is Chaitin random

Now how about a concrete example of a random real? We know that with probability one a real in the unit interval is random, but how can we exhibit a specific, particular random real? Well, it turns out that we've had it ever since the beginning of Part II!

Let's consider the first N bits of the binary expansion of Ω, the halting probability of our universal self-delimiting binary computer U, a real number between 0 and 1. And for the purposes of this discussion, let me say that if I have to choose between

Ω = .xxx011111...

and

Ω = .xxx100000...

I prefer the binary expansion for Ω with infinitely many 1's.

Let's suppose we know the halting probability to within one part in 2N, in other words, that we've been told its numerical value with an accuracy of 2N. More precisely, we truncate Ω, we don't round it off. So I'll define ΩN to be the first N bits of the binary expansion of Ω with infinitely many 1's, which ensures that Ω > ΩN, that Ω is strictly greater than ΩN, which is important for everything to work.

The crucial fact is that ΩN settles the halting problem for U(p) for all programs p up to N bits in size. In fact, there is a prefix π such that

U(p) = ΩN     ⇒     Up) = { x : H(x) ≤ N }.

In other words, by slapping on a prefix we can make any program to calculate the first N bits of the halting probability into a program for computing a list of all the S-expressions with complexity ≤ N. But this list itself cannot have complexity ≤ N, because it cannot be contained in itself!

Therefore taking p to be a minimum-size program for ΩN, we have

N   <   H( { x : H(x) ≤ N } )   ≤   | π | + HN).

So

HN) > N − | π | = Nc,

and Ω is Chaitin random. The proof is a bit like the one that Hn|n) = n + O(1); that was a warm-up exercise for this.

You systematically run more and more programs on U for more and more time, and get better and better lower bounds on the halting probability for U. As soon as the first N bits of the halting probability are correct (that is, the same as ΩN), you know all ≤ N bit programs for U that halt!

Yes indeed, you can compute Ω in the limit from below. But---unless you're given ΩN!---you never know how close you are, there is no ``computable regulator of convergence.''

Can you work through the details on your own and write the prefix π in LISP? If so, what value do you get for c? If, God forbid, you give up, or, what's more commendable, if you want to compare your answer with mine to see if you did a better job, see The Limits of Mathematics for the detailed proof that

HN) > N − 8000.

In the next three chapters we'll deduce from this lower bound on HN), which states that Ω is Chaitin random, that Ω is also Martin-Löf and Solovay random (statistically random) and finally is strong Chaitin random, so that in fact HN) − N   →   ∞. This will keep us busy for the rest of Part III, and to do it we'll have to figure out how to do a little bit of measure theory on the computer.

Should you stop reading here?

This book should stop here, at the end of this chapter, and I should never present Martin-Löf nor Solovay randomness. Why? (a) In my crucial, breakthrough paper on this, my 1975 J.ACM paper, I didn't! Why not? (b) Because these are old, backward looking infertile concepts belonging to a dying field! (c) Don't need to! Not for my incompleteness results in The Limits of Mathematics. And I can show that Ω is absolutely Borel normal without using Martin-Löf or Solovay randomness, simply with a straight-forward program-size compression argument. (d) As von Neumann and Feynman have pointed out discussing e.g., minimum principles versus differential equations, mathematically equivalent physical laws can nevertheless be psychologically inequivalent and suggest completely different new laws! [J. von Neumann, ``Method in the physical sciences'' (1955), in F. Bródy and T. Vámos (Eds.), The Neumann Compendium, World Scientific, 1995, pp. 627-634. R. Feynman, The Character of Physical Law, MIT Press, 1965. See p. 168, near the end of Chapter 7, ``Seeking new laws''.]

The reason I go on with three more (misleading) chapters in Part III after this one, is (a) a new concept feels slightly more comfortable if it keeps company, if it connects with old (even obsolete!) concepts! (b) I need the detour through Martin-Löf and Solovay to prove that strong Chaitin randomness is equivalent to my main definition, Chaitin randomness. But I don't use strong Chaitin randomness anywhere, I just like the information that it gives me about Ω. I challenge the reader to find a simple, direct proof, and eliminate any need for ever introducing strong Chaitin randomness. ``Strong'' Chaitin randomness is really an expression of weakness, of the fact that I don't (yet) have an immediate proof that this is equivalent to my main randomness definition. I'm counting on you to rectify this!

Similarly, (c) I do need Solovay randomness to show that very sparse infinite subsets of the bits of Ω must have 0's and 1's with equal limiting relative frequency. A straight-forward program-size compression argument will not do. [If a fixed percentage of the bits of Ω are selected, then a straight-forward program-size compression argument does work.] But that may just be my fault! Can you fix this by giving a direct proof? Then AIT will be free from any dependence on old, dying ideas!

Do the research that will obviate any need for me to temporarily introduce variant definitions of randomness! These direct proofs avoiding Martin-Löf, Solovay, and strong Chaitin randomness, will no doubt be obtained by unraveling the equivalence proofs linking these concepts with Chaitin randomness that I present in the remaining chapters of Part III, and extracting the central idea from each proof. To work!

Exercises

  1. Prove that the complexity of a minimum-size program p is almost the same as its size. I.e., if U(p) = x and |p| = H(x), then

    H(p) = |p| + O(1).

    Thus the complexity of a minimum-size program p drops down below the maximum possible for strings of size |p| just enough for p to be self-delimiting. [This was one of the clues that led me to chose the randomness complexity cutoff that I did: minimum-size programs should be border-line random. [In a previous version of AIT, the 1960's version, in which programs were not self-delimiting, minimum-size programs were maximally random. I was sorry to lose this in my new version of AIT, the 1970's version, but it was worth it! In exchange, I got the halting probability Ω! At any rate, this degree of randomness is still enough for us to be able to show that large minimum-size programs must have nearly the same relative frequency of 0's and 1's, and many other statistical randomness properties. In fact, Question: are there any statistical randomness properties that Chaitin-random infinite binary sequences have but which large minimum-size programs do not have? If you find the answer, let me know!] ]

  2. Prove that if p and p' are both minimum-size programs for the same S-expression x, then

    H(p) = H(x) + O(1)
    H(p') = H(x) + O(1)
    H(p:p') = H(x) + O(1) .

  3. Prove the result of R.M. Solovay [In his unpublished book-length manuscript on AIT dated May 1975.] that

    log2 # { x : H(x) < n } = nH(n) +O(1).

    In other words, show that there are 2nH(n) + O(1) different S-expressions x of complexity less than n.

  4. Show that

    Ω# = ∑n = 1,2,3,... 2H(n)

    is also a Chaitin-random real number. Can you find other examples?

  5. Find other examples like κn, other situations in which we can get information about a number by showing that it's random. Are there any other interesting applications of this technique?

  6. Prove that Ω is irrational, i.e., not the ratio of two integers.

  7. Prove that Ω is transcendental, i.e., not algebraic, not the solution of an equation

    axn + bxn−1 + ... + px + q = 0

    with integer coefficients. [Cf. Cantor's and Liouville's proofs of the existence of transcendentals in R. Courant and H. Robbins, What is Mathematics?, Oxford University Press, 1941. This problem has a colorful intellectual history. See T. Dantzig, Number, Macmillan, 1954.]