In these lectures I discuss philosophical applications of AIT,
not practical applications.
Indeed, I believe AIT has no practical applications.
The most interesting thing about AIT is that you can almost never determine the
complexity of anything. This makes the theory useless for practical
applications, but fascinating from a philosophical point of view, because it
shows that there are limits to knowledge.
Most work on computational complexity is concerned with time. However this
course will try to show that program-size complexity, which measures
algorithmic information, is of much greater philosophical significance.
I'll discuss how one can use this complexity measure to study what
can and cannot be achieved by formal axiomatic mathematical theories.
In
particular, I'll show (a) that there are natural information-theoretic
constraints on formal axiomatic theories, and that program-size complexity
provides an alternative path to incompleteness from the one originally used by
Kurt Gödel. Furthermore, I'll show (b) that in pure mathematics there are
mathematical facts that are true for no reason, that are true by accident.
These have to do with determining the successive binary digits of the precise
numerical value of the halting probability Ω for a ``self-delimiting'' universal
Turing machine.
I believe that these meta-theorems (a,b) showing (a) that the
complexity of axiomatic theories can be characterized information-theoretically
and (b) that God plays dice in pure mathematics, both strongly suggest a
quasi-empirical view of mathematics. I.e., math is different from physics,
but perhaps not as different as people usually think.
I'll also discuss the
convergence of theoretical computer science with theoretical physics, Leibniz's
ideas on complexity, Stephen Wolfram's book A New Kind of Science, and
how to attempt to use information theory to define what a living being is.
In this book I've tried to preserve the informal style of presentation in my lectures
that stressed the key ideas and methods and avoided getting bogged down in
technical details. There are no proofs here, but there are plenty of proof sketches.
I hope that you enjoy reading this book just as much as I enjoyed presenting
this material at EWSCS '03!
—Gregory Chaitin
This is a good example of the situation discussed in
Gödel,
What is Cantor's Continuum Problem?, 1947,
where he argues that maybe math is a little like physics and
that new axioms that are not self-evident might be justified
because of their usefulness.
If so, there is ample pragmatic justification for adding
P ≠ NP
as a new axiom.
(In his 1947 article Gödel was concerned with set theory,
not computer science.)
Let's see!!!
1Hermann Weyl,
The Open World, Yale University Press, 1932.
Reprinted by Ox Bow Press, 1989.
See also Weyl's Philosophy of Mathematics and Natural Science, Princeton University Press,
1949.
What is a law of nature?
According to Leibniz,
a theory must be simpler
than the data it explains!
Because if a physical law can be as
complicated as the experimental data
that it explains, then
there is always a law, and the
notion of ``law'' becomes meaningless!
Understanding is compression!
A theory as complicated as the data it explains is NO theory!
All of this is stated very clearly (in French) in 1686
by the mathematical genius and philosopher Leibniz!
During his lifetime he only transmitted summaries
of these ideas to friends and colleagues in letters!
The complete text of his Discourse on Metaphysics
was found among his voluminous personal papers after his death.2
2For
the original French text, see Leibniz,
Discours de métaphysique, Gallimard, 1995.
There is an English translation in G. W. Leibniz,
Philosophical Essays,
edited and translated by Roger Ariew and Daniel Garber, Hackett,
1989.
program → Computer → output
axioms → Deduction → theorems
Ideas → Mind of God → The World
Following Leibniz, in each of these diagrams
the input on the left
must be much smaller
than the output on the right.
Leibniz's key insight is not
that this is ``the best of all possible worlds''.
This was anti-Leibniz propaganda by Voltaire, who ridiculed Leibniz and
did not understand how subtle and profound Leibniz was.
(According to Borges, the word ``optimism'' was invented by Voltaire to mock Leibniz!)
Leibniz's key insight
is that God has used few ideas to create all
the diversity, richness and apparent complexity of the natural world.3
Leibniz is actually affirming his belief that the universe is rationally comprehensible.
(This belief in a rational universe goes back at least to the ancient Greeks,
particularly Pythagoras and Plato, but Leibniz's formulation is much sharper
and profound because he analyzes in mathematical terms exactly what this belief means.)
In modern language, Leibniz was stating his belief in the possibility of science.
3Here is an argument in favor of diversity from Leibniz via Borges.
Consider two libraries with exactly the same number of books. Recall that Borges was
a librarian! One of the libraries only has copies of Virgil's Aeneid (presumably
the perfect book), while the other library has one copy of the Aeneid and
many other different (and presumably inferior!) books. Nevertheless, the second
library is clearly a more interesting place to visit!
Pythagoras and Plato believed that the universe can be comprehended using mathematics.
Leibniz went beyond them by clarifying mathematically what exactly does it mean
to assert that the universe can be comprehended using mathematics.
AIT continues this train of thought and goes beyond Leibniz by positing that explanations
are computer programs and also by defining
precisely what complexity is and what exactly
does it mean to satisfy Leibniz's requirement that
an explanation has to be simpler than
the phenomena that it explains.
The setup is as follows.
The (static, formal) mathematical theories that we consider
are thought of, somewhat abstractly, as a computer
program for running through all possible proofs and generating
all the theorems in the theory, which are precisely all the
logical consequences of the axioms in the theory.
[This assumes that the theory uses a formal language, symbolic logic,
requires complete proofs, and provides a proof-checking algorithm
that always enables you to decide mechanically whether or not a proof
is correct, whether or not it followed all the rules in deriving
a logical consequence of the axioms.]
And we shall concentrate our attention on the size in bits of this computer
program for generating all the theorems. That's our measure of the complexity
of this theory, that's our measure of its information content.
So when we speak of an N-bit theory, we mean one with an N-bit program
for generating all the theorems. We don't care that this process is very
slow and never terminates. AIT is a theoretical theory, not a practical theory!
Okay, that's half the setup! These are the theories we'll consider.
Here's the other half of the setup.
Here are the theorems that we want these theories
to be able to establish.
We want to use these theories to prove that
particular computer programs are elegant.
What is an elegant program? It's one with the property
that no smaller program written in the same language
produces exactly the same output.4
4Throughout this discussion we assume a fixed choice
of programming language.
There are lots of elegant programs!
For any particular computational task, for any particular
output that we desire to achieve, there has to be at
least one elegant program, and there may even be several.
But what if we want to be able to prove that a particular
program is elegant?
These are
Irreducible Mathematical Truths!!!
No rational justification is possible!
Such mathematical facts can have no
rational explanation, because rational
explanation consists of reducing something
to simpler (maybe even self-evident)
principles. A theory for something
derives it from simpler hypotheses.
If this kind of reduction is impossible,
then we are confronted with irrational
mathematical facts, mathematical facts
that cannot be encompassed by any static theory!
Therefore (corollary)...
Why?
Well, no N-bit theory for any finite N can enable
you to prove all true assertions of the form
(What about the physical world?
Does it have finite or infinite complexity?
We'll look at that later.
See Sections 1.10 through 1.12.)
Let c be the size in bits of the fixed-size routine
that does all this: It is given the N-bit theory as data and
then it runs through all possible proofs in
the theory until it finds a provably elegant program that
is sufficiently large (> N + c bits), then it runs that program and returns
that large elegant program's output as its own output.
So the N + c bit program displayed above produces
the same output as a provably elegant program that is larger
than N + c bits. But that is impossible!
So our precise result is that an N-bit theory
cannot enable us to prove that any program
that is larger than N + c bits is elegant.
Here c is the size in bits of the fixed-size
routine that when added to the N-bit formal theory
as above yields the paradox that proves our theorem.
Note: We are making the tacit assumption that
if our theory proves that a program is elegant, then that
program is actually elegant. I.e., we assume that only
true theorems are provable in our formal theory. If this is
NOT the case, then the theory is of absolutely no interest.
Gödel, who had the conventional Platonist view of math,
was only forced into backing new
math axioms that are only justified pragmatically
just as in physics because of his famous 1931 incompleteness
theorem. And I believe that the ideas that
I've just presented applying program-size
complexity to incompleteness, in particular
my result that it takes an N-bit theory
to prove that an N-bit program is elegant,
and the results on Ω that we'll see Day II,
provide even more support for
Gödel's heretical views on new axioms.
The way AIT measures the complexity (information
content) of mathematical theories makes Gödel
incompleteness seem much more natural, much more
pervasive, much more inevitable, and much more
dangerous. Adding new axioms—adding
more mathematical information—seems to be
the only way out, the only way to go forward!
We've discussed adding new axioms in math just
as in physics, pragmatically.
A related question is
Even when there are NO proofs?
For an extreme example of this, see
Wolfram, A New Kind of
Science, 2002,
who provides a tremendous amount of
computational evidence, but almost no proofs,
in support of his theory.
See also the journal Experimental Mathematics.
Obviously, AIT makes me sympathetic to experimental mathematics,
even though I don't do experimental math myself. Experimental math is fueled
by the power of the computer, not by Gödel's nor by my
meta-theorems, but it fits nicely into a quasi-empirical view
of mathematics. Practical necessity as well as philosophical
analysis are both simultaneously pushing in the direction
of experimental math!
The conventional view on this held by high-energy physicists
is that there is a TOE, a theory of everything, a finite set
of laws of nature that we may someday know, which has
only finite complexity.
So that part is optimistic!
But unfortunately in quantum mechanics there's
randomness, God plays dice, and to know the results of all
God's coin tosses, infinitely many coin tosses,
necessitates a theory of infinite complexity, which
simply records the result of each toss!
So the conventional view currently held by physicists
is that because of randomness in quantum mechanics the world has infinite complexity.
Could the conventional view be wrong?
Might it nevertheless be the case that the universe only has finite complexity?
Some extremely interesting thoughts on this can
be found in
According to Wolfram,
there is no real randomness.
There is only pseudo-randomness,
like the randomness produced by random-number
generators, which are actually deterministic
sequences of numbers governed by mathematical
laws, since computers use algorithms to generate
pseudo-random numbers, they don't use quantum mechanics!
According to Wolfram,
our universe is actually a deterministic universe
that's governed by deterministic physical laws!
So the physical universe, the world, has finite complexity.
According to Wolfram,
everything happens for a reason,
just as Leibniz thought!
These two supremely intelligent men are rationalists.
They want to understand everything!
They don't believe in ultimate mysteries!
They don't think anything is incomprehensible!
They believe in the power of the human mind to
comprehend everything!
On the other hand, we have seen that
because it has infinite complexity,
the universe of mathematical ideas
CANNOT be comprehended in its entirety.
It looks complicated,5
but it is actually
very simple!
5Especially if you're given successive digits
of π that come from far inside the decimal
expansion without being told that they're digits of π—they look random.
According to Wolfram
all the randomness we see
in the physical world is
actually pseudo-randomness.
He believes that the physical world
is actually deterministic,
we just don't know the law.
He sees this as a philosophical possibility.
Whether our physical universe is actually
that way or not is another matter,
to be decided by scientists, not philosophers!
6For more on Wolfram, Fredkin, Lloyd, Toffoli, Landauer, Zuse... see
O. Postel-Vinay, ``L'Univers est-il un calculateur?'' [Is the universe a calculator?],
La Recherche, no. 360, January 2003, pp. 33-44.
In a nutshell, digital philosophy posits that
the world is a giant computer,
a giant digital information processor,
and that, fundamentally,
everything is discrete 0/1 bits!
This algorithmic view of everything works much better if there are
actually no real numbers, no continuous quantities, and the physical
universe is really, at some bottom level, discrete.7
7Algorithms played a decisive role in Sumerian mathematics
more than a millennium before Pythagoras, a tremendously long intellectual trajectory!
The Sumerians used base
60 numerals, and divided the circle into 360 degrees.
Recall too
Zeno's refutation of continuous time and motion, which led
Hume to insist that space and time are discrete.—Françoise
Chaitin-Chatelin, private communication.
Wolfram's work, AIT, and Fredkin's digital philosophy are all examples of
the convergence of mathematics,
theoretical physics, and theoretical computer science!
This is an accelerating trend, of which the field of quantum computing
is also an example.
Of course, traditional mathematical physics is based
on continuous math,
on ordinary and partial differential equations,
and does not fit in too well with a digital philosophy.
Maybe digital philosophy is a terrible mistake.
Maybe we are taking the digital computer much too seriously!
Maybe we shouldn't make it the basis of a new philosophy, of a new world view, of
a new système du monde?
We will see!
This is the second expanded edition of a valuable collection
of essays by philosophers, mathematicians, and computer
scientists (including two of my own essays) that its editor
Tymoczko unfortunately did not live to see in print.
Highly recommended!
There is also a forthcoming two-volume set on experimental math,
that should be extremely interesting.
What is a living being?
How can we partition the world into parts?
Can we do this
in spite of Parmenides and mystics who insist
that the world must be apprehended as a whole
and is an organic unity, a single substance,
and cannot be separated into independent parts?
I think the key is
algorithmic independence.
X and Y are said to be
algorithmically independent if
the program-size complexity of X and Y
≈ the sum of the individual program-size complexities
of X and Y.
I.e., if the number of bits in the simplest theory that explains both simultaneously
is approximately equal to the sum of the number of bits in the simplest theories that explain
each of them separately.
Independent parts of the world, of which living beings are the most interesting example,
have the property that their program-size complexity
decomposes additively.
Conversely, the parts of a living being are certainly not independent and
have high mutual information. [Mutual information will be discussed
in Sections 3.6 and 3.7, Day III, and Section 4.3, Day IV.]
This needs much more work!
There is a proof-checking algorithm!
1I.e., using notation that we haven't introduced yet,
the positive integer N is uninteresting iff
H(N) ≡ HU(N)
≥ |N| ≡
size in bits of N.
For the definition of HU, see Section 2.13.
2It turns out that if you work through all the details,
you can't prove that a positive integer is uninteresting
if its size in bits is greater than (the size
of the program that generates all the theorems in
the formal theory FT) plus a constant c.
3In English in Paolo Mancosu,
From Brouwer to Hilbert, Oxford University Press, 1998.
4Vladimir Tasic,
Mathematics and the Roots of Postmodern Thought,
Oxford University Press, 2001.
have a solution
in positive integers with N ≥ 3?
(Here n ranges over positive integers and p ranges over the primes.)
N cases of the halting problem is
only log2 N bits of information,
not N bits!
They are never independent mathematical facts!
Why not?
Because
we could answer all N cases of the halting problem
if we knew exactly how many of the N programs halt.
Just run them all in parallel until exactly the right number halt.
The others will never halt.
Now we use this observation to compress all the redundancy
out of the Turing number T and get an algorithmically
irreducible number Ω.
Halting Probability:
|p| is the size in bits of the program p.
I.e., each k-bit program that halts
when run
on our standard universal Turing machine U
contributes 1/2k to Ω.
[We need to make U self-delimiting (Section 2.12)
to ensure that Ω ≤ 1. Otherwise the sum
for Ω diverges to ∞.
By using self-delimiting
programs, we've constructed one number, Ω,
that extends the trick of Section 2.8's crucial observation
so that it works for
an infinite number of
computer programs, all of them, in fact.]
Now there is absolutely NO redundancy!
The first N bits of Ω
answers the halting problem for all
programs up to N bits in size!
(Can you see why? Hint: Ω can be computed in
the limit from below.5)
5But very, very slowly, and you can never be sure how close you are.
The bits of Ω are algorithmically
irreducible,
algorithmically independent,
and
algorithmically random!
Ω is a real number with maximal information
content.
Each bit of Ω is a complete surprise!
It's not at all a tame real number like π = 3.1415926...
which only has a finite amount of algorithmic information.
Ω is a dangerous, scary real number!
Not good to meet in a dark alley!
Ω is maximally unknowable!
Maximum entropy, maximum disorder!
This makes Ω sound bad, very bad!
On the other hand, Ω is distilled, crystalized mathematical
information.
If T is coal, then Ω is a diamond!
In fact, initial segments of Ω are ideal new mathematical axioms.
Knowing a large initial segment of Ω would settle all halting problems
for programs up to that size,
which would, to use Charles Bennett's terminology,
settle all finitely refutable math conjectures up to that complexity.
Ω is concentrated essence of mathematical creativity
and mathematical inspiration!
One could measure the progress of mathematics by how many bits of Ω
we can currently determine!6
(Of course this has nothing to do with moral or scientific progress.)
6Somewhere Leibniz proposes measuring the
intellectual progress of mankind via a function Φ(t)
with the property that all
interesting theorems with proofs of size ≤ Φ(t) are known at time t.
Yet another possible
measure of human progress is a function Λ(t) such that all halting problems for
programs with ≤ Λ(t) bits have been settled by time t.
Yet perhaps these measures of intellectual progress are all beside the point,
the point being the emergence of new concepts?
The progress measured by Φ is, in principle, merely hard work, and could be
achieved mechanically, by employing vast amounts of computation, assuming that we
are exploring a fixed, static formal theory. But I think that Λ—and counting provable
bits in Ω—measures the emergence of new concepts indirectly via
their effects, for surely new concepts would be needed to advance in these areas.
So Ω can also be regarded as a good friend, instead of an enemy!
Here mathematical truth
isn't Black or White.
It's Grey!
The best way to think about Ω
is that each bit of Ω is 0 or 1
with probability 1/2!
Here God plays dice with
mathematical truth!
Knowing all the even bits of Ω
wouldn't help us to get any of the odd bits of Ω!
Knowing the first million bits of Ω
wouldn't help us to get the million and first bit of Ω!
The bits of Ω are
just like independent tosses of a fair coin!
This is the case even though Ω is
a specific real number!
And even though each bit
of Ω
is fully determined mathematically.
(Even though we can't compute it
nor
prove what its value is!)
First of all, U must read one bit
of its program p at a time and decide by itself
when to stop reading p before encountering
a blank at the end of p.
In other words, each p for U must be
self-delimiting.
Also, U must be universal.
This means that for any special-purpose
self-delimiting computer C there is
a prefix πC such that
concatenating it in front of a program for C
gives you a program for U
that computes the same output:
This prefix depends only on the choice of C
and not on p.
In other words, U can simulate each C.
HU (x) ≡
minU(p)=x |p|.
H(x) is the size in bits of the smallest
program
for computing x on each machine.
Then we have
In other words, programs for U are not too large.
For U to simulate C adds only a fixed number of
bits to each program for C.
Then
ΩU
is defined as follows:
Any such universal U will do.
I invented a special version of LISP for
writing the
prefix π.
Access to the data β is strictly controlled.
U reads one bit at a time of p
and
CANNOT run off the end of p = πβ.
The binary data β must be
self-delimiting,
just like the LISP prefix π,
so that
U knows when to stop reading it.
I.e., the alphabet for p is binary, not trinary!
There is no blank at end of p!
That would be a wasted character, one that
isn't being used nearly enough!
There are more details about the LISP implementation in
the last lecture (Day IV).
Does the diophantine equation
D(x) = 0
have an unsigned integer solution?
So we get randomness in arithmetic!
Ω's algorithmic randomness also infects
elementary number theory!
The disease is spreading!
Self-delimiting computer C
Next we construct a universal computer U, for example, as follows:
In other words, U can simulate C if to each program for C we add
a prefix consisting of a long run of 0's followed by a 1.
This prefix is self-delimiting, and the number of 0's in it
indicates which C is to be simulated.
Then we have
where
I.e., U's programs are, to within an additive constant, minimal in size.
We take this minimality property as our definition for a universal computer U.
Now somehow pick a particular natural U to use as the basis for AIT.
In the lecture for Day IV, I will indicate how to use LISP to implement the
particular U that I've picked to use as the basis for my theory.
1``Art is a lie that helps us to see the truth,'' Pablo Picasso.
Define an abstract complexity measure H via these two properties:
I.e., { 〈x, k〉 :
H(x) ≤ k } is r.e.
I.e., H(x) =
limt → ∞
Ht(x).
This abstract approach works!
However I prefer
the more concrete approach in which
you
think of H(x) as the size of the smallest program for x.
This gives more insight into what is going on.
Beware of premature axiomatization!
And beware of excessive abstraction concealing meaning!
Size of the smallest program that computes both objects.
This was the original reason that I made programs self-delimiting,
so that information content would be additive!2
This means we can combine programs for x and y and
add c bits and get a program for the pair 〈x, y〉.
For a LISP program that does this and shows that c = 432 is
possible, see Section 4.10.
2Then I discovered Ω, which cannot be defined
unless programs are self-delimiting.
I.e., the joint complexity is approximately equal to the sum of the
individual complexities.
This means that the two objects have nothing in common.
This is the extent to which it is better
to compute them together rather than separately.
This is the second main technical idea of AIT.
First is to use self-delimiting programs.
Second is to define relative complexity in this more subtle manner
in which we are given y* for free, not y,
and we have to compute x.
This is an identity in Shannon information theory, with no O(1) term.
In these two different versions of information theory,
the formulas look similar but of course the meaning of H is completely different.
We get these two corollaries:
H(x : y) =
H(y) −
H(y | x) + O(1).
So the mutual information is also the extent to which knowing
one of the two objects helps you to know the other.
It was not at all
obvious that this would be symmetric!
Again, these two corollaries are identities
in Shannon information
theory.
They justify our whole approach.
When I got these results I knew
that the definitions I had picked
for AIT were finally correct!
In other words, you have to know how many bits there are
plus what each bit is.
And most N-bit strings have H close
to the maximum possible.
These are the algorithmically random N-bit strings.
This notion has many philosophical resonances. First, a random
string is one for which there is no theory that obeys Leibniz's dictum that
a theory must be smaller than what it explains.
Leibniz clearly anticipated this definition of randomness.
Second,
a random string is an unexplainable, irrational string, one that cannot be comprehended,
except as ``a thing in itself'' (Ding an sich), to use Kantian
terminology.
H(1N)
= H(N) + O(1)
≈ log2 N.
For such bit strings,
you only need to know how many bits there are,
not what each bit is.
These bit strings are not at all random.
These bit strings are borderline random.
Randomness is a matter of degree,
and this is a good place to put the cut-off
between random
and non-random strings,
if you have to pick a cut-off.
Here xN is the first N bits of x.
If x is generated by independent tosses of a fair coin,
then with probability one, x is algorithmically random.
But how about specific examples?
Ω' ≡ ∑N 2−H(N)
These are algorithmically random real numbers.
The initial segments of their binary expansion
always have high complexity.
From this it follows that Ω and Ω' aren't
contained in any constructively definable set of measure zero.
In other words, they are statistically (Martin-Löf) random.
Therefore
Ω and Ω'
necessarily have
any property held by infinite binary
sequences x with probability one.
E.g., Ω and Ω' are both Borel normal,
for sure,
since in general this is the case with probability one.
Borel normal means that in any base, in the limit all blocks of digits of the
same size provably have equal relative frequency.
That Ω and Ω' are both Borel normal
real numbers
can also be proved directly using a simple program-size argument.
This can be done
because, as Shannon information theory already shows,
reals that
are not Borel normal are highly redundant and compressible.
And the main difference between Shannon information theory and AIT
is that Shannon's theory is concerned with average compressions
while I am concerned with compressing individual sequences.
But the more subtle proofs have to descend one level and use probabilistic arguments.
AIT is an iceberg! Above the water we see program size.
But the bulk of the AIT iceberg is submerged and is the algorithmic probability P(x)!
PC (x) is defined to be
the probability that a program
generated by coin-tossing
makes C produce x:
P(x) ≡
PU (x) ≡
∑U(p)=x
2−|p|.
This takes into account all the programs that calculate x,
not just the smallest one.
because one way to compute x is using an elegant program for x.
In fact, much, much more is true:
This crucial result shows that AIT,
at least in so far as the appearance of its formulas is concerned,
is sort of just a version of probability theory
in which we take logarithms and convert probabilities into complexities.
In particular, the important and subtle theorem that
is seen from this perspective as merely
an alternative version of the definition of relative probability:
How do we establish this crucial relationship between H and P?
By using an extended version of something called the Kraft inequality,
which AIT inherited from Shannon information theory!
[See Section 3.13.]
connecting H(x) and P(x).
This shows that there cannot be too many
small programs for x.
In particular, it follows that
the number of elegant programs for x is bounded.
Also,
the number of programs for x
whose size is within c bits of H(x)
is bounded by a function that depends only on c and not on x.
This function of c is approximately 2c.
In fact,
any program for x whose size is close to H(x)
can
easily be obtained from any other such program.
So, what I call Occam's razor, elegant or nearly elegant programs
for x are essentially unique!
And this justifies our definition of relative complexity H(x | y)
and shows that it does not depend too much on the choice
of the elegant program for y.
There are at most O(1) bits difference in
H(x | y)
depending on the choice of the y* that we are given for free.
is an extended version of the Kraft inequality condition for the existence
of a prefix-free set of words.
In AIT, the Kraft inequality gives us a necessary and sufficient condition not for the
existence of a prefix-free set,
as it did in Shannon information theory,
but for the existence of a self-delimiting computer C.
Once we have constructed a special-purpose computer C using the Kraft
inequality, we can then make statements about U by using the fact that
Here's how we construct C!
Imagine that we have an algorithm for generating
an infinite list of
requirements:
As we generate each requirement 〈s, o〉,
we pick the first available s-bit program p
and we assign the output o to C(p).
Available means not an extension or prefix of
any previously assigned program for C.
This process will work and will produce a self-delimiting
computer C satisfying all the
requirements iff
Note that if there are duplicate requirements in the list,
then several p will yield the same
output o.
I.e., we will have several p with the same value for C(p).
This way of constructing C may be thought of as a first-fit storage
allocation
algorithm for one infinitely-divisible unit of storage.
It is important to be able to simultaneously keep each of these three images in ones mind,
and to translate from one image to the next depending on which gives the most
insight at any given moment.
In other words, C may be thought of as a kind of constructive probability
distribution, and probability one corresponds to the entire unit interval
of programs, with longer and longer programs corresponding to smaller
and smaller subintervals of the unit interval.
Then the crucial fact that no extension of a valid program is a valid program
simply says that the intervals corresponding to valid programs
do not overlap.
In the last lecture, Day IV, I'll drop this abstract viewpoint and make everything
very, very concrete by indicating how to program C in LISP and actually run
programs on C...
First of all, I never stated the formal version of
the incompleteness results about Ω that
I explained informally in Day II.
How can we measure the complexity of a formal
mathematical theory in bits of information?
H(formal math theory) ≡
such that
Note that this is something new: U is now performing
an unending computation and produces an infinite
amount of output!
The rather abstract point of view taken here is in
the spirit of Emil Post's 1944 paper R.e. sets of
positive integers and their decision problems.
For our purposes the internal details of the formal
theory are irrelevant. We are flying high in the sky
looking down at the formal theory. We're
so high up that we don't care about the axioms
and rules of inference used in the theory.
We can't see all of that detail. Nevertheless,
using these very general methods we can make some
rather strong statements about the power of the formal
theory in question.
Here are some hints about the proof.
We make the tacit hypothesis, of course,
that
all the theorems proved in theory T are true.
Otherwise T is of no interest.
The proof is in two parts.
This follows from the fact that
knowing ΩK,
the first K bits of Ω,
enables you to answer the halting
problem for all programs for U
up to K bits in size.
This, our major incompleteness
result about Ω,
actually requires
very little of the machinery presented
in Day III.
In fact, the proof is a
straight-forward program-size argument
that makes no use of P nor of the
the Kraft inequality.
This is in fact the self-contained
elementary
proof that I give
in my 1998 Springer volume
The Limits of Mathematics,
which is the reference volume for today's lecture.
Day III we defined the following complexity measures:
Well, you could try to make mutual information into the size of
a program.
E.g., mutual information could be defined to be the size of
the largest (not the smallest!) possible common subroutine
shared by elegant programs for x and y.1
1For other possibilites, see the section on common information
in my 1979 paper
Toward a mathematical definition of ``life.''
Unfortunately this alternative definition, which is much more intuitive
than the one I actually use, doesn't seem to fit into AIT too well.
Perhaps someone can find a way to make it work! Can you?
This is an interesting question to answer because of an important
possible application: In Section 1.14, Day I, it was suggested that
mutual information be used to try to partition the world into
separate living organisms.
Let's look at some examples of how this can be done.
How can we make an N-bit string self-delimiting?
Well, 2N + 2 bits will certainly do.
There's a special-purpose
self-delimiting computer C that accomplishes this:
So we've just seen four different examples of how a self-delimiting
program can tell us its
size as well as its contents.
Another version of the principle that a program tells
us its size as well as its contents is the fact that
an elegant program tells us its size as well as its output, so
That's all the unfinished business from Day III!
Now let's start Day IV proper by explaining LISP!
LISP is
like a computerized version of set theory!
The LISP S-expression
denotes the following nested tuples:
Programs and data in LISP are always S-expressions.
Everything is an S-expression!
The S in S-expression stands for Symbolic.
S-expressions that are not lists are called atoms.
A, bc, 39 and x-y-z
are the atoms in the above S-expression.
The empty list () = 〈 〉 is also an atom, usually
denoted by nil.
LISP is a functional programming language,
not an imperative programming language.
LISP is an expression language. There is no notion of time,
there are no GOTO's, and there are no assignment statements!2
2Well, they're actually present, but only indirectly, only subliminally!
LISP programs are expressions that you evaluate,
not run, and they yield a value, with NO side-effect!
(This is called ``Pure'' LISP!)
Functional Notation in LISP:
So
becomes
in LISP.
Prefix notation, not infix notation! Full parenthesization!
Fixed number of arguments for each primitive function!
In my LISP most parentheses are understood, are implicit,
and are then supplied by the LISP interpreter.
Complete LISP Example.
Factorial programmed in my LISP:
With all the parentheses supplied, factorial becomes:
In words, f is defined to be a function of one argument, n.
And if n = 0, then f(n) is 1.
If not, f(n) is
n × f(n−1).
Then (f 4) yields value 24, as expected, since
4 × 3 × 2 × 1 = 24.
Dissecting and Reassembling Lists:
Define an elegant LISP expression to be one with
the property that no smaller LISP expression
yields the same value.
In order to measure size properly,
LISP expressions are written in a canonical standard
notation with no embedded comments and with exactly one blank
separating successive elements of a list.
Then the LISP complexity of an S-expression is defined
to be the size in characters of an elegant expression
with that particular value.
Next represent a formal mathematical theory in LISP
as a LISP S-expression whose evaluation never terminates
and which uses the primitive function display to
output each theorem.
display is an identity function with
the side-effect of displaying its operand and is normally
used for debugging large LISP S-expressions
that give the wrong value.
Theorem—
This is a very concrete, down-to-earth incompleteness theorem!
In that direction, it's the best I can do!
The reductio ad absurdum
proof consists of N + 410 characters of LISP that define a function,
and that apply this function to a quoted form of the N-character LISP expression
that generates all the theorems of the formal theory. The first
step is to measure the size N of the LISP expression that generates
all the theorems. Then use TRY (Section 4.7) to run the theory for longer
and longer, until a provably elegant expression is found whose size
is greater than N + 410. When this happens, evaluate the provably
elegant expression and return its value as our value.
So we've described an N + 410 character LISP expression whose value is the
same as the value of an elegant expression that is larger than it is!
Contradiction!
The first argument is either an integer or
no-time-limit.
The second argument is an arbitrary LISP S-expression.
The third argument is a bit string represented in LISP
as a list (...) of 0's and 1's separated by blanks.
TRY then yields this triple:
Also, within the expression that is being tried,
you can use
read-bit
or
read-exp
to get access
to the binary data in a self-delimiting manner,
either one bit or one LISP S-expression at a time.
TRY is like a large suitcase: I am using it to do
several different things at the same time. In fact,
TRY is all I really need to be able to program AIT in
LISP. All of the other changes in my version of LISP
were made just for the fun of it! (Actually, to
simplify LISP as much as possible without
ruining its power, so that I could prove theorems
about it and at the same time enjoy programming in it!)
Example: If you're trying to run an unending
computation that is a formal theory written in LISP
as described in Section 4.6, then TRY always returns
This function U of p returns the second element of the list returned by TRY,
a TRY with no time limit that tries to evaluate a LISP expression at the
beginning of p, while making the rest of p available as raw binary data.
Here the program p for U is a bit string
that is
represented
in LISP as a list of 0's and 1's separated by blanks.
E.g., p = (0 1 1 0 1...)
Let me explain this!
The binary program p consists of a LISP S-expression prefix (converted into
a bit string), that is followed by a special delimiter character, the ASCII newline
control character (also converted to bits), and that is then followed by raw binary data.
The newline character adds 8 bits to the program p, but guarantees
that the prefix be self-delimiting. Each character in the LISP prefix is
given in p as the corresponding 8 ASCII bits.
Within the prefix, read-bit and
read-exp enable you to get access to
the raw binary data, either one bit or one LISP S-expression at a time.
But you must not run off the end of the binary data!
That aborts everything! If so, U of p returns out-of-data.
So the program for the universal U consists of a high-level language algorithmic
part followed by data, raw binary data. The algorithmic part indicates
which special-purpose self-delimiting computer C should be simulated,
and the raw binary data gives the binary program for C!
I believe that I've picked a very natural U.
And here's a program with one bit of binary data.
The prefix is read-bit and the data is 0.
Let's change the data bit.
Now the prefix is read-bit and the data is 1.
run-utm-on is a macro that expands to the definition of U,
which is cadr try no-time-limit 'eval read-exp.
Also note that when bits converts a LISP S-expression
into a bit string (a list of 0's and 1's) it automatically
adds the newline character required by read-exp.
This is a 432-bit prefix π for U
with the property that
Here x* is an elegant program for x and
y* is an elegant program for y.
Therefore
For years and years I used this inequality
without having the faintest idea what the value of
the constant might be! It's nice to know that
there is a natural choice for U that can be easily
programmed in LISP and for which c = 432! That's
a little bit like dreaming of a woman for years
and then actually meeting her! That's also how I feel
about now having a version of U that I can actually
run code on!
My book The Limits of Mathematics programs in LISP
all the proofs of my incompleteness results, in particular,
all the key results about Ω. And my book
Exploring Randomness programs in LISP all the proofs
of the results presented Day III. That's a lot of
LISP programming, but I enjoyed it a lot!
Some of
these programs are fun to read, and it's educational to do so,
but I probably went overboard! In general, I would say that
you are better off programming something yourself, rather
than reading someone else's code, if you really want
to understand what's going on! There is no substitute for
doing it yourself, your own way, if you really want to
understand something. After all, that's how I came up
with AIT and my alternative approach to Gödel incompleteness
in the first place, because I was dissatisfied with the usual
approach!
For example, the relationship
between the algorithmic complexity H(X) and
the algorithmic probability
P(X) is not completely known.
When dealing with infinite computations in AIT,
in some cases it is not even clear whether the definitions
that were chosen are the right ones!
[For more on this, see
the last chapter
of my Exploring Randomness.]
3In this connection, see footnote 6 on page 24.
Preface
This little book contains the course that I had the pleasure of giving
at the Eighth Estonian Winter School in Computer Science (EWSCS '03)
held at the
beautiful
Park Hotel Palmse in Lahemaa National Park, Estonia, from March 2nd through 7th, 2003.
There I gave
four 90-minute lectures
on algorithmic information theory (AIT), which is
the theory of program-size complexity.
Each of these lectures is one chapter of this book.
http://www.cs.umaine.edu/~chaitin
http://www.cs.auckland.ac.nz/CDMTCS/chaitin
Contents
Day I—Philosophical Necessity of AIT
1.1 Is P ≠ NP a New Axiom?
Can we add
P ≠ NP
as a new axiom?!
1.2 Day I Summary
Discourse on Metaphysics (in French!)
what is ``understanding.''
1.3 Summary of Leibniz, 1686
(Brought to my attention by Hermann
Weyl, 1932.1)
1.4 From Leibniz to AIT
AIT goes beyond Leibniz by positing
that a theory is a computer program,
and that the size in bits of this computer
program is the complexity of the theory.
AIT makes Leibniz more precise!
AIT emphasizes the common features in these four diagrams:
1.5 Setup for Discussing the Limits of Mathematical Reasoning
Now let's apply these ideas!
Let's see what they have to say about the limits of mathematical reasoning.
Let's see what mathematical theories can accomplish.
No smaller program gives exactly the same output.
1.6 You need an N-bit theory
to prove that an N-bit program
is elegant!
[For the proof, see Section 1.8; for a corollary, see Section 1.7.]
1.7 Complexity of the Universe of Mathematical Ideas
The world of mathematical ideas
has INFINITE complexity!
1.8 Why do you need an N-bit theory to prove
that an N-bit program is elegant?
Here is the proof!
We'll assume the opposite and derive a contradiction.
Consider this computer program:
↓
Computer
↓
the output of
the first provably elegant program
whose size is greater than
(complexity of theory +
size of fixed-size routine)
1.9 Is Mathematics Quasi-Empirical?
Now let's go back to the question of adding
P ≠ NP
as a new axiom,
and to Gödel's thoughts that maybe
physics and math are not so different,
which, following Lakatos and Tymoczko (1998),
is now referred to as a quasi-empirical
view of mathematics.
1.10 Complexity of the Physical World
Now let's turn to the physical universe!
Does it have finite or infinite complexity?
1.11 Summary of Wolfram, 2002
In summary, according to
Wolfram, 2002
the world is like
π = 3.1415926...
1.12 Digital Philosophy
Wolfram's book as well as my
own work on AIT are both examples of what
Edward Fredkin refers to as
digital philosophy, a viewpoint that Fredkin also helped to pioneer.6
1.13 Three Interesting Books
I should mention that besides my own works,
the book on the quasi-empirical view of math is
New Directions in the
Philosophy of Mathematics,
Princeton University Press,
1998.
Mathematics by Experiment,
Experimentation in Mathematics,
A. K. Peters, 2003?,
1.14 Decomposing the World
Finally, here is a new and completely different philosophical application of AIT.
Day II—Main Application of AIT: Incompleteness
2.1 Day II Summary
Limits of Formal
Mathematical Reasoning:
Both I and II are
irreducible mathematical truths!
2.2 Hilbert-Style Formal Theories
2.3 Summary of Gödel, 1931
2.4 My Approach to Incompleteness
2.5 Borel's Unreal Number B, 1927
Let's start on the path to
the halting probability Ω, which is a
real number.
B was brought to my attention by Vladimir Tasic, 2001.4
(Nth text in French
is not a valid yes/no question,
or cannot be answered.)
2.6 More-Real Turing Number T
Here is the next step on our path to Ω:
bN tells us whether the Nth program pN
ever halts.
2.7 Some Interesting Cases of the Halting Problem
These are bits of T:
In each case there is a program that systematically searches
for a counter-example and halts iff it finds it.
Does
About the location of the complex zeroes of the zeta function
Tells us a lot about the distribution of prime numbers.
Is every even number the sum of two primes?
2.8 Crucial Observation
Suppose we are given N programs
and want to know
which ones halt and
which ones don't.
2.9 The Halting Probability Ω
Finally Ω = Omega!
2.10 Why is Ω Interesting?
The base-two bits of Ω are
irreducible mathematical facts!
They can't be derived from anything simpler!
2.11 In What Sense is Ω Random?
The bits of Ω
are mathematical facts
that are true
for No Reason,
they're true
by Accident!
2.12 The Self-Delimiting Universal Computer U
Actually, the value of Ω depends
on the choice of universal computer U.
There are many possible choices!
2.13 U is Optimal and Defines ΩU
Then we define the complexity
H
with respect to C and U
as follows:
2.14 My U is Programmed in LISP
The particular U that I picked to define Ω
uses LISP as follows:
2.15 Programming U in LISP
The LISP expression
π is converted to binary,
eight bits per character,
and
concatenated with the raw binary data β
to produce the binary program for U, p = πβ.
2.16 Ω and Hilbert's 10th Problem
We end today's lecture with a very important application:
Hilbert's 10th problem.
For more details, see Ord and Kieu,
On the existence of a new family of Diophantine equations for Ω,
http://arxiv.org/abs/math.NT/0301274.
[Matijasevic]
iff the kth bit of Ω is a 1!
[Chaitin]
iff the kth bit of Ω is 0/1!
[Ord and Kieu]
Day III—Technical Survey of AIT: Definitions & Theorems
Ω is the jewel of AIT. But it isn't a diamond solitaire.
It's in a beautiful setting, which we'll outline today.
We'll review the key definitions, theorems,
methods and ideas of AIT, but there will be NO proofs!
Only proof sketches.
For the proofs, see my book
Exploring Randomness.
3.1 Definition of a Self-Delimiting Computer C
AIT starts with a
Abstract version of self-delimiting feature:
No extension of a valid program is a valid program.
If C(p) is defined then C
is not defined on any extension of p.
I.e., the domain of the function C is a so-called prefix-free set.
That's a set of words in which no word is a prefix of another.
3.2 What do we do with C? Define U!
First define the program-size complexity
HC (x) to be
the size in bits of the smallest program for C to compute x:
3.3 Does the fact that H depends on U destroy AIT?!
3.4 A More Abstract Definition of the Complexity H
I prefer a very down-to-earth concrete approach.
But here is a more abstract ``axiomatic'' approach to defining H.
Then pick out an optimal minimal H,
one with the property that for any other
H' there is a constant c such that
3.5 Back to U—What do we do with U? Definitions!
3.6 AIT Definitions (Continued)
3.7 Important Properties of These Complexity Measures
3.8 Complexity of Bit Strings. Examples.
How about infinite bit strings?
Consider the N-bit strings consisting of N 0's or N 1's.
Consider an elegant program p.
I.e., no smaller program makes U produce the same output.
3.9 Maximum Complexity Infinite Binary Sequences
An infinite binary sequence x
is defined to be algorithmically
random iff
Because with probability one,
infinitely often
there have to be long runs
of consecutive zeroes or ones,
and this makes the complexity dip far below
the maximum
possible [N + H(N)] infinitely often!
3.10 The Algorithmic Probability of x, P(x)
Straight-forward theorems in AIT use program-size arguments.
3.11 What is the Relationship Between H and P?
First of all, it's obvious that
3.12 Occam's Razor! There are few elegant programs!
But first I want to discuss an important corollary of the crucial
theorem
3.13 The Extended Kraft Inequality Condition for Constructing C
The crucial tool used to show that
3.14 Three-Level Proofs: Program Size, Probabilities, Geometry
The proof of this crucial version of the Kraft inequality
involves simultaneously thinking of C as a computer, as an assignment
of probabilities to outputs, and as an assignment of outputs to
segments of the unit interval, segments which are halves, or halves of halves,
or halves of halves of halves...
Day IV—LISP Implementation of AIT
The LISP interpreter for Day IV,
which is a Java applet that will run in your web browser,
can be found at these two URL's:
http://www.cs.umaine.edu/~chaitin/unknowable/lisp.html
http://www.cs.auckland.ac.nz/CDMTCS/chaitin/unknowable/lisp.html
Note that there is a concise LISP ``reference manual'' on these web pages,
just below the windows for input to and output from the interpreter.
4.1 Formal Definition of the Complexity of a Formal Theory
Before starting with LISP, let me finish some things
I started in the last lecture, and also answer some
questions that were raised about that lecture.
4.2 Formal Statement of the Incompleteness Theorem for Ω
A formal math theory T with complexity N
cannot enable you to determine more
than N + c bits of the numerical value of
Ω (base-two).
4.3 Mutual Information is Program Size?
I was asked a question about mutual information.
The fourth complexity measure in this list, mutual information,
is actually rather different from the other three.
It isn't the size of a computer program!
Instead it's defined to be
4.4 How do you actually make things self-delimiting?!
Is
4.5 What is LISP?
LISP = LISt Processing. Lists are arbitrarily long nested tuples,
and may have repeated elements.
define (f n)
if = n 0 1
* n (f - n 1)
(define (f n)
(if (= n 0) 1
(* n (f (- n 1)))
))
Finally I should mention that car gives you the first element of a list,
cdr gives you the rest of the list, and cons inverts car
and cdr.
4.6 LISP Program-Size Complexity
You can get a toy version of AIT by using LISP program size!
A formal mathematical theory with LISP complexity N cannot
enable you to prove that a LISP expression is elegant
if the elegant expression's size is greater than N + 410 characters!
4.7 How Does TRY work?
TRY plays the fundamental role in my LISP that
eval
plays in traditional, normal LISP. It enables you
to try to evaluate a given expression for a limited
or unlimited amount of time, and giving it raw binary
data on the side, which it must access in a self-delimiting manner.
Also,
display's
from within the expression are captured,
thus giving a mechanism for an S-expression to produce
an infinite amount of output, not just a final value.
try time-limit/no-time-limit expression binary-data
TRY has the three arguments shown above.
(success/failure
value-of-expression/out-of-time/out-of-data
(list of captured displays from within expression...))
(failure out-of-time (list of theorems...))
The size of the list of theorems depends on how much
time the TRY was given to run the formal theory.
4.8 LISP Program for Our Standard Universal Computer U
define (U p)
cadr try no-time-limit
'eval read-exp
p
4.9 Running U on Simple Examples
Here's our first example of a program for U.
This program is all LISP prefix with no binary data.
The prefix is '(a b c).
run-utm-on
bits ' '(a b c)
This yields
(a b c).
run-utm-on
append bits 'read-bit
'(0)
This yields 0.
run-utm-on
append bits 'read-bit
'(1)
This yields 1.
4.10 Subadditivity Revisited
cons eval read-exp
cons eval read-exp
nil
4.11 Some Open Problems for Future Research
I've greatly enjoyed this Winter School and my visit to Estonia!
Thank you very much for inviting me!
Additional Reading
The six papers listed here were the handouts that accompanied my lectures at EWSCS '03.
The two books listed below provide the reference material for the course,
including complete proofs for all the theorems that I stated here,
and a detailed explanation of my version of LISP.
The papers listed here are also available at the author's websites:
http://arxiv.org/abs/math.HO/0302333
Presented at the XIX. Deutscher Kongreß für
Philosophie, Grenzen und Grenzüberschreitungen
[German Philosophy Congress on Limits and Transcending Limits], 23.-27. September 2002 in Bonn.
http://arxiv.org/abs/math.HO/0210035
G. Chaitin, The Limits of Mathematics, Springer-Verlag, Singapore, 1998.
http://www.cs.umaine.edu/~chaitin
http://www.cs.auckland.ac.nz/CDMTCS/chaitin
LISP code for my two books may be found at
http://www.cs.umaine.edu/~chaitin/ait
http://www.cs.auckland.ac.nz/CDMTCS/chaitin/ait
http://www.cs.umaine.edu/~chaitin/lm.html
http://www.cs.auckland.ac.nz/CDMTCS/chaitin/lm.html
The LISP interpreter is at
http://www.cs.umaine.edu/~chaitin/unknowable/lisp.html
http://www.cs.auckland.ac.nz/CDMTCS/chaitin/unknowable/lisp.html