Inductive Reasoning and Kolmogorov Complexity
Ming Li*
University of Waterloo, Department of Computer Science
Waterloo, Ontario N2L 3G1, Canada
Paul M.B. Vit ´anyi
Centrum voor Wiskunde en Informatica, Kruislaan 413,
1098 SJ Amsterdam, The Netherlands
and
Universiteit van Amsterdam, Faculteit Wiskunde en Informatica
ABSTRACT
This is a sloppy ﬁrst draft of [J. Comp. System Sciences, 44:2(1992), 343-384].
Also, there are some problems with the pictures and references due to the
obsolete troff processing.
Reasoning to obtain the ‘truth’ about reality, from external data, is an important,
controversial, and complicated issue in man’s effort to understand nature. (Yet,
today, we try to make machines do this.) There have been old useful principles,
new exciting models, and intricate theories scattered in vastly different areas
including philosophy of science, statistics, computer science, and psychology.
We focus on inductive reasoning in correspondence with ideas of R.J. Solomonoff.
While his proposals result in perfect procedures, they involve the noncomputable
notion of Kolmogorov complexity. In this paper we develop the thesis
that Solomonoff’s method is fundamental in the sense that many other induction
principles can be viewed as particular ways to obtain computable approximations
to it. We demonstrate this explicitly in the cases of Gold’s paradigm for inductive
inference, Rissanen’s minimum description length (MDL) principle, Fisher’s
maximum likelihood principle, and Jaynes’ maximum entropy principle. We
- 2 present
several new theorems and derivations to this effect. We also delimit
what can be learned and what cannot be learned in terms of Kolmogorov complexity,
and we describe an experiment in machine learning of handwritten characters.
We also give an application of Kolmogorov complexity in Valiant style
learning, where we want to learn a concept probably approximally correct in
feasible time and examples.
The eye of the understanding is like the eye of the sense; for as you may see great objects
through small crannies or levels, so you may see great axioms of nature through small and contemptible
instances. [Francis Bacon, Sylva Sylvarum 337, 1627]
1. A Historical View of Inductive Reasoning
The Oxford English Dictionary gives as the meaning of induction: the process of inferring a
general law or principle from the observations of particular instances. This deﬁnes precisely
what we would like to call inductive inference. On the other hand, we regard inductive reasoning
as a more general concept than inductive inference, namely as a process of re-assigning a probability
(or credibility) to a law or proposition from the observation of particular instances. In other
words, in the way we use the notions, inductive inference draws conclusions that consist in
accepting or rejecting a proposition, while inductive reasoning only changes the degree of our
belief in a proposition. The former is a special case of the latter. In this paper we discuss inductive
reasoning in correspondence with R.J. Solomonoff’s ideas as expressed in e.g. [ Solomonoff
1964 However, Solomonoff’s procedure is not effective, since it involves the noncomputable
Kolmogorov complexity of objects. We shall show, however, that there is considerable structure
in many different approaches proposed for induction, since they can be variously derived as computable
approximations to Solomonoff’s method.
The history of inductive inference, which is as old as empirical science itself, dates back at
least to the Greek philosopher of science Epicurus (342? - 270? B.C). To reason by induction is
nothing but to learn from experience. As the sun rises day by day, our belief in that the sun will
* The work of the ﬁrst author was supported in part by National Science Foundation Grant DCR-8606366, Ofﬁce of
Naval Research Grant N00014-85-k-0445, Army Research Ofﬁce Grant DAAL03-86-K-0171, and NSERC Operating
Grant OGP0036747. Part of the work was performed while he was at the Department of Computer Science, York
University, North York, Ontario , Canada. A preliminary form of this paper appeared in Proc. 4th IEEE Structure in
Complexity Theory Conference, 1989.
- 3 rise
tomorrow increases, and we eventually infer the truth that the sun will rise every morning.
As human history evolves, man tries to understand and explain the events that happen around
him: this takes the form of different induction methods to formulate scientiﬁc theories from positive
and negative, fortunate and unfortunate, lucky and unlucky, happy and miserable experiences.
Two metaphysical principles stand out and prevail today: the principle of Epicurus’ multiple
explanations (or indifference) and Occam’s principle of simplest explanation (Occam’s
razor).
The Principle of Multiple Explanations: If more than one theory is consistent with the
data, keep them all.
The source of the following material is [ Epicurus Epicurus, in his Letter to Pythocles, explains
that: There are cases, especially of events in the heavens such as the risings and settings of
heavenly bodies and eclipses, where it is sufﬁcient for our happiness that several explanations be
discovered. In these cases, the events ‘have multiple causes of coming into being and a multiple
predication of what exists, in agreement with the perceptions.’ Epicurus maintains that, if several
explanations are in agreement with the (heavenly) phenomena, then we must keep all of them for
two reasons. Firstly, the degree of precision achieved by multiple explanations is sufﬁcient for
human happiness. Secondly, it would be unscientiﬁc to prefer one explanation to another when
both are equally in agreement with the phenomena. This, he claims, would be to ‘abandon physical
inquiry and resort to myth.’ His follower Lucretius (95 - 55 B.C.) illustrates the inevitability
of the use of the multiple explanation principle by the following example:
‘‘There are also some things for which it is not enough to state a single cause, but several, of which
one, however, is the case. Just as if you were to see the lifeless corpse of a man lying far away, it would be
ﬁtting to state all the causes of death in order that the single cause of this death may be stated. For you
would not be able to establish conclusively that he died by the sword or of cold or of illness or perhaps by
poison, but we know that there is something of this kind that happened to him.’’
Based on the same intuition, in the calculus of probabilities it has been customary to postulate
the ‘principle of indifference’ or the ‘principle of insufﬁcient reason’. When there is no other
evidence, because of the absolute lack of knowledge concerning the conditions under which a die
falls, we have no reason to assume that a certain face has higher probability of turning up. Hence
we assume that each side of the die has the probability 1/6. The principle of indifference considers
events to be equally probable if we have not the slightest knowledge of the conditions under
which each of them is going to occur. For the case of a die, this actually coincides with the socalled
‘maximum entropy principle’, we will discuss later, which states that we should choose
- 4 probabilities
pi for face i to be the outcome of a trial, i = 1, 2,...,6, such that !"i =1
6
pi ln pi is maximized
under the only constraint "i =1
6
pi = 1. We obtain precisely pi = 1/6 for i = 1, 2,...,6.
The second and more sophisticated principle is the celebrated Occam’s razor principle commonly
attributed to William of Ockham (1290? - 1349?). This enters the scene about 1500 years
after Epicurus. In sharp contrast to the principle of multiple explanations, it states:
Occam’s Razor Principle: Entities should not be multiplied beyond necessity.
This is generally interpreted as: Among the several theories that are all consistent with the
observed phenomena, one should pick the simplest theory. (According to Bertrand Russell, the
actual phrase used by Ockham was: ‘It is vain to do with more what can be done with fewer.’)
Surely Occam’s razor principle is easily understood from an ‘utilitarian’ point of view: if both
theories explain the same set of facts, why not use the simpler theory?! However things become
more intricate when we want to know whether a simpler theory is really better than the more
complicated one. This also raises another question which has been a bone of contention in Philosophy
ever since the razor’s inception. For what is the proper measure of simplicity? Is x100
+1
more complicated than ax17
+bx2
+cx+d? E.g., the distinguished contemporary philosopher K.
Popper pronounced that the razor is without sense to use on such grounds. However, it is interesting
to notice that the principle can be given objective contents, and has recently been very successfully
applied in many different forms in computational learning theory.
To explain this, let us consider an over-simpliﬁed example of inferring a ﬁnite automaton
with one-letter input using Occam’s razor principle.
Accepted inputs: 0, 000, 00000, 000000000;
Rejected inputs: #, 00, 000000;
For these data, there exist many consistent ﬁnite automata. Figure 1 shows the trivial automaton
and Figure 2 shows the smallest automaton, where S denotes starting state and darker states are
accepting states.
S
0 0 0 0 0 0 0 0 0
Figure 1: A trivial automaton
- 5 -
S
0
0
Figure 2: The smallest automaton
Intuitively, the automaton in Figure 1 just encodes data plainly, we therefore do not expect
that machine anticipate the future data. On the other hand the second machine makes a plausible
inference that the language accepted consists of strings of an odd number of 0’s. The latter
appeals to our intuition as a reasonable inference. However, a too simplistic application of
Occam’s razor principle may also lead to nonsense as the following story illustrates.
Once upon a time, there was a little girl named Emma. Emma had never eaten a banana, nor had she
been on a train. One day she went for a journey from New York to Pittsburgh by train. To relieve Emma’s
anxiety, her mother gave her a large bag of bananas. At Emma’s ﬁrst bite of a banana, the train plunged
into a tunnel. At the second bite, the train broke into daylight again. At the third bite, Lo! into a tunnel; the
fourth bite, La! into daylight again. And so on all the way to Pittsburgh and to the bottom of her bag of
bananas. Our bright little Emma (applying Occam’s razor principle?) told her grandpa: ‘Every odd bite of a
banana makes you blind; every even bite puts things right again.’ [After N.R. Hanson, ‘Perception and
Discovery’, Freeman, Cooper & Co, 1969, p.359.]
Let us consider how the idea of ‘simplicity’ affects a scientist’s thinking. We refer to a
beautiful study of simplicity by Kemeny [ Kemeny Initially, there were no new facts that failed to
be explained by the Special Theory of relativity. The incentive to invent the General Theory of
Relativity, by Albert Einstein, was his conviction that the Special Theory was not the simplest
theory that can explain all the observed facts. Reducing the number of independent variables
obviously simpliﬁes a theory. By the requirement of general covariance Einstein succeeded in
replacing the previous independent ‘gravitational mass’ and ‘inertial mass’ by a single concept.
In spite of the apparent universal acceptance of Occam’s razor, consciously or unconsciously,
the concept of simplicity remains highly controversial. Generally speaking, it has
remained a crude non-precise idea. Things are subtler than they appear. Is the following formulation
precise?
Occam’s Razor Rule: Select a hypothesis which is as well in agreement with the observed
value as possible; if there is any choice left, choose the simplest possible hypothesis.
- 6 Example.
Consider the problem of ﬁtting n points by a polynomial. The above rule tells us
to choose the polynomial of lowest degree passing through all the n points. But due to measurement
precision and possible noise, whatever degree polynomial the points originated from, we
will end up a polynomial of degree n !1 which ﬁts the data precisely. But this polynomial most
likely does not help us to predict future points.
Example. Consider another example given by Kemeny: Let there be unknown number of
white balls and black balls in a sealed urn. Through an opening you randomly pick one ball at a
time, note its color and replace it, and shake the urn thoroughly. After n draws you must decide
what fracton of the balls in the urn is white. The possible hypotheses state that some rational
fraction r of balls in the urn is white, where 0$r$1. By the above rule, if in n draws, m white balls
are selected, then we should formulate the hypothesis r =m/n. Let there be 1/3 white and 2/3
black balls. However the probability of getting the true hypothesis m =n/3 is zero if n is not
divisible by 3, and it tends to zero, even under the assumption that n is divisible by 3. On the
other hand we know that to obtain a hypothesis 1/3!#$r$1/3+#, for any #, has probability tending
to 1 exponentially fast, by the so-called Chernoff formula. (For Chernoff’s formula see e.g. [
Angluin Valiant ) Even when the process converges, n may be too large for practical use.
Kemeny’s Rule. Select the simplest hypothesis compatible with the observed values.
Here ‘compatible’ is deﬁned as follows. The hypothesis Hi is compatible with data D if, assuming
the truth of Hi, there was at most one percent chance of getting a deviation as great as
m (Hi,D) for some measure function m. This is related to Valiant’s learning theory to be discussed
later.
But how does one deﬁne simplicity? Is 1/4 simpler than 1/10? Is 1/3 simpler than 2/3?
Saying that an urn contains 1/3rd part white balls comes down to the same thing as saying that it
contains a 2/3rd part black balls. Kemeny warned: ‘do not use more precision in your theories
than is necessary.’ But what is necessary and what is not? All these issues are very subjective.
Does a simple theory generate a hypothesis which is good for predicting future outcomes? How
do we achieve fast convergence? How does one trade between ‘simplicity’ and ‘truth’ (‘compatibility’)?
Kemeny actually asked for ‘a criterion combining an optimum of simplicity and compatibility’
[crediting Nelson Goodman for this suggestion].
- 7 -
1.1. Combining Epicurus, Ockham, and Bayes
The study of inductive reasoning predates artiﬁcial intelligence or computer science by more than
2000 years. There is tremendous amount of literature in many different ﬁelds under diverging
terminologies. Our goal is to extract a common core of simple ideas underlying all these
approaches, in the spirit of Occam’s Razor principle. We will start with Bayesian inference
theory.
To apply Bayesian type reasoning one has to assign a priori probabilities (prior probability)
to each possible hypothesis. Since the posthumous publication in 1763 of Th. Bayes’ (?? - 1761)
famous memoir ‘An essay towards solving a problem in the doctrine of chances’ by his friend
Richard Price, [ Bayes essay there has been continuous bitter debate on the controversial prior
probability in the Bayesian formula.
The invention of Kolmogorov complexity, by its ﬁrst inventor R. Solomonoff, was as an
auxiliary notion to resolve this particular problem. Namely, using Kolmogorov complexity he
found a single ‘universal’ prior distribution which can be substituted for any particular actually
valid distribution (as long as it is computable) in Bayes’ formula, and obtain approximately as
good results as if the actually valid distribution had been used. It sounds like magic, but
Solomonoff’s approach does give a more or less satisfactory solution to this unlikely objective.
The elegant idea of a universal prior is a combination of Occam’s razor and modern computability
theory. However, the universal prior is uncomputable, since it involves Kolmogorov complexity.
In this paper we develop the thesis that many theories, models, and principles for inductive
reasoning that were formulated both before and after Solomonoff’s inception, can be
rigorously derived as particular computable approximations to it.
We ﬁrst describe the basics of Bayesian theory and how to apply Kolmogorov complexity
to obtain the Universal prior probability distribution. We then derive the Gold paradigm and its
principles. We derive a form of Rissanen’s Minimum Description Length (MDL) principle. From
the MDL principle Rissanen derives the Fisher’s Maximum Likelihood principle and Jaynes
Maximum Entropy principle. This paper contains a review of all these theories and principles. It
has been our experience that some experts say the connections as claimed are obvious, while
some other experts deny those connections exist. Thus, since the proof of the pudding is in the
eating, we explicitly establish derivations together with the appropriate related theorems. We
also describe an experiment we have performed in machine learning of recognition of handwritten
characters using the MDL principle. Combination of Gold-style inference with ideas from
- 8 computational
complexity theory leads to Valiant’s model of deductive learning. We give an
application of the Universal prior distribution to obtain a theory of learning simple concepts
under simple distributions. A more extensive treatment of this material will be given in our
forth-coming textbook [ Li Introduction Kolmogorov
2. The Universal Prior Distribution
2.1. Bayesian Inference
In the following discussion of probability we assume the usual so-called Kolmogorov Axioms,
see e.g. [ Feller Introduction For our purpose we need the following. We have a hypothesis
space, H = {H1, H2,...}, which consists of a countable set of hypotheses, which are mutually
exclusive (in the sense that at most one of them is right), and exhaustive (in the sense that at least
one of them is right). With each hypothesis Hi we associate a probability P (Hi) such that
"i
P (Hi) = 1. The student is supplied with some data D, providing information about which
hypothesis is correct. From the deﬁnition of conditional probability, i.e.,
P (A | B) = P (A % B)/P (B), it is easy to derive Bayes’ formula (rewrite P (A %B) in the two possible
different ways, equate the two expressions, and set A = Hi and B = D):
P (Hi | D) =
P (D)
P (D | Hi) P (Hi)
(1)
where
P (D) =
i
"P (D | Hi)P (Hi).
We interpret the different variables in the formula as follows. The Hi’s represent the possible
alternative hypotheses concerning the phenomenon we wish to discover. The term D represents
the empirically or otherwise known data concerning this phenomenon. The term P (D), the probability
of data D, can be considered as a normalizing factor so that "i
P (Hi | D) = 1. The term
P (Hi) is called the prior probability or initial probability of hypothesis Hi, i.e., the probability
that Hi is true before we have seen any evidence. The term P (Hi | D) is called the ﬁnal,
a posteriori, or inferred probability, which reﬂects the probability of Hi modiﬁed from the prior
probability P (Hi) after seeing the data D. The term P (D | Hi), the conditional probability of seeing
D when Hi is true, is assumed to be computable from D and Hi. In many learning situations,
- 9 data
can only be consistent with an hypothesis Hi in the sense of being forced by it such that
P (D | Hi) = 1. If the data is inconsistent with hypothesis Hi then P (D | Hi) = 0. In such a situation,
the data either is determined by a hypothesis, or disqualiﬁes it. (We assume there is no
noise that distorts the data.) For example, the hypothesis is datum x & L or x L.
The most interesting term is the prior probability P (Hi). In the context of machine learning,
P (Hi) is often considered as the learner’s initial degree of belief in hypothesis Hi. In essence
Bayes’ rule is a mapping from a priori probability P (Hi) to a posteriori probability P (Hi | D),
where the mapping is determined by data D. In general, the problem is not so much that in the
limit the inferred probability would not ‘condense’ on the ‘true’ hypothesis, but that the inferred
probability gives as much information as possible about the possible hypotheses from only a limited
number of data, cf. example below. In fact, the continuous bitter debate between the Bayesian
and non-Bayesian opinions centered on the prior probability. The controversy is caused by
the fact that Bayesian theory does not say how to initially derive the prior probabilities for the
hypotheses. Rather, Bayes’ rule only says how they are to be updated. However, in each actual
case the prior probabilities may be unknown, uncomputable, or conceivably do not exist. (What
is the prior probability of use of words in written English? There are many different sources of
different social backgrounds living in different ages.) This problem is solved if we can ﬁnd a single
probability distribution to use as the prior distribution in each different case, with approximately
the same result as if we had used the real distribution. Surprisingly, this turns out to be
possible up to some mild restrictions.
Example. We use an example of von Mises [ von mises probability truth Let an urn contain
many dice each with different attributes. A die with attribute p has probability p showing 6 in a
random throw. For convenience, assume the attribute set A is ﬁnite, and the difference between
each pair of attributes is greater than 2#. Randomly draw a die from the urn, our task is to determine
its attribute. We do this by experimenting. Throw the die n times independently. If 6
shows up m times, we choose the attribute that is nearest to m/n. Let Hp be the event of drawing
a die with attribute p from an urn. Let Dq be the experimental data such that m successes (6’s)
* Properly speaking, formula (1) is not due to Bayes, but it is due to P.S. Laplace (1749 - 1827) who stated the formula
and attached Bayes’ name to it [ Laplace Actually, Bayes in his original paper [ Bayes Essay assumed the uniform distribution
for the a priori probability, hence he has essentially derived P (Hi | D)=P (D | Hi)/"i
P (D | Hi). Although this
formula can be derived from (1) by simply assuming that all P (Hi) are the same, Bayes did not state his result in the
general form as in (1), nor did he derived his result through a formula similar to (1). Despite the fact that Bayes’ rule is
just a rewriting of the deﬁnition of conditional probability and nothing more, it is its interpretation and applications that
are most profound and caused much bitter controversy during the past two centuries.
- 10 were
observed out of n throws and | q ! m /n | < #, for q&A. So
P (Hp | Dq) =
P (Dq)
P (Dq | Hp) P (Hp)
,
where P (Dq) = "p
P (Dq | Hp) P (Hp). According to Chernoff’s formula (see [ angluin valiant ),
for '<1, we have:
P (m !np>'np | Hp) < e
!
2
1
'2
np
,
P (np !m>'np | Hp) < e
!
3
1
'2
np
.
Hence, if p is the true attribute of the die we have drawn then, choosing '=p/2# (so | p ! m/n | ( #
implies | m !np | > 'np), P (Dq | Hp) goes to 0 exponentially fast when the number of experiments
increases, for q)p, and P (Dp | Hp) goes to 1 at the same rate. Hence P (Hp | Dp) goes to 1.
Thus we derive the correct attribute of the die with high probability (using a polynomial number
of throws). The interesting point is that if the number of trials is small, then the inferred probability
P (Hp | Dq) may heavily depend on the the prior probability P (Hp). However, if n is large, then
irrespective of the prior distribution, the inferred probability condenses more and more around
m/n.
Example. We explain a simpliﬁed version of Solomonoff’s theory of inductive inference.
The simpliﬁcation is in that we, for the moment, consider only a discrete sample space like
{0, 1}*, The set of all ﬁnite binary sequences, rather than {0, 1}*
, the set of all one-way inﬁnite
binary sequences.
We view theory formation in science as the process of obtaining a compact description of
the past observations together with predictions of future ones. The investigator observes increasingly
larger initial segments of an ﬁnite binary sequence as the outcome of an ﬁnite sequence of
experiments on some aspect X of nature. To describe the underlying regularity of this sequence,
the investigator tries to formulate a theory that governs X, on the basis of the outcome of past
experiments. Candidate theories (hypotheses) are identiﬁed with computer programs that compute
binary sequences starting with the observed initial segment.
First assume the existence of a prior probability distribution, described by probability function
P, over a discrete sample space + = {0,1}*. Deﬁne a function µ over + by
µ(x) = "{P (xy): y & +}. Thus, µ(x) is the probability of a sequence starting with x. Given a
- 11 previously
observed data string S, the inference problem is to predict the next symbol in the output
sequence, i.e., extrapolating the sequence S. In terms of the variables in Formula (1), Hi is the
hypothesis that the sequence under consideration starts with initial segment S a. The data D consist
in the assertion that the sequence in fact starts with initial segment S. Thus, for P(Hi) and
P (D) in Formula (1) we substitute µ(S a) and µ(S), respectively, and obtain, a = 0 or a = 1,
P (Sa | S) =
µ(S)
P (S | Sa) µ(Sa)
.
Here µ(S) = P (S | S 0) µ(S 0) + P (S | S 1) µ(S 1) + P (S). Obviously, P (S | Sa) = 1 for any a, hence
P (Sa | S) =
µ(S)
µ(Sa)
. (2)
In terms of inductive inference or machine learning, the ﬁnal probability P (Sa | S) is the probability
of the next symbol being a, given the initial sequence S. Obviously we now only need the
prior probability to evaluate P (Sa | S).
The goal of inductive inference in general is to be able to either (i) predict (extrapolate) the
next element of S, or (ii) to infer an underlying effective process (in the most general case, a Turing
machine, according to the Church-Turing thesis) that generated S, and hence to be able to
predict the next symbol.
In order to solve the problem for unknown prior probability, Solomonoff proposed what he
called a universal prior distribution. We now carefully deﬁne the universal prior distribution
and prove several fundamental theorems due to Solomonoff and L.A. Levin, and afterwards continue
this example. The deﬁnitions and theorems are so fundamental that our approach totally
rests upon them. These results are in some form hidden in [ Solomonoff 1964 ] [ Solomonoff
IEEE 1978 convergence ] [ Levin Zvonkin ] [ G ´acs Lecture Notes 1987 For various reasons they
are difﬁcult to access, and almost unknown except to a few people doing research in this area. It
seems useful to recapitulate them. First we need the basic deﬁnitions of Kolmogorov complexity.
2.2. Kolmogorov Complexity
Inductive reasoning was the midwife that stood at the cradle of Kolmogorov complexity. Nowadays,
Kolmogorov complexity has been applied in many areas of computer science and
mathematics (see [ Kolmogorov complexity handbook for a general survey), and few realize that
Kolmogorov complexity was at ﬁrst invented for the purpose of inductive inference. In this essay,
- 12 we
go back to this origin.
We are interested in deﬁning the complexity of a concrete individual ﬁnite string of zeros
and ones. Unless otherwise speciﬁed, all strings will be binary and of ﬁnite length. All logarithms
in this paper are base 2, unless it is explicitly noted they are not. If x is a string, then l (x)
denotes the length (number of zeros and ones) of x. We identify throughout the xth ﬁnite binary
string with the natural number x, according to the correspondence:
(#, 0), (0, 1), (1, 2), (00, 3), (01, 4), (10, 5),...
Intuitively, we want to call a string simple if it can be described in a few words, like "the string of
a million ones"; A string is considered complex if it cannot be so easily described, like a "random"
string which does not follow any rule and hence we do not know how to describe apart
from giving it literally. A description of a string may depend on two things, the decoding method
(the machine which interprets the description) and outside information available (input to the
machine). We are interested in descriptions which are effective, and restrict the decoders to Turing
machines. Without loss of generality, our Turing machines use binary input strings which we
call programs. More formally, ﬁxing a Turing machine T, we would like to say that p is a
description of x if, on input p, T outputs x. It is also convenient to allow T to have some extra
information y to help to generate x. We write T (p,y)=x to mean that Turing machine T with input
p and y terminates with output x.
Deﬁnition 1. The descriptional complexity CT of x, relative to Turing machine T and binary
string y, is deﬁned by
CT(x | y) = min{l (p): p &{0,1}*, T (p,y)=x},
or * if no such p exists.
The complexity measure deﬁned above is useful and makes sense only if the complexity of
a string does not depend on the choice of T. Therefore the following simple theorem is vital. This
Invariance Theorem is given by Solomonoff [ Solomonoff formal inductive inference Kolmogorov
[ Kolmogorov Three approaches and Chaitin [ Chaitin Ch2
Theorem 1. There exists a universal Turing machine U, such that, for any other Turing
machine T, there is a constant cT such that for all strings x, y, CU(x | y) $ CT(x | y)+cT.
Proof. Fix some standard enumeration of Turing machines T1,T2, . . . . Let U be the
Universal Turing machine such that when starting on input 0n
1p, p &{0,1}*, U simulates the nth
- 13 Turing
machine Tn on input p. For convenience in the proof, we choose U such that if Tn halts,
then U ﬁrst erases everything apart from the halting contents of Tn’s tape, and also halts. By construction,
for each p &{0,1}*, Tn started on p eventually halts iff U started on 0n
1p eventually
halts. Choosing cT=n+1 for Tn ﬁnishes the proof.
Clearly, the Universal Turing machine U that satisﬁes the Invariance Theorem is optimal in
the sense that CU minorizes each CT up to a ﬁxed additive constant (depending on U and T).
Moreover, for each pair of Universal Turing machines U and U,, satisfying the Invariance
Theorem, the complexities coincide up to an additive constant (depending only on U and U,), for
all strings x, y:
| CU(x | y) ! CU,(x | y) | $ cU, U,.
Therefore, we set the canonical conditional Kolmogorov complexity C (x | y) of x under condition
of y equal to CU(x | y), for some ﬁxed optimal U. We call U the reference Turing machine. Hence
the Kolmogorov complexity of a string does not depend on the choice of encoding method and is
well-deﬁned. Deﬁne the unconditional Kolmogorov complexity of x as C (x) = C (x | #), where #
denotes the empty string (l (#) = 0).
Deﬁnition 2. In the sequel we need to use the preﬁx complexity variant, or self-delimiting
complexity, rather than C (x) from Deﬁnition 1. A preﬁx machine is a Turing machine with three
tapes: a one-way input tape, a one-way output tape, and a two-way work tape. Initially, the input
tape contains an indeﬁnitely long sequence of bits. If the machine halts, then the initial segment
on the input tape it has read up till that time is considered the input or program, and the contents
of the output tape is the output. Clearly, the set of programs of each such machine is a preﬁxcode.
(Recall that if p and q are two code words of a preﬁx-code, then p is not a proper preﬁx of
q.) We can give an effective enumeration of all preﬁx machines in the standard way. Then the
preﬁx descriptional complexity of x &{0,1}*, with respect to preﬁx machine T, and binary string
y, is deﬁned as
KT(x | y) = min{l (p): p &{0,1}*, T (p,y)=x},
or * if such p do not exist. One can prove an Invariance Theorem for preﬁx complexity, and
deﬁne the conditional and unconditional preﬁx Kolmogorov complexity, by ﬁxing some reference
optimal preﬁx machine, in exactly the same way as before, so we do not repeat the construction.
Remark. The preﬁx Kolmogorov complexity of string x, is the length of the shortest preﬁx
- 14 program
that outputs x. In this exposition, we will use K (x) to denote the preﬁx Kolmogorov
complexity of x. C (x) and K (x) differ by at most a 2 log K (x) additive term. In some applications
this does not make any difference. But in some other applications, for example inductive
inference, this is vital. In particular, we need the property that the series "x
2!K (x)
converges, cf.
below.
Deﬁnition 3. A binary string x is incompressible if K (x)(l (x).
Remark. Since Martin-L
..
of [ Martin-L
..
of deﬁnition random 1966 has shown that incompressible
strings pass all effective statistical tests for randomness, we will also call incompressible
strings random strings. A simple counting argument shows that most strings are random. The
theory of computability shows that the function K (x) is noncomputable, but can be approximated
to any degree of accuracy by a computable function. However, at no point in this approximation
process can we know the error. Cf. also the surveys [ Levin Zvonkin ] [ Li Two Decades
2.3. Semicomputable Functions and Measures
We consider recursive functions with values consisting of pairs of natural numbers. If <p, q > is
such a value then we interpret this value as the rational number p/q, and say that the recursive
function is rational valued.
Deﬁnition. A real function f is semicomputable from below iff there exists a recursive
function g (x, k) with rational values (or, equivalently, a computable real function g (x, k)), nondecreasing
in k, with f (x) = limk-*g (x, k). A function f is semicomputable from above, if ! f is
semicomputable from below.
(An equivalent deﬁnition: f is a function that is semicomputable from below if the set
{(x, r): r $ f (x), r is rational} is recursively enumerable.)
A real function f is computable iff there is a recursive function g (x, k) with rational values,
and | f (x) ! g (x, k) | < 1/k.
Obviously, all recursive functions are computable, and all computable functions are semicomputable.
However, not all semicomputable functions are computable, and not all computable
functions are recursive. Nontrivial examples of functions that are semicomputable from
above but not computable are C (x), C (x | y), K (x), and K (x | y).
The following analysis is a simpliﬁed version over the discrete space N (or the set of ﬁnite
binary strings), of Zvonkin and Levin [ Levin Zvonkin We follow to some extent [ G ´acs Lecture
- 15 Notes
Functions µ: N - [0, 1] that satisfy the usual properties of probability distributions except
that
"x
µ(x) $ 1.
we shall call measures. We say that a measure µ (multiplicatively) dominates a measure µ, if
there exists a constant c such that, for all x in N, we have µ,(x) $ c µ(x). It is known from the calculus
that no measure µ dominates all measures: for each measure µ there is a measure µ, such
that lim µ,(x)/µ(x) = *. However, if we restrict ourselves to the class of semicomputable measures,
then it turns out that this class contains an ‘absorbing’ element, a measure that dominates all
measures in the class. We call the measure that dominates all other measures in a given class a
universal measure for that class. This important observation that such a measure exists was ﬁrst
made by Levin [ Levin Zvonkin
Theorem 2. The class of measures that are semicomputable from below contains a universal
measure.
Proof. First we consider the standard enumeration of all partial recursive functions
.1, .2,.... Each . = .i in this list is a function on the positive integers. Let <.> denote a standard
effective invertible pairing function over N to associate a unique natural number <x, k > with
each pair (x, k) of natural numbers. This way we can interpret . as a two-argument function
.(<x, k >). We change each . into a partial recursive function / with the same range as . but
with, for each x, the value of /(<x, k >) is deﬁned only if /(<x, 1>), /(<x, 2>),...,/(<x, k !1>)
are deﬁned. (Assign values to arguments in enumeration order.) We use each / to deﬁne a semicomputable
real function s by rational valued approximations sk
(x), k = 1, 2,..., from below:
s (x) = sup {sk
(x): sk
(x) = p/q,
/(<x, k >) = <p, q >, k = 1,2,...}.
The resulting s-enumeration contains all semicomputable functions. Next we use each semicomputable
function s to compute a measure µ from below. Initially, set µ(x) = 0 for all x. If s (1) is
undeﬁned then µ will not change any more and it is trivially a measure. Otherwise, for k = 1, 2,...,
if sk
(1) + sk
(2) +...+ sk
(k) $ 1 then set µ(i) := sk
(i) for i = 1, 2,..., k, else the computation of µ is
ﬁnished.
There are three mutually exclusive ways the computation of µ can go, exhausting all possibilities.
Firstly, s is already a measure and µ := s. Secondly, for some x and k with x $ k the value
- 16 -
sk
(x) is undeﬁned. Then the values of µ do not change any more from µ(i) = sk ! 1
(i) for
i = 1, 2,..., k !1, and µ(i) = 0 for i ( k, even though the computation of µ goes on forever. Thirdly,
there is a ﬁrst k such that sk
(1) + sk
(2) +...+ sk
(k) > 1, that is, the new approximation of µ violates
the condition of measure. Then the approximation of µ is ﬁnished as in the second case. But in
this case the algorithm terminates, and µ is even computable.
Thus, the above procedure yields an effective enumeration µ1, µ2,... of all semicomputable
measures. Deﬁne the function µ0 as:
µ0(x) = "n
2!n
µn(x).
It follows that µ0 is a measure since
"x
µ0(x) = "n
2!n
"x
µn(x) $ "n
2!n
= 1.
The function µ0 is also semicomputable from below, since µn(x) is semicomputable from below
in n and x. (Use the universal partial recursive function .0 and the construction above.) Finally,
µ0 dominates each µn since µ0(x) > 2!n
µn(x). Therefore, µ0 is a universal semicomputable meas-
ure.
Obviously, there are countably inﬁnite universal semicomputable measures. We now ﬁx a
reference universal semicomputable measure µ0(x), and denote it by m(x). It will turn out that
function m(x) adequately captures Solomonoff’s envisioned universal a priori probability.
If a semicomputable measure is also a probability distribution then it is computable.
Namely, if we compute an approximation µk
of the function µ from below for which
"x
µk
(x) > 1 ! #, then we have | µ(x) ! µk
(x) | < # for all x.
Any positive function w (x) such that "x
w (x) $ 1 must converge to zero. Hence m(x) converges
to zero as well. However, it converges to zero slower than any positive computable function
that converges to zero. That is, m(x) is not computable, and therefore it is not a proper probability
distribution: "x
m(x) < 1. There is no analogous result to Theorem 2 for computable
measures: amongst all computable measures there is no universal one. This fact is one of the reasons
for introducing the notion of semicomputable measures.
- 17 -
2.4. The Solomonoff-Levin Distribution
The original incentive to develop a theory of algorithmic information content of individual
objects was Ray Solomonoff’s invention of a universal a priori probability that can be used
instead of the actual (but unknown) a priori probability in applying Bayes’ Rule. His original
suggestion was to set the a priori probability P (x) of a ﬁnite binary string x to "2!l (p)
, the sum
taken over all programs p with U (p) = x where U is the reference Turing machine of Theorem 1
for the C-complexity. However, using plain Turing machines this is improper, since not only
does "x
P (x) diverge, but for some x even P (x) itself diverges. To counteract this defect, Solomonoff
in 1960 and 1964 used normalizing terms, but the overall result was unconvincing. Levin
[ Levin Zvonkin succeeded in 1970 to ﬁnd a proper mathematical expression of the a priori probability,
of which we present the simpler version over the discrete domain N. This was elaborated
by Levin in 1973 and 1974 [ Levin Notion random ] [ Levin non-growth and Levin and G ´acs in
1974 [ G ´acs symmetry and independently by Chaitin in 1975 [ Chaitin theory 1975 formally
identical
Deﬁnition. The Solomonoff-Levin distribution (actually a measure) on the positive integers
is deﬁned by PU(x) = "2!l (p)
, where the sum is taken over all programs p for which the reference
preﬁx-machine U of Theorem 1 outputs x. This is a measure because of the following.
Kraft’s Inequality. If l1,l2,... is a sequence of positive integers such that "n
2
! ln
$ 1 then
there is a preﬁx-code c : N - {0, 1}* (i.e., if n ) m are positive integers, then c (n) is not a preﬁx
of c (m)), with l (c (n)) = ln. Conversely, if c : N - {0, 1}* is a preﬁx-code, then the sequence
l1, l2,... with ln = l (c (n)), n = 1, 2,... satisﬁes the inequality above. See e.g. [ Gallager
Hence, by the Kraft Inequality, for the preﬁx-code formed by the programs p of U we have
"p
2!l (p)
$ 1. Therefore, the combined probability "x
PU(x), summed over all x’s, sums up to less
than one, no matter how we choose reference U, because for some program q there is no output at
all.
Another way to conceive of PU(x) is as follows. We think of the input to the reference
preﬁx machine U as being provided by indeﬁnite long sequences of fair coin ﬂips. Thus, the probability
of generating a program p for U is P (p) = 2!l (p)
where P is the standard ‘coin-ﬂip’ uniform
measure. (Namely, presented with any inﬁnitely long sequence starting with p, the machine
U, being a preﬁx-machine, will read exactly p and no further.) Due to the halting problem, for
some q the reference U does not halt. Therefore, the halting probability + satisﬁes
- 18 +
= "x
PU(x) < 1.
Now we are ready to state the remarkable and powerful fact that Levin’s universal semicomputable
measure m(x), the Solomonoff-Levin universal a priori probability PU(x), and the
simpler expression 2!K (x)
, all coincide up to an independent ﬁxed multiplicative constant. It is a
consequence of universally accepted views in mathematical logic (Church’s Thesis), that the widest
possible effective notion of simplicity of description of an object x is quantiﬁed by K (x).
The Solomonoff-Levin distribution can be interpreted as an recursively invariant notion that
is the formal representation of ‘‘Occam’s Razor’’: the statement that one object is simpler than
the other is equivalent to saying that the former object has higher probability than the latter.
Theorem 3. There is a constant c such that for each x, up to additive constant c, we have
! log m(x) = ! log PU(x) = K (x).
Proof. Since 2!K (x)
represents the contribution to PU(x) by a shortest program for x,
2!K (x)
$ PU(x) for all x. Since PU(x) is semicomputable from below by enumerating all programs
for x, we have by the universality of m(x) that there is a ﬁxed constant c such that for all x we
have PU(x) $ c m(x).
It remains to show that m(x) = O (2!K (x)
). This is equivalent to showing that for some constant
c we have ! log m(x) ( K (x) + c. It sufﬁces to exhibit a preﬁx code such that for some other
ﬁxed constant c,, for each x there is a code word p such that l (p) $ ! log m(x) + c,, together with a
preﬁx-machine T such that T (p) = x. Then, KT(x) $ l (p) and hence by the Invariance Theorem 1
also K (x) $ l (p) up to a ﬁxed additive constant. First we recall a construction for the ShannonFano
code.
Claim. If µ is a measure on the integers, "x
µ(x) $ 1, then there is a binary preﬁx-code
r : N - {0, 1}* such that the code words r (1), r (2),... are in lexicographical order, such that
l (r (x)) $ ! log µ(x) + 2. This is the Shannon-Fano code.
Proof. Let [0, 1) be the half open unit real interval. The half open interval [0.x, 0.x + 2!l (x)
)
corresponding to the set (cylinder) of reals 0x = {0.y : y = x z} (x ﬁnite and y and z inﬁnite binary
strings) is called a binary interval. We cut off disjoint, consecutive, adjacent (not necessarily
binary) intervals In of length µ(n) from the left end of [0, 1), n = 1, 2,.... Let in be the length of the
longest binary interval contained in In. Set r (n) equal to the binary word corresponding to the
ﬁrst such interval. It is easy to see that In is covered by at most four binary intervals of length in,
from which the claim follows.
- 19 Since
m(x) is semicomputable from below, there is a partial recursive function .(t, x) such
that .(t, x) $ m(x) for all t, and limt - *.(t, x) = m(x). Let /(t, x) = 2!k
, with k is a positive
integer, be the greatest partial recursive lower bound of this form on .(t, x). We can assume that
/ enumerates its range without repetition. Then,
"x, t
/(t, x) = "x "t
/(t, x) $ "x
2 m(x) $ 2.
(The series "t
/(t, x) can only converge to precisely 2 m(x) in case there is a positive integer k
such that m(x) = 2!k
.)
Similar to before, we chop off consecutive, adjacent, disjoint half open intervals It,x of
length /(t, x)/2, in order of computation of /(t, x), starting from the left side of [0, 1). This is
possible by the last displayed equation. It is easy to see that we can construct a preﬁx-machine T
as follows. If 0p is the largest binary interval of It,x, then T (p) = x. Otherwise, T (p) is undeﬁned
(e.g., T doesn’t halt).
By construction of /, for each x there is a /(t, x) >m(x)/2. By the construction in the
Claim, each interval It,x has length /(t, x)/2. Each I-interval contains a binary interval 0p of
length at least one quarter of that of I. Therefore, there is a p with T (p) = x such that
2!l (p)
( m(x)/16. This implies KT(x) $ !log m(x) + 4. The proof of the theorem is ﬁnished.
Theorem 3 demonstrates a particularly important instance of the two conceptually different,
but equivalent, deﬁnitions of the semicomputable measures. We analyse this equivalence in
some detail. Let P1, P2,... be the effective enumeration of all semicomputable probability distributions
constructed in Theorem 2. Let T1, T2,... be the standard enumeration of preﬁx-machines.
For each preﬁx-machine T, deﬁne
QT(x) = "T (p) = x
2!l (p).
In other words, QT(x) is the probability that T computes output x if its input p is generated by
successive tosses of a fair coin. In other words, the inputs p are uniformly distributed with the
probability of p occurring equal 2!l (p)
. It is easy to see that each QT satisﬁes
x
"QT(x) $ 1.
Equality holds iff T halts for all inputs (proper programs). Let Q1, Q2,... (where we do not
require equality to hold) be the probability distributions associated with T1, T2,....
- 20 Claim.
There are recursive functions 1, 2 such that Qn = 3(P1(n)) and Pn = 3(Q2(n)), for
n =1, 2,....
Proof. Omitted.
Remark. The Coding Theorem tells us that there is a constant c > 0 such that
! log PU(x) ! K (x) $ c. We recall from the deﬁnition of the Solomonoff-Levin distribution that
! log PU(x) = ! log "U (p) = x
2!l (p)
, and K (x) = min {l (p): U (p) = x}. A priori an outcome x may
have high probability because it has many long descriptions. But these relations show that in that
case it must have a short description too. In other words, the a priori probability of x is governed
by the shortest program for x.
Remark. Let P be any probability distribution (not necessarily computable). The Pexpected
value of m(x)/P (x) is
"x
P (x)
P (x)
m(x)
< 1.
We ﬁnd by Chebychev’s ﬁrst Inequality
1)
that
"{P (x): m(x) $ k P (x)} ( 1 !
k
1
. (3)
Since m(x) dominates all semicomputable measures multiplicatively, for all x we have
P (x) $ cP m(x), (4)
for a ﬁxed positive constant cP independent of x (but depending on the index of P in the effective
enumeration µ1, µ2,... of semicomputable measures).
Inequalities (3) and (4) have the following consequences:
2)
(i) If x is a random sample from a simple computable distribution P (x), then m(x) is a good
estimate of P (x).
(ii) If we know or believe that x is random with respect to P, and we know P (x), then we
can use P (x) as an estimate of m(x).
1)
Recall that Chebychev’s First Inequality says the following. Let P be any probability distribution, f any nonnegative
function with expected value EP(f ) = "x
P (x) f (x) < *. For 4 ( 0 we have "{P (x): f (x) > 4} < 4!1
EP(f ). Here we
use it with k EP(f ) substituted for 4.
2)
We shortly remark, without further explanation, that in both cases the degree of approximation depends on the index
of P, and the randomness of x with respect to P, as measured by the randomness deﬁciency 50(x | P) = log (m(x)/P (x)).
If 50(x | P) = O (1) then x is random, otherwise x is not random. For example, for the Uniform Distribution
- 21 -
2.4.1. Solomonoff’s Inference Procedure and Its Mathematical Justiﬁcation
We continue Solomonoff’s approach to inductive inference, as in [ Levin Zvonkin In general, one
cannot prove that an inference procedure in statistics is good. This accounts for the many different
approaches which are advocated in statistics. In contrast, about Solomonoff’s procedure we
can rigourously prove that it is good. First, we put the previously developed theory in a continuous
setting. Let the sample space S = {0, 1}* 6{0, 1}*
, the set of all ﬁnite and one-way inﬁnite
binary sequences. Let a cylinder 0x = {xy : y & S}, the set of all elements from S that start with x.
A function µ from cylinders to the real interval [0, 1] is called a semimeasure if
(a) µ(S) $ 1; and
(b) µ(0x) ( µ(0x0) + µ(0x1) .
A semimeasure is called a measure if equality holds in (a) and (b). A semimeasure µ is
(semi)computable if f (x) = µ(0x) is (semi)computable. Note that f needs to satisfy (a) and (b). It
is more convenient, and customary in this area, to simply write µ(x) instead of µ(0x). The problem
was that the proper a priori probabilities µ in formula (2) are not known.
We modify the Turing machines in the standard enumeration so that they correspond to the
semicomputable measures.
A monotonic machine M is a three tape machine similar to the preﬁx machine we deﬁned
before, but now for all ﬁnite (binary) inputs p and q, if p is a preﬁx of q, then M (q) = M (p)r for
some r in S. (For convenience we deﬁne the M (p) as the contents of the output tape when M
reads the next symbol after p. If M doesn’t halt then M (p) can be ﬁnite or one-way inﬁnite.) Let
U be the universal monotonic machine, in the same way as we have already met universal Turing
machines and universal preﬁx machines.
The universal semicomputable semimeasure is deﬁned as
M(x) =
U (p) & 0x
" 2!l (p)
,
i.e., M(x) is the a priori probability that the output of the reference monotonic machine U starts
with x. Just as in the discrete case, one can show that for each semicomputable semimeasure µ,
50(x | P) = n ! K (x | n) + O (1), where n = l (x). Such a (universal Martin-L
..
of) test is needed, since otherwise we cannot
distinguish, for instance, between randomness and nonrandomness of samples from the uniform distribution.
(Clearly, the word C o n s t a n t i n o p l e is not a random 14-letter word. The probability of seeing it somewhere written
is decidedly greater than 128!14
, say, for a randomly selected fourteen letter ASCII word.)
- 22 there
exists a constant c, such that for all x & {0, 1}, we have
M(x) ( c µ(x).
An alternative approach to deﬁning a priori probability was taken by Cover [ Cover gambling
schemes who deﬁned
Mc(x) = "{m(xy): y & {0, 1}*}.
This function has related properties to M.
Solomonoff’s Predictor. Instead of using formula (2), we estimate the conditional probability
P (xy | x) that the next segment after x is y by the expression
M(x)
M(x y)
. (5)
Now let µ in Formula (2) be an arbitrary computable measure. This case includes all computable
sequences, as well as many Bernouilli sequences.
Justiﬁcation. Solomonoff [ Solomonoff IEEE 1978 showed that convergence of the error
made by the estimator is very fast, in the following sense. If µ is the actual prior probability
(measure) over the sample space {0, 1}, than we obviously cannot do better in predicting a ‘0’ or
‘1’ after an initial segment x than using the inferred probability
µ(x)
µ(xa)
, a = 0, 1.
To estimate how much worse it is to use M instead of µ we consider the difference in inferred
probabilities. Let Sn denotes the µ-expectation of the squared difference between the µ-inferred
probability and the M-inferred probability, of ‘0’ occurring as the n + 1th symbol:
Sn =
l (x) = n
" µ(x)
M(x)
M(x 0)
!
µ(x)
µ(x 0)
2
.
Then "n
Sn $ K (µ)/2. Here, K (µ) is the Kolmogorov complexity of the index i, where Ti is a
Turing machine computing µ. Therefore, Sn converges to zero faster than 1/n. In other words, it
has been rigorously proved that for the above estimator the expected error at the nth prediction
converges to zero faster than 1/n!
This was improved by G ´acs [ %A P. G ´acs %T Personal Communication as follows. If the
- 23 length
of y is ﬁxed, and the length of x grows to inﬁnity, then
µ(x y)/µ(x)
M(x y)/M(x)
- 1,
with µ-probability one. In other words, the conditional a priori probability is almost always
asymptotically equal to the conditional probability.
With respect to the discrete sample space approach taken before, one can show that:
! log M(x) = ! log Mc(x) = K (x) + O (log K (x)). (6)
2.4.2. Conclusions
On the positive side we have achieved the following. Bayes’ rule using the universal prior distribution
gives an objective interpretation to Occam’s razor principle. Namely, if several programs
could generate S 0 then the shortest one is used (for the prior probability), and further if S 0 has a
shorter program than S 1 then S 0 is preferred (i.e. predict 0 with higher probability than predicting
1 after seeing S). Bayes’ rule via the universal prior distribution also satisﬁes Epicurus’ multiple
explanations dictum, since we do not select a single hypothesis after considering the evidence,
but maintain all hypotheses consistent with the evidence and just transform the probability
distribution on the hypotheses according to the evidence. Finally, there is mathematical proof
that Solomonoff’s inference procedure using the universal prior probability performs almost as
good as the one using the actual (computable) prior probability.
On the negative side we know that Solomonoff’s inference is not practicable in its pure
form. The universal prior distributions m(x) for discrete sample spaces, and M(x) for continuous
sample spaces, are not computable, essentially because the Kolmogorov complexity is not computable.
However, we can compute approximations to K (x), m(x), and M(x). It turns out that
using Solomonoff’s inference principles with such computable approximations yields many other
known inference models or principles. In the next few sections, we derive or establish connections
with various well-known machine learning models and inductive inference paradigms or
principles. Thus we provide an alternative view of these models and principles from the lofty perspective
of Kolmogorov complexity.
- 24 -
3. Gold’s Inductive Inference Paradigm
There are many different ways of formulating concrete inductive inference problems in the
real world. We will try to simplify matters as much as possible short of losing signiﬁcance.
(i) The class of rules we consider can be various classes of languages or functions, where
we restrict ourselves to classes of recursive sets, context-free languages, regular sets and sets of
ﬁnite automata, and sets of Boolean formulae. We treat a language L as a function f using its
characteristic function, i.e., f (x) = 7L(x)=1 if x &L, and 0 otherwise.
(ii) The hypothesis space or rule space denoted by R speciﬁes syntactically how each rule
in (i) should be represented. We ﬁx a standard enumeration of the representations for a class of
rules, R = {R1, R2,...}, and assume that each rule f has at least one description in the corresponding
hypothesis space. For example, the hypothesis space can be standard encodings of contextfree
grammars, or standard encodings of Finite Automata. In any case, it is assumed that the
hypothesis space is effectively enumerable (so it cannot be the set of all halting Turing machine
codes). For convenience, this enumeration of hypotheses R1, R2,... consists of codes for algorithms
to compute recursive functions f1, f2,... (languages are represented by their characteristic
functions).
(iii) The presentation of examples is vital to the inference process. We choose the simplest,
and yet most general, form of data presentation. For a function f to be inferred, there is a
ﬁxed inﬁnite sequence of examples (s1, f (s1)), (s2, f (s2)),.... When f =7L, we have 7L(s) = 1 if
s & L (s is a positive example of L) and 7L(s) = 0 otherwise (s is a negative example of L).
A rule (or function) f is said to be consistent with the initial segment of examples
S = (s1, a1) , . . . , (sn, an), (7)
if f (si) = ai, i =1,..,n. We require that all strings will eventually appear as ﬁrst component in a
pair in S. The last assumption is strong, but is essential to the Gold paradigm.
How to infer a rule. By (ii), there is an effective enumeration f1, f2,... of partial recursive
functions corresponding to the enumeration of hypotheses. The a priori probability of fk is
m(fk) = m(k). (Actually, m(fk) = c m(k), for some constant c depending on the effective enumeration
involved, but not depending on n. To assume that c = 1 makes no difference in the following
discussion.) We are given an inﬁnite sequence of examples representing the rule or function f to
be learned. According to Bayes’ rule (1), for k = 1, 2,..., the inferred probability of fk after the
sequence of examples (7) is given by:
- 25 P
(fk = f | f(si) = ai, i = 1,..,n) =
"{m(j): fj(si) = ai, i = 1,..,n}
P (f(si) = ai, i = 1,..,n | fk = f) m(k)
. (8)
Cf. also [ Cover 1974 gambling ] [ Cover Impact In the numerator of the right-hand term, the ﬁrst
factor is zero or one depending on whether fk is consistent with S, and the second factor is the a
priori probability of fk. The denominator is a normalizing term giving the combined a priori probability
of all rules consistent with S. With increasing n, the denominator term is monotonically
nonincreasing. Since all examples eventually appear in S, the denominator converges to a limit,
say d $ 1. For each k, the inferred probability fk is monotonically nondecreasing with increasing
n, until fk is inconsistent with a new example, in which case it falls to zero and stays there henceforth.
In the limit, only the fk’s that are consistent with the sequence of presented examples
have positive inferred probability m(k)/d. By Theorem 3, since m(k) = 3(2!K (k)
), the highest
inferred probability is carried by the rule fk with least Kolmogorov complexity among the
remaining ones. Similar statements hold after each initial segment of n examples, n = 1,2,....
Reasoning inductively, we transform the a priori probability according to Formula (8),
inferring a new posterior probability by the evidence of each initial segment of examples. At each
step, we can select the rule with the highest inferred probability, and in the limit we have selected
the proper rule. At each step we predict the rule with the highest inferred probability. Reformulating,
if we want to infer a language L using this procedure, then:
(a) The Bayesian a posteriori probability for the correct answer converges to c 2!l (p)
/d,
where p is the shortest program which the reference machine uses to simulate M0, where M0 is
the smallest TM that accepts L. This correct answer will have the highest probability in the limit.
That is, inferred probability distribution over the underlying machines converges to a highest probability
for M0 in the limit. In other words, after n steps for some n, all the machines smaller
than M0 violate some data pair in S, and M0 is the choice forever after step n.
(b) It is interesting to notice that the a posteriori probability decreases monotonically until
it converges to c 2!l (p)
/d for p the program with which U simulates M0. Smaller machines are
chosen ﬁrst and then canceled because they violate some data.
Predicting extrapolation. If we want to infer f (s), rather than f, given the sequence of
examples S, then using formulas (2) and (5), the inferred probability that f (s) = a is (denoting a
string s1s2
. . . sn as s1:n):
P (f (s) = a | f (si) = ai, i = 1,..,n) =
"{m(j): fj(si) = ai, i = 1,..,n}
"{m(j): fj(si) = ai, i = 1,..,n, fj(s) = a}
(9)
- 26 The
Gold paradigm of inductive inference in the sense as originally studied by Gold in [
Gold Language identiﬁcation 1967 ] [ Gold 1978 can be viewed simply as a computable approximation
to Equation (8). The fundamental idea of the Gold paradigm is the idea called
identiﬁcation in the limit and a universal method of implementing the identiﬁcation in the limit is
called ‘identiﬁcation by enumeration’. These are contained in facts (a) and (b), as a computable
analogue of Solomonoff’s approach. We now investigate the correspondence between these two
basic ideas in some detail.
Identiﬁcation in the Limit views inductive inference as an inﬁnite process. Formally, let M
be an inductive inference method in order to derive some unknown rule R. If M receives a larger
and larger set of examples (bigger and bigger initial segment S), a larger and larger sequence of
M’s conjectures is generated, say, f1, f2, f3, . . . . If there is some integer m such that fm is a
correct description of R and for all n >m
fm=fn,
then M identiﬁed R in the limit. Two facts deserve mentioning: M cannot determine whether it
has converged and therefore stop with a correct hypothesis. M may be viewed as learning more
and more information about the unknown rule R and monotonically increasing its approximation
to R until the correct identiﬁcation. Gold gave the best explanation to his deﬁnition:
I wish to construct a precise model for the intuitive notion "able to speak a language" in order to be
able to investigate theoretically how it can be achieved artiﬁcially. Since we cannot write down the rules of
English which we require one to know before we say he can "speak English", an artiﬁcial intelligence
which is designed to speak English will have to learn its rules from implicit information. That is, its information
will consist of examples of the use of English and/or of an informant who can state whether a given
usage satisﬁes certain rules of English, but cannot state these rules explicitly.
. . . A person does not know when he is speaking a language correctly; there is always the possibility
that he will ﬁnd that his grammar contains an error. But we can guarantee that a child will eventually
learn a natural language, even if it will not know when it is correct.
Identiﬁcation by enumeration is a method to implement identiﬁcation in the limit. It
refers to the following guessing rule: Enumerate the class of rules in rule space. At step t, guess
the unknown rule to be the ﬁrst rule of the enumeration which agrees with data received so far.
Formally speaking, in our setting, if we have received an initial segment S, then, given s, predict
as the next example (s, f (s)) for f is the ﬁrst rule that is consistent with S. Now if this can be
done effectively, identiﬁcation in the limit will be achieved. We say an induction method
- 27 identiﬁes
a rule correctly in k steps if it will never produce wrong hypothesis starting from step
k*
. Let G and G, be two guessing methods. G will be said to be uniformly faster than G, if the
following two conditions hold: (1) Given any R from the rule space, G will identify R correctly at
least as soon as G,, expressed in the number of examples needed, for all sequences of examples;
and (2) for some R, some sequence of examples, G will identify R sooner than G,. Say a guessing
method G is optimal if for any other guessing method G, there is a constant c such that: if R
appears to be the ith rule in the enumeration of G,, then it appears no later than ci in the enumeration
of G. It is easy to prove that the identiﬁcation-by-enumeration method will identify a
hypothesis in the limit if this hypothesis can be identiﬁed in the limit at all. Further if G0 is an
identiﬁcation-by-enumeration guessing rule, then there is no guessing rule uniformly faster than
G0. But only the Solomonoff procedure is optimal. Indeed,
Theorem 4. (a) Identiﬁcation-by-enumeration is a computable approximation to inductive
inference (Solomonoff’s inference) associated with Formula (8). (b) Neither method is uniformly
faster than the other. (c) Solomonoff procedure is optimal, while identiﬁcation-by-enumeration is
not.
Proof . (a) An effective enumeration for the identiﬁcation-by-enumeration method, can be
viewed as a computable approximation to Solomonoff’s procedure according to formula (8) as
follows. Let the effective enumeration of the rule space be: R1,R2,R3
. . . . Convert this to an
effective preﬁx-free description of each rule Ri in the rule space. For instance, if x = x1,...,xn is a
binary string, then x = x10x20...0xn1 is a preﬁx-code for the x’s. Similarly, x, = l (x) x is a preﬁxcode.
Note that l (x,) = l (x) + 2 log l (x). We encode each rule Ri (a binary string) as p i,, where p is
a (preﬁx) program that enumerates the rule space. The resulting code for the Ri’s is an effective
preﬁx-code. Denoting the length of the description of Ri by | Ri | , we have:
(i) if i<j, then | Ri | $ | Rj | ; and
(ii)
i
"2
! | Ri |
$1 (by Kraft’s inequality).
Assign a priori probability P (Ri) = 2
! | Ri |
to rule Ri, i =1, 2,.... (This is possible because of (ii).)
Using Formula (8) with P (Ri) instead of m(i) yields a computable approximation to
Solomonoff’s inductive inference procedure. Formula (8) chooses the shortest encoded
* Although here we treat only the case when the procedure converges to one single rule, this deﬁnition allows that the
procedure vacillate between the correct rules. Such deﬁnition is needed when, say, function f is not computable. Such
research has been initiated by J. Case [ case power of vacillation
- 28 consistent
rule which coincides with the ﬁrst consistent rule in the effective enumeration. This
shows that identiﬁcation by enumeration can be formulated as an computable approximation to
Solomonoff’s procedure.
We now show that neither method is uniformly faster than the other. Let G1, G2,... be an
effective enumeration of the hypotheses space by a Gold procedure, and let H1, H2,... be the
(noneffective) enumeration of the hypotheses space by decreasing a priori probability according
to Solomonoff. In other words, K(H1) $ K (H2) $ . . . . In both cases we deal with identiﬁcationby-enumeration,
so it is known that there is no guessing rule uniformly faster than either of them.
To prove (c), let an arbitrary procedure G using identiﬁcation-by-enumeration effectively
enumerates rules in our rule space as R1,R2, . . . . Then obviously K (Ri)=K (i)+cG, where cG
depends on G. Hence m(Ri) is approximately at least
c.i
1
for some c. Hence the number of rules
that have probability greater than this is at most c.i.
Remark. What about non-uniform speed comparison? In case the particular rule f to be
inferred is sufﬁciently simple (has low Kolmogorov complexity) then Solomonoff’s procedure
can be much faster than Gold’s enumeration. Let f be the function we want to infer, and let
f = fm, with m minimal, in Gold’s enumeration order. Let also f = fn, for n with K (n) minimal.
To infer the correct f, in Gold’s approach we must eliminate all fk with k < m. But in
Solomonoff’s approach, we only need to eliminate all fk with K (k) < K (n). Now necessarily there
are many f’s that are ‘simple’ in the sense that K (n) << l (m), for which e.g. Solomonoff’s procedure
works much (sometimes noncomputably) faster than Gold’s method.
The following theorem sets limits on the number of examples needed to infer a particular
function f.
Theorem 5. Let f1, f2,... be an effective enumeration of the rule space. Suppose we want
to infer f = fi, with i minimal, from a set of n examples S as in (7). Let c be an appropriate large
enough constant.
(a) If K (i) > K (f (s1)...f (sn) | s1...sn) ! c, then it is impossible by any effective deterministic procedure
to infer f correctly.
(b) If we can infer f correctly by computable approximation to Solomonoff’s method (8) using
only S, and c extra bits of information, then K (i | S) $ c.
(c) If K (i | S) $ c then we can compute fi from S and c bits extra information.
Proof. (a) Otherwise we would be able to compute i from a program of length signiﬁcantly
- 29 shorter
than K (i): contradiction. Items (b) and (c) are obvious.
There is an enormous amount of research in the area under the title of Gold paradigm. We
refer the readers to the articles [ Angluin Smith ] [ Case Smith and the book [ Osherson We
present three examples of reasoning by means of Gold’s paradigm in order to give a ﬂavor of this
research direction.
Example.[Gold, 1967] We can learn a function in the set of primitive recursive functions.
Proof. Effectively enumerate the set of all primitive recursive functions by /1,/2, . . . . On
any initial input segment (x1,y1) . . . (xk,yk), our inference machine just prints the least i such that
/i is consistent with the input, i.e., /i(xk)=yk for k =1, . . . ,n.
Example. [Gold, 1967] We cannot learn in general a function in the set of all total recursive
functions.
Proof. By diagonalization. Suppose M can identify all recursive functions. But then one can
deﬁne a recursive function f so that the guesses of M will be wrong on f inﬁnitely often. We construct
f by simply simulating M. Let f (0)=0, Suppose the value of f (0), f (1), . . . , f (n !1) have
been constructed. On input n, simulate M on initial input f (0), f (1), . . . , f (n !1). Then deﬁne
f (n) equal 1 plus the guess of M (modulo 2). So M never guesses f correctly.
Example. One of the ﬁrst studied problem was extrapolating a sequence. A machine M
extrapolates a sequence f (1), f (2), . . . as follows. It makes an initial guess f ,(0). Then it inputs
the real f (0). At step i, based on previous inputs f (1), f (2), . . . , f (i !1), it guesses f ,(i). If there
is a i0 such that for all i >i0 f ,(i)=f (i), then we say M extrapolates f. Bringing everything in our
setting, the initial segment before step i is a sequence of pairs (1, f (1))(2, f (2))...(i !1, f (i !1)),
and M extrapolates with the pair (i, f ,(i)). It is not surprising that the class of functions computable
by a Turing machine running in time t (n), for any computable function t, can be extrapolated
(by identiﬁcation by enumeration).
4. Rissanen’s Minimum Description Length Principle
Solomonoff’s ideas about inductive reasoning have explicitly served as guiding principle in
Rissanen’s development of Minimum description length (MDL) principle. Let us derive
Rissanen’s MDL principle from Solomonoff’s induction principle. For simplicity, we deal with
only non-adaptive models. A non-adaptive model is a model P (D | 8) where the parameter vector
8=8(D) is estimated from n observed data points denoted by D.
- 30 Scientists
formulate their theories in two steps: ﬁrstly a scientist must, based on scientiﬁc
observations or given data, formulate alternative hypotheses, and secondly he selects one deﬁnite
hypothesis. This is the subject of inference in statistics. Statisticians have developed many different
principles to do this, like Occam’s razor principle, the Maximum Likelihood principle,
various ways of using Bayesian formula with different prior distributions. No single principle
turned out to be satisfactory in all situations. Philosophically speaking, Solomonoff’s approach
presents an ideal way of solving induction problems using Bayes’ rule with the universal prior
distribution. However, due to the non-computability of the universal prior function, such a theory
cannot be directly used in practice. Some approximation is needed in the real world applications.
Further, from theory to inductive inference and statistical practice, there is still a big distance, for
example, concrete formulae are needed.
Gold’s principle was a particularly simple approximation to Solomonoff’s induction - the
sophisticated notion of probability distribution is replaced by linear enumeration. Now we will
closely follow Solomonoff’s ideas, but substitute a ‘good’ computable approximation to m(x).
This results in Rissanen’s Minimum Description Length Principle (MDL principle). He not
only gives the principle, more importantly he also gives the detailed formulas on how to use this
principle. This makes it possible to use the MDL principle in real problems. The principle can be
intuitively stated as follows:
Minimum Description Length Principle. The best theory to explain a set of data is the
one which minimizes the sum of
(1) the length, in bits, of the description of the theory;
(2) the length, in bits, of data when encoded with the help of the theory.
We now develop this MDL principle from Bayes’ rule, Formula (1), using the Universal
distribution m(x). Recall Bayes’ formula:
P (H | D) =
P (D)
P (D | H) P (H)
.
Here H is an hypothesis, here a probability distribution, which we assume to be computable or
anyway semicomputable, and D the observed data. We must choose the hypothesis H such that
P (H | D) is maximized. First we take the negative logarithm on both sides of the formula:
!logP (H | D)=!logP (D | H)!logP (H)+logP (D).
Since P (D) can be considered as a normalizing factor, we ignore it in the following discussion.
- 31 Since
we are only concerned with maximizing the term P (H | D) or, equivalently, minimizing the
term !logP (H | D), this is equivalent to minimizing
! log P (D | H) ! log P (H).
Now to get the minimum description length principle, we only need to explain the above two
terms in the sum properly. According to Solomonoff, when P is semicomputable, then we
approximate P by m. The prior probability P (H) is set to m(H)=2!K (H)±O (1)
, where K (H) is the
preﬁx Kolmogorov complexity of H. That is, !logP (H) is precisely the length of a minimum
preﬁx code, or program, of the hypothesis H.
Similar argument applies to term !logP (D | H). Assuming P is semicomputable, using the
conditional version of (3) and (4), we know that the universal semimeasure m(x) has the following
properties.
(a) There is a constant c, such that m(D | H) ( cP (D | H).
(b) The P-probability that m(D | H) $ kP (D | H) is at least 1 ! 1/k.
By a conditional version of Theorem 3, m(D | H) = 2!K (D | H)±O (1)
. Hence again 2!K (D | H)
is a reasonable
approximation of P (D | H), and minimizing !logP (D | H) can be considered as minimizing
K (D | H), i.e., ﬁnding an H such that the description length, or the Kolmogorov complexity,
of D given H is minimized. The term !log P(D | H) can also be thought as the ideal code length
for describing data D, given hypothesis H. Such preﬁx code length can be achieved by the
Shannon-Fano code. The term !logP (D | H), also known as the self-information, in information
theory, and the negative log likelihood in statistics, can now be regarded as the number of bits it
takes to redescribe or encode D with an ideal code relative to H.
In the original Solomonoff approach, H in general is a Turing machine. In practice we must
avoid such an overly general approach in order to keep things computable. In different applications,
the hypothesis H can mean many different things. For example, if we infer decision trees, H
is a decision tree; In case of learning ﬁnite automata, H can be a ﬁnite automaton; In case we are
interested in learning Boolean formulae, then H may be a Boolean formula; If we are ﬁtting a
polynomial to a set of points, then H may be a polynomial of some degree; In general statistical
applications, one assumes that H is some model H (8) with a set of parameters 8={81, . . . ,8k},
where the number k may vary and inﬂuence the (descriptional) complexity of H (8). In such case,
from
!logP (D | 8)!logP (8),
- 32 using
Taylor expansion at the point of optimal 8ˆ (for best maximum likelihood estimator), and
taking only dominant terms, Rissanen has derived a formula for the minimum description length
as
8,k
min{!logP (D | 8)+
2
1
klogn},
where k is the number of parameters in 8={81, . . . ,8k}, and n is number of observations (or data
points) contained in D. At the optimal k and 8, the term 1/2 k log n is called the optimum model
cost.
Since K (H) is not computable and hard to approximate, Rissanen suggested the following
approach. First convert (or encode) H to a positive integer in N = {1,2, . . . }. Then we try to
assign prior distribution to each integer in N. Jeffreys [ Jeffreys Theory of Probability suggested
to assign probability 1/n to integer n. But this results an improper distribution since the series
"1/n diverges. We modify Jeffreys distribution. It is possible, by iterating the idea of encoding
n (viewed as the corresponding nth binary string) as n, = l (n) n, to obtain a preﬁx-code such that
L (n) denotes the length of the code for n, with L (n) deﬁned by
l* (n) = log n + loglog n + . . . ,
all positive terms, and
L (n) = l*(n) + log c,
where c =2.865064 . . . . Viz, it can be proved [ Rissanen Universal prior integers that:
n =1
"
*
2!l* (n)
= c.
Therefore, the existence of a preﬁx-code as claimed follows from Kraft’s Inequality.
Assign prior probability P (n) = 2!L (n)
to each integer n. We obtain the following desired
properties: (a) "n =1
*
2!L (n)
= 1; and (b) integers n are coded by a preﬁx code. Hence, descriptions
of two integers, n1 and n2, can be just concatenated to produce the code for the pair (n1,n2), and
so on. The decoding process is trivial.
Using the MDL principle, Wax and Rissanen (according to Wax) and Quinlan and Rivest [
Quinlan Rivest have developed procedures to infer decision trees. Other work by Wax [ Wax
Detection coherent and by Gao and Li [ gao li applied MDL principle to recognition problems.
- 33 Example.
We sketch an initial experiment we [ gao li have performed in on-line handwritten
character learning using the MDL principle. Inputing Chinese characters into computers is a
difﬁcult task. There are at least 5,000 characters in daily use, all of different shapes. Many
methods have been invented for key-board input. Some have been successful in the limited sense
that they can be used by trained typists only. Handwriting input is an alternative choice. Many
such systems have been built with various recognition rates.
We [ gao li have implemented such a system that learns handwritten characters from examples
under the guidance of the MDL principle. We now sketch a simple experiment we have performed.
An input character is drawn on a digitizer board with 200/inch resolution in both horizontal
and vertical directions. The system learns a character from examples. The basic algorithm
involves low level preprocessing, scaling, forming a prototype of a character (for learning),
elastic matching (for recognizing), and so on. At the stage of forming a prototype of a character,
we have to decide on the feature extraction intervals. Then we code a character into a prototype
so that future inputs are classiﬁed according to their (elastic Hamming) distance to the prototypes.
Handwritten characters are usually quite arbitrary and prone to lots of noise. If the feature
extraction interval is very small, then the algorithm will be very sensitive to errors and slight
changes in the recognition phase, causing low recognition rate. If the feature extraction interval is
very large, then it becomes less likely that we extract the essential features of a character and
hence we get a low recognition rate again. We must compromise. The compromise is on the
basis of minimum description length of prototypes.
We proceeded as follows to establish an optimal feature selection interval. A set of 186
characters drawings by one subject, exactly 3 examples for each of the 62 alphanumerical characters,
were recorded. The character drawings were stored in a standardized integer coordinate system
ranged from 0 to 30 in both x and y directions. These character drawings were then input to
the system to establish a knowledge base, which formed the collection of prototypes with normalized
real coordinates, based on some selected feature extraction interval. After the construction of
knowledge base was ﬁnished, the system was tested by having it classify the same set of character
drawings. If a character is misclassiﬁed, it is encoded using extra bits (i.e., the term P (D | H)).
The error code length is the sum of the total number of points for all the incorrectly classiﬁed
character drawings. The model code length is the total number of points in all the prototypes in
the machine’s knowledge base multiplied by 2. The factor of 2 comes from the fact that the prototype
coordinates are stored as real numbers. This takes twice as much memory (in C) as the
- 34 character
drawing coordinates which are in integer form. The prototype coordinates are real
instead of integer numbers, to facilitate the elastic matching process to give small resolution for
comparisons of classiﬁcation.
Thus, both the model code length and the error code length are directly related to the feature
extraction interval. The smaller this interval, the more complex the prototypes, but the smaller the
error code length. The effect is reversed if the feature extraction interval goes toward larger
values. Since the total code length is the sum of the two code lengths, there should be a value of
the feature extraction interval gives a minimum for the total code length. This feature extraction
interval is considered to be the ‘best’ one in the spirit of the MDL principle. The corresponding
model, or knowledge base, is considered to be optimal in the sense that it contains enough of the
essence of the raw data but eliminates most redundancy of the noise component from the raw
data. This optimal feature extraction interval can be found empirically by carrying out the above
described build-and-test procedure repeatedly. That is, build the knowledge base, and then test it
based on the same set of characters for which it was built. Repeat this for a number of different
extraction intervals.
In fact, this actual optimization process is implemented on the system and is available
whenever the user wants to call it. For our particular set of characters, the results
Figure 3. Optimization
of this optimization are given in Figure 3, which depicts three quantities: the model code length,
- 35 the
error code length, and the total code length versus feature extraction interval (SAMPLING
INTERVAL in the Figure). For larger feature extraction intervals, the model code length is small
but most of the character drawings are misclassiﬁed, giving a very large total code length. On the
other hand, when the feature extraction interval is at the small end of the scale, all the training
characters get correctly classiﬁed, and the error code length is zero. However the model code
length reaches its largest value, resulting in a larger total code length again. The minimum code
length occurred at extraction interval of 8, which
Figure 4. Optimization correct ratio
gives 98.2 percent correct classiﬁcation. Figure 4 illustrates the fraction of correctly classiﬁed
character drawings for the training data.
Whether the resulting ‘optimal’ model really performs better than the models in the same
class, the knowledge bases established using different feature extraction intervals, is subject to
testing it on new character drawings. For this purpose, the set of 62 handwritten characters were
drawn again by the same person who provided the initial data to build the knowledge base. Thus
the new data can be considered to be from the same source as the previous data set. The new data
were classiﬁed by the system using the knowledge base built from the former data set of 186
character drawings, based on different feature extraction intervals. The testing result is plotted in
Figure 5 in terms of the fraction of correct classiﬁcation (CORRECT RATIO) versus feature
extraction interval (SAMPLING INTERVAL). It is interesting to see that 100% correct
- 36 Figure
5. Test result
classiﬁcation occurred at feature extraction intervals 5, 6 and 7. These values of feature extraction
intervals are close to the optimized value 8. At the low end of feature extraction interval scale the
correct classiﬁcation drops, indicating disturbance caused by too much redundancy in the model.
The recommended working feature extraction interval is thus either 7 or 8 for this particular type
of character drawings. For more information on this research, see [ gao li (preprint available from
the ﬁrst author).
5. Fisher’s Maximum Likelihood Principle
Rissanen [ universal prior has argued that Fisher’s maximum likelihood principle [ Fisher ] [
Gauss is a special case of the MDL principle. By our treatment of MDL it is therefore a more restricted
computable approximation to Solomonoff’s induction. The Maximum Likelihood principle
says that given data D, one should use the hypothesis H that maximizes P (D | H) or,
equivalently, minimizes !logP (D | H), the ﬁrst term in of the MDL principle. We will use H and
8 interchangeably because 8 is used by statisticians. What makes ML principle sound in statistics
is the implicit assumption that each hypothesis H consists of a probability distribution
8=(81, . . . ,8k) with the same number k of parameters, each parameter 8i with ﬁxed precision. In
other words, in the probability distribution P (D | H = 8) the number k of parameters of 8, and the
precision of each of them, is the same for each H. Hence, one assumes that the descriptions of all
- 37 hypotheses
(models 8) are of equal length; that is, the complexity of the models is considered to
be ﬁxed. This is, obviously, a subjective assumption. In contrast, the MDL principle minimizes
the sum of !logP (D | H) and !logP (H) Intuitively, if one increases the description length of the
hypothesis H, it may ﬁt the data better and therefore decrease the description of data given H. In
the extreme case, if one encodes all the data information into the model H precisely, P (H) is
minimized and ! log P (H) is maximized. In that case, no code is needed to describe the data; that
is, P (D | H) is maximized (equals 1) and !logP (D | H) is minimized (equals 0).
On the other hand, if one decreases the description length of H, then this may be penalized
by the increasing description length of the data, given H. In the extreme case say, H is a trivial
hypothesis that contains nothing, then one needs 0 bits to describe H. But then, one gains no
insight of data and has to ‘plainly’ describe the data without help from any hypothesis.
Hence one may consider the MDL principle as a more general principle than the ML principle
in the sense that it considers the trade-off between the complexity of the model H and the
power of the model to describe the data D, whereas the ML principle does not take the hypothesis
complexity into account.
Yet the rationale behind the ML principle was to be objective by avoiding the ‘subjective’
assumption of the prior probability. The ML principle is equivalent with selecting the probabilistic
model P (D | 8) which permits the shortest ideal code length for the observed sequence, provided
that the model used in the encoding, i.e., the parameter 8 is given, too. Thus, the ML principle
is just a special case of the MDL principle under the assumption that hypotheses are equally
likely and the number of parameters in 8 are ﬁxed and small (so they do not make P (D | 8) = 1).
The shortcoming of the ML principle is that it cannot handle the situation where we do not know
the number (and precision) of the parameters. For example, in the ﬁtting polynomial example, the
ML principle does not work well when the degree of the polynomial is not ﬁxed. On the other
hand the MDL principle works naturally for this example.
6. Jaynes’ Maximum Entropy Principle
Rissanen [ universal prior and M. Feder [ Feder have shown that Jaynes’ Maximum Entropy
(ME) principle [ Jaynes rationale maximum entropy ] [ Jaynes information inference ] [ Jaynes
Probabilities can also be considered as a special case of the MDL principle. This is interesting
since it is known in statistics that there are a number of important applications where the ML
principle fails but where the maximum entropy formalism has been successful, and vice versa. In
- 38 order
to apply Bayes’ theorem, we need to decide what the prior probability pi = P (Hi) is subject
to condition
i
"pi = 1,
and certain other constraints provided by empirical data or considerations of symmetry, probabilistic
laws, and so on. Usually these constraints are not sufﬁcient to determine the pi’s uniquely.
Jaynes proposed to use the estimated values pˆi which satisfy said constraints and maximize the
entropy function
H = !
i
"pi ln pi
subject to the constraints. This is called the maximum entropy (ME) principle.
We now demonstrate the rationale behind the ME principle, its use, and its connection with
the MDL principle following discussions in [ Jaynes rationale ] [ Feder ] [ Rissanen universal
prior Consider a random experiment with k possible outcomes in each trial, thus kn
possible outcomes
in n trails. Let ni be the number of times the ith value appears in an outcome D of n trials.
Let frequency fi = ni /n, i = 1, 2,...,k. The entropy of outcome D is:
H (f1, . . . , fk) = !
i =1
"
k
fi ln fi. (10)
Let there be m < k linearly independent constraints of the form
i =1
"
k
aji fi = dj, 1 $ j $ m, and (11)
i =1
"
k
fi = 1 (12)
where the set D = {d1,...,dm} is related to the observed data, measuring as it were m ‘physical
quantities’ subject to the matrix A = {aji}.
Example. Consider a loaded die, k = 6. If we do not have any information about the die,
then using the Epicurus’ multiple explanation principle, we may assume that pi = 1/6 for
i = 1,...,6. This actually coincides with the ME principle, since H (p1,...,p6) = "i =1
6
pi ln pi subject
to (12) achieves maximum value ln6 = 1.7917595 for pi = 1/6 for all i. Now suppose some
experiments on the die have been performed, and it is observed that the die is biased and the
- 39 average
throw gives 4.5. That is,
i =1
"
6
i pi = 4.5.
In terms of Equation (11), we have m = 1, D = {4.5}, and aj 1 = (1,2,3,4,5,6). Maximizing the
expression in Equation (10), subject to constraints (11) and (12) gives estimates:
pˆi = e!4i
("e!4i
)!1
, 1 = 1,...,6,
where 4 = !0.37105. Hence (pˆ1,...,pˆ6) = (0.0543, 0.0788, 0,1142, 0.1654, 0.2398, 0.3475). The
maximized entropy H (pˆ1, . . . ,pˆ6) equals 1.61358. How dependable is the ME principle? Jaynes
has proven an ‘entropy concentration theorem’ which, for example, implies that in an experiment
of N = 1000 trails, 99.99% of all outcomes satisfying the constraints of Equations (11) and (12)
have entropy
1.602 $ H (
n
n1
, . . . ,
n
n6
) $ 1.614.
Now we turn to the MDL principle to deal with the same problem. The following argument
can be derived from probabilistic assumptions. But Kolmogorov [ Kolmogorov logical foundations
1969 ] [ Kolmogorov Combinatorial foundations 1983 advocated a purely combinatorial
approach, such as we give below, which does not need any such assumptions. Let
8 = (p1, . . . ,pk) be the actual prior distribution of a random variable. We perform a sequence of n
independent trials. Kolmogorov observed that the real substance of Formula (10) is that we need
approximately n H (8) bits to record the sequence of n outcomes. Namely, it sufﬁces to state that
each outcome appeared n1,...,nk times, respectively, and afterwards give the index of which one
of the
C (n1, . . . ,nk) =
n1! . . . nk!
n !
possible sequences D of n outcomes actually took place. For this no more than
k log n + log C (n1, . . . ,nk) + O (log log n)
bits are needed. The ﬁrst term corresponds to ! log P (8), the second term corresponds to
! log P (D | 8), and the third term represents the cost of encoding separators between the individual
items. Using Stirling’s approximation for the factorial function, we ﬁnd that for large n this
- 40 is
approximately
n ( !
i =1
"
k
n
ni
log
n
ni
) = n H (
n
n1
, . . . ,
n
nk
).
Since k and n are ﬁxed, the least upper bound on the minimum description length for an arbitrary
sequence of n outcomes under certain given constraints D is found by maximizing the term
log C (n1,...,nk) subject to said constraints. This is equivalent to maximizing the entropy function
(10) under constraints D. (Such constraints may be derived, for example, from the laws of large
numbers: in case of independent experiments with a probability distribution 8, we have ni /n 9 pi,
and we have a certain rate of convergence with certain probability.)
7. Valiant Style Deductive Learning
Can we make Gold-style learning feasible? According to commonly accepted views in the theory
of computation, this means that the learning algorithm should run in polynomial time - and hence
also use but a polynomial number of examples. The latter condition necessarily implies that not
all examples in an inﬁnite domain can turn up. Hence we need to assume a mechanism for making
a selection of examples. A deterministic selection ﬁxes the sequence of examples drawn in
advance, hence we would like to assume that examples are drawn from some distribution. The
idea in Gold’s approach that an inference algorithm should work for all sequences of examples
then translates to the idea that the learning algorithm should work for all distributions.
The second unavoidable modiﬁcation of the common approach in statistical inference, or
recursion theoretical learning, imposed by the feasability constraint, is as follows. In traditional
inference we want to learn a concept precisely in the limit. The feasibility restriction to a polynomial
algorithm precludes the precise learning of nontrivial concepts, and therefore we have to
relax precision to within a certain error. This corresponds with natural learning, where it is important
that learning is fast, and it sufﬁces to learn approximately.
We have now arrived at Valiant’s proposal: a learning theory, where one wants to learn a
concept with high propability, in polynomial time, and a polynomial number of examples, within
a certain error, under all distributions on the examples. The additional computational requirements
are orthogonal to the usual concerns in inference, and result in a distinctly novel theory.
However, there are at least two problems with it:
(1) Under all distributions, many concept classes, including some seemingly simple ones,
- 41 are
not known to be polynomially learnable or known not to be polynomially learnable if
NP)RP, although some concept classes are polynomially learnable under some ﬁxed distribution.
(2) In real life situations, it is sometimes impossible to sample according to underlying dis-
tributions.
Item (1) is counterintuitive for a proposed theory of machine learning; in fact it shows that
Valiant’s initially proposed requirements for learning are too strong. In practice, we usually do
not have to make such a general assumption. Due to this reason several authors have proposed to
study Valiant learning under ﬁxed distributions. Then some previously (polynomially) unlearnable
classes become learnable. For instance, the class of µ!DNF-formulae is polynomially learnable
under the uniform distribution. However, the assumption of any special distribution is obviously
too restrictive and not practically interesting.
7.1. A New Approach
In [ Li SIAM Simple Concepts we proposed to study Valiant-style learning under all simple distributions,
which properly include all computable and semi-computable distributions. This
allows us to systematically develop a theory of learning for simple concepts that intuitively
should be polynomially learnable. To stress this point: maybe it is too much to ask to be able to
learn all ﬁnite automata fast (humans cannot either), but surely we ought to be able to learn a
sufﬁciently simple ﬁnite automaton fast (as humans can). Previous approaches looked at syntactically
described classes of concepts. We introduce the idea of the restriction of a syntactically
described class of concepts to the concepts that are simple in the sense of having low Kolmogorov
complexity. This will cover most intuitive notions of simplicity. Our other restriction, from
distribution-free learning to simple-distribution-free learning is also not much of a restriction.
Already the computable distributions include all distributions we have a name for, like the uniform
distribution, normal distribution, geometric distribution, Poisson distribution - so the even
wider class of simple distributions ought to cover everything practically interesting.
It is an integral part of the proposed approach to also deal with the problem of inability of
sampling according to underlying distributions. In real life the samples are sometimes (or often)
provided by some mechanical or artiﬁcial means or good-willed teachers, rather than provided
according to its underlying distribution. Naturally the simpler examples are provided ﬁrst. Consider
a situation where a robot wants to learn but there is nobody around to provide it with examples
according to the real distribution. Because it does not know the real distribution, the robot
- 42 just
has to generate its own examples according to its own (computable) distribution and do
experiments to classify these examples. For example, in case of learning a ﬁnite state black box
(with resetting mechanism and observable accepting/ rejecting behavior). So the sampling distribution
and the real distribution may be quite different.
7.2. Deﬁnitions
Deﬁnition. (1) Let X be a set. A concept is a subset of X. A concept class is a set C : 2X
of concepts.
An example of a concept c&C is a pair (x,b) where b =1 if x&c and b =0 otherwise. A sample
is a set of examples.
(2) Let c&C be the target concept and P be a distribution on X. Given accuracy parameter
#, and conﬁdence parameter 5, a learning algorithm A draws a sample S of size mA(#,5) according
to P, and produces a hypothesis h = hA(S)&C.
(3) We say C is learnable if for some A in above, for every P and every c &C,
Pr(P (h;c) > #) $ 5,
where ; denotes the symmetric difference. In this case we say that C is (#,5)!learnable, or paclearnable
(probably approximately correct).
(4) C is polynomially learnable if A runs in polynomial time (and asks for polynomial
number of examples) in 1/5, 1/#, and the length of the concept to be learned.
Deﬁnition. A distribution P (x) is simple if it is (multiplicatively) dominated by a semicomputable
distribution Q (x). That is, there is a constant c such that for all x,
cQ (x)(P (x).
The ﬁrst question is how large the class of simple distributions is. It certainly includes all
semi-computable distributions and hence all distributions in our statistics books. It can be shown
that there is a non-semicomputable distribution which is simple, and that there is a distribution
which is not simple.
- 43 -
7.3. Discrete Sample Space
First we deal with discrete sample spaces. We show that if a concept is polynomially learnable
under this single distribution then it is polynomially learnable in Valiant’s sense under all ‘simple’
distributions if we sample according to the ‘universal’ distribution. We also provide new
non-trivial learning algorithms for several (old and new) classes of problems under our assumption.
These classes were not known to be polynomially learnable under Valiant’s more general
assumption, some were even NP-complete. For example, the class of DNF’s such that each
monomial has Kolmogorov complexity O (logn), the class of k-reversible DFA of Kolmogorov
complexity O (log n), and the class of k-term DNF are polynomially learnable under our assumptions.
All these results hold for the appropriate polynomial time computable variants - perhaps
bringing the approach in the practicable domain.
Deﬁnition. The learning algorithm samples according to m(x), if in the learning phase the
algorithm draws random samples from m(x). (We can formalize this in different ways.) We
obtain the following completeness result.
Theorem 6. A concept class C is polynomially learnable under the universal distribution
m, iff it is polynomially learnable for each simple distribution P, provided the sample is drawn
according to m.
Proof. P (x) is dominated by some semi-computable distribution Q (x). Q (x) is in turn
dominated by m(x). Hence, there is a constant c > 0 such that for all x,
cm(x) ( P (x)
Assume C is learnable (in time t) under distribution m(x). Then one can run the learning
algorithms with error parameter #/c in polynomial time. Let err be the set of strings that are
misclassiﬁed by the learned concept. So with probability at least 1!5
x&err
" m(x) $ #/c.
Hence
x&err
" P (x)$c
x&err
" m(x)$#.
Hence if the underlying distribution is P (x) rather than m(x), we are still guaranteed to ‘paclearn’
C (in time t), if sampling according to m(x). This still requires that the learning algorithm
- 44 has
the required constant c as additional input. The argument can be improved so that this extra
input can be dispensed with [ Li SIAM Learning simple
Since m assigns higher probabilities to simpler strings, one could suspect that after polynomially
many examples, all simple strings are sampled and the strings that are left unsampled
have only very low (inverse polynomial) probability. However, the next theorem shows that this
is not the case.
Theorem 7. Let S be a set of nc
samples drawn according to m . Then
x<S
"m(x) = +(
(logn)2
1
).
Now let us consider polynomially computable distributions. Again, all textbook distributions
we know are polynomially computable. Call a distribution polynomial simple if it is dominated
by a polynomially computable distribution. In all of the discussion below all Kolmogorov
complexity (including the related notion m) can be replaced by its polynomial bounded version.
7.3.1. Learning under m(x)
In the [ Li 1989 Simple Concepts we gave an example of a class of simple concepts, log n-DNF,
which is polynomially learnable under the universal distribution, and hence in our sense under all
simple distributions, and which is not known to be polynomially learnable in the general Valiant
model. Here we present a class that was shown to be not polynomially learnable in Valiant’s
sense, unless P = NP, but which is polynomially learnable under m(x).
DNF stands for ‘disjunctive normal form’. A DNF is any sum m1+m2+...+mr of monomials,
where each monomial mi is the product of some literals chosen from a universe x1, . . . , xn or
their negations x1, . . . , xn. A k-term DNF is a DNF consisting of at most k monomials. A monomial
in a DNF is monotone if no variable in it is negated. In [ Pitt Valiant 35 1989
it was shown that learning a monotone k-term DNF by k-term (or 2k-term) DNF is NP-complete.
(In contrast to k-DNF is a DNF where each monomial consists of at most k literals. Recall, that
k-DNF is learnable in Valiant’s sense [ Valiant 1984 learnable )
Theorem 8. Monotone k-term DNF is polynomially learnable by monotone k-term DNF
while sampling under m.
Proof (Sketch). Assume we are learning a monotone k-term DNF
f (x1, . . . ,xn) = m1+ . . . +mk, where mi’s are the k monotone monomials (terms) of f.
- 45 Learning
Algorithm.
0. Draw a sample of nk,
examples, k,>k +1. Set DNF g := =. (g is the DNF we will eventually
output as approximation of f.)
1. Pick a positive example a =(a1, . . . ,an). Form a monotone term m such that m includes xi if
ai=1.
2. for each positive example a = (a1,...,an) do: if ai = 0 and deleting xi from m violates no
negative examples, delete xi from m.
3. Remove from the sample all positive examples which are implied by m. Set g > g+m. If
there are still positive examples left, then go to step 1, else halt and return g.
We show that the algorithm is correct. Let us write mi : m for two monotone monomials if
all the variables that appear in mi also appear in m. At step 1, the monomial m obviously implies
no negative examples, since for some monomial mi of f we must have mi : m. Step 2 of the algorithm
keeps deleting variables from m. If at any time for no monomial mi&f holds mi : m, then
there exists a negative example that contains at most k 0’s such that it satisﬁes m but no mi of f.
This negative example is of Kolmogorov complexity at most klog n, hence by the Chernoff formulae
(Section 2.1) with high probability it is contained in the sample. Hence at step 2, with
high probability, there will be an mi such that mi : m. Hence we eventually ﬁnd a correct mi
(precisely) with high probability. Then at step 3, we remove the positive examples implied by
this mi and continue on to ﬁnd another term of f. The algorithm will eventually output g =f with
high probability by standard calculations.
Remark. Notice that this is not an approximation algorithm like the one in [ Li 1989 Simple
Concepts
to learn log n-DNF. This algorithm outputs the precise monotone formula with high probability.
7.4. Continous Sample Space
Secondly, we deal with continuous sample spaces. For example, the uniform distribution now is
deﬁned as L (0x) = 2! | x |
, where 0x denotes the set of all one-way inﬁnite binary strings starting
with x. This is the Lebesgue measure on interval [0,1]. While for discrete sample spaces all concept
classes are Valiant learnable (although not all are polynomially learnable), this is not the
case for continuous sample spaces. We can deﬁne the notion of ‘simple’ semimeasure and that of
universal semicomputable semimeasure, over a continuous sample space, and show that all
- 46 concept
classes are learnable over each simple semimeasure D iff they are learnable under the
universal semimeasure. In contrast with the discrete case w.r.t. polynomial learning, here we do
not need to require that the learning algorithm samples according to the universal measure. For
details, see [ Li 1989 Simple Concepts
8. Acknowledgement
We are grateful Ray Solomonoff, Leonid Levin, and Peter G ´acs for conversations about algorithmic complexity
and inductive reasoning. We thank Mati Wax for useful discussions on the MDL principle; Leslie
Valiant for suggesting handwritten character recognition as a viable application; and Qiong Gao for participating
in the realization of it. Fahiem Bacchus, Danny Krizanc and John Tromp read and commented on
the manuscript.
-- --