Inductive Reasoning and Kolmogorov Complexity Ming Li* University of Waterloo, Department of Computer Science Waterloo, Ontario N2L 3G1, Canada Paul M.B. Vit ´anyi Centrum voor Wiskunde en Informatica, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands and Universiteit van Amsterdam, Faculteit Wiskunde en Informatica ABSTRACT This is a sloppy first draft of [J. Comp. System Sciences, 44:2(1992), 343-384]. Also, there are some problems with the pictures and references due to the obsolete troff processing. Reasoning to obtain the ‘truth’ about reality, from external data, is an important, controversial, and complicated issue in man’s effort to understand nature. (Yet, today, we try to make machines do this.) There have been old useful principles, new exciting models, and intricate theories scattered in vastly different areas including philosophy of science, statistics, computer science, and psychology. We focus on inductive reasoning in correspondence with ideas of R.J. Solomonoff. While his proposals result in perfect procedures, they involve the noncomputable notion of Kolmogorov complexity. In this paper we develop the thesis that Solomonoff’s method is fundamental in the sense that many other induction principles can be viewed as particular ways to obtain computable approximations to it. We demonstrate this explicitly in the cases of Gold’s paradigm for inductive inference, Rissanen’s minimum description length (MDL) principle, Fisher’s maximum likelihood principle, and Jaynes’ maximum entropy principle. We - 2 present several new theorems and derivations to this effect. We also delimit what can be learned and what cannot be learned in terms of Kolmogorov complexity, and we describe an experiment in machine learning of handwritten characters. We also give an application of Kolmogorov complexity in Valiant style learning, where we want to learn a concept probably approximally correct in feasible time and examples. The eye of the understanding is like the eye of the sense; for as you may see great objects through small crannies or levels, so you may see great axioms of nature through small and contemptible instances. [Francis Bacon, Sylva Sylvarum 337, 1627] 1. A Historical View of Inductive Reasoning The Oxford English Dictionary gives as the meaning of induction: the process of inferring a general law or principle from the observations of particular instances. This defines precisely what we would like to call inductive inference. On the other hand, we regard inductive reasoning as a more general concept than inductive inference, namely as a process of re-assigning a probability (or credibility) to a law or proposition from the observation of particular instances. In other words, in the way we use the notions, inductive inference draws conclusions that consist in accepting or rejecting a proposition, while inductive reasoning only changes the degree of our belief in a proposition. The former is a special case of the latter. In this paper we discuss inductive reasoning in correspondence with R.J. Solomonoff’s ideas as expressed in e.g. [ Solomonoff 1964 However, Solomonoff’s procedure is not effective, since it involves the noncomputable Kolmogorov complexity of objects. We shall show, however, that there is considerable structure in many different approaches proposed for induction, since they can be variously derived as computable approximations to Solomonoff’s method. The history of inductive inference, which is as old as empirical science itself, dates back at least to the Greek philosopher of science Epicurus (342? - 270? B.C). To reason by induction is nothing but to learn from experience. As the sun rises day by day, our belief in that the sun will * The work of the first author was supported in part by National Science Foundation Grant DCR-8606366, Office of Naval Research Grant N00014-85-k-0445, Army Research Office Grant DAAL03-86-K-0171, and NSERC Operating Grant OGP0036747. Part of the work was performed while he was at the Department of Computer Science, York University, North York, Ontario , Canada. A preliminary form of this paper appeared in Proc. 4th IEEE Structure in Complexity Theory Conference, 1989. - 3 rise tomorrow increases, and we eventually infer the truth that the sun will rise every morning. As human history evolves, man tries to understand and explain the events that happen around him: this takes the form of different induction methods to formulate scientific theories from positive and negative, fortunate and unfortunate, lucky and unlucky, happy and miserable experiences. Two metaphysical principles stand out and prevail today: the principle of Epicurus’ multiple explanations (or indifference) and Occam’s principle of simplest explanation (Occam’s razor). The Principle of Multiple Explanations: If more than one theory is consistent with the data, keep them all. The source of the following material is [ Epicurus Epicurus, in his Letter to Pythocles, explains that: There are cases, especially of events in the heavens such as the risings and settings of heavenly bodies and eclipses, where it is sufficient for our happiness that several explanations be discovered. In these cases, the events ‘have multiple causes of coming into being and a multiple predication of what exists, in agreement with the perceptions.’ Epicurus maintains that, if several explanations are in agreement with the (heavenly) phenomena, then we must keep all of them for two reasons. Firstly, the degree of precision achieved by multiple explanations is sufficient for human happiness. Secondly, it would be unscientific to prefer one explanation to another when both are equally in agreement with the phenomena. This, he claims, would be to ‘abandon physical inquiry and resort to myth.’ His follower Lucretius (95 - 55 B.C.) illustrates the inevitability of the use of the multiple explanation principle by the following example: ‘‘There are also some things for which it is not enough to state a single cause, but several, of which one, however, is the case. Just as if you were to see the lifeless corpse of a man lying far away, it would be fitting to state all the causes of death in order that the single cause of this death may be stated. For you would not be able to establish conclusively that he died by the sword or of cold or of illness or perhaps by poison, but we know that there is something of this kind that happened to him.’’ Based on the same intuition, in the calculus of probabilities it has been customary to postulate the ‘principle of indifference’ or the ‘principle of insufficient reason’. When there is no other evidence, because of the absolute lack of knowledge concerning the conditions under which a die falls, we have no reason to assume that a certain face has higher probability of turning up. Hence we assume that each side of the die has the probability 1/6. The principle of indifference considers events to be equally probable if we have not the slightest knowledge of the conditions under which each of them is going to occur. For the case of a die, this actually coincides with the socalled ‘maximum entropy principle’, we will discuss later, which states that we should choose - 4 probabilities pi for face i to be the outcome of a trial, i = 1, 2,...,6, such that !"i =1 6 pi ln pi is maximized under the only constraint "i =1 6 pi = 1. We obtain precisely pi = 1/6 for i = 1, 2,...,6. The second and more sophisticated principle is the celebrated Occam’s razor principle commonly attributed to William of Ockham (1290? - 1349?). This enters the scene about 1500 years after Epicurus. In sharp contrast to the principle of multiple explanations, it states: Occam’s Razor Principle: Entities should not be multiplied beyond necessity. This is generally interpreted as: Among the several theories that are all consistent with the observed phenomena, one should pick the simplest theory. (According to Bertrand Russell, the actual phrase used by Ockham was: ‘It is vain to do with more what can be done with fewer.’) Surely Occam’s razor principle is easily understood from an ‘utilitarian’ point of view: if both theories explain the same set of facts, why not use the simpler theory?! However things become more intricate when we want to know whether a simpler theory is really better than the more complicated one. This also raises another question which has been a bone of contention in Philosophy ever since the razor’s inception. For what is the proper measure of simplicity? Is x100 +1 more complicated than ax17 +bx2 +cx+d? E.g., the distinguished contemporary philosopher K. Popper pronounced that the razor is without sense to use on such grounds. However, it is interesting to notice that the principle can be given objective contents, and has recently been very successfully applied in many different forms in computational learning theory. To explain this, let us consider an over-simplified example of inferring a finite automaton with one-letter input using Occam’s razor principle. Accepted inputs: 0, 000, 00000, 000000000; Rejected inputs: #, 00, 000000; For these data, there exist many consistent finite automata. Figure 1 shows the trivial automaton and Figure 2 shows the smallest automaton, where S denotes starting state and darker states are accepting states. S 0 0 0 0 0 0 0 0 0 Figure 1: A trivial automaton - 5 - S 0 0 Figure 2: The smallest automaton Intuitively, the automaton in Figure 1 just encodes data plainly, we therefore do not expect that machine anticipate the future data. On the other hand the second machine makes a plausible inference that the language accepted consists of strings of an odd number of 0’s. The latter appeals to our intuition as a reasonable inference. However, a too simplistic application of Occam’s razor principle may also lead to nonsense as the following story illustrates. Once upon a time, there was a little girl named Emma. Emma had never eaten a banana, nor had she been on a train. One day she went for a journey from New York to Pittsburgh by train. To relieve Emma’s anxiety, her mother gave her a large bag of bananas. At Emma’s first bite of a banana, the train plunged into a tunnel. At the second bite, the train broke into daylight again. At the third bite, Lo! into a tunnel; the fourth bite, La! into daylight again. And so on all the way to Pittsburgh and to the bottom of her bag of bananas. Our bright little Emma (applying Occam’s razor principle?) told her grandpa: ‘Every odd bite of a banana makes you blind; every even bite puts things right again.’ [After N.R. Hanson, ‘Perception and Discovery’, Freeman, Cooper & Co, 1969, p.359.] Let us consider how the idea of ‘simplicity’ affects a scientist’s thinking. We refer to a beautiful study of simplicity by Kemeny [ Kemeny Initially, there were no new facts that failed to be explained by the Special Theory of relativity. The incentive to invent the General Theory of Relativity, by Albert Einstein, was his conviction that the Special Theory was not the simplest theory that can explain all the observed facts. Reducing the number of independent variables obviously simplifies a theory. By the requirement of general covariance Einstein succeeded in replacing the previous independent ‘gravitational mass’ and ‘inertial mass’ by a single concept. In spite of the apparent universal acceptance of Occam’s razor, consciously or unconsciously, the concept of simplicity remains highly controversial. Generally speaking, it has remained a crude non-precise idea. Things are subtler than they appear. Is the following formulation precise? Occam’s Razor Rule: Select a hypothesis which is as well in agreement with the observed value as possible; if there is any choice left, choose the simplest possible hypothesis. - 6 Example. Consider the problem of fitting n points by a polynomial. The above rule tells us to choose the polynomial of lowest degree passing through all the n points. But due to measurement precision and possible noise, whatever degree polynomial the points originated from, we will end up a polynomial of degree n !1 which fits the data precisely. But this polynomial most likely does not help us to predict future points. Example. Consider another example given by Kemeny: Let there be unknown number of white balls and black balls in a sealed urn. Through an opening you randomly pick one ball at a time, note its color and replace it, and shake the urn thoroughly. After n draws you must decide what fracton of the balls in the urn is white. The possible hypotheses state that some rational fraction r of balls in the urn is white, where 0$r$1. By the above rule, if in n draws, m white balls are selected, then we should formulate the hypothesis r =m/n. Let there be 1/3 white and 2/3 black balls. However the probability of getting the true hypothesis m =n/3 is zero if n is not divisible by 3, and it tends to zero, even under the assumption that n is divisible by 3. On the other hand we know that to obtain a hypothesis 1/3!#$r$1/3+#, for any #, has probability tending to 1 exponentially fast, by the so-called Chernoff formula. (For Chernoff’s formula see e.g. [ Angluin Valiant ) Even when the process converges, n may be too large for practical use. Kemeny’s Rule. Select the simplest hypothesis compatible with the observed values. Here ‘compatible’ is defined as follows. The hypothesis Hi is compatible with data D if, assuming the truth of Hi, there was at most one percent chance of getting a deviation as great as m (Hi,D) for some measure function m. This is related to Valiant’s learning theory to be discussed later. But how does one define simplicity? Is 1/4 simpler than 1/10? Is 1/3 simpler than 2/3? Saying that an urn contains 1/3rd part white balls comes down to the same thing as saying that it contains a 2/3rd part black balls. Kemeny warned: ‘do not use more precision in your theories than is necessary.’ But what is necessary and what is not? All these issues are very subjective. Does a simple theory generate a hypothesis which is good for predicting future outcomes? How do we achieve fast convergence? How does one trade between ‘simplicity’ and ‘truth’ (‘compatibility’)? Kemeny actually asked for ‘a criterion combining an optimum of simplicity and compatibility’ [crediting Nelson Goodman for this suggestion]. - 7 - 1.1. Combining Epicurus, Ockham, and Bayes The study of inductive reasoning predates artificial intelligence or computer science by more than 2000 years. There is tremendous amount of literature in many different fields under diverging terminologies. Our goal is to extract a common core of simple ideas underlying all these approaches, in the spirit of Occam’s Razor principle. We will start with Bayesian inference theory. To apply Bayesian type reasoning one has to assign a priori probabilities (prior probability) to each possible hypothesis. Since the posthumous publication in 1763 of Th. Bayes’ (?? - 1761) famous memoir ‘An essay towards solving a problem in the doctrine of chances’ by his friend Richard Price, [ Bayes essay there has been continuous bitter debate on the controversial prior probability in the Bayesian formula. The invention of Kolmogorov complexity, by its first inventor R. Solomonoff, was as an auxiliary notion to resolve this particular problem. Namely, using Kolmogorov complexity he found a single ‘universal’ prior distribution which can be substituted for any particular actually valid distribution (as long as it is computable) in Bayes’ formula, and obtain approximately as good results as if the actually valid distribution had been used. It sounds like magic, but Solomonoff’s approach does give a more or less satisfactory solution to this unlikely objective. The elegant idea of a universal prior is a combination of Occam’s razor and modern computability theory. However, the universal prior is uncomputable, since it involves Kolmogorov complexity. In this paper we develop the thesis that many theories, models, and principles for inductive reasoning that were formulated both before and after Solomonoff’s inception, can be rigorously derived as particular computable approximations to it. We first describe the basics of Bayesian theory and how to apply Kolmogorov complexity to obtain the Universal prior probability distribution. We then derive the Gold paradigm and its principles. We derive a form of Rissanen’s Minimum Description Length (MDL) principle. From the MDL principle Rissanen derives the Fisher’s Maximum Likelihood principle and Jaynes Maximum Entropy principle. This paper contains a review of all these theories and principles. It has been our experience that some experts say the connections as claimed are obvious, while some other experts deny those connections exist. Thus, since the proof of the pudding is in the eating, we explicitly establish derivations together with the appropriate related theorems. We also describe an experiment we have performed in machine learning of recognition of handwritten characters using the MDL principle. Combination of Gold-style inference with ideas from - 8 computational complexity theory leads to Valiant’s model of deductive learning. We give an application of the Universal prior distribution to obtain a theory of learning simple concepts under simple distributions. A more extensive treatment of this material will be given in our forth-coming textbook [ Li Introduction Kolmogorov 2. The Universal Prior Distribution 2.1. Bayesian Inference In the following discussion of probability we assume the usual so-called Kolmogorov Axioms, see e.g. [ Feller Introduction For our purpose we need the following. We have a hypothesis space, H = {H1, H2,...}, which consists of a countable set of hypotheses, which are mutually exclusive (in the sense that at most one of them is right), and exhaustive (in the sense that at least one of them is right). With each hypothesis Hi we associate a probability P (Hi) such that "i P (Hi) = 1. The student is supplied with some data D, providing information about which hypothesis is correct. From the definition of conditional probability, i.e., P (A | B) = P (A % B)/P (B), it is easy to derive Bayes’ formula (rewrite P (A %B) in the two possible different ways, equate the two expressions, and set A = Hi and B = D): P (Hi | D) = P (D) P (D | Hi) P (Hi) (1) where P (D) = i "P (D | Hi)P (Hi). We interpret the different variables in the formula as follows. The Hi’s represent the possible alternative hypotheses concerning the phenomenon we wish to discover. The term D represents the empirically or otherwise known data concerning this phenomenon. The term P (D), the probability of data D, can be considered as a normalizing factor so that "i P (Hi | D) = 1. The term P (Hi) is called the prior probability or initial probability of hypothesis Hi, i.e., the probability that Hi is true before we have seen any evidence. The term P (Hi | D) is called the final, a posteriori, or inferred probability, which reflects the probability of Hi modified from the prior probability P (Hi) after seeing the data D. The term P (D | Hi), the conditional probability of seeing D when Hi is true, is assumed to be computable from D and Hi. In many learning situations, - 9 data can only be consistent with an hypothesis Hi in the sense of being forced by it such that P (D | Hi) = 1. If the data is inconsistent with hypothesis Hi then P (D | Hi) = 0. In such a situation, the data either is determined by a hypothesis, or disqualifies it. (We assume there is no noise that distorts the data.) For example, the hypothesis is datum x & L or x L. The most interesting term is the prior probability P (Hi). In the context of machine learning, P (Hi) is often considered as the learner’s initial degree of belief in hypothesis Hi. In essence Bayes’ rule is a mapping from a priori probability P (Hi) to a posteriori probability P (Hi | D), where the mapping is determined by data D. In general, the problem is not so much that in the limit the inferred probability would not ‘condense’ on the ‘true’ hypothesis, but that the inferred probability gives as much information as possible about the possible hypotheses from only a limited number of data, cf. example below. In fact, the continuous bitter debate between the Bayesian and non-Bayesian opinions centered on the prior probability. The controversy is caused by the fact that Bayesian theory does not say how to initially derive the prior probabilities for the hypotheses. Rather, Bayes’ rule only says how they are to be updated. However, in each actual case the prior probabilities may be unknown, uncomputable, or conceivably do not exist. (What is the prior probability of use of words in written English? There are many different sources of different social backgrounds living in different ages.) This problem is solved if we can find a single probability distribution to use as the prior distribution in each different case, with approximately the same result as if we had used the real distribution. Surprisingly, this turns out to be possible up to some mild restrictions. Example. We use an example of von Mises [ von mises probability truth Let an urn contain many dice each with different attributes. A die with attribute p has probability p showing 6 in a random throw. For convenience, assume the attribute set A is finite, and the difference between each pair of attributes is greater than 2#. Randomly draw a die from the urn, our task is to determine its attribute. We do this by experimenting. Throw the die n times independently. If 6 shows up m times, we choose the attribute that is nearest to m/n. Let Hp be the event of drawing a die with attribute p from an urn. Let Dq be the experimental data such that m successes (6’s) * Properly speaking, formula (1) is not due to Bayes, but it is due to P.S. Laplace (1749 - 1827) who stated the formula and attached Bayes’ name to it [ Laplace Actually, Bayes in his original paper [ Bayes Essay assumed the uniform distribution for the a priori probability, hence he has essentially derived P (Hi | D)=P (D | Hi)/"i P (D | Hi). Although this formula can be derived from (1) by simply assuming that all P (Hi) are the same, Bayes did not state his result in the general form as in (1), nor did he derived his result through a formula similar to (1). Despite the fact that Bayes’ rule is just a rewriting of the definition of conditional probability and nothing more, it is its interpretation and applications that are most profound and caused much bitter controversy during the past two centuries. - 10 were observed out of n throws and | q ! m /n | < #, for q&A. So P (Hp | Dq) = P (Dq) P (Dq | Hp) P (Hp) , where P (Dq) = "p P (Dq | Hp) P (Hp). According to Chernoff’s formula (see [ angluin valiant ), for '<1, we have: P (m !np>'np | Hp) < e ! 2 1 '2 np , P (np !m>'np | Hp) < e ! 3 1 '2 np . Hence, if p is the true attribute of the die we have drawn then, choosing '=p/2# (so | p ! m/n | ( # implies | m !np | > 'np), P (Dq | Hp) goes to 0 exponentially fast when the number of experiments increases, for q)p, and P (Dp | Hp) goes to 1 at the same rate. Hence P (Hp | Dp) goes to 1. Thus we derive the correct attribute of the die with high probability (using a polynomial number of throws). The interesting point is that if the number of trials is small, then the inferred probability P (Hp | Dq) may heavily depend on the the prior probability P (Hp). However, if n is large, then irrespective of the prior distribution, the inferred probability condenses more and more around m/n. Example. We explain a simplified version of Solomonoff’s theory of inductive inference. The simplification is in that we, for the moment, consider only a discrete sample space like {0, 1}*, The set of all finite binary sequences, rather than {0, 1}* , the set of all one-way infinite binary sequences. We view theory formation in science as the process of obtaining a compact description of the past observations together with predictions of future ones. The investigator observes increasingly larger initial segments of an finite binary sequence as the outcome of an finite sequence of experiments on some aspect X of nature. To describe the underlying regularity of this sequence, the investigator tries to formulate a theory that governs X, on the basis of the outcome of past experiments. Candidate theories (hypotheses) are identified with computer programs that compute binary sequences starting with the observed initial segment. First assume the existence of a prior probability distribution, described by probability function P, over a discrete sample space + = {0,1}*. Define a function µ over + by µ(x) = "{P (xy): y & +}. Thus, µ(x) is the probability of a sequence starting with x. Given a - 11 previously observed data string S, the inference problem is to predict the next symbol in the output sequence, i.e., extrapolating the sequence S. In terms of the variables in Formula (1), Hi is the hypothesis that the sequence under consideration starts with initial segment S a. The data D consist in the assertion that the sequence in fact starts with initial segment S. Thus, for P(Hi) and P (D) in Formula (1) we substitute µ(S a) and µ(S), respectively, and obtain, a = 0 or a = 1, P (Sa | S) = µ(S) P (S | Sa) µ(Sa) . Here µ(S) = P (S | S 0) µ(S 0) + P (S | S 1) µ(S 1) + P (S). Obviously, P (S | Sa) = 1 for any a, hence P (Sa | S) = µ(S) µ(Sa) . (2) In terms of inductive inference or machine learning, the final probability P (Sa | S) is the probability of the next symbol being a, given the initial sequence S. Obviously we now only need the prior probability to evaluate P (Sa | S). The goal of inductive inference in general is to be able to either (i) predict (extrapolate) the next element of S, or (ii) to infer an underlying effective process (in the most general case, a Turing machine, according to the Church-Turing thesis) that generated S, and hence to be able to predict the next symbol. In order to solve the problem for unknown prior probability, Solomonoff proposed what he called a universal prior distribution. We now carefully define the universal prior distribution and prove several fundamental theorems due to Solomonoff and L.A. Levin, and afterwards continue this example. The definitions and theorems are so fundamental that our approach totally rests upon them. These results are in some form hidden in [ Solomonoff 1964 ] [ Solomonoff IEEE 1978 convergence ] [ Levin Zvonkin ] [ G ´acs Lecture Notes 1987 For various reasons they are difficult to access, and almost unknown except to a few people doing research in this area. It seems useful to recapitulate them. First we need the basic definitions of Kolmogorov complexity. 2.2. Kolmogorov Complexity Inductive reasoning was the midwife that stood at the cradle of Kolmogorov complexity. Nowadays, Kolmogorov complexity has been applied in many areas of computer science and mathematics (see [ Kolmogorov complexity handbook for a general survey), and few realize that Kolmogorov complexity was at first invented for the purpose of inductive inference. In this essay, - 12 we go back to this origin. We are interested in defining the complexity of a concrete individual finite string of zeros and ones. Unless otherwise specified, all strings will be binary and of finite length. All logarithms in this paper are base 2, unless it is explicitly noted they are not. If x is a string, then l (x) denotes the length (number of zeros and ones) of x. We identify throughout the xth finite binary string with the natural number x, according to the correspondence: (#, 0), (0, 1), (1, 2), (00, 3), (01, 4), (10, 5),... Intuitively, we want to call a string simple if it can be described in a few words, like "the string of a million ones"; A string is considered complex if it cannot be so easily described, like a "random" string which does not follow any rule and hence we do not know how to describe apart from giving it literally. A description of a string may depend on two things, the decoding method (the machine which interprets the description) and outside information available (input to the machine). We are interested in descriptions which are effective, and restrict the decoders to Turing machines. Without loss of generality, our Turing machines use binary input strings which we call programs. More formally, fixing a Turing machine T, we would like to say that p is a description of x if, on input p, T outputs x. It is also convenient to allow T to have some extra information y to help to generate x. We write T (p,y)=x to mean that Turing machine T with input p and y terminates with output x. Definition 1. The descriptional complexity CT of x, relative to Turing machine T and binary string y, is defined by CT(x | y) = min{l (p): p &{0,1}*, T (p,y)=x}, or * if no such p exists. The complexity measure defined above is useful and makes sense only if the complexity of a string does not depend on the choice of T. Therefore the following simple theorem is vital. This Invariance Theorem is given by Solomonoff [ Solomonoff formal inductive inference Kolmogorov [ Kolmogorov Three approaches and Chaitin [ Chaitin Ch2 Theorem 1. There exists a universal Turing machine U, such that, for any other Turing machine T, there is a constant cT such that for all strings x, y, CU(x | y) $ CT(x | y)+cT. Proof. Fix some standard enumeration of Turing machines T1,T2, . . . . Let U be the Universal Turing machine such that when starting on input 0n 1p, p &{0,1}*, U simulates the nth - 13 Turing machine Tn on input p. For convenience in the proof, we choose U such that if Tn halts, then U first erases everything apart from the halting contents of Tn’s tape, and also halts. By construction, for each p &{0,1}*, Tn started on p eventually halts iff U started on 0n 1p eventually halts. Choosing cT=n+1 for Tn finishes the proof. Clearly, the Universal Turing machine U that satisfies the Invariance Theorem is optimal in the sense that CU minorizes each CT up to a fixed additive constant (depending on U and T). Moreover, for each pair of Universal Turing machines U and U,, satisfying the Invariance Theorem, the complexities coincide up to an additive constant (depending only on U and U,), for all strings x, y: | CU(x | y) ! CU,(x | y) | $ cU, U,. Therefore, we set the canonical conditional Kolmogorov complexity C (x | y) of x under condition of y equal to CU(x | y), for some fixed optimal U. We call U the reference Turing machine. Hence the Kolmogorov complexity of a string does not depend on the choice of encoding method and is well-defined. Define the unconditional Kolmogorov complexity of x as C (x) = C (x | #), where # denotes the empty string (l (#) = 0). Definition 2. In the sequel we need to use the prefix complexity variant, or self-delimiting complexity, rather than C (x) from Definition 1. A prefix machine is a Turing machine with three tapes: a one-way input tape, a one-way output tape, and a two-way work tape. Initially, the input tape contains an indefinitely long sequence of bits. If the machine halts, then the initial segment on the input tape it has read up till that time is considered the input or program, and the contents of the output tape is the output. Clearly, the set of programs of each such machine is a prefixcode. (Recall that if p and q are two code words of a prefix-code, then p is not a proper prefix of q.) We can give an effective enumeration of all prefix machines in the standard way. Then the prefix descriptional complexity of x &{0,1}*, with respect to prefix machine T, and binary string y, is defined as KT(x | y) = min{l (p): p &{0,1}*, T (p,y)=x}, or * if such p do not exist. One can prove an Invariance Theorem for prefix complexity, and define the conditional and unconditional prefix Kolmogorov complexity, by fixing some reference optimal prefix machine, in exactly the same way as before, so we do not repeat the construction. Remark. The prefix Kolmogorov complexity of string x, is the length of the shortest prefix - 14 program that outputs x. In this exposition, we will use K (x) to denote the prefix Kolmogorov complexity of x. C (x) and K (x) differ by at most a 2 log K (x) additive term. In some applications this does not make any difference. But in some other applications, for example inductive inference, this is vital. In particular, we need the property that the series "x 2!K (x) converges, cf. below. Definition 3. A binary string x is incompressible if K (x)(l (x). Remark. Since Martin-L .. of [ Martin-L .. of definition random 1966 has shown that incompressible strings pass all effective statistical tests for randomness, we will also call incompressible strings random strings. A simple counting argument shows that most strings are random. The theory of computability shows that the function K (x) is noncomputable, but can be approximated to any degree of accuracy by a computable function. However, at no point in this approximation process can we know the error. Cf. also the surveys [ Levin Zvonkin ] [ Li Two Decades 2.3. Semicomputable Functions and Measures We consider recursive functions with values consisting of pairs of natural numbers. If
is
such a value then we interpret this value as the rational number p/q, and say that the recursive
function is rational valued.
Definition. A real function f is semicomputable from below iff there exists a recursive
function g (x, k) with rational values (or, equivalently, a computable real function g (x, k)), nondecreasing
in k, with f (x) = limk-*g (x, k). A function f is semicomputable from above, if ! f is
semicomputable from below.
(An equivalent definition: f is a function that is semicomputable from below if the set
{(x, r): r $ f (x), r is rational} is recursively enumerable.)
A real function f is computable iff there is a recursive function g (x, k) with rational values,
and | f (x) ! g (x, k) | < 1/k.
Obviously, all recursive functions are computable, and all computable functions are semicomputable.
However, not all semicomputable functions are computable, and not all computable
functions are recursive. Nontrivial examples of functions that are semicomputable from
above but not computable are C (x), C (x | y), K (x), and K (x | y).
The following analysis is a simplified version over the discrete space N (or the set of finite
binary strings), of Zvonkin and Levin [ Levin Zvonkin We follow to some extent [ G ´acs Lecture
- 15 Notes
Functions µ: N - [0, 1] that satisfy the usual properties of probability distributions except
that
"x
µ(x) $ 1.
we shall call measures. We say that a measure µ (multiplicatively) dominates a measure µ, if
there exists a constant c such that, for all x in N, we have µ,(x) $ c µ(x). It is known from the calculus
that no measure µ dominates all measures: for each measure µ there is a measure µ, such
that lim µ,(x)/µ(x) = *. However, if we restrict ourselves to the class of semicomputable measures,
then it turns out that this class contains an ‘absorbing’ element, a measure that dominates all
measures in the class. We call the measure that dominates all other measures in a given class a
universal measure for that class. This important observation that such a measure exists was first
made by Levin [ Levin Zvonkin
Theorem 2. The class of measures that are semicomputable from below contains a universal
measure.
Proof. First we consider the standard enumeration of all partial recursive functions
.1, .2,.... Each . = .i in this list is a function on the positive integers. Let <.> denote a standard
effective invertible pairing function over N to associate a unique natural number , k = 1,2,...}.
The resulting s-enumeration contains all semicomputable functions. Next we use each semicomputable
function s to compute a measure µ from below. Initially, set µ(x) = 0 for all x. If s (1) is
undefined then µ will not change any more and it is trivially a measure. Otherwise, for k = 1, 2,...,
if sk
(1) + sk
(2) +...+ sk
(k) $ 1 then set µ(i) := sk
(i) for i = 1, 2,..., k, else the computation of µ is
finished.
There are three mutually exclusive ways the computation of µ can go, exhausting all possibilities.
Firstly, s is already a measure and µ := s. Secondly, for some x and k with x $ k the value
- 16 -
sk
(x) is undefined. Then the values of µ do not change any more from µ(i) = sk ! 1
(i) for
i = 1, 2,..., k !1, and µ(i) = 0 for i ( k, even though the computation of µ goes on forever. Thirdly,
there is a first k such that sk
(1) + sk
(2) +...+ sk
(k) > 1, that is, the new approximation of µ violates
the condition of measure. Then the approximation of µ is finished as in the second case. But in
this case the algorithm terminates, and µ is even computable.
Thus, the above procedure yields an effective enumeration µ1, µ2,... of all semicomputable
measures. Define the function µ0 as:
µ0(x) = "n
2!n
µn(x).
It follows that µ0 is a measure since
"x
µ0(x) = "n
2!n
"x
µn(x) $ "n
2!n
= 1.
The function µ0 is also semicomputable from below, since µn(x) is semicomputable from below
in n and x. (Use the universal partial recursive function .0 and the construction above.) Finally,
µ0 dominates each µn since µ0(x) > 2!n
µn(x). Therefore, µ0 is a universal semicomputable meas-
ure.
Obviously, there are countably infinite universal semicomputable measures. We now fix a
reference universal semicomputable measure µ0(x), and denote it by m(x). It will turn out that
function m(x) adequately captures Solomonoff’s envisioned universal a priori probability.
If a semicomputable measure is also a probability distribution then it is computable.
Namely, if we compute an approximation µk
of the function µ from below for which
"x
µk
(x) > 1 ! #, then we have | µ(x) ! µk
(x) | < # for all x.
Any positive function w (x) such that "x
w (x) $ 1 must converge to zero. Hence m(x) converges
to zero as well. However, it converges to zero slower than any positive computable function
that converges to zero. That is, m(x) is not computable, and therefore it is not a proper probability
distribution: "x
m(x) < 1. There is no analogous result to Theorem 2 for computable
measures: amongst all computable measures there is no universal one. This fact is one of the reasons
for introducing the notion of semicomputable measures.
- 17 -
2.4. The Solomonoff-Levin Distribution
The original incentive to develop a theory of algorithmic information content of individual
objects was Ray Solomonoff’s invention of a universal a priori probability that can be used
instead of the actual (but unknown) a priori probability in applying Bayes’ Rule. His original
suggestion was to set the a priori probability P (x) of a finite binary string x to "2!l (p)
, the sum
taken over all programs p with U (p) = x where U is the reference Turing machine of Theorem 1
for the C-complexity. However, using plain Turing machines this is improper, since not only
does "x
P (x) diverge, but for some x even P (x) itself diverges. To counteract this defect, Solomonoff
in 1960 and 1964 used normalizing terms, but the overall result was unconvincing. Levin
[ Levin Zvonkin succeeded in 1970 to find a proper mathematical expression of the a priori probability,
of which we present the simpler version over the discrete domain N. This was elaborated
by Levin in 1973 and 1974 [ Levin Notion random ] [ Levin non-growth and Levin and G ´acs in
1974 [ G ´acs symmetry and independently by Chaitin in 1975 [ Chaitin theory 1975 formally
identical
Definition. The Solomonoff-Levin distribution (actually a measure) on the positive integers
is defined by PU(x) = "2!l (p)
, where the sum is taken over all programs p for which the reference
prefix-machine U of Theorem 1 outputs x. This is a measure because of the following.
Kraft’s Inequality. If l1,l2,... is a sequence of positive integers such that "n
2
! ln
$ 1 then
there is a prefix-code c : N - {0, 1}* (i.e., if n ) m are positive integers, then c (n) is not a prefix
of c (m)), with l (c (n)) = ln. Conversely, if c : N - {0, 1}* is a prefix-code, then the sequence
l1, l2,... with ln = l (c (n)), n = 1, 2,... satisfies the inequality above. See e.g. [ Gallager
Hence, by the Kraft Inequality, for the prefix-code formed by the programs p of U we have
"p
2!l (p)
$ 1. Therefore, the combined probability "x
PU(x), summed over all x’s, sums up to less
than one, no matter how we choose reference U, because for some program q there is no output at
all.
Another way to conceive of PU(x) is as follows. We think of the input to the reference
prefix machine U as being provided by indefinite long sequences of fair coin flips. Thus, the probability
of generating a program p for U is P (p) = 2!l (p)
where P is the standard ‘coin-flip’ uniform
measure. (Namely, presented with any infinitely long sequence starting with p, the machine
U, being a prefix-machine, will read exactly p and no further.) Due to the halting problem, for
some q the reference U does not halt. Therefore, the halting probability + satisfies
- 18 +
= "x
PU(x) < 1.
Now we are ready to state the remarkable and powerful fact that Levin’s universal semicomputable
measure m(x), the Solomonoff-Levin universal a priori probability PU(x), and the
simpler expression 2!K (x)
, all coincide up to an independent fixed multiplicative constant. It is a
consequence of universally accepted views in mathematical logic (Church’s Thesis), that the widest
possible effective notion of simplicity of description of an object x is quantified by K (x).
The Solomonoff-Levin distribution can be interpreted as an recursively invariant notion that
is the formal representation of ‘‘Occam’s Razor’’: the statement that one object is simpler than
the other is equivalent to saying that the former object has higher probability than the latter.
Theorem 3. There is a constant c such that for each x, up to additive constant c, we have
! log m(x) = ! log PU(x) = K (x).
Proof. Since 2!K (x)
represents the contribution to PU(x) by a shortest program for x,
2!K (x)
$ PU(x) for all x. Since PU(x) is semicomputable from below by enumerating all programs
for x, we have by the universality of m(x) that there is a fixed constant c such that for all x we
have PU(x) $ c m(x).
It remains to show that m(x) = O (2!K (x)
). This is equivalent to showing that for some constant
c we have ! log m(x) ( K (x) + c. It suffices to exhibit a prefix code such that for some other
fixed constant c,, for each x there is a code word p such that l (p) $ ! log m(x) + c,, together with a
prefix-machine T such that T (p) = x. Then, KT(x) $ l (p) and hence by the Invariance Theorem 1
also K (x) $ l (p) up to a fixed additive constant. First we recall a construction for the ShannonFano
code.
Claim. If µ is a measure on the integers, "x
µ(x) $ 1, then there is a binary prefix-code
r : N - {0, 1}* such that the code words r (1), r (2),... are in lexicographical order, such that
l (r (x)) $ ! log µ(x) + 2. This is the Shannon-Fano code.
Proof. Let [0, 1) be the half open unit real interval. The half open interval [0.x, 0.x + 2!l (x)
)
corresponding to the set (cylinder) of reals 0x = {0.y : y = x z} (x finite and y and z infinite binary
strings) is called a binary interval. We cut off disjoint, consecutive, adjacent (not necessarily
binary) intervals In of length µ(n) from the left end of [0, 1), n = 1, 2,.... Let in be the length of the
longest binary interval contained in In. Set r (n) equal to the binary word corresponding to the
first such interval. It is easy to see that In is covered by at most four binary intervals of length in,
from which the claim follows.
- 19 Since
m(x) is semicomputable from below, there is a partial recursive function .(t, x) such
that .(t, x) $ m(x) for all t, and limt - *.(t, x) = m(x). Let /(t, x) = 2!k
, with k is a positive
integer, be the greatest partial recursive lower bound of this form on .(t, x). We can assume that
/ enumerates its range without repetition. Then,
"x, t
/(t, x) = "x "t
/(t, x) $ "x
2 m(x) $ 2.
(The series "t
/(t, x) can only converge to precisely 2 m(x) in case there is a positive integer k
such that m(x) = 2!k
.)
Similar to before, we chop off consecutive, adjacent, disjoint half open intervals It,x of
length /(t, x)/2, in order of computation of /(t, x), starting from the left side of [0, 1). This is
possible by the last displayed equation. It is easy to see that we can construct a prefix-machine T
as follows. If 0p is the largest binary interval of It,x, then T (p) = x. Otherwise, T (p) is undefined
(e.g., T doesn’t halt).
By construction of /, for each x there is a /(t, x) >m(x)/2. By the construction in the
Claim, each interval It,x has length /(t, x)/2. Each I-interval contains a binary interval 0p of
length at least one quarter of that of I. Therefore, there is a p with T (p) = x such that
2!l (p)
( m(x)/16. This implies KT(x) $ !log m(x) + 4. The proof of the theorem is finished.
Theorem 3 demonstrates a particularly important instance of the two conceptually different,
but equivalent, definitions of the semicomputable measures. We analyse this equivalence in
some detail. Let P1, P2,... be the effective enumeration of all semicomputable probability distributions
constructed in Theorem 2. Let T1, T2,... be the standard enumeration of prefix-machines.
For each prefix-machine T, define
QT(x) = "T (p) = x
2!l (p).
In other words, QT(x) is the probability that T computes output x if its input p is generated by
successive tosses of a fair coin. In other words, the inputs p are uniformly distributed with the
probability of p occurring equal 2!l (p)
. It is easy to see that each QT satisfies
x
"QT(x) $ 1.
Equality holds iff T halts for all inputs (proper programs). Let Q1, Q2,... (where we do not
require equality to hold) be the probability distributions associated with T1, T2,....
- 20 Claim.
There are recursive functions 1, 2 such that Qn = 3(P1(n)) and Pn = 3(Q2(n)), for
n =1, 2,....
Proof. Omitted.
Remark. The Coding Theorem tells us that there is a constant c > 0 such that
! log PU(x) ! K (x) $ c. We recall from the definition of the Solomonoff-Levin distribution that
! log PU(x) = ! log "U (p) = x
2!l (p)
, and K (x) = min {l (p): U (p) = x}. A priori an outcome x may
have high probability because it has many long descriptions. But these relations show that in that
case it must have a short description too. In other words, the a priori probability of x is governed
by the shortest program for x.
Remark. Let P be any probability distribution (not necessarily computable). The Pexpected
value of m(x)/P (x) is
"x
P (x)
P (x)
m(x)
< 1.
We find by Chebychev’s first Inequality
1)
that
"{P (x): m(x) $ k P (x)} ( 1 !
k
1
. (3)
Since m(x) dominates all semicomputable measures multiplicatively, for all x we have
P (x) $ cP m(x), (4)
for a fixed positive constant cP independent of x (but depending on the index of P in the effective
enumeration µ1, µ2,... of semicomputable measures).
Inequalities (3) and (4) have the following consequences:
2)
(i) If x is a random sample from a simple computable distribution P (x), then m(x) is a good
estimate of P (x).
(ii) If we know or believe that x is random with respect to P, and we know P (x), then we
can use P (x) as an estimate of m(x).
1)
Recall that Chebychev’s First Inequality says the following. Let P be any probability distribution, f any nonnegative
function with expected value EP(f ) = "x
P (x) f (x) < *. For 4 ( 0 we have "{P (x): f (x) > 4} < 4!1
EP(f ). Here we
use it with k EP(f ) substituted for 4.
2)
We shortly remark, without further explanation, that in both cases the degree of approximation depends on the index
of P, and the randomness of x with respect to P, as measured by the randomness deficiency 50(x | P) = log (m(x)/P (x)).
If 50(x | P) = O (1) then x is random, otherwise x is not random. For example, for the Uniform Distribution
- 21 -
2.4.1. Solomonoff’s Inference Procedure and Its Mathematical Justification
We continue Solomonoff’s approach to inductive inference, as in [ Levin Zvonkin In general, one
cannot prove that an inference procedure in statistics is good. This accounts for the many different
approaches which are advocated in statistics. In contrast, about Solomonoff’s procedure we
can rigourously prove that it is good. First, we put the previously developed theory in a continuous
setting. Let the sample space S = {0, 1}* 6{0, 1}*
, the set of all finite and one-way infinite
binary sequences. Let a cylinder 0x = {xy : y & S}, the set of all elements from S that start with x.
A function µ from cylinders to the real interval [0, 1] is called a semimeasure if
(a) µ(S) $ 1; and
(b) µ(0x) ( µ(0x0) + µ(0x1) .
A semimeasure is called a measure if equality holds in (a) and (b). A semimeasure µ is
(semi)computable if f (x) = µ(0x) is (semi)computable. Note that f needs to satisfy (a) and (b). It
is more convenient, and customary in this area, to simply write µ(x) instead of µ(0x). The problem
was that the proper a priori probabilities µ in formula (2) are not known.
We modify the Turing machines in the standard enumeration so that they correspond to the
semicomputable measures.
A monotonic machine M is a three tape machine similar to the prefix machine we defined
before, but now for all finite (binary) inputs p and q, if p is a prefix of q, then M (q) = M (p)r for
some r in S. (For convenience we define the M (p) as the contents of the output tape when M
reads the next symbol after p. If M doesn’t halt then M (p) can be finite or one-way infinite.) Let
U be the universal monotonic machine, in the same way as we have already met universal Turing
machines and universal prefix machines.
The universal semicomputable semimeasure is defined as
M(x) =
U (p) & 0x
" 2!l (p)
,
i.e., M(x) is the a priori probability that the output of the reference monotonic machine U starts
with x. Just as in the discrete case, one can show that for each semicomputable semimeasure µ,
50(x | P) = n ! K (x | n) + O (1), where n = l (x). Such a (universal Martin-L
..
of) test is needed, since otherwise we cannot
distinguish, for instance, between randomness and nonrandomness of samples from the uniform distribution.
(Clearly, the word C o n s t a n t i n o p l e is not a random 14-letter word. The probability of seeing it somewhere written
is decidedly greater than 128!14
, say, for a randomly selected fourteen letter ASCII word.)
- 22 there
exists a constant c, such that for all x & {0, 1}, we have
M(x) ( c µ(x).
An alternative approach to defining a priori probability was taken by Cover [ Cover gambling
schemes who defined
Mc(x) = "{m(xy): y & {0, 1}*}.
This function has related properties to M.
Solomonoff’s Predictor. Instead of using formula (2), we estimate the conditional probability
P (xy | x) that the next segment after x is y by the expression
M(x)
M(x y)
. (5)
Now let µ in Formula (2) be an arbitrary computable measure. This case includes all computable
sequences, as well as many Bernouilli sequences.
Justification. Solomonoff [ Solomonoff IEEE 1978 showed that convergence of the error
made by the estimator is very fast, in the following sense. If µ is the actual prior probability
(measure) over the sample space {0, 1}, than we obviously cannot do better in predicting a ‘0’ or
‘1’ after an initial segment x than using the inferred probability
µ(x)
µ(xa)
, a = 0, 1.
To estimate how much worse it is to use M instead of µ we consider the difference in inferred
probabilities. Let Sn denotes the µ-expectation of the squared difference between the µ-inferred
probability and the M-inferred probability, of ‘0’ occurring as the n + 1th symbol:
Sn =
l (x) = n
" µ(x)
M(x)
M(x 0)
!
µ(x)
µ(x 0)
2
.
Then "n
Sn $ K (µ)/2. Here, K (µ) is the Kolmogorov complexity of the index i, where Ti is a
Turing machine computing µ. Therefore, Sn converges to zero faster than 1/n. In other words, it
has been rigorously proved that for the above estimator the expected error at the nth prediction
converges to zero faster than 1/n!
This was improved by G ´acs [ %A P. G ´acs %T Personal Communication as follows. If the
- 23 length
of y is fixed, and the length of x grows to infinity, then
µ(x y)/µ(x)
M(x y)/M(x)
- 1,
with µ-probability one. In other words, the conditional a priori probability is almost always
asymptotically equal to the conditional probability.
With respect to the discrete sample space approach taken before, one can show that:
! log M(x) = ! log Mc(x) = K (x) + O (log K (x)). (6)
2.4.2. Conclusions
On the positive side we have achieved the following. Bayes’ rule using the universal prior distribution
gives an objective interpretation to Occam’s razor principle. Namely, if several programs
could generate S 0 then the shortest one is used (for the prior probability), and further if S 0 has a
shorter program than S 1 then S 0 is preferred (i.e. predict 0 with higher probability than predicting
1 after seeing S). Bayes’ rule via the universal prior distribution also satisfies Epicurus’ multiple
explanations dictum, since we do not select a single hypothesis after considering the evidence,
but maintain all hypotheses consistent with the evidence and just transform the probability
distribution on the hypotheses according to the evidence. Finally, there is mathematical proof
that Solomonoff’s inference procedure using the universal prior probability performs almost as
good as the one using the actual (computable) prior probability.
On the negative side we know that Solomonoff’s inference is not practicable in its pure
form. The universal prior distributions m(x) for discrete sample spaces, and M(x) for continuous
sample spaces, are not computable, essentially because the Kolmogorov complexity is not computable.
However, we can compute approximations to K (x), m(x), and M(x). It turns out that
using Solomonoff’s inference principles with such computable approximations yields many other
known inference models or principles. In the next few sections, we derive or establish connections
with various well-known machine learning models and inductive inference paradigms or
principles. Thus we provide an alternative view of these models and principles from the lofty perspective
of Kolmogorov complexity.
- 24 -
3. Gold’s Inductive Inference Paradigm
There are many different ways of formulating concrete inductive inference problems in the
real world. We will try to simplify matters as much as possible short of losing significance.
(i) The class of rules we consider can be various classes of languages or functions, where
we restrict ourselves to classes of recursive sets, context-free languages, regular sets and sets of
finite automata, and sets of Boolean formulae. We treat a language L as a function f using its
characteristic function, i.e., f (x) = 7L(x)=1 if x &L, and 0 otherwise.
(ii) The hypothesis space or rule space denoted by R specifies syntactically how each rule
in (i) should be represented. We fix a standard enumeration of the representations for a class of
rules, R = {R1, R2,...}, and assume that each rule f has at least one description in the corresponding
hypothesis space. For example, the hypothesis space can be standard encodings of contextfree
grammars, or standard encodings of Finite Automata. In any case, it is assumed that the
hypothesis space is effectively enumerable (so it cannot be the set of all halting Turing machine
codes). For convenience, this enumeration of hypotheses R1, R2,... consists of codes for algorithms
to compute recursive functions f1, f2,... (languages are represented by their characteristic
functions).
(iii) The presentation of examples is vital to the inference process. We choose the simplest,
and yet most general, form of data presentation. For a function f to be inferred, there is a
fixed infinite sequence of examples (s1, f (s1)), (s2, f (s2)),.... When f =7L, we have 7L(s) = 1 if
s & L (s is a positive example of L) and 7L(s) = 0 otherwise (s is a negative example of L).
A rule (or function) f is said to be consistent with the initial segment of examples
S = (s1, a1) , . . . , (sn, an), (7)
if f (si) = ai, i =1,..,n. We require that all strings will eventually appear as first component in a
pair in S. The last assumption is strong, but is essential to the Gold paradigm.
How to infer a rule. By (ii), there is an effective enumeration f1, f2,... of partial recursive
functions corresponding to the enumeration of hypotheses. The a priori probability of fk is
m(fk) = m(k). (Actually, m(fk) = c m(k), for some constant c depending on the effective enumeration
involved, but not depending on n. To assume that c = 1 makes no difference in the following
discussion.) We are given an infinite sequence of examples representing the rule or function f to
be learned. According to Bayes’ rule (1), for k = 1, 2,..., the inferred probability of fk after the
sequence of examples (7) is given by:
- 25 P
(fk = f | f(si) = ai, i = 1,..,n) =
"{m(j): fj(si) = ai, i = 1,..,n}
P (f(si) = ai, i = 1,..,n | fk = f) m(k)
. (8)
Cf. also [ Cover 1974 gambling ] [ Cover Impact In the numerator of the right-hand term, the first
factor is zero or one depending on whether fk is consistent with S, and the second factor is the a
priori probability of fk. The denominator is a normalizing term giving the combined a priori probability
of all rules consistent with S. With increasing n, the denominator term is monotonically
nonincreasing. Since all examples eventually appear in S, the denominator converges to a limit,
say d $ 1. For each k, the inferred probability fk is monotonically nondecreasing with increasing
n, until fk is inconsistent with a new example, in which case it falls to zero and stays there henceforth.
In the limit, only the fk’s that are consistent with the sequence of presented examples
have positive inferred probability m(k)/d. By Theorem 3, since m(k) = 3(2!K (k)
), the highest
inferred probability is carried by the rule fk with least Kolmogorov complexity among the
remaining ones. Similar statements hold after each initial segment of n examples, n = 1,2,....
Reasoning inductively, we transform the a priori probability according to Formula (8),
inferring a new posterior probability by the evidence of each initial segment of examples. At each
step, we can select the rule with the highest inferred probability, and in the limit we have selected
the proper rule. At each step we predict the rule with the highest inferred probability. Reformulating,
if we want to infer a language L using this procedure, then:
(a) The Bayesian a posteriori probability for the correct answer converges to c 2!l (p)
/d,
where p is the shortest program which the reference machine uses to simulate M0, where M0 is
the smallest TM that accepts L. This correct answer will have the highest probability in the limit.
That is, inferred probability distribution over the underlying machines converges to a highest probability
for M0 in the limit. In other words, after n steps for some n, all the machines smaller
than M0 violate some data pair in S, and M0 is the choice forever after step n.
(b) It is interesting to notice that the a posteriori probability decreases monotonically until
it converges to c 2!l (p)
/d for p the program with which U simulates M0. Smaller machines are
chosen first and then canceled because they violate some data.
Predicting extrapolation. If we want to infer f (s), rather than f, given the sequence of
examples S, then using formulas (2) and (5), the inferred probability that f (s) = a is (denoting a
string s1s2
. . . sn as s1:n):
P (f (s) = a | f (si) = ai, i = 1,..,n) =
"{m(j): fj(si) = ai, i = 1,..,n}
"{m(j): fj(si) = ai, i = 1,..,n, fj(s) = a}
(9)
- 26 The
Gold paradigm of inductive inference in the sense as originally studied by Gold in [
Gold Language identification 1967 ] [ Gold 1978 can be viewed simply as a computable approximation
to Equation (8). The fundamental idea of the Gold paradigm is the idea called
identification in the limit and a universal method of implementing the identification in the limit is
called ‘identification by enumeration’. These are contained in facts (a) and (b), as a computable
analogue of Solomonoff’s approach. We now investigate the correspondence between these two
basic ideas in some detail.
Identification in the Limit views inductive inference as an infinite process. Formally, let M
be an inductive inference method in order to derive some unknown rule R. If M receives a larger
and larger set of examples (bigger and bigger initial segment S), a larger and larger sequence of
M’s conjectures is generated, say, f1, f2, f3, . . . . If there is some integer m such that fm is a
correct description of R and for all n >m
fm=fn,
then M identified R in the limit. Two facts deserve mentioning: M cannot determine whether it
has converged and therefore stop with a correct hypothesis. M may be viewed as learning more
and more information about the unknown rule R and monotonically increasing its approximation
to R until the correct identification. Gold gave the best explanation to his definition:
I wish to construct a precise model for the intuitive notion "able to speak a language" in order to be
able to investigate theoretically how it can be achieved artificially. Since we cannot write down the rules of
English which we require one to know before we say he can "speak English", an artificial intelligence
which is designed to speak English will have to learn its rules from implicit information. That is, its information
will consist of examples of the use of English and/or of an informant who can state whether a given
usage satisfies certain rules of English, but cannot state these rules explicitly.
. . . A person does not know when he is speaking a language correctly; there is always the possibility
that he will find that his grammar contains an error. But we can guarantee that a child will eventually
learn a natural language, even if it will not know when it is correct.
Identification by enumeration is a method to implement identification in the limit. It
refers to the following guessing rule: Enumerate the class of rules in rule space. At step t, guess
the unknown rule to be the first rule of the enumeration which agrees with data received so far.
Formally speaking, in our setting, if we have received an initial segment S, then, given s, predict
as the next example (s, f (s)) for f is the first rule that is consistent with S. Now if this can be
done effectively, identification in the limit will be achieved. We say an induction method
- 27 identifies
a rule correctly in k steps if it will never produce wrong hypothesis starting from step
k*
. Let G and G, be two guessing methods. G will be said to be uniformly faster than G, if the
following two conditions hold: (1) Given any R from the rule space, G will identify R correctly at
least as soon as G,, expressed in the number of examples needed, for all sequences of examples;
and (2) for some R, some sequence of examples, G will identify R sooner than G,. Say a guessing
method G is optimal if for any other guessing method G, there is a constant c such that: if R
appears to be the ith rule in the enumeration of G,, then it appears no later than ci in the enumeration
of G. It is easy to prove that the identification-by-enumeration method will identify a
hypothesis in the limit if this hypothesis can be identified in the limit at all. Further if G0 is an
identification-by-enumeration guessing rule, then there is no guessing rule uniformly faster than
G0. But only the Solomonoff procedure is optimal. Indeed,
Theorem 4. (a) Identification-by-enumeration is a computable approximation to inductive
inference (Solomonoff’s inference) associated with Formula (8). (b) Neither method is uniformly
faster than the other. (c) Solomonoff procedure is optimal, while identification-by-enumeration is
not.
Proof . (a) An effective enumeration for the identification-by-enumeration method, can be
viewed as a computable approximation to Solomonoff’s procedure according to formula (8) as
follows. Let the effective enumeration of the rule space be: R1,R2,R3
. . . . Convert this to an
effective prefix-free description of each rule Ri in the rule space. For instance, if x = x1,...,xn is a
binary string, then x = x10x20...0xn1 is a prefix-code for the x’s. Similarly, x, = l (x) x is a prefixcode.
Note that l (x,) = l (x) + 2 log l (x). We encode each rule Ri (a binary string) as p i,, where p is
a (prefix) program that enumerates the rule space. The resulting code for the Ri’s is an effective
prefix-code. Denoting the length of the description of Ri by | Ri | , we have:
(i) if ik +1. Set DNF g := =. (g is the DNF we will eventually
output as approximation of f.)
1. Pick a positive example a =(a1, . . . ,an). Form a monotone term m such that m includes xi if
ai=1.
2. for each positive example a = (a1,...,an) do: if ai = 0 and deleting xi from m violates no
negative examples, delete xi from m.
3. Remove from the sample all positive examples which are implied by m. Set g > g+m. If
there are still positive examples left, then go to step 1, else halt and return g.
We show that the algorithm is correct. Let us write mi : m for two monotone monomials if
all the variables that appear in mi also appear in m. At step 1, the monomial m obviously implies
no negative examples, since for some monomial mi of f we must have mi : m. Step 2 of the algorithm
keeps deleting variables from m. If at any time for no monomial mi&f holds mi : m, then
there exists a negative example that contains at most k 0’s such that it satisfies m but no mi of f.
This negative example is of Kolmogorov complexity at most klog n, hence by the Chernoff formulae
(Section 2.1) with high probability it is contained in the sample. Hence at step 2, with
high probability, there will be an mi such that mi : m. Hence we eventually find a correct mi
(precisely) with high probability. Then at step 3, we remove the positive examples implied by
this mi and continue on to find another term of f. The algorithm will eventually output g =f with
high probability by standard calculations.
Remark. Notice that this is not an approximation algorithm like the one in [ Li 1989 Simple
Concepts
to learn log n-DNF. This algorithm outputs the precise monotone formula with high probability.
7.4. Continous Sample Space
Secondly, we deal with continuous sample spaces. For example, the uniform distribution now is
defined as L (0x) = 2! | x |
, where 0x denotes the set of all one-way infinite binary strings starting
with x. This is the Lebesgue measure on interval [0,1]. While for discrete sample spaces all concept
classes are Valiant learnable (although not all are polynomially learnable), this is not the
case for continuous sample spaces. We can define the notion of ‘simple’ semimeasure and that of
universal semicomputable semimeasure, over a continuous sample space, and show that all
- 46 concept
classes are learnable over each simple semimeasure D iff they are learnable under the
universal semimeasure. In contrast with the discrete case w.r.t. polynomial learning, here we do
not need to require that the learning algorithm samples according to the universal measure. For
details, see [ Li 1989 Simple Concepts
8. Acknowledgement
We are grateful Ray Solomonoff, Leonid Levin, and Peter G ´acs for conversations about algorithmic complexity
and inductive reasoning. We thank Mati Wax for useful discussions on the MDL principle; Leslie
Valiant for suggesting handwritten character recognition as a viable application; and Qiong Gao for participating
in the realization of it. Fahiem Bacchus, Danny Krizanc and John Tromp read and commented on
the manuscript.
-- --