Learning from positive data
Stephen Muggleton
Oxford University Computing Laboratory,
Parks Road,
Oxford, 0X1 3QD,
United Kingdom.
Abstract
Gold showed in 1967 that not even regular grammars can be exactly identified from positive examples alone. Since it is known that children learn natural grammars almost exclusively from positives examples, Gold's result has been used as a theoretical support for Chomsky's theory of innate human linguistic abilities. In this paper new results are presented which show that within a Bayesian framework not only grammars, but also logic programs are learnable with arbitrarily low expected error from positive examples only. In addition, we show that the upper bound for expected error of a learner which maximises the Bayes' posterior probability when learning from positive examples is within a small additive term of one which does the same from a mixture of positive and negative examples. An Inductive Logic Programming implementation is described which avoids the pitfalls of greedy search by global optimisation of this function during the local construction of individual clauses of the hypothesis. Results of testing this implementation on artificially-generated data-sets are reported. These results are in agreement with the theoretical predictions.
1     Introduction
Gold's [5] seminal paper not only formed the foundations of learnability theory but also provided an important negative result for the learnability of grammars. It was shown that not even the class of all regular languages could be identified in the limit from an arbitrary finite sequence of positive examples of the target language. In the same paper Gold pointed out the implications for theories of language acquisition in children. He notes that psycholinguistic studies by McNeill and others had shown that
...  children are rarely informed when they make grammatical errors and those that are informed take little heed.
1
Gold's negative results have been taken by [14] as theoretical support for Chomsky's theory [4] of innate human linguistic abilities.
In this paper Gold's requirements for exact identification of a language are replaced by a need to converge with arbitrarily low error. In a previous paper [10] the author derived a function for learning logic programs from positive examples only. In the present paper the Bayes' function for maximising posterior probability is derived. The solution is representation independent, and therefore equally applicable to grammar learning, scientific theory formation or even automatic programming. The expected error of an algorithm which maximises this function over a high prior probability segment of the hypothesis space is analysed and shown to be within a small additive term of that obtained from a mixture of positive and negative examples.
An implementation of this approach within the Inductive Logic Programming (ILP) system Progol4.2 is described. A novel aspect of this implementation is the use of global optimisation during local construction of individual clauses of the hypothesis. The technique avoids the local optimisation pitfalls of cover-set algorithms. Experiments on three separate domains (animal taxonomy, KRK-illegal and grammar learning) are shown to be in accordance with the theoretical predictions.
This paper is organised as follows. In Section 2 a Bayes' framework is described which is compatible with the U-learnability framework [9, 13]. The Bayes' function for the posterior probability of hypotheses given positive examples only is derived in Section 3. The expected error of an algorithm which maximises the Bayes' function over a high prior probability segment of the hypothesis space is given in Section 4. In Section 5 the ILP system Progol4.2, which implements this function is described. An experiment is presented in Section 6, in which Pro-gol4.2 is tested on varying amounts of randomly generated data for three target concepts. The results of these experiments are discussed in Section 7. The paper is concluded in Section 8 by a comparison to related work and a discussion of directions for future research.
2     Bayes' positive example framework
The following is a simplified version of the U-learnability framework presented in [9, 13]. X is taken to be a countable class of instances and Tí C 2X to be a countable class of concepts. Dx and D-u are probability distributions over X and H respectively. For H £ H, D x {H) = J2xeH D x (x) and the conditional distribution of Dx associated with H is as follows.
n     r \     n  í im     DÁX^H)     í °          if'x ČH
Dx\h{x) = Dx{x\H) =     Dx{H)     = j ^   otherw.se
2
The teacher randomly chooses a target theory T from D-u and randomly and independently chooses a series of examples E = (x\, ..,xm) from T according to Dx\t- Given E, D u and Dx a learner L chooses an hypothesis H £ H for which all elements of E are in H. The teacher then assesses Error(i7) as Dx(H\T) + Dx(T\H).
Unlike the setting in U-learnability it is assumed in the present paper that L is given D-u and Dx-
3     Bayes' posterior estimation
Gold's negative result for identification of the regular languages over the symbol set E is based on the fact that for any sequence of positive examples E there will always be at least two possible candidate hypotheses, 1) E*, the language containing all possible sentences and 2) the language corresponding to elements of E alone. It is clear that 1) is the most general hypothesis, and has a compact finite automaton description, while 2) is the least general hypothesis and has a complex finite state automaton description. Since these two extreme hypotheses will maximise errors of commission and omission respectively it would seem desirable to find a compromise between the size of the hypothesis description and its generality. Size and generality of an hypothesis can be defined within the Bayes' framework of the previous section as follows.
sz(tf)    =    -In Du(H) g(H)    =    Dx(H)
Bayes' theorem allows us to derive a tradeoff between sz(H) and g(H). In its familiar form, Bayes' theorem is as follows.
With respect to the Bayes' framework of the previous section p(H\E) is interpreted from the learner's perspective as the probability that H = T given the example sequence is E. Similarly, p(H) is defined as the probability that H = T, which is
p(H) = Du(H).
Meanwhile p(E\H) is the probability that the example sequence is E given that H = T. Since examples are chosen randomly and independently from Dx\h then for any consistent hypothesis this is as follows.
m
p(E\H)   =    i[DxlH(xi)
8 = 1
L\Dx(H)
3
The prior p(E) is the probability that the example sequence is E irrespective of T. This is as follows.
m
p(E)= J2Du(T)I[DxlT(x3) Ten                3=1
The Bayes' equation can now be rewritten as follows.
p(H\E)
p(E)
Since         i m—~ is common to all consistent hypotheses, it will be treated as a
normalising constant cm in the following.
p(H\E)   =
In p(H\E)    =   m In I —— I - sz{H) + d„
\9{H)J
In the above dm = In cm. The tradeoff between size and generality of an hypothesis can be seen in the final equation above. The function In p(H\E) decreases with increases in sz(H) and g(H). Additionally, as m grows, the requirements on generality of an hypothesis become stricter. A function fm with similar properties was defined in [10] and it was shown there that for every hypothesis H except T there is a value of m such that for all m' > m it is the case that fmi(H) < fm(T). This result indicates a form of convergence, somewhat different from Gold's identification in the limit.
4     Analysis of expected error
Haussler et al. [7] argue the advantages of analysing expected error over VC dimension analysis. Analysis of expected error is the approach taken below.
It is assumed that class membership of instances is decidable for all hypotheses. Also the hypotheses in H are assumed to be ordered according to decreasing prior probability as Hi} H2}. . ■■ For the purposes of analysis the distribution Dn(Hi) = -f- is assumed, where a is a normalising constant. This is similar to the prior distribution assumptions used in ProgoM.l [10] and is a smoothed version of a distribution which assigns equal probability to the 2b hypotheses describable in b bits, where the sum of the probabilities of such hypotheses is 2~b. Within this distribution i has infinite mean and variance. It is also assumed that the hypothesis space contains only targets T for which Dx{T) < |. This assumption, which holds for most target concepts used in Inductive Logic Programming, is not a particularly strong restriction on the hypothesis space since if T is the complement of T and Dx(T) > | then clearly Dx(T) < |.
4
The following theorem gives an upper bound on the expected error of an algorithm which learns from positive examples only by maximising the Bayes' posterior probability function over the initial am hypotheses within the space.
Theorem 1 Expected error for positive examples only. Let X be a countable instance space and D x be a probability distribution over X. Let Tí C 2X be a countable hypothesis space containing at least all finite subsets of X, and for which all H G "H have D x (H) < |. Let D% be a probability distribution over Tí. Assume that Tí has an ordering Hi, H2,. . . such that D-u(Hi) > D-h(Hj) for all j > i. Let Du(Hi) = f where \ = Y?=i £ ~ öks- Let Hn = {Ht : Ht G U and i<n}. T is chosen randomly from D-u and the X{ in E = (xi} ..}xm) are chosen randomly and independently from Dx\t- Let fm(H) = D-u(H)(D \H\)m and let n = am. L is the following learning algorithm. If there are no hypotheses H G 1-Ĺn such that H D He = {xu..,xm} then L(E) = HE. Otherwise L(E) = Hn(E) = H only if H G 1-Ĺn, H D He and for all H1 G 1-Ĺn for which H1 D He it is the case that fm(H) > fm(H'). The error of an hypothesis H is defined as Error[H}T) = Dx(T \ H) + Dx(H \T). The expected error of L after m examples, EE(m), is at most 2'33+^ln m. Proof.   Given in Appendix A. □
Note that this result is independent of the choice of Dx and that L considers only O(m) hypotheses to achieve an expected error of 0( n m). For comparison a similar algorithm which learns from a mixture of positive and negative examples is analysed for the same choice of D-^.
Theorem 2 Expected error for positive and negative examples. Let X be
a countable instance space and Tí C 2X be a countable hypothesis space containing at least all finite subsets of X. Let D-u, Dx be probability distributions over Tí and X. Assume that Tí has an ordering Hi, H2,. . . such that D-u(Hi) > D-h(Hj) for all j > i. Let DH(Hi) = f where \ = J2Zi £• Let ^ = {Hi ■ Ht G U and i < n}. T is chosen randomly from D-^. Let ex(x}H) = (x}v) where v = True if x G H and v = False otherwise. Let E = (ex(xi} T),.., ex[xm} T)) where each Xi is chosen randomly and independently from Dx- He = {x : (x} True) in E}. Hypothesis H is said to be consistent with E if and only if X{ G H for each (xi7True) in E and Xj ^ H for each (xj, False) in E. Let f(H) = D-u(H) and let n = am. L is the following learning algorithm. If there are no hypotheses H G Hn consistent with E then L(E) = He- Otherwise L(E) = Hn(E) = H only if H G 1-Ĺn, H consistent with E and for all H' G 1-Ĺn consistent with E it is the case that f(H) > f(H'). The error of an hypothesis H is defined as Error[H}T) = Dx(T \ H) + Dx(H \ T). The expected error of L after m examples, EE(m), is at most 1,51+2 n m. Proof.   Given in Appendix A. □
Note that this is within a small additive term of the bound for learning from
5
Figure 1: Generality of partial theories
positive examples only. Again the result is independent of the choice of D x and again L considers only 0(m) hypotheses to achieve an expected error of 0( n m).
5     Implementation
The Bayes' function fm has been implemented to guide the search of the ILP system Progol [10] when learning from positive examples only. The new version, Progol4.2, is available by anonymous ftp from ftp.comlab.ox.ac.uk in directory pub/Packages/ILP/progol4.2. The earlier version, Progol4.1, uses a cover-set algorithm to construct the set of clauses, but for each clause does a pruned admissible search to maximise compression. Progol4.2 has a similar overall search algorithm, but when constructing each clause carries out an admissible search which optimises a global estimate of fm for the complete theory containing the clause under construction. The basis for this global estimate is as follows. Suppose a clause d has been constructed as the ith clause of an overall theory (set of clauses) Hn = {Ci,.., Cn}. It is found that Hi = C\,.., Ci implies p more of the m positive examples than ü/i-_1. Figure 1 shows this situation with respect to the sample distribution D x ■ According to the Law of Large Numbers when m is large
g(Hj) - gjH^) ^ p
g{Hn)               m
and therefore
g(ffn)«-(g(ff,-)-g(#-i))-p
The surprising conclusion is that for large m it is possible to estimate the generality of Hn from p} m, g(Hi) and g(i78_i).
By assuming that the size of an hypothesis can be measured in bits for any hypothesis and that the number of examples covered per bit of an hypothesis is approximately uniform the following should also hold.
Til
sz(Hn) ss — sz(C-)
6
In Progol4.2 the value of sz(C8) is measured crudely as the number of atoms in the clause.
Since it is possible to estimate both sz{Hn) and g(Hn) during the local construction of each individual clause, it is possible to maximise an estimate of fm(Hn) during the construction of each of the clauses. The polynomial time-complexity bounds on the search carried out by Progol4.1 [10] are unaltered for Progol4.2.
5.1     Estimation of g{Hi)
The function g{H{) in the above is estimated in Progol4.2 using Stochastic Logic Programs (SLPs) [11]. An SLP is a range-restricted logic program P with numeric labels associated with each of the clauses. An SLP can be used to randomly derive elements of the Herbrand Base of P using SLD derivation with a stochastic selection rule. In order to estimate g{H{) an SLP, representing Dx, is used to randomly generate a sample of size s from the domain of the predicate p/n being learned. If s' of these instances are entailed by p/n then the Laplace corrected estimate of g {Hi) is ^ry.
In order to construct the SLP for the domain of p/n} Progol4.2 uses the modeh declaration of p/n (see [10]). For instance, suppose in a chess domain the mode declaration is modeh(l,move(+piece,pos(+file,+rank),pos(+file,+rank))). Then Progol4.2 will construct the following generating clause for the domain.
'*move'(A,pos(B,C),pos(D,E)) :- piece(A), file(B), rank(C),
file(D), rank(E).
The clauses of the SLP consist of the above clause and the definitions of piece/1, file/1 and rank/1. The labels for the SLP are built up by recording the total number of times each clause is visited in the derivations of the positive examples of move/3. In this way it is possible to estimate the distribution Dx from the examples themselves. For instance, in the example set we might find that half the examples involve the queen, a quarter involve rooks and the other quarter involve bishops. When randomly generating examples from the conditioned SLP these proportions are maintained.
6     Experiment
6.1     Experimental hypotheses
There remains a question as to how appropriate the assumptions made in Theorems 1 and 2 are in practice. In this section the predictive accuracy of three reasonably typical ILP target theories are compared against the theoretical predictions. If the distributional assumptions made are correct then we would expect
7
Examples.
class(dog,mammal).    class(dolphin,mammal).
class(trout,fish).           class(eel,fish).
class(lizard,reptile).     class(snake,reptile).
class(eagle,bird).          class(penguin,bird).
Target.
class(A,mammal) :- has_milk(A). class(A,fish) :- has_gills(A). class(A,bird) :- has_covering(A,feathers). class(A,reptile) :- has_covering(A,scales), not has_gills(A).
Figure 2: Animal taxonomy
that the expected error functions of theorems 1 and 2 should bound the mean error of a sample of real world domains. The experiments described in this section will test the following two hypotheses.
1.   Upper bound. In every domain EE(m) < 2,33+^    m •
2.   Positive versus positive and negative data. In every domain error is of a similar order when learning from positives examples only compared to learning from a mixture of positive and negative examples.
6.2     Materials
The experimental hypotheses will be tested using Progol4.2 on the following target theories.
Animal taxonomy. Figure 2 shows the target and form of examples for the animal taxonomy domain.
KRK illegality. Figure 3 shows the target and form of examples for the KRK illegality domain (originally described in [12]).
Natural language grammar. Figure 4 shows the target and form of examples for the natural language grammar domain.
Examples sets and background knowledge for the domains above are available from the ftp site described in Section 5.
8
Examples.
illegal(3,5,6,7,6,2). illegal(5,1,2,1,2,1).
Target.
illegal(A,B,A,B,_,_). illegal(_,_,A,B,A,B). illegal(A,B,_,_,C,D) : illegal(A,_,B,_,B,_) :-illegal(_,A,_,B,_,B) :-illegal(_,A,B,C,B,D) illegal(_,A,B,C,B,D) illegal(A,_,B,C,D,C) illegal(A,_,B,C,D,C)
illegal(3,6,7,6,7,4). illegal(4,3,1,1,4,2).
adj(A,C), adj(B,D). not A=B. not A=B.
A<C, A<D.
A>C, A>D.
A<B, A<D.
A>B, A>D.
Figure 3: KRK illegality
Examples.
s([every, nice, dog, barks], []).
s( [the,man,hits,the,ball,at,the,house], []).
s( [the,dog,walks,to,the,man], []).
Target.
s(A,B) s(A,B) s(A,B)
np(A,C), iverb(C,B). np(A,C), vp(C,D), np(D,B). np(A,C), tverb(C,D), np(D,E), prep(E,F), np(F,B).
Figure 4: Natural language grammar
9
6.3     Method
For the first two domains instances were generated randomly using the appropriate SLP (see Section 5.1) with uniform values of labels on all clauses. In the grammar domain it was found that only around 4 in 10,000 randomly generated sentences were positive examples of the target grammar T. Thus the distribution Dx was skewed so that Dx(T) = Dx(T) = 0.5. In all domains instances were classified according to the target theory in order to construct training and test sets. In the case of learning from positive examples only, training sets had all negative examples removed.
For each domain Progol4.2 was tested on 1) learning from positive examples only and 2) learning from a mixture of positive and negative examples. In both cases m was varied according to the series m = 5,10, 20, 40, 80,160, 320, 640,1280. For each size of sample the predictive accuracy of the hypothesis returned by Progol4.2 was estimated on a test set of size 10,000. For each m the estimate of predictive accuracy was averaged over 10 resamplings of the same sized training set. The series was discontinued for a particular domain if the estimate error was 0 for several successive values of m.
7    Results
7.1     Predictive accuracy versus bound
The results of testing the first experimental hypothesis (expected error upper bound) are graphed in Figures 5, 6 and 7. Labels on these graphs have the following meanings.
P. The predictive accuracy of learning from positive examples only is shown as the mean and standard deviation (error bars) of the 10 retrials for each value of m.
L(P). The theoretical lower bound on positive examples only accuracy from Theorem 1 (Accuracy= 100(1 — EE(m))).
M.  Majority class for domain (100 D x (T)).
Since each data point in each of the three domains lies above the bound, the first experimental hypothesis of Section 6 is confirmed 1.
1The non-monotonic behaviour of P in Figure 5 was found to be caused by large fluctuations in errors of commission. This is due to the gradual allowance of larger theories by the Bayes' function as m grows, together with the fact that generality does not vary monotonically with increasing size of a clausal theory.
10
uu	i    i   i   i  1                     i            i			[   /	»   i   <t> i	1	-----f
95							-
90							'-
85							-
80							
						M	
75							-
70	7 P                                                /   L(P)					1	-
Training Set Size
100
Figure 5: Predictive accuracy versus bound for animal taxonomy
100 Training Set Size
Figure 6: Predictive accuracy versus bound for KRK illegal
11
uu 90		i     i    i    i   1				t—f
80						-
70		P   /				-
60		/                                            /''   L(P)				-
50					M	
						
		/				
Training Set Size
Figure 7: Predictive accuracy versus bound for natural language grammar
7.2     Positive versus positive and negative
The results of testing the second experimental hypothesis (similar expected error for positive versus positive and negative) are graphed in Figures 8, 9 and 10. Labels on these graphs have the following meanings.
P.  The predictive accuracy of learning from positive examples only, shown as the mean of the 10 retrials for each value of m.
P+N.  The predictive accuracy of learning from a mixture of positive and negative examples, shown as the mean of the 10 retrials for each value of m.
L(P+N). The theoretical lower bound on positive examples only accuracy from Theorem 2 (Accuracy= 100(1 — EE(m))).
M.  Majority class for domain (100 D x (T)).
In the taxonomy and grammar domains (Figures 8 and 10) learning from positive examples only requires consistently fewer examples for any given e than learning from a mixture of positive and negative examples. In the KRK-illegality domain the converse is true. In every domain accuracy for all values of m is comparable when learning from positive examples compared to learning from a mixture of positive and negative examples. This confirms the experimental hypothesis.
12
90   -
85   -
75
1  1 1 1 1			l_J—1—-1  1 1 1 1 /      P+N
			.-'    _
¥/			
— /		/	—
		/	
	/		M
-"			/
,  , , , 1			/ L (P+N) .......1
Training Set
100
Figure 8: Positives versus positives and negatives for animal taxonomy
		1      ^U—		"-"-H
95		/ /    ,--''		-
90	-			-
85	//    /			-
80	/       //			-
75	-   P+N    /        /   /			-
70	P			
65				M-
60	-			-
55	/   L (P+N) , , , , 1      :'      ,       ,     ,    ,	,, ,1		-
100 Training Set
1000
Figure 9: Positives versus positives and negatives for KRK illegal
13
1   1   1  1 1                 l^_^_l_	—------>-             i            i          i y
	
	
	
	
	
	
	
	
	
	
	
/	/
/P                                     /'	/ /
/                                                     P+N /'	—
/ / / /        /	-
	
	
	
	
	
	
	
	
	
	
/ L(P+N)	-
,''	
-'''	M
,,,,!/'	
-Ll.
10
Training Set
Figure 10: Positives versus positives and negatives for natural language grammar
8     Conclusion
In 1967 Gold demonstrated negative results for learnability in the limit of various classes of formal languages. This has provided a strong impetus for the investigation of constrained hypothesis languages, within which learning from positive examples is possible. For instance, Plotkin [15] demonstrated the existence of unique least general generalisations of positive examples represented as first-order clauses. Biermann and Feldman [2] and later Angluin [1] demonstrated that certain parameterised subsets of the regular languages could be identified in the limit from positive examples only. Within the framework of PAC-learning Valiant demonstrated [19] that k-CNF propositional logic formulae are learnable from positive examples. More recently Shinohara [18] demonstrated that certain size-bounded classes of elementary formal systems are identifiable in the limit from positive examples.
Unlike the approaches above, the techniques used in this paper for learning from positive examples are representation independent. That is to say, the representation of hypotheses does not play a part either in the development of the Bayes' function (Section 3) or the analysis of expected error (Section 4). It might legitimately be claimed that two strong assumptions are made in Section 2: 1) that the learner knows D-u and 2) that the learner can estimate Dx by conditioning a Stochastic Logic Program.   The second assumption seems less pernicious
14
since it only requires a logic program which defines the Herbrand base. Further theoretical research is required to analyse the effect of discrepancy between the learner's prior p(H) and the teacher's distribution D-^.
Theorems f and 2 define bounds on expected error for randomly chosen targets and example sets. To machine learning practitioners it might seem surprising that these bounds are a function of only m, the number of examples. However, the bounds are achieved by considering expectations over randomly chosen targets from a given distribution, D-u = -§-. Experiments could be devised in which targets were randomly chosen from such a distribution, though this would tell us nothing about expected error in real domains. The experiments in Section 6 were devised to check whether the assumptions within Theorems 1 and 2 were appropriate for typical domains to which the ILP system Progol has been applied. Owing to limitations of time it was only feasible to test three such domains, for which the bounds held in all three cases. Further testing of domains is necessary to determine whether the bounds hold on average for existing Progol datasets.
Various researchers including [3, 6] have advocated and demonstrated the use of Bayesian analysis in machine learning. The success of the Bayesian solution to learning from positive examples reinforces this trend.
Several techniques [16, 17, 8] for learning from positive examples only have been investigated within Inductive Logic Programming. However, all these approaches differ from this paper in assuming some form of completeness within the example set.
In the light of the results in this paper it would seem worth reconsidering the degree of support that Gold's learnability results provide for Chomskian linguistics. Clearly, Chomsky's theory of innate linguistic ability is consistent with the results in this paper. However, the results in this paper show that weaker assumptions concerning the innate properties of natural language can be made than those that Gold's results might seem to suggest.
Acknowledgements
The author would like to thank David Haussler for influential discussions on the topic of learning from positive examples. The author's investigations with David Haussler of PAC upper-bound results for learning from positive examples will be reported elsewhere. Many thanks are due to my wife, Thirza Castello-Cortes, who has shown me great support during the writing of this paper. The author is grateful to Nick Chater of the Experimental Psychology Department in Oxford for pointing out the relevant literature on language learning in children. Thanks also for useful discussions on the topics in this paper with Donald Michie, John McCarthy, Tony Hoare, David Page, Ashwin Srinivasan and James Cussens. This work was supported partly by the Esprit Long Term Research Action ILP II (project 20237), EPSRC grant GR/J46623 on Experimental Application and Development of ILP, EPSRC grant GR/K57985 on Experiments
15
with Distribution-based Machine Learning and an EPSRC Advanced Research Fellowship held by the author. The author is also supported by a Research Fellowship at Wolfson College Oxford.
References
[1] D. Angluin. Inference of reversible languages. Journal of the ACM, 29:741-765, 1982.
[2] A.W. Biermann and J.A. Feldman. On the synthesis of finite-state machines from samples of their behaviour. IEEE Transactions on Computers, C(21):592-597, 1972.
[3] W. Buntine. A Theory of Learning Classification Rules. PhD thesis, School of Computing Science, University of Technology, Sydney, 1990.
[4] N. Chomsky. Knowledge of language: its nature, origin and use. Praeger, New York, 1986. First published 1965.
[5] E.M. Gold. Language identification in the limit. Information and Control, 10:447-474, 1967.
[6] D. Haussler, M Kearns, and R. Shapire. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. In COLT-91: Proceedings of the Jfth Annual Workshop on Computational Learning Theory, pages 61-74, San Mateo, CA, 1991. Morgan Kauffmann.
[7] D. Haussler, M Kearns, and R. Shapire. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning Journal, 14(1):83—113, 1994.
[8] R.J. Mooney and M.E. Califf. Induction of first-order decision lists: Results on learning the past tense of english verbs. Journal of Artificial Intelligence Research, 3:1-24, 1995.
[9] S. Muggleton. Bayesian inductive logic programming. In M. Warmuth, editor, Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pages 3-11, New York, 1994. ACM Press.
[10]  S. Muggleton. Inverse entailment and Progol.  New Generation Computing, 13:245-286, 1995.
[11]  S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Advances in Inductive Logic Programming. IOS Press/Ohmsha, 1996.
16
[12] S. Muggleton, M.E. Bain, J. Hayes-Michie, and D. Michie. An experimental comparison of human and machine learning formalisms. In Proceedings of the Sixth International Workshop on Machine Learning, Los Altos, CA, 1989. Kaufmann.
[13] S. Muggleton and CD. Page. A learnability model for universal representations. Technical Report PRG-TR-3-94, Oxford University Computing Laboratory, Oxford, 1994.
[14] S. Pinker. Language learnability and language development. Harvard University Press, Cambridge, Mass., 1984.
[15] G.D. Plotkin. A note on inductive generalisation. In B. Meltzer and D. Michie, editors, Machine Intelligence 5, pages 153-163. Edinburgh University Press, Edinburgh, 1969.
[16] J.R. Quinlan and R.M. Cameron. Induction of logic programs: FOIL and related systems. New Generation Computing, 13:287-312, 1995.
[17] L. De Raedt and M. Bruynooghe. A theory of clausal discovery. In Proceedings of the 13th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 1993.
[18] T. Shinohara. Inductive inference of monotonie formal systems from positive data. In Proceedings of the first international workshop on algorithmic learning theory, Tokyo, 1990. Ohmsha.
[19] L.G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134-1142, 1984.
A     Proof of Theorems 1 and 2
Theorem 1 Expected error for positive examples only. Let X be a countable instance space and D x be a probability distribution over X. Let Tí C 2X be a countable hypothesis space containing at least all finite subsets of X, and for which all H ďH have Dx(H) < \- Let D-u be a probability distribution overH.. Assume that H has an ordering Hi, H2, ■ ■ ■ such that D-u(Hi) > D-h(Hj) for all j > i. Let Du(Hi) = f where \ = Y?=i £ ~ öks- Let Hn = {Ht : Ht G U and i<n}. T is chosen randomly from D-u and the X{ in E = (xi, ..,xm) are chosen randomly and independently from Dx\t- Let fm(H) = D-u(H)(D \H\)m and let n = am. L is the following learning algorithm. If there are no hypotheses H G 1-Ĺn such that H D He = {xi,..,xm} then L(E) = HE. Otherwise L(E) = Hn(E) = H only if H G 1-Ĺn, H D He and for all H1 G 1-Ĺn for which H1 D He it is the case that fm(H)   >  fm(H').    The  error of an hypothesis H is defined as
17
Error[H}T) = Dx(T \ H) + Dx(H \T). The expected error of L after m examples, EE(m), is at most 2'33+^ln m.
Proof. Consider the case in which T £ 1-Ĺn (case 1). Then by definition L(E) = Hn(E) = H. Since H = Hn(E) has the maximum value of fm in Tin it follows that
(\       Tfh                                          /                            \        Tfh
Also DX(T DH) = DX(T) - DX(T\H) = DX(H) -DX(H\ T) and therefore
Dx(H) - Dx(T) = Dx(H\T)- DX(T \ H).                        (2)
Furthermore consider the case in which Dx(H) > Dx(T) (case la). Rearranging (1) and taking logs we get
m(ln DX(H) - In DX(T)) < In Dn(H) - In Dn(T).                   (3)
Let r = ln^{gj:^TX)(T). We now show that for 0 < DX(T) < DX(H) < \ it is the case that r > 2. First note that r decreases monotonically in Dx(T) since
dr                          1               n    Dx(H)      Dx(H)
dDx(T)      Dx(H)-Dx(T)2{     DX(T)      DX(T) +   ' ^
for Dx(H) > Dx(T). But as Dx(T) approaches | it also approaches Dx(H) and ]imDx(T)^Dx(H) r = dDx(T)ln DX(T) = ^j^y = 2.   Thus r > 2 for 0 < DX(T) <
DX(H) < \ from which it follows that In DX(H) - In DX(T) > 2(DX{H) -Dx(T)).  Combining this with (2) and (3) gives
2m(Dx(H \T)-DX(T\ H)) < In Dn(H) - In Dn(T)
and therefore
DX(H \ T) < ~/WÍ^(r) + DX(T \ H).                             (4)
2m
Now consider the case in which Dx(H) < Dx(T) (case lb). From (2) it follows that Dx(H \ T) < Dx(T \ H). Thus (4) holds in both case la and case lb, and therefore from the definition of Error in the theorem for all of case 1 we get
Error(Hn(E), T) < ~ln ^T^ + 2DX(T \ H).                        (5)
2m
Lastly consider the case in which T ^ 1-Ln (case 2). In this case we have the trivial bound
Error(H,T)<l.                                                 (6)
18
We are now in a position to bound EE(rn).  First we define T1 = T and Tr
T x T"1"1. Now
EE(m)= J2Dn(T)   E   Dx(E\T)Error(L(E),T).
Ten             EeTm
Splitting the sum into case 1 and case 2 and making use of (6) gives the following.
EE(m)    <     Y,  Dn(T)   E   Dx(E\T)Error(Hn(E),T)+     E    Du(T).l Ten„             EeTm                                                  teu\u„
But since D-u(Hi) = -f-;
Ten\n„                  «=n+i l       Jl=n l       n
This together with (5) gives
EE(m)   <    -+   E  Du(T)  E   DX(E\T) f"1" f*(T) + 2DX(T \ H)) n     Teun             EeTm                   \       zm                                 J
=    Z~h  S  Du(T)lnDu(T)
-i t rin
+2  E  Dn(T)  E  DX(E\T)DX(T\H). Ten„             EeTm
But
1                                               1   °°   a        a
-«   E  ß«(T)/nD«(T)<--5:^n(H)<0.82.
z Ten„                                       L i=i l         l
Letting rmn(e) = {E : E £ Tm (má Dx(T\Hn(E)) < eDxl?1)} <m</ remembering that DX(T \H)< DX(T) < \ gives
E  DU(T)  J2   DX(E\T)DX(T\H) TeHn             EeTm
=     E  Dn(T)    E     DX(E\T)DX(T\H) TeHn             Ee
+ E Dn(T)       E       ^x(JB|r)JDx(r\F)
Teíín                EeT™\Tmn(e)
< f + E ß«(r)     E     ^(^1^
Z        T£_ff„                  £eX™\Tmn(e)                        z
=    ^ + \  E  Dn(T)Pr(for random E, Dx(T\Hn(E))>eDx(T)) 1      l TeHn e      Pr(3ff £ nn.Dx(T \ H) > eDx(T) and xu .., imč(Tn if))
-    2 +                                                    2
<    e + n(l-e)m
<
2
19
Thus
_,    ,       a      0.82
EE(m) < - H----------h e + ne"
n        m
Optimal values of n and e are found by successively setting to zero the partial
derive
gives
derivatives ofn,e and solving.   This gives e =   n nm  and n = am.   Substituting
riru    .           1+ 0.82+ 2ln m +In a + 1
EEim)    <    --------------------------------------
m
2.33 + 2ln m <    ------------------.
m
D
Theorem 2 Expected error for positive and negative examples. Let X be
a countable instance space and Tí C 2X be a countable hypothesis space containing at least all finite subsets of X. Let D-u, Dx be probability distributions over Tí and X. Assume that Tí has an ordering Hi, H2,. . . such that D-u(Hi) > D-h(Hj) for all j > i. Let DH(Hi) = f where ± = J2Zi £• Let ^U. = {Ht :H%^U and i < n}. T is chosen randomly from D-^. Let ex(x}H) = (x}v) where v = True if x G H and v = False otherwise. Let E = (ex(x\, T),.., ex(xm} T)) where each Xi is chosen randomly and independently from Dx- He = {x : (x} True) in E}. Hypothesis H is said to be consistent with E if and only if X{ G H for each True) in E and Xj ^ H for each (xj, False) in E.  Let f(H) = D-u(H) and
\Xi
let n = am. L is the following learning algorithm. If there are no hypotheses H G Hn consistent with E then L(E) = He- Otherwise L(E) = Hn(E) = H only if H G 1-Ĺn, H consistent with E and for all H' G 1-Ĺn consistent with E it is the case that f(H) > f(H'). The error of an hypothesis H is defined as Error[H}T) = Dx(T \ H) + Dx(H \ T). For n = am the expected error of L after m examples, EE(m), is at most 1,51+2 n m .
Proof. Let Tm = {(ex(xi} T),.., ex(xm}T)) : X{ G r}. The expected error can be bounded in a similar way to that used in the proof of Theorem 1.
EE(m)    =    J2Dn(T)   E   Dx(E\T)Error(L(E),T) Ten             EeTm
<      E  Dn(T)   E   Dx(E\T)Error(Hn(E),T)+     E    Du(T).l
Ten„             EeTm                                                  teu\u„
<    -+   E  Dn(T)   E   Dx(E\T)Error(Hn(E),T) n     Teu„             EeTm
Letting Tmn(e) = {E1 : E' G Tm and Error(Hn(E'),T) < e} gives
E  DU(T)   E   Dx(E\T)Error(Hn(E),T) TeH„                EeTm
=     E  Dn(T)     E     Dx(E\T)Error(Hn(E),T)
T(iHn                  EETmn(e)
20
+  Y,  Dn(T)        E        Dx(E\T)Error(Hn(E),T)
TeHn                EeTm\rmu(e)
<    e + Pr(3H G Un.Error{H, T) > e and Xl, ..,xm G (T n H))
<    e + n(l-e)m
<    e + ne-tm
Thus
EE(m) < - + € + ne~em n
Again optimal values of n and e are found by successively setting to zero the partial
derivatives ofn,e and solving.   This gives e = —!H2L and n = am.   Substituting
gives
nrv    ,    /     l + 21nm + lna + l
hjhj\m)    <    ---------------------------------------
m
1.51 + 21n m <    ------------------.
m
D
21