1
Theory of Machine learning
Peter Flach Book pp.124-126, Tom Mitchell, Machine Learning Chapter 7
We seek theory to relate:
•  Probability of successful learning
•  Number of training examples
•  Complexity of hypothesis space
•  Accuracy to which target concept is approximated
•  Manner in which training examples presented
2
Two roles for Bayesian methods
Tom Mitchell, Machine Learning Chapter 6
•  Provides practical learning algorithm
•  Provides conceptual frameworks
gold standard for evaluationg other learning algorithms
insight to Occam;s razor
3
Brute Force MAP Hypothesis Learner
H … hypotheses
D … learning data
hMAP … maximum a posteriori hypothesis
4
Bias-variance dilemma
•  bias–variance dilemma: a low-complexity model suffers
less from variability due to random variations in the
training data, but
•  may introduce a systematic bias that even large amounts of
training data can’t resolve;
•  Example(s):
•  on the other hand,
•  a high-complexity model eliminates such bias but can
suffer non-systematic errors due to variance.
•  Example(s):
5
What Machine learning is
6
Ensembles
Based on Ray Mooney CS 391L
University of Texas at Austin
7
Learning Ensembles
•  Learn multiple alternative definitions of a concept using
different training data or different learning algorithms.
•  Combine decisions of multiple definitions, e.g. using
weighted voting.
Training Data
Data1 Data mData2 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
Learner1 Learner2 Learner m⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
Model1 Model2 Model m⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
Model Combiner Final Model
8
Value of Ensembles
•  When combing multiple independent and
diverse decisions each of which is at least
more accurate than random guessing,
random errors cancel each other out, correct
decisions are reinforced.
•  Human ensembles are demonstrably better
–  How many jelly beans in the jar?: Individual
estimates vs. group average.
–  Who Wants to be a Millionaire: Expert friend
vs. audience vote.
9
Homogenous Ensembles
•  Use a single, arbitrary learning algorithm but
manipulate training data to make it learn multiple
models.
–  Data1 ≠ Data2 ≠ … ≠ Data m
–  Learner1 = Learner2 = … = Learner m
•  Different methods for changing training data:
–  Bagging: Resample training data
–  Boosting: Reweight training data
10
Bagging
•  Create ensembles by repeatedly randomly resampling the
training data (Brieman, 1996).
•  Given a training set of size n, create m samples of size n by
drawing n examples from the original data, with
replacement.
–  Each bootstrap sample will on average contain 63.2% of the
unique training examples, the rest are replicates.
•  Combine the m resulting models using simple majority
vote.
•  Decreases error by decreasing the variance in the results
due to unstable learners, algorithms (like decision trees)
whose output can change dramatically when the training
data is slightly changed.
11
Bagging : Algorithms
12
Boosting
•  Originally developed by computational learning theorists to
guarantee performance improvements on fitting training data for
a weak learner that only needs to generate a hypothesis with a
training accuracy greater than 0.5 (Schapire, 1990; Goedel
Prize)
13
Boosting
•  Revised to be a practical algorithm, AdaBoost, for building
ensembles that empirically improves generalization
performance (Freund & Shapire, 1996).
•  Examples are given weights. At each iteration, a new
hypothesis is learned and the examples are reweighted to
focus the system on examples that the most recently
learned classifier got wrong.
14
Boosting: Basic Algorithm
•  General Loop:
Set all examples to have equal uniform weights.
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples
Decrease the weights of examples ht classifies correctly
•  Base (weak) learner must focus on correctly
classifying the most highly weighted examples
while strongly avoiding over-fitting.
•  During testing, each of the T hypotheses get a
weighted vote proportional to their accuracy on
the training data.
15
Note on ensemble construction
•  Ensemble construction can be defined as a
learning problem
•  given the predictions of some base classifiers as
features, learn a meta-model that best combines
their predictions.
•  E.g. in Bagging, what classifiers to use and with
what weights (weighted voting)
•  In Boosting we could learn the weights rather than
deriving them from each base model’s error rate.
Random Forests
●  an ensemble of classification or regression random trees.
●  each Random tree is constructed by a
●  different bootstrap sample from the original data
●  with a subset of features
●  1/3 of all samples are left out (a cause of bootstrap) – OOB
(out of bag) data – for classification error estimation
●  majority voting, = a variant of bagging
17
Ensembles and bias-variance dilemma
•  Bagging decreases variance
variance -> variance/num_of_ensembleMembers
•  Boosting decreases bias
(as hypothesis complexity is increasing)
18
Rule learning
Based partially on J. Fürnkranz ML course, U. Darmstadt
19
Example
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
…
20
From trees to rules
21
C4.5rules
Introduction to Inductive Logic Programming
In collaboration with Olga Štěpánková
Example:
Can we recognize robots after short experience?
friendly unfriendly
Example:
Robots and an atribute-value description
head smile neck body In hand friendly
Circle ne Tie Rectangle Sword no
Rectangle ano Butterfly Rectangle Nothing yes
Circle ne Butterfly Circle Sword yes
Triangle ne Tie Rectangle Ball no
Circle ano Nothing Triangle Flower no
Triangle ne Nothing Triangle Ball yes
Triangle ano Tie Circle Nothing no
Circle ano Tie Circle Nothing yes
Example: hypothesis and testing
Example: hypothesis and testing (cont.)
When an attribute-value representation is insufficcient?
•  Examples do not have a uniform description (e.g. are of a
different length)
•  A structure of examples is important
•  Domain knowledge is (multi-)relational
Inductive logic programming: Basic task
(Muggleton94)
A set of positive E+ and negative E- examples
Domain knowledge B (a logic program)
goal: to find a logic program P that together with B covers
(almost all) positive examples and
not cover (almost no) negative example
------
+: much more flexible
data of any structure can be processed
-: some effort needed
more time consuming( even though << NeuroN)
Example
Example: find a path in an oriented graph
path(X,Y) :- edge(X,Y).
path(X,Y) :- path(X,U),edge(U,Y).
edge(1,2). edge(1,3). edge(2,3). edge(2,4). …
= domain knowledge
Specialization and generalization
A formula G is a specialization of a formula G iff
F is a logical consequence of G
G |= F (any model of G is also a model of F).
Specialization operator (refinement operator)
assigns to a clause a set of all its specializations
Most of ILP systems use two basic operations of specialization
binding two variables
spec(path(X, Y )) = path(X, X)
adding a goal into a clause body
spec(path(X,Y)) = (path(X,Y):-edge(U,V))
and also
substitution a variable with a constant
spec(number(X)) = number(0)
substitution a variable with a most general term
spec(number(X) = number(s(Y)) .
Example: path(From,To) in a graph
Learning set
positive examples : path(1,2). path(1,3). path(1,4). path(2,3).
negative examples: path(2,1). path(2,5).
Domain knowledge
edge(1,2). edge(1,3). edge(2,3). edge(2,4).
Specialization (refinement) tree
path(X,Y).
path(X,X). path(X,Y) :- edge(Z,U). path(X,Y):-path(Z,U).
path(X,Y) :- edge(X,U). path(X,Y):-path(X,U).
path(X,Y) :- edge(X,Y). …
path(X,Y):-path(X,U),edge(V,W).
path(X,Y):-path(X,U),edge(X,W).
…
path(X,Y):-path(X,U),edge(U,W).
…
path(X,Y):-path(X,U),edge(U,Y).