1
What Machine learning is
2
Ensembles
Based on Ray Mooney CS 391L
University of Texas at Austin
3
Bias-variance dilemma
•  bias–variance dilemma: a low-complexity model suffers
less from variability due to random variations in the
training data, but
•  may introduce a systematic bias that even large amounts of
training data can’t resolve;
•  Example(s):
•  on the other hand,
•  a high-complexity model eliminates such bias but can
suffer non-systematic errors due to variance.
•  Example(s):
4
Learning Ensembles
•  Learn multiple alternative definitions of a concept using
different training data or different learning algorithms.
•  Combine decisions of multiple definitions, e.g. using
weighted voting.
Training Data
Data1 Data mData2 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
Learner1 Learner2 Learner m⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
Model1 Model2 Model m⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
Model Combiner Final Model
5
Value of Ensembles
•  When combing multiple independent and
diverse decisions each of which is at least
more accurate than random guessing,
random errors cancel each other out, correct
decisions are reinforced.
•  Human ensembles are demonstrably better
–  How many jelly beans in the jar?: Individual
estimates vs. group average.
–  Who Wants to be a Millionaire: Expert friend
vs. audience vote.
6
Stacking
•  considers heterogeneous weak learners
•  learns them in parallel and
•  combines them by training a meta-model to output a
prediction based on the different weak models predictions
•  meta-learner – any
•  Actually, stacking is a kind of general model for ensemble
learning
7
Homogenous Ensembles
•  Use a single, arbitrary learning algorithm but
manipulate training data to make it learn multiple
models.
–  Data1 ≠ Data2 ≠ … ≠ Data m
–  Learner1 = Learner2 = … = Learner m
•  Different methods for changing training data:
–  Bagging: Resample training data
–  Boosting: Reweight training data
8
Bagging
•  Create ensembles by repeatedly randomly resampling the
training data (Brieman, 1996).
•  Given a training set of size n, create m samples of size n by
drawing n examples from the original data, with
replacement.
–  Each bootstrap sample will on average contain 63.2% of the
unique training examples, the rest are replicates.
•  Combine the m resulting models using simple majority
vote.
•  Decreases error by decreasing the variance in the results
due to unstable learners, algorithms (like decision trees)
whose output can change dramatically when the training
data is slightly changed.
9
Bagging : Algorithms
10
Boosting
•  Originally developed by computational learning theorists to
guarantee performance improvements on fitting training data for
a weak learner that only needs to generate a hypothesis with a
training accuracy greater than 0.5 (Schapire, 1990; Goedel
Prize)
11
Boosting
•  Revised to be a practical algorithm, AdaBoost, for building
ensembles that empirically improves generalization
performance (Freund & Shapire, 1996).
•  Examples are given weights. At each iteration, a new
hypothesis is learned and the examples are reweighted to
focus the system on examples that the most recently
learned classifier got wrong.
12
Boosting: Basic Algorithm
•  General Loop:
Set all examples to have equal uniform weights.
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples
Decrease the weights of examples ht classifies correctly
•  Base (weak) learner must focus on correctly
classifying the most highly weighted examples
while strongly avoiding over-fitting.
•  During testing, each of the T hypotheses get a
weighted vote proportional to their accuracy on
the training data.
13
AdaBoost Pseudocode
TrainAdaBoost(D, BaseLearn)
For each example di in D let its weight wi=1/|D|
Let H be an empty set of hypotheses
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples: ht=BaseLearn(D)
Add ht to H
Calculate the error, εt, of the hypothesis ht as the total sum weight of the
examples that it classifies incorrectly.
If εt > 0.5 then exit loop, else continue.
Let βt = εt / (1 – εt )
Multiply the weights of the examples that ht classifies correctly by βt
Rescale the weights of all of the examples so the total sum weight remains 1.
Return H
TestAdaBoost(ex, H)
Let each hypothesis, ht, in H vote for ex’s classification with weight log(1/ βt )
Return the class with the highest weighted vote total.
14
Note on ensemble construction
•  Ensemble construction can be defined as a
learning problem
•  given the predictions of some base classifiers as
features, learn a meta-model that best combines
their predictions.
•  E.g. in Bagging, what classifiers to use and with
what weights (weighted voting)
•  In Boosting we could learn the weights rather than
deriving them from each base model’s error rate.
Random Forests
●  an ensemble of classification or regression random trees.
●  each Random tree is constructed by a
●  different bootstrap sample from the original data
●  with a subset of features
●  1/3 of all samples are left out (a cause of bootstrap) – OOB
(out of bag) data – for classification error estimation
●  majority voting, = a variant of bagging
16
Random Forest
17
Learning with Weighted Examples
•  Generic approach is to replicate examples in the
training set proportional to their weights (e.g. 10
replicates of an example with a weight of 0.01 and
100 for one with weight 0.1).
•  Most algorithms can be enhanced to efficiently
incorporate weights directly in the learning
algorithm so that the effect is the same (e.g.
implement the WeightedInstancesHandler
interface in WEKA).
•  For decision trees, for calculating information
gain, when counting example i, simply increment
the corresponding count by wi rather than by 1.
•  For kNN and other learners?
18
Experimental Results on Ensembles
(Freund & Schapire, 1996; Quinlan, 1996)
•  Ensembles have been used to improve
generalization accuracy on a wide variety of
problems.
•  On average, Boosting provides a larger increase in
accuracy than Bagging.
•  Boosting on rare occasions can degrade accuracy.
•  Bagging more consistently provides a modest
improvement.
•  Boosting is particularly subject to over-fitting
when there is significant noise in the training data.
19
Issues in Ensembles
•  Parallelism in Ensembles: Bagging is easily
parallelized, Boosting is not.
•  Variants of Boosting to handle noisy data.
•  How “weak” should a base-learner for Boosting
be?
•  What is the theoretical explanation of boosting’s
ability to improve generalization?
•  Exactly how does the diversity of ensembles affect
their generalization performance.
•  Combining Boosting and Bagging.
20
Ensembles and bias-variance dilemma
•  Bagging decreases variance
variance -> variance/num_of_ensembleMembers
•  Boosting decreases bias
(as hypothesis complexity is increasing)
21
Ensembles and Active Learning
•  Ensembles can be used to actively select
good new training examples.
•  Select the unlabeled example that causes the
most disagreement amongst the members of
the ensemble.
•  Applicable to any ensemble method:
–  QueryByBagging
–  QueryByBoosting