Machine learning essentials
Lukˊaˇs Laﬀˊers
Matej Bel University, Dept. of Mathematics
MUNI Brno
3.12.2021 7.1.2022
Bias variance trade-oﬀ
Penalized regression - LASSO and Ridge
Regression Tree
Random Forrest
Machine learning and causality
Bias vs variance trade-oﬀ
Which would you choose?
As always: It depends.
Y = f(X)+ε
EPE(Y,ˆf(X))
expected prediction error
= E Y −ˆf(X)
2
= E (Y −f(X))2
unsystematic error
+E (f(X)−ˆf(X))2
systematic error
At a particular point X = x :
MSE(f(x),ˆf(x))
mean squared error
= E (f(x)−ˆf(x))2
= f(x)−E[ˆf(x)]
2
bias2
+E (ˆf(x)−E[ˆf(x)])2
variance
Train/Test data
Prediction is easy: we can measure on the test data how well the estimated
model (on the train data) works.
LASSO
Least Absolute Shrinkage and Selection
Operator
Motivation 1 - model selection + estimation
How to choose regressors?
Hypothesis testing (Backward elimination,...)
Information criteria (AIC, BIC, FIC,...)
Model Averaging
2-step procedures: [MODEL SELECTION] + [ESTIMATION]
LASSO does it in one step.
Motivation 2 - many regressors
We have MANY regressors in comparison with observations.
p >> N
Parameters of
y = Xβ +ε
are not uniquely deﬁned.
LASSO can handle such situations.
Motivation 3 - Bias-Variance trade-oﬀ
LASSO takes this trade-oﬀ into account. It aims to predict well.
y = Xβ +ε
ˆβOLS = argmin(y −Xβ)T
(y −Xβ)
under Correct speciﬁcation and Gaussian homoskedastic errors
(ε ∼ N(0,σ2
I)), this has the lowest variance among linear unbiased
estimators
also Maximum likelihood estimator (asymptotically eﬃcient)
so why do we care with anything else?
Should we care about unbiasedness so much?
Interpretation: OLS estimators are non-zero, all of them
Sparsity principle: ESL: ’Use a procedure that does well in sparse
problems, since no method does well in dense problems.’
Highly correlated regressors lead to a high variance of estimators
But is the truth sparse?
(y −Xβ)T
(y −Xβ)+ PENALTY
(y −Xβ)T
(y −Xβ)+ λ||β||1
λ - ’price’ of β
||β||1 = ∑
p
i=1 |βi|
We shrink β towards zero!
And the credit goes to...
Robert Tibshirani
Cross-validation
How to choose λ ?
Choose it in a way so
that the model
predicts well.
Minimize
cross-validation
error.
Example 1
215 genes out of 4718 with non-zero coeﬃcients of a multinomial lasso classiﬁer (source: SLS)
The Lasso Estimator
For a linear model LASSO estimator is the solution of the following problem
min
β0,β1
1
2N
N
∑
i=1
(yi −β0 −
p
∑
j=1
xijβj)2
subject to
p
∑
j=1
|βj| ≤ t
t → 0 =⇒ ˆβ → 0
t → ∞ =⇒ ˆβ → βOLS
..in a more compact form
min
β
1
2N
||y −Xβ||2
2
subject to ||β||1 ≤ t
which is equivalent to
min
β
1
2N
||y −Xβ||2
2 +λ||β||1
A few remarks...
If the predictors are not on a same scale, we rescale and recenter then.
intercept disappears
there is a relationship between λ and t and this relationship is
data-dependent
the factor 1
2N
makes it easier to compare λ across diﬀerent sample sizes
Ridge regression vs LASSO. (source: Faraway (2014))
LASSO in 3D - (β1,β2,β3). (source: SLS)
Choice of λ by Cross-validation
Relative bound = ||ˆβ||1/||ˆβOLS||1 (source: SLS)
Example - Crime rate in U.S. cities
(source: SLS)
(*)Consistency
We may be interested in MSE consistency
||X(ˆβ −β∗
)||2
2/N ≤ C||β∗
||1 log(p)/N.
what does it mean?
If ||β∗||1 = o( N/log(p)) (read as: does not grow too fast, alternatively read
as: the truth is ”sparse”), then lasso is consistent for prediction.
Or we may be interested in sparsistency, that is, recoverability of the
non-zero support set of the true regression parameters. This is diﬃcult to
prove and for this we would need higher level assumptions.
Uniqueness
Suppose we have x1 and x2 with corresponding β1 and β2.
Now we add x3 which is identical to x2.
(ˆβ1,γ ˆβ2,(1 −γ)ˆβ2)
produces the same ﬁt and its norm is the same. So we have inﬁnite number
of solutions.
Diﬀerent penalties
min
β0,β1
1
2N
N
∑
i=1
(yi −β0 −
p
∑
j=1
xijβj)2
+λ
p
∑
j=1
|βj|q
Constraint regions for diﬀerent q-s. (source: SLS)
Elastic net
min
β0,β1



1
2N
N
∑
i=1
(yi −β0 −
p
∑
j=1
xijβj)2
+λ
1
2
(1 −α)||β||2
2 +α||β||1
more sophisticated penalty



Takes the best out of the two worlds:
ability to make some parameters exact zeroes as LASSO (α = 1)
ability to handle highly correlated data as ridge regression (α = 0)
we pay a price for that → the choice of α is needed
Elastic net
Left LASSO, right elastic net for highly correlated data. (source: SLS)
Elastic net
Left elastic net ball for α = 0.7, right LASSO ball. (source: SLS)
Regression tree
Medical doctor asks a patient the following yes/no questions:
Are you more than 30 years old?
Is you diastolic pressure higher than 100?
Is there anyone in your family with heart condition?
Do you sport more than 60mins per week?
Based on these yes/no question MD can make a prognosis or suggest a
suitable intervention.
Yes/no questions are easy. They can be standardized and made into
guidelines.
Morgan and Sonquist (1963), Breiman, Friedman, Olshen and Stone (1984)
How to build such a tree?
What is a good yes/no question?
Can we measure how good a
question is?
Outcome Y
Covariates X1,X2,...Xp
Measure of variablity: RSS(group) = ∑i∈group(Yi − ¯Ygroup)2
Choose the partitioning group = part1 ∪part2 so that the overall measure of
variability is minimized.
RSS(group) → RSS(part1)+RSS(part2)
(Make the 2 groups very diﬀerent.)
If all p variables are numerical. We have p ·(n −1) diﬀerent ways of how
to partition the data.
How big should the tree be?
Stop if the RSS improvement is small. (what is ”small”?)
Prunning: Grow a large tree and cut the ”leaves”.
Choose tree that minimize CC(tree) = ∑leafj
RSS(partj )+ λ(#leaves)
penalty for complexity
How to choose penalty λ? So that we predict well! E.g. by
cross-validation.
If all p variables are numerical. We have p ·(n −1) diﬀerent ways of how
to partition the data.
How big should the tree be?
Stop if the RSS improvement is small. (what is ”small”?)
Prunning: Grow a large tree and cut the ”leaves”.
Choose tree that minimize CC(tree) = ∑leafj
RSS(partj )+ λ(#leaves)
penalty for complexity
How to choose penalty λ? So that we predict well! E.g. by
cross-validation.
It is a regression:
Y = β1I(age > 30)·I(gender = ♀)
+ β2I(age > 30)·I(gender = ♂)
+ β3I(age ≤ 30)·I(dia > 100)
+ β4I(age ≤ 30)·I(dia ≤ 100)
+ β5I(age ≤ 30)·I(dia ≤ 100)·I(♥)·I(sport > 60)
+ β6I(age ≤ 30)·I(dia ≤ 100)·I(♥)·I(sport ≤ 60)
+ β7I(age ≤ 30)·I(dia ≤ 100)·I(SS♥)
+ ε
Prediction in a leaf is a simple average.
PROS
Trees are visually appealing and easy to understand.
No parametric structure is imposed
Ability to capture complex interactions
CONS
Sensitive if observations have values around the cutting points
Variance is high. Small change in the data can results in a large change
of estimated model.
Random forrest
Pozdravujem vˊas, lesy, hory,
z tej duˇse pozdravujem vˊas!
I greet you, forests, mountains, I greet you from that soul!
P. O. Hviezdoslav
P.O.H. in a forrest.
Random forrest
High variability in a single tree?
We use bootstrap to grow many trees.
Then we average across the predictions across
multiple trees.
Or we may give higher weight to the trees that
predict better (Bagging)
Random forrest.
Random forrest - many similar trees
Not all that random, right?
Random forrest
Trees are somewhat diﬀerent because they
are based on diﬀerent bootstrapped
samples.
We may explicitly make the tree even more
diﬀerent.
We only choose a subset of regressors. Say
randomly pick
√
p of them instead of all p.
How many trees? Enough so that the
prediction error does not change.
Random forrest.
Random forrest - many diﬀerent trees
Eﬀect of a variable
How to quantify an eﬀect of a particular variable on outcome in such a
complicated object as random forrest?
Partial dependence - set the variable at a particular value for all the
observations and look at the mean predictions.
Partial eﬀects - Fix all the other variables at mean values and look at the
mean predictions.
Importance - Use out-of-bag data to access change in MSE after you
randomly permute a predictor.
Minimal depth - How early or late is the variable used for a split?
Average the depth of the ﬁrst split.
SHAP values (will not cover this)
Classiﬁcation using trees and random forrest
So far, we looked at continuous outcome variable.
The presented ideas are often used for classiﬁcation (categorical
variable).
The measure of group diﬀerence is not RSS in this case (e.g. deviance,
entropy, Gini index).
The prediction in the leaf is a proportion.
Why random forrest?
Random forrest is considered as one of the best oﬀ-the-shelf
predictor/classiﬁer.
Especially if you don’t know much of the problem.
It is relatively easy to compute.
Interpretability is not the nicest.
(*) Machine learning and causality
Prediction is nice, but economists often care more about the underlying
mechanism more.
While ML gives us many great prediction tools, we are often interested in a
certain variable of interest.
Having a lot of information we need to cope with high dimensionality of
covariates.
Lot of information is great, it could make our Selection on observables
assumption more credible!
Job-seeker went through a training/course. Did it help?
We know a lot about these job-seekers (say 300 variables).
Regression will not cope well with these kind of problems, especially if the
sample size is small.
We may try LASSO, but it will give us biased estimates.
Can we make use of the great predictive capabilities of ML algorithms for
improving the estimation of parameters of interest?
This is an area of active research. Here we will discuss one important paper
on DOUBLE MACHINE LEARNING
Chernozhukov, Victor, et al. ”Double/debiased machine learning for
treatment and structural parameters.” The Econometrics Journal 21.1
(2018): C1-C68.
Double Machine Learning framework
Consider the following partial linear model. θ is the parameter of interest.
Y = θD +g(X)+U, E[U|D,X] = 0
D = m(X)+V, E[V|X] = 0
Split the data into two parts
Use the ﬁrst one to get ˆg by some ML algorithm (LASSO, RF)
Use the second portion of data to get ˆθ from regressing Y − ˆg(X) on D
ˆθ1 =
1
n
∑
i
D2
i
−1
1
n
∑
i
Di(Yi − ˆg(Xi))
ˆθ1 is based on E[ψ1] = 0 where ψ1 = D(Y −g(X)−θD)
Double Machine Learning framework
How does this naive estimator behave?
√
n(ˆθ1 −θ) =
1
n
∑
i
D2
i
−1
1
√
n
∑
i
DiUi
Nicely behaved, approx. Gaussian
+
1
n
∑
i
D2
i
−1
1
√
n
∑
i
Di(g(Xi)− ˆg(Xi))
In general divergent.
Why?
1
n
∑
i
D2
i
−1
1
√
n
∑
i
Di (g(Xi )− ˆg(Xi )) = E[D2
i ]
−1 1
√
n
∑
i
mi (Xi )
=0
(g(Xi )− ˆg(Xi ))
more slowly than
√
n
+oP(1)
→P 0
So it leads to a regularization bias.
Double Machine Learning framework
Now we do something else.
Instead of ψ1 = D(Y −g(X)−θD) we will base our estimation on diﬀerent
moment conditions:
ψ2 = V(Y −g(X)−θD) = (D −m(X))·(Y −g(X)−θD)
ψ3 = V(Y −g(X)−θV) = (D −m(X))·(Y −g(X)−θ(D −m(X)))
These moment conditions are somewhat more ”clever” as the problematic
regularization bias part will converge to zero under mild conditions.
ˆθ2 based on ψ2
Split the data into two parts
Use the ﬁrst one to get ˆg and ˆm by some ML algorithm (LASSO, RF)
Use the second portion of data to get ˆV = D − ˆm(X) and use this to get
ˆθ2 .....
√
n(ˆθ2 −θ) = a∗
Nicely behaved, approx. Gaussian
+ b∗
Regularization bias
+ c∗
Overﬁtting bias
Regularization bias : b∗ = 1
n ∑i D2
i
−1 1√
n
∑i(m(Xi)− ˆm(Xi))(g(Xi)− ˆg(Xi))
Overﬁtting bias: Sample splitting takes care of this.
Regularization bias : b∗ = 1
n ∑i D2
i
−1 1√
n
∑i(m(Xi)− ˆm(Xi))·(g(Xi)− ˆg(Xi))
ˆg and ˆm no longer need to converge at the rate n−1/2
It is suﬃcient if they both converge at the rate n−1/4
and this is much much
easier for the ML algorithms.
ˆθ3 based on ψ3
Split the data into two parts
Use the ﬁrst one to get ˆg and ˆm by some ML algorithm (LASSO, RF)
Use the second portion of data to get ˆV = D − ˆm(X) and ˆW = Y − ˆm(X)
and use this to get ˆθ3 via regressing ˆW on ˆV
This is, in fact ortogonalization. We project both D and Y onto space
spanned by X. By means of Frisch-Waugh-Lowell theorem we recover the
coeﬃcient of D.
Similar decomposition can be shown. Regularization bias also includes cross
product (m(Xi)− ˆm(Xi))·(g(Xi)− ˆg(Xi))
What makes φ2 and φ3 diﬀerent from φ1 ???
Regularization bias vanishes under mild conditions.
In other words, φ2 and φ3 are both locally insensitive to some mild
perturbations of ˆm,ˆg around m,g.
Neyman-orthogonality
This local insensitiveness has a name: Neyman-orthogonality.
ψ is a moment condtion
θ is the parameter of interest (target parameter), θ0 is the true one
η is the nuisance parameter, θ0 is the true one
W denotes data
∂
∂r
E[ψ(W;θ0,η0 +r(η −η0))]
r=0
= 0
Neyman-orthogonality of ψ2
We will verify that ψ2 satisfy the Neyman-orthogonality condition, while ψ1
does not.
Notation
η = (m,g) is the vector of nuisance parameters, θ0 = (m0,g0) is the true
one
ηr = η0 +r(η −η0).
Neyman-orthogonality of ψ2
ψ2(W;θ0,ηr ) = (D −m0(X)−r(m(X)−m0(X)))·(Y −g0(X)−r(g(x)−g0(X))−Dθ0)
= (D −m0(X))·(Y −g0(X)−Dθ0)+
−r(D −m0(X))·(g(x)−g0(X))
−r(m(X)−m0(X))·(Y −g0(X)−Dθ0)
+r2
(m(X)−m0(X))·(g(x)−g0(X))
∂
∂r
E[ψ2(W;θ0,ηr )] = −E[(D −m0(X))·(g(x)−g0(X))]
−E[(m(X)−m0(X))·(Y −g0(X)−Dθ0)]
+2 ·r ·E[(m(X)−m0(X))·(g(x)−g0(X))]
∂
∂r
E[ψ(W;θ0,ηr )]
r=0
= −E[(D −m0(X))·(g(x)−g0(X))]
−E[(m(X)−m0(X))·(Y −g0(X)−Dθ0)]
Neyman-orthogonality of ψ2
∂
∂r
E[ψ(W;θ0,ηr )]
r=0
= −E[(D −m0(X))·(g(x)−g0(X))]
−E[(m(X)−m0(X))·(Y −g0(X)−Dθ0)]
= 0
because
E[(D −m0(X))·(g(x)−g0(X))] = E[(g(x)−g0(X))·E[D −m0(X)|X]
E[V|X]=0
] = 0
E[(m(X)−m0(X))·(Y −g0(X)−Dθ0)] = E[(m(X)−m0(X))·E[Y −g0(X)−Dθ0|X]
E[U|X,D]=0
] = 0
and hence ψ2 is indeed Neyman-orthogonal.
Neyman-orthogonality of ψ3
Follows similarly as ψ2 but ithe derivation is a little bit longer.
Neyman-orthogonality of ψ1 ???
ψ1(W;θ0,ηr ) = D ·(Y −g0(X)−r(g(x)−g0(X))−Dθ0)
∂
∂r
E[ψ2(W;θ0,ηr )] = −E[D ·(g(x)−g0(X))]
∂
∂r
E[ψ(W;θ0,ηr )]
r=0
= −E[D ·(g(x)−g0(X))]
= 0
There is nothing we could do to use E[U|X,D] = 0 and E[V|X] = 0 to make
this term equal to zero.
Overﬁtting bias
√
n(ˆθ2 −θ) = a∗
Nicely behaved, approx. Gaussian
+ b∗
Regularization bias
+ c∗
Overﬁtting bias
Overﬁtting bias may arise from the fact that the same data is used for both
estimation of nuisance functions and target parameter.
We can split the data. Randomly split data into two parts. Use one for
nuisance parameter estimation, the other one for target. → But then we
loose many observations.
How to ﬁx this? Swap the roles of the two data parts and then average
across them!
Sample splitting for dealing with overﬁtting bias
W
Step 1
W C
k
Wk ˆm(X)
ˆg(X)
Step 2
Wk ˆψk
i
Step 3
ˆψ = 1
n ∑K
k=1 ∑
nk
i=1
ˆψk
i
Step 4
DML wrap-up (1)
There are diﬀerent ways how one can estimate θ. We saw three: ˆθ1, ˆθ2 and
ˆθ3.
These three estimators are based on three diﬀerent moment condition
functions: ψ1, ψ2 and ψ3.
While ψ1 was locally sensitive to some small changes in the η, the other two
ψ2 and ψ3 were not.
This allows us to get rid of the regularization bias.
Sample-splitting removes the overﬁtting bias.
DML wrap-up (2)
Estimator ˆθ based on Neyman-orthogonal moment function ψ
Apply sample splitting
Nuisance parameter estimators are ”good enough” (e.g. converge at
rate at least n−1/4
- so that the regularization bias. vanishes)
We get that (Theorem 1 in Chernozhukov et al. 2019)
√
n(ˆθ −θ) → N(0,σ2
)
Asymptotically normally distributed estimator that is
√
n consistent.
DML wrap-up (3)
DML provides a framework for developing estimators that:
can handle high-dimensional data
make use of predictive powers of ML
are well behaved under mild conditions
Thank you for your attention!
References
Why ML is becoming interesting in Economics by Varian, Hal R. ”Big data: New tricks for econometrics.” Journal of Economic Perspectives 28.2 (2014): 3-28.
More recent JEP article on ML: Mullainathan, Sendhil, and Jann Spiess. ”Machine learning: an applied econometric approach.” Journal of Economic Perspectives 31.2
(2017): 87-106.
THE most standard and by far the best book: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New
York: springer. Made free by the authors https://www.statlearning.com Free online course based on this book:
https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
More comprehensive, but somewhat less accessible book: [ESL] Friedman, Jerome H. The elements of statistical learning: Data mining, inference, and prediction.
Springer open, 2017.
Beautiful exposition of the essential of Bias-Variance trade-oﬀ https://daviddalpiaz.github.io/r4sl/biasvariance-tradeoff.html
LASSO: [SLS] Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2019.
https://web.stanford.edu/∼hastie/StatLearnSparsity/
LASSO: If you only have 1hour, read this: Sparsity and the Lasso (Statistical Machine Learning, Spring 2015) Ryan Tibshirani (with Larry Wasserman)
http://www.stat.cmu.edu/∼larry/=sml/sparsity.pdf
”Tree” idea: Morgan, James N., and John A. Sonquist. ”Problems in the analysis of survey data, and a proposal.” Journal of the American statistical association 58.302
(1963): 415-434.
Standard reference book on trees: Breiman, Leo, et al. Classiﬁcation and regression trees. Routledge, 2017.
SHAP values - original paper: Lundberg, Scott M., and Su-In Lee. ”A uniﬁed approach to interpreting model predictions.” Proceedings of the 31st international
conference on neural information processing systems. 2017.
SHAP values - digestible intro: https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d
LIME values (tool for explaining similar to SHAP) - digestible intro: https://medium.com/dataman-in-ai/explain-your-model-with-lime-5a1a5867b423
References
Double machine learning framework: Chernozhukov, Victor, et al. ”Double/debiased machine learning for treatment and structural parameters.” The Econometrics
Journal 21.1 (2018): C1-C68.
Somewhat accessible intro to DML: https://towardsdatascience.com/double-machine-learning-for-causal-inference-78e0c6111f9d
DML video by one of the authors of DML https://www.youtube.com/watch?v=eHOjmyoPCFU
DoubleML package in R https://cran.r-project.org/web/packages/DoubleML/DoubleML.pdf
Bach, Philipp, et al. ”DoubleML–An Object-Oriented Implementation of Double Machine Learning in R.” arXiv preprint arXiv:2103.09603 (2021).