Machine learning essentials Lukˊaˇs Laffˊers Matej Bel University, Dept. of Mathematics MUNI Brno 3.12.2021 7.1.2022 Bias variance trade-off Penalized regression - LASSO and Ridge Regression Tree Random Forrest Machine learning and causality Bias vs variance trade-off Which would you choose? As always: It depends. Y = f(X)+ε EPE(Y,ˆf(X)) expected prediction error = E Y −ˆf(X) 2 = E (Y −f(X))2 unsystematic error +E (f(X)−ˆf(X))2 systematic error At a particular point X = x : MSE(f(x),ˆf(x)) mean squared error = E (f(x)−ˆf(x))2 = f(x)−E[ˆf(x)] 2 bias2 +E (ˆf(x)−E[ˆf(x)])2 variance Train/Test data Prediction is easy: we can measure on the test data how well the estimated model (on the train data) works. LASSO Least Absolute Shrinkage and Selection Operator Motivation 1 - model selection + estimation How to choose regressors? Hypothesis testing (Backward elimination,...) Information criteria (AIC, BIC, FIC,...) Model Averaging 2-step procedures: [MODEL SELECTION] + [ESTIMATION] LASSO does it in one step. Motivation 2 - many regressors We have MANY regressors in comparison with observations. p >> N Parameters of y = Xβ +ε are not uniquely defined. LASSO can handle such situations. Motivation 3 - Bias-Variance trade-off LASSO takes this trade-off into account. It aims to predict well. y = Xβ +ε ˆβOLS = argmin(y −Xβ)T (y −Xβ) under Correct specification and Gaussian homoskedastic errors (ε ∼ N(0,σ2 I)), this has the lowest variance among linear unbiased estimators also Maximum likelihood estimator (asymptotically efficient) so why do we care with anything else? Should we care about unbiasedness so much? Interpretation: OLS estimators are non-zero, all of them Sparsity principle: ESL: ’Use a procedure that does well in sparse problems, since no method does well in dense problems.’ Highly correlated regressors lead to a high variance of estimators But is the truth sparse? (y −Xβ)T (y −Xβ)+ PENALTY (y −Xβ)T (y −Xβ)+ λ||β||1 λ - ’price’ of β ||β||1 = ∑ p i=1 |βi| We shrink β towards zero! And the credit goes to... Robert Tibshirani Cross-validation How to choose λ ? Choose it in a way so that the model predicts well. Minimize cross-validation error. Example 1 215 genes out of 4718 with non-zero coefficients of a multinomial lasso classifier (source: SLS) The Lasso Estimator For a linear model LASSO estimator is the solution of the following problem min β0,β1 1 2N N ∑ i=1 (yi −β0 − p ∑ j=1 xijβj)2 subject to p ∑ j=1 |βj| ≤ t t → 0 =⇒ ˆβ → 0 t → ∞ =⇒ ˆβ → βOLS ..in a more compact form min β 1 2N ||y −Xβ||2 2 subject to ||β||1 ≤ t which is equivalent to min β 1 2N ||y −Xβ||2 2 +λ||β||1 A few remarks... If the predictors are not on a same scale, we rescale and recenter then. intercept disappears there is a relationship between λ and t and this relationship is data-dependent the factor 1 2N makes it easier to compare λ across different sample sizes Ridge regression vs LASSO. (source: Faraway (2014)) LASSO in 3D - (β1,β2,β3). (source: SLS) Choice of λ by Cross-validation Relative bound = ||ˆβ||1/||ˆβOLS||1 (source: SLS) Example - Crime rate in U.S. cities (source: SLS) (*)Consistency We may be interested in MSE consistency ||X(ˆβ −β∗ )||2 2/N ≤ C||β∗ ||1 log(p)/N. what does it mean? If ||β∗||1 = o( N/log(p)) (read as: does not grow too fast, alternatively read as: the truth is ”sparse”), then lasso is consistent for prediction. Or we may be interested in sparsistency, that is, recoverability of the non-zero support set of the true regression parameters. This is difficult to prove and for this we would need higher level assumptions. Uniqueness Suppose we have x1 and x2 with corresponding β1 and β2. Now we add x3 which is identical to x2. (ˆβ1,γ ˆβ2,(1 −γ)ˆβ2) produces the same fit and its norm is the same. So we have infinite number of solutions. Different penalties min β0,β1 1 2N N ∑ i=1 (yi −β0 − p ∑ j=1 xijβj)2 +λ p ∑ j=1 |βj|q Constraint regions for different q-s. (source: SLS) Elastic net min β0,β1    1 2N N ∑ i=1 (yi −β0 − p ∑ j=1 xijβj)2 +λ 1 2 (1 −α)||β||2 2 +α||β||1 more sophisticated penalty    Takes the best out of the two worlds: ability to make some parameters exact zeroes as LASSO (α = 1) ability to handle highly correlated data as ridge regression (α = 0) we pay a price for that → the choice of α is needed Elastic net Left LASSO, right elastic net for highly correlated data. (source: SLS) Elastic net Left elastic net ball for α = 0.7, right LASSO ball. (source: SLS) Regression tree Medical doctor asks a patient the following yes/no questions: Are you more than 30 years old? Is you diastolic pressure higher than 100? Is there anyone in your family with heart condition? Do you sport more than 60mins per week? Based on these yes/no question MD can make a prognosis or suggest a suitable intervention. Yes/no questions are easy. They can be standardized and made into guidelines. Morgan and Sonquist (1963), Breiman, Friedman, Olshen and Stone (1984) How to build such a tree? What is a good yes/no question? Can we measure how good a question is? Outcome Y Covariates X1,X2,...Xp Measure of variablity: RSS(group) = ∑i∈group(Yi − ¯Ygroup)2 Choose the partitioning group = part1 ∪part2 so that the overall measure of variability is minimized. RSS(group) → RSS(part1)+RSS(part2) (Make the 2 groups very different.) If all p variables are numerical. We have p ·(n −1) different ways of how to partition the data. How big should the tree be? Stop if the RSS improvement is small. (what is ”small”?) Prunning: Grow a large tree and cut the ”leaves”. Choose tree that minimize CC(tree) = ∑leafj RSS(partj )+ λ(#leaves) penalty for complexity How to choose penalty λ? So that we predict well! E.g. by cross-validation. If all p variables are numerical. We have p ·(n −1) different ways of how to partition the data. How big should the tree be? Stop if the RSS improvement is small. (what is ”small”?) Prunning: Grow a large tree and cut the ”leaves”. Choose tree that minimize CC(tree) = ∑leafj RSS(partj )+ λ(#leaves) penalty for complexity How to choose penalty λ? So that we predict well! E.g. by cross-validation. It is a regression: Y = β1I(age > 30)·I(gender = ♀) + β2I(age > 30)·I(gender = ♂) + β3I(age ≤ 30)·I(dia > 100) + β4I(age ≤ 30)·I(dia ≤ 100) + β5I(age ≤ 30)·I(dia ≤ 100)·I(♥)·I(sport > 60) + β6I(age ≤ 30)·I(dia ≤ 100)·I(♥)·I(sport ≤ 60) + β7I(age ≤ 30)·I(dia ≤ 100)·I(SS♥) + ε Prediction in a leaf is a simple average. PROS Trees are visually appealing and easy to understand. No parametric structure is imposed Ability to capture complex interactions CONS Sensitive if observations have values around the cutting points Variance is high. Small change in the data can results in a large change of estimated model. Random forrest Pozdravujem vˊas, lesy, hory, z tej duˇse pozdravujem vˊas! I greet you, forests, mountains, I greet you from that soul! P. O. Hviezdoslav P.O.H. in a forrest. Random forrest High variability in a single tree? We use bootstrap to grow many trees. Then we average across the predictions across multiple trees. Or we may give higher weight to the trees that predict better (Bagging) Random forrest. Random forrest - many similar trees Not all that random, right? Random forrest Trees are somewhat different because they are based on different bootstrapped samples. We may explicitly make the tree even more different. We only choose a subset of regressors. Say randomly pick √ p of them instead of all p. How many trees? Enough so that the prediction error does not change. Random forrest. Random forrest - many different trees Effect of a variable How to quantify an effect of a particular variable on outcome in such a complicated object as random forrest? Partial dependence - set the variable at a particular value for all the observations and look at the mean predictions. Partial effects - Fix all the other variables at mean values and look at the mean predictions. Importance - Use out-of-bag data to access change in MSE after you randomly permute a predictor. Minimal depth - How early or late is the variable used for a split? Average the depth of the first split. SHAP values (will not cover this) Classification using trees and random forrest So far, we looked at continuous outcome variable. The presented ideas are often used for classification (categorical variable). The measure of group difference is not RSS in this case (e.g. deviance, entropy, Gini index). The prediction in the leaf is a proportion. Why random forrest? Random forrest is considered as one of the best off-the-shelf predictor/classifier. Especially if you don’t know much of the problem. It is relatively easy to compute. Interpretability is not the nicest. (*) Machine learning and causality Prediction is nice, but economists often care more about the underlying mechanism more. While ML gives us many great prediction tools, we are often interested in a certain variable of interest. Having a lot of information we need to cope with high dimensionality of covariates. Lot of information is great, it could make our Selection on observables assumption more credible! Job-seeker went through a training/course. Did it help? We know a lot about these job-seekers (say 300 variables). Regression will not cope well with these kind of problems, especially if the sample size is small. We may try LASSO, but it will give us biased estimates. Can we make use of the great predictive capabilities of ML algorithms for improving the estimation of parameters of interest? This is an area of active research. Here we will discuss one important paper on DOUBLE MACHINE LEARNING Chernozhukov, Victor, et al. ”Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal 21.1 (2018): C1-C68. Double Machine Learning framework Consider the following partial linear model. θ is the parameter of interest. Y = θD +g(X)+U, E[U|D,X] = 0 D = m(X)+V, E[V|X] = 0 Split the data into two parts Use the first one to get ˆg by some ML algorithm (LASSO, RF) Use the second portion of data to get ˆθ from regressing Y − ˆg(X) on D ˆθ1 = 1 n ∑ i D2 i −1 1 n ∑ i Di(Yi − ˆg(Xi)) ˆθ1 is based on E[ψ1] = 0 where ψ1 = D(Y −g(X)−θD) Double Machine Learning framework How does this naive estimator behave? √ n(ˆθ1 −θ) = 1 n ∑ i D2 i −1 1 √ n ∑ i DiUi Nicely behaved, approx. Gaussian + 1 n ∑ i D2 i −1 1 √ n ∑ i Di(g(Xi)− ˆg(Xi)) In general divergent. Why? 1 n ∑ i D2 i −1 1 √ n ∑ i Di (g(Xi )− ˆg(Xi )) = E[D2 i ] −1 1 √ n ∑ i mi (Xi ) =0 (g(Xi )− ˆg(Xi )) more slowly than √ n +oP(1) →P 0 So it leads to a regularization bias. Double Machine Learning framework Now we do something else. Instead of ψ1 = D(Y −g(X)−θD) we will base our estimation on different moment conditions: ψ2 = V(Y −g(X)−θD) = (D −m(X))·(Y −g(X)−θD) ψ3 = V(Y −g(X)−θV) = (D −m(X))·(Y −g(X)−θ(D −m(X))) These moment conditions are somewhat more ”clever” as the problematic regularization bias part will converge to zero under mild conditions. ˆθ2 based on ψ2 Split the data into two parts Use the first one to get ˆg and ˆm by some ML algorithm (LASSO, RF) Use the second portion of data to get ˆV = D − ˆm(X) and use this to get ˆθ2 ..... √ n(ˆθ2 −θ) = a∗ Nicely behaved, approx. Gaussian + b∗ Regularization bias + c∗ Overfitting bias Regularization bias : b∗ = 1 n ∑i D2 i −1 1√ n ∑i(m(Xi)− ˆm(Xi))(g(Xi)− ˆg(Xi)) Overfitting bias: Sample splitting takes care of this. Regularization bias : b∗ = 1 n ∑i D2 i −1 1√ n ∑i(m(Xi)− ˆm(Xi))·(g(Xi)− ˆg(Xi)) ˆg and ˆm no longer need to converge at the rate n−1/2 It is sufficient if they both converge at the rate n−1/4 and this is much much easier for the ML algorithms. ˆθ3 based on ψ3 Split the data into two parts Use the first one to get ˆg and ˆm by some ML algorithm (LASSO, RF) Use the second portion of data to get ˆV = D − ˆm(X) and ˆW = Y − ˆm(X) and use this to get ˆθ3 via regressing ˆW on ˆV This is, in fact ortogonalization. We project both D and Y onto space spanned by X. By means of Frisch-Waugh-Lowell theorem we recover the coefficient of D. Similar decomposition can be shown. Regularization bias also includes cross product (m(Xi)− ˆm(Xi))·(g(Xi)− ˆg(Xi)) What makes φ2 and φ3 different from φ1 ??? Regularization bias vanishes under mild conditions. In other words, φ2 and φ3 are both locally insensitive to some mild perturbations of ˆm,ˆg around m,g. Neyman-orthogonality This local insensitiveness has a name: Neyman-orthogonality. ψ is a moment condtion θ is the parameter of interest (target parameter), θ0 is the true one η is the nuisance parameter, θ0 is the true one W denotes data ∂ ∂r E[ψ(W;θ0,η0 +r(η −η0))] r=0 = 0 Neyman-orthogonality of ψ2 We will verify that ψ2 satisfy the Neyman-orthogonality condition, while ψ1 does not. Notation η = (m,g) is the vector of nuisance parameters, θ0 = (m0,g0) is the true one ηr = η0 +r(η −η0). Neyman-orthogonality of ψ2 ψ2(W;θ0,ηr ) = (D −m0(X)−r(m(X)−m0(X)))·(Y −g0(X)−r(g(x)−g0(X))−Dθ0) = (D −m0(X))·(Y −g0(X)−Dθ0)+ −r(D −m0(X))·(g(x)−g0(X)) −r(m(X)−m0(X))·(Y −g0(X)−Dθ0) +r2 (m(X)−m0(X))·(g(x)−g0(X)) ∂ ∂r E[ψ2(W;θ0,ηr )] = −E[(D −m0(X))·(g(x)−g0(X))] −E[(m(X)−m0(X))·(Y −g0(X)−Dθ0)] +2 ·r ·E[(m(X)−m0(X))·(g(x)−g0(X))] ∂ ∂r E[ψ(W;θ0,ηr )] r=0 = −E[(D −m0(X))·(g(x)−g0(X))] −E[(m(X)−m0(X))·(Y −g0(X)−Dθ0)] Neyman-orthogonality of ψ2 ∂ ∂r E[ψ(W;θ0,ηr )] r=0 = −E[(D −m0(X))·(g(x)−g0(X))] −E[(m(X)−m0(X))·(Y −g0(X)−Dθ0)] = 0 because E[(D −m0(X))·(g(x)−g0(X))] = E[(g(x)−g0(X))·E[D −m0(X)|X] E[V|X]=0 ] = 0 E[(m(X)−m0(X))·(Y −g0(X)−Dθ0)] = E[(m(X)−m0(X))·E[Y −g0(X)−Dθ0|X] E[U|X,D]=0 ] = 0 and hence ψ2 is indeed Neyman-orthogonal. Neyman-orthogonality of ψ3 Follows similarly as ψ2 but ithe derivation is a little bit longer. Neyman-orthogonality of ψ1 ??? ψ1(W;θ0,ηr ) = D ·(Y −g0(X)−r(g(x)−g0(X))−Dθ0) ∂ ∂r E[ψ2(W;θ0,ηr )] = −E[D ·(g(x)−g0(X))] ∂ ∂r E[ψ(W;θ0,ηr )] r=0 = −E[D ·(g(x)−g0(X))] = 0 There is nothing we could do to use E[U|X,D] = 0 and E[V|X] = 0 to make this term equal to zero. Overfitting bias √ n(ˆθ2 −θ) = a∗ Nicely behaved, approx. Gaussian + b∗ Regularization bias + c∗ Overfitting bias Overfitting bias may arise from the fact that the same data is used for both estimation of nuisance functions and target parameter. We can split the data. Randomly split data into two parts. Use one for nuisance parameter estimation, the other one for target. → But then we loose many observations. How to fix this? Swap the roles of the two data parts and then average across them! Sample splitting for dealing with overfitting bias W Step 1 W C k Wk ˆm(X) ˆg(X) Step 2 Wk ˆψk i Step 3 ˆψ = 1 n ∑K k=1 ∑ nk i=1 ˆψk i Step 4 DML wrap-up (1) There are different ways how one can estimate θ. We saw three: ˆθ1, ˆθ2 and ˆθ3. These three estimators are based on three different moment condition functions: ψ1, ψ2 and ψ3. While ψ1 was locally sensitive to some small changes in the η, the other two ψ2 and ψ3 were not. This allows us to get rid of the regularization bias. Sample-splitting removes the overfitting bias. DML wrap-up (2) Estimator ˆθ based on Neyman-orthogonal moment function ψ Apply sample splitting Nuisance parameter estimators are ”good enough” (e.g. converge at rate at least n−1/4 - so that the regularization bias. vanishes) We get that (Theorem 1 in Chernozhukov et al. 2019) √ n(ˆθ −θ) → N(0,σ2 ) Asymptotically normally distributed estimator that is √ n consistent. DML wrap-up (3) DML provides a framework for developing estimators that: can handle high-dimensional data make use of predictive powers of ML are well behaved under mild conditions Thank you for your attention! References Why ML is becoming interesting in Economics by Varian, Hal R. ”Big data: New tricks for econometrics.” Journal of Economic Perspectives 28.2 (2014): 3-28. More recent JEP article on ML: Mullainathan, Sendhil, and Jann Spiess. ”Machine learning: an applied econometric approach.” Journal of Economic Perspectives 31.2 (2017): 87-106. THE most standard and by far the best book: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer. Made free by the authors https://www.statlearning.com Free online course based on this book: https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/ More comprehensive, but somewhat less accessible book: [ESL] Friedman, Jerome H. The elements of statistical learning: Data mining, inference, and prediction. Springer open, 2017. Beautiful exposition of the essential of Bias-Variance trade-off https://daviddalpiaz.github.io/r4sl/biasvariance-tradeoff.html LASSO: [SLS] Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2019. https://web.stanford.edu/∼hastie/StatLearnSparsity/ LASSO: If you only have 1hour, read this: Sparsity and the Lasso (Statistical Machine Learning, Spring 2015) Ryan Tibshirani (with Larry Wasserman) http://www.stat.cmu.edu/∼larry/=sml/sparsity.pdf ”Tree” idea: Morgan, James N., and John A. Sonquist. ”Problems in the analysis of survey data, and a proposal.” Journal of the American statistical association 58.302 (1963): 415-434. Standard reference book on trees: Breiman, Leo, et al. Classification and regression trees. Routledge, 2017. SHAP values - original paper: Lundberg, Scott M., and Su-In Lee. ”A unified approach to interpreting model predictions.” Proceedings of the 31st international conference on neural information processing systems. 2017. SHAP values - digestible intro: https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d LIME values (tool for explaining similar to SHAP) - digestible intro: https://medium.com/dataman-in-ai/explain-your-model-with-lime-5a1a5867b423 References Double machine learning framework: Chernozhukov, Victor, et al. ”Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal 21.1 (2018): C1-C68. Somewhat accessible intro to DML: https://towardsdatascience.com/double-machine-learning-for-causal-inference-78e0c6111f9d DML video by one of the authors of DML https://www.youtube.com/watch?v=eHOjmyoPCFU DoubleML package in R https://cran.r-project.org/web/packages/DoubleML/DoubleML.pdf Bach, Philipp, et al. ”DoubleML–An Object-Oriented Implementation of Double Machine Learning in R.” arXiv preprint arXiv:2103.09603 (2021).