Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Session 2 Oleg Deev & Stefan Lyócsa Masaryk University ★ * ★ •ČFINTECH RISK MANAGEMENT 9 www.flntedi-ho2020«u Oleg Deev & Štefan Lyócsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression The principle Assume that we have the following observations availa * * * * * • * * * * * * * * • • * • * * • * * in o in o in 1.0 1.5 2.0 2.5 3.0 Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Assume that we know the true values (not contaminated by noise), are at the red line: X Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression You want to estimate the relationship between x and y. Using the estimated model, you would like to make predictions into future. A common strategy is to split the sample first into two parts: • Testing sample - allow the model to learn. • Validation sample - test the out-of-sample performance. Different splitting strategies are possible. This is a basic one. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Both samples visualized: X Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Using data from the testing sample, let's fit a linear line model: Vi.test ~ A) + f^lxi,test + ui,test The estimated coefficients are: Vilest ~ 1-37 + 0.65x^es£ + ^i,test We know that the model is ill specified, no way a line is going to fit these data very well. But for prediction purposes, it might a good-enough approximation to the reality. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression This is how the line looks like: Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Using data from the testing sample, let's fit a polynomial model: Yi,test = A) + X^=l Pp^i,test + ui,test Yitest = @0+ (3lXijest+ fi2X?jest+ (33Xfjest^ ^i,test The estimated coefficients are: Yhtest = -24.33 + 7b.b9Xhtest - 85.93X?test + 48.28*?^ -13.28X*test — lAlXftest + Uijest This polynomial is going to fit the data much better. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression This is how the curve looks like: X Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Model comparison We can compare which model fits the data better, e.g. R2. Instead we calculate a related measure, teh mean square error for th( first model: MS Ex = Nt~lt Y,i^test - Yitl)2 = 0.06676 The smaller the value, the better the fit. Now for the second model MSE2 = Nt~lt £,(yMesi - Yh2)2 = 0.05401 The second model has better fit, by approx. 19%! Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Model comparison The first model fit the data poorly. It is linear. The data are curved. It is a biased model. The second model fits the data better. The higher the order of the polynomial, the better the fit and lower the bias (in-sample). Is the model with better fit on the testing sample going to be better in the validation sample? Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Model comparison Using coefficients from model 1 and model 2, and given new x from validation sample, we can predict y. Next we compare which model forecasts better using the MSE, but now we use predicted values. This is called the Mean Forecasted Squared Error: MSFE1 = N^lldatlon Ei^v^on ~ Yhlf = 0.0802 Now for the second model: MSFE2 = N~^idation ^invalidation ~ = 0.2357 Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression • The forecasts are less accurate on the validation sample. • The linear model (although biased) performs much better. Why? The polynomial model is over-fitting the data, e.g. fits too well on the expense of parameters. Parameters are not estimated with certainty - they suffer from variance. This leads to an increase in the variance of the predictions. • True values • Line • Curve • * • • • iM1 • *• • * • •• n i i i r 1.0 1.5 2.0 2.5 3.0 Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression The goal of the Machine learning is to find an optimum between model bias and the variance of predictions. Many strategies, two standard ones: • Regularization (Ridge regression, Lasso, Elastic net). • Boosting (Regression trees, Random forest,...). One strategy is to allow small bias (e.g. less parameters in the model) while lowering the variance. The accuracy of predictions might improve. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression The common theme is to sacrifice in-sample fit in hope for a better out-of-sample prediction. Recall a multiple linear regression model: Yi = A) + PlX^i + /?2^Q,2 + ... + fipXi^ + Ui Using OLS, parameters of interest are estimated by minimizing the sum of squared residuals: n n ^ min ^J2^i= J2(Yi ~ A) - PiXi,! - ... - /3pXiiP)2 In short: min ^J2(Yi- Yi) /3o,.../3p i=l Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression OLS approach: min ^J2(Yi- Yi) f3o,...f3p i=l Ridge regression: mm ^E(^-^)2 + AE/5l f3o,...f3p i=l j=l • A > 0, • X are standardized (0 mean, 1 variance), • Y is centered around 0. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Ridge regression: f3o,...f3p i=l j=l The higher the A the lower the (3 coefficients, i.e. stronger the penalty. Why might Ridge regression actually work? The higher the A, the less sensitive is Y, the dependent variable, to the changes in the Xj explanatory variable(s). The Ridge regression model is more 'robust' to changes in explanatory variables. How to find A? Standard approach is to use 10-fold cross-validation technique. See the next Case study. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression What factors drive the rate of return on a loan? We use the same model as in the Case study 3. Now, instead of OLS, we estimate it via penalized 'Ridge' estimator. • Can Ridge out-perform (out-of-sample) the OLS model? RR2i = f30 + Pinewi + ^verZi + ... + f3pnrodepi + ui Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression O Split the sample into two. Leave last 100 observations for out-of-sample (validation). O Estimate OLS and calculate MSFE using the out-of-sample data. O Perform k — fold cross-validation to estimate A for the Ridge regression models. O Calculate MSFE using the out-of-sample data. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Sai mple spli t • NF = 100 • N = dim(DT) [1] • tst = DT[1:(N-NF)J • val = DT[((N-NF)+1) :Nj Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression k-fol Id 1 C ross val id la tion We need to prepare the data for the glmnet functions. See the codes.... • CV = cv.glmnet(x=indep,y=dep,nfolds=30,alpha=0) • plot(CV) • CV$lambda.min • CV$lambda.lse • round(cbind(coefficients(m7),coef(CV,s=;lambda.min;), coef(CV,s=;lambda.lse;)),4) Oleg Deev & Stefan Lyöcsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression ( n u II LS esti m a ti ion ai id 1 prediction • m7 = lm(RR2 ^ new+ver3+ver4+lfi+lee+luk+lrs+lsk+age+undG female+lamt+int+durm+educprim+educbasic+educvocat+ educsec+msmar+msco+mssi+msdi+nrodep+espem+esfue+ essem+esent+esret+dures+exper+linctot+noliab+ lliatot+norl: lamteprl+nopearlyrep,data=tst) • yOLS = predict(m7,new=val) • ytrue = val[,MRR2M] • MSEOLS = mean((yOLS-ytrue)2) Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression How A (actually log(X)) and MSE are related. Increasing penalization is very expensive as it increases MSE considerably. 39 39 39 39 39 39 39 39 39 39 39 39 39 i-1-1-r 2 4 6 8 log(Lambda) Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression ation We need to prepare the data for the glmnet functions. See the codes.... • yRIDGEmin=predict(CV,newx=pred,s=CV$lambda.min) • MSER1 = mean((ytrue-yRIDGEmin)2) • yRIDGElse=predict(CV,newx=pred,s=CV$lambda.lse) • MSER2 = mean((ytrue-yRIDGElse)2) • cbind(MSEOLS, MSER1, MSER2) Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression OLS approach: min J2(Yi ~ Yi) f3o,...f3p i=l Ridge regression: n - „ p min ^£(^-^)2 + A£/3j $o,.../3p i=l j=l Least Absolute Shrinkage and Selection Operator (LASSO): n p min ^E(^-^)2 + AEI^I f3o,...f3p i=l j=l As before: • A > 0, • X are standardized (0 mean, 1 variance), • Y is centered around 0. Oleg Deev & Stefan Lyocsa FinTech Least Absolute Shrinkage and Selection Operator (LASSO): As with Ridge, the higher the A, the lower the ft coefficients, i.e. stronger the penalty. With LASSO, coefficients might be reduced to 0. This is useful as LASSO reduces the model complexity, which in turn is known to be helpful for forecasting purposes. n P Oleg Deev & Stefan Lyöcsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Which to use? LASSO or Ridge? • Ridge is useful when many variables are supposed to be useful (they might be highly correlated as well). • LASSO is useful when only few variables are useful. Why not to select only useful variables and run OLS? Oleg Deev & Stefan Lyöcsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression What factors drive the rate of return on a loan? We use the same model as in the Case study 3 and 4. Now, instead of OLS and Ridge, we estimate it via penalized 'LASSO' estimator. 9 Can LASSO out-perform (out-of-sample) the OLS and Ridge model? RR2i = f30 + Pinewi + ^verZi + ... + f3pnrodepi + ui Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression O Split the sample into two. Leave last 100 observations for out-of-sample (validation). O Estimate OLS and calculate MSFE using the out-of-sample data. O Perform k — fold cross-validation to estimate A for the Ridge regression models. O Calculate MSFE using the out-of-sample data. O Perform k — fold cross-validation to estimate A for the LASSO regression models. O Calculate MSFE using the out-of-sample data. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression k-fol Id 1 C ross val id la tion We need to prepare the data for the glmnet functions. See the codes.... • CV = cv.glmnet(x=indep,y=dep,nfolds=30,alpha=l) • plot(CV) • CV$lambda.min • CV$lambda.lse • round(cbind(coefficients(m7),coef(CV,s=;lambda.min;), coef(CV,s=;lambda.lse;)),4) Oleg Deev & Stefan Lyöcsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression How A (actually log(X)) and MSE are related. 38 37 36 34 31 29 26 21 15 6 4 2 2 2 T -4-3-2-10 1 2 log(Lambda) Oleg Deev & Stefan Lyöcsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression ation We need to prepare the data for the glmnet functions. See the code in Case study 3. Next, we can run the predictions: • yLASSOmin=predict(CV,newx=pred,s=CV$lambda.min) • MSEL1 = mean((ytrue-yLASS0min)2) • yLASS01se=predict(CV,newx=pred,s=CV$lambda.lse) • MSEL2 = mean((ytrue-yLASS01se)2) • cbind(MSEOLS, MSER1, MSER2, MSEL1, MSEL2) Oleg Deev & Stefan Lyöcsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression OLS approach: min ^J2(Yi- Yi) f3o,...f3p i=l Ridge regression: f3o,...f3p i=l j=l Least Absolute Shrinkage and Selection Operator (LASSO): n ^ P min ^£(^-^)2 + AEI&l (3o,...(3p i=l j=l Elastic net: n P P mm -+ ± E(^-^)2 + A(^E^ + «E 1/5,1) $0,—&p 1=1 j = l j = l Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Elastic net: n P P mm -+ ± E(^-^)2 + A(^E/? + «£ l&D $0,—&p 1=1 j = l j = l It gives a combined penalization of Ridge and LASSO. The new parameter a shows which of the two penalization forms gets higher weight. o If a = 1 it is a LASSO model. • If a = 0 it is a Ridge model. • With 0 < a < 1, we have the Elastic net. As before, the optimal a and A is determined via cross-validation. Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression What factors drive the rate of return on a loan? We use the same model as in the Case study 3, 4 and 5. Now, instead of OLS, Ridge, LASSO we estimate it via 'Elastic net' estimator. • Can Elastic net out-perform (out-of-sample) the OLS, Ridge, LASSO model? RR2i = f30 + Pinewi + ^verZi + ... + f3pnrodepi + ui Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Force = 0.25 • CV = cv.glmnet(x=indep,y=dep,nfolds=30,alpha=0 • yNET025min=predict(CV,newx=pred,s=CV$lambda.min) • MSEEN1.1 = mean((ytrue-yNET025min)2) • yNET0251se=predict(CV,newx=pred,s=CV$lambda.lse) • MSEEN1.2 = mean((ytrue-yNET0251se)2) Oleg Deev & Štefan Lyócsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Force = 0.50 • CV = cv.glmnet(x=indep,y=dep,nfolds=30,alpha=0 • yNET050min=predict(CV,newx=pred,s=CV$lambda.min) • MSEEN2.1 =mean((ytrue-yNET050min)2) • yNET0501se=predict(CV,newx=pred,s=CV$lambda.lse) • MSEEN2.2 =mean((ytrue-yNET0501se)2) Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Force = 0.75 • CV = cv.glmnet(x=indep,y=dep,nfolds=30,alpha=0 • yNET075min=predict(CV,newx=pred,s=CV$lambda.min) • MSEEN3.1 = mean((ytrue-yNET075min)2) • yNET0751se=predict(CV,newx=pred,s=CV$lambda.lse) • MSEEN3.2 = mean((ytrue-yNET0751se)2) Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression We can compare results: MSEs EN75_1 868.86 LASSOJ 871.52 EN25_1 874.82 EN50_1 876.54 RidgeJ 929.69 RidgeM 968.78 EN75_M 976.86 EN25_M 976.97 EN50_M 977.07 LASSOM 977.57 OLS 995.62 Oleg Deev & Stefan Lyocsa FinTech Introduction to Machine Learning Ridge regression Lasso regression Elastic-net regression Session 2 Oleg Deev & Stefan Lyócsa Masaryk University ★ * ★ •ČFINTECH RISK MANAGEMENT 9 www.flntedi-ho2020«u Oleg Deev & Štefan Lyócsa FinTech