Internal High-dimensional statistics and Machine learning with applications to Insurance November 2022 Masaryk University, Brno Ivana Milović, MAS PhD PhD Internal Introducing Myself 2 Ivana Milović, MAS PhD Non-Life Pricing Actuary (SME) ivana.milovic@allianz.at Prior experience • Uniqa Insurance Group – Non-Life Pricing Actuary (Motor) • Lecturer - University of Vienna • Prae and Post-Doc Researcher - Department of Statistics, University of Vienna Education • PhD in Statistics (Univ. of Vienna, 2016) • Master of Advanced Studies in Mathematics (Univ. of Cambridge, 2011) • BSc in Mathematics and Computer Science (Univ. of Belgrade, 2010) What is pricing? 3 Internal 4 What is pricing? “Pricing is the way that a company decides prices for its products or services, or the prices decided” – Cambridge dictionary Internal 5 Why do we need statistics and mathematical modelling for pricing in insurance? Classical industry example: Selling paperclips • Known operating costs (rent, maintenance, salaries, marketing, etc.) • Known production costs (materials, etc.) • Known profit margin Known price of a paperclip Fully deterministic! Internal 6 Why do we need statistics and mathematical modelling for pricing in insurance? Classical insurance example: Selling a policy • Known operating costs (rent, maintenance, salaries, marketing, etc.) • Unknown claim costs (claim occurrence and severity are random events) • Known profit margin Unknown price of a policy Not deterministic! Internal 7 Why do we need statistics and mathematical modelling for pricing in insurance? If the cost of policy is random, how do we estimate it? There are two ways: • Based on the historical data/expert judgement (simplistic approach) • Fitting statistical models to historical data -> technical pricing. Internal 8 What are the goals of technical pricing? • To provide the best estimate for the expected cost of an insurance policy -> fair price • Help us predict future losses and to better assess the portfolio and segment performance • Know which are the technically unprofitable and profitable segments -> Identify business opportunities Internal 9 How to perform technical pricing? Which family of models? Models Scope Analysis Collection Data Preparation Validation etc. Formats Variables etc. Validation etc. Complexity Our focus in this class FINAL PRICE IT Internal 10 Content Topics o Model assessment and selection o Cross validation, AIC, BIC o Linear Models o PCR, Regularization methods o Generalized Linear models o Pricing process o Machine Learning in Insurance Internal 11 Let’s get started Introduction 12 Internal ▪ Let 𝑌 be a quantitative response and 𝑋 = (𝑋1, … , 𝑋 𝑝) be a set of regressors and suppose: 𝑌 = 𝑓 𝑋 + 𝜖, for some fixed (but unknown) function 𝒇 . ▪ 𝜖 has mean 0 and is independent of 𝑋. Often, we assume normality. ▪ Note: 𝑋 can be fixed or random Example: Y is the number of claims and X are the characteristics of a driver and his car Ivana Milović 13 Introduction Internal Statistical learning is a set of approaches for estimating 𝑓 by መ𝑓 from the data. Estimation goals can be: ➢ Prediction ➢ Inference Ivana Milović 14 This Photo by Unknown Author is licensed under CC BY-SA Introduction Internal Introduction Prediction: ෠𝑌 = መ𝑓(𝑋), for some estimate መ𝑓. If prediction is our only goal and we do not have interest in the form of 𝑓, then many modern techniques give good results: random forests, gradient boosting trees, etc. Example: predicting prices on the stock exchange. Here the interpretation is not important, as long as, the results are good. Ivana Milović 15 Internal ▪ The accuracy of ෠𝑌 depends on two quantities: ➢ reducible error – coming from approximating 𝑓 by መ𝑓 ➢ irreducible error – the error coming from 𝜖 ▪ We measure the accuracy by the expected prediction error ▪ 𝐸(𝑌 − ෠𝑌)2 = 𝐸(𝑓 𝑋 − መ𝑓(𝑋))2+ 𝑉𝑎𝑟(𝜖) ▪ Goal: to find a method that has a small reducible error Ivana Milović 16 reducible irreducible Introduction Internal ▪ Note that Ivana Milović 17 Introduction Internal Introduction Inference: we want to also understand the form of 𝑓, i.e. the relationship between 𝑌 and 𝑋 = (𝑋1, … , 𝑋 𝑝) . • Is 𝑓 linear or more complex? • Which regressors are associated with 𝑌? • What is their relationship? 18 Ivana Milović Choice of Model 19 Internal Choice of Model We may choose our model based on what we are more interested in: prediction or inference Example: ➢ Parametric models like linear models and GLMs: simple and interpretable, but not always very accurate ➢ Non-parametric models like splines, GBM, random forests: better predictions but much less interpretable Factors like sample size, computational power, etc. also play a significant role in decision making. 20 Ivana Milović Internal Choice of Model 21 EXAMPLE: LINEAR REGRESSION VS. SPLINES Ivana Milović Interpretability? Controversies 22 Internal Machine learning controversy Many machine learning techniques offer fully automatized routines for calculating prices, insurance premiums, etc. or clustering data into different segments (for example: brands or regions) But if the interpretability is missing, many problems might occur 23 Ivana Milović Internal ▪ Certain companies have sparked controversy as ethnic, gender or ‘unethical’ variables slipped into their models, often because data bias was not corrected Ivana Milović 24 Machine learning controversy Internal ▪ Certain companies have sparked controversy as ethnic, gender or ‘unethical’ variables slipped into their models, often because data bias was not corrected Ivana Milović 25 Machine learning controversy In 2019, Facebook was found to be in contravention of the U.S. constitution, by allowing its advertisers to deliberately target adverts according to gender, race, and religion, all of which are protected classes under the country’s legal system. Job adverts for roles in nursing or secretarial work were suggested primarily to women, whereas job ads for janitors and taxi drivers had been shown to a higher number of men, in particular men from minority backgrounds. The algorithm learned that ads for real estate were likely to attain better engagement stats when shown to white people, resulting in them no longer being shown to other minority groups. Internal 26 Ivana Milović Machine learning controversy Internal 27 Machine learning controversy What about the insurance industry? • Current standard: GLM models • Can Machine learning replace them? Later on that! Ivana Milović Assessing model accuracy 28 Internal Assessing Model Accuracy • No model dominates all other models over all possible data sets. We need to decide which model is most suitable based on the data set given • The prediction error 𝐸(𝑌 − መ𝑓(𝑋))2 can be estimated by the mean-squared error (MSE) 𝟏 𝒏 σ𝒊=𝟏 𝒏 ( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐 given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1 𝑛 . • Here 𝑋𝑖 denotes a 𝑝 −vector of regressors for the i-th data point 29 Ivana Milović Internal Assessing Model Accuracy • But we do not want to predict the model accuracy on the data we already observed! 1 𝑛 σ𝑖=1 𝑛 ( 𝑌𝑖 − መ𝑓(𝑋𝑖))2 is, actually, an in-sample (training) MSE. • We want our model to perform well on the future data, • For a new (unseen) observation 𝑋0, 𝑌0 , it should hold that መ𝑓 𝑋0 ≈ 𝑌0. • In general, when considering all new data points: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑋0,𝑌0 (𝑌0 − መ𝑓(𝑋0))2 should be small. This is an out-of-sample (testing) MSE 30 Ivana Milović Internal Assessing Model Accuracy There is no guarantee that a model with a small training MSE will also have a small testing MSE. This leads to concepts of underfitting and overfitting. 31 Ivana Milović Internal Assessing Model Accuracy As the model complexity increases, the training error gets smaller, but the testing error increases. Underfitting: the model is too simple and performs badly on the training data, and consequently on the testing data Overfitting: the training data is modelled too well, because non-existing patterns in the data are found (coming from the noise). Therefore, the performance on the future data is poor. 32 Ivana Milović Internal Bias-variance trade-off ▪ Let 𝑋0 be fixed. Note that the test MSE can be written as 𝐸(𝑌0 − መ𝑓(𝑋0))2 = 𝐵𝑖𝑎𝑠 ෡𝑓 (𝑋0 2 + 𝑉𝑎𝑟 መ𝑓(𝑋0) + Var ϵ . ▪ Bias: Error introduced by approximating 𝑓 by መ𝑓 ▪ Variance: how much መ𝑓 changes if we use different data sets for training Ivana Milović 33 reducible irreducible Internal Bias-variance trade-off (additional) 34 Internal Bias-variance trade-off 35 Ivana Milović Easy to find a method with low bias and high variance, just use a curve that connects all the points Easy to find a method with low variance and high bias, just take a flat line through the data But, we want a method that simultaneously has low bias and low variance. Internal Bias-variance trade-off Example: 36 Ivana Milović Internal Test MSE Estimation But in real-life situations it is not possible to compute the test MSE, because 𝑓 is unknown, so we need to estimate it. Remember: the test MSE equals: 𝐸(𝑌0 − መ𝑓(𝑋0))2 The estimation be done in the following ways: ➢ Cross-Validation: directly estimating test MSE by using resampling ➢ Indirect way of estimating test error: adjust the training error by a penalty term which takes the model dimension into account, i.e. test MSE=train MSE +penalty term 37 Ivana Milović Cross-validation 38 Internal Cross-Validation • Used to estimate the test MSE, for a given statistical model • It tells us how our model performs on unseen data • When comparing several competing models, the one with the smallest cross-validation error (CV) is preferred. • It can also be used for selecting tuning parameters for a chosen model (Ridge, Lasso, etc.) 39 Ivana Milović Internal Cross-Validation There are 3 ways in which CV can be done: 1. Validation set approach: divide the data randomly into two data sets: training and testing. Usually an 80-20% split is done. The model is then fitted using the training set and the prediction error 𝟏 𝒎 σ𝒊=𝟏 𝒎 𝒀𝒊 − ෡𝒀𝒊 𝟐 is calculated on the testing data Ivana Milović 40 Internal Cross-Validation Example: ▪ The model trained on 80% of the data gives the following prediction: ෠𝑌 = 2𝑋. ▪ The test data is: ▪ CV equals: 1 3 (5 − 4)2+(9 − 10)2+(10 − 8)2 = 6 3 = 2 Ivana Milović 41 Y X 5 2 9 5 10 4 Internal Cross-Validation Drawbacks: ➢ CV error can be extremely variable, depending on how the data was split ➢ Only a subset of the data was used for training, this introduces a lot of bias so we might overestimate the testing error 2. Leave-one-out cross-validation (LOOCV): Dataset with 𝑛 sample points is split into 𝑛 − 1 data points, on which model training is done and the testing is done on the remaining one data point. This is then repeated 𝑛 times, so that each point gets to be in the training and the validation data set. The prediction errors are then averaged out. 42 Ivana Milović Internal Cross-Validation • Now there is no randomness in data splits, and there is much less bias compared to the previous method, because 𝑛 − 1 points are used for training • Problem: we have to fit the model 𝒏 times. Computationally extensive. 43 Ivana Milović Internal 44 Cross-Validation 3. K-fold cross-validation: Randomly divide the data set into 𝑘 parts of (approximately) equal size. Then train the model on 𝑘 − 1 parts and test on the remaining part. Repeat 𝑘 times and average out the testing error. Ivana Milović Internal Cross-Validation • How big should 𝑘 be? • Experience shows that 𝒌 = 𝟓 or 𝒌 = 𝟏𝟎 show best results. • We fit the model only 𝑘 times • The bias remains small, because we fit on almost all data and variability of the CV estimate gets smaller compared to LOOCV, because the outputs for each fit are less correlated • This method corrects the disadvantages of the previous two. 45 Ivana Milović Internal Cross-validation example • Response variable mpg – miles per gallon • Polynomial regression is performed with the regressor horsepower. But which degree to take? • Cross-validation can give us an answer 46 Ivana Milović Internal Cross-validation example Ivana Milović 47 Validation set approach Internal Cross-validation example Ivana Milović 48 Internal 49 Adjust the training error: AIC, BIC, etc. Other way of estimating the test MSE error is by adjusting the training MSE. Internal AIC, BIC, etc • AIC (Akaike Information Criterion) is an estimator for an out-of-sample prediction error and thereby for the relative quality of a statistical model for a given set of data. • Given a collection of models, AIC estimates the quality of each model. Thus, AIC provides a means for model selection. • Akaike extends the concept of the maximum likelihood estimation to the case where the number of parameters 𝑝 is also unknown. A penalty is introduced, depending on 𝑝. So, a parameter is added to the model, only if it leads to a significant improvement in the fit. 50 Ivana Milović Internal AIC, BIC, etc • Let 𝑓 𝑦 𝜃 be a candidate model for estimating 𝑌, for 𝜃 ∈ 𝑅 𝑝. For example: 𝑓 𝑦 𝜃 is the density of 𝑁 𝑋𝜃, 𝐼 • Let መ𝜃 = መ𝜃 𝑌 be the MLE estimator, given the data 𝑌 ∈ 𝑅 𝑛. • Then, 𝑨𝑰𝑪 = −𝟐𝒍𝒐𝒈𝒇 𝒀 ෡𝜽 + 𝟐𝐩 is the estimate of the test MSE • Model with the smallest AIC is chosen 51 Ivana Milović Internal AIC, BIC, etc. BIC (Bayesian Information Criterion) is a similar method to AIC. ➢ The model with the smallest 𝑩𝑰𝑪 = −𝟐𝒍𝒐𝒈𝒇 𝒀 ෡𝜽 + 𝐩 𝐥𝐨𝐠(𝐧) is chosen. ➢ Since the penalty term here is larger, sparser models are selected than with AIC. ➢In the linear regression model with normal errors: AIC and BIC have the following forms: 𝑨𝑰𝑪 = 𝒏 𝐥𝐨𝐠( 𝑴𝑺𝑬) + 𝟐𝒑 and 𝑩𝑰𝑪 = 𝒏 𝒍𝒐𝒈(𝑴𝑺𝑬) + 𝒑𝒍𝒐𝒈(𝒏) 52 Ivana Milović Types of Models 53 Internal 54 Linear Models Ivana Milović Internal Model selection and regularization • Linear models (and generalized linear models: GLMs), though simple, turn out to be surprisingly competitive in real-world problems, compare to more complex models • Reason for that lies in their simplicity and interpretability • GLMs are the standard in the insurance business and most of the results for linear models can be naturally generalized • But what is their prediction accuracy and what happens when the number of parameters 𝒑 is large compared to the sample size 𝒏? 55 Ivana Milović More tomorrow! Internal Model selection and regularization ▪ Let us focus on linear models, for demonstration ▪ Assume that: 𝒀 = 𝑿𝜷 + 𝝐, for some 𝛽 ∈ 𝑅 𝑝 𝐸 𝜖 = 0 and 𝑉𝑎𝑟 𝜖 = 𝜎𝐼. Also, Y ∈ 𝑅 𝑛 and X ∈ 𝑅 𝑛×𝑝. ▪ OLS estimator መ𝛽 = 𝑋′ 𝑋 −1 𝑋′ 𝑌 is well-defined for 𝑛 ≥ 𝑝 and it is unbiased. Therefore, the estimates ෠𝑌 = 𝑋 መ𝛽 are unbiased. ▪ For 𝑝 > 𝑛, OLS is not even defined. Therefore, we have to come up with some other estimators. 56 Ivana Milović Internal Model selection and regularization But what about the variance of these estimates? • If 𝑛 ≫ 𝑝, the variance is usually small, and our estimates are accurate • But if two or more variables are highly correlated, this could lead to high variance and therefore unstable estimates. This happens, because det(𝑋′ 𝑋) is almost 0 and the matrix inversion becomes very unstable 57 Ivana Milović Internal Model selection and regularization 58 Ivana Milović Example of (potentially) highlycorrelated variables in Motor Insurance Vehicle age and contract age Population density and regional segmentation variables Example of (potentially) highlycorrelated variables in SME Insurance Turnover and number of employees Internal Model selection and regularization • Also, if 𝑛 is not much larger than 𝑝, the estimates can get very unstable. • Example: if all regressors are i.i.d. N(0,1) the variance of the predictions equals 𝜎 𝑝 𝑛−𝑝−1 . • This is problematic for 𝑝 large compared to 𝑛. 59 Ivana Milović Internal Model selection and regularization Alternatives to OLS in linear regression: ➢ Subset selection (best subset and stepwise) ➢ Dimension reduction (PCA, for example) ➢ Shrinkage methods (Ridge, Lasso, etc.) 60 Ivana Milović Subset Selection 61 Internal Subset Selection 1. Best subset selection: for a linear model with 𝑝 predictors do ➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor ➢ For 𝑘 = 1,2, … , 𝑝 1. Fit all 𝑝 𝑘 models that contain exactly 𝑘 predictors 2. Pick the best among these 𝑝 𝑘 models and call it 𝑀 𝑘. I.e., choose the model with the largest 𝑅2. ➢Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc. ➢Note: here you cannot use 𝑅2 because then the largest model would always be chosen. https://en.wikipedia.org/wiki/Coefficient_of_determination 62 Ivana Milović Internal Subset Selection • This method is conceptually very simple to understand • Problem? Too many models to fit! • How many? 2 𝑝 models to fit. • For example: for 𝑝 = 40, there are 1 073 741 824 models to fit! • So, we need another solution. 63 Ivana Milović Internal Subset Selection 2. Stepwise selection ➢ Forward ➢ Backward Forwards stepwise selection • Computationally efficient alternative to the best subset selection • Here we begin with the null model and add predictors one at the time until we get the full model (or some stopping rule is applied) • Then we choose among these models using cross-validation, AIC, BIC, etc. 64 Ivana Milović Internal Subset Selection More formally: Forwards stepwise selection: for a linear model with 𝑝 predictors do ➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor ➢ For 𝑘 = 0,1, … , 𝑝 − 1 1. Consider all 𝑝 − 𝑘 models that add one additional predictor to the model 𝑀 𝑘 2. Pick the best among these 𝑝 − 𝑘 models and call it 𝑀 𝑘+1. I.e. choose the model with the largest 𝑅2. ➢Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc. ➢Note: here you cannot use 𝑅2 because then the largest model would always be chosen. 65 Ivana Milović Internal Subset Selection • Here we fit only 1 + σ 𝑘=0 𝑝−1 (𝑝 − 𝑘) = 1 + 𝑝 (𝑝+1) 2 models • For example: for 𝑝 = 40, there are 466 models to fit. Much better than before. • This procedure works well in practice, but now there is no guarantee that we will select the best method overall Backwards stepwise selection: Similar: here you start with the full model and delete regressors one at the time 66 Ivana Milović Internal 67 Example: Prostate cancer • The data come from a study that examined the correlation between the level of prostate specific antigen (response variable) and a number of clinical measures (regressors) in men who were about to receive a radical prostatectomy. • It is data frame with 97 rows and 9 columns. Ivana Milović Internal Example: Prostate cancer Ivana Milović 68 Internal Example: Prostate cancer R Package Leaps is used to select the best model (based on 𝑅2 ) of each size 69 Ivana Milović Internal Example: Prostate cancer 70 • Then AIC and BIC are calculated for each of these models, based on the formula for linear regression with normal errors. Ivana Milović Summary for today 71 Internal Summary • We assess the model quality by its prediction error 𝟏 𝒏 σ𝒊=𝟏 𝒏 ( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐 given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1 𝑛 . • But this is only one part of it – training (in-sample) error • It is necessary to estimate this error for new (unseen) data – testing (out-of-sample) error 72 Ivana Milović Internal Summary A model (and its complexity) should be chosen based on these two prediction errors: 73 Ivana Milović Internal Summary • The training error we can estimate from the sample directly • There are two types of methods for estimating the testing error 1. Cross-validation: based on resampling 2. AIC, BIC, etc.: based on testing error ≈ training error + dimension penalty 74 Ivana Milović Internal Summary Linear models: simple but widely-used because of its simplicity and interpretability OLS well-defined for 𝑛 ≥ 𝑝 But they perform badly if ➢ p is large compared to n ➢ some of the regressors are highly correlated 75 Ivana Milović Internal Summary Some methods to reduce the number of parameters: 1. Best subset selection: all submodels are considered, but this is computationally infeasible 2. Stepwise-regression: regressors are added one at the time. Once a regressor is chosen, it stays 76 Ivana Milović Internal 77 Preview We are still to see: • Some other methods that do model selection for linear models • How to deal with correlations • How to deal with 𝑝 > 𝑛 case? Internal Thank you! 78