High-Dimensional Statistics and Applications in Insurance November 2019 Masaryk University, Brno Ivana Milović, MAS PhD Introducing myself Ivana Milović, MAS PhD Non-Life Pricing Actuary - Group P&C Pricing Lecturer - University of Vienna Prior experience ➢ Prae and Post-Doc Researcher - Department of Statistics, University of Vienna Education ➢ PhD in Statistics (Univ. of Vienna, 2016) ➢ Master of Advanced Studies in Mathematics (Univ. of Cambridge, 2011) ➢ BSc in Mathematics and Computer Science (Univ. of Belgrade, 2010) 2 Introduction to Uniqa UNIQA at a glance Key financials EURm Diversification by regions and products (GWP(a)(b) FY17) UNIQA’s geographical footprint 2013 2014 2015 2016(c) 2017 Gross written premiums(a) 5,886 6,064 6,325 5,048 5,293 Premiums earned (retained)(a) 5,641 5,839 6,102 4,443 6,628 Earnings before taxes 308 378 423 225 242 Consolidated net profit 285 290 331 148 161 Combined ratio (net) (P&C) 99.8% 99.6% 97.8% 98.1% 97.5% Return on Equity 11.9% 9.9% 10.9% 4.7% 5.1% 69% 31% UNIQA Austria UNIQA International 50% 20% 30% Life P&C Health (a) Including savings portion of premiums from unit- and index-linked life insurance, (b) Excluding consolidation and UNIQA Reinsurance, (c) UNIQA signed contract to sell Italian operations on Dec 2, therefore FY16 IFRS figures excluding Italy 4WARD What are Shared Services? A central service unit is an entity within a multi-unit organization responsible for supplying the business units, respective divisions and departments with specific operational tasks & processes (eg accounting, payroll, IT, compliance or as in UNIQA’s case actuarial and risk) “ “ UNIQA 4WARD (U4W) Local UNIQA Business Units (BUs) and UNIQA Group are customers of U4W and will outsource specific processes to U4W U4W performs the process for the customers according to commonly defined Service Levels UNIQA operating countries and HQ Benefits through UNIQA 4WARD ▪ Standardization ▪ Specialization ▪ Speed Benefits with UNIQA 4WARD Development of personal and professional skills with UNIQA 7 Actuarial education and continuous professional & soft-skill training What can UNIQA 4WARD offer you? General onboarding training with focus on UNIQA tools and standards as well as intercultural awareness 1 Mentoring program and on-the-job knowledge transfer 3 Function specific training in relevant Group department in Vienna – partially spending time in Vienna and in Bratislava with a strong “applied learning” (learning by doing approach) 2 International working culture and positive working atmosphere4 Various employee benefits Flexible working times & Home Office 25 vacation days Language courses Bonus payments Your Development Start-up environment with the stability of an international insurance company in the background.5 Mathematics Challenge 8 https://www.uniqa4ward.com/en/challenge.html#Challenge Introduction Introduction ▪ Topics ➢Model assessment and selection ➢Cross validation, AIC, BIC ➢Linear Models ➢PCR, Regularization methods ➢Generalized Linear models ➢Pricing process ➢Machine Learning in Insurance Ivana Milović 10 Introduction Ivana Milović 11 Introduction ▪ Let 𝑌 be a quantitative response and 𝑋 = (𝑋1, … , 𝑋 𝑝) be a set of regressors and suppose: 𝑌 = 𝑓 𝑋 + 𝜖, for some fixed (but unknown) function 𝑓 . ▪ 𝜖 has mean 0 and is independent of 𝑋. Often we assume normality. ▪ Note: 𝑋 can be fixed or random ▪ Example: Y is the number of claims and X are the characteristics of a driver and his car Ivana Milović 12 Introduction ▪ Statistical learning is a set of approaches for estimating 𝑓 by መ𝑓 from the data. ▪ Estimation goals can be: ➢Prediction ➢Inference Ivana Milović 13 Introduction ▪ Prediction: ෠𝑌 = መ𝑓(𝑋), for some estimate መ𝑓. ▪ If prediction is our only goal and we do not have interest in the form of 𝑓, then many modern techniques give good results: random forests, gradient boosting trees, etc. ▪ Example: predicting prices on the stock exchange. Here the interpretation is not important, as long as the results are good. Ivana Milović 14 Introduction ▪ The accuracy of ෠𝑌 depends on two quantities: ➢ reducible error – coming from approximating 𝑓 by መ𝑓 ➢ irreducible error – the error coming from 𝜖 ▪ We measure the accuracy by the expected prediction error ▪ 𝐸(𝑌 − ෠𝑌)2 = 𝐸(𝑓 𝑋 − መ𝑓(𝑋))2+ 𝑉𝑎𝑟(𝜖) ▪ Goal: to find a method that has small reducible error Ivana Milović 15 reducible irreducible Introduction ▪ Inference: we want to also understand the form of 𝑓, i.e. the relationship between 𝑌 and 𝑋 = (𝑋1, … , 𝑋 𝑝) . ▪ Is 𝑓 linear or more complex? ▪ Which regressors are associated with 𝑌? ▪ What is their relationship? Ivana Milović 16 Choice of Model Choice of Model ▪ We may choose our model based on what we are more interested in: prediction or inference ▪ Example: ➢ Parametric models like linear models and GLMs: simple and interpretable, but not always very accurate ➢ Non-parametric models like splines, GBM, random forests: better predictions but much less interpretable ▪ Factors like sample size, computational power, etc. also play a significant role in making a decision. Ivana Milović 18 Choice of Model Example: Linear regression vs. splines Ivana Milović 19 Machine learning controversy ▪ Many machine learning techniques offer fully automatized routines for calculating prices, insurance premiums, etc. or clustering data into different segments (for example: brands or regions) ▪ But if the interpretability is missing, many problems might occur Ivana Milović 20 Machine learning controversy ▪ Certain companies have sparked controversy as ethnic, gender or ‘unethical’ variables slipped into their models, often because data bias was not corrected Ivana Milović 21 Machine learning controversy Ivana Milović 22 Machine learning controversy ▪ What about the insurance industry? ▪ Current standard: GLM models ▪ Can Machine learning replace them? ▪ Later on that! Ivana Milović 23 Assessing Model Accuracy ▪ No model dominates all other models over all possible data sets. We need to decide which model is most suitable based on the data set given ▪ The prediction error 𝐸(𝑌 − መ𝑓(𝑋))2 can be estimated by the mean-squared error (MSE) 𝟏 𝒏 σ𝒊=𝟏 𝒏 ( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐 given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1 𝑛 . ▪ Here 𝑋𝑖 denotes a 𝑝 −vector of regressors for the i-th data point Ivana Milović 24 Assessing Model Accuracy ▪ But we do not want to predict the model accuracy on the data we already observed. 1 𝑛 σ𝑖=1 𝑛 ( 𝑌𝑖 − መ𝑓(𝑋𝑖))2 is actually in-sample (training) MSE. ▪ We want our model to perform well on the future data, ▪ For a new (unseen) observation 𝑋0, 𝑌0 , it should hold that መ𝑓 𝑋0 ≈ 𝑌0. ▪ In general, when considering all new data points: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑋0,𝑌0 (𝑌0 − መ𝑓(𝑋0))2 should be small. This is out-of-sample (testing) MSE Ivana Milović 25 Assessing Model Accuracy ▪ There is no guarantee that a model with a small training MSE will also have a small testing MSE. This leads to concepts of underfitting and overfitting. Ivana Milović 26 Assessing Model Accuracy ▪ As the model complexity increases, the training error gets smaller but the testing error increases. ▪ Underfitting: the model is too simple and performs badly on the training data, and consequently on the testing data ▪ Overfitting: the training data is modelled too well, because non-existing patterns in the data are found (coming from the noise). Therefore the performance on the future data is poor. Ivana Milović 27 Assessing Model Accuracy Ivana Milović 28 Bias-variance trade-off ▪ Let 𝑋0 be fixed. Note that the test MSE can be written as 𝐸(𝑌0 − መ𝑓(𝑋0))2 = 𝐵𝑖𝑎𝑠 ෡𝑓 (𝑋0 2 + 𝑉𝑎𝑟 መ𝑓(𝑋0) + Var ϵ . ▪ Bias: Error introduced by approximating 𝑓 by መ𝑓 ▪ Variance: how much መ𝑓 changes if we use different data sets for training Ivana Milović 29 reducible irreducible Bias-variance trade-off ▪ Easy to find a method with low bias and high variance, just use a curve that connects all the points ▪ Easy to find a method with low variance and high bias, just take a flat line through the data ▪ But, we want a method that simultaneously has low bias and low variance. Ivana Milović 30 Bias-variance trade-off ▪ Example: Ivana Milović 31 Test MSE Estimation ▪ But in real-life situations it is not possible to compute the test MSE, because 𝑓 is unknown, so we need to estimate it. ▪ This could be done in the following ways: ➢ Cross-Validation: directly estimating test MSE by using resampling ➢ Indirect way of estimating test error: adjust the training error by a penalty term which takes the model dimension into account Ivana Milović 32 Cross Validation Cross-Validation ▪ Used to estimate the test MSE, for a given statistical model ▪ It tells us how our model performs on an unseen data set ▪ When comparing several competing models, the one with the smallest crossvalidation error (CV) is preferred. ▪ It can also be used for selecting tuning parameters for a chosen model (Ridge, Lasso, etc.) Ivana Milović 34 Cross-Validation There are 3 ways in which CV can be done: 1. Validation set approach: divide the data randomly into two data sets: training and testing. Usually a 80-20% split is done. The model is then fitted using the training set and the prediction error 𝟏 𝒎 σ𝒊=𝟏 𝒎 𝒀𝒊 − ෢𝒀𝒊 𝟐 is calculated on the testing data Ivana Milović 35 Cross-Validation Example: ▪ The model trained on 80% of the data gives the following prediction: ෠𝑌 = 2𝑋. ▪ The test data is: ▪ CV equals: 1 3 (5 − 4)2 +(9 − 10)2 +(10 − 8)2 = 6 3 = 2 Ivana Milović 36 Y X 5 2 9 5 10 4 Cross-Validation ▪ Drawbacks: ➢ CV error can be extremely variable, depending on how the data was split ➢ Only a subset of the data was used for training, this introduces a lot of bias so we might overestimate the testing error 2. Leave-one-out cross-validation (LOOCV): Dataset with 𝑛 sample points is split into 𝑛 − 1 data points, on which model training is done and the testing is done on the remaining one data point. This is then repeated 𝑛 times, so that each point gets to be in the training and the validation data set. The prediction errors are then averaged out. Ivana Milović 37 Cross-Validation ▪ Now there is no randomness in data splits, and there is much less bias compared to the previous method, because 𝑛 − 1 points are used for training ▪ Problem: we have to fit the model 𝑛 times. Computationally extensive. Ivana Milović 38 Cross-Validation 3. K-fold cross-validation: Randomly divide the data set into 𝑘 parts of (approximately) equal size. Then train the model on 𝑘 − 1 parts and test on the remaining part. Repeat 𝑘 times and average out the testing error. Ivana Milović 39 Cross-Validation ▪ How big should 𝑘 be? Experience shows that 𝒌 = 𝟓 or 𝒌 = 𝟏𝟎 show best results. ▪ We fit the model only 𝑘 times ▪ Bias remains small, because we fit on almost all data and variability of the CV estimate gets smaller compared to LOOCV, because the outputs for each fit are less correlated ▪ This method corrects the disadvantages over the previous two. Ivana Milović 40 Example ▪ Response variable mpg – miles per gallon ▪ Polynomial regression is performed with the regressor horsepower. But which degree to take? ▪ Cross-validation can give us an answer Ivana Milović 41 Example Ivana Milović 42 Example Ivana Milović 43 Validation set approach Example Ivana Milović 44 AIC, BIC, etc. AIC, BIC, etc. Other way of estimating the test MSE error is by adjusting the training MSE. ▪ AIC (Akaike Information Criterion) is an estimator for out-of-sample prediction error and thereby for the relative quality of a statistical model for a given set of data. ▪ Given a collection of models, AIC estimates the quality of each model. Thus, AIC provides a means for model selection. Ivana Milović 46 AIC, BIC, etc ▪ Akaike extends the concept of the maximum likelihood estimation to the case where the number of parameters 𝑝 is also unknown. A penalty is introduced, depending on 𝑝. So a parameter is added to the model, only if it leads to a significant improvement in the fit. Ivana Milović 47 AIC, BIC, etc ▪ Let 𝑓 𝑦 𝜃 be a candidate model for estimating 𝑌, for 𝜃 ∈ 𝑅 𝑝 . For example: 𝑓 𝑦 𝜃 is the density of 𝑁 𝑋𝜃, 𝐼 ▪ Let መ𝜃 = መ𝜃 𝑌 be the MLE estimator, given the data 𝑌 ∈ 𝑅 𝑛. ▪ Then, 𝑨𝑰𝑪 = −𝟐𝒍𝒐𝒈𝒇 𝒀 ෡𝜽 + 𝟐𝐩 is the estimate of the test MSE ▪ Model with the smallest AIC is chosen Ivana Milović 48 AIC, BIC, etc. ▪ BIC (Bayesian Information Criterion) is a similar method to AIC. ➢ The model with the smallest 𝑩𝑰𝑪 = −𝟐𝒍𝒐𝒈𝒇 𝒀 ෡𝜽 + 𝐩 𝐥𝐨𝐠(𝐧) is chosen. ➢ Since the penalty term here is larger, sparser models are selected than with AIC. ➢ In the linear regression model with normal errors: AIC and BIC have the following forms: 𝑨𝑰𝑪 = 𝒏 𝐥𝐨𝐠( 𝑴𝑺𝑬) + 𝟐𝒑 and 𝑩𝑰𝑪 = 𝒏 𝒍𝒐𝒈(𝑴𝑺𝑬) + 𝒑𝒍𝒐𝒈(𝒏) Ivana Milović 49 50 Linear Models Ivana Milović Model selection and regularization ▪ Linear models (and generalized linear models: GLMs), though simple, turn out to be surprisingly competitive in real-world problems, compare to more complex models ▪ Reason for that lies in their simplicity and interpretability ▪ GLMs are the standard in the insurance business and most of the results for linear models can be naturally generalized ▪ But what is their prediction accuracy and what happens when the number of parameters 𝒑 is large compared to the sample size 𝒏? Ivana Milović 51 Model selection and regularization ▪ Let us focus on linear models, for demonstration ▪ Assume that: 𝒀 = 𝑿𝜷 + 𝝐, for some 𝛽 ∈ 𝑅 𝑝 ▪ 𝐸 𝜖 = 0 and 𝑉𝑎𝑟 𝜖 = 𝜎𝐼. ▪ Also, Y ∈ 𝑅 𝑛 and X ∈ 𝑅 𝑛×𝑝. Ivana Milović 52 Model selection and regularization ▪ OLS estimator መ𝛽 = 𝑋′ 𝑋 −1 𝑋′ 𝑌 is well-defined for 𝑛 ≥ 𝑝 and it is unbiased. Therefore, the estimates ෠𝑌 = 𝑋 መ𝛽 are unbiased. ▪ For 𝑝 > 𝑛, OLS is not even defined. Therefore, we have to come up with some other estimators. Ivana Milović 53 Model selection and regularization But what about the variance of these estimates? ▪ If 𝑛 ≫ 𝑝, the variance is usually small and our estimates are accurate ▪ But if two or more variables are highly correlated, this could lead to high variance and therefore unstable estimates. This happens, because det(𝑋′ 𝑋) is almost 0 and the matrix inversion becomes very unstable Ivana Milović 54 Model selection and regularization ▪ Example of (potentially) highly-correlated variables in Motor Insurance ➢ entry user age and current user age ➢ vehicle age and contract age ➢ population density and regional segmentation variables Ivana Milović 55 Model selection and regularization ▪ Also if 𝑛 is not much larger than 𝑝, the estimates can get very unstable. ▪ Example: if all regressors are i.i.d. N(0,1) the variance of the predictions equals 𝜎 𝑝 𝑛−𝑝−1 . ▪ This is problematic for 𝑝 large compared to 𝑛. Ivana Milović 56 Model selection and regularization ▪ Alternatives to OLS in linear regression: ➢ Subset selection (best subset and stepwise) ➢ Dimension reduction (PCA, for example) ➢ Shrinkage methods (Ridge, Lasso, etc.) Ivana Milović 57 Subset Selection Best subset selection: for a linear model with 𝑝 predictors do ➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor ➢ For 𝑘 = 1,2, … , 𝑝 1. Fit all 𝑝 𝑘 models that contain exactly 𝑘 predictors 2. Pick the best among these 𝑝 𝑘 models and call it 𝑀 𝑘. I.e., choose the model with the largest 𝑅2 . ➢ Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc. ➢ Note: here you cannot use 𝑅2 because then the largest model would always be chosen. Ivana Milović 58 Subset Selection ▪ This method is conceptually very simple to understand ▪ Problem? Too many models to fit! How many? ▪ 2 𝑝 models to fit. ▪ For example: for 𝑝 = 40, there are 1 073 741 824 models to fit! ▪ So we need another solution. Ivana Milović 59 Subset Selection 2. Stepwise selection ➢ Forward ➢ Backward Forwards stepwise selection ▪ Computationally efficient alternative to the best subset selection ▪ Here we begin with the null model and add predictors one at the time until we get the full model (or some stopping rule is applied) ▪ Then we choose among these models using cross-validation, AIC, BIC, etc. Ivana Milović 60 Subset Selection More formally: Forwards stepwise selection: for a linear model with 𝑝 predictors do ➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor ➢ For 𝑘 = 0,1, … , 𝑝 − 1 1. Consider all 𝑝 − 𝑘 models that add one additional predictor to the model 𝑀 𝑘 2. Pick the best among these 𝑝 − 𝑘 models and call it 𝑀 𝑘+1. I.e. choose the model with the largest 𝑅2 . ➢ Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc. ➢ Note: here you cannot use 𝑅2 because then the largest model would always be chosen. Ivana Milović 61 Subset Selection ▪ Here we fit only 1 + σ 𝑘=0 𝑝−1 (𝑝 − 𝑘) = 1 + 𝑝 (𝑝+1) 2 models ▪ For example: for 𝑝 = 40, there are 466 models to fit. Much better than before. ▪ This procedure works well in practice, but now there is no guarantee that we will select the best method overall Backwards stepwise selection: ▪ Similar: here you start with the full model and delete regressors one at the time Ivana Milović 62 Example: Prostate cancer ▪ The data come from a study that examined the correlation between the level of prostate specific antigen (response variable) and a number of clinical measures (regressors) in men who were about to receive a radical prostatectomy. ▪ It is data frame with 97 rows and 9 columns. Ivana Milović 63 Example: Prostate cancer Ivana Milović 64 Example: Prostate cancer ▪ R Package Leaps is used to select the best model (based on 𝑅2) of each size Ivana Milović 65 Example: Prostate cancer ▪ Then AIC and BIC are calculated for each of these models, based on the formula for linear regression with normal errors. Ivana Milović 66 Example: Prostate cancer Ivana Milović 67 Summary of Day 1 Summary ▪ We assess the model quality by its prediction error 𝟏 𝒏 σ𝒊=𝟏 𝒏 ( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐 given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1 𝑛 . ▪ But this only one part of it – training (in-sample) error ▪ It is necessary to estimate this error for new (unseen) data – testing (out-ofsample) error Ivana Milović 69 Summary ▪ A model (and its complexity) should be chosen based on these two prediction errors: Ivana Milović 70 Summary ▪ The training error we can estimate from the sample directly ▪ There are two types of methods for estimating the testing error 1. Cross – validation: based on resampling 2. AIC, BIC, etc.: based on testing error ≈ training error + dimension penalty Ivana Milović 71 Summary ▪ Linear models: simple but widely-used because of it simplicity and interpretability ▪ OLS well-defined for 𝑛 ≥ 𝑝 ▪ But if performs badly if ➢ p is large compared to n ➢ some of the regressors are highly correlated Ivana Milović 72 Summary ▪ Some methods to reduce the number of parameters: 1. Best subset selection: all submodels are considered, but this is computationally infeasible 2. Stepwise-regression: regressors are added one at the time. Once a regressor is chosen, it stays Ivana Milović 73 Other Methods Preview ▪ We are still to see: ➢ Some other methods that do model selection for linear models ➢ How to deal with correlations ➢ How to deal with 𝑝 > 𝑛 case? Ivana Milović 75 Principal Component Analysis Principal Component Regression ▪ PCA uses an orthogonal transformation to convert a set of possibly correlated variables into a set linearly uncorrelated variables called principal components. ▪ This transformation is defined in such a way that the first principal component has the largest variance, the second principal component the second largest, etc. ▪ This way a dimension reduction can be performed and consequently OLS can be fitted using the newly obtained regressors. ▪ One can show that this reduces the variance of the OLS estimator Ivana Milović 77 Principal Component Regression Ivana Milović 78 Principal Component Regression ▪ The only issue with this procedure is that the new regressors have lost the interpretability, because they are linear combinations or the original regressors. ▪ But if the prediction is the only goal, then this procedure is more than suitable. Ivana Milović 79 Shrinkage Methods Shrinkage Methods ▪ We have already mentioned that if 𝑝 is relatively large compared to 𝑛, or if some regressors are highly-correlated then the OLS estimates can be very variable and therefore unstable. ▪ Also we cannot do OLS for 𝑝 > 𝑛. ▪ In order to tackle these problems, shrinking the regression coefficients is helpful Ivana Milović 81 Shrinkage Methods ▪ We know that if OLS is defined then (Gauss-Markov) ➢ it is unbiased ➢ has the smallest variance among all unbiased linear estimators ▪ So, if we want to stay in the class of unbiased linear estimators, we cannot further reduce the variance. ▪ Idea: introduce a little bit of bias to decrease the variance significantly Ivana Milović 82 Ridge estimator ▪ Let 𝜆 ≥ 0 be fixed. Then the Ridge estimator is defined as: ▪ መ𝛽𝜆 = 𝑎𝑟𝑔min 𝛽∈𝑅 𝑝 ( 𝑌 − 𝑋𝛽 2 2 +𝜆 𝛽 2 2 ) = 𝑎𝑟𝑔 min 𝛽 2 2≤𝑐 𝑌 − 𝑋𝛽 2 2 for some 𝑐 that depends on 𝜆. ▪ For 𝜆 = 0, we obtain OLS. Otherwise we obtain a biased estimator with smaller variance than OLS Ivana Milović 83 Lasso estimator ▪ Let 𝜆 ≥ 0 be fixed. Then the Lasso estimator is defined as: ▪ መ𝛽𝜆 = 𝑎𝑟𝑔min 𝛽∈𝑅 𝑝 ( 𝑌 − 𝑋𝛽 2 2 + 𝜆 𝛽 1) = 𝑎𝑟𝑔 min 𝛽 1≤𝑐 𝑌 − 𝑋𝛽 2 2 for some 𝑐 that depends on 𝜆. ▪ For 𝜆 = 0, we obtain OLS. Otherwise we obtain a biased estimator than in most cases outperforms the OLS Ivana Milović 84 Shrinkage Methods Ivana Milović 85 Shrinkage Methods – geometrical interpretation Ivana Milović 86 Shrinkage Methods ▪ For both estimators, estimators for 𝛽 coefficients will be now bounded, which means that also the variance of the estimates stays controlled ▪ How to choose the right 𝜆? Cross validation! Ivana Milović 87 Model selection ▪ Ridge estimator will almost surely not set any estimated coefficients to zero because of its L2 geometry ▪ On the other hand, that is exactly what happens with Lasso estimates, because of the L1 norm. ▪ The larger the 𝜆 the more coefficients are set to 0. ▪ So Lasso performs model selection and estimation at the same time Ivana Milović 88 Example – Prostate data ▪ The more you increase 𝜆, the smaller the estimated coefficients are ▪ Ridge estimated coefficients: Ivana Milović 89 Example – Prostate data ▪ The more you increase 𝜆, the smaller the estimated coefficients are ▪ Lasso estimated coefficients: here they are set to 0 for large 𝜆 Ivana Milović 90 Generalized linear models (GLM) GLM ▪ Generalized linear models (GLM) are a natural extension of linear models ▪ Response variable is now function of a linear combination of regressors ▪ Response variable does not have to be distributed normally anymore, it can take on of the distributions from the exponential family: Bernoulli, Binomial, Poisson, Gamma, Exponential ▪ GLMs are widely used in insurance industry and are ideally suited for the analysis of the non-normal data, that is commonly encountered in insurance. Ivana Milović 92 GLM ▪ More formally: 𝑌𝑖 ∈ 𝑅 - response variable, 𝑋𝑖 ∈ 𝑅 𝑝 - regressors ▪ Linear regression: 𝐸(𝑌𝑖ห𝑋𝑖) = 𝛽′ 𝑋𝑖 and ෡𝑌𝑖 = መ𝛽′ 𝑋𝑖. Ivana Milović 93 GLM ▪ But what if 𝑌𝑖 is a count variable, like the number of claims? ▪ Assume a Poisson distribution for each 𝑌𝑖, but with a (potentially) different parameter 𝜆𝑖 > 0. Each customer has different frequency of claims. ▪ We want to model 𝑌𝑖 in terms of 𝑋𝑖. Ivana Milović 94 GLM - Poisson regression ▪ We know that 𝑃(𝑌𝑖 = 𝑦ห𝜆𝑖) = (𝑒−𝜆𝑖 𝜆𝑖 𝑦 )/𝑦! for each 𝑦 ∈ 0,1,2, … . ▪ Also 𝐸 𝑌𝑖ห𝑋𝑖 = 𝜆𝑖. We want to model 𝜆𝑖 in terms of 𝛽′ 𝑋𝑖. Since 𝜆𝑖 > 0, it makes sense to do the following parametrization 𝐸 𝑌𝑖ห𝑋𝑖 = 𝜆𝑖 = 𝑒 𝛽′ 𝑋 𝑖 ▪ Estimator: ෡𝑌𝑖 = 𝑒 ෡𝛽′ 𝑋 𝑖 = 𝑒 ෡𝛽1 ′ 𝑋 𝑖1 ⋯ 𝑒 ෡𝛽 𝑝 ′ 𝑋 𝑖𝑝. ▪ GLM: a generalization of this, to also allow for other distributions in the exponential family: Normal, Exponential, Gamma, Bernoulli, Binomial, etc. Ivana Milović Multiplicative structure 95 GLM - Model choice Ivana Milović Source: Willis Towers Watson 96 GLM Ivana Milović ▪ Generalized Linear Models serve as the industry standard for non-life insurance pricing ➢ Multiplicative output remains understandable also for non-actuaries ➢ Range of professional insurance software dedicated to GLM ➢ GLM is also possible in, for example, R ▪ Burning costs are defined as Frequency × Severity ▪ One can model the average frequency of claims, the average claim amounts (severity) or (directly) the average burning costs ▪ Burning costs are then the basis for the (net) risk premium 97 Criteria for GLM Ivana Milović ▪ Portfolio size ➢ 150.000 exposure rows is seen as a minimum ➢ A significant number of claims is required as well ▪ Homogeneity of the risks in the portfolio ▪ Possibility to segment the risks ➢ Available risk factors ▪ Alternative methods ➢ Other pricing techniques ➢ Flat premium or premium influenced by one risk factor ➢ Individual underwriting Often these criteria are met for a part of, but not for the full portfolio 98 Pricing Process Risk Modelling Process Ivana Milović Data preparation Initial Analysis GLM possible ? GLM analysis Simplified Pricing Method Net risk premium Yes No Data extraction Core System 100 You could end up with a multitude of models Ivana Milović Frequency Material Damage Bodily injury attritional Bodily injury large Severity Material Damage Bodily injury attritional Separate models possible for private persons, fleets, leasing, etc In this example the severity of large BI claims is not modelled, but taken as a fixed amount per claim And this is just for passenger cars! Example: MTPL 101 Validation of a GLM model Ivana Milović ▪ Split the dataset in two ▪ Usually a 80-20% split or out-of-time ▪ Check how the model performs on unseen data ▪ Avoid overfitting ▪ Significance tests ▪ Significance of a parameter in the model ▪ Significance of levels of a parameter against each other (how granular should a variable be) ▪ Temporal stability ▪ To be significant, an effect must be stable over the years ▪ Residual analysis ▪ To test the distribution ▪ On real data no distribution works perfectly 102 Combining Risk Models Ivana Milović ▪ In the end we need to deliver a final risk premium ➢ We should combine all models we made ➢ Necessary to understand the total effect ▪ Result: net risk premium! 103 From net risk premium to gross risk premium Ivana Milović ▪ A whole range of effects is to be added to the net risk premium ▪ Most loadings will be added through an increase in the intercept, but there are other possibilities ▪ Loading for discounts to be added Gross Risk Premium 104 Additional Topics - Interactions Interactions and GLM ▪ An interaction effect exists when the effect of an independent variable on a dependent variable changes, depending on the value(s) of one or more other independent variables ▪ In that case an interaction term(s) has to be added to the model ▪ Example: gene A and gene B may contribute to developing a certain disease, but in combination they are fatal Ivana Milović 106 Interactions and GLM ▪ The problem? GLM models do not detect interactions automatically ▪ Then can be added to the model, but this has to be done ‘manually’ ▪ Example taken from: Ivana Milović 107 Interactions and GLM Ivana Milović 108 Interactions and GLM ▪ In this example: there is an interaction of age and engine power ▪ 𝑨𝒈𝒆 ≥ 𝟔𝟎 𝒂𝒏𝒅 𝑬𝒏𝒈𝒊𝒏𝒆 𝑷𝒐𝒘𝒆𝒓 ≥ 𝟓𝟎 ▪ But if this effect is not noticed and included in the model, the GLM fit is poor Ivana Milović 109 Interactions and GLM Ivana Milović 110 Machine Learning Boosting ▪ But many machine learning algorithms can automatically capture these effects ▪ Let us take Gradient Boosting Trees for example ▪ How does this algorithm work? ▪ Let us present some basics Ivana Milović 112 Tree based methods ▪ Tree-based methods partition the feature space into a set of rectangles and then fit a simple model (typically a constant) in each region ▪ Consider a regression problem with continuous response 𝑌 and continuous regressors 𝑋1, 𝑋2 ∈ (0,1). ▪ For example, this partition is simple but cannot be obtained by recursive binary splitting, i.e. represented by a tree. Ivana Milović 113 Tree based methods ▪ So, let us restrict our attention to recursive binary partitions, like this one: ▪ First split the space into two regions and model the response by the mean of Y in each region. Choose the split variable and split-point to achieve the optimal split. ▪ Then one or both regions are further split in the same fashion iteratively until some stopping rule is applied. Ivana Milović 114 Tree based methods ▪ The corresponding regression model predicts Y with a constant 𝑐 𝑚 if the inputs X are in region 𝑅 𝑚, i.e. Ivana Milović 115 Tree based methods ▪ These trees can now further be used for boosting ▪ What is boosting? Ivana Milović 116 Boosting ▪ Gradient boosting is one of the most powerful techniques for building predictive models. It is proven successful in many areas and is one of the leading methods for winning Kaggle competitions Ivana Milović 117 Boosting ▪ In general: models can be fitted to data individually or combined in an ensemble – a combination of simple individual models (usually trees) that together create a more powerful model ▪ Boosting is a method that builds the model in a stage-wise fashion. ▪ It starts by fitting an initial model. ▪ The second model focuses then on accurately predicting the cases where the first model performed badly ▪ The third model focuses on correcting the faults of the previous stage, etc. Ivana Milović 118 Boosting ▪ Here we do not fit one big decision tree to the model, because this can easily lead to overfitting ▪ Instead, the boosting algorithm learns slowly ▪ At each step we fit a decision tree to the residuals from the previous model ▪ Then new tree is then added to the model Ivana Milović 119 Boosting Ivana Milović 120 ▪ Example: data to be fitted Boosting Ivana Milović 121 Boosting Ivana Milović 122 Boosting Ivana Milović 123 Boosting ▪ Usually the trees are rather small, but they should be deep enough to capture interactions. Number of splits = 2 is already enough to catch firstorder interactions ▪ There are several parameters that need to be chosen: the number of trees, the number of splits in each tree and the learning rate of the algorithm (usually 0.1 or 0.01) ▪ For the number of trees cross-validation is used Ivana Milović 124 Example ▪ Back to our example ▪ Remember that GLM could not ‘recognize’ the interaction between age and engine power ▪ But GBMs do, provided that the tuning parameter have be carefully selected Ivana Milović 125 Example Ivana Milović 126 Method comparison GLM vs Machine Learning ▪ The problem with these kind of algorithms is that the interpretation is almost completely lost ▪ It is very unlikely that such models will be approved by regulators, at least in the majority of countries ▪ And even if they are, then the insurance company runs into the risk of reputational loss, in case some of the ethical problems discussed before emerge Ivana Milović 128 GLM vs Machine Learning ▪ Also the actuaries want to understand their models and not use black-box alternatives ▪ So, GLMs will probably not be replaced by Machine Learning algorithms in the near future ▪ But they can assist the actuaries in spotting interactions, as well as variable significance or perform clustering tasks Ivana Milović 129 GLM vs Machine Learning Ivana Milović 130 GLM vs Machine Learning ▪ Examples of clustering can be brand or region clustering. Here the black-box nature of the models is not so important, because the model results can usually easily be validated Ivana Milović 131 Literature Literature ▪ An Introduction to Statistical Learning with Applications in R - Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani ▪ https://www.reacfin.com/wp-content/uploads/2016/12/20170914-Machine- Learning-applications-for-non-life-pricing.pdf ▪ https://www.stat.cmu.edu/~ryantibs/datamining/lectures/17-modr2.pdf Ivana Milović 133