Internal High-dimensional statistics and Machine learning with applications to Insurance November 2022 Masaryk University, Brno Ivana Milović, MAS PhD PhD Internal Introducing Myself 2 Ivana Milović, MAS PhD Non-Life Pricing Actuary (SME) ivana.milovic@allianz.at Prior experience • Uniqa Insurance Group – Non-Life Pricing Actuary (Motor) • Lecturer - University of Vienna • Prae and Post-Doc Researcher - Department of Statistics, University of Vienna Education • PhD in Statistics (Univ. of Vienna, 2016) • Master of Advanced Studies in Mathematics (Univ. of Cambridge, 2011) • BSc in Mathematics and Computer Science (Univ. of Belgrade, 2010) Allianz Group Our Company 3 Internal Allianz Group at a glance With around 155,000 employees worldwide, the Allianz Group serves over 126 million customers1 in more than 70 countries. In fiscal year 2021 the Allianz Group achieved total revenues of approximately 148.5 billion euros. Allianz is one of the world‘s largest asset managers, with third-party assets of nearly 2.0 trillion euros at year end. OUR COMPANY 41Including non-consolidated entities with Allianz customers Data as of March 4, 2022 (release of the Annual Report 2021) Internal OUR COMPANY 5 We are much more than just an insurer … Allianz is the leading specialist in space insurance and celebrated its 100th birthday as aviation insurer in 2015 … the Allianz Center for Technology has conducted thousands of crash test since 1980 to improve road safety … Allianz insures major Hollywood and Bollywood movies, including all 24 James Bond productions … Allianz offers financial solutions to more than 49 million emerging consumers in Africa, Asia and Latin America … Allianz was one of the insurers of the Titanic … Allianz supports sustainable motor sports by being a partner of the fully electric car racing Formula E Championship … Allianz insured the last three buildings to hold the title of “world’s tallest”: Petronas Towers, Taipei 101 & Burj Khalifa … Allianz is the Worldwide Insurance Partner of the Olympic & Paralympic Movements from 2021- 2028 – and one of few global brands that can communicate with Olympic IP/rings Internal OUR COMPANY 6 We are a global player Leading Property and Casualty insurer globally Among the top 5 Life/Health business globally Among the top 5 asset managers globally Global leader in credit insurance Worldwide leader in assistance services One of the leading corporate insurers globally Internal Revenues by segments and regions OUR COMPANY 42% 53% 6% Property/Casualty Insurance Life/Health Insurance Asset Management 26% 30% 12% 10% 6% 15% Germany Western & Southern Europe USA Growth Markets Anglo Markets Speciality Insurance 1 Excl. Corporate & Other and consolidations 2 Incl. Banking Revenues Allianz Group 2021: EUR 148.5 bn by REGIONS1,2by SEGMENTS 7 148.5 bn 148.5 bn 3 4 5 3 Central and Eastern Europe, Asia-Pacific, Latin America, Middle East and Africa, Turkey. Austria and Allianz Direct allocated to Western and Southern Europe. 4 UK, Ireland, Australia 5 Allianz Global Corporate & Specialty, Euler Hermes, Allianz Partners, Allianz Re Data as of March 29, 2022 (release of the Allianz Fact Sheet; further information on allianz.com) Internal Interested in joining one of our companies? 8 ❑ Bachelor/Master Thesis supervision ❑ Summer internship ❑ Full- or part time job Contact us: ivana.milovic@allianz.at What is pricing? 9 Internal 10 What is pricing? “Pricing is the way that a company decides prices for its products or services, or the prices decided” – Cambridge dictionary Internal 11 Why do we need statistics and mathematical modelling for pricing in insurance? Because the cost of a policy is random. How do we estimate it? There are two ways: • Based on the historical data/expert judgement (simplistic approach) • Fitting statistical models to historical data -> technical pricing. Summary from yesterday 12 Internal Summary • We assess the model quality by its prediction error 𝟏 𝒏 σ𝒊=𝟏 𝒏 ( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐 given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1 𝑛 . • But this is only one part of it – training (in-sample) error • It is necessary to estimate this error for new (unseen) data – testing (out-of-sample) error 13 Ivana Milović Internal Summary A model (and its complexity) should be chosen based on these two prediction errors: 14 Ivana Milović Internal Summary • The training error we can estimate from the sample directly • There are two types of methods for estimating the testing error 1. Cross – validation: based on resampling 2. AIC, BIC, etc.: based on testing error ≈ training error + dimension penalty 15 Ivana Milović Internal 16 Content Topics o Model assessment and selection o Cross validation, AIC, BIC o Linear Models o PCR, Regularization methods o Generalized Linear models o Pricing process o Machine Learning in Insurance Types of Models 17 Internal 18 Linear Models Ivana Milović Internal Model selection and regularization • Linear models (and generalized linear models: GLMs), though simple, turn out to be surprisingly competitive in real-world problems, compare to more complex models (GLMs are the standard in the insurance business) • Reason for that lies in their simplicity and interpretability • But what is their prediction accuracy and what happens when the number of parameters 𝒑 is large compared to the sample size 𝒏? 19 Ivana Milović Internal Model selection and regularization ▪ Let us focus on linear models, for demonstration ▪ Assume that: 𝒀 = 𝑿𝜷 + 𝝐, for some 𝛽 ∈ 𝑅 𝑝 𝐸 𝜖 = 0 and 𝑉𝑎𝑟 𝜖 = 𝜎𝐼. Also, Y ∈ 𝑅 𝑛 and X ∈ 𝑅 𝑛×𝑝. ▪ If 𝑛 ≥ 𝑝 then the OLS estimator መ𝛽 = 𝑋′ 𝑋 −1 𝑋′ 𝑌 is well-defined and it is unbiased. ▪ Therefore, the estimates ෠𝑌 = 𝑋 መ𝛽 are unbiased. ▪ But for 𝑝 > 𝑛, OLS is not even defined. Therefore, we have to come up with some other estimators. 20 Ivana Milović Internal Model selection and regularization But what about the variance of these estimates? • If 𝑛 ≫ 𝑝, the variance is usually small, and our estimates are accurate • But if two or more variables are highly correlated, this could lead to high variance and therefore unstable estimates. This happens, because det(𝑋′ 𝑋) is almost 0 and the matrix inversion becomes very unstable 21 Ivana Milović Internal 22 Ivana Milović Example of (potentially) highlycorrelated variables in Motor Insurance Vehicle age and contract age Engine Power and engine volume Population density and regional segmentation variables Example of (potentially) highlycorrelated variables in SME Insurance Turnover and number of employees Model selection and regularization Internal Model selection and regularization • Also, if 𝑛 is not much larger than 𝑝, the estimates can get very unstable. • Example: if all regressors are i.i.d. N(0,1) the variance of the predictions equals 𝜎 𝑝 𝑛−𝑝−1 . • This is problematic for 𝑝 large compared to 𝑛. 23 Ivana Milović Internal • To summarize, we need to find methods to ❑ reduce the number of parameters (dimensionality) and/or ❑ to reduce the variance of the estimators • Otherwise, the models are unreliable! 24 Model selection and regularization Internal Model selection and regularization Alternatives to OLS in linear regression: • Subset selection (best subset and stepwise) o The goal is to choose a subset of all regressors and that that model as approximation o Cross Validation, AIC or BIC help us then choose the best submodel • Dimension reduction (PCA, for example) o We transform the original regressors so that the new ones are uncorrelated and sorted in order of importance, so we can also reduce the dimension • Shrinkage methods (Ridge, Lasso, etc.) o Modification of OLS (Ordinary Least Squares) estimator, so that the coefficients of the estimates are shrunk towards zero, to reduce the variability of the estimator and make them more stable o Also works for p larger than n 25 Ivana Milović Internal Model selection and regularization For more details, see the Appendix chapter 26 Ivana Milović GLM – industry standard 27 Internal GLM • Generalized linear models (GLM) are a natural extension of linear models • Response variable is now function of a linear combination of regressors • Response variable does not have to be distributed normally anymore, it can take on of the distributions from the exponential family: Bernoulli, Binomial, Poisson, Gamma, Exponential • GLMs are widely used in insurance industry and are ideally suited for the analysis of the nonnormal data, that is commonly encountered in insurance. 28 Ivana Milović Internal GLM - Model choice Ivana Milović 29 Source: Willis Towers Watson Internal GLM Generalized Linear Models serve as the industry standard for non-life insurance pricing • Their multiplicative output remains understandable also for non-actuaries • Range of professional insurance software dedicated to GLM • GLM is also possible in R, Python and other open-source software 30 Ivana Milović Internal Risk Premium • In order to come up with a price for an insurance policy, we need to start by modelling the base risk premium, i.e. • (Net) Risk Premium is defined as Frequency (how often)× Severity (how high) • One can model ❑ the frequency of claims -> Poisson ❑ the claim amount (severity)-> Gamma ❑ (directly) the risk premium -> Gamma, Tweedie 31 Ivana Milović Internal Risk Premium • This procedure has to be done for every insurance coverage. • Example in SME insurance 32 Internal Ivana Milović 33 Data preparation Initial Analysis GLM possible ? GLM analysis Simplified Pricing Method Net risk premium Yes No Data extraction Core System Risk Premium Internal Risk Premium • In the end we need to deliver a final risk premium • We should combine all models we made • Necessary to understand the total effect • Result: total net risk premium 34 Internal Ivana Milović 35 A whole range of effects is to be added to the net risk premium Gross Risk Premium Risk Premium: from net to gross GLM – space for improvement 36 Internal Interactions and GLM • An interaction effect exists when the effect of an independent variable on a dependent variable changes, depending on the value(s) of one or more other independent variables • In that case an interaction term(s) has to be added to the model • Example: gene A and gene B may contribute to developing a certain disease, but in combination they are fatal 37 Ivana Milović Internal Interactions and GLM • The problem? GLM models do not detect interactions automatically • Then can be added to the model, but this has to be done ‘manually’ • Example taken from: 38 Ivana Milović Internal Ivana Milović 39 Interactions and GLM Internal Interactions and GLM • In this example: there is an interaction of age and engine power 𝑨𝒈𝒆 ≥ 𝟔𝟎 𝒂𝒏𝒅 𝑬𝒏𝒈𝒊𝒏𝒆 𝑷𝒐𝒘𝒆𝒓 ≥ 𝟓𝟎 • But if this effect is not noticed and included in the model, the GLM fit is poor 40 Ivana Milović Internal Interactions and GLM 41 Ivana Milović Internal Tree based methods • But many machine learning algorithms can automatically capture these effects • Let us take Gradient Boosting Trees for example • How does this algorithm work? • Let us present some basics 42 Ivana Milović Internal Tree based methods • Tree-based methods partition the feature space into a set of rectangles and then fit a simple model (typically a constant) in each region • Consider a regression problem with continuous response 𝑌 and continuous regressors 𝑋1, 𝑋2 ∈ (0,1). • For example, this partition is simple, but cannot be obtained by recursive binary splitting, i.e. represented by a tree. 43 Internal ▪ So, let us restrict our attention to recursive binary partitions, like this one: ▪ First split the space into two regions and model the response by the mean of Y in each region. Choose the split variable and split-point to achieve the optimal split. ▪ Then one or both regions are further split in the same fashion iteratively until some stopping rule is applied. 44 Tree based methods Internal ▪ The corresponding regression model predicts Y with a constant 𝑐 𝑚 if the inputs X are in region 𝑅 𝑚, i.e. 45 Tree based methods Internal Tree based methods 46 Ivana Milović These trees can now further be used for boosting What is boosting? Internal Boosting • Gradient boosting is one of the most powerful techniques for building predictive models. It is proven successful in many areas and is one of the leading methods for winning Kaggle competitions (https://kaggle.com/) 47 Ivana Milović Internal Boosting • In general: models can be fitted to data individually or combined in an ensemble – a combination of simple individual models (usually trees) that together create a more powerful model • Boosting is a method that builds the model in a stage-wise fashion. • It starts by fitting an initial model. • The second model focuses then on accurately predicting the cases where the first model performed badly • The third model focuses on correcting the faults of the previous stage, etc. 48 Ivana Milović Internal Boosting • Here we do not fit one big decision tree to the model, because this can easily lead to overfitting • Instead, the boosting algorithm learns slowly • At each step we fit a decision tree to the residuals from the previous model • Then new tree is then added to the model 49 Ivana Milović Internal Boosting Example: data to be fitted 50 Ivana Milović Internal Boosting 51 Ivana Milović Internal 52 Boosting Ivana Milović Internal Boosting 53 Ivana Milović Internal Boosting 54 Ivana Milović Usually the trees are rather small, but they should be deep enough to capture interactions. Number of splits = 2 is already enough to catch first-order interactions There are several parameters that need to be chosen: the number of trees, the number of splits in each tree and the learning rate of the algorithm (usually 0.1 or 0.01) For the number of trees cross-validation is used Internal Example • Back to our example • Remember that GLM could not ‘recognize’ the interaction between age and engine power • But GBMs do, provided that the tuning parameter have be carefully selected 55 Ivana Milović 56 Example SUCCESS! Ivana Milović Internal GLM vs Machine Learning • The problem with these kind of algorithms is that the interpretation is almost completely lost • It is very unlikely that such models will be approved by regulators, at least in the majority of countries • And even if they are, then the insurance company runs into the risk of reputational loss, in case some of the ethical problems discussed before emerge 57 Ivana Milović Internal GLM vs Machine Learning • Also, the actuaries want to understand their models and not use black-box alternatives • So, GLMs will probably not be replaced by Machine Learning algorithms in the near future • But they can assist the actuaries in spotting interactions, as well as determine variable significance or perform clustering tasks 58 Ivana Milović Internal Variable importance 59 Ivana Milović Internal Clustering • Examples of clustering can be brand, region or business activities clustering. • Here the black-box nature of the models is not so important, because the model results can usually easily be validated 60 Ivana Milović Internal Literature ▪ An Introduction to Statistical Learning with Applications in R - Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani ▪ https://www.reacfin.com/wp-content/uploads/2016/12/20170914-Machine- Learning-applications-for-non-life-pricing.pdf ▪ https://www.stat.cmu.edu/~ryantibs/datamining/lectures/17-modr2.pdf Ivana Milović 61 Internal Appendixthe white slides. 62 Subset Selection 63 Internal Subset Selection 1. Best subset selection: for a linear model with 𝑝 predictors do ➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor ➢ For 𝑘 = 1,2, … , 𝑝 1. Fit all 𝑝 𝑘 models that contain exactly 𝑘 predictors 2. Pick the best among these 𝑝 𝑘 models and call it 𝑀 𝑘. I.e., choose the model with the largest 𝑅2. ➢Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc. ➢Note: here you cannot use 𝑅2 because then the largest model would always be chosen. https://en.wikipedia.org/wiki/Coefficient_of_determination 64 Ivana Milović Internal Subset Selection • This method is conceptually very simple to understand • Problem? Too many models to fit! • How many? 2 𝑝 models to fit. • For example: for 𝑝 = 40, there are 1 073 741 824 models to fit! • So, we need another solution. 65 Ivana Milović Internal Subset Selection 2. Stepwise selection ➢ Forward ➢ Backward Forwards stepwise selection • Computationally efficient alternative to the best subset selection • Here we begin with the null model and add predictors one at the time until we get the full model (or some stopping rule is applied) • Then we choose among these models using cross-validation, AIC, BIC, etc. 66 Ivana Milović Internal Subset Selection More formally: Forwards stepwise selection: for a linear model with 𝑝 predictors do ➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor ➢ For 𝑘 = 0,1, … , 𝑝 − 1 1. Consider all 𝑝 − 𝑘 models that add one additional predictor to the model 𝑀 𝑘 2. Pick the best among these 𝑝 − 𝑘 models and call it 𝑀 𝑘+1. I.e. choose the model with the largest 𝑅2. ➢Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc. ➢Note: here you cannot use 𝑅2 because then the largest model would always be chosen. 67 Ivana Milović Internal Subset Selection • Here we fit only 1 + σ 𝑘=0 𝑝−1 (𝑝 − 𝑘) = 1 + 𝑝 (𝑝+1) 2 models • For example: for 𝑝 = 40, there are 466 models to fit. Much better than before. • This procedure works well in practice, but now there is no guarantee that we will select the best method overall Backwards stepwise selection: Similar: here you start with the full model and delete regressors one at the time 68 Ivana Milović Internal 69 Example: Prostate cancer • The data come from a study that examined the correlation between the level of prostate specific antigen (response variable) and a number of clinical measures (regressors) in men who were about to receive a radical prostatectomy. • It is data frame with 97 rows and 9 columns. Ivana Milović Internal Example: Prostate cancer Ivana Milović 70 Internal Example: Prostate cancer R Package Leaps is used to select the best model (based on 𝑅2 ) of each size 71 Ivana Milović Internal Example: Prostate cancer 72 • Then AIC and BIC are calculated for each of these models, based on the formula for linear regression with normal errors. Ivana Milović Internal 73 Preview We are still to see: • Some other methods that do model selection for linear models • How to deal with correlations • How to deal with 𝑝 > 𝑛 case? PCA 74 Internal Principal Component Regression Ivana Milović 75 Internal Principal Component Regression • PCA uses an orthogonal transformation to convert a set of possibly correlated variables into a set linearly uncorrelated variables called principal components. • This transformation is defined in such a way that the first principal component has the largest variance, the second principal component the second largest, etc. 76 Ivana Milović Internal Principal Component Regression ▪ This way a dimension reduction can be performed and consequently OLS can be fitted using the newly obtained regressors. ▪ One can show that this reduces the variance of the OLS estimator ▪ But, interpretability issues! Ivana Milović 77 Shrinkage Methods 78 Internal Shrinkage Methods ▪ We have already mentioned that if 𝑝 is relatively large compared to 𝑛, or if some regressors are highly-correlated then the OLS estimates can be very variable and therefore unstable. ▪ Also, we cannot do OLS for 𝑝 > 𝑛. ▪ In order to tackle these problems, shrinking the regression coefficients is helpful Ivana Milović 79 Internal Shrinkage Methods ▪ Idea: introduce some bias, but decrease variance significantly ▪ This is done by adjusting our minimization problem Ivana Milović 80 Internal Shrinkage Methods Ivana Milović 81 ▪ OLS: ▪ Ridge: ▪ Lasso: Internal Shrinkage Methods • So, Ridge and Lasso are actually classes of estimators, since they depend on 𝜆 • How to choose the right 𝜆? Cross validation! 82 Ivana Milović Internal Shrinkage Methods – geometrical interpretation 83 Ivana Milović Internal Model selection • Ridge estimator will almost surely not set any estimated coefficients to zero because of its L2 geometry • On the other hand, that is exactly what happens with Lasso estimates, because of the L1 norm. • The larger the 𝜆 the more coefficients are set to 0. • So, Lasso performs model selection and estimation at the same time 84 Ivana Milović Internal Example – Prostate data The more you increase 𝜆, the smaller the estimated coefficients are Ridge estimated coefficients: 85 Ivana Milović Internal Example – Prostate data The more you increase 𝜆, the smaller the estimated coefficients are Lasso estimated coefficients: here they are set to 0 for large 𝜆 86 Ivana Milović