High-Dimensional Statistics and Applications in Insurance
November 2019
Masaryk University, Brno
Ivana Milović, MAS PhD
Introducing myself
Ivana Milović, MAS PhD
Non-Life Pricing Actuary - Group P&C Pricing
Lecturer - University of Vienna
Prior experience
➢ Prae and Post-Doc Researcher - Department of Statistics,
University of Vienna
Education
➢ PhD in Statistics (Univ. of Vienna, 2016)
➢ Master of Advanced Studies in Mathematics (Univ. of Cambridge, 2011)
➢ BSc in Mathematics and Computer Science (Univ. of Belgrade, 2010)
2
Introduction to Uniqa
UNIQA at a glance
Key financials EURm
Diversification by regions and products (GWP(a)(b) FY17)
UNIQA’s geographical footprint
2013 2014 2015 2016(c) 2017
Gross written premiums(a) 5,886 6,064 6,325 5,048 5,293
Premiums earned
(retained)(a) 5,641 5,839 6,102 4,443 6,628
Earnings before taxes 308 378 423 225 242
Consolidated net profit 285 290 331 148 161
Combined ratio (net) (P&C) 99.8% 99.6% 97.8% 98.1% 97.5%
Return on Equity 11.9% 9.9% 10.9% 4.7% 5.1%
69%
31%
UNIQA Austria
UNIQA International
50%
20%
30%
Life
P&C
Health
(a) Including savings portion of premiums from unit- and index-linked life insurance,
(b) Excluding consolidation and UNIQA Reinsurance, (c) UNIQA signed contract to sell Italian operations on Dec 2, therefore FY16 IFRS figures excluding Italy
4WARD
What are Shared Services?
A central service unit is an entity within a multi-unit organization responsible for supplying
the business units, respective divisions and departments with specific operational tasks &
processes (eg accounting, payroll, IT, compliance or as in UNIQA’s case actuarial and risk)
“
“
UNIQA 4WARD (U4W)
Local UNIQA Business Units (BUs) and UNIQA
Group are customers of U4W and will outsource
specific processes to U4W
U4W performs the process for
the customers according to
commonly defined Service Levels
UNIQA operating countries and HQ
Benefits through UNIQA 4WARD
▪ Standardization
▪ Specialization
▪ Speed
Benefits with UNIQA 4WARD
Development of personal and professional skills with UNIQA
7
Actuarial education and continuous professional & soft-skill training
What can UNIQA 4WARD offer you?
General onboarding training with focus on UNIQA tools and standards as well as intercultural
awareness
1
Mentoring program and on-the-job knowledge transfer
3
Function specific training in relevant Group department in Vienna – partially spending time in
Vienna and in Bratislava with a strong “applied learning” (learning by doing approach)
2
International working culture and positive working atmosphere4
Various employee benefits
Flexible working
times & Home
Office
25 vacation
days
Language
courses
Bonus
payments
Your Development
Start-up environment with the stability of an international insurance company in the
background.5
Mathematics Challenge
8
https://www.uniqa4ward.com/en/challenge.html#Challenge
Introduction
Introduction
▪ Topics
➢Model assessment and selection
➢Cross validation, AIC, BIC
➢Linear Models
➢PCR, Regularization methods
➢Generalized Linear models
➢Pricing process
➢Machine Learning in Insurance
Ivana Milović 10
Introduction
Ivana Milović 11
Introduction
▪ Let 𝑌 be a quantitative response and 𝑋 = (𝑋1, … , 𝑋 𝑝) be a set of regressors and
suppose: 𝑌 = 𝑓 𝑋 + 𝜖, for some fixed (but unknown) function 𝑓 .
▪ 𝜖 has mean 0 and is independent of 𝑋. Often we assume normality.
▪ Note: 𝑋 can be fixed or random
▪ Example: Y is the number of claims and X are the characteristics of a driver and
his car
Ivana Milović 12
Introduction
▪ Statistical learning is a set of approaches for estimating 𝑓
by መ𝑓 from the data.
▪ Estimation goals can be:
➢Prediction
➢Inference
Ivana Milović 13
Introduction
▪ Prediction: ෠𝑌 = መ𝑓(𝑋), for some estimate መ𝑓.
▪ If prediction is our only goal and we do not have interest in the form of 𝑓, then
many modern techniques give good results: random forests, gradient boosting
trees, etc.
▪ Example: predicting prices on the stock exchange. Here the interpretation is not
important, as long as the results are good.
Ivana Milović 14
Introduction
▪ The accuracy of ෠𝑌 depends on two quantities:
➢ reducible error – coming from approximating 𝑓 by መ𝑓
➢ irreducible error – the error coming from 𝜖
▪ We measure the accuracy by the expected prediction error
▪ 𝐸(𝑌 − ෠𝑌)2 = 𝐸(𝑓 𝑋 − መ𝑓(𝑋))2+ 𝑉𝑎𝑟(𝜖)
▪ Goal: to find a method that has small reducible error
Ivana Milović 15
reducible irreducible
Introduction
▪ Inference: we want to also understand the form of 𝑓, i.e. the relationship
between 𝑌 and 𝑋 = (𝑋1, … , 𝑋 𝑝) .
▪ Is 𝑓 linear or more complex?
▪ Which regressors are associated with 𝑌?
▪ What is their relationship?
Ivana Milović 16
Choice of Model
Choice of Model
▪ We may choose our model based on what we are more interested in: prediction
or inference
▪ Example:
➢ Parametric models like linear models and GLMs: simple and interpretable,
but not always very accurate
➢ Non-parametric models like splines, GBM, random forests: better
predictions but much less interpretable
▪ Factors like sample size, computational power, etc. also play a significant role in
making a decision.
Ivana Milović 18
Choice of Model
Example: Linear regression vs. splines
Ivana Milović 19
Machine learning controversy
▪ Many machine learning techniques offer fully automatized routines for
calculating prices, insurance premiums, etc. or clustering data into different
segments (for example: brands or regions)
▪ But if the interpretability is missing, many problems might occur
Ivana Milović 20
Machine learning controversy
▪ Certain companies have sparked controversy as ethnic, gender or ‘unethical’
variables slipped into their models, often because data bias was not corrected
Ivana Milović 21
Machine learning controversy
Ivana Milović 22
Machine learning controversy
▪ What about the insurance industry?
▪ Current standard: GLM models
▪ Can Machine learning replace them?
▪ Later on that!
Ivana Milović 23
Assessing Model Accuracy
▪ No model dominates all other models over all possible data sets. We need to
decide which model is most suitable based on the data set given
▪ The prediction error 𝐸(𝑌 − መ𝑓(𝑋))2 can be estimated by the mean-squared error
(MSE)
𝟏
𝒏
σ𝒊=𝟏
𝒏
( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐
given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1
𝑛
.
▪ Here 𝑋𝑖 denotes a 𝑝 −vector of regressors for the i-th data point
Ivana Milović 24
Assessing Model Accuracy
▪ But we do not want to predict the model accuracy on the data we already
observed.
1
𝑛
σ𝑖=1
𝑛
( 𝑌𝑖 − መ𝑓(𝑋𝑖))2
is actually in-sample (training) MSE.
▪ We want our model to perform well on the future data,
▪ For a new (unseen) observation 𝑋0, 𝑌0 , it should hold that መ𝑓 𝑋0 ≈ 𝑌0.
▪ In general, when considering all new data points: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒
𝑋0,𝑌0
(𝑌0 − መ𝑓(𝑋0))2
should
be small. This is out-of-sample (testing) MSE
Ivana Milović 25
Assessing Model Accuracy
▪ There is no guarantee that a model with a small training MSE will also have a
small testing MSE. This leads to concepts of underfitting and overfitting.
Ivana Milović 26
Assessing Model Accuracy
▪ As the model complexity increases, the training error gets smaller but the testing
error increases.
▪ Underfitting: the model is too simple and performs badly on the training data,
and consequently on the testing data
▪ Overfitting: the training data is modelled too well, because non-existing
patterns in the data are found (coming from the noise). Therefore the
performance on the future data is poor.
Ivana Milović 27
Assessing Model Accuracy
Ivana Milović 28
Bias-variance trade-off
▪ Let 𝑋0 be fixed. Note that the test MSE can be written as
𝐸(𝑌0 − መ𝑓(𝑋0))2
= 𝐵𝑖𝑎𝑠 ෡𝑓 (𝑋0
2
+ 𝑉𝑎𝑟 መ𝑓(𝑋0) + Var ϵ .
▪ Bias: Error introduced by approximating 𝑓 by መ𝑓
▪ Variance: how much መ𝑓 changes if we use different data sets for training
Ivana Milović 29
reducible irreducible
Bias-variance trade-off
▪ Easy to find a method with low bias and high variance, just use a curve that
connects all the points
▪ Easy to find a method with low variance and high bias, just take a flat line
through the data
▪ But, we want a method that simultaneously has low bias and low variance.
Ivana Milović 30
Bias-variance trade-off
▪ Example:
Ivana Milović 31
Test MSE Estimation
▪ But in real-life situations it is not possible to compute the test MSE, because 𝑓 is
unknown, so we need to estimate it.
▪ This could be done in the following ways:
➢ Cross-Validation: directly estimating test MSE by using resampling
➢ Indirect way of estimating test error: adjust the training error by a penalty term which
takes the model dimension into account
Ivana Milović 32
Cross Validation
Cross-Validation
▪ Used to estimate the test MSE, for a given statistical model
▪ It tells us how our model performs on an unseen data set
▪ When comparing several competing models, the one with the smallest crossvalidation
error (CV) is preferred.
▪ It can also be used for selecting tuning parameters for a chosen model (Ridge,
Lasso, etc.)
Ivana Milović 34
Cross-Validation
There are 3 ways in which CV can be done:
1. Validation set approach: divide the data randomly into two data sets: training and
testing. Usually a 80-20% split is done. The model is then fitted using the training set and
the prediction error
𝟏
𝒎
σ𝒊=𝟏
𝒎
𝒀𝒊 − ෢𝒀𝒊
𝟐
is calculated on the testing data
Ivana Milović 35
Cross-Validation
Example:
▪ The model trained on 80% of the data gives the following prediction: ෠𝑌 = 2𝑋.
▪ The test data is:
▪ CV equals:
1
3
(5 − 4)2
+(9 − 10)2
+(10 − 8)2
=
6
3
= 2
Ivana Milović 36
Y X
5 2
9 5
10 4
Cross-Validation
▪ Drawbacks:
➢ CV error can be extremely variable, depending on how the data was split
➢ Only a subset of the data was used for training, this introduces a lot of bias
so we might overestimate the testing error
2. Leave-one-out cross-validation (LOOCV): Dataset with 𝑛 sample points is split
into 𝑛 − 1 data points, on which model training is done and the testing is done on
the remaining one data point. This is then repeated 𝑛 times, so that each point
gets to be in the training and the validation data set. The prediction errors are then
averaged out.
Ivana Milović 37
Cross-Validation
▪ Now there is no randomness in data splits, and there is much less bias compared to
the previous method, because 𝑛 − 1 points are used for training
▪ Problem: we have to fit the model 𝑛 times. Computationally extensive.
Ivana Milović 38
Cross-Validation
3. K-fold cross-validation: Randomly divide the data set into 𝑘 parts of (approximately)
equal size. Then train the model on 𝑘 − 1 parts and test on the remaining part. Repeat 𝑘
times and average out the testing error.
Ivana Milović 39
Cross-Validation
▪ How big should 𝑘 be? Experience shows that 𝒌 = 𝟓 or 𝒌 = 𝟏𝟎 show best
results.
▪ We fit the model only 𝑘 times
▪ Bias remains small, because we fit on almost all data and variability of the
CV estimate gets smaller compared to LOOCV, because the outputs for
each fit are less correlated
▪ This method corrects the disadvantages over the previous two.
Ivana Milović 40
Example
▪ Response variable mpg – miles per gallon
▪ Polynomial regression is performed with the regressor horsepower. But which
degree to take?
▪ Cross-validation can give us an answer
Ivana Milović 41
Example
Ivana Milović 42
Example
Ivana Milović 43
Validation set approach
Example
Ivana Milović 44
AIC, BIC, etc.
AIC, BIC, etc.
Other way of estimating the test MSE error is by adjusting the training
MSE.
▪ AIC (Akaike Information Criterion) is an estimator for out-of-sample prediction
error and thereby for the relative quality of a statistical model for a given set of
data.
▪ Given a collection of models, AIC estimates the quality of each model. Thus,
AIC provides a means for model selection.
Ivana Milović 46
AIC, BIC, etc
▪ Akaike extends the concept of the maximum likelihood estimation to the case
where the number of parameters 𝑝 is also unknown. A penalty is introduced,
depending on 𝑝. So a parameter is added to the model, only if it leads to a
significant improvement in the fit.
Ivana Milović 47
AIC, BIC, etc
▪ Let 𝑓 𝑦 𝜃 be a candidate model for estimating 𝑌, for 𝜃 ∈ 𝑅 𝑝
. For
example: 𝑓 𝑦 𝜃 is the density of 𝑁 𝑋𝜃, 𝐼
▪ Let መ𝜃 = መ𝜃 𝑌 be the MLE estimator, given the data 𝑌 ∈ 𝑅 𝑛.
▪ Then, 𝑨𝑰𝑪 = −𝟐𝒍𝒐𝒈𝒇 𝒀 ෡𝜽 + 𝟐𝐩 is the estimate of the test MSE
▪ Model with the smallest AIC is chosen
Ivana Milović 48
AIC, BIC, etc.
▪ BIC (Bayesian Information Criterion) is a similar method to AIC.
➢ The model with the smallest 𝑩𝑰𝑪 = −𝟐𝒍𝒐𝒈𝒇 𝒀 ෡𝜽 + 𝐩 𝐥𝐨𝐠(𝐧) is chosen.
➢ Since the penalty term here is larger, sparser models are selected than with
AIC.
➢ In the linear regression model with normal errors: AIC and BIC have the
following forms:
𝑨𝑰𝑪 = 𝒏 𝐥𝐨𝐠( 𝑴𝑺𝑬) + 𝟐𝒑 and 𝑩𝑰𝑪 = 𝒏 𝒍𝒐𝒈(𝑴𝑺𝑬) + 𝒑𝒍𝒐𝒈(𝒏)
Ivana Milović 49
50
Linear Models
Ivana Milović
Model selection and regularization
▪ Linear models (and generalized linear models: GLMs), though simple, turn out
to be surprisingly competitive in real-world problems, compare to more complex
models
▪ Reason for that lies in their simplicity and interpretability
▪ GLMs are the standard in the insurance business and most of the results for
linear models can be naturally generalized
▪ But what is their prediction accuracy and what happens when the number of
parameters 𝒑 is large compared to the sample size 𝒏?
Ivana Milović 51
Model selection and regularization
▪ Let us focus on linear models, for demonstration
▪ Assume that: 𝒀 = 𝑿𝜷 + 𝝐, for some 𝛽 ∈ 𝑅 𝑝
▪ 𝐸 𝜖 = 0 and 𝑉𝑎𝑟 𝜖 = 𝜎𝐼.
▪ Also, Y ∈ 𝑅 𝑛 and X ∈ 𝑅 𝑛×𝑝.
Ivana Milović 52
Model selection and regularization
▪ OLS estimator መ𝛽 = 𝑋′
𝑋 −1
𝑋′
𝑌 is well-defined for 𝑛 ≥ 𝑝 and it is unbiased.
Therefore, the estimates ෠𝑌 = 𝑋 መ𝛽 are unbiased.
▪ For 𝑝 > 𝑛, OLS is not even defined. Therefore, we have to come up with some
other estimators.
Ivana Milović 53
Model selection and regularization
But what about the variance of these estimates?
▪ If 𝑛 ≫ 𝑝, the variance is usually small and our estimates are accurate
▪ But if two or more variables are highly correlated, this could lead to high
variance and therefore unstable estimates. This happens, because det(𝑋′ 𝑋) is
almost 0 and the matrix inversion becomes very unstable
Ivana Milović 54
Model selection and regularization
▪ Example of (potentially) highly-correlated variables in Motor Insurance
➢ entry user age and current user age
➢ vehicle age and contract age
➢ population density and regional segmentation variables
Ivana Milović 55
Model selection and regularization
▪ Also if 𝑛 is not much larger than 𝑝, the estimates can get very unstable.
▪ Example: if all regressors are i.i.d. N(0,1) the variance of the predictions equals
𝜎
𝑝
𝑛−𝑝−1
.
▪ This is problematic for 𝑝 large compared to 𝑛.
Ivana Milović 56
Model selection and regularization
▪ Alternatives to OLS in linear regression:
➢ Subset selection (best subset and stepwise)
➢ Dimension reduction (PCA, for example)
➢ Shrinkage methods (Ridge, Lasso, etc.)
Ivana Milović 57
Subset Selection
Best subset selection: for a linear model with 𝑝 predictors do
➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a
predictor
➢ For 𝑘 = 1,2, … , 𝑝
1. Fit all 𝑝
𝑘
models that contain exactly 𝑘 predictors
2. Pick the best among these 𝑝
𝑘
models and call it 𝑀 𝑘. I.e., choose the model with the largest 𝑅2
.
➢ Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc.
➢ Note: here you cannot use 𝑅2
because then the largest model would always be
chosen.
Ivana Milović 58
Subset Selection
▪ This method is conceptually very simple to understand
▪ Problem? Too many models to fit! How many?
▪ 2 𝑝 models to fit.
▪ For example: for 𝑝 = 40, there are 1 073 741 824 models to fit!
▪ So we need another solution.
Ivana Milović 59
Subset Selection
2. Stepwise selection
➢ Forward
➢ Backward
Forwards stepwise selection
▪ Computationally efficient alternative to the best subset selection
▪ Here we begin with the null model and add predictors one at the time until we
get the full model (or some stopping rule is applied)
▪ Then we choose among these models using cross-validation, AIC, BIC, etc.
Ivana Milović 60
Subset Selection
More formally:
Forwards stepwise selection: for a linear model with 𝑝 predictors do
➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor
➢ For 𝑘 = 0,1, … , 𝑝 − 1
1. Consider all 𝑝 − 𝑘 models that add one additional predictor to the model 𝑀 𝑘
2. Pick the best among these 𝑝 − 𝑘 models and call it 𝑀 𝑘+1. I.e. choose the model with the largest 𝑅2
.
➢ Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc.
➢ Note: here you cannot use 𝑅2
because then the largest model would always be chosen.
Ivana Milović 61
Subset Selection
▪ Here we fit only 1 + σ 𝑘=0
𝑝−1
(𝑝 − 𝑘) = 1 +
𝑝 (𝑝+1)
2
models
▪ For example: for 𝑝 = 40, there are 466 models to fit. Much better than before.
▪ This procedure works well in practice, but now there is no guarantee that we will
select the best method overall
Backwards stepwise selection:
▪ Similar: here you start with the full model and delete regressors one at the time
Ivana Milović 62
Example: Prostate cancer
▪ The data come from a study that examined the correlation between the level of
prostate specific antigen (response variable) and a number of clinical
measures (regressors) in men who were about to receive a radical
prostatectomy.
▪ It is data frame with 97 rows and 9 columns.
Ivana Milović 63
Example: Prostate cancer
Ivana Milović 64
Example: Prostate cancer
▪ R Package Leaps is used to select the best model (based on 𝑅2) of each size
Ivana Milović 65
Example: Prostate cancer
▪ Then AIC and BIC are calculated for each of these models, based on the formula for linear
regression with normal errors.
Ivana Milović 66
Example: Prostate cancer
Ivana Milović 67
Summary of Day 1
Summary
▪ We assess the model quality by its prediction error
𝟏
𝒏
σ𝒊=𝟏
𝒏
( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐
given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1
𝑛
.
▪ But this only one part of it – training (in-sample) error
▪ It is necessary to estimate this error for new (unseen) data – testing (out-ofsample)
error
Ivana Milović 69
Summary
▪ A model (and its complexity) should be chosen based on these two prediction
errors:
Ivana Milović 70
Summary
▪ The training error we can estimate from the sample directly
▪ There are two types of methods for estimating the testing error
1. Cross – validation: based on resampling
2. AIC, BIC, etc.: based on testing error ≈ training error + dimension
penalty
Ivana Milović 71
Summary
▪ Linear models: simple but widely-used because of it simplicity and
interpretability
▪ OLS well-defined for 𝑛 ≥ 𝑝
▪ But if performs badly if
➢ p is large compared to n
➢ some of the regressors are highly correlated
Ivana Milović 72
Summary
▪ Some methods to reduce the number of parameters:
1. Best subset selection: all submodels are considered, but this is
computationally infeasible
2. Stepwise-regression: regressors are added one at the time. Once a
regressor is chosen, it stays
Ivana Milović 73
Other Methods
Preview
▪ We are still to see:
➢ Some other methods that do model selection for linear models
➢ How to deal with correlations
➢ How to deal with 𝑝 > 𝑛 case?
Ivana Milović 75
Principal Component Analysis
Principal Component Regression
▪ PCA uses an orthogonal transformation to convert a set of possibly correlated
variables into a set linearly uncorrelated variables called principal components.
▪ This transformation is defined in such a way that the first principal component has
the largest variance, the second principal component the second largest, etc.
▪ This way a dimension reduction can be performed and consequently OLS can be
fitted using the newly obtained regressors.
▪ One can show that this reduces the variance of the OLS estimator
Ivana Milović 77
Principal Component Regression
Ivana Milović 78
Principal Component Regression
▪ The only issue with this procedure is that the new
regressors have lost the interpretability, because they are
linear combinations or the original regressors.
▪ But if the prediction is the only goal, then this procedure is
more than suitable.
Ivana Milović 79
Shrinkage Methods
Shrinkage Methods
▪ We have already mentioned that if 𝑝 is relatively large compared
to 𝑛, or if some regressors are highly-correlated then the OLS
estimates can be very variable and therefore unstable.
▪ Also we cannot do OLS for 𝑝 > 𝑛.
▪ In order to tackle these problems, shrinking the regression
coefficients is helpful
Ivana Milović 81
Shrinkage Methods
▪ We know that if OLS is defined then (Gauss-Markov)
➢ it is unbiased
➢ has the smallest variance among all unbiased linear estimators
▪ So, if we want to stay in the class of unbiased linear
estimators, we cannot further reduce the variance.
▪ Idea: introduce a little bit of bias to decrease the variance
significantly
Ivana Milović 82
Ridge estimator
▪ Let 𝜆 ≥ 0 be fixed. Then the Ridge estimator is defined as:
▪ መ𝛽𝜆 = 𝑎𝑟𝑔min
𝛽∈𝑅 𝑝
( 𝑌 − 𝑋𝛽 2
2
+𝜆 𝛽 2
2
) = 𝑎𝑟𝑔 min
𝛽 2
2≤𝑐
𝑌 − 𝑋𝛽 2
2
for some 𝑐 that depends on 𝜆.
▪ For 𝜆 = 0, we obtain OLS. Otherwise we obtain a biased
estimator with smaller variance than OLS
Ivana Milović 83
Lasso estimator
▪ Let 𝜆 ≥ 0 be fixed. Then the Lasso estimator is defined as:
▪ መ𝛽𝜆 = 𝑎𝑟𝑔min
𝛽∈𝑅 𝑝
( 𝑌 − 𝑋𝛽 2
2
+ 𝜆 𝛽 1) = 𝑎𝑟𝑔 min
𝛽 1≤𝑐
𝑌 − 𝑋𝛽 2
2
for some 𝑐 that depends on 𝜆.
▪ For 𝜆 = 0, we obtain OLS. Otherwise we obtain a biased
estimator than in most cases outperforms the OLS
Ivana Milović 84
Shrinkage Methods
Ivana Milović 85
Shrinkage Methods – geometrical interpretation
Ivana Milović 86
Shrinkage Methods
▪ For both estimators, estimators for 𝛽 coefficients will be now
bounded, which means that also the variance of the
estimates stays controlled
▪ How to choose the right 𝜆? Cross validation!
Ivana Milović 87
Model selection
▪ Ridge estimator will almost surely not set any estimated
coefficients to zero because of its L2 geometry
▪ On the other hand, that is exactly what happens with Lasso
estimates, because of the L1 norm.
▪ The larger the 𝜆 the more coefficients are set to 0.
▪ So Lasso performs model selection and estimation at the
same time
Ivana Milović 88
Example – Prostate data
▪ The more you increase 𝜆, the smaller the estimated coefficients are
▪ Ridge estimated coefficients:
Ivana Milović 89
Example – Prostate data
▪ The more you increase 𝜆, the smaller the estimated coefficients are
▪ Lasso estimated coefficients: here they are set to 0 for large 𝜆
Ivana Milović 90
Generalized linear models (GLM)
GLM
▪ Generalized linear models (GLM) are a natural extension of linear models
▪ Response variable is now function of a linear combination of regressors
▪ Response variable does not have to be distributed normally anymore, it can
take on of the distributions from the exponential family: Bernoulli, Binomial,
Poisson, Gamma, Exponential
▪ GLMs are widely used in insurance industry and are ideally suited for the
analysis of the non-normal data, that is commonly encountered in insurance.
Ivana Milović 92
GLM
▪ More formally: 𝑌𝑖 ∈ 𝑅 - response variable, 𝑋𝑖 ∈ 𝑅 𝑝 - regressors
▪ Linear regression: 𝐸(𝑌𝑖ห𝑋𝑖) = 𝛽′ 𝑋𝑖 and ෡𝑌𝑖 = መ𝛽′ 𝑋𝑖.
Ivana Milović 93
GLM
▪ But what if 𝑌𝑖 is a count variable, like the number of claims?
▪ Assume a Poisson distribution for each 𝑌𝑖, but with a (potentially) different
parameter 𝜆𝑖 > 0. Each customer has different frequency of claims.
▪ We want to model 𝑌𝑖 in terms of 𝑋𝑖.
Ivana Milović 94
GLM - Poisson regression
▪ We know that
𝑃(𝑌𝑖 = 𝑦ห𝜆𝑖) = (𝑒−𝜆𝑖 𝜆𝑖
𝑦
)/𝑦! for each 𝑦 ∈ 0,1,2, … .
▪ Also 𝐸 𝑌𝑖ห𝑋𝑖 = 𝜆𝑖. We want to model 𝜆𝑖 in terms of 𝛽′ 𝑋𝑖. Since 𝜆𝑖 > 0, it makes
sense to do the following parametrization
𝐸 𝑌𝑖ห𝑋𝑖 = 𝜆𝑖 = 𝑒 𝛽′ 𝑋 𝑖
▪ Estimator: ෡𝑌𝑖 = 𝑒
෡𝛽′ 𝑋 𝑖 = 𝑒
෡𝛽1
′ 𝑋 𝑖1 ⋯ 𝑒
෡𝛽 𝑝
′ 𝑋 𝑖𝑝.
▪ GLM: a generalization of this, to also allow for other distributions in the exponential
family: Normal, Exponential, Gamma, Bernoulli, Binomial, etc.
Ivana Milović
Multiplicative
structure
95
GLM - Model choice
Ivana Milović
Source: Willis Towers Watson
96
GLM
Ivana Milović
▪ Generalized Linear Models serve as the industry standard for non-life insurance
pricing
➢ Multiplicative output remains understandable also for non-actuaries
➢ Range of professional insurance software dedicated to GLM
➢ GLM is also possible in, for example, R
▪ Burning costs are defined as Frequency × Severity
▪ One can model the average frequency of claims, the average claim amounts
(severity) or (directly) the average burning costs
▪ Burning costs are then the basis for the (net) risk premium
97
Criteria for GLM
Ivana Milović
▪ Portfolio size
➢ 150.000 exposure rows is seen as a minimum
➢ A significant number of claims is required as well
▪ Homogeneity of the risks in the portfolio
▪ Possibility to segment the risks
➢ Available risk factors
▪ Alternative methods
➢ Other pricing techniques
➢ Flat premium or premium influenced by one risk factor
➢ Individual underwriting
Often these criteria are met for a part of, but not for the full portfolio
98
Pricing Process
Risk Modelling Process
Ivana Milović
Data preparation
Initial Analysis
GLM
possible
?
GLM analysis
Simplified Pricing
Method
Net risk
premium
Yes No
Data extraction
Core System
100
You could end up with a multitude of models
Ivana Milović
Frequency
Material Damage
Bodily injury attritional
Bodily injury large
Severity Material Damage
Bodily injury attritional
Separate models
possible for
private persons,
fleets, leasing, etc
In this example
the severity of
large BI claims is
not modelled, but
taken as a fixed
amount per claim
And this is just for
passenger cars!
Example: MTPL
101
Validation of a GLM model
Ivana Milović
▪ Split the dataset in two
▪ Usually a 80-20% split or out-of-time
▪ Check how the model performs on unseen data
▪ Avoid overfitting
▪ Significance tests
▪ Significance of a parameter in the model
▪ Significance of levels of a parameter against each other (how granular should a
variable be)
▪ Temporal stability
▪ To be significant, an effect must be stable over the years
▪ Residual analysis
▪ To test the distribution
▪ On real data no distribution works perfectly
102
Combining Risk Models
Ivana Milović
▪ In the end we need to deliver a final risk premium
➢ We should combine all models we made
➢ Necessary to understand the total effect
▪ Result: net risk premium!
103
From net risk premium to gross risk premium
Ivana Milović
▪ A whole range of effects is to be added
to the net risk premium
▪ Most loadings will be added through an
increase in the intercept, but there are
other possibilities
▪ Loading for discounts to be added
Gross Risk Premium
104
Additional Topics - Interactions
Interactions and GLM
▪ An interaction effect exists when the effect of an independent variable on a
dependent variable changes, depending on the value(s) of one or more other
independent variables
▪ In that case an interaction term(s) has to be added to the model
▪ Example: gene A and gene B may contribute to developing a certain disease,
but in combination they are fatal
Ivana Milović 106
Interactions and GLM
▪ The problem? GLM models do not detect interactions automatically
▪ Then can be added to the model, but this has to be done ‘manually’
▪ Example taken from:
Ivana Milović 107
Interactions and GLM
Ivana Milović 108
Interactions and GLM
▪ In this example: there is an interaction of age and engine power
▪ 𝑨𝒈𝒆 ≥ 𝟔𝟎 𝒂𝒏𝒅 𝑬𝒏𝒈𝒊𝒏𝒆 𝑷𝒐𝒘𝒆𝒓 ≥ 𝟓𝟎
▪ But if this effect is not noticed and included in the model, the GLM fit is poor
Ivana Milović 109
Interactions and GLM
Ivana Milović 110
Machine Learning
Boosting
▪ But many machine learning algorithms can automatically capture these
effects
▪ Let us take Gradient Boosting Trees for example
▪ How does this algorithm work?
▪ Let us present some basics
Ivana Milović 112
Tree based methods
▪ Tree-based methods partition the feature space into a set of rectangles and then fit a
simple model (typically a constant) in each region
▪ Consider a regression problem with continuous response 𝑌 and continuous regressors
𝑋1, 𝑋2 ∈ (0,1).
▪ For example, this partition is simple but cannot be obtained by recursive binary splitting,
i.e. represented by a tree.
Ivana Milović 113
Tree based methods
▪ So, let us restrict our attention to recursive binary partitions, like this one:
▪ First split the space into two regions and model the response by the mean of Y in
each region. Choose the split variable and split-point to achieve the optimal split.
▪ Then one or both regions are further split in the same fashion iteratively until some
stopping rule is applied.
Ivana Milović 114
Tree based methods
▪ The corresponding regression model predicts Y with a constant 𝑐 𝑚 if the inputs
X are in region 𝑅 𝑚, i.e.
Ivana Milović 115
Tree based methods
▪ These trees can now further be used for boosting
▪ What is boosting?
Ivana Milović 116
Boosting
▪ Gradient boosting is one of the most powerful techniques for building predictive
models. It is proven successful in many areas and is one of the leading methods
for winning Kaggle competitions
Ivana Milović 117
Boosting
▪ In general: models can be fitted to data individually or combined in an
ensemble – a combination of simple individual models (usually trees) that
together create a more powerful model
▪ Boosting is a method that builds the model in a stage-wise fashion.
▪ It starts by fitting an initial model.
▪ The second model focuses then on accurately predicting the cases where
the first model performed badly
▪ The third model focuses on correcting the faults of the previous stage, etc.
Ivana Milović 118
Boosting
▪ Here we do not fit one big decision tree to the model, because this can
easily lead to overfitting
▪ Instead, the boosting algorithm learns slowly
▪ At each step we fit a decision tree to the residuals from the previous
model
▪ Then new tree is then added to the model
Ivana Milović 119
Boosting
Ivana Milović 120
▪ Example: data to be fitted
Boosting
Ivana Milović 121
Boosting
Ivana Milović 122
Boosting
Ivana Milović 123
Boosting
▪ Usually the trees are rather small, but they should be deep enough to
capture interactions. Number of splits = 2 is already enough to catch firstorder
interactions
▪ There are several parameters that need to be chosen: the number of trees,
the number of splits in each tree and the learning rate of the algorithm
(usually 0.1 or 0.01)
▪ For the number of trees cross-validation is used
Ivana Milović 124
Example
▪ Back to our example
▪ Remember that GLM could not ‘recognize’ the interaction between age and
engine power
▪ But GBMs do, provided that the tuning parameter have be carefully selected
Ivana Milović 125
Example
Ivana Milović 126
Method comparison
GLM vs Machine Learning
▪ The problem with these kind of algorithms is that the interpretation is almost
completely lost
▪ It is very unlikely that such models will be approved by regulators, at least in the
majority of countries
▪ And even if they are, then the insurance company runs into the risk of
reputational loss, in case some of the ethical problems discussed before
emerge
Ivana Milović 128
GLM vs Machine Learning
▪ Also the actuaries want to understand their models and not use black-box
alternatives
▪ So, GLMs will probably not be replaced by Machine Learning algorithms in the
near future
▪ But they can assist the actuaries in spotting interactions, as well as variable
significance or perform clustering tasks
Ivana Milović 129
GLM vs Machine Learning
Ivana Milović 130
GLM vs Machine Learning
▪ Examples of clustering can be brand or region clustering. Here the black-box
nature of the models is not so important, because the model results can usually
easily be validated
Ivana Milović 131
Literature
Literature
▪ An Introduction to Statistical Learning with Applications in R - Gareth James,
Daniela Witten, Trevor Hastie and Robert Tibshirani
▪ https://www.reacfin.com/wp-content/uploads/2016/12/20170914-Machine-
Learning-applications-for-non-life-pricing.pdf
▪ https://www.stat.cmu.edu/~ryantibs/datamining/lectures/17-modr2.pdf
Ivana Milović 133