Internal
High-dimensional statistics and Machine learning with applications to Insurance
November 2022
Masaryk University, Brno
Ivana Milović, MAS PhD PhD
Internal
Introducing Myself
2
Ivana Milović, MAS PhD
Non-Life Pricing Actuary (SME)
ivana.milovic@allianz.at
Prior experience
• Uniqa Insurance Group – Non-Life Pricing Actuary
(Motor)
• Lecturer - University of Vienna
• Prae and Post-Doc Researcher - Department of
Statistics,
University of Vienna
Education
• PhD in Statistics (Univ. of Vienna, 2016)
• Master of Advanced Studies in Mathematics (Univ. of
Cambridge, 2011)
• BSc in Mathematics and Computer Science (Univ. of
Belgrade, 2010)
What is pricing?
3
Internal
4
What is pricing?
“Pricing is the way that a company decides
prices for its products or services, or the
prices decided” – Cambridge dictionary
Internal
5
Why do we need statistics and
mathematical modelling for pricing in
insurance?
Classical industry example: Selling paperclips
• Known operating costs (rent, maintenance, salaries, marketing, etc.)
• Known production costs (materials, etc.)
• Known profit margin
Known price of a paperclip
Fully deterministic!
Internal
6
Why do we need statistics and
mathematical modelling for pricing in
insurance?
Classical insurance example: Selling a policy
• Known operating costs (rent, maintenance, salaries, marketing, etc.)
• Unknown claim costs (claim occurrence and severity are random events)
• Known profit margin
Unknown price of a policy
Not deterministic!
Internal
7
Why do we need statistics and
mathematical modelling for pricing in
insurance?
If the cost of policy is random, how do we estimate it?
There are two ways:
• Based on the historical data/expert judgement (simplistic approach)
• Fitting statistical models to historical data -> technical pricing.
Internal
8
What are the goals of technical pricing?
• To provide the best estimate for the expected cost of an insurance policy -> fair price
• Help us predict future losses and to better assess the portfolio and segment performance
• Know which are the technically unprofitable and profitable segments -> Identify business
opportunities
Internal
9
How to perform technical pricing?
Which family of
models?
Models
Scope
Analysis
Collection
Data
Preparation
Validation
etc.
Formats
Variables
etc.
Validation
etc.
Complexity
Our focus in
this class
FINAL PRICE
IT
Internal
10
Content
Topics
o Model assessment and
selection
o Cross validation, AIC, BIC
o Linear Models
o PCR, Regularization
methods
o Generalized Linear models
o Pricing process
o Machine Learning in
Insurance
Internal
11
Let’s get started
Introduction
12
Internal
▪ Let 𝑌 be a quantitative response and 𝑋 = (𝑋1, … , 𝑋 𝑝) be a set of regressors and suppose: 𝑌 =
𝑓 𝑋 + 𝜖, for some fixed (but unknown) function 𝒇 .
▪ 𝜖 has mean 0 and is independent of 𝑋. Often, we assume normality.
▪ Note: 𝑋 can be fixed or random
Example: Y is the number of claims and X are the characteristics of a driver and his car
Ivana Milović
13
Introduction
Internal
Statistical learning is a set of approaches for estimating 𝑓 by መ𝑓 from the data.
Estimation goals can be:
➢ Prediction
➢ Inference
Ivana Milović
14
This Photo by Unknown Author is licensed under CC BY-SA
Introduction
Internal
Introduction
Prediction: ෠𝑌 = መ𝑓(𝑋), for some estimate መ𝑓.
If prediction is our only goal and we do not have
interest in the form of 𝑓, then many modern
techniques give good results: random forests,
gradient boosting trees, etc.
Example: predicting prices on the stock
exchange. Here the interpretation is not
important, as long as, the results are good.
Ivana Milović
15
Internal
▪ The accuracy of ෠𝑌 depends on two quantities:
➢ reducible error – coming from approximating 𝑓 by መ𝑓
➢ irreducible error – the error coming from 𝜖
▪ We measure the accuracy by the expected prediction error
▪ 𝐸(𝑌 − ෠𝑌)2 = 𝐸(𝑓 𝑋 − መ𝑓(𝑋))2+ 𝑉𝑎𝑟(𝜖)
▪ Goal: to find a method that has a small reducible error
Ivana Milović
16
reducible irreducible
Introduction
Internal
▪ Note that
Ivana Milović
17
Introduction
Internal
Introduction
Inference: we want to also understand the form
of 𝑓, i.e. the relationship between 𝑌 and 𝑋 =
(𝑋1, … , 𝑋 𝑝) .
• Is 𝑓 linear or more complex?
• Which regressors are associated with 𝑌?
• What is their relationship?
18
Ivana Milović
Choice of Model
19
Internal
Choice of Model
We may choose our model based on what we are more interested in: prediction or inference
Example:
➢ Parametric models like linear models and GLMs: simple and interpretable, but not always very
accurate
➢ Non-parametric models like splines, GBM, random forests: better predictions but much less
interpretable
Factors like sample size, computational power, etc. also play a significant role in decision making.
20
Ivana Milović
Internal
Choice of Model
21
EXAMPLE: LINEAR REGRESSION VS. SPLINES
Ivana Milović
Interpretability?
Controversies
22
Internal
Machine learning controversy
Many machine learning techniques offer fully
automatized routines for calculating prices,
insurance premiums, etc. or clustering data into
different segments (for example: brands or
regions)
But if the interpretability is missing, many
problems might occur
23
Ivana Milović
Internal
▪ Certain companies have sparked controversy as ethnic, gender or ‘unethical’
variables slipped into their models, often because data bias was not corrected
Ivana Milović
24
Machine learning controversy
Internal
▪ Certain companies have sparked controversy as ethnic, gender or ‘unethical’
variables slipped into their models, often because data bias was not corrected
Ivana Milović
25
Machine learning controversy
In 2019, Facebook was found to be in contravention of the
U.S. constitution, by allowing its advertisers to deliberately
target adverts according to gender, race, and religion, all of
which are protected classes under the country’s legal
system.
Job adverts for roles in nursing or secretarial work were
suggested primarily to women, whereas job ads for janitors
and taxi drivers had been shown to a higher number of
men, in particular men from minority backgrounds.
The algorithm learned that ads for real
estate were likely to attain better
engagement stats when shown to white
people, resulting in them no longer
being shown to other minority
groups.
Internal
26
Ivana Milović
Machine learning controversy
Internal
27
Machine learning
controversy
What about the insurance industry?
• Current standard: GLM models
• Can Machine learning replace
them?
Later on that!
Ivana Milović
Assessing model accuracy
28
Internal
Assessing Model Accuracy
• No model dominates all other models over all possible data sets. We need to decide which
model is most suitable based on the data set given
• The prediction error 𝐸(𝑌 − መ𝑓(𝑋))2
can be estimated by the mean-squared error (MSE)
𝟏
𝒏
σ𝒊=𝟏
𝒏
( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐
given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1
𝑛
.
• Here 𝑋𝑖 denotes a 𝑝 −vector of regressors for the i-th data point
29
Ivana Milović
Internal
Assessing Model Accuracy
• But we do not want to predict the model accuracy on the data we already observed!
1
𝑛
σ𝑖=1
𝑛
( 𝑌𝑖 − መ𝑓(𝑋𝑖))2 is, actually, an in-sample (training) MSE.
• We want our model to perform well on the future data,
• For a new (unseen) observation 𝑋0, 𝑌0 , it should hold that መ𝑓 𝑋0 ≈ 𝑌0.
• In general, when considering all new data points: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒
𝑋0,𝑌0
(𝑌0 − መ𝑓(𝑋0))2 should be small. This is
an out-of-sample (testing) MSE
30
Ivana Milović
Internal
Assessing Model Accuracy
There is no guarantee that a model with a small training MSE will also have a small
testing MSE. This leads to concepts of underfitting and overfitting.
31
Ivana Milović
Internal
Assessing Model Accuracy
As the model complexity increases, the training
error gets smaller, but the testing error
increases.
Underfitting: the model is too simple and
performs badly on the training data, and
consequently on the testing data
Overfitting: the training data is modelled too
well, because non-existing patterns in the data
are found (coming from the noise). Therefore,
the performance on the future data is poor.
32
Ivana Milović
Internal
Bias-variance trade-off
▪ Let 𝑋0 be fixed. Note that the test MSE can be written as
𝐸(𝑌0 − መ𝑓(𝑋0))2
= 𝐵𝑖𝑎𝑠 ෡𝑓 (𝑋0
2
+ 𝑉𝑎𝑟 መ𝑓(𝑋0) + Var ϵ .
▪ Bias: Error introduced by approximating 𝑓 by መ𝑓
▪ Variance: how much መ𝑓 changes if we use different data sets for training
Ivana Milović
33
reducible irreducible
Internal
Bias-variance trade-off (additional)
34
Internal
Bias-variance trade-off
35
Ivana Milović
Easy to find a method with low bias and high variance, just use a
curve that connects all the points
Easy to find a method with low variance and high bias, just take a flat
line through the data
But, we want a method that simultaneously has low bias and low
variance.
Internal
Bias-variance trade-off
Example:
36
Ivana Milović
Internal
Test MSE Estimation
But in real-life situations it is not possible to compute the test MSE, because 𝑓 is unknown, so we
need to estimate it.
Remember: the test MSE equals:
𝐸(𝑌0 − መ𝑓(𝑋0))2
The estimation be done in the following ways:
➢ Cross-Validation: directly estimating test MSE by using resampling
➢ Indirect way of estimating test error: adjust the training error by a penalty term which takes the
model dimension into account, i.e. test MSE=train MSE +penalty term
37
Ivana Milović
Cross-validation
38
Internal
Cross-Validation
• Used to estimate the test MSE, for a given statistical model
• It tells us how our model performs on unseen data
• When comparing several competing models, the one with the smallest cross-validation error
(CV) is preferred.
• It can also be used for selecting tuning parameters for a chosen model (Ridge, Lasso, etc.)
39
Ivana Milović
Internal
Cross-Validation
There are 3 ways in which CV can be done:
1. Validation set approach: divide the data randomly into two data sets: training and testing.
Usually an 80-20% split is done. The model is then fitted using the training set and the
prediction error
𝟏
𝒎
σ𝒊=𝟏
𝒎
𝒀𝒊 − ෡𝒀𝒊
𝟐
is calculated on the testing data
Ivana Milović
40
Internal
Cross-Validation
Example:
▪ The model trained on 80% of the data gives the following prediction: ෠𝑌 = 2𝑋.
▪ The test data is:
▪ CV equals:
1
3
(5 − 4)2+(9 − 10)2+(10 − 8)2 =
6
3
= 2
Ivana Milović
41
Y X
5 2
9 5
10 4
Internal
Cross-Validation
Drawbacks:
➢ CV error can be extremely variable, depending on how the data was split
➢ Only a subset of the data was used for training, this introduces a lot of bias so we might
overestimate the testing error
2. Leave-one-out cross-validation (LOOCV): Dataset with 𝑛 sample points is split into 𝑛 − 1
data points, on which model training is done and the testing is done on the remaining one
data point. This is then repeated 𝑛 times, so that each point gets to be in the training and the
validation data set. The prediction errors are then averaged out.
42
Ivana Milović
Internal
Cross-Validation
• Now there is no randomness in data splits,
and there is much less bias compared to the
previous method, because 𝑛 − 1 points are
used for training
• Problem: we have to fit the model 𝒏 times.
Computationally extensive.
43
Ivana Milović
Internal
44
Cross-Validation
3. K-fold cross-validation: Randomly divide
the data set into 𝑘 parts of (approximately)
equal size. Then train the model on 𝑘 − 1
parts and test on the remaining part.
Repeat 𝑘 times and average out the
testing error.
Ivana Milović
Internal
Cross-Validation
• How big should 𝑘 be?
• Experience shows that 𝒌 = 𝟓 or 𝒌 = 𝟏𝟎 show best results.
• We fit the model only 𝑘 times
• The bias remains small, because we fit on almost all data and variability of the CV estimate gets
smaller compared to LOOCV, because the outputs for each fit are less correlated
• This method corrects the disadvantages of the previous two.
45
Ivana Milović
Internal
Cross-validation example
• Response variable mpg – miles per gallon
• Polynomial regression is performed with the
regressor horsepower. But which degree to
take?
• Cross-validation can give us an answer
46
Ivana Milović
Internal
Cross-validation example
Ivana Milović
47
Validation set approach
Internal
Cross-validation example
Ivana Milović
48
Internal
49
Adjust the training
error: AIC, BIC, etc.
Other way of estimating
the test MSE error is by
adjusting the training
MSE.
Internal
AIC, BIC, etc
• AIC (Akaike Information Criterion) is an estimator for an out-of-sample prediction error and
thereby for the relative quality of a statistical model for a given set of data.
• Given a collection of models, AIC estimates the quality of each model. Thus, AIC provides a
means for model selection.
• Akaike extends the concept of the maximum likelihood estimation to the case where the number
of parameters 𝑝 is also unknown. A penalty is introduced, depending on 𝑝. So, a parameter is
added to the model, only if it leads to a significant improvement in the fit.
50
Ivana Milović
Internal
AIC, BIC, etc
• Let 𝑓 𝑦 𝜃 be a candidate model for estimating 𝑌, for 𝜃 ∈ 𝑅 𝑝. For example: 𝑓 𝑦 𝜃 is the density
of 𝑁 𝑋𝜃, 𝐼
• Let መ𝜃 = መ𝜃 𝑌 be the MLE estimator, given the data 𝑌 ∈ 𝑅 𝑛.
• Then, 𝑨𝑰𝑪 = −𝟐𝒍𝒐𝒈𝒇 𝒀 ෡𝜽 + 𝟐𝐩 is the estimate of the test MSE
• Model with the smallest AIC is chosen
51
Ivana Milović
Internal
AIC, BIC, etc.
BIC (Bayesian Information Criterion) is a similar method to AIC.
➢ The model with the smallest 𝑩𝑰𝑪 = −𝟐𝒍𝒐𝒈𝒇 𝒀 ෡𝜽 + 𝐩 𝐥𝐨𝐠(𝐧) is chosen.
➢ Since the penalty term here is larger, sparser models are selected than with AIC.
➢In the linear regression model with normal errors: AIC and BIC have the following forms:
𝑨𝑰𝑪 = 𝒏 𝐥𝐨𝐠( 𝑴𝑺𝑬) + 𝟐𝒑 and 𝑩𝑰𝑪 = 𝒏 𝒍𝒐𝒈(𝑴𝑺𝑬) + 𝒑𝒍𝒐𝒈(𝒏)
52
Ivana Milović
Types of Models
53
Internal
54
Linear Models
Ivana Milović
Internal
Model selection and regularization
• Linear models (and generalized linear models: GLMs), though simple, turn out to be surprisingly
competitive in real-world problems, compare to more complex models
• Reason for that lies in their simplicity and interpretability
• GLMs are the standard in the insurance business and most of the results for linear models can be
naturally generalized
• But what is their prediction accuracy and what happens when the number of parameters 𝒑 is large
compared to the sample size 𝒏?
55
Ivana Milović
More
tomorrow!
Internal
Model selection and regularization
▪ Let us focus on linear models, for demonstration
▪ Assume that: 𝒀 = 𝑿𝜷 + 𝝐, for some 𝛽 ∈ 𝑅 𝑝
𝐸 𝜖 = 0 and 𝑉𝑎𝑟 𝜖 = 𝜎𝐼.
Also, Y ∈ 𝑅 𝑛 and X ∈ 𝑅 𝑛×𝑝.
▪ OLS estimator መ𝛽 = 𝑋′ 𝑋 −1 𝑋′ 𝑌 is well-defined for 𝑛 ≥ 𝑝 and it is unbiased. Therefore, the
estimates ෠𝑌 = 𝑋 መ𝛽 are unbiased.
▪ For 𝑝 > 𝑛, OLS is not even defined. Therefore, we have to come up with some other estimators.
56
Ivana Milović
Internal
Model selection and regularization
But what about the variance of these estimates?
• If 𝑛 ≫ 𝑝, the variance is usually small, and our estimates are accurate
• But if two or more variables are highly correlated, this could lead to high variance and therefore
unstable estimates. This happens, because det(𝑋′
𝑋) is almost 0 and the matrix inversion becomes
very unstable
57
Ivana Milović
Internal
Model selection and regularization
58
Ivana Milović
Example of (potentially) highlycorrelated
variables in Motor
Insurance
Vehicle age and contract age
Population density and regional
segmentation variables
Example of (potentially) highlycorrelated
variables in SME
Insurance
Turnover and number of employees
Internal
Model selection and regularization
• Also, if 𝑛 is not much larger than 𝑝, the estimates can get very unstable.
• Example: if all regressors are i.i.d. N(0,1) the variance of the predictions equals 𝜎
𝑝
𝑛−𝑝−1
.
• This is problematic for 𝑝 large compared to 𝑛.
59
Ivana Milović
Internal
Model selection and regularization
Alternatives to OLS in linear regression:
➢ Subset selection (best subset and stepwise)
➢ Dimension reduction (PCA, for example)
➢ Shrinkage methods (Ridge, Lasso, etc.)
60
Ivana Milović
Subset Selection
61
Internal
Subset Selection
1. Best subset selection: for a linear model with 𝑝 predictors do
➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor
➢ For 𝑘 = 1,2, … , 𝑝
1. Fit all 𝑝
𝑘
models that contain exactly 𝑘 predictors
2. Pick the best among these 𝑝
𝑘
models and call it 𝑀 𝑘. I.e., choose the model with the
largest 𝑅2.
➢Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc.
➢Note: here you cannot use 𝑅2 because then the largest model would always be chosen.
https://en.wikipedia.org/wiki/Coefficient_of_determination
62
Ivana Milović
Internal
Subset Selection
• This method is conceptually very simple to understand
• Problem? Too many models to fit!
• How many? 2 𝑝 models to fit.
• For example: for 𝑝 = 40, there are 1 073 741 824 models to fit!
• So, we need another solution.
63
Ivana Milović
Internal
Subset Selection
2. Stepwise selection
➢ Forward
➢ Backward
Forwards stepwise selection
• Computationally efficient alternative to the best subset selection
• Here we begin with the null model and add predictors one at the time until we get the full model
(or some stopping rule is applied)
• Then we choose among these models using cross-validation, AIC, BIC, etc.
64
Ivana Milović
Internal
Subset Selection
More formally:
Forwards stepwise selection: for a linear model with 𝑝 predictors do
➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor
➢ For 𝑘 = 0,1, … , 𝑝 − 1
1. Consider all 𝑝 − 𝑘 models that add one additional predictor to the model 𝑀 𝑘
2. Pick the best among these 𝑝 − 𝑘 models and call it 𝑀 𝑘+1. I.e. choose the model with the
largest 𝑅2.
➢Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc.
➢Note: here you cannot use 𝑅2
because then the largest model would always be chosen.
65
Ivana Milović
Internal
Subset Selection
• Here we fit only 1 + σ 𝑘=0
𝑝−1
(𝑝 − 𝑘) = 1 +
𝑝 (𝑝+1)
2
models
• For example: for 𝑝 = 40, there are 466 models to fit. Much better than before.
• This procedure works well in practice, but now there is no guarantee that we will select the best
method overall
Backwards stepwise selection:
Similar: here you start with the full model and delete regressors one at the time
66
Ivana Milović
Internal
67
Example:
Prostate cancer
• The data come from a study that
examined the correlation between the
level of prostate specific antigen
(response variable) and a number of
clinical measures (regressors) in men
who were about to receive a radical
prostatectomy.
• It is data frame with 97 rows and 9
columns.
Ivana Milović
Internal
Example: Prostate cancer
Ivana Milović
68
Internal
Example: Prostate cancer
R Package Leaps is used to select the best
model (based on 𝑅2
) of each size
69
Ivana Milović
Internal
Example: Prostate cancer
70
• Then AIC and BIC are calculated for each of these models, based on the formula for
linear regression with normal errors.
Ivana Milović
Summary for today
71
Internal
Summary
• We assess the model quality by its prediction error
𝟏
𝒏
σ𝒊=𝟏
𝒏
( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐
given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1
𝑛
.
• But this is only one part of it – training (in-sample) error
• It is necessary to estimate this error for new (unseen) data – testing (out-of-sample) error
72
Ivana Milović
Internal
Summary
A model (and its complexity) should be chosen
based on these two prediction errors:
73
Ivana Milović
Internal
Summary
• The training error we can estimate from the sample directly
• There are two types of methods for estimating the testing error
1. Cross-validation: based on resampling
2. AIC, BIC, etc.: based on testing error ≈ training error + dimension penalty
74
Ivana Milović
Internal
Summary
Linear models: simple but widely-used because of its simplicity and interpretability
OLS well-defined for 𝑛 ≥ 𝑝
But they perform badly if
➢ p is large compared to n
➢ some of the regressors are highly correlated
75
Ivana Milović
Internal
Summary
Some methods to reduce the number of parameters:
1. Best subset selection: all submodels are considered, but this is computationally
infeasible
2. Stepwise-regression: regressors are added one at the time. Once a regressor is
chosen, it stays
76
Ivana Milović
Internal
77
Preview
We are still to see:
• Some other methods that do
model selection for linear
models
• How to deal with correlations
• How to deal with 𝑝 > 𝑛 case?
Internal
Thank you!
78