Internal
High-dimensional statistics and Machine learning with applications to Insurance
November 2022
Masaryk University, Brno
Ivana Milović, MAS PhD PhD
Internal
Introducing Myself
2
Ivana Milović, MAS PhD
Non-Life Pricing Actuary (SME)
ivana.milovic@allianz.at
Prior experience
• Uniqa Insurance Group – Non-Life Pricing Actuary
(Motor)
• Lecturer - University of Vienna
• Prae and Post-Doc Researcher - Department of
Statistics,
University of Vienna
Education
• PhD in Statistics (Univ. of Vienna, 2016)
• Master of Advanced Studies in Mathematics (Univ. of
Cambridge, 2011)
• BSc in Mathematics and Computer Science (Univ. of
Belgrade, 2010)
Allianz Group
Our Company
3
Internal
Allianz Group at a glance
With around 155,000 employees worldwide, the Allianz Group serves over
126 million customers1 in more than 70 countries.
In fiscal year 2021 the Allianz Group achieved total revenues of
approximately 148.5 billion euros.
Allianz is one of the world‘s largest asset managers, with third-party assets of
nearly 2.0 trillion euros at year end.
OUR COMPANY
41Including non-consolidated entities with Allianz customers
Data as of March 4, 2022 (release of the Annual Report 2021)
Internal
OUR COMPANY
5
We are much more than just an insurer
… Allianz is the leading specialist in space insurance and
celebrated its 100th birthday as aviation insurer in 2015
… the Allianz Center for Technology has
conducted thousands of crash test since 1980
to improve road safety
… Allianz insures major Hollywood and
Bollywood movies, including all 24 James Bond
productions
… Allianz offers financial solutions to more than
49 million emerging consumers in Africa, Asia
and Latin America
… Allianz was one of the insurers of the Titanic
… Allianz supports sustainable motor sports by
being a partner of the fully electric car racing
Formula E Championship
… Allianz insured the last three buildings to hold the title
of “world’s tallest”: Petronas Towers, Taipei 101 &
Burj Khalifa
… Allianz is the Worldwide Insurance Partner of the
Olympic & Paralympic Movements from 2021-
2028 – and one of few global brands that can
communicate with Olympic IP/rings
Internal
OUR COMPANY
6
We are a global player
Leading Property and Casualty insurer globally
Among the top 5 Life/Health business globally
Among the top 5 asset managers globally
Global leader in credit insurance
Worldwide leader in assistance services
One of the leading corporate insurers globally
Internal
Revenues by segments and regions
OUR COMPANY
42%
53%
6%
Property/Casualty
Insurance
Life/Health
Insurance
Asset Management
26%
30%
12%
10%
6%
15%
Germany
Western & Southern
Europe
USA
Growth Markets
Anglo Markets
Speciality Insurance
1 Excl. Corporate & Other and consolidations
2 Incl. Banking
Revenues Allianz Group 2021: EUR 148.5 bn
by REGIONS1,2by SEGMENTS
7
148.5
bn
148.5
bn
3
4
5
3 Central and Eastern Europe, Asia-Pacific, Latin America, Middle East and Africa, Turkey.
Austria and Allianz Direct allocated to Western and Southern Europe.
4 UK, Ireland, Australia
5 Allianz Global Corporate & Specialty, Euler Hermes, Allianz Partners, Allianz Re
Data as of March 29, 2022 (release of the Allianz Fact Sheet; further information on allianz.com)
Internal
Interested in joining
one of our
companies?
8
❑ Bachelor/Master Thesis
supervision
❑ Summer internship
❑ Full- or part time job
Contact us:
ivana.milovic@allianz.at
What is pricing?
9
Internal
10
What is pricing?
“Pricing is the way that a company decides
prices for its products or services, or the
prices decided” – Cambridge dictionary
Internal
11
Why do we need statistics and
mathematical modelling for pricing in
insurance?
Because the cost of a policy is random. How do we estimate it?
There are two ways:
• Based on the historical data/expert judgement (simplistic approach)
• Fitting statistical models to historical data -> technical pricing.
Summary from yesterday
12
Internal
Summary
• We assess the model quality by its prediction error
𝟏
𝒏
σ𝒊=𝟏
𝒏
( 𝒀𝒊 − ෠𝒇(𝑿𝒊)) 𝟐
given a sample (𝑋𝑖, 𝑌𝑖)𝑖=1
𝑛
.
• But this is only one part of it – training (in-sample) error
• It is necessary to estimate this error for new (unseen) data – testing (out-of-sample) error
13
Ivana Milović
Internal
Summary
A model (and its complexity) should be chosen
based on these two prediction errors:
14
Ivana Milović
Internal
Summary
• The training error we can estimate from the sample directly
• There are two types of methods for estimating the testing error
1. Cross – validation: based on resampling
2. AIC, BIC, etc.: based on testing error ≈ training error + dimension penalty
15
Ivana Milović
Internal
16
Content
Topics
o Model assessment and
selection
o Cross validation, AIC, BIC
o Linear Models
o PCR, Regularization
methods
o Generalized Linear models
o Pricing process
o Machine Learning in
Insurance
Types of Models
17
Internal
18
Linear Models
Ivana Milović
Internal
Model selection and regularization
• Linear models (and generalized linear models: GLMs), though simple, turn out to be
surprisingly competitive in real-world problems, compare to more complex models
(GLMs are the standard in the insurance business)
• Reason for that lies in their simplicity and interpretability
• But what is their prediction accuracy and what happens when the number of
parameters 𝒑 is large compared to the sample size 𝒏?
19
Ivana Milović
Internal
Model selection and regularization
▪ Let us focus on linear models, for demonstration
▪ Assume that: 𝒀 = 𝑿𝜷 + 𝝐, for some 𝛽 ∈ 𝑅 𝑝
𝐸 𝜖 = 0 and 𝑉𝑎𝑟 𝜖 = 𝜎𝐼.
Also, Y ∈ 𝑅 𝑛 and X ∈ 𝑅 𝑛×𝑝.
▪ If 𝑛 ≥ 𝑝 then the OLS estimator መ𝛽 = 𝑋′ 𝑋 −1 𝑋′ 𝑌 is well-defined and it is unbiased.
▪ Therefore, the estimates ෠𝑌 = 𝑋 መ𝛽 are unbiased.
▪ But for 𝑝 > 𝑛, OLS is not even defined. Therefore, we have to come up with some other
estimators.
20
Ivana Milović
Internal
Model selection and regularization
But what about the variance of these estimates?
• If 𝑛 ≫ 𝑝, the variance is usually small, and our estimates are accurate
• But if two or more variables are highly correlated, this could lead to high variance
and therefore unstable estimates. This happens, because det(𝑋′ 𝑋) is almost 0 and
the matrix inversion becomes very unstable
21
Ivana Milović
Internal
22
Ivana Milović
Example of (potentially) highlycorrelated
variables in Motor
Insurance
Vehicle age and contract age
Engine Power and engine volume
Population density and regional segmentation
variables
Example of (potentially) highlycorrelated
variables in SME
Insurance
Turnover and number of employees
Model selection and regularization
Internal
Model selection and regularization
• Also, if 𝑛 is not much larger than 𝑝, the estimates can get very unstable.
• Example: if all regressors are i.i.d. N(0,1) the variance of the predictions equals
𝜎
𝑝
𝑛−𝑝−1
.
• This is problematic for 𝑝 large compared to 𝑛.
23
Ivana Milović
Internal
• To summarize, we need to find methods to
❑ reduce the number of parameters (dimensionality) and/or
❑ to reduce the variance of the estimators
• Otherwise, the models are unreliable!
24
Model selection and regularization
Internal
Model selection and regularization
Alternatives to OLS in linear regression:
• Subset selection (best subset and stepwise)
o The goal is to choose a subset of all regressors and that that model as approximation
o Cross Validation, AIC or BIC help us then choose the best submodel
• Dimension reduction (PCA, for example)
o We transform the original regressors so that the new ones are uncorrelated and sorted in order of
importance, so we can also reduce the dimension
• Shrinkage methods (Ridge, Lasso, etc.)
o Modification of OLS (Ordinary Least Squares) estimator, so that the coefficients of the estimates
are shrunk towards zero, to reduce the variability of the estimator and make them more stable
o Also works for p larger than n
25
Ivana Milović
Internal
Model selection and regularization
For more details, see the Appendix chapter
26
Ivana Milović
GLM – industry standard
27
Internal
GLM
• Generalized linear models (GLM) are a natural extension of linear models
• Response variable is now function of a linear combination of regressors
• Response variable does not have to be distributed normally anymore, it can take on of the
distributions from the exponential family: Bernoulli, Binomial, Poisson, Gamma, Exponential
• GLMs are widely used in insurance industry and are ideally suited for the analysis of the nonnormal
data, that is commonly encountered in insurance.
28
Ivana Milović
Internal
GLM - Model choice
Ivana Milović
29
Source: Willis Towers Watson
Internal
GLM
Generalized Linear Models serve as the industry standard for non-life insurance pricing
• Their multiplicative output remains understandable also for non-actuaries
• Range of professional insurance software dedicated to GLM
• GLM is also possible in R, Python and other open-source software
30
Ivana Milović
Internal
Risk Premium
• In order to come up with a price for an insurance policy, we need to start by modelling
the base risk premium, i.e.
• (Net) Risk Premium is defined as Frequency (how often)× Severity (how high)
• One can model
❑ the frequency of claims -> Poisson
❑ the claim amount (severity)-> Gamma
❑ (directly) the risk premium -> Gamma, Tweedie
31
Ivana Milović
Internal
Risk Premium
• This procedure has to be
done for every insurance
coverage.
• Example in SME
insurance
32
Internal
Ivana Milović
33
Data preparation
Initial Analysis
GLM
possible
?
GLM analysis
Simplified Pricing
Method
Net risk
premium
Yes No
Data extraction
Core System
Risk Premium
Internal
Risk Premium
• In the end we need to deliver a final risk premium
• We should combine all models we made
• Necessary to understand the total effect
• Result: total net risk premium
34
Internal
Ivana Milović
35
A whole range of effects is to be added to
the net risk premium
Gross Risk Premium
Risk Premium: from net to gross
GLM – space for
improvement
36
Internal
Interactions and GLM
• An interaction effect exists when the effect of an independent variable on a dependent variable
changes, depending on the value(s) of one or more other independent variables
• In that case an interaction term(s) has to be added to the model
• Example: gene A and gene B may contribute to developing a certain disease, but in combination
they are fatal
37
Ivana Milović
Internal
Interactions and GLM
• The problem? GLM models do not detect interactions automatically
• Then can be added to the model, but this has to be done ‘manually’
• Example taken from:
38
Ivana Milović
Internal
Ivana Milović
39
Interactions and GLM
Internal
Interactions and GLM
• In this example: there is an interaction of age and engine power
𝑨𝒈𝒆 ≥ 𝟔𝟎 𝒂𝒏𝒅 𝑬𝒏𝒈𝒊𝒏𝒆 𝑷𝒐𝒘𝒆𝒓 ≥ 𝟓𝟎
• But if this effect is not noticed and included in the model, the GLM fit is poor
40
Ivana Milović
Internal
Interactions and GLM
41
Ivana Milović
Internal
Tree based methods
• But many machine learning algorithms can automatically capture these effects
• Let us take Gradient Boosting Trees for example
• How does this algorithm work?
• Let us present some basics
42
Ivana Milović
Internal
Tree based methods
• Tree-based methods partition the feature
space into a set of rectangles and then fit a
simple model (typically a constant) in each
region
• Consider a regression problem with
continuous response 𝑌 and continuous
regressors 𝑋1, 𝑋2 ∈ (0,1).
• For example, this partition is simple, but
cannot be obtained by recursive binary
splitting, i.e. represented by a tree.
43
Internal
▪ So, let us restrict our attention to recursive binary partitions, like this one:
▪ First split the space into two regions and model the response by the mean of Y in each
region. Choose the split variable and split-point to achieve the optimal split.
▪ Then one or both regions are further split in the same fashion iteratively until some
stopping rule is applied.
44
Tree based methods
Internal
▪ The corresponding regression model predicts Y with a constant 𝑐 𝑚 if the inputs X
are in region 𝑅 𝑚, i.e.
45
Tree based methods
Internal
Tree based methods
46
Ivana Milović
These trees can now further be used for boosting
What is boosting?
Internal
Boosting
• Gradient boosting is one of the most powerful techniques for building predictive models. It
is proven successful in many areas and is one of the leading methods for winning Kaggle
competitions (https://kaggle.com/)
47
Ivana Milović
Internal
Boosting
• In general: models can be fitted to data individually or combined in an ensemble – a combination
of simple individual models (usually trees) that together create a more powerful model
• Boosting is a method that builds the model in a stage-wise fashion.
• It starts by fitting an initial model.
• The second model focuses then on accurately predicting the cases where the first model
performed badly
• The third model focuses on correcting the faults of the previous stage, etc.
48
Ivana Milović
Internal
Boosting
• Here we do not fit one big decision tree to the
model, because this can easily lead to
overfitting
• Instead, the boosting algorithm learns slowly
• At each step we fit a decision tree to the
residuals from the previous model
• Then new tree is then added to the model
49
Ivana Milović
Internal
Boosting
Example: data to be fitted
50
Ivana Milović
Internal
Boosting
51
Ivana Milović
Internal
52
Boosting
Ivana Milović
Internal
Boosting
53
Ivana Milović
Internal
Boosting
54
Ivana Milović
Usually the trees are rather
small, but they should be
deep enough to capture
interactions. Number of
splits = 2 is already enough
to catch first-order
interactions
There are several
parameters that need to be
chosen: the number of
trees, the number of splits in
each tree and the learning
rate of the algorithm (usually
0.1 or 0.01)
For the number of trees
cross-validation is used
Internal
Example
• Back to our example
• Remember that GLM could not ‘recognize’ the interaction between age and engine power
• But GBMs do, provided that the tuning parameter have be carefully selected
55
Ivana Milović
56
Example
SUCCESS!
Ivana Milović
Internal
GLM vs Machine Learning
• The problem with these kind of algorithms is
that the interpretation is almost completely
lost
• It is very unlikely that such models will be
approved by regulators, at least in the
majority of countries
• And even if they are, then the insurance
company runs into the risk of reputational
loss, in case some of the ethical problems
discussed before emerge
57
Ivana Milović
Internal
GLM vs Machine Learning
• Also, the actuaries want to understand their models and not use black-box alternatives
• So, GLMs will probably not be replaced by Machine Learning algorithms in the near future
• But they can assist the actuaries in spotting interactions, as well as determine variable
significance or perform clustering tasks
58
Ivana Milović
Internal
Variable importance
59
Ivana Milović
Internal
Clustering
• Examples of clustering can be brand, region
or business activities clustering.
• Here the black-box nature of the models is
not so important, because the model results
can usually easily be validated
60
Ivana Milović
Internal
Literature
▪ An Introduction to Statistical Learning with Applications in R - Gareth James,
Daniela Witten, Trevor Hastie and Robert Tibshirani
▪ https://www.reacfin.com/wp-content/uploads/2016/12/20170914-Machine-
Learning-applications-for-non-life-pricing.pdf
▪ https://www.stat.cmu.edu/~ryantibs/datamining/lectures/17-modr2.pdf
Ivana Milović
61
Internal
Appendixthe white slides.
62
Subset Selection
63
Internal
Subset Selection
1. Best subset selection: for a linear model with 𝑝 predictors do
➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor
➢ For 𝑘 = 1,2, … , 𝑝
1. Fit all 𝑝
𝑘
models that contain exactly 𝑘 predictors
2. Pick the best among these 𝑝
𝑘
models and call it 𝑀 𝑘. I.e., choose the model with the
largest 𝑅2.
➢Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc.
➢Note: here you cannot use 𝑅2 because then the largest model would always be chosen.
https://en.wikipedia.org/wiki/Coefficient_of_determination
64
Ivana Milović
Internal
Subset Selection
• This method is conceptually very simple to understand
• Problem? Too many models to fit!
• How many? 2 𝑝 models to fit.
• For example: for 𝑝 = 40, there are 1 073 741 824 models to fit!
• So, we need another solution.
65
Ivana Milović
Internal
Subset Selection
2. Stepwise selection
➢ Forward
➢ Backward
Forwards stepwise selection
• Computationally efficient alternative to the best subset selection
• Here we begin with the null model and add predictors one at the time until we get the full model
(or some stopping rule is applied)
• Then we choose among these models using cross-validation, AIC, BIC, etc.
66
Ivana Milović
Internal
Subset Selection
More formally:
Forwards stepwise selection: for a linear model with 𝑝 predictors do
➢ Let 𝑀0 be the null model with zero regressors, i.e. sample mean of 𝑌 is used as a predictor
➢ For 𝑘 = 0,1, … , 𝑝 − 1
1. Consider all 𝑝 − 𝑘 models that add one additional predictor to the model 𝑀 𝑘
2. Pick the best among these 𝑝 − 𝑘 models and call it 𝑀 𝑘+1. I.e. choose the model with the
largest 𝑅2.
➢Select the best model from 𝑀0, 𝑀1, … , 𝑀 𝑝 using cross-validation, AIC, BIC, etc.
➢Note: here you cannot use 𝑅2
because then the largest model would always be chosen.
67
Ivana Milović
Internal
Subset Selection
• Here we fit only 1 + σ 𝑘=0
𝑝−1
(𝑝 − 𝑘) = 1 +
𝑝 (𝑝+1)
2
models
• For example: for 𝑝 = 40, there are 466 models to fit. Much better than before.
• This procedure works well in practice, but now there is no guarantee that we will select the best
method overall
Backwards stepwise selection:
Similar: here you start with the full model and delete regressors one at the time
68
Ivana Milović
Internal
69
Example:
Prostate cancer
• The data come from a study that
examined the correlation between the
level of prostate specific antigen
(response variable) and a number of
clinical measures (regressors) in men
who were about to receive a radical
prostatectomy.
• It is data frame with 97 rows and 9
columns.
Ivana Milović
Internal
Example: Prostate cancer
Ivana Milović
70
Internal
Example: Prostate cancer
R Package Leaps is used to select the best
model (based on 𝑅2
) of each size
71
Ivana Milović
Internal
Example: Prostate cancer
72
• Then AIC and BIC are calculated for each of these models, based on the formula for
linear regression with normal errors.
Ivana Milović
Internal
73
Preview
We are still to see:
• Some other methods that do
model selection for linear
models
• How to deal with correlations
• How to deal with 𝑝 > 𝑛 case?
PCA
74
Internal
Principal Component Regression
Ivana Milović
75
Internal
Principal Component Regression
• PCA uses an orthogonal transformation to convert a set of possibly correlated variables into a set
linearly uncorrelated variables called principal components.
• This transformation is defined in such a way that the first principal component has the largest
variance, the second principal component the second largest, etc.
76
Ivana Milović
Internal
Principal Component Regression
▪ This way a dimension reduction can be performed and consequently OLS can be fitted using the
newly obtained regressors.
▪ One can show that this reduces the variance of the OLS estimator
▪ But, interpretability issues!
Ivana Milović
77
Shrinkage Methods
78
Internal
Shrinkage Methods
▪ We have already mentioned that if 𝑝 is relatively large compared to 𝑛, or if some regressors are
highly-correlated then the OLS estimates can be very variable and therefore unstable.
▪ Also, we cannot do OLS for 𝑝 > 𝑛.
▪ In order to tackle these problems, shrinking the regression coefficients is helpful
Ivana Milović
79
Internal
Shrinkage Methods
▪ Idea: introduce some bias, but decrease variance significantly
▪ This is done by adjusting our minimization problem
Ivana Milović
80
Internal
Shrinkage Methods
Ivana Milović
81
▪ OLS:
▪ Ridge:
▪ Lasso:
Internal
Shrinkage Methods
• So, Ridge and Lasso are actually classes of estimators, since they depend on 𝜆
• How to choose the right 𝜆? Cross validation!
82
Ivana Milović
Internal
Shrinkage Methods – geometrical interpretation
83
Ivana Milović
Internal
Model selection
• Ridge estimator will almost surely not set any estimated coefficients to zero because of its L2
geometry
• On the other hand, that is exactly what happens with Lasso estimates, because of the L1 norm.
• The larger the 𝜆 the more coefficients are set to 0.
• So, Lasso performs model selection and estimation at the same time
84
Ivana Milović
Internal
Example – Prostate data
The more you increase 𝜆, the smaller the
estimated coefficients are
Ridge estimated coefficients:
85
Ivana Milović
Internal
Example – Prostate data
The more you increase 𝜆, the
smaller the estimated coefficients
are
Lasso estimated coefficients: here
they are set to 0 for large 𝜆
86
Ivana Milović