Data Analytics for Non-Life Insurance Pricing
- Lecture Notes -
Mario V. Wuthrich RiskLab Switzerland Department of Mathematics ETH Zurich
Christoph Buser AXA
versicherungen
Version October 27, 2021
Electronic copy available at: https://ssrn.com/abstract=2870308
2
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Preface and Terms of Use
Lecture notes. These notes are the basis of our lecture on Data Analytics for Non-Life Insurance Pricing at ETH Zurich. They aim at giving insight and education to practitioners, academics and students in actuarial science on how to apply machine learning methods for statistical learning. There is some overlap in these notes with our book Statistical Foundations of Actuarial Learning and its Applications which treats the topic of deep learning in much more depth, see [141]
Prerequisites. The prerequisites for these lecture notes are a solid education in mathematics, in particular, in probability theory and statistics. Moreover, knowledge in statistical software such as R is required.
Terms of Use. These lecture notes are an ongoing project which is continuously revised and updated. Of course, there may be errors in the notes and there is always room for improvement. Therefore, we appreciate any comment and/or correction that readers may have. However, we would like you to respect the following rules:
• These notes are provided solely for educational, personal and non-commercial use. Any commercial use or reproduction is forbidden.
• All rights remain with the authors. They may update the manuscript or withdraw the manuscript at any time. There is no right of availability of any (old) version of these notes. The authors may also change these terms of use at any time.
• The authors disclaims all warranties, including but not limited to the use or the contents of these notes. On using these notes, you fully agree to this.
• Citation: please use the SSRN URL https://ssrn. com/abstract=2870308
• All included figures were produced by the authors with the open source software R.
Versions of the 1st edition of these notes (before 2019):
November 15, 2016; January 27, 2017; March 28, 2017; October 25, 2017; June 11, 2018
Versions of the 2nd edition of these notes (after 2019):
February 5, 2019; June 4, 2019; September 10, 2020
3
Electronic copy available at: https://ssrn.com/abstract=2870308
4
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Acknowledgment
Firstly, we thank RiskLab and the Department of Mathematics of ETH Zurich, who have always strongly supported this project. A special thank you goes to Peter Reinhard (AXA Insurance, Winterthur) who has initiated this project by his very supportive and forward-looking anticipation.
Secondly, we kindly thank Patrick Zochbauer who has written his MSc ETH Mathematics thesis under our supervision at AXA-Winterthur on "Data Science in Non-Life Insurance Pricing: Predicting Claims Frequencies using Tree-Based Models". This MSc thesis has been a great stimulus for these lecture notes and it has also supported us to get more easily into this topic. This thesis is awarded the Walter Saxer Insurance Price 2016.
Next we thank the Institute of Statistical Mathematics (ISM), Tokyo, and, in particular, Prof. Tomoko Matsui (ISM) and Prof. Gareth Peters (University College London, UCL, and Heriot-Watt University, Edinburgh) for their support. Part of these notes were written while MVW was visiting ISM.
We kindly thank Philippe Deprez, John Ery and Andrea Gabrielli who have been carefully reading preliminary versions of these notes. This has helped us to substantially improve the outline.
Finally, we thank many colleagues and students for very fruitful discussions, providing data and calculating examples. We mention in particular: Michel Baes, Peter Blum, Hans Buhlmann, Peter Buhlmann, Patrick Cheridito, Philippe Deprez, Paul Embrechts, Andrea Ferrario, Andrea Gabrielli, Guojun Gan, Guangyuan Gao, Donatien Hainaut, Christian Lorentzen, Nicolai Meinshausen, Michael Merz, Alexander Noll, Gareth Peters, Ronald Richman, Ulrich Riegel, Robert Salzmann, Jiirg Schelldorfer, Pavel Shevchenko, Olivier Steiger, Qikun Xiang, Xian Xu.
Zurich, October 27, 2021 Mario V. Wiithrich & Christoph Buser
5
Electronic copy available at: https://ssrn.com/abstract=2870308
6
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Contents
1 Introduction to Non-Life Insurance Pricing 11
1.1 Introduction................................... 11
1.2 The compound Poisson model......................... 13
1.2.1 Model assumptions and first properties ............... 13
1.2.2 Maximum likelihood estimation: homogeneous case......... 16
1.2.3 Poisson deviance loss.......................... 17
1.3 Prediction uncertainty............................. 19
1.3.1 Generalization loss........................... 19
1.3.2 Cross-validation on test samples ................... 23
1.3.3 Leave-one-out cross-validation..................... 24
1.3.4 if-fold cross-validation......................... 24
1.3.5 Stratified if-fold cross-validation................... 25
1.4 Example: homogeneous Poisson model.................... 25
2 Generalized Linear Models 29
2.1 Heterogeneous Poisson claims frequency model............... 29
2.2 Multiplicative regression model........................ 31
2.3 Deviance residuals and parameter reduction................. 34
2.4 Example in car insurance pricing....................... 36
2.4.1 Pre-processing features: categorical feature components...... 36
2.4.2 Pre-processing features: continuous feature components...... 39
2.4.3 Data compression ........................... 45
2.4.4 Issue about low frequencies...................... 47
2.4.5 Models GLM3+ considering all feature components......... 49
2.4.6 Generalized linear models: summary................. 53
2.5 Classification problem............................. 54
2.5.1 Classification of random binary outcomes.............. 54
2.5.2 Logistic regression classification.................... 55
2.6 Maximum likelihood estimation........................ 59
3 Generalized Additive Models 61
3.1   Generalized additive models for Poisson regressions............. 61
3.1.1 Natural cubic splines.......................... 62
3.1.2 Example in motor insurance pricing, revisited............ 66
3.1.3 Generalized additive models: summary................ 74
7
Electronic copy available at: https://ssrn.com/abstract=2870308
8 Contents
3.2   Multivariate adaptive regression splines ................... 74
4 Credibility Theory 77
4.1 The Poisson-gamma model for claims counts ................ 77
4.1.1 Credibility formula........................... 77
4.1.2 Maximum a posteriori estimator................... 80
4.1.3 Example in motor insurance pricing................. 80
4.2 The binomial-beta model for classification.................. 82
4.2.1 Credibility formula........................... 82
4.2.2 Maximum a posteriori estimator................... 83
4.3 Regularization and Bayesian MAP estimators................ 83
4.3.1 Bayesian posterior parameter estimator............... 83
4.3.2 Ridge and LASSO regularization................... 84
4.4 Markov chain Monte Carlo method...................... 87
4.4.1 Metropolis-Hastings algorithm.................... 88
4.4.2 Gibbs sampling............................. 90
4.4.3 Hybrid Monte Carlo algorithm.................... 90
4.4.4 Metropolis-adjusted Langevin algorithm............... 92
4.4.5 Example in Markov chain Monte Carlo simulation......... 93
4.4.6 Markov chain Monte Carlo methods: summary........... 96
4.5 Proofs of Section 4.4.............................. 99
5 Neural Networks 101
5.1 Feed-forward neural networks......................... 101
5.1.1 Generic feed-forward neural network construction.......... 101
5.1.2 Shallow feed-forward neural networks ................ 105
5.1.3 Deep feed-forward neural networks.................. 115
5.1.4 Combined actuarial neural network approach............ 127
5.1.5 The balance property in neural networks............... 132
5.1.6 Network ensemble........................... 133
5.2 Gaussian random fields............................. 137
5.2.1    Gaussian Bayesian neural network.................. 137
■5.2.2   Infinite Gaussian Bayesian neural network.............. 138
5.2.3 Bayesian inference for Gaussian random field priors ........ 139
5.2.4 Predictive distribution for Gaussian random field priors...... 140
5.2.5 Step function activation........................ 141
6 Classification and Regression Trees 145
6.1   Binary Poisson regression trees........................ 145
6.1.1 Binary trees and binary indexes.................... 146
6.1.2 Pre-processing features: standardized binary splits......... 147
6.1.3 Goodness of split............................ 147
6.1.4 Standardized binary split tree growing algorithm.......... 151
6.1.5 Example in motor insurance pricing, revisited............ 154
6.1.6 Choice of categorical classes...................... 155
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Contents 9
6.2 Tree pruning ............ ...................... 157
6.2.1 Binary trees and pruning ........ ............... 157
6.2.2 Minimal cost-complexity pruning................... 158
6.2.3 Choice of the best pruned tree .................... 164
6.2.4 Example in motor insurance pricing, revisited............ 166
6.3 Binary tree classification............................ 169
6.3.1 Empirical probabilities......................... 169
6.3.2 Standardized binary split tree growing algorithm for classification 170
6.4 Proofs of pruning results............................ 175
7 Ensemble Learning Methods 177
7.1 Bootstrap simulation.............................. 177
7.1.1 Non-parametric bootstrap....................... 177
7.1.2 Parametric bootstrap ......................... 179
7.2 Bagging..................................... 179
7.2.1 Aggregating............................... 180
7.2.2 Bagging for Poisson regression trees................. 181
7.3 Random forests................................. 183
7.4 Boosting machines............................... 187
7.4.1 Generic gradient boosting machine.................. 187
7.4.2 Poisson regression tree boosting machine............... 190
7.4.3 Example in motor insurance pricing, revisited............ 195
7.4.4 AdaBoost algorithm.......................... 198
8 Telematics Car Driving Data 201
8.1 Description of telematics car driving data.................. 202
8.1.1 Simple empirical statistics....................... 202
8.1.2 The velocity-acceleration heatmap.................. 206
8.2 Cluster analysis................................. 208
8.2.1 Dissimilarity function......................... 208
8.2.2 Classifier and clustering........................ 210
8.2.3 K-means clustering algorithm..................... 212
8.2.4 Example................................. 213
8.3 Principal component analysis......................... 215
A Motor Insurance Data 221
A.l  Synthetic data generation........................... 221
A.2  Descriptive analysis .............................. 225
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
10
Contents
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1
Introduction to Non-Life Insurance Pricing
1.1 Introduction
The non-life insurance section ASTIN (Actuarial STudies In Non-life insurance) of the International Actuarial Association (IAA) has launched a working party on big data and data analytics in 2014. The resulting document [4] states in its introduction:
"The internet has changed and continues to change the way we engage with the tangible world. The resulting change in communication platforms, commercial trade and social interactions make the world smaller and the data bigger. Whilst the fundamentals of analyzing data have not changed, our approach to collating and understanding data, creating accessible and useful information, developing skill sets and ultimately transforming huge and ever-growing repositories of data into actionable insights for our employers, shareholders and our communities more generally has entered a new paradigm.
As a corollary, this has made the fundamentals of data processing, modeling techniques and aligning business structures accordingly more important - no longer can existing approaches suffice - we now face the pressures of a faster moving world, blurring business lines which make customer-centricity surpass a siloed product/service focus and challenges to the actuary being central to predictive modeling.
We too need to evolve, working intelligently with a wide cross-function of skill sets such as data experts, data scientists, economists, statisticians, mathematicians, computer scientists and, so as to improve the transformation of data to information to customer insights, behavioral experts. This is a necessary evolution for actuaries and the profession to remain relevant in a high-tech business world."
The goal of these lecture notes is to face this paradigm. We aim at giving a broad toolbox to the actuarial profession so that they can cope with these challenges, and so that they remain highly competitive in their field of competence. In these lecture notes we start from the classical actuarial world of generalized linear models, generalized additive
11
Electronic copy available at: https://ssrn.com/abstract=2870308
12
Chapter 1.  Introduction to Non-Life Insurance Pricing
models and credibility theory. Lhese models form the basis of the deeper statistical understanding. We then present several machine learning techniques such as neural networks, regression trees, bagging techniques, random forests and boosting machines. One can view these machine learning techniques as non-parametric and semi-parametric statistics approaches, with the recurrent goal of optimizing a given objective function to receive optimal predictive power. In a common understanding we would like to see these machine learning methods as an extension of classical actuarial models, and we are going to illustrate how classical actuarial methods can be embedded into these non-parametric machine learning approaches, benefiting from both the actuarial world and the machine learning world. Lhese lecture notes have also served as a preliminary version to our book [141] which treats the topic of statistical modeling and deep learning in much more depth, and the reader will notice that there is some (unavoidable) overlap between these lecture notes and our book [141].
A second family of methods that we are going to meet in these lecture notes are so-called unsupervised learning methods (clustering methods). Lhese methods aim at finding common structure in data to cluster these data. We provide an example of unsupervised learning by analyzing telematics car driving data which poses the challenge of selecting feature information from high frequency data. Lor a broader consideration of unsupervised learning methods we refer to our tutorial [107].
We close this short introductory section by briefly reviewing major developments identified in the China InsurLech Development White Paper [27]. Lhe notion of Insurance Lechnology (InsurLech) is closely related to Linancial Lechnology (LinLech), and it is a commonly used term for technology and innovation in the insurance industry. Among other things it compromises the following key points:
• artificial intelligence, machine learning and statistical learning, which may learn and accumulate useful knowledge through data;
• big data analytics, which deals with fact that data may be massive;
• cloud computing, which may be the art of performing real-time operations;
• block chain technology, that may be useful for a more efficient and anonymous exchange of data;
• internet of things, which involves the integration and interaction of physical devices (e.g. wearables, sensors) through computer systems to reduce and manage risk.
Lhe actuarial profession has started several initiatives in data analytics and machine learning to cope with these challenges. We mention the working party "Data Science" of the Swiss Association of Actuaries (SAV) that aims at building a toolbox for the actuarial profession:
https : //www. actuarialdatascien.ee. org/ Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1.  Introduction to Non-Life Insurance Pricing
13
The SAV working party has prepared several tutorials on actuarial developments in machine learning. These tutorials are supported by computer code that can be downloaded from GitHub.1 In the present lecture notes we work with synthetic data for which we exactly know the true data generating mechanism. This has the advantage that we can explicitly back-test the quality of all our models presented. In typical real world applications this is not the case and one needs to fully rely on estimated models. This is demonstrated in the tutorials of the SAV working party which describe several statistical modeling techniques on real insurance data, see [43, 89, 101, 119, 120]. We highly recommend readers to perform similar analysis on their own real data, because one crucial pillar of statistical modeling is to derive the right intuition and understanding for the data and the models used.
1.2   The compound Poisson model
In this section we introduce the basic statistical approach for modeling non-life insurance claims. This approach splits the total claim amount into a compound sum which accounts for the number of claims and determines the individual claim sizes.
1.2.1   Model assumptions and first properties
The classical actuarial approach for non-life insurance portfolio modeling uses a compound random variable with N describing the number of claims that occur within a fixed time period and Z±,..., Zn describing the individual claim sizes. The total claim amount in that fixed time period is then given by
N
S = Z\ + ... + Zn = ^2 Zk.
k=l
The main task in non-life insurance modeling is to understand the structure of such total claim amount models. The standard approach uses a compound distribution for S.
Throughout, we assume to work on a sufficiently rich probability space (Q, J-, P).
Model Assumptions 1.1 (compound distribution). The total claim amount S is given by the following compound random variable on (Q,J-,¥)
N
S = Z\ + ... + Zn = ^2 Zk,
fc=i
with the three standard assumptions:
(1) N is a discrete random variable which takes values in No;
(2) Zi, Z2, ■ ■ ■ are independent and identically distributed (i.i.d.);
(3) N and (Z±, Z%,. . .) are independent.
1https://github.com/JSchelldorfer/ActuarialDataScience
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
14
Chapter 1.  Introduction to Non-Life Insurance Pricing
Remarks.
• If S satisfies the three standard assumptions (l)-(3) of Model Assumptions 1.1 we say that S has a compound distribution. N is called claims count and Z^, k > 1, are the individual claim sizes or claim severities.
• The compound distribution has been studied extensively in the actuarial literature. The aim here is to give more structure to the problem so that it can be used for answering complex pricing questions on heterogeneous insurance portfolios.
In the sequel we assume that the claims count random variable N can be modeled by a Poisson distribution. We therefore assume that there exist an expected (claims) frequency A > 0 and a volume v > 0.
We say that N has a Poisson distribution, write N ~ Poi(Av), if
(Xv)k
¥[N = k] = e~Xv for all k e N0.
k\
The volume v > 0 often measures the time exposure in yearly units. Therefore, throughout these notes, the volume v is called years at risk. If a risk is insured part of the year, say, 3 months we set v = 1/4.
For a random variable Z we denote its coefficient of variation by Vco(Z) = Var(Z)1/2/E[Z] (subject to existence). We have the following lemma for the Poisson distribution.
Lemma 1.2. Assume N ~ Poi(Aw) for fixed X,v > 0. Then
E[N] =Xv = Var(iV)        and       Vco(iV) = '  = \Jl/^v       0 as v ^ oo.
Proof. See Proposition 2.8 in Wiithrich [135]. □
Lemma 1.3. Assume that Ni, i = 1,.. ., n, are independent and Poisson distributed with means XiVi > 0. We have
n / n \
N = yJNl - Poi  X>«u« • i=i \i=i )
Proof. This easily follows by considering the corresponding moment generating functions and using the independence assumption, for details see Wiithrich [135], Chapter 2. □
Definition 1.4 (compound Poisson model). The total claim amount S has a compound Poisson distribution, write
S - CompPoi(Ai;,G),
if S has a compound distribution with N ~ Poi(Aw) for given X,v > 0 and individual claim size distribution Z\ ~ G.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1.  Introduction to Non-Life Insurance Pricing
15
Proposition 1.5. Assume S ~ CompPoi(Av, G). We have, whenever they exist,
E[S] = Xv E[Zi],      Var(S) = Xv E[Z?1,      Vco(S) = a/Vco(Zi)2 + 1.
V Xv v
Proof. See Proposition 2.11 in Wuthrich [135]. □ Remarks.
• If S has a compound Poisson distribution with fixed expected frequency A > 0 and fixed claim size distribution G having finite second moment, then the coefficient of variation converges to zero at speed v-1/2 as the years at risk v increase to infinity. In industry, this property is often called diversification property.
• The compound Poisson distribution has the so-called aggregation property and the disjoint decomposition property. These are two extremely beautiful and useful properties which explain part of the popularity of the compound Poisson model, we refer to Theorems 2.12 and 2.14 in Wuthrich [135] for more details. The aggregation property for the Poisson distribution has already been stated in Lemma 1.3. It tells us that we can aggregate independent Poisson distributed random variables and we stay within the family of Poisson distributions.
• The years at risk v > 0 may have different interpretation in the compound Poisson context: either (i) we may consider a single risk which is insured over v accounting years, or (ii) we have a portfolio of independent compound Poisson risks and then v measures the volume of the aggregated portfolio (in years at risk). The latter uses the aggregation property which says that also the aggregated portfolio has a compound Poisson distribution (with volume weighted expected frequency), see Theorem 2.12 in Wuthrich [135].
One crucial property of compound distributions is that we have the following decomposition of their expected values
E[S] = E[N] E[Zi] = Xv E[Zi],
where for the second identity we have used the Poisson model assumption, see Proposition 1.5, and where we assume S £ L1(J7,Jr, P). This implies that for the modeling of the pure risk premium E[S] we can treat the (expected) number of claims E[N] and the (expected) individual claim sizes E[Zi] separately in a compound model. We make the following restriction here:
In the present notes, for simplicity, we mainly focus on the modeling of the claims count N, and we only give broad indication about the modeling of Z^. We do this to not overload these notes.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
16
Chapter 1.  Introduction to Non-Life Insurance Pricing
In many situations it is more appropriate to consider volume scaled quantities claims count N we may consider the claims frequency defined by
Y=™.
V
In the Poisson model, the claims frequency Y has the following two properties
E[Y] = A       and       Var(F) = X/v.
From this we see that confidence bounds for frequencies based on standard deviations Var(y)1/2 = \[Xjv get more narrow with increasing years at risk v > 0. This is the reason why the equal balance (diversification) concept on a portfolio level works. This is going to be crucial in the subsequent chapters where we aim at detecting structural differences between different risks in terms of expected frequencies.
Anticipatory, one often rewrites (1.1) as follows
Y = \ + e,
where s is centered with variance X/v. In this form, A describes the structural behavior of Y and s is understood as the noise term that describes the random deviation/fluctuation around the structural term when running the experiment. In all what follows, we try to separate the structural behavior from the random noise term, so that we are able to quantify the price level (pure risk premium) of individual insurance policies.
1.2.2   Maximum likelihood estimation: homogeneous case
Assume that N has a Poisson distribution with volume v > 0 and expected frequency A > 0, that is, N ~ Poi(Aw). For predicting this random variable N, we would typically like to use its expected value E[iV] = Xv as predictor because this minimizes the mean square error of prediction (MSEP); we highlight this in more detail in Section 1.3, below. However, this predictor is only useful if the parameters are known. Typically, we know the volume v > 0 but, in general, we do not assume knowledge of the true expected frequency A. Hence, we need to estimate this latter parameter.
This estimation task is solved by assuming that one has a family of independent random variables N±,..., Nn with Ni ~ Poi(Awi) for all i = 1,..., n. Note that for the time being we assume a homogeneous portfolio in the sense that all random variables iVj are assumed to have the same expected frequency A > 0; this is going to be relaxed in the subsequent chapters. Based on this model assumption one aims at estimating the common frequency parameter A from given observations AT" = (N±,.. . ,Nn)'. The joint log-likelihood function of these observations is given by
n
A      tN{X) = yj-^l + Nl\og{Xvl)-\og{Nl\).
i=i
The parameter estimation problem is then commonly solved by calculating the maximum likelihood estimator (MLE), which is the parameter value A that maximizes the log-likelihood function, i.e., from the class of homogeneous Poisson models the one is selected
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1.  Introduction to Non-Life Insurance Pricing
17
that assigns the highest probabibility to the actual observation AT". Thus, the MLE A for A is found by the solution of
|^v(A) = 0. (1-2)
This solution is given by
-     V™ N-
A = ^^>0. (1.3)
Note that -§^tN{\) < 0 for any A > 0.2 The MLE is unbiased for A, i.e. E[A] = A, and its variance is given by
Var(A) A
which converges to zero as the denominator goes to infinity. This allows us to quantify the parameter estimation uncertainty by the corresponding standard deviation. This is going to be important in the analysis of heterogeneous portfolios, below.
1.2.3   Poisson deviance loss
Instead of maximizing the log-likelihood function we could also try to minimize an appropriate objective function. The canonical objective function under our model assumptions is the Poisson deviance loss. We define the maximal log-likelihood (which is received from the saturated model)
n
£N(N) = y,~Nl + NilogNi - log(AM). (1.4)
i=i
The saturated model is obtained by letting each observation Ni have its own parameter Aj = ¥,[Nj\/vi. These individual parameters are estimated by their corresponding individual MLEs Aj = Ni/vi, that is, each policy i receives its own individual MLE parameter. We set the i-th term on the right-hand side of (1.4) equal to zero if Ni = 0 (we use this terminology throughout these notes).
The (scaled) Poisson deviance loss for expected frequency A > 0 is defined by D*(N, A)   = 2(£N(N)-£N(X))
=   2       ~Ni + Ni lo§ Ni + A^ " Ni lo§ (^i)
i=l
£2 iV,
i=i
A^ , . (\vl --1 — log —
> 0, (1.5)
where the i-th term in (1.5) is set equal to 2Xvi for Ni = 0. Remarks.
• Each term under the summation in (1.5) is non-negative because the saturated model maximizes each of these terms individually (by choosing the individual MLEs
2If A = Vi = 0' which happens with positive probability in the Poisson model, we get a
degenerate model.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
18
Chapter 1.  Introduction to Non-Life Insurance Pricing
Aj = Ni/vi). Lherefore, D*(N,X) > 0 for any A > 0. Note that we even receive non-negativity in (1.5) if we choose arbitrary policy-dependent expected frequencies Aj > 0 instead of the homogeneous expected frequency parameter A.3
• Maximizing the log-likelihood function £]\[(X) in A is equivalent to minimizing the deviance loss function D*(N,X) in A.
• Lhe deviance loss can be generalized to the exponential dispersion family which contains many interesting distributions such as the Gaussian, the Poisson and the gamma models. We refer to Section 4.1.2 in [141].
We close this section by analyzing the expected deviance loss
E[D*(N,X)} = 2E[Xv - N - Nlog(Xv/N)} = 2E [N \og(N/Xv)] , of a Poisson distributed random variable N with expected value E[iV] = Xv > 0.
Ligure 1.1: Expected deviance loss E[D*(N, A)] = 2E[iVlog(iV/Av)] of a claims count N ~ Poi(Aw) as a function of its expected value E[iV] = Xv on two different scales; these plots are obtained by Monte Carlo simulation, the blue dot shows mean E[iV] = 5.16%,
the vertical cyan ties show means E[iV] = 5.16% ± 4.58%.
Ligure 1.1 illustrates this expected deviance loss on two different scales for the expected number of claims E[iV] = Xv. Lhe example which we are going to study in these notes has an average expected number of claims of Xv = 5.16%. Lhis gives an expected deviance loss of 30.9775 ■ 10-2 (blue dot in Ligure 1.1). Lhis is going to be important for the understanding of the remainder of these notes. Note that this value is bigger than 27.7278 ■ 10-2 given in (A.5) in the appendix. Lhis difference is explained by the fact that the latter value has been obtained on a heterogeneous portfolio. In this heterogeneous portfolio the individual policies have a standard deviation in expected numbers of claims of 4.58%, see Ligure A.l in the appendix. Lhe cyan vertical ties in Ligure 1.1
3Policy dependent expected frequency parameters Aj are going to be crucial in the subsequent chapters because the general goal of these notes is to price heterogeneous insurance portfolios.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1.  Introduction to Non-Life Insurance Pricing
19
(rhs) provide two policies with expected numbers of claims of 5.16% ±4.58%. If we average their expected deviance losses we receive 26.3367-10-2 which corresponds to the cyan dot in Figure 1.1 (rhs).
1.3   Prediction uncertainty
In this section we quantify prediction uncertainty. This is mainly done in terms of the Poisson model and the Poisson deviance loss. For the general case within the exponential dispersion family and more insight we refer to Chapter 4 of [141], in particular, we mention that deviance losses are strictly consistent scoring functions for mean estimation which is important in forecast evaluation, see Gneiting [55].
1.3.1    Generalization loss
Assume we have n observations, called cases, given by
V={(N1,v1),...,(Nn,vn)}. (1.6)
T> stands for data. Typically, we do not know the true data generating mechanism of T>. Therefore, we make a model assumption: assume that all n cases (iVj, Vi) are independent and that Ni are Poisson distributed with expected frequency A > 0. We infer the expected frequency parameter A from this data T>. We denote the resulting estimator by A, for the moment this can by any (sensible) Independent estimator for A. We would like to analyze how well this estimator performs on cases which have not been seen during the estimation procedure of A. In machine learning this is called generalization of the estimated model to unseen data, and the resulting error is called generalization error, out-of-sample error or prediction error.
To analyze this generalization error we choose a new random variable N ~ Poi(Aw), which is independent from the data T> and which has the same expected frequency A. The frequency Y = N/v of this new random variable is predicted by
Y = E[Y] = X. (1.7)
Note that we deliberately write E because it is estimated from the data T>.
Proposition 1.6 (MSEP generalization loss). Assume that all cases in T> are independent and Poisson distributed having the same expected frequency A > 0. Moreover, let case (N,v) be independent ofT> and Poisson distributed with the same expected frequency A. We predict Y = N/v byY = X, where the D-based estimator A is assumed to be square integrable. This prediction has MSEP
= (E [Y] - E [f])2 + Var(F) + Var(F).
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
E
Y-Y
20
Chapter 1.  Introduction to Non-Life Insurance Pricing
Proof of Proposition 1.6. The first step is done to bring the terms into the same order as in the statement of the proposition, in the second last step we use independence between T> and (N,v),
E
Y - Y
= E
E
Y -Y
= E
= E
[Y - E[Y]
(E [Y] - E |Yj + E |Yj - Y =    ( E [Y] - E \9] ) 2 + Var(Y) + Var(Y) This proves the proposition
Y -E[Y] +E[Y] - Yj + E [(E[Y] - Y)2] + 2E^Y- E[Y]) (E[Y] - Y)j + Var(Y)
Remarks 1.7 (generalization loss, part I).
• The first term on the right-hand side of the statement of Proposition 1.6 denotes the squared bias, the second term the estimation variance and the last term the pure randomness (process variance) involved in the prediction problem. In general, we try to minimize simultaneously the bias and the estimation variance in order to get accurate predictions. Usually, these two terms compete in the sense that a decrease in one of them typically leads to an increase in the other one. This phenomenon is known as the bias-variance trade-off for which one needs to find a good balance (typically by controlling the complexity of the model). This is crucial for heterogeneous portfolios and it is going to be the main topic of these notes. We also refer to Section 7.3 in Hastie et al. [62] for the bias-variance trade-off.
• If we are in the atypical situation of having a homogeneous (in A) Poisson portfolio and if we use the D-based MLE A, the situation becomes much more simple. In Section 1.2.2 we have seen that the MLE A is unbiased and we have determined its estimation variance. Henceforth, we receive in this special (simple) case for the MSEP generalization loss
E
Y — Y
E[Y]-E A     + Var(A) + Var(Y) = 0 +
A
A
E"=i Vi
We emphasize that this is an atypical situation because usually we do not assume to have a homogeneous portfolio and, in general, the MLE is not unbiased.
Proposition 1.6 considers the MSEP which implicitly implies that the weighted square loss function is the objective function of interest. However, in Section 1.2.2 we have been considering the Poisson deviance loss as objective function (to obtain the MLE) and, therefore, it is canonical to measure the generalization loss in terms of the out-of-sample Poisson deviance loss. Under the assumptions of Proposition 1.6 this means that we aim at studying
E[D*(N, A)] = 2E \\v - N - Nlog(\v/N)} .
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1.  Introduction to Non-Life Insurance Pricing
21
Proposition 1.8 (Poisson deviance generalization loss). Assume that all cases in T> are independent and Poisson distributed having the same expected frequency A > 0. Moreover, let case (N, v) be independent of T> and Poisson distributed with the same expected frequency X. We predict N by Xv, where the D-based estimator X is assumed to be integrable and X > e, P-a.s., for some e > 0. This prediction has Poisson deviance generalization loss
E[D*(N, a)] = E[D*(N, a)] + S(X, a), with estimation loss defined by
S(X, A) = 2v (E [a] - A - AE [log (a/a)])   > 0.
Proof of Proposition 1.8. The assumptions on A imply loge < E[logA] < logE[A] < oo. We have E[D*(N,\)} =2E \^Xv - Xv + Xv - N - N\og(Xv/N) - N\og(X/X)^ .
The first claims follows by the independence between N and A. There remains the proof of the positivity of the estimation loss. Observe
A - A - A log (x/xj = X(x — 1 — log x) = Xg{x),
where we have denned x = A/A > 0, and the last identity defines the function g(x) = x — 1 — logrr. The claim now follows from log a; < x — 1 with that latter inequality being strict for i/l. □
Remarks 1.9 (generalization loss, part II).
• The estimation loss £(a,a) simultaneously quantifies the bias and the estimation volatility. Again, we try to make this term small by controlling the bias-variance trade-off.
• If we want to use the D-based MLE A of Section 1.2.2 for estimating A, we need to insure positivity, P-a.s., i.e. we need to exclude degenerate models. We set Xe = X+e for some e > 0. We receive Poisson deviance generalization loss for the MLE-based estimator Xe
E[D*(N,Xe)} = K[D*(N, a)] + 2ve - 2vXE [log (a£/a)] ,
with
£(Xe,X)   =   2ve - 2E[Xv\og(Xev/(Xv))} > 0, E[D*(N,X)}   =   -2E[Anog(Ai;/A0]  > 0.
Note that we may also use Xe = max{a, e} for guaranteeing positivity, P-a.s., but in this case the bias is more difficult to determine.
• In view of the Poisson deviance generalization loss we can determine the optimal estimator for a w.r.t. optimization criterion
a* = argmin E[D*(N,fi)] = argmin 2E[fiv - N - N\og(fiv/N)].
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
22
Chapter 1.  Introduction to Non-Life Insurance Pricing
This minimizer is given by A* = E[iV]/v which implies that the Poisson deviance loss is strictly consistent for the expected value according to Gneiting [55]; strict consistency is a property needed to perform back-testing, we refer to Section 4.1.3 in [141].
We close this section by comparing the square loss function, the Poisson deviance loss function and the absolute value loss function.
Example 1.10. In this example we illustrate the three different loss functions:
L(Y,X,v = l)   =   (Y - A)2 ,
L(Y,X,v = l)   = 2[A-F-Flog(A/F)],
L(Y,X,v = l)   = \Y-X\,
the first one is the square loss function, the second one the Poisson deviance loss function and the third one the absolute value loss function.
Figure 1.2: Loss functions Y >—> L(Y,\,v = 1) with A = 5.16% for different scales on the a;-axis (and y-axis).
These three loss functions are plotted in Figure 1.2 for an expected frequency of A = 5.16%. From this plot we see that the pure randomness is (by far) the dominant term: the random variable Y = N (for v = 1) lives on the integer values No and, thus, every positive outcome Y > 1 substantially contributes to the loss L(-), see Figure 1.2 (middle). A misspecification of A only marginally contributes (for small A); this is exactly the class imbalance problem that makes model calibration of low frequency examples so difficult. Moreover, we see that the Poisson deviance loss reacts more sensitively than the square loss function for claims N G {1,..., 8} for our expected frequency choice of A = 5.16%. Formula (6.4) and Figure 6.2 in McCullagh-Nelder [93] propose the following approximation, we also refer to Figure 5.5 in [141],
L(Y, A, v = 1) « 9F1/3 (Y1/3 - A1/3)2 . Thus, Poisson deviance losses have a rather different behavior from square losses. ■ Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1.  Introduction to Non-Life Insurance Pricing
23
1.3.2   Cross-validation on test samples
Assessment of the quality of an estimated model is done by analyzing the generalization loss, see Propositions 1.6 and 1.8. That is, one back-tests the estimated model on cases that have not been seen during the estimation procedure. However, the true generalization loss cannot be determined because typically the true data generating mechanism is not known. Therefore, one tries to assess the generalization loss empirically.
Denote by A i—> L(Y, A, v) a (non-negative) loss function, for examples see Example 1.10. The generalization loss for this loss function is defined by (subject to existence)
CL = E[L(Y,X,v)] ■
If the amount of data T> in (1.6) is very large, we are tempted to partition the data T> into a learning sample VB and a test sample VBC with B C {1,. .. ,n} labeling the cases VB = {(Ni,Vi); i G B} C T> considered for learning the model, and with Bc = {1,..., n} \ B labeling the remaining test cases.
Based on the learning sample VB we construct the predictor A8 for Y = N/v, see (1.7), and we use the test sample VBC to estimate the generalization loss empirically by
CT = E \l(yxv)] =^ E L(Y,XB,Vl). (1.8)
The upper index in C°^s indicates that we do an out-of-sample analysis (estimation) because the data for estimation and back-testing has been partitioned into a learning sample and a disjoint back-testing sample.
The out-of-sample Poisson deviance loss is analogously estimated by, assume
> 0. (1.9)
C°^ = nD*(N,X)] = j^r y,2n*
\BC,
XBv% (XBvl 1 — lot
In many situations it is too optimistic to assume that we can partition data T> into a learning sample and a test sample because often the volume of the data is not sufficiently big. A naive way to solve this problem is to use the whole data T> for learning the model and then back-testing this model on the same data. The model estimated from the whole data T> is denoted by A (assume A > 0).
The in-sample Poisson deviance loss is defined by
1 1 n
£% = -D*(N,X) = -y,2n-
i=i
Xv%
1 - loe
Xvi
(1.10)
i.e. this is exactly the empirical Poisson deviance loss of the estimated model.
This in-sample loss Dp is prone to over-fitting because it prefers more complex models that can follow observations more closely.   However, this smaller in-sample loss does
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
24
Chapter 1.  Introduction to Non-Life Insurance Pricing
not necessarily imply that a more detailed model generalizes better to unseen cases. Therefore, we need other evaluation methods. These are discussed next.
1.3.3 Leave-one-out cross-validation
Choose i G {1,... ,n} and consider partition Bi = {1,... ,n} \ {i} and B\ = {i}. This provides on every Bi a predictor for Y = N/v given by
def. ^
The leave-one-out cross-validation loss for loss function l(-) is defined by
i n
4r = - El^a^U).
n ^—f    V /
i=i
Leave-one-out cross-validation uses all data T> for learning: the data T> is split into a training set for (partial) learning and a validation set Z>W for an out-of-sample validation iteratively for all i G {1,... , n}. Often the leave-one-out cross-validation is computationally too expensive as it requires fitting n times the model which is for large insurance portfolios too exhaustive.
1.3.4 K-fold cross-validation
For K-fo\d cross-validation we choose an integer K > 2 and partition {1,..., n} randomly into K disjoint subsets B±,... ,Bk of roughly the same size. This provides for every k = 1,..., K a training set
= {(Nl,vl); i^Bk}cV, and the corresponding estimator for the expected frequency a
The K-fo\d cross-validation loss for loss function l(-) is defined by
n k=iieBk ^ k=i |Dfe| ieBk
Note that we use the whole data T> for learning. As for leave-one-out cross-validation we split this learning data into a training set X>(-8fe) for (partial) learning and a validation set V®k for out-of-sample validation. This is done for all k = 1,... ,K, and the out-of-sample (generalization) loss is then estimated by the resulting average cross-validation loss. K-iold cross-validation is the method typically used, and in many applications one chooses K = 10. We do not further elaborate on this choice here, but we refer to the related literature.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1.  Introduction to Non-Life Insurance Pricing
25
1.3.5   Stratified K-io\d cross-validation
K-fo\d cross-validation partitions {1,... , n} into K disjoint random subsets B±,... ,Bk of approximately the same size. If there are outliers then these outliers may fall into the same subset Bk, and this may disturb K-fo\d cross-validation results Stratified K-fo\d cross-validation distributes outliers more equally across the partition. This is achieved by ordering the observations Yi = Ni/vi, i = 1,... ,n, that is, Y^ > Y(2) — • • • — Y(n), with a deterministic rule if there is more than one observation of the same size. Then we build urns Uj of size K for j = 1,..., \n/K~\ (the last urn U^n/x] may be smaller depending on the cardinality of n and K)
Uj = {Y(i); {j-l)K+l<i<jK}
Urn U\ receives the K largest observations, urn Hi contains the next K largest observations, etc. Then we define the partition (T>k)k=i,...,K °f    f°r k = 1,..., K by
T^k = j choose randomly from each urn tii,..., U\r^K\ one case (without replacement) |,
where "choose randomly (without replacement)" is meant in the sense that all urns are randomly distributed resulting in the partitioned data T>k, k = 1,. . ., K. K-fo\d cross-validation is now applied to the partition T>i,... ,T>k- Note that this does not necessarily provide the same result as the original K-fo\d cross-validation because in the stratified version it is impossible that the two largest outliers fall into the same set of the partition (supposed that they are bigger than the remaining observations).
Summary. To estimate the generalization loss we typically choose the (stratified) K-fo\d Poisson deviance cross-validation loss given by
71
k=ll£Bk
Xt-^Vi-Ni-Nilog
X^Vl Ni
> 0. (1.11)
Remark. Evaluation of the K-fo\d Poisson deviance cross-validation loss (1.11) requires that the (random) partition B\,..., Bk of V is chosen such that X^~Bk^ > 0 for all k = 1,..., K. This is guaranteed in stratified K-to\d cross-validation as soon as we have K observations with Ni > 0. For non-stratified K-fo\d cross-validation the situation is more complicated because the positivity constraint may fail with positive probability (if we have too many claims with Ni = 0).
1.4   Example: homogeneous Poisson model
Throughout these notes we consider a synthetic motor-third party liability (MTPL) car insurance portfolio. This portfolio consists of n = 500'000 car insurance policies having claims Ni information and years at risk information Vi G (0,1], for i = 1,. . . ,n. The simulation of the data is described in Appendix A, in particular, we refer to Listing A.2 which sketches this data V.4 In a first (homogeneous) statistical model we assume
Ni        Poi(Awi)       for i = 1,..., n, 4The data is available from https://people.math.ethz.ch/~wmario/Lecture/MTPL_data.csv
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
26
Chapter 1.  Introduction to Non-Life Insurance Pricing
and with given expected frequency A > 0. The MLE for A of this data T> is given by, see (1.3),
A
10.2691%.
1 "I
This should be compared to the true average frequency A* = 10.1991% given in (A.4). Thus, we have a small positive bias in our estimate. The in-sample Poisson deviance loss is given by C% = 29.1065 ■ 10"2.
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	/•cv	/•CV j~d	£(X,X*)	/•is Wd	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
Table 1.1: Poisson deviance losses of If-fold cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters.
Next, we calculate the 10-fold cross-validation losses (1.11). The results are presented in columns (non-stratified and stratified versions) of Table 1.1. We observe that this 10-fold cross-validation losses of 29.1066 ■ 10-2 and 29.1065 ■ 10-2, respectively, match the in-sample loss C1^ = 29.1065 ■ 10-2, i.e. we do not have any sign of over-fitting, here. The column 'run time' shows the total run time needed,5 and '# param.' gives the number of estimated model parameters.
Finally, we determine the estimation loss £(A, A*) w.r.t. the true model A*, see Appendix A and Proposition 1.8. We emphasize that we are in the special situation here of knowing the true model A* because we work with synthetic data. This is an atypical situation in practice and therefore we highlight all values in green color which can only be calculated because we know the true model. Using in-sample loss (1.10) we derive
C
D
E2^
i=i
1
D*(N,X*
\*(xi)vj _ \k(xi)vl
+ i(x,x*
Auj
1 - log
log
A
1 n
- ^2(A*(xlK-iVl)log
A
(1.12
i=l vA^) where the first term is the Poisson deviance loss w.r.t. the true model A*, see (A.5)
1
n
i     r /
-D*(N, A*) = - 2 Xk(xi)vi -N.-N, log -1 n i=i   L V
The second term in (1.12) is defined by
27.7278 ■ 10"2.
1 n
£(A,A*) = - j22u
i=i
A-A*(a:i)-A*(a:i)log
1.3439 ■ 10"2 > 0. (1.13)
5A11 run times are measured on a personal laptop Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99GHz with 16GB RAM.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 1.  Introduction to Non-Life Insurance Pricing
27
This is an estimate for the estimation loss £(A,A*) w.r.t. the true model A*.6 The last term in (1.12) refers to over-fitting, in average it disappear if the estimation A is independent from AT". In our case it takes value
1   n (    A \
-      2 " Ni) log ——   = 0.0348 ■ 10"2,
i.e. it is negligible w.r.t. the other terms.
We will use the estimated estimation loss £(X, A*) as a measure of accuracy in all models considered, below. Note that we have £(X, A*) = 0 if and only if A = A*(ajj) for all i = 1,..., n. Thus, obviously, if we have heterogeneity between the expected frequencies of the insurance policies, the estimation loss cannot be zero for a homogeneous model. ■
Summary. In practice, model assessment is done w.r.t. cross-validation losses, see Table 1.1. In our special situation of knowing the true model and the true expected frequency function A*(-), we will use the estimation loss £(A,A*) for model assessment. This also allows us to check whether we draw the right conclusions based on the cross-validation analysis.
In mathematical statistics, the estimation loss (1.13) is related to the risk stemming from a decision rule. In our situation the decision rule is the MLE A7" i—> A = X(N) which is compared to the true parameters (A*(ajj))j=i n in terms of the loss function under the summation in (1.13).
8Strictly speaking we should consider the estimation loss £(\e, A*) of Ae = max{A, e} for some e > 0, because we need a strictly positive estimator, P-a.s., see Proposition 1.8.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
28
Chapter 1.  Introduction to Non-Life Insurance Pricing
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2
Generalized Linear Models
Generalized linear models (GLMs) are popular statistical models that are used in the framework of the exponential dispersion family (EDF). For standard references on GLMs and the EDL we refer to Nelder-Wedderburn [98] and McCullagh-Nelder [93]. We present the Poisson claims count model within a GLM framework in this chapter. Lhis extends the homogeneous Poisson portfolio consideration of Section 1.2.2 to the heterogeneous case. Similar derivations can be done for individual claim sizes using, for instance, a gamma or a log-normal distribution for individual claim sizes, for details we refer to Ohlsson-Johansson [102] and Wuthrich-Merz [141]. Additionally, we introduce a classification problem in this chapter which is based on the Bernoulli model and which describes a logistic regression problem, see Section 2.5 below.
2.1   Heterogeneous Poisson claims frequency model
Assume that the claims count random variable N has a Poisson distribution with given years at risk v > 0 and expected frequency A > 0. We aim at modeling the expected frequency A > 0 such that it allows us to incorporate structural differences (heterogeneity) between different insurance policies and risks; such structural differences are called systematic effects in the statistical literature. Assume we have q-dimensional features x = [x\,... ,xq)' G X belonging to the set of all possible features (called feature space X). A regression function A(-) maps this feature space X to expected frequencies
X:X^R+,       x^X = X(x). (2.1)
Lhe feature x describes the risk characteristics of a certain insurance policy, see Example 2.1 below, and A(-) describes the resulting structural differences (systematic effects) in the expected frequency described by these features.
Terminology. Lhere is different terminology and we use the ones in italic letters.
• x is called feature, covariate, explanatory variable, independent variable, predictor variable, measurement vector.
• A is called (expected) response, dependent variable, regressand, regression function, classifier (the latter three terminologies also depend on the particular type of
29
Electronic copy available at: https://ssrn.com/abstract=2870308
30
Chapter 2.  Generalized Linear Models
regression problem considered, this is further highlighted below).
• The components of x can be continuous, discrete or even finite. In the finite and discrete cases we distinguish between ordered (ordinal) and unordered (nominal) components (called categorical components). For instance, x\ 6 {female,male} is a nominal categorical feature component. The case of two categories is called binary.
Our main goal is to find the regression function A(-) as a function of the features x 6 X and to understand the systematic effects and their sensitivities in each feature component xi of x. In general, available observations are noisy, that is, we cannot directly observe A (a;), but only the frequency Y = N/v, which is generated by
Y = \(x)+e,
with residual (noise) term e satisfying in the Poisson case
E[e] = 0       and       Var(e) = X(x)/v.
Example 2.1 (car insurance). We model the claims frequency of Y = N/v with feature x by
(X(x)v)k
F[Y = k/v]   =F[N = k]  =  exp{-A(a;)t;} V V , for k e N0,
k\
with given years at risk v > 0 and regression function A(-) given by (2.1). We may now think of features x = (x\,. .. ,xq) characterizing different car drivers with, for instance, x\ describing the age of the driver (continuous component), X2 describing the price of the car (continuous component or discrete ordinal component if of type "cheap", "average", "expensive"), x% describing the gender of the driver (binary component), etc. The goal is to find (infer) the regression function A(-) so that it optimally characterizes the underlying risks in terms of the chosen features x 6 X. This inference needs to be done from noisy claims count observations N and the chosen features may serve as proxies for other (unobservable) risk drivers such as driving experience, driving skills, etc. ■
In view of the previous example it seems advantageous to include as many feature components as possible in the model. However, if the feature space is too complex and too large this will lead to a poor model with a poor predictive performance (big generalization loss, see Section 1.3). The reason for this being that we have to infer the regression function from a finite number of observations, and the bigger the feature space the more likely that irrelevant feature information may play a special role in a particular (finite) observation sample. For this reason it is important to carefully choose relevant feature information. In fact, feature extraction is one of the best studied and understood fields in actuarial practice with a long tradition. The Swiss car insurance market has been deregulated in 1996 and since then sophisticated pricing models have been developed that may depend on more than 30 feature components. One may ask, for instance, questions like "What is a sports car?", see Ingenbleek-Lemaire [72].
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
31
2.2   Multiplicative regression model
A commonly used technique to infer a regression function A(-) is the MLE method within the family of GLMs. We consider the special case of the Poisson model with a multiplicative regression structure in this chapter. This multiplicative regression structure leads to a log-linear functional form (or the log-link, respectively).
Assume X C R9 and that the regression function A(-) is characterized by a parameter vector j3 = (/3o, fli,..., /3q)' £ R9+1 such that for x £ X we have representation
x       \og\{x) = (i0+liixi +...+fiqxq d=   (p,x). (2.2)
The last definition uses a slight abuse of notation because the features x £ X are extended by a zero component xq = 1 for the intercept component /3q. Formula (2.2) assumes that all feature components are real-valued. This requires that categorical feature components are transformed (pre-processed); a detailed treatment and description of categorical feature components' pre-processing is provided in Section 2.4.1, below. The task is to infer /3 from cases (N,Xi, Vi) £ T>, where we extend definition (1.6) of the data to
V = {(iVi, Xl,vi) ,...,(Nn, xn, vn)} , (2.3)
with Xi £ X being the feature information of policy i = 1,..., n. Assume that all cases are independent with Ni being Poisson distributed with expected frequency A(ajj) given by (2.2). The joint log-likelihood function of the data T> under these assumptions is given by
n
P -> ^(/3) = E-exP(/3'^)^ + Ar^(/3'^)+1°g^)-log(^0- (2-4)
i=i
The MLE may be found by the solution of1
§0 M/3) = 0. (2.5) We calculate the partial derivatives of the log-likelihood function for 0 < / < q
Q n n
—£n(P) = ^-exp(/3,xl)vlxhi + NiXij =  ^2 {-\{xl)vl + Ni) xhi = 0, (2.6)
' 1 i=i i=i
where Xi = (xiti,... ,Xi,q)' £ X describes the feature of the i-th case in T>, and for the intercept fto we add components Xi:o = 1. We define the design matrix X G Rrax(g+1) by
= (xi,l)l<i<n,0<l<q, (2-7)
thus, each row 1 < i < n describes the feature x\ of case i. Observe that (3£/3)j = (f3,Xi) = \og\(xi) and (X'N)i = ^27=1 ^ixi,i- This allows us to rewrite (2.6) as follows
n
(X'AOz = ^exp^X^}^ = (3£Vexp{3£/8})z,
i=i
1Formula (2.5) is a bit a sloppy notation of saying that the gradient V^£n(/3) is equal to the zero vector in K9+1. Solutions to (2.5) give critical points, maximas have negative definite Hessians, and in case of a concave log-likelihood function we have a unique maximum.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
32
Chapter 2.  Generalized Linear Models
where we define the weight matrix V = diag(wi,..., vn) and where the exponential function exp {X/3} is understood element-wise. We have just derived the following statement:
Proposition 2.2. The solution f3 to the MLE problem (2.5) in the Poisson case (2.4) may be found by the solution of
X'V exp{X/3} = X'N.
Lhis optimization problem in Proposition 2.2 is solved numerically using Lisher's scoring method or the iteratively re-weighted least squares (IRLS) method, see Nelder-Wedderburn [98]. Lhese algorithms are versions of the Newton-Raphson algorithm to find roots of sufficiently smooth functions.
Lhis MLE /3 is then used to estimate the regression function A(-) and we set for x G X
x       X(x) = exp(/3, x), (2.8) which provides the predictor for case (Y,x,v) in the Poisson GLM situation
Y = E[Y} = \(x) = exp(3, x).
Remarks 2.3.
• Lhe Hessian of £n(/3) is given by
H^n((3) = -X' diag(exp((3,x1)v1,...,exp((3,xn)vri)X G K(<z+i)x(9+i). (2.9)
Below we make assumptions on the rank of the design matrix X, which is necessary to have uniqueness of the MLE in Proposition 2.2. In fact, if X has full rank q+l < n then we have a negative definite Hessian H^_/v(/3) which provides a unique MLE. Lhis follows from the property that (2.4) provides a concave optimization problem in a minimal representation if X has full rank.
• In the derivation of the MLE (2.5) we assume that we know the true functional form for A(-) of the data generating mechanism and only its parameter j3 needs to be inferred, see (2.2). Lypically, the true data generating mechanism is not known and (2.8) is used as an approximation to the true (but unknown) mechanism. In the next chapters we present methods that assume less model structure in A(-).
• Lhe log-linear structure (2.2) for the expected frequency implies a multiplicative tariff structure (systematic effects)
Y = E[Y]= X(x) = exp(3, x) = exp{^0} II exP(A^}- (2-10)
i=i
Lhe term exp{/3o} from the intercept /3q describes the base premium and the factors exp{/3/a;/} describe the adjustments according to the feature components x\. Lhese factors exp{/3/a;/} are typically around 1 (if fto is appropriately normalized). Moreover, they can easily be interpreted. Lor instance, increasing feature component
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
33
xi from value 1 to value 2 adds an additional relative loading of size exp{/3/} to the price; and this relative loading is independent of any other feature component values.
Formula (2.10) shows that the feature components interact in a multiplicative way in our Poisson GLM. One of the main tasks below is to analyze whether this multiplicative interaction is appropriate.
The so-called score is obtained by the gradient of the log-likelihood function
s(/3,N) = ^-£N(f3) = X'N - X'Vexp{3lf3}.
Under the assumption that the data T> has been generated by this Poisson GLM we have E[s(/3, AT")] = 0 and we obtain Fisher's information l(/3) G R(9+1)x(9+1), see also Section 2.6 in the appendix,
d
l(ß)=E[s(ß,N)s(ß,N)'] = -E
-nßeN(ß).
From Proposition 2.2 we obtain unbiasedness of the volume-adjusted MLE (2.8) for the expected number of claims, see also Proposition 2.4, below. Moreover, the Cramer-Rao bound is attained, which means that we have a uniformly minimum variance unbiased (UMVU) estimator, see Section 2.7 in Lehmann [87].
Fisher's information matrix T(/3) plays a crucial role in uncertainty quantification of MLEs within GLMs. In fact, one can prove asymptotic normality of the MLE /3 if the volumes go to infinity and the asymptotic covariance matrix is a scaled version of the inverse of Fisher's information matrix, we refer to Chapter 6 in Lehmann [87], Fahrmeir-Tutz [41] and Chapter 5 in Wuthrich-Merz [141].
Finally, we consider the so-called balance property, see Theorem 4.5 in Buhlmann-Gisler [18]. This is an important property in insurance to receive the right price calibration on the portfolio level.
Proposition 2.4 (balance property). Under the assumptions of Proposition 2.2 we have for the MLE f3
n n n
1 = 1 1=1 1=1
Proof. The proof is a direct consequence of Proposition 2.2. Note that the first column in the design matrix X is identically equal to 1 (modeling the intercept). This implies
J n
# Vi exp<3, Xi) = (1,. .. , 1)V exp{X/3} = (1,. .. , 1)JV = ^ Nt.
i=l i=l
This proves the claim. □
Remark 2.5. The balance property holds true in general for GLMs within the EDF as long as we work with the canonical link. The canonical link of the Poisson model is the log-link, which exactly tells us that for regression function (2.2) the balance property has to hold. For more background on canonical links we refer to McCullagh-Nelder [93] and Section 5.1.5 in [141].
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
34
Chapter 2.  Generalized Linear Models
2.3   Deviance residuals and parameter reduction
There are several ways to assess goodness of fit and to evaluate the generalization loss introduced in Section 1.3. For analyzing the generalization loss we consider the (stratified) K-fo\d Poisson deviance cross-validation loss described in (1.11), modified to the heterogeneous case. The (heterogeneous) Poisson deviance loss for regression function (2.2) is given by, see also (1.5),
> 0.
(2.11)
Note, again, that minimizing this Poisson deviance loss provides the MLE /3 under assumption (2.2).
For the goodness of fit we may consider the (in-sample) Pearson's residuals.  We set
Yl = N./v, for 1 < i < n
N - \{xl)vl
\{Xi)vi
Y - X(xi)
(2.12)
These Pearson's residuals should roughly be centered with unit variance (and close to independence) under our model assumptions. Therefore, we can consider scatter plots of these residuals (in relation to their features) and we should not detect any structure in these scatter plots.
Pearson's residuals 5f, 1 < i < n, are distribution-free, i.e. they do not (directly) account for a particular distributional form. They are most appropriate in a (symmetric) Gaussian case. For other distributional models one often prefers deviance residuals. The reason for this preference is that deviance residuals are more robust (under the right distributional choice): note that the expected frequency parameter estimate A appears in the denominator of Pearson's residuals in (2.12). This may essentially distort Pearson's residuals. Therefore, we may not want to rely on weighted residuals. Moreover, Pearson's residuals do not account for the distributional properties of the underlying model. Therefore, Pearson's residuals can be heavily distorted by skweness and extreme observations (which look very non-Gaussian). The Poisson deviance residuals are defined by
sgn [Ni
Remarks.
If we allow for an individual expected frequency parameter Ai for each observation Ni, then the MLE optimal model is exactly the saturated model with parameter estimates Aj = N/vi, see also (1.4). Therefore, each term in the summation on the right-hand side of the above deviance loss (2.11) is non-negative, i.e.
2 N
X(xl)vl    1_log( Kxi)vi
> o,
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
35
for all 1 < i < n. This implies that the deviance loss is bounded from below by zero, and a sequence of parameters (j3t)t>i that is decreasing in the sequence of deviance losses (D*(N, X = A^g ))t>i may provide convergence to a (local) minimum w.r.t. that loss function. This is important in any numerical optimization such as the gradient descent algorithm that we will meet below.
• Observe that maximizing the log-likelihood In (ft) for parameter j3 is equivalent to minimizing the deviance loss D*(N,X = \p) for /3. In this spirit, the deviance loss plays the role of the canonical objective function that should be minimized.
• D*(N,X) is the scaled Poisson deviance loss. Within the EDF there is a second deviance statistics that accounts for potential over- or under-dispersion <f> ^ 1. In the Poisson model this does not apply because by definition cf> = 1. Nevertheless, we can determine this dispersion parameter empirically on our data. There are two different estimators. Pearson's (distribution-free) dispersion estimate is given by
1 n 2
and the deviance dispersion estimate is given by
*D = n-(q+l) §1*0   = «-(? + !)■ (2J4)
Pearson's dispersion estimate should be roughly equal to 1 to support our model choice. The size of the deviance dispersion estimate depends on the size of the expected number of claims Xv, see Figure 1.1. For our true expected frequency A* we receive 4>D = D*(N,X*)/n = 27.7228 ■ 10"2, see (A.5).
Finally, we would like to test whether we need all components in j3 G R9+1 or whether a lower dimensional nested model can equally well explain the observations AT". We assume that the components in j3 are ordered in the appropriate way, otherwise we permute them (and accordingly the columns of the design matrix X).
Null hypothesis Ho'- fti = . .. = /3P = 0 for given 1 < p < q.
1. Calculate the deviance loss D*(N, Afuu) in the full model with MLE /3 G K9+1-
2. Calculate the deviance loss D*(N,\h0) under the null hypothesis Ho with MLE 3 G W+1-p.
Define the likelihood ratio test statistics, see Lemma 3.1 in Ohlsson-Johansson [102],
xl = D*(N, XHo) - D*(N, Ami) > 0. (2.15)
Under Hq, the likelihood ratio test statistics Xv 1S approximately x2-distributed with p degrees of freedom. Alternatively, one could use a Wald statistics instead of Xv- A Wald statistics is a second order approximation to the log-likelihood function which is motivated by asymptotic normality of the MLE /3. The z-test in the R output in Listing 2.2 refers to a rooted Wald statistics; for more details we refer to Section 5.3.2 in [141].
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
36
Chapter 2.  Generalized Linear Models
Remarks 2.6.
• Analogously, the likelihood ratio test and the Wald test can be applied recursively to a sequence of nested models. This leads to a step-wise reduction of model complexity, this is similar in spirit to the analysis of variance (ANOVA) in Listing 2.7, and it is often referred to as backward model selection, see dropl below.
• The tests presented apply to nested models. If we want to selected between nonnested models we can use cross-validation.
2.4   Example in car insurance pricing
We consider a car insurance example with synthetic data T> generated as described in Appendix A. The portfolio consists of n = 500'000 car insurance policies for which we have feature information Xi 6 X* and years at risk information Vi £ (0,1], for i = 1,.. . ,n. Moreover, for each policy i we have an observation N{. These (noisy) observations were generated from independent Poisson distributions based on the underlying expected frequency function A*(-) as described in Appendix A.2 Here, we assume that neither the relevant feature components, the functional form nor the involved parameters of this expected frequency function A*(-) are known (this is the typical situation in practice), and we aim at inferring an expected frequency function estimate A(-) from the data X>.3
2.4.1   Pre-processing features: categorical feature components
We have 4 categorical (nominal) feature components gas, brand, area and ct in our data T>. These categorical feature components need pre-processing (feature engineering) for the application of the Poisson GLM with regression function (2.2) because they are not real-valued. Component gas £ {Diesel, Regular} is binary and is transformed to 0 and 1, respectively. The area code is ordinal and is transformed by {A, ...,F} i—> {1,..., 6} Cl, see Figure A.9. Thus, there remains the car brand and the Swiss cantons ct. Car brand has 11 different levels and there are 26 Swiss cantons, see (A.l). We present dummy coding to transform these categorical feature components to numerical representations.
For illustrative purposes, we demonstrate dummy coding on the feature component car brand. In a first step we use binary coding to illustrate which level a selected car has. In Table 2.1 we provide the one-hot encoding of the car brands on the different rows. Each brand is mapped to a basis vector in R11, i.e. each car brand is represented in one-hot encoding by a unit vector cc^lh-) £ {0, l}11 with Yl}=ixilh^ = 1- This mapping is illustrated by the different rows in Table 2.1. In a second step we need to ensure that the resulting design matrix X (which includes an intercept component, see (2.7)) has full rank. This is achieved by declaring one level to be the reference level (this level is modeled by the intercept /3q).  Dummy coding then only measures relative differences
2The true (typically unknown) expected frequency is denoted by A*(-) and the corresponding feature space by X*, i.e. we have true regression function A* : X* —S- K+.
3The data is available from https://people.math.ethz.ch/~wmario/Lecture/MTPL_data.csv
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
37
Bl	1	0	0	0	0	0	0	0	0	0	0
BIO	0	1	0	0	0	0	0	0	0	0	0
Bll	0	0	1	0	0	0	0	0	0	0	0
B12	0	0	0	1	0	0	0	0	0	0	0
B13	0	0	0	0	1	0	0	0	0	0	0
B14	0	0	0	0	0	1	0	0	0	0	0
B2	0	0	0	0	0	0	1	0	0	0	0
B3	0	0	0	0	0	0	0	1	0	0	0
B4	0	0	0	0	0	0	0	0	1	0	0
B5	0	0	0	0	0	0	0	0	0	1	0
B6	0	0	0	0	0	0	0	0	0	0	1
Table 2.1: One-hot encoding x^lh^ 6 R11 for car brand, encoded on the different rows.
to this reference level. If we declare Bl to be the reference level, we can drop the first column of Table 2.1. This provides the dummy coding scheme for car brand given in Table 2.2.
Bl	0	0	0	0	0	0	0	0	0	0
BIO	1	0	0	0	0	0	0	0	0	0
Bll	0	1	0	0	0	0	0	0	0	0
B12	0	0	1	0	0	0	0	0	0	0
B13	0	0	0	1	0	0	0	0	0	0
B14	0	0	0	0	1	0	0	0	0	0
B2	0	0	0	0	0	1	0	0	0	0
B3	0	0	0	0	0	0	1	0	0	0
B4	0	0	0	0	0	0	0	1	0	0
B5	0	0	0	0	0	0	0	0	1	0
B6	0	0	0	0	0	0	0	0	0	1
Table 2.2: Dummy coding ccbrand 6 R10 for car brand, encoded on the different rows. We define the part of the feature space X that belongs to car brand as follows
^brand = ^brand £ ^ 1jlO.   ^ ^.brand £ ^ ^    c   R10_ (2.16)
That is, the resulting part Xhraild of the feature space X is 10 dimensional, the feature components can only take values 0 and 1, and the components of xbrand add up to either 0 or 1, indicating to which particular car brand a specific policy is belonging to. Note that this feature space may differ from the original feature space X* of Appendix A. For the 26 Swiss cantons we proceed completely analogously receiving Xct C {0, l}25 C R25 with, for instance, canton ZH being the reference level.
Remarks 2.7.
• If we have k categorical classes, then we need k — 1 indicators (dummy variables) to uniquely identify the parametrization of the (multiplicative) model including an
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
38
Chapter 2.  Generalized Linear Models
intercept. Choice (2.16) assumes that one level is the reference level. Lhis reference level is described by the intercept All other levels are measured relative to this reference level and are described by regression parameters /3/, with 1 < / < k — 1, see also (2.10). Lhis parametrization is called dummy coding or treatment contrast coding where the reference level serves as control group and all other levels are described by dummy variables relative to this control group.
• Other identification schemes (contrasts) are possible, as long as they lead to a full rank in the design matrix X. Lor instance, if we have the hypothesis that the first level is the best, the second level is the second best, etc., then we could consider Helmert's contrast coding diagram given in Lable 2.3. Lor illustrative purposes we only choose k = 5 car brands in Lable 2.3.
Bl	4/5	0	0	0
B2	-1/5	3/4	0	0
B3	-1/5	-1/4	2/3	0
B4	-1/5	-1/4	-1/3	1/2
B5	-1/5	-1/4	-1/3	-1/2
Lable 2.3: Helmert's contrast coding a;He:LlIiert i R4 encoded on the different rows.
Lhis coding in Lable 2.3 has the following properties: (i) each column sums to zero, (ii) all columns are orthogonal, and (iii) each level is compared to the mean of the subsequent levels, see also (2.17), below. In view of (2.2), this provides regression function for policy i (restricted to 5 car brand levels, only)
	' ßo +	Ißl				i	is	car	brand Bl,
	ßo-	lßl +	\ßl			i	is	car	brand B2,
= <	ßo-	x5ßl-	\ß2 +			i	is	car	brand B3,
	ßo-	kßl-	\ß2~	5Ä +	\ß,	i	is	car	brand B4,
	. ßo-	x5ßl-		5Ä-	5/34	i	is	car	brand B5.
(/3,^elmert)
Observe that the mean over all risk classes is described by (3q. Lor a given level, say car brand B2, the subsequent levels have the same mean (on the log scale) except in the variable to which it is compared to:
1 3
(log-) mean of car brand B2:      (3o--(3i + —/?2>
5 4
(log-) mean over car brands B3, B4, B5:      (3o--(3i — —/?2>
5 4
thus, the difference in this comparison is exactly one unit of /?2-
(2.17)
Lhe choice of the explicit identification scheme does not have any influence on the prediction, i.e. different (consistent) parametrizations lead to the same prediction. However, the choice of the identification may be important for the interpretation of the parameters (see the examples above) and it is important, in particular, if we explore parameter reduction techniques, e.g., backward selection.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
39
• If we have k = 2 categorical classes (binary case), then dummy coding is equivalent to a continuous consideration of that component.
Conclusion 2.8. If we consider the 5 continuous feature components age, ac, power, area and log(dens) as log-linear, the binary component gas as 0 or 1, and the 2 categorical components brand and ct by dummy coding we receive a feature space X = R5 x {0,1} x Xbrajad x Xct C Rq of dimension q = 5 + 1 + 10 + 25 = 41.
2.4.2   Pre-processing features: continuous feature components
In view of the previous conclusion, we need to ask ourselves whether a log-linear consideration of the continuous feature components age, ac, power, area and log(dens) in regression function (2.2) is appropriate.
observed log-frequency per age of di
observed log-frequency per age of c
observed log-frequency per power of c
log-frequency per area code
observed log-frequency per log density
Figure 2.1: Observed marginal log-frequencies of the continuous feature components age, ac, power, area and log(dens).
In Figure 2.1 we illustrate the observed marginal log-frequencies of the continuous feature components age, ac, power, area and log(dens) provided by the data T> in Appendix A. We see that some of these graphs are (highly) non-linear and non-monotone which does not support the log-linear assumption in (2.2). This is certainly true for the components age and ac. The other continuous feature components power, area and log(dens) need a more careful consideration because these marginal plots in Figure 2.1 can be (rather) misleading. Note that these marginal plots involve interactions between feature components and, thus, these marginal plots may heavily be influenced by such interactions. For instance, log(dens) at the lower end may suffer from insufficient years at risk and interactions with other feature components that drive low expected frequencies.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
40
Chapter 2.  Generalized Linear Models
GLM modeling decision. To keep the outline of the GLM chapter simple we assume that the components power, area and log(dens) can be modeled by a log-linear approach, and that the components age and ac need to be modeled differently. These choices are going to be challenged and revised later on.
The first way to deal with non-linear and non-monotone continuous feature components in GLMs is to partition them and then treat them as (nominal) categorical variables. We will present an other treatment in Chapter 3 on generalized additive models (GAMs). In Figure 2.2 we provide the chosen categorization which gives us a partition of age into 8 'age classes' and of ac into 4 'ac classes' (which hopefully are homogeneous w.r.t. claims frequency).
age	class 1	: 18-20
age	class 2	: 21-25
age	class 3	: 26-30
age	class 4	: 31-40
age	class 5	: 41-50
age	class 6	: 51-60
age	class 7	: 61-70
age	class 8	: 71-90
ac	class 0:	0
ac	class 1:	1
ac	class 2:	2
ac	class 3:	3+
						• •	
age of driver observed log-frequency per age of car
Figure 2.2: (lhs) Chosen 'age classes' and 'ac classes'; (rhs) resulting partition.
Listing 2.1: Categorical classes for GLM
1 cl   <-  c(18,   21,   26,   31,   41,   51,   61,   71, 91)
2 ageGLM  <-  cbind(c(18:90) ,c(rep(paste(cl [1] ,"to",cl [2]-1 ,   sep = ""),...
3 dat$ageGLM  <-  as.factor(ageGLM[dat$age-17 , 2])
4 dat$ageGLM  <-  relevel(dat$ageGLM,   ref="51to60")
5 dat$acGLM  <-  as.factor(pmin(dat$ac,3))
6 levels(dat$acGLM)   <-  c("acO","ac1","ac2","ac3+")
In Listing 2.1 we give the R code to receive this partitioning. Note that we treat these age and ac classes as categorical feature components. In R this is achieved by declaring these components being of factor type, see lines 3 and 5 of Listing 2.1. For the GLM approach we then use dummy coding to bring these (new) categorical feature components into the right structural form, see Section 2.4.1.   On line 4 of Listing 2.1 we define
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
41
the reference level of 'age class' for this dummy coding, and for 'ac class' we simply choose the first one acO as reference level. Thus, in full analogy to Section 2.4.1, we obtain Xage C {0, l}7 C R7 for modeling 'age classes', and Xac C {0, l}3 C R3 for modeling 'ac classes'. This implies that our feature space considers 3 log-linear continuous feature components power, area and log(dens), the binary feature component gas and 4 categorical feature components 'age class', 'ac class', brand and ct. This equips us with feature space
X = R3 x {0,1} x Xage x Xac x Xbrajad x Xct c R9, of dimension g = 3 + l + 7 + 3 + 10 + 25 = 49.
Remarks 2.9 (feature engineering).
• The choice of 'age classes' and 'ac classes' has been done purely expert based by looking at the marginal plots in Figure 2.2 (rhs). They have been built such that each resulting class is as homogeneous as possible in the underlying frequency. In Chapter 6 we will meet regression trees which allow for a data driven selection of classes.
• Besides the heterogeneity between and within the chosen classes, one should also pay attention to the resulting volumes. The smallest 'age class' 18to20 has a total volume of 1'741.48 years at risk, the smallest 'ac class' acO a total volume of 14'009.20 years at risk. The latter is considered to be sufficiently large, the former might be critical. We come back to this in formula (2.20), below. Thus, in building categorical classes one needs to find a good trade-off between homogeneity within the classes and minimal sizes of these classes for reliable parameter estimation.
• A disadvantage of this categorical coding of continuous variables is that the topology gets lost. For instance, after categorical coding as shown in Figure 2.2 it is no longer clear that, e.g., ages 70 and 71 are neighboring ages (in categorical coding). Moreover, the resulting frequency function will not be continuous at the boundaries of the classes. Therefore, alternatively, one could also try to replace non-monotone continuous features by more complex functions. For instance, we could map age to a 4 dimensional function having regression parameters /3/,..., /3j+3 G R
age H> /3/age + /3/+i log(age) + /3/+2age2 + /3/+3age3.
In this coding we keep age as a continuous variable and we allow for more modeling flexibility by providing a pre-specified functional form that will run into the regression function.
• In (2.10) we have seen that if we choose the log-link function then we get a multiplicative tariff structure. Thus, all feature components interact in a multiplicative way. If we have indication that interactions have a different structure, we can model this structure explicitly. As an example, we may consider
(age, ac) h-> /3/age + /3/+iac + /3/+2age/ac2.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
42
Chapter 2.  Generalized Linear Models
This gives us systematic effect (under log-link choice)
exp{/3/age} exp{/3/+:Lac} exp{/3/+2age/ac2}.
This shows that GLMs allow for a lot of modeling flexibility, but the modeling has to be done explicitly by the modeler. This requires deep knowledge about the data, so that features can be engineered and integrated in the appropriate way.
Example 2.10 (example GLM1). In a first example we only consider feature information age and ac, and we choose the (categorical) feature pre-processing as described in Listing 2.1. In the next section on 'data compression' it will become clear why we are starting with this simplified example.
We define feature space X = Xage x Xac C R9 which has dimension q = 7 + 3 = 10, and we assume that the regression function A : X —> R+ is given by the log-linear form (2.2).
Listing 2.2: Results of example GLM1 (Example 2.10)
1 Call:
2 glm(formula =  claims   ~   ageGLM  +  acGLM ,   family  =  poisson(),   data  = dat,
3 offset  = log(expo))
4
5 Deviance Residuals:
6	Min	iq	Median	3Q			Max		
7 Q	-1.1643 -0.	3967	-0.2862	-0.1635		4.3409			
O 9	Coefficients								
10		Estimate Std.		Error	z value		Pr (> !	z!)	
11	CIntercept)	-1.	41592 0.	02076	-68.	220	< 2e	-16	* * *
12	ageGLM18to20	1.	12136 0.	04814	23 .	293	< 2e	-16	* * *
13	ageGLM21to25	0.	48988 0.	02909	16 .	839	< 2e	-16	* * *
14	ageGLM26to30	0.	13315 0.	02473	5 .	383	7.32e	-08	* * *
15	ageGLM31to40	0.	01316 0.	01881	0.	700	0.48403		
16	ageGLM41to50	0.	05644 0.	01846	3.	058	0.00223		**
17	ageGLM61to70	-0.	17238 0.	02507	-6.	875	6.22e	-12	* * *
18	ageGLM71to90	-0.	13196 0.	02983	-4.	424	9.71e	-06	* * *
19	acGLMacl	-0.	60897 0.	02369	-25 .	708	< 2e	-16	* * *
20	acGLMac2	-0.	79284 0.	02588	-30.	641	< 2e	-16	* * *
21	acGLMac3+	-1.	08595 0.	01866	-58.	186	< 2e	-16	* * *
22	---								
23	S ignif. code	s :	0  *** 0.001	**  0.	01 *	0.	05   . 0	. 1	1
24									
25	(Dispersion	parameter  for poisson			family		taken	to	be
26									
27	Null devianc		e: 145532	on 499999		de	grees	of	free^
28	Residual devianc		e: 141755	on 499989		de	grees	of	free^
29	AIC: 192154								
30									
31	Number  of Fisher		Scoring it	erations :		6			
The MLE /3 S R9+1 is then calculated using the R package glm. The results are presented in Listing 2.2. We receive q + 1 = 11 estimated MLE parameters (lines 11-21), the intercept being (3o = —1.41592. Thus, the reference risk cell (51to60, acO) has an expected frequency of A(cc) = exp{ —1.41592} = 24.27%. All other risk cells are measured relative to this reference cell using the multiplicative structure (2.10).
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
43
The further columns on lines 11-21 provide: column Std. Error gives the estimated standard deviation 07 in /3/, for details we refer to (2.33) in the appendix Section 2.6 where Fisher's information matrix is discussed; the column z value provides the rooted Wald statistics under the null hypothesis /3/ = 0 (for each component 0 < / < q individually), it is defined by z% = fii/df, finally, the last column Pr(>!z!) gives the resulting (individual two-sided) p-values for these null hypotheses under asymptotic normality. These null hypotheses should be interpreted as there does not exist a significant difference between the considered level 1 < / < q and the reference level (when using dummy coding). Note that these tests cannot be used to decide whether a categorical variable should be included in the model or not.4
Line 28 of Listing 2.2 gives the unsealed in-sample Poisson deviance loss, i.e. nC1^ = D*(N,X) = 141'755. This results in an in-sample loss of £% = 28.3510 ■ 10-2, see also Table 2.4. Line 27 gives the corresponding value of the homogeneous model (called null model) only considering an intercept, this is the model of Section 1.4. This allows us to consider the likelihood ratio test statistics (2.15) for the null hypothesis Hq of the homogeneous model versus the alternative heterogeneous model considered in this example.
xl = D*(N,XHo) - D*(N,Xfull) = 145'532 - 14l'755 = 3'777.
This test statistics has approximately a ^-distribution with q = 10 degrees of freedom. The resulting p-value is almost zero which means that we highly reject the homogeneous model in favor of the model in this example.
Finally, Akaike's information criterion (AIC) is given, and Fisher's scoring method for parameter estimation has converged in 6 iterations.
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	rev	/•CV J~D		/•is J~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.1) GLM1	3.1s	11	28.3543	28.3544	0.6052	28.3510	10.2691%
Table 2.4: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 1.1.
Exactly in the same structure as in Table 1.1, we present the corresponding cross-validation losses £§v, the estimation loss £(A,A*) and the in-sample loss Cf) on line (Ch2.1) of Table 2.4. Note that the estimation loss is given by
4The rooted Wald statistics uses a second order Taylor approximation to the log-likelihood. Based on asymptotic normality it allows for a z-test under known dispersion or a i-test for estimated dispersion whether a variable can be dropped from the full model. However, if we have categorical features, one has to be more careful because in this case the p-value only indicates whether a level significantly differs from the reference level.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
44
Chapter 2.  Generalized Linear Models
and this estimation loss can only be calculated because we know the true model A*, here. We observe that these loss figures are clearly better than in the homogeneous model on line (Chl.l), but worse compared to the true model A* on line (ChA.l). We also observe that we have a negligible over-fitting because cross-validation provides almost the same numbers compared to the in-sample loss.5
Next we estimate the dispersion parameter <f>. Lhe resulting Pearson's and deviance estimators are
4>P = 1.0061
and
4>D = 28.3516 ■ 10"
Lhe Pearson's estimator shows a small over-dispersion, the deviance estimator is more difficult to interpret because its true level for small frequencies is typically not exactly known, see also Ligure 1.1. Note that the deviance dispersion estimate (2.14) scales with the number of parameters q, that is, the degrees of freedom are given by 500'000 —(g+l) = 499'989. Lhis slightly corrects for model complexity. However, the resulting estimate is bigger than the cross-validation loss in our example.
GLM1: estimated frequency of age
GLM1: estimated frequency of ac
	• observed — GLM1 estimated
\	> •
o	
I' • ---= _._•                  » «*	
	• •
23   28   33   38   43   48   53   58   63   68 73
age of driver
2   4   6   8   10     13     16     19     22     25     28     31 34
age of car
Ligure 2.3: Estimated marginal frequencies in example GLM1 for age and ac.
In Ligure 2.3 we present the resulting marginal frequency estimates (averaged over our portfolio); these are compared to the marginal observations. Lhe orange graphs reflect the MLEs provided in Listing 2.2. Lhe graphs are slightly wiggly which is caused by the fact that we have a heterogeneous portfolio that implies non-homogeneous multiplicative interactions between the feature components age and ac.
Finally, in Figure 2.4 we show the resulting Pearson's and deviance residuals. Lhe different colors illustrate the underlying years at risk Vi G (0,1], i = 1,.. ., n. As previously mentioned, we observe that Pearson's residuals are not very robust for small years at risk Vi (red color) because we divide by these (small) volume measures for Pearson's residual
5Cross-validation is done on the same partition for all models considered in these notes. Run time measures the time to fit the model once on a personal laptop Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99GHz with 16GB RAM.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
45
Figure 2.4: Deviance residuals Sf versus Pearson's residuals Sf; the colors illustrate the underlying years at risk Vi G (0,1].
definition (2.12). In general, one should work with deviance residuals because this is a distribution-adapted version of the residuals and considers skewness and tails in the right way (supposed that the model choice is appropriate). This finishes the example for the time being. ■
2.4.3   Data compression
We note that Example 2.10 only involves d = 8 ■ 4 = 32 different risk cells for the corresponding 'age classes' and 'ac classes', see also Figure 2.2. Within these risk cells we assume that the individual insurance policies are homogeneous, i.e. can be characterized by a common feature value x\ G X = Xage X Xac C R9 with q = 10. Running the R code in Listing 2.2 may be time consuming if the number of policies n is large. Therefore, we could/should try to first compress the data accordingly. For independent Poisson distributed random variables this is rather simple, the aggregation property of Lemma 1.3 implies that we can consider sufficient statistics on the d = \X\ = 32 different risk cells. These are described by the (representative) features {a;^,..., x^} = X. We define the aggregated portfolios in each risk cell k = 1,..., d by
n n
i=i i=i
Observe that and Vi are on an individual insurance policy level, whereas N£ and are on an aggregated risk cell (portfolio) level. The consideration of the latter provides a substantial reduction in computational time when calculating the MLE /3 of /3 because the n = 500'000 observations are compressed to d = 32 aggregated observations (sufficient statistics). Lemma 1.3 implies that all risk cells k = 1,..., d are independent and Poisson distributed
N+ ~Poi(A«K+). Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
46
Chapter 2.  Generalized Linear Models
The joint log-likelihood function on the compressed data T>+ = {(N^,x~^,v£); k = 1,. . ., d} is given by
d
= £-exp(/3,:E+H+ + iV+ ((/3, x+) + \ogv+) - log(iV+!). (2.19)
k=l
We introduce the design matrix X+ = (x~fci)i<k<d,o<i<q as above, and the MLE j3 is found as in Proposition 2.2.
Listing 2.3: Aggregation/compression of risk cells
> dat   <-  ddply(dat,   .(ageGLM, acGLM),   summarize,expo=sumCexpo),claims=sumCclaims))
> str(dat)
1 data.frame 1 :       32  obs.   of     4 variables:
$  ageGLM:   Factor  w/  8  levels "51to60" , "181o20" , .. :   1111222233 ...
$  acGLM   :   Factor  w/  4  levels "acO","acl","ac2",..:   1234123412 ...
$  expo     :   num    3144.4   6286.4 5438.8  38209.8  38.6 ... $  claims:   int     712  817  618  3174   19  31   12  410   179   192 ...
In Listing 2.3 we provide the R code for the data compression in each risk cell. Note that ageGLM and acGLM describe the categorical classes, see also Listing 2.1.
Example 2.10, revisited (example GLM2). We revisit example GLM1 (Example 2.10), but calculate the MLE /3 directly on the aggregated risk cells received from Listing 2.3. The results are presented in Listing 2.4.
We note that the result for the MLE /3 is exactly the same, compare Listings 2.2 and 2.4. Of course, this needs to be the case due to the fact that we work with (aggregated) sufficient statistics in the latter version. The only things that change are the deviance losses because we work on a different scale now. We receive in-sample loss = D*(N+,X)/d = 32.1610/32 = 1.0050 on the aggregate data D+. The degrees of freedom are d — (q + 1) = 32 — 11 = 21, this results in dispersion parameter estimates
4>P = 1.5048       and       4>D = 1.5315.
Thus, we obtain quite some over-dispersion which indicates that our regression function choice misses important structure.
In Figure 2.5 (lhs) we plot deviance residuals versus Pearson's residuals, the different colors showing the volumes of the underlying risk cells. We note that the two versions of residuals are almost identical on an aggregate risk cell level. Figure 2.5 (rhs) gives the Tukey-Anscombe plot which plots the residuals versus the fitted means. In this plot we would not like to discover any structure.
In Figure 2.6 we plot the estimated marginal frequencies of example GLM2 (on the different categorical levels). We notice that the observed frequencies Nj^~ /v£ (marginally aggregated) exactly match the estimated expected frequencies \(x~^) (also marginally aggregated). This is explained by the fact that the Poisson GLM method with categorical coding provides identical results to the method of the total marginal sums by Bailey [7] and Jung [76], see Section 7.1 in Wuthrich [135]. This also explains why the average estimated frequencies of the homogeneous model and example GLM1 are identical in the last column of Table 2.4. This finishes example GLM2. ■
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
47
Listing 2.4: Results of example GLM2
1	Call :						
2	glm C formula	=  claims   -   ageGLM  + acGLM		, family		= poisson	
3 4	offset =	log(expo))					
5	Deviance Residuals:						
6	Min	1Q Median	3Q		Max		
7 Q	-1.9447 -0.	8175 -0.1509	0.6370	1.9360			
O 9	Coefficients						
10		Estimate Std.	Error  z value		Pr (> !	z!)	
11	CIntercept)	-1.41592 0.	02076 -68.	220	< 2e	-16	* * *
12	ageGLM18to20	1.12136 0.	04814 23.	293	< 2e	-16	* * *
13	ageGLM21to25	0.48988 0.	02909 16.	839	< 2e	-16	* * *
14	ageGLM26to30	0.13315 0.	02473 5.	383	7.32e	-08	* * *
15	ageGLM31to40	0.01316 0.	01881 0.	700	0.48403		
16	ageGLM41to50	0.05644 0.	01846 3.	058	0.00223		**
17	ageGLM61to70	-0.17238 0.	02507 -6.	875	6.22e	-12	* * *
18	ageGLM71to90	-0.13196 0.	02983 -4.	424	9.71e	-06	* * *
19	acGLMacl	-0.60897 0.	02369 -25.	708	< 2e	-16	* * *
20	acGLMac2	-0.79284 0.	02588 -30.	641	< 2e	-16	* * *
21	acGLMac3+	-1.08595 0.	01866 -58.	186	< 2e	-16	* * *
22	---						
23	S ignif. code	s :     0  *** 0.001	**  0.01 *	0.	05   . 0	. 1	1
24							
25	(Dispersion	parameter  for  poisson family			taken	to	be 1)
26							
27	Null   deviance: 3809.473		on  31 d	egr	ees of	f re	sedom
28	Residual   deviance: 32.161		on  21 d	egr	ees of	f re	sedom
29	AIC: 305.12						
30							
31	Number  of   Fisher   Scoring it		erations:	3			
deviance versus Pearson's residuals
Tukey-Anscombe plot
Figure 2.5: Example GLM2: (lhs) deviance versus Pearson's residuals, (rhs) Tukey-Anscombe plot.
2.4.4   Issue about low frequencies
In the previous section we have worked on aggregated data T>+ (sufficient statistics). This aggregation is possible on categorical classes, but not necessarily if we consider continuous feature components. This is exactly the reason why we have been starting
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
48
Chapter 2.  Generalized Linear Models
Figure 2.6: Estimated marginal frequencies in GLM2 Example 2.10 for 'age classes' and 'ac classes' risk cells.
by only considering two categorized feature components in Example 2.10 (GLM1 and GLM2). In this section we would like to emphasize a specific issue in insurance of having rather low expected frequencies in the range of 3% to 20%. If, for instance, X(x) = 5% and v = 1 then we obtain for N ~ Poi(X(x)v)
E[N] = 0.05       and       Vai(N)1/2 = 0.22.
This indicates that the pure randomness is typically of much bigger magnitude than possible structural differences (see also Figure 1.2), and we require a lot of information to distinguish good from bad car drivers. In particular, we need portfolios having appropriate volumes (years at risk). If, instead, these drivers would have observations of 10 years at risk, i.e. v = 10, then magnitudes of order start to change
E[N] = 0.50       and       Var(V)1/2 = 0.71.
If we take confidence bounds of 2 standard deviations we obtain for the expected frequency A (a;) the following interval
X(x)
X(x)
, A(a:)
This implies for X(x) = 5% that we need v = 2'000 years at risk to detect a structural difference to an expected frequency of 4%. Of course, this is taken care of by building sufficiently large homogeneous sub-portfolios (we have n = 500'000 policies). But if the dimension of the feature space X is too large or if we have too many categorical feature components then we cannot always achieve to obtain sufficient volumes for parameter estimation (and a reduction of dimension technique may need to be applied). Note that the smallest 'age class' 18to20 is rather heterogeneous but it only has a total volume of 1'741.48 years at risk, see Remarks 2.9. Moreover, these considerations also refer to the remark after Example 2.1.
Version October 27, 2021, M.V. Wuthrich h C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
49
2.4.5   Models GLM3+ considering all feature components
In this section we consider the a GLM considering all feature components and based on our GLM modeling decision on page 40: we assume that the components power, area and log(dens) can be modeled by a log-linear approach, component gas is binary, the components age and ac are modeled categorically according to Figure 2.2, and the remaining feature components brand and ct are categorical. This gives us the feature space
X = R3 x {0,1} x Xage x Xac x Xbrajad x Xct c R9, of dimension qr = 3 + l + 7 + 3 + 10 + 25 = 49. We call this model example GLM3.
Listing 2.5: Results of example GLM3
1	Call :								
2	glm C formula	=  claims ~		powe	r  + area		log(d	ens)   +  gas +	
3	acGLM +	brand +  ct,		family  = poissonO,				data = i	dat ,
5	Deviance Residuals:								
6	Min	iq	Median		3Q		Max		
7 Q	-1.1944 -0.	3841	-0.28	53	-0.1635	4	.3759		
O 9	Coefficients								
10		Estimate		Std	..   Error z	value		Pr(>!z ! )	
11	CIntercept)	-1.	7374212	0 .	0455594 -	38	. 135	< 2e-16	* * *
12	power	-0.	0008945	0 .	0031593	-0	. 283	0.777069	
13	area	0.	0442176	0 .	0192701	2	. 295	0.021755	*
14	log C dens)	0.	0318978	0 .	0143172	2	. 228	0.025884	*
15	gasRegular	0.	0491739	0 .	0128676	3	.822	0.000133	* * *
16	ageGLM18to20	1.	1409863	0 .	0483603	23	. 593	< 2e-16	* * *
17									
18	ageGLM71to90	-0.	1264737	0 .	0300539	-4	. 208	2.57e-05	* * *
19	acGLMacl	-0.	6026905	0 .	0237064 -	25	.423	< 2e-16	* * *
20	acGLMac2	-0.	7740611	0 .	0259994 -	29	.772	< 2e-16	* * *
21	acGLMac3+	-1.	0242291	0 .	0204606 -	50	. 059	< 2e-16	* * *
22	brandBlO	-0.	0058938	0 .	0445983	-0	. 132	0.894864	
23									
24	brandB5	0.	1300453	0 .	0292474	4	.446	8.73e-06	* * *
25	brandB6	-0.	0004378	0 .	0332226	-0	. 013	0.989486	
26	ct AG	-0.	0965560	0 .	0274785	-3	.514	0.000442	* * *
27									
28	ctVS	-0.	1110315	0 .	0317446	-3	.498	0.000469	* * *
29	ctZG	-0.	0724881	0 .	0463526	-1	. 564	0.117855	
30	---								
31	Signif. code	s :	0  *** 0	. 001	** 0.01	*	0. 05	. 0.1	1
32									
33	Null devianc		e: 145532		on 499999		degr e	es   of fr	e edo]
34	Residual devianc		e: 140969		on 499950		degr e	es   of fr	e edo]
35    AIC: 191445
The results for the MLE /3 are provided in Listing 2.5 and the run time is found in Table 2.5. From Listing 2.5 we need to question the modeling of (all) continuous variables power, area and log(dens). Either these variables should not be in the model or we should consider them in a different functional form.
In Table 2.5 we state the resulting loss figures of model GLM3. They are better than the ones of model GLM1 (only considering the two age and ac classes, respectively), but there is still room for improvements compared to the true model A*.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
50
Chapter 2.  Generalized Linear Models
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	rev >~D	J~D	£(A,A*)	>~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.1) GLM1	3.1s	11	28.3543	28.3544	0.6052	28.3510	10.2691%
(Ch2.3) GLM3	12.0s	50	28.2125	28.2133	0.4794	28.1937	10.2691%
(Ch2.4) GLM4	14.0s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch2.5) GLM5	13.3s	56	28.1508	28.1520	0.4128	28.1292	10.2691%
Table 2.5: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 2.4.
As a first model modification we consider power as categorical, merging powers above and including 9 to one class, thus, we consider 9 different 'power classes'. This equips us with a new feature space
X = AfP°wer x M2 x {0,1} x Xage x Xac x Xbrajad x Xct c K9, (2.21)
of dimension g = 8 + 2 + l + 7 + 3 + 10 + 25 = 56. We call this new model GLM4. The results are presented in Listing 2.6.
Based on this analysis we keep all components in the model. In particular, we should keep the variable power but a log-linear shape is not the right functional form to consider power. This also becomes apparent from the cross-validation analysis given in Table 2.5 on line (Ch2.4). Note that the number of parameters has been increasing from 11 (GLM1) to 57 (GLM4), and at the same time over-fitting is increasing (difference between cross-validation loss jC^,v and in-sample loss C1^). The cross-validation loss has decreased by 28.3543 — 28.1502 = 0.2041 which is consistent with the decrease of 0.1915 in estimation loss S(X, A*) from GLM1 to GLM4.
Another important note is that all GLMs fulfill the balance property of Proposition 2.4. This is also seen from the last column in Table 2.5.
Next we consider an analysis of variance (ANOVA) which nests GLMs w.r.t. the considered feature components. In this ANOVA, terms are sequentially added to the model. Since the order of this sequentially adding is important, we re-order the components on lines 2 and 3 of Listing 2.6 as follows acGLM, ageGLM, ct, brand, powerGLM, gas, log(dens) and area. This ordering is based on the rationale that the first one is the most important feature component and the last one is the least important one. The ANOVA is then done in R by the command anova.
In Listing 2.7 we present the results of this ANOVA. The column Df shows the number of model parameters /3/, 1 < I < q, the corresponding feature component uses. The column Deviance shows the amount of reduction in in-sample deviance loss D*(N,\) when sequentially adding this feature component, and the last column Resid.Dev shows the remaining in-sample deviance loss. We note that gas and area lead to a comparably small decrease of in-sample loss which may question the use of these feature components (in the current form). We can calculate the corresponding p-values of the x2-test statistics
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
51
Listing 2.6: Results of example GLM4
29
30
31
32
33
34
Call :
glmCformula =  claims   ~  powerGLM +  area +  log(dens)   +  gas  +  ageGLM +
acGLM +  brand +  ct,   family  =  poissonO,   data = dat,   offset  = log(expo))
Deviance Residuals:
Min IQ Median
-1.1373     -0.3820 -0.2838
Coefficients :
3Q
-0.1624
Max 4.3856
10		Estimate			Std	.. Error		z value		Pr(>!z ! )	
11	CIntercept)	-1	. 903e	ä+00	4.	699e	ä-02	-40	. 509	< 2e-16	* * *
12	powerGLM2	2	. 681e	ä-01	2 .	121e	ä-02	12	. 637	< 2e-16	* * *
13	powerGLM3	2	. 525e	ä-01	2 .	135e	ä-02	11	.828	< 2e-16	* * *
14	powerGLM4	1	.377e	ä-01	2 .	113e	ä-02	6	.516	7.22e-ll	* * *
15	powerGLM5	-2	.498e	ä-02	3.	063e	ä-02	-0	.816	0.414747	
16	powerGLM6	3	. 009e	ä-01	3.	234e	ä-02	9	.304	< 2e-16	* * *
17	powerGLM7	2	. 214e	ä-01	3.	240e	ä-02	6	.835	8.22e-12	* * *
18	powerGLM8	1	. 103e	ä-01	4.	128e	ä-02	2	. 672	0.007533	**
19	powerGLM9	-1	. 044e	ä-01	4.	708e	ä-02	-2	.218	0.026564	*
20	area	4	.333e	ä-02	1.	927e	ä-02	2	. 248	0.024561	*
21	log C dens)	3	. 224e	ä-02	1.	432e	ä-02	2	. 251	0.024385	*
22	gasRegular	6	.868e	ä-02	1.	339e	ä-02	5	. 129	2.92e-07	* * *
23	ageGLM18to20	1	. 142e	ä + 00	4.	836e	ä-02	23	. 620	< 2e-16	* * *
24											
25											
26	ctZG	-8	. 123e	ä-02	4.	638e	ä-02	-1	.751	0.079900	
27	---										
28	Signif. codes		0  ***   0.		. 001	**	0.01	* i	0. 05	.0.1 1	
(Dispersion  parameter  for poisson  family  taken  to  be 1)
Null deviance: 145532 on 499999 degrees of freedom Residual deviance: 140641 on 499943 degrees of freedom AIC: 191132
Listing 2.7: ANOVA results 1
1 Analysis   of   Deviance Table
2
3 Model:   poisson,   link: log
4
5 Response: claims
6
7 Terms   added  sequentially   (first  to last)
8
9
10		Df	Deviance		Resid. Df	Resid. Dev
11	NULL				499999	145532
12	acGLM	3	2927.	. 32	499996	142605
13	ageGLM	7	850.	. 00	499989	141755
14	ct	25	363.	. 29	499964	141392
15	brand	10	124.	.37	499954	141267
16	powerGLM	8	315 .	.48	499946	140952
17	gas	1	50.	. 53	499945	140901
18	log(dens)	1	255 .	. 22	499944	140646
19	area	1	5 .	. 06	499943	140641
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
52
Chapter 2.  Generalized Linear Models
Listing 2.8: ANOVA results 2
1 Analysis   of   Deviance Table
2
3 Model:   poisson ,   link: log
4
5 Response: claims
6
7 8	Terms adds	3d s	equentially		(first	to last)
9 10		Df	Deviance Re		sid. Df	Resid. Dev
11	NULL				499999	145532
12	acGLM	3	2927.	. 32	499996	142605
13	ageGLM	7	850.	. 00	499989	141755
14	ct	25	363.	. 29	499964	141392
15	brand	10	124.	.37	499954	141267
16	powerGLM	8	315 .	.48	499946	140952
17	gas	1	50.	. 53	499945	140901
18	area	1	255 .	. 20	499944	140646
19	log C dens)	1	5 .	. 07	499943	140641
Listing 2.9: drop 1 analysis
1 Single  term deletions
2
3 Model:
4 claims   -   acGLM +  ageGLM +  ct  +  brand + powerGLM +  gas  +  areaGLM +
5 log(dens)
6 Df   Deviance AIC LRT Pr(>Chi)
7 <none> 140641 191132
8 acGLM 3 142942 193426   2300.61 <  2.2e-16 ***
9 ageGLM 7 141485 191962 843.91 <  2.2e-16 ***
10 ct 25 140966 191406 324.86 <  2.2e-16 ***
11 brand 10 140791 191261 149.70 <  2.2e-16 ***
12 powerGLM 8 140969 191443 327.68 <  2.2e-16 ***
13 gas                 1 140667 191156 26.32 2.891e-07 ***
14 areaGLM         1 140646 191135 5.06 0.02453 *
15 log(dens)     1 140646 191135 5.07 0.02434 *
16 ---
17 Signif.   codes: 0   '***I   0.001   '**'   0.01   '*'   0.05   '.' 0.1
(2.15) with Df degrees of freedom. These p-values are all close to zero except for area and log (dens). These p-value are 2.5%, which may allow to work with a model reduced by area and/or log (dens).
One may argue that the ANOVA in Listing 2.7 is not "fully fair" for area because this feature component is considered as the last one in this sequential analysis. Therefore, we revert the order in a second analysis putting log (dens) to the end of the list, see Listing 2.8. Indeed, in this case we come to the conclusion that feature component log (dens) may not be necessary. In fact, these two feature components seem to be exchangeable which is explained by the fact that they are very highly correlated, see Table A.2 in the appendix.
For these reasons we study another model GLM5 where we completely drop area. The cross-validation results are provided on line (Ch2.5) of Table 2.5. These cross-validation
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
53
losses Ljj are slightly bigger for model GLM5 compared to model GLM4, therefore, we decide to keep area in the model, and we work with feature space (2.21) for the GLM. Remark that this decision is wrong because the estimation loss of 0.4128 in model GLM5 is smaller than the 0.4137 of model GLM4. However, these estimation losses £(\, A*) are not available in typical applications where the true data generating mechanism A* is not known.
	AIC
(Ch2.3) GLM3	191'445
(Ch2.4) GLM4	191'132
(Ch2.5) GLM5	191'135
Table 2.6: Akaike's information criterion AIC.
Alternatively to cross-validation, we could also compare Akaike's information criterion (AIC). In general, the model with the smallest AIC value should be preferred; for the validity of the use of AIC we refer to Section 4.2.3 in [141]. In view of Table 2.6 we reach to the same conclusion as with cross-validation losses in this example. The AN OVA in Listing 2.7 adds one feature component after the other. In practice, one usually does model selection rather by backward selection. Therefore, one starts with a complex/full model and drops recursively the least significant variables. If we start with the full model we can perform a dropl analysis which is given in Listing 2.9: this analysis individually drops each variable from the full model. Based on this analysis we would first drop areaGLM because it has the highest p-value (received from the Wald statistics, see Section 2.6). However, this p-value is smaller than 5%, therefore, we would not drop any variable on a 5% significance level. The same conclusion is drawn from AIC because the full model has the smallest value. This finishes the GLM example. ■
2.4.6   Generalized linear models: summary
In the subsequent chapters we will introduce other regression model approaches. These approaches will challenge our (first) model choice GLM4 having q + 1 = 57 parameters and feature space (2.21). We emphasize the following points:
• We did not fully fine-tune our model choice GLM4. That is, having n = 500'000 observations we could provide better feature engineering. For instance, we may explore explicit functional forms for ac or age. Moreover, in feature engineering we should also explore potential interactions between the feature components, see Remarks 2.9.
• Exploring functional forms, for instance, for age also has the advantage that neighboring 'age classes' stand in a neighborhood relationship to each other. This is not the case with the categorization used in Figure 2.2.
• Another issue we are going to meet below is over-parametrization and over-fitting. To prevent from over-fitting one may apply regularization techniques which may tell us that certain parameters or levels/labels are not needed. Regularization will be discussed in Section 4.3.2, below.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
54
Chapter 2.  Generalized Linear Models
• Linally, we may also question the Poisson model assumption. In many real applications one observes so-called zero-inflated claims counts which means that there are too many zero observations in the data. In this case often a zero-inflated Poisson (ZIP) model is used that adds an extra point mass to zero. If we face over-dispersion, then also an negative binomial model should be considered.
2.5    Classification problem
Lor classification we consider the following data
V={(Y1,x1),...,(Yn,xn)}, (2.22)
with features Xi 6 X and responses Yi taking values in a finite set y. Lor instance, we may consider genders 3^ = {female, male}. In general, we call the elements of 3^ classes or levels, and we represent the classes by a finite set of integers (labels) y = {0,..., J — 1}, for a given JeN. Lhese classes can either be of ordered type (e.g. small, middle, large) or of categorical type (e.g. female, male). Our goal is to construct a classification on X. Lhat is, we aim at constructing a classifier
C:X^y,       x^y = C(x). (2.23)
Lhis classifier may, for instance, describe the most likely outcome y 6 3^ of a (noisy) response Y having feature x. Lhe classifier C provides a finite partition of the feature space X given by
X=\JX(y),       X(y) = {x £ X; C(x) = y} . (2.24)
y&y
2.5.1    Classification of random binary outcomes
Lypically, also in classification the data generating mechanism is not known. Lherefore, we make a model assumption under which we infer a classifier C of the given data T>. Let us focus on the binary situation 3^ = {0,1}. We assume that all responses of the cases (Yi, x^ in T> are independent and generated by the following probability law
tvi(x) = F[Y = 1] = p(x)       and       ir0(x) = P [Y = 0] = 1 - p(x), (2.25)
for a given (but unknown) probability function
p:X->[Q,l],       x^p(x). (2.26)
Lormula (2.25) describes a Bernoulli random variable Y. A classifier C : X —> {0,1} can then be defined by
C(x) = argmax 7ry(x),
y&y
with a deterministic rule if both iry(x) are equally large. C(x) is the most likely outcome of Y = Y(x). Of course, this can be generalized to multiple response classes J > 2 with corresponding probabilities ttq(x), ..., irj_i(x) for features x 6 X.   Lhis latter
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
55
distribution is called categorical distribution. For simplicity we restrict to the binary (Bernoulli) case here.
If for all features x there exists y £ y with Try(x) = 1, there is no randomness involved, i.e. the responses Y are not noisy, and we obtain a deterministic classification problem. Nevertheless, also in this case we typically need to infer the unknown partition (2.24).
2.5.2   Logistic regression classification
Assume we consider the binary situation and the probability function (2.26) can be described by a logistic functional form through the scalar product (P,x) given in (2.2). This means that we assume for given j3 £ R9+1
p{x) =    exp(/3, ■£) or equivalents       (p, x) = log (   p{x)   ). (2.27)
l+exp(P,x) \l-p{x)J
The aim is to estimate /3 with MLE methods. Assume we have n independent responses in T>, given by (2.22), all being generated by a model of the form (2.25)-(2.26). The joint log-likelihood under assumption (2.27) is then given by, we set Y = (Y±,..., Yn)',
n n
= E Yi lo§K^) + (1 " Yi) log(l - p(Xi)) = m Y (p, Xl) - log(l + exp(/3, Xi)).
i=\ i=i
We calculate the partial derivatives of the log-likelihood function for 0 < / < q
P.yW   =   Y,(y,-- = dPi f^K       1 +exp(P,xl)J
Proposition 2.11. The solution P to the MLE problem (2.25)-(2.26) in the logistic case is given by the solution of
,_exp{Xp]_ _ 1 + exp{Xp}
for design matrix X given by (2.7) and where the ratio is understood element-wise.
If the design matrix X has full rank q+1 < n, the log-likelihood function is concave and, henceforth, the MLE is unique. Moreover, the root search problem of the score function in Proposition 2.11 is solved by Fisher's scoring method or the IRLS algorithm, see also Section 2.2.
The estimated logistic probabilities are obtained by
— / \    —/ \       exp(3,a;) ,       ^ , s     n    ^ , s 1
tti{x) = p{x) =--- and       t^o{x) = 1 — iri{x) —
l+exp(P,x) l+exp(P,x) This provides estimated classifier
C{x) = argmax tvv(x), (2.28)
y^y
with a deterministic rule if both 7Ty(x) are equally large.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
56
Chapter 2.  Generalized Linear Models
Remarks 2.12.
• Model (2.25)-(2.26) was designed for a binary classification problem with two response classes J = 2 (Bernoulli case). Similar results can be derived for multiple response classes J > 2 (categorical case).
• The logistic approach (2.27) was used to obtain a probability p{x) G (0,1). This is probably the most commonly used approach, but there exist many other (functional) modeling approaches which are in a similar spirit to (2.27).
• In machine learning, the logistic regression assumption (2.27) is also referred to the sigmoid activation function 4>{x) = (1 + e_x)_1 for iGR, We come back to this in the neural network chapter, see Table 5.1.
• In categorical problems with more than two classes one often replaces the logistic regression function by the so-called softmax function.
• Note that Try(x) is by far more sophisticated than C{x). For instance, if Y is an indicator whether a car driver has an accident or not, i.e.,
Y = 1{7V>1}>
then we have in the Poisson case
tti(cc) = p{x) = P [Y = 1] = P [N > 1] = 1 - exp{-X(x)v} = \{x)v + o (X(x)v),
(2.29)
as X{x)v —> 0. Thus, tvi(x) ~ X{x)v for typical car insurance frequencies, say, of magnitude 5% and one year at risk v = 1. This implies that in the low frequency situation we obtain classifier C{x) = 0 for all policies.
Moreover, for small expected frequencies X{x) we could also use the logistic regression modeling approach and (2.29) to infer the regression model, we refer to Sections 3.3 and 3.4 in Ferrario et al. [43].
In analogy to the Poisson case we can consider the deviance loss in the binomial case given by
D*(Y,p)   =   2{eY(Y)-eY0)) (2.30)
n
=   -2^Y (3, Xi) - log(l + exp(3, Xi)),
i=\
because the saturated model provides log-likelihood equal to zero. Similar to (2.15) this allows us for a likelihood ratio test for parameter reduction. It also suggests the Pearson's residuals and the deviance residuals, respectively,
= Yj -p(xj)
yjp{xi){\ -pix,))1
SP   =   sgn (Yi - 0.5) J-2 (Yi 0, x,) - log(l + exp(3, a^))).
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 2.  Generalized Linear Models
57
Finally, for out-of-sample back-testing and cross-validation one often considers the probability of misclassification as generalization loss
P [Y + Č{x)\ ,
for a classifier C(-) estimated from randomly chosen i.i.d. data (Yí,Xí) ~ P, i = 1,..., n, where the distribution P has the meaning that we choose at random a policy with feature Xi and corresponding classifier Yi, and where (Y, x) is independent of C(-) having the same distribution as the cases (Yi,Xi).
Based on a learning sample V® we estimate the classifier by C®(-) and using the test sample V^c we can estimate the probability of misclassification defined by
=F[Y^ C{x)\ =        E \YiMty i2'31)
we also refer to Section 1.3.2. Thus, we use the loss function for misclassification
L(Y,C(x)) = l{Y^c(x)}. (2.32)
We can then apply the same techniques as in Section 1.3.2, i.e. the leave-one-out cross-validation or the K-iald cross-validation to estimate this generalization loss.
In relation to Remarks 2.12 one should note that for rare events the misclassification rate should be replaced by other loss functions. In the binary situation with iri(x) C 1/2 for all x G X, the trivial predictor Y = 0, a.s., obtains an excellent misclassification rate (because of missing sensitivity of this loss function towards rare events); in machine learning this problem is often called class imbalance problem. The binomial deviance loss (2.30) is also referred to the cross-entropy loss which may be used instead of misclassification.
Version October 27, 2021, M.V. Wůthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
58
Chapter 2.  Generalized Linear Models
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix to Chapter 2
2.6   Maximum likelihood estimation
Under fairly general conditions, the MLE satisfies asymptotic normality properties converging to a multivariate standard Gaussian distribution if properly normalized, see Theorem 6.2.3 in Lehmann [87].  In particular, the MLE is asymptotically unbiased with asymptotic variance described by the inverse of Fisher's information matrix. The log-likelihood function is in our Poisson GLM given by, see (2.4),
n
P ^ ^(/3) = E-exP(/3'^)^ + Ar^(/3'^)+log^)-1°g(Ar*!))
i=i
and the MLE is found by the roots of
I W> = °
Assuming full rank q + 1 < n of the design matrix X we receive a unique solution to this root search problem because under this assumption the log-likelihood function is concave. Fisher's information matrix is in this Poisson GLM given by the negative Hessian, see (2.9),
d2
Aß) = -E
^2 vl exp(/3, Xi)xitiXitr I = -Uß£N(ß).
0<l,r<q       \i=l / 0<l,r<q
Fisher's information matrix can be estimated by
( n
V*=l / 0<l,r<q
C n \
^wlexp|(x3)l}a;l,/a;l,r j = X' V~ X,
Vi=l / 0<l,r<q
with estimated diagonal working weight matrix
V~ = diag(exp j(£3)i}vi,---,exp j(x3)n}«n) = diag (Vexp{x3}) •
The inverse of the estimated Fisher's information matrix X(/3)_1 can now be used to estimate the covariance matrix of the estimate j3 — j3. In particular, we may use for 0 < / < q
*! = MxCP)-1),, , (2-33)
as the estimated standard error in the MLE /3/. The diagonal matrix V~ can be obtained from the R output by using d.glm$weights if we set d.glm <- glm(. . .).
59
Electronic copy available at: https://ssrn.com/abstract=2870308
60
Chapter 2.  Generalized Linear Models
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 3
Generalized Additive Models
In the previous chapter we have assumed that the log-linear model structure (2.2) is appropriate for expected frequency modeling. We have met situations where this log-linear model structure has not really been justified by statistical analysis (feature components age or power). We have circumvented this problem by building categorical classes and then estimating a regression parameter for each of these categorical classes individually. This approach may lead to over-parametrization and may neglect dependencies between neighboring classes (if there is a natural ordering). In the present section we allow for more flexibility in model assumption (2.2) by considering generalized additive models (GAMs). This chapter is based on Hastie et al. [62], Ohlsson-Johansson [102], Pruscha [106], Wood [134] and Buhlmann-Machler [20].
3.1    Generalized additive models for Poisson regressions
Assume we are in modeling setup (2.1) and we would like to estimate the regression function A(-). In this chapter we replace the log-linear approach (2.2) by the following model assumption
where /;(■), I = 1,. . . ,q, are sufficiently nice functions and (3q 6 R denotes the intercept. Remarks 3.1.
• In approach (3.1) we assume that all feature components are real-valued. Categorical components are treated by dummy coding.
• Models of type (3.1) are called generalized additive models (GAMs), here applied to the Poisson case with log-link function (which is the canonical link for the Poisson model). GAMs go back to Hastie-Tibshirani [60, 61], we also refer to Wood [134].
• Approach (3.1) does not allow for non-multiplicative interactions between the feature components. More general versions of GAMs are available in the literature. These may capture other interactions between feature components by, for instance,
(3.1)
i=i
61
Electronic copy available at: https://ssrn.com/abstract=2870308
62
Chapter 3.  Generalized Additive Models
considering functions fix i2 (x^, x\2) for pair-wise interactions. For interactions between categorical and continuous feature components we also refer to Section 5.6 in [102]. Here, for simplicity, we restrict to models of type (3.1).
• At the moment, the functions //(■), / = 1,. .. , q, are not further specified. A necessary requirement to make them uniquely identifiable in the calibration procedure is a normalization condition. For given data T> we require that the functions // satisfy for all / = 1,.. . , q
1 n
-E/'K/)=0- (3-2)
1=1
• The log-link in (3.1) leads to a multiplicative tariff structure
X(x) = exp{(30}l[exp{fl(xl)}. (3.3)
i=i
Normalization (3.2) then implies that exp{/3o} describes the base premium and we obtain multiplicative correction factors exp{//(a;/)} relative to this base premium for features x G X, we also refer to (2.10).
In the sequel of this section we consider natural cubic splines for the modeling of //. 3.1.1   Natural cubic splines
A popular approach is to use natural cubic splines for the functions //(■) in (3.1). Choose a set of m knots u± < ... < um on the real line R and define the function / : [u±, um] —> R as follows:
f(x) = ht(x),        for x G [ut,ut+i), (3.4)
where for t = 1,. . ., m — 1 we choose cubic functions ht(x) = at + $tx + Tt^2 + $tx^ on R. For the last index t = m — 1 we extend (3.4) to the closed interval [um-i,um]. We say that the function / defined in (3.4) is a cubic spline if it satisfies the following differentiability (smoothing) conditions in the (internal) knots U2 < ■ ■ ■ < um—\
ht-i{ut) = ht(ut),       h[Mi(ut) = K(ut)       and       h"_1(ut)=h,"(ut), (3.5)
for all t = 2,..., m — 1. Function / has 4(m — 1) parameters and the constraints (3.5) reduce these by 3(m — 2). Therefore, a cubic spline has m + 2 degrees of freedom. Observe that (3.5) implies that a cubic spline is twice continuously differentiable on the interval (ui,um). If we extend this cubic spline / twice continuously differentiable to an interval [a, b] D [jM,um] with linear extensions on [a,iti] and [um,b], we call this spline natural cubic spline. This provides two additional boundary constraints f"(x) = 0 on [a,iti] U [ttm,6], thus, h'{(ui) = /i^l_1(ttm) = 0, and reduces the degrees of freedom by 2. Therefore, a natural cubic spline has m degrees of freedom.
Note that a natural cubic spline can be represented by functions x i—> (x — ttt)+, t = 1,... ,171. Namely,
m mm
f(x)=ao+'&ox + '^2ct{x-ut)+,        with ^c( = 0 and ^« = 0, (3.6)
t=i t=i t=i
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 3.  Generalized Additive Models
63
gives a natural cubic spline. The two side constraints ensure that we have a smooth linear extension above um. This again provides m degrees of freedom and these natural cubic splines (with m given knots) build an m-dimensional linear space. The knots ui < ... < um play a particularly special role: assume that in these knots we are given values /J,. .. , Then, there exists a unique natural cubic spline / on [a, b] with f{ut) = /t* for all i = 1,..., m, see Theorem 5.1 in [102].
We would like to solve the following optimization problem:
Find the intercept /3q and the natural cubic splines fi, ■ ■ ■ ,fq satisfying normalization conditions (3.2) such that the following expression is minimized:
with observations AT" = (Ni,...,Nn)', intercept and natural cubic splines given by / = (/?o; fi, ■ ■ ■, fq) and determining the GAM regression function A(-) by (3.1), tuning parameters r\\ > 0, 1 = 1,.. . ,q, and Poisson deviance loss given by
where the right-hand side is set equal to 2\{xi)vi for iVj = 0. The supports [a/, 6/] of the natural cubic splines // should contain all observed feature components xn of T>.
Remarks 3.2.
• We refer to page 35 for the interplay between maximizing the log-likelihood function and minimizing the deviance loss. In short, we are minimizing the in-sample loss in (3.7), modeled by the deviance loss D*(N, A), which is equivalent to maximizing the corresponding log-likelihood function, subject to regularization conditions that we discuss next.
• The regularization conditions for the natural cubic splines //
guarantee in the optimization of (3.7) that the resulting (optimal) functions // do not look too wildly over their domains [a/, b{\ D [uii, umii], where u±i,. . ., umii are the mi knots of fi. In particular, we want to prevent from over-fitting to the data T>. The tuning parameters r\i > 0 balance the influence of the regularization conditions (3.8). Appropriate tuning parameters are (often) determined by cross-validation.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
i=i
(3.7)
(3.8)
Electronic copy available at: https://ssrn.com/abstract=2870308
64
Chapter 3.  Generalized Additive Models
Usually, one starts from a more general optimization problem, namely, minimize 9 rh
D*(N, a)   +   5> i' {fl'{xl)fdxl i=i Jai
E2
i=i
ßo+EU M*i,i)v. -Ni-Ni      +     /zKz)) + Ni log^Ni/vi
+ fl U'l'{xl))2dxl (3.9)
over fto G K and over all (normalized) functions /i,.. . ,M that are twice continuously differentiable on their domains.
> Assume that the feature component x\ has exactly m; < n different values a;^ j < ... < x* l among the observations T>. This implies that the deviance loss D*(N,X) only depends on these values j*x = ji{x*i) for i = 1,... ,m; and / = 1,..., g, see middle line of (3.9).
> Theorem 5.1 (part 1) in [102] says that for given values (x*l7 /*;)i=i,...,m; there exists a unique natural cubic spline f* on [a/, b\\ with /;*(a;*;) = /*; for all i = 1,... ,mj. Thus, a choice (a;*;, f*i)i=i,...,mi completely determines the natural cubic spline /*.
> Theorem 5.1 (part 2) in [102] says that any twice continuously differentiable function // with fi(x*j) = f*j for all i = 1,. .. , mj, has a regularization term (3.8) that is at least as big as the one of the corresponding natural cubic spline /*.
For these reasons we can restrict the (regularized) optimization to natural cubic splines. This basically means that we need to find the optimal values (/*;)i=i,...,m; in the feature components/knots (3^/)i=i,...m| subject to regularization conditions (3.8) for given tuning parameters r\\ > 0 and normalization (3.2) for / = 1,. .. , q.
• Note that m; < n is quite common, for instance, our n = 500'000 car drivers only have mi = 90 — 18 + 1 = 73 different drivers' ages, see Appendix A. In this case, the deviance loss D*(N,X) is for feature component x\ completely determined by the values fgt = fiix*^, i = 1,... ,m; = 73. For computational reasons one may even merge more of the feature component values (by rounding/bucketing) to get less values f*t and knots un = x*t in the optimization.
Since natural cubic splines with given knots x*i,..., x^ l build an m/-dimensional linear space, we may choose m; linearly independent basis functions s±i,.. . ,smii (of the form (3.6)) and represent the natural cubic splines // on [a/, 6/] by
mi
fl(xl) = Y1 Pk,lSk,l(xi), (3.10) k=l
for unique constants Pii,.. . ,/3mij. Observe that (3.10) brings us fairly close to the GLM framework of Chapter 2. Assume that all natural cubic splines // are represented in the form (3.10), having parameters ftii,.. . ,/3mii and corresponding basis functions
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 3.  Generalized Additive Models
65
sii,..., smui, then we can rewrite (3.1) as
q mi
logAfa;) =/3o + EE/3MsM(a;/)   =f' (0, *(*)>, (3-H)
z=i fc=l
where the last identity defines the scalar product between /3 = (/3o, /5i,i, • • •, Pmq,q)' and s(x) = (1, si,i(a;i),..., smq^q(xq)Y £ W+1. Moreover, note that for these natural cubic splines we have
/   (fi'(xi)) dxi   =   /       \Z^Pk,iSk,i(xi)\ dxi
Jai Juu      \k=1 J
ml fUm  i ml
=      Y  Pk,lPj,l  /      1    4tl(xi)s"j(xi)dxi   A=       Yj  Pk,lPj,l ^fc]-k,j=l Ul<1 k,j=l
Therefore, optimization problem (3.7) is transformed to (we drop the irrelevant terms): Minimize over j3 £ W+1 the objective function
n
^2[exp(/3,S(;E^)>^-iV^(/3,s(;E^)>]+/3, n(rj) /3, (3.12)
i=i
with block-diagonal matrix Q(rj) = diag(0, Q\,..., Qd) S M(r+1)x(r'+1) having blocks
nl = nl(rn) = vi , £Rm>xm>,
V    " / k,j=l,...,mi
for / = 1,. .. , q, tuning parameters rj = (771,.. . and under side constraints (3.2).
The optimal parameter j3 of this optimization problem is found numerically. Remarks 3.3.
• The normalization conditions (3.2) provide identifiability of the parameters:
n n   mi mi   / n \
0 = Y Mx*,i) = Yl Yl fik,isk,i{xi,i) = Y (^fe.z Yl sk,i(xi,i)) •
i=i i=i /«=i /«=i V   i=i /
Often these conditions are not explored at this stage, but functions are only adjusted at the very end of the procedure. In particular, we first calculate a relative estimator and the absolute level is only determined at the final stage of the calibration. This relative estimator can be determined by the back-fitting algorithm for additive models, for details see Algorithm 9.1 in Hastie et al. [62] and Section 5.4.2 in Ohlsson-Johansson [102].
• A second important remark is that the calculation can be accelerated substantially if one "rounds" the feature components, i.e., for instance, for the feature component dens we may choose the units in hundreds, which substantially reduces m; and, hence, the computational complexity, because we obtain less knots and hence a lower dimensional j3. Often one does not lose (much) accuracy by this rounding and, in particular, one is less in the situation of a potential over-parametrization and over-fitting, we also refer to Section 2.4.3 on data compression.
• In the example below, we use the command gam from the R package mgcv. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
66
Chapter 3.  Generalized Additive Models
3.1.2   Example in motor insurance pricing, revisited
We present two applications of the GAM approach. We revisit the car insurance example of Section 2.4 which is based on the data generated in Appendix A. The first application only considers the two feature components age and ac, this is similar to Example 2.10. The second application considers all feature components, similarly to Section 2.4.5.
Example 3.4 (example GAM1). We consider the same set-up as in Example 2.10 (GLM1) but we model the feature components age and ac by natural cubic splines. Note that we have mi = 73 different ages of drivers and m-2 = 36 different ages of cars. For computational reasons, it is important in GAM calibrations that data is compressed accordingly, as described in Section 2.4.3. This gives us at most m*m,2 = 73 ■ 36 = 2'628 different risk cells. In our data T>, only 2'223 of these risk cells are non-empty, i.e. have a volume v£ > 0, for the latter see also (2.18). Thus, the data is reduced from n = 500'000 observations to a sufficient statistics of size d = 2'223.1
Listing 3.1: Results of example GAM1
1 Family: poisson
2 Link  function: log
3
4 Formula:
5 claims   ~   sCage,   bs  =   "er",   k =  kl)   +  s(ac,   bs  =  "er",   k =  k2)
6
7 Parametric coefficients:
8 Estimate   Std.   Error  z  value Pr(>!z!)
9 (Intercept)   -2.39356 0.02566     -93.29       <2e-16 ***
10 ---
11 Signif.   codes:     0  ***   0.001   **  0.01  *   0.05   .   0.1 1
12
13 Approximate   significance   of   smooth terms:
14 edf   Ref.df   Chi.sq p-value
15 s(age)   12.48     15.64       1162     <2e-16 ***
16 s(ac)     17.43     20.89       3689     <2e-16 ***
17 ---
18 Signif.   codes:     0  ***   0.001   **  0.01  *   0.05   .   0.1 1
19
20 R-sq.(adj)   =     0.963       Deviance   explained = 65%
21 UBRE =   -0.012852     Scale   est.   =1 n = 2223
The natural cubic splines approach is implemented in the command gam of the R package mgcv, and the corresponding results are provided in Listing 3.1. The feature components s (age ,bs=" cr" ,k=kl) and s(di,bs="cr" ,k=k2) are being modeled as continuous variables (indicated by s (■)) and we fit cubic splines (indicated by bs=" cr"). The parameters kl = mi = 73 and k2 = m-2 = 36 indicate how many knots we would like to have for each of the two feature components age and ac. Note that we choose the maximal number of possible knots (different labels) here, see also Remarks 3.2. This number is usually too large and one should choose a lower number of knots for computational reasons. We did not specify the tuning parameters r\\ in Listing 3.1. If we drop these tuning parameters in the R command gam, then a generalized cross-validation (GCV) criterion or an unbiased risk estimator (UBRE) criterion is applied to determine good tuning parameters
1This data compression reduces the run time of the GAM calibration from 1'018 seconds to 1 second!
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 3.  Generalized Additive Models
67
(internally). The GCV criterion considers a scaled in-sample loss given by
GCVfo) = (1 ~ M(rj)/n)-2 £% =      _ 71      ^ (3.13)
where M(tj) is the effective degrees of freedom of the model. This effective degrees of freedom is obtained from the corresponding influence matrix, for more details we refer to Wood [134], Section 3.2.3, and Hastie et al. [62], Section 7.10.1. The GCV criterion has the advantage over K-iold cross-validation that it is computationally much more fast, which is important in the optimization algorithm. We can extract the resulting optimal tuning parameters with the command gam$sp which provides (771,772) = (273'401,2'591) in our example. The column edf in Listing 3.1 shows the effective degrees of freedom which corresponds to the number of knots needed (for this optimal tuning parameters), for details we refer to Section 5.4.1 in Hastie et al. [62]. If edf is close to 1 then we basically fit a straight line which means that we can use the GLM (log-linear form) for this feature component. The last two columns on lines 15-16 of Listing 3.1 give an approximate x2-test for assessing the significance of the corresponding smooth term. The corresponding p-values are roughly zero which says that we should keep both feature components.
	5   S "		I		1 ™ ■	|||(
		I				
20       40       60       80 0   5   10       20       30 20       40       60       80 0   5   10       20 30
Figure 3.1: GAM1 results with (771,772) = (273'401, 2'591), and kl = 73 and k2 = 36; (lhs) fitted splines s(age,bs="cr" ,k=kl) and s(di,bs="cr" ,k=k2) excluding observations, and (rhs) including observations; note that (771,772) is determined by GCV here.
In Figure 3.1 we provide the resulting marginal plots for age and ac. These are excluding (lhs) and including (rhs) the observations (and therefore have different scales on the y-axis). Moreover, we provide confidence bounds in red color. From these plots it seems that age is fitted well, but that ac is over-fitting for higher ages of the car, thus, one should either decrease the number of knots k2 or increase the tuning parameter 772. In Figure 3.2 we provide the resulting two dimensional surface (from two different angles). We observe a rather step slope for small age and small ac which reflects the observed frequency behavior.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
68
Chapter 3.  Generalized Additive Models
Figure 3.2: GAM1 results with (771,772) = (273'401. 2'591), and kl = 73 and k2 = 36 from two different angles.
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	rev	/•cv Id	£(A,A*)	Ma >~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Clil.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.1) GLM1	3.1s	11	28.3543	28.3544	0.6052	28.3510	10.2691%
(Ch3.1) GAM1	1.1s	108	28.3248	28.3245	0.5722	28.3134	10.2691%
Table 3.1: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model a*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 2.4.
In Table 3.1 we provide the corresponding prediction results on line (Ch3.1). The number of parameters '# param.' is determined by the number of chosen knots minus one, i.e. kl + k2 — 1 = mi + 777,2 — 1 = 73+ 36 —1 = 108, where the minus one corresponds to the intercept parameter minus two normalization conditions (3.2). We observe a smaller in-sample loss of GAM1 compared to GLM1, i.e. the GAM can better adapt to the data T> than the GLM (considering the same feature components age and ac). This shows that the choices of the categorical classes for age and ac can be improved in the GLM approach. This better performance of the GAM carries over to the cross-validation losses jC^,v and the estimation loss £(X, a*). We also note that the difference between the in-sample loss and the cross-validation loss increases, which shows that the GAM has a higher potential for over-fitting than the GLM, here.
In Figure 3.3 we present the results for different choices of the tuning parameters 77/. For small tuning parameters (77/ = 10 in our situation) we obtain wiggly pictures which follow closely the observations. For large tuning parameters (77/ = IO'000'OOO in our situation) we obtain rather smooth graphs with small degrees of freedom, see Figure 3.3 (rhs). Thus, we may get either over-fitting (lhs) and under-fitting (rhs) to the data.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 3.  Generalized Additive Models
69
20       40 60
0   5   10       20 30
20       40 60
~i—i—i—I—i—i—r
0   5   10       20 30
Figure 3.3: GAM1 results with (lhs) (771,772) (IO'000'OOO, IO'000'OOO) for kl = 73 and k2 = 36.
(10,10)  and  (rhs) (771,772)
0   5   10       20 30
Figure 3.4: GAM1 results with number of knots (lhs) kl = 12 and k2 = 17 and (rhs) kl = 5 and k2 = 5; note that (771,772) is determined by GCV here.
In Figure 3.4 we change the number of knots (kl,k2) and we let the optimal tuning parameters be determined using the GCV criterion. We note that for less knots the picture starts to smooth and the confidence bounds get more narrow because we have less parameters to be estimated. For one knot we receive a log-linear GLM component. Finally, we evaluate the quality of the GCV criterion. From Listing 3.1 we see that the effective degrees of freedom are M(tj) = 1 + 12.48 + 17.43 = 30.91. Thus, we obtain
GCVfa)
C
Kn-M(r])J    D     V500'000 - 30.91, This is smaller than the K-iold cross-validation errors £gv on line (Ch3.1) of Table 3.1.
500'000
28.3134 ■ 10"2 = 28.3166 ■ 10"2.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
70
Chapter 3.  Generalized Additive Models
This indicates that GCV is likely to under-estimate over-fitting here. This closes our first GAM example. ■
Example 3.5 (example GAM2/3). In our second example we consider all available feature components. We keep gas, brand and ct as categorical. All other feature components are modeled with natural cubic splines, using the GCV criterion to determine the optimal tuning parameters r\\. To keep computational time under control we use data compression, and before that we modify log (dens )/2 to be rounded to one digit (which gives 48 different values). The data is then compressed correspondingly leading to d = 235'163 compressed observations.2 This compression gives us 73 + 36 + 12 + 2 + 11 + 6 + 48 + 26 + 1 - 8 = 207 parameters to be estimated. This should be compared to Conclusion 2.8.
Listing 3.2: Results of example GAM2
1 Family: poisson
2 Link  function: log
3
4 Formula:
5 claims   ~   sCage,   bs="cr",   k=73)   +  s(ac,   bs="cr",   k=36)   +  s(power,   bs="cr", k=12)
6 +  gas  + brand  +  s(area,   bs="cr",   k=6)   +  s(densGAM,   bs="cr",   k=48)   + ct
7
8 Parametric coefficients:
9 Estimate   Std.   Error    z  value  Pr(>!z!)
10 (Intercept)   -2.254150       0.020420   -110.391     <  2e-16 ***
11 gasRegular       0.073706       0.013506 5.457  4.83e-08 ***
12 brandBlO 0.044510       0.045052 0.988 0.323174
13 .
14 brandB5 0.107430       0.029505 3.641   0.000271 ***
15 brandB6 0.020794       0.033406 0.622 0.533627
16 ctAG -0.094460       0.027672       -3.414  0.000641 ***
17 .
18 ctZG -0.079805       0.046513       -1.716  0.086208 .
19 ---
20 Signif.   codes:     0  ***   0.001   **  0.01  *   0.05   .   0.1 1
21
22 Approximate   significance   of   smooth terms:
23 edf   Ref.df       Chi.sq p-value
24 s(age) 13.132   16.453   1161.419     <2e-16 ***
25 s(ac) 17.680   21.147  2699.012     <2e-16 ***
26 s(power) 9.731   10.461     311.483     <2e-16 ***
27 s(area) 1.009     1.016 5.815     0.0164 *
28 s(densGAM)     5.595     7.071       10.251 0.1865
29 ---
30 Signif.   codes:     0  ***   0.001   **  0.01  *   0.05   .   0.1 1
31
32 R-sq.(adj)   =     0.115       Deviance   explained = 4.747,
33 UBRE =   -0.56625     Scale   est.   =1 n = 235163
The estimation results are presented in Listing 3.2. The upper part (lines 10-18) shows the intercept estimate (3o as well as the estimates for the categorical variables gas, brand and ct. A brief inspection of these numbers shows that we keep all these variables in the model.
2The data compression takes 44 seconds.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 3.  Generalized Additive Models
71
123456 02468 10
area densGAM
Figure 3.5: GAM2 results: marginal splines approximation of the continuous feature components age, ac, power, area and log(dens).
The lower part of Listing 3.2 (lines 24-28) shows the results of the 5 continuous feature components age, ac, power, area and log(dens). The first 3 continuous feature components should be included in this natural cubic spline form. The effective degrees of freedom of the feature component area is very close to 1, which suggests that this component should be modeled in log-linear form. Finally, the component log(dens) does not seem to be significant which means that we may drop it from the model. In Figure 3.5 we show the resulting marginal spline approximations, which confirm the findings of Listing 3.2, in particular, area is log-linear whereas log(dens) can be modeled by almost a horizontal line (respecting the red confidence bounds and keeping in mind that area and dens are strongly positively correlated, see Table A.2).
In Table 3.2 we provide the resulting losses of model GAM2 on line (Ch3.2). Firstly, we see that the run time of the calibration takes roughly 10 minutes, i.e. quite long. This run time could be reduced if we choose less knots in the natural cubic splines. Secondly, we observe that the estimation loss £(A,A*) and the in-sample loss C1^ become smaller
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
72
Chapter 3.  Generalized Additive Models
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	/•cv	/•CV J~D	£(A,A*)	/•is J~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.4) GLM4	14s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch3.2) GAM2	678s	207	-	-	0.3877	28.0927	10.2690%
(Ch3.3) GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
Table 3.2: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and param.' gives the number of estimated model parameters, this table follows up from Table 2.5.
compared to model GLM4 which shows that the categorization in GLM4 is not optimal. We refrain from calculating the cross-validation losses for model GAM2 because this takes too much time. Finally, in Figure 3.6 we illustrate the resulting two-dimensional surfaces of the continuous feature components ac, power, age and log(dens).
Figure 3.6: GAM2 results of the continuous components ac, power, age and log (dens).
In view of Listing 3.2 we may consider a simplified GAM. We model the feature components age and ac by natural cubic splines having kl = 14 and k2 = 18 knots, respectively, because the corresponding effective degrees of freedom in Listing 3.2 are 13.132 and 17.680 (the tuning parameters r\\ are taken equal to zero for these two components). The component power is transformed to a categorical variable because the effective degrees of freedom of 9.731 is almost equal to the number of levels (these are 11 parameters under dummy coding), suggesting that each label should have its own regression parameter /3/. The component area is considered in log-linear form, the component dens is dropped from the model, and gas, brand and ct are considered as categorical variables. We call this model GAM3. The calibration of this GAM takes roughly 50 seconds, which makes it feasible to perform K-fo\d cross-validation.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 3.  Generalized Additive Models
73
Listing 3.3: Results of example GAM3
1 Family: poisson
2 Link  function: log
3
4 Formula:
5 claims   ~   sCage,   bs  =   "cr",   k =  14)   +  s(ac,   bs  =  "cr",   k = 18)
6 + powerCat  +  gas  + brand +  area + ct
7
8 Parametric coefficients:
9 Estimate   Std.   Error  z  value  Pr(>!z!)
10 (Intercept)   -2.674145       0.032674   -81.844     <  2e-16 ***
11 powerGAM2 0.264250       0.021278     12.419     <  2e-16 ***
12 .
13 powerGAM12     -0.275665       0.109984     -2.506   0.012197 *
14 gasRegular       0.072807       0.013526       5.383  7.34e-08 ***
15 brandBlO 0.045986       0.045066       1.020 0.307524
16 .
17 brandB5 0.107218       0.029511       3.633   0.000280 ***
18 brandB6 0.021155       0.033407       0.633 0.526560
19 area 0.084586       0.005360     15.782     <  2e-16 ***
20 ctAG -0.102839       0.027335     -3.762   0.000168 ***
21 .
22 ctZG -0.085552       0.046349     -1.846   0.064917 .
23 ---
24 Signif.   codes:     0  ***   0.001   **  0.01  *   0.05   .   0.1 1
25
26 Approximate   significance   of   smooth terms:
27 edf   Ref.df   Chi.sq p-value
28 s(age)     13 13       1192     <2e-16 ***
29 s(ac)       17 17       2633     <2e-16 ***
30 ---
31 Signif.   codes:     0  ***   0.001   **  0.01  *   0.05   .   0.1 1
32
33 R-sq.(adj)   =     0.0334       Deviance   explained  = 3.44%
34 UBRE =   -0.71863     Scale   est.   =1 n = 500000
The results are presented on line (Ch3.3) of Table 3.2 and in Listing 3.3 (we do not choose any regularization here because the chosen numbers of knots are assumed to be optimal). We conclude that this model GAM3 has a clearly better performance than model GLM4. It is slightly worse than GAM2, however, this comes at a much lower run time. For this reason we prefer model GAM3 in all subsequent considerations.
In Figure 3.7 (lhs) we provide the scatter plot of the resulting estimated frequencies (on the log scale) between the two models GLM4 and GAM3. The colors illustrate the years at risk vl on a policy level. The cloud of frequencies lies on the 45 degrees axis which indicates that model GLM4 and model GAM3 make similar predictions A(a^) on a policy level. In Figure 3.7 (middle and rhs) we compare these predictions to the true frequencies A*(a^) (which are available for our synthetic data). These two clouds have a much wider diameter which says that the models GLM4 and GAM3 have quite some room for improvements. This is what we are going to explore in the next chapters. This finishes the GAM example. ■
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
74
Chapter 3.   Generalized Additive Models
comparison of frequencies comparison of frequencies comparison of frequencies
log frequency GLM4 log of true frequency log or true frequency
Figure 3.7: Resulting estimated frequencies (on the log scale) of models GLM4 and GAM3, also compared to the true frequencies.
3.1.3   Generalized additive models: summary
We have seen that GAMs provide an improvement over GLMs for non log-linear feature components. In particular, GAMs make it redundant to categorize non log-linear feature components in GLM applications. This has the advantage that the ordinal relationship is kept. The drawback of GAMs is that they are computationally more expensive than GLMs.
We have not explored interactions beyond multiplicative structures, yet. This may be a main reason why the comparisons in Figure 3.7 (middle and rhs) do not look fully convincing. Having 500'000 observation we could directly explore such interactions in GLMs and GAMs, see also Remarks 2.9. We refrain from doing so but we will provide other data driven methods below.
3.2   Multivariate adaptive regression splines
The technique of multivariate adaptive regression splines (MARS) uses piece-wise linear basis functions. These piece-wise linear basis functions have the following two forms on X c W:
x h-> h_tj(x) = (xi - t)+       or       x h-> h+tj(x) = (t - xi)+, (3.14)
for a given constant t and a given feature component x\. The pair in (3.14) is called reflected pair with knot t for feature component x\. These reflected pairs consist of so-called rectified linear unit (ReLU) or hinge functions. We define the set of all such reflected pairs that are generated by the features in T>. These are given by the basis
H = {h_tl,i(-), h+thi{-); I = 1,.. . , q and t/ = x^h . .. , xnj} .
Note that ~H spans a space of continuous piece-wise linear functions (splines). MARS makes the following modeling approach: choose the expected frequency function as
M
x h-> logA(cc) = (30 +      (3mhm(x), (3.15)
m=l
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 3.  Generalized Additive Models
75
for given M > 0, and hm{-) are functions from the basis ~H or products of such basis functions in T-L. The latter implies that the non-zero part of hm(-) can also be non-linear.
Observe that this approach is rather similar to the GAM approach with natural cubic splines. The natural cubic splines are generated by functions x \—> {x\ — i)+ with t = Xij, see (3.6), which can be obtained by multiplying the basis function h_t,i{x) 6 ~H three times with itself.
This model is fit by an iterative (stage-wise adaptive) algorithm that selects at each step a function hm(-) of the present calibration, and splits this function into the two (new) functions
x     hmi(x) = hm(x) (xi - xhi)+        and        x ^ hm2(x) = hm(x) (xhi - xi)+,
for some / = 1,.. . , q and i = 1,. . ., n. Thereby, it chooses the additional optimal split (in terms of m, I and xn) to obtain the new (refined) model
logA(cc)   =   (30 + E Pm'hm>(x) + (3mihm(x)h_Xi]lii(x) + (3m2hm(x)h+Xi]lii(x)
=   Po + E Pm'hm'{x) + l3mihmi(x) + l3m2hm2(x). (3.16)
The optimal split can be determined relative to a loss function, for instance, one can consider the resulting in-sample loss from the Poisson deviance loss. This way we may grow a large complex model (in a stage-wise adaptive manner). By backward pruning (deletion) some of the splits may be deleted again, if they do not contribute sufficiently to the reduction in cross-validation error. For computational reasons, in MARS calibrations one often uses the GCV criterion, which provides a more crude error estimation in terms of an adjusted in-sample error, see (3.13) above and (9.20) in Hastie et al. [62]. We do not provide more details on pruning and deletion here, but pruning in a similar context will be treated in more detail in Chapter 6 in the context of regression trees. We remark the that refinement (3.16) can also be understood as a boosting step. Boosting is going to be studied in Section 7.4.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
76
Chapter 3.  Generalized Additive Models
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4
Credibility Theory
In Chapter 2 on GLMs we have seen that many years at risk are needed in order to draw statistically firm conclusions. This holds true, in particular, for problems with low expected frequencies. In this chapter we present credibility methods that allow us to smooth predictors and estimators with other sources of information in situations where we do not have sufficient volume or in situations where we have outliers. Credibility methods are based on Bayesian statistics. We do not present the entire credibility theory framework here, but we only consider some special cases that are related to the Pois-son claims frequency model and to a binary classification problem. For a comprehensive treatment of credibility theory we refer to the book of Buhlmann-Gisler [18]. Along the way we also meet the important technique of regularization. Moreover, we discuss simulation methods such as the Markov chain Monte Carlo (MCMC) method to numerically calculate Bayesian posterior distributions.
4.1   The Poisson-gamma model for claims counts 4.1.1    Credibility formula
In Section 1.2 we have assumed that N ~ Poi(Av) for a fixed expected frequency parameter A > 0 and for given years at risk v > 0. In this chapter we do not assume that the expected frequency parameter A is fixed, but it is modeled by a strictly positive random variable A. This random variable A may have different interpretations: (i) there is some uncertainty involved in the true expected frequency parameter and we reflect this uncertainty by a random variable A; (ii) we have a heterogeneous portfolio of different risks and we choose at random a policy from this portfolio. The latter is similar to a heterogeneous situation where a priori all policies are equivalent. Here, we make a specific distributional assumption for A by choosing a gamma prior distribution having density for 6 e K+
with shape parameter 7 > 0 and scale parameter c > 0. Note that the mean and variance of the gamma distribution are given by 7/c and 7/c2, respectively. For more information on the gamma distribution we refer to Section 3.2.1 in Wuthrich [135].
77
Electronic copy available at: https://ssrn.com/abstract=2870308
78
Chapter 4.  Credibility Theory
Definition 4.1 (Poisson-gamma model). Assume there is a fixed given volume v > 0.
• Conditionally, given A, the claims count N ~ Poi(Aw).
• A ~ r(7, c) with prior shape parameter 7 > 0 and prior scale parameter c > 0.
Theorem 4.2. Assume N follows the Poisson-gamma model of Definition J^.l. The posterior distribution of A, conditional on N, is given by
A\N ~ r(7 + iV, c + v).
Proof. The Bayes' rule says that the posterior density of A, given N, is given by (up to normalizing constants)
it(9\N)    cc    e-°" ^ F-1e-cB cc ^+441e-<c+»>fl.
This is a gamma density with the required properties.
Remarks 4.3.
• The posterior distribution is again a gamma distribution with updated parameters. The parameter update is given by
7 1—y 7P°st = 7 + AT       and       c ^ cpost = c + v.
Often 7 and c are called the prior parameters and jP°st and cpost the posterior parameters.
• The remarkable property of the Poisson-gamma model is that the posterior distribution belongs to the same family of distributions as the prior distribution. There are more examples of this kind, many of these examples belong to the EDF with conjugate priors, see Buhlmann-Gisler [18].
• For the estimation of the unknown parameter A we obtain the following prior estimator and the corresponding posterior (Bayesian) estimator
A  d==f'  E[A] = 1, c
AP°st   d==f'  E[A\N] = 1^1 = 1±^.
L  1   1       cP°st        c + v
• Assume that the claims counts N±,..., Nn are conditionally, given A, independent and Poisson distributed with conditional means Avi, for i = 1,. . . ,n. Lemma 1.3 implies that
n / n
n = Y.n*\a ^ Poi A$>«
i=i V 1=1
Thus, we can apply Theorem 4.2 also to the aggregate portfolio = J2i Ni, if the claims counts iVj all belong to the same frequency parameter A.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
79
• The unconditional distribution of N under the Poisson-gamma model is a negative binomial distribution, see Proposition 2.20 in Wuthrich [135].
Corollary 4.4. Assume N follows the Poisson-gamma model of Definition J^.l. The posterior estimator Apost has the following credibility form
Apost = a\ + (l-a) A,
with credibility weight a and observation based estimator A given by
v       .    . N a = - G (0,1)        and        X = —.
c + v v
The (conditional) mean square error of this estimator is given by
E
(A - Apost)
N
post 2
^post^2
(1 - a) - Apost.
c
Proof. In view of Theorem 4.2 we have for the posterior mean
Apost    =    1±_- = ^_--JV+_--l = QA + (l-a)A.
c + v        c + v v c + v c
This proves the first claim. For the estimation uncertainty we have
E^A-A^j   JVJ    =   Var(A|iV)  = j^-—yT = (1 - a) - Apost. This proves the claim. □ Remarks 4.5.
• Corollary 4.4 shows that the posterior estimator Apost is a credibility weighted average between the prior guess A and the MLE A with credibility weight a G (0,1).
• The credibility weight a has the following properties (which balances the weights given in the posterior estimator Apost):
1. for volume v —> oo: a —> 1;
2. for volume v —> 0: a —> 0;
3. for prior uncertainty going to infinity, i.e. c —> 0: a —> 1;
4. for prior uncertainty going to zero, i.e. c —> oo: a —> 0.
Note that
Var (A) = 0L = - \.
cz c
For c large we have an informative prior distribution, for c small we have a vague prior distribution and for c = 0 we have a non-informative or improper prior distribution. The latter means that we have no prior parameter knowledge (this needs to be understood in an asymptotic sense). These statements are all done under the assumption that the prior mean A = 7/c remains constant.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
80
Chapter 4.  Credibility Theory
• In motor insurance it often happens that N = 0. In this case, A = 0 and we cannot meaningfully calibrate a Poisson model with MLE because we receive a degenerate model. However, the resulting posterior estimator Apost = (1 — a)A > 0 still leads to a sensible model calibration in the credibility approach. This is going to be important in the application of Poisson regression trees in Chapter 6.
4.1.2 Maximum a posteriori estimator
In the previous section we have derived the posterior estimator Apost for the claims frequency based on observation N. We have calculated the posterior distribution of A, given N. Its log-likelihood function is given by
log7r(0|iV)   oc   (~f + N-l)log9-(c + v)£.
If we maximize this log-likelihood function we receive the maximum a posteriori (MAP) estimator given by
^map = 7 + N -1 = -    t _ a
C + V V
Note that the MAP estimator is always positive for 7 > 1, the latter corresponds to a prior coefficient of variation 7-1/2 being bounded from above by 1. Thus, if we have a prior distribution for A which is informative with maximal coefficient of variation of 1, the MAP estimator will always be positive.
4.1.3 Example in motor insurance pricing
The credibility formula of Corollary 4.4 is of particular interest when the portfolio is small, i.e. if v is small, because in that case we receive a credibility weight a that substantially differs from 1. To apply this model we a need prior mean A = 7/c > 0 and a scale parameter c > 0. These parameters may either be chosen from expert opinion or they may be estimated from broader portfolio information. In the present example we assume broader portfolio information and we calibrate the model according to Section 4.10 of Buhlmann-Gisler [18], for full details we refer to this latter reference.
Model Assumptions 4.6. Assume we have independent portfolios i = l,...,n all following the Poisson-gamma model of Definition 4-1 with portfolio dependent volumes Vi > 0 and portfolio independent prior parameters 7, c > 0.
This model is a special case of the Buhlmann-Straub model, see Buhlmann-Gisler [18]. Corollary 4.4 implies that for each portfolio i = 1,..., n we have
APOSt = ajAj + (1 - ai) A,
with credibility weights on and observation based estimators \ given by
Qj =--— G (0,1)        and        \ = —-,
c + Vi Vi
and the prior mean is given by A = 7/c. The optimal estimator (in a credibility sense, homogeneous credibility estimator) of this prior parameter is given by
1 n
2^i=l a* i=i
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
81
Thus, there only remains to determine the scale parameter c > 0. To estimate this last parameter we apply the iterative procedure given in Buhlmann-Gisler [18], pages 102-103. This procedure provides estimates f2 and Ao from which we estimate
c = A0/?2,
for details we refer to Buhlmann-Gisler [18]. This then provides estimated credibility weights
ai = -^-G(0,l). (4.2)
C + Vi
We apply this credibility model to the car insurance pricing data generated in Appendix A. As portfolios i = 1,... ,n we choose the 26 Swiss cantons ct and we consider the corresponding volumes Vi and the observed number of claims iVj in each Swiss canton. For the first analysis we consider the whole portfolio which provides a total of J2i vi = 253'022 years at risk. The largest Swiss canton ZH has 42'644 years at risk, and the smallest Swiss canton AI has 1'194 years at risk. For the second example we only consider drivers with an age below 20 for which we have a total of J2i vi = 1'741 years at risk. The largest Swiss canton ZH has 283 years at risk (drivers below age 20), and the smallest Swiss canton AR has 4 years at risk in this second example.
If we consider the entire portfolio we obtain fast convergence of the iterative procedure (4 iterations) and we set c = 456 and AQ = 10.0768% (based on the entire portfolio). We use this same estimate c = 456 for both examples, the entire portfolio and the young drivers only portfolio. From this we can calculate the estimated credibility weights on for all Swiss cantons i = 1,..., n. The results are presented in Figure 4.1. For the entire
credibility weights full portfolio credibility weights young drivers only
Figure 4.1: Estimated credibility weights (4.2) of (lhs) the entire portfolio and of (rhs) the young drivers only portfolio.
portfolio the estimated credibility weights lie between 72% and 99%. For the young drivers only portfolio the estimated credibility weights are between 0.8% and 40%. Using these credibility weights we could then calculate the Bayesian frequency estimators (which are smoothed versions of the MLE frequency estimators). This finishes our example. ■
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
82
Chapter 4.  Credibility Theory
4.2   The binomial-beta model for classification 4.2.1    Credibility formula
There is a similar credibility formula for the binary classification problem when we assume that the prior distribution of the parameter is given by a beta distribution.
Definition 4.7 (Binomial-beta model).
• Conditionally, given O, Y±,. .. ,Yn are i.i.d. Bernoulli distributed with success probability O.
• 0 ~ Beta(a, b) with prior parameters a, b > 0.
Theorem 4.8. Assume Y±,.. . ,Yn follow the binomial-beta model of Definition J^.l. The posterior distribution of 0, conditional on Y = (Yi,. .. , Yn), is given by
@\Y ~ Beta (a + YYi>b + n-YY*
i=i
%=i
def.
Beta(apos\6post
Proof. Using the Bayes' rule, the posterior density of G is given by (up to normalizing constants)
ir(e\Y) (x ^f[eY^(i-e)1-Y^jea-1(i-e)b-1.
This is a beta density with the required properties. □
For the estimation of the unknown success probability we obtain the following prior and posterior (Bayesian) estimators
p  d= E[0]
a + b1
Impost
def.
E[0|Y]
7post
_ =  a + E^iY
apost _|_ frpost a _|_ 5 + n
Corollary 4.9. Assume Y±,.. . ,Yn follow the binomial-beta model of Definition 4-7. The posterior estimator pPost has the following credibility form
pp°st =a p+(l-a) p.
with credibility weight a and observation based estimator p given by
n V™ Y
a = -■-        G(0,1)        and       p= ^l=1 \
a + o + n n
The (conditional) mean square error of this estimator is given by
E
1
1 + a + b + n
pPost ^ _£Post^
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
83
Proof. For the estimation uncertainty we have
apost^post
E [(e-?,ost)2|y]    =   Var(6|r) = This proves the claim.
(1 _|_ apost _|_ 5post^apost _|_ £post)2
4.2.2   Maximum a posteriori estimator
In the previous section we have derived the posterior estimator ppost. The corresponding log-likelihood function was given by
log7r(0|Y)   oc   fa + EF,-l)logg+ (b + n-J2^j]\^jJ)-If we maximize this log-likelihood function we receive the MAP estimator
MAP _ a + E"=l Yi ~ 1 _    a + b + n    , t
a
P      =-;-=-;- P
a + b + n — 2       a + b + n — 2 \ n
Here, we require that a > 1 for having a MAP estimator that has a positive probability for any observation Y.
4.3   Regularization and Bayesian MAP estimators 4.3.1   Bayesian posterior parameter estimator
For illustrative purposes we reconsider the GLM of Chapter 2. However, the theory presented in this section applies to any other parametric regression model. We consider GLM (2.4) for independent cases (iVj, Xi,Wi) G T>. The joint density of these cases is for regression parameter j3 given by
fß(Nl, ...,Nn) = exp (M/3)} = ft e-""V"* (wV^M* ,
i=l
Bayesian modeling in this context means that we choose a prior density 7r(/3) for the regression parameter j3. The joint distribution of the cases and the regression parameter then reads as
f(N1,...,Nn,P) = f[e-^^ (6XP</3>*;)vi)Nt ff(/3). (4.3) i=i iVl'
After having observed data T>, the parameter vector j3 has posterior density
ir(f3\V)  oc f(Nu...,Nn,P), (4.4)
where we have dropped the normalizing constants. Thus, we receive an explicit functional form for the posterior density of the parameter vector j3, conditionally given the data T>. Simulation methods like MCMC algorithms, see Section 4.4, or sequential Monte Carlo (SMC) samplers then allow us to evaluate numerically the Bayesian (credibility) estimator
3P°St = E[/3|P] = jp^(p\V)dp. (4.5) This Bayesian estimator considers both, the prior information tv(P) and the data T>.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
84
Chapter 4.  Credibility Theory
4.3.2   Ridge and LASSO regularization
The joint distribution of the data and the regression parameter (4.3) can also be viewed as a regularized version of the log-likelihood function In (ft), where the prior density tt is used for regularization. We have already met this idea in the calibration of GAMs, see (3.7). Assume that 7r(/3) oc exp{—?7||/3||p} for some tuning parameter r\ > 0 and || ■ ||p denoting the Lp-norm for some p > 1. In this case we receive joint log-likelihood function of data and regression parameter
n
eN(P)+\ogn(P) oc  Y,-eM(3,xl)vl+Nl((p,xl)vl + logvl)-f]\\p\\pp. (4.6)
i=i
^map
If we now calculate the MAP estimator P we receive an Lp-regularized version of the MLE P given in Proposition 2.2. Maximization of objective function (4.6) implies that too large values for P are punished (regularized/shrunk towards zero) with regularization parameter (tuning parameter) 77 > 0. In particular, this helps us to control complex models to not over-fit to the data (in-sample).
Remarks 4.10 (ridge and LASSO regularization).
• Lhe last term in (4.6) tries to keep P small. Lypically, the intercept (3o is excluded from this regularization because otherwise regularization induces a bias towards 0, see balance property of Proposition 2.4.
• Lor a successful application of regularization one requires that all feature components live on the same scale. Lhis may require scaling using, for instance, the MinMaxScaler, see (5.14), below. Categorical feature components may require a group treatment, see Section 4.3 in Hastie et al. [63].
• p = 2 gives a Gaussian prior distribution, and we receive the so-called ridge regularization or Likhonov regularization [127]. Note that we can easily generalize this ridge regularization to any Gaussian distribution. Assume we have a prior mean 6 and a positive definite prior covariance matrix £, then we can consider component-specific regularization
\ogn(P) oc -6)'£-1(0-6).
• p = 1 gives a Laplace prior distribution, and we receive the so-called LASSO regularization (least absolute shrinkage and selection operator), see Libshirani [126]. LASSO regularization shrinks (unimportant) components to exactly zero, i.e. LASSO regularization can be used for variable selection.
• Optimal tuning parameters 77 are determined by cross-validation.
Example 4.11 (regularization of GLM1). We revisit Example 2.10 (GLM1) and we apply ridge and LASSO regularization according to (4.6). We consider the feature components age and ac using the categorical classes as defined in Ligure 2.2. Lhe R code is provide in Listing 4.1. On line 1 we define the design matrix only consisting of the categorized 'age class' and 'ac class', respectively.   On lines 2-3 we fit
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
85
Listing 4.1: R code glmnet for regularization
1 X  <-  model.matrix(-   ageGLM +  acGLM, dat)
2 glm.ridge   <-  glmnet(x=X,y=dat$claims,family="poisson",alpha=0,offset=log(dat$expo))
3 exp(predict(glm.ridge ,   newx = X,   newoffset = log(dat$ expo)))
this regularized regression model. Parameter alpha = 0 gives ridge regularization, and alpha = 1 gives LASSO regularization. Running this algorithm as in Listing 4.1 fits the regularized GLM for 100 different values of the tuning parameter 77. These tuning parameters are in [0.00137,13.72341] for ridge regularization and in [0.00004,0.01372] for LASSO regularization in our example. Moreover, the intercept /3q is excluded from regularization.
10 10 10 10 10 10 10 10 10 10
-4-2 0 2 0 12 3 4
log(eta) L1 norm
^map
Figure 4.2: Estimated regression parameters j3 under ridge regularization, the a;-axis shows (lhs) log(77) and (rhs) the total L1-norm of the regression parameter (excluding the intercept), in blue color are the parameters for the 'age classes' and in orange color the ones for the 'ac classes', see also Listing 2.2.
^map
In Figure 4.2 we illustrate the resulting ridge regularized regression parameters j3 for different tuning parameters 77 (excluding the intercept). We observe that the components
^map
in j3       shrink with increasing tuning parameter towards the homogeneous model. In Table 4.1 lines (Ch4.1)-(Ch4.3) we present the resulting errors for different tuning parameters. We observe that choosing different tuning parameters 77 > 0 continuously closes the gap between the homogeneous model and model GLM1.
^map
In Figure 4.3 we illustrate the resulting LASSO regularized regression parameters j3 for different tuning parameters 77. We see that this is different from ridge regularization because parameters are set/shrunk to exactly zero for increasing tuning parameter 77.
^map
The a;-axis on the top of Figure 4.3 shows the number of non-zero components in j3 (excluding the intercept).   We note that at 77 = 0.0001 the regression parameter for age31to40 set to zero and at 77 = 0.0009 the one for age41to50 is set to zero. Since age51to60 is the reference label, this LASSO regularization implies that we receive one large age class from age 31 to age 60 for 77 = 0.0009. Lines (Ch4.4)-(Ch4.6) of Table
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
86
Chapter 4.  Credibility Theory
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	rev	/•cv J~D	£(A,A*)	Ms >~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch4.1) ridge r) = 13.723	13st	11			1.3439	29.1065	10.2691%
(Ch4.2) ridge r) = 0.1438	13st	11			1.0950	28.8504	10.2691%
(Ch4.3) ridge r) = 0.0014	13st	11			0.6091	28.3547	10.2691%
(Ch4.4) LASSO r? = 0.0137	7st	1			1.3439	29.1065	10.2691%
(Ch4.5) LASSO r? = 0.0009	7s+	9			0.6329	28.3791	10.2691%
(Ch4.6) LASSO r? = 0.0001	7s+	11			0.6053	28.3511	10.2691%
(Ch2.1) GLM1	3.1s	11	28.3543	28.3544	0.6052	28.3510	10.2691%
Table 4.1: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, the run time^ is received by simultaneously fitting the model for 100 tuning parameters 77, and param.' gives the number of estimated model parameters, this table follows up from Table 2.4.
log(eta) L1 norm
Figure 4.3: Estimated regression parameters j3 under LASSO regularization, the x-axis shows (lhs) log (77) and (rhs) the total L1-norm of the regression parameter (excluding the intercept), in blue color are the parameters for the 'age classes' and in orange color the ones for the 'ac classes', see also Listing 2.2.
4.1 show that also LASSO regularization closes the gap between the homogeneous model and model GLM1. This finishes this example. ■
In the previous example we have seen that the ridge and LASSO regressions behave fundamentally differently, namely, LASSO regression shrinks certain parameters perfectly to zero whereas ridge regression does not. We briefly analyze the reason therefore, and we give some more insight into regularization. For references we refer to Hastie et al. [63], Chapter 16 in Efron-Hastie [37], and Chapter 6 in Wuthrich-Merz [141]. We start from the following optimization problem
argmin - eN{/3),        subject to ||/3||£ < t, (4.7) P
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
87
Figure 4.4: Illustration of optimization problem (4.7) under a budget constraint (lhs) for p = 2 and (rhs) p = 1; this figure is taken from Wuthrich-Merz [141].
for some given budget constraint t G K+ and p > 1. This is a convex optimization problem with a convex constraint. Version (4.6) is obtained as the Lagrange version of this optimization problem (4.7) with fixed Lagrange multiplier 77. We illustrate optimization problem (4.7) in Figure 4.4. The red dot shows the MLE that maximizes the unconstrained log-likelihood £n(/3). The convex curve around this MLE shows a level set of the log-likelihood function given by {/3;^jy(/3) = co} f°r a given level cq. The blue area shows the constraint in (4.7), the (lhs) shows a Euclidean ball for p = 2 and the (rhs) shows a square for p = 1. The optimal regularized estimate for j3 is obtained by the tangential intersection of the appropriate level set with the blue budget constraint. This is illustrated by the orange dots. We observe that for p = 1 this might be in the corner of the square (Figure 4.4, rhs), this is the situation where the parameter for the first feature component shrinks exactly to zero. For p = 2 this does not happen, a.s.
4.4   Markov chain Monte Carlo method
In this section we give the basics of Markov chain Monte Carlo (MCMC) simulation methods. We refrain from giving rigorous derivations and proofs but we refer to the relevant MCMC literature like Congdon [28], Gilks et al. [53], Green [57, 58], Johansen et al. [75], Neal [97] or Robert [113].
A main problem in Bayesian modeling is that posterior distributions of type (4.4) cannot be found explicitly, i.e. we can neither directly simulate from that posterior distribution nor can we calculate the (posterior) Bayesian credibility estimator (4.5) explicitly. In this section we provide an algorithm that helps us to approximate posterior distributions. For this discussion we switchback the notation from ß to © because the latter is more standard in a Bayesian modeling context.
In a nutshell, if a probability density 7r(0) (or ir{6\T>) in a Bayesian context) of a random variable 0 is given up to its normalizing constant (i.e. if we explicitly know its functional form in 6), then we can design a discrete time Markov process that converges (in distribu-
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
88
Chapter 4.  Credibility Theory
tion) to tt (or tt(-\T>), respectively). More precisely, we can design a (time inhomogeneous) Markov process that converges (under reversibility, irreducibility and aperiodicity) to its (unique) stationary distribution (also called equilibrium distribution). Assume for simplicity that tt is discrete. If the (irreducible and aperiodic) Markov chain (Q^)t>0 has the following invariance property for tt
7t(0í) = 53tt(0j-) p[0(í+1) = 0l|©(i) =0j] ,        for all 0, and t > 0, (4.8)
then tt is the (unique) stationary distribution of (@^)t>o- Based on the latter property, we can use the samples (©^)t>t0 as an empirical approximation to tt after the Markov chain has sufficiently converged, i.e. for sufficiently large to. This to is sometimes also called the burn-in period.
Thus, we are aiming at constructing an irreducible and aperiodic Markov chain that satisfies invariance property (4.8) for tt. A Markov chain is said to fulfill the detailed balance condition (is reversible) w.r.t. tt if the following holds for all 9i,9j and t > 0
7r(0i) p [©(í+1) = 6j\©(i) = 0,] = tt(6j) p [e(í+1) = oÁ ©^ =   . (4.9)
We observe that the detailed balance condition (4.9) implies the invariance property (4.8) for tt (sum both sides over 6j). Therefore, reversibility is sufficient to obtain the invariance property. The aim now is to design (irreducible and aperiodic) Markov chains that are reversible for tt; the limiting samples of these Markov chains can be used as empirical approximations to tt. Crucial for this design is that we only need to know the functional form of the density tt; note that the normalizing constant of tt cancels in (4.9).
4.4.1   Metropolis—Hastings algorithm
The goal is to design an irreducible and aperiodic Markov chain that fulfills the detailed balance condition (4.9) for density tt. The main idea of this design goes back to Metropolis and Hastings [95, 64] and uses an acceptance-rejection method.
Assume that the Markov chain is in state at algorithmic time t and we would like to simulate the next state ©(i+1) of the chain. For the simulation of this next state we choose a proposal density q(-\&^) that may depend on the current state (and also on the algorithmic time i, not indicated in the notation).
Metropolis-Hastings Algorithm.
1. In a first step we simulate a proposed next state
0* ~ g(-|©(i)).
2. Using this proposed state we calculate the acceptance probability
3. Simulate independently U ~ Uniform[0,1].
Version October 27, 2021, M.V. Wůthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
89
4. Set for the state at algorithmic time t + 1
@(m) = | o* ift7<Q(e«e*),
otherwise.
Lemma 4.12. TTiz's algorithm satisfies the detailed balance condition (4.9) for tt, and hence is invariant according to (4.8) for tt.
This lemma is proved in Section 4.5, below.
Thus, the Metropolis-Hastings (MH) algorithm provides a reversible Markov chain that converges to the stationary distribution tt (under irreducibility and aperiodicity). The crucial point why this algorithm works for densities tt that are only known up to their normalizing constants lies in the definition of the acceptance probability (4.10); note that the acceptance probability is the only place where tt is used. In the definition of the acceptance probability (4.10) we consider the ratio tv(@*)/tv(@^) and therefore the normalizing constant cancels.
The remaining freedom is the choice of the (time inhomogeneous) proposal densities q. In general, we aim at using a choice that leads to a fast convergence to the stationary distribution tt. However, such a "good" choice is not straightforward. In a Gaussian context it was proved that the optimal average acceptance rate of the proposal in the MH algorithm is 0.234, see Roberts et al. [114].
A special case is the so-called random walk Metropolis-Hastings (RW-MH) algorithm (also called Metropolis algorithm). It is obtained by choosing a symmetric proposal density, i.e. q(6i\6j) = q(6j\6j). In this case the acceptance probability simplifies to
If 0 is not low dimensional, then often a block update MH algorithm is used. Assume that 0 = (0i,... ,@k) can be represented by K components. In that case a MH algorithm can be applied iteratively to each component separately, and we still obtain convergence of the Markov chain to its stationary distribution tt. Assume we are in state at algorithmic time t and we would like to simulate the next state ©(t+1). The block update MH algorithm works as follows:
Block Update Metropolis-Hastings Algorithm.
1. Initialize the block update ©(t+1) = ©(*).
2. Iterate for components k = 1,..., K
(a) simulate a proposed state 0£ ~ 9fc("|®^+1^) f°r component ©^ of 0, and set
0* = (er)>...>et+11),e^e£J11\...>eri)); (4.n)
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
90
Chapter 4.  Credibility Theory
(b) calculate the acceptance probability of this newly proposed component
(c) simulate independently U ~ Uniform[0,1];
(d) set for the component k at algorithmic time t + 1
0(m)( ift7<Q(e<t+1\e*),
fe I otherwise.
Observe that this block update MH algorithm updates component by component of to obtain ©(t+1). The (notation in the) initialization is (only) used to simplify the description of the block update, i.e. in fact the initialization does not update any component. This is reflected in (4.11) by the fact that components 1 to k — 1 are already updated, component k is about to get updated, and components k + 1 to K are not updated yet. This block update MH algorithm is of special interest for Gibbs sampling, which is described next.
4.4.2   Gibbs sampling
The block update MH algorithm is of special interest if it is possible to perform conditional sampling. Assume that for ajl k = 1,..., K the density ir is so nice that we can directly sample from the conditional distributions
0fcl(ei,...,efe_1,efe+1,...,e^) ~ tt(-|©i, • • • ,&k-i,&k+i, ■ ■ ■,&k)-
In this case we may choose as proposal distribution for in the above block update MH algorithm (called Gibbs sampling in this case)
~ 9fc(.|e(j^ I Q^+1),...>e£-11)>e£J11)>...,eri)) - *(e*),
where we use notation (4.11). These proposals are always accepted because a = 1.
4.4.3   Hybrid Monte Carlo algorithm
The hybrid Monte Carlo (HMC) algorithm goes back to Duane et al. [34] and is well described in Neal [97]. Note that the abbreviation HMC is also used for Hamiltonian Monte Carlo because the subsequent Markov process dynamics are based on Hamiltonians. Represent the density ir as follows
tt(0) = exp {log tt(0)}  oc  exp {^(0)} .
Observe that £ is the log-likelihood function of ir, (potentially) up to a normalizing constant. The acceptance probability (4.10) in the MH algorithm then reads as
a(0(t),e*) = exp (min {^(©*) - ^(©^) + logg(©W|©*) - logg(©*|©W), o}) .
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
91
In physics —£(6) can be interpreted as the potential energy of the system in state 9. We enlarge this system by also accounting for its momentum. In physics this is done by considering the kinetic energy (and using the law of conservation of energy). Assume that 0 and ^ are independent with joint density
/(0,V) = <0)9ty) ^  exp{*(0)-/#)}•
We assume that —£{■) and k(-) are differentiable, playing the role of potential and kinetic energy, respectively, and we define the Hamiltonian
H(6,u)) = -£(6) + k(u)).
Assume that the dimension of 9 = (6±,... ,6r) is r G N, then a typical choice for yj = (ipi,..., ipr) and its kinetic energy is
k{uh)
i=i
^V' (diag(t2,...,r2;
for given parameters r/ > 0. In other words, g{ip) is a Gaussian density with independent components ^i,...,^ and covariance matrix T = diag(r2,..., r2). The goal is to design a reversible, irreducible and aperiodic Markov process that has / as stationary distribution (note that we silently change to a continuous space and time setting, for instance, for derivatives being well-defined).
The crucial point now is the construction of the Markov process dynamics. Assume that we are in state (©(*), at time t and we would like to study an infinitesimal time interval dt. For this we consider a change (in infinitesimal time) that leaves the total energy in the Hamiltonian unchanged (law of conservation of energy). Thus, for all components / = 1,.. . , r we choose the dynamics at time t
d@
(t)
dt
and
d£{6)
dt
ew
(4.12)
This implies for the change in the Hamiltonian at time t
dt
E
1=1
r
E
i=i
OH(0,i/j)
d6l d£{6)
d@
(t)
dH(0,i/j)
(e('),*W) dt
d^
(t)
d6l
d@
(t)
dk(u))
ew dt
d^
dt
(ew,*W) dt = 0.
Thus, a Markov process (0(*),$(*))t>o which moves according to (4.12) leaves the total energy in the Hamiltonian unchanged. Moreover, this dynamics preserves the volumes, see (3.8) in Neal [97], and it is reversible w.r.t. f(9,yj). This implies that f(9,yj) is an invariant distribution for this (continuous-time) Markov process (©(*), ^^t^o- Note, however, that this Markov process is not irreducible because it cannot leave a given total energy state (and hence cannot explore the entire state space) for a given starting value. This deficiency can be corrected by randomizing the total energy level. Furthermore, for simulation we also need to discretize time in (4.12). These two points are described next.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
92
Chapter 4.  Credibility Theory
First, we mention that we can apply Gibbs sampling for the update of the kinetic energy state given ©(*), simply by using the Gaussian density g(ip). Assume we are in position (©(*), at algorithmic time t £ No and we aim at simulating state (©(*+!), $(*+!)). We do this in two steps, first we move the kinetic energy level and then we move along the corresponding (discretized) constant Hamiltonian level.
Hybrid (Hamiltonian) Monte Carlo Algorithm. 1. Using Gibbs sampling, we set
e(t)   ~ AA(0,r = diag(r12,...,t^
2. For fixed step size s > 0 and number of steps L 6 N, we initialize 6*(0) = ©(*) and V'*(0) = and we consider the leapfrog updates for j = 1,..., L
V*0'-i/2) = f(i-i) + ^v4ro'-i)),
e*U) = ^-i/ryo'-i/2), = ^(j-i/2) + | v0e(o*(j)).
This provides a proposal for the next state
(0*,$*) = (0*(L),-i/}*(L)). Accept this proposal with acceptance probability
^(0(*), $(*+!)), (0*, = exp (min {-#(©*, **) +    (0^, $(t+1)), o}) ,
otherwise keep (©(*), $(t+1)).
(4.13)
We give some remarks. Observe that in step 2 we may update both components ©(*) and ^(t+1). If the leapfrog discretization were exact (i.e. no discretization error), then it would define, as in (4.12), a reversible Markov process that leaves the total energy and the total volume unchanged, see Section 3.1.3 in Neal [97]. This would imply that the acceptance probability in (4.13) is equal to 1. The MH step with acceptance probability (4.13) corrects for the discretization error (corrects for a potential bias). Moreover, since the kinetic energy k{ip) is symmetric with respect to zero, the reflection of the momentum in the proposal ^* would not be necessary (if we perform Gibbs sampling in step 1). Note that the wanted distribution tt is obtained by considering the (for the first component) marginalized Markov process (©(*), ^^)t>o asymptotically.
4.4.4   Metropolis-adjusted Langevin algorithm
The above HMC algorithm provides as special case for L = 1 the Metropolis-adjusted Langevin algorithm (MALA), studied in Roberts-Rosenthal [115].
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
93
First, observe that the Gibbs sampling step of ^(t+1) is independent of everything else, and we can interpret this Gibbs sampling step as a proposal distribution for updating in particular, because we are only interested into the marginalized process (0^t-))t>o of (©(*), fW)oo. The leapfrog update for L = 1 reads as
0*   =   0(*) + — T-1Velog^(e<t') + e T-^t+1\
= _£Veiog7r(e«)-| veiog7r(e*).
From the first line we see that the proposal is chosen as
6*le(0  ~ + jT-1Vd\ogir(eV),e2T-^J . (4.14)
Lemma 4.13. The acceptance probability obtained from (4.13) for L = 1 is equal to the acceptance probability of the MH algorithm using proposal (4.14).
This lemma is proved in Section 4.5, below.
Remarks and interpretation. The motivation for the study of the MALA has been that the classical MH algorithm may have a poor performance in higher dimensions, because the Markov process may behave like a random walk that explores the space in a rather inefficient way. The idea is to endow the process with an appropriate drift so that it finds the relevant corners of the space more easily. For this purpose the Langevin diffusion is studied: assume that (Wt)t>o is an r~dimensional Brownian motion on our probability space, then we can consider the r-dimensional stochastic differential equation
d0(*) = ptog7r(eW)dt + dWi.
When 7r is sufficiently smooth, then tt is a stationary distribution of this Langevin diffusion. The MALA described by the proposals (4.14) is then directly obtained by a discretized version of this Langevin diffusion for T = 1 and for s describing the step size. The acceptance-rejection step then ensures convergence to the appropriate stationary distribution tt. In Roberts-Rosenthal [115] it was proved that (under certain assumptions) the optimal average acceptance rate of the MALA is 0.574. Note that this is vastly different from the 0.234 in the classical MH algorithm, see Roberts et al. [114]. In Neal [97] it is argued that also the MALA may have a too random walk-like behavior and the full power of HMC algorithms is only experienced for bigger L's reflecting bigger steps along the (deterministic) Hamiltonian dynamics.
4.4.5   Example in Markov chain Monte Carlo simulation
In this section we make a simple example for the density tt that allows us to empirically compare the performance of the MH algorithm, the MALA and the HMC algorithm. We choose dimension r = 10 (which is not large) and let X 6 M7"*7" be a positive definite covariance matrix, the explicit choice is provided in Figure 4.5. We then assume that
tt (=> AA(0,S).
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
94
Chapter 4.  Credibility Theory
				□□□ n
				□□DUDD
	□			□□□□□□
				□□    1 1
□□□□□□				■
Figure 4.5: Choice of covariance matrix X.
Thus, 7T is a centered multivariate Gaussian density. Of course, in this case we could directly simulate from ir. Nevertheless, we will use the MCMC algorithms to approximate samples of ir. This allows us to study the performance of these algorithms.
MH: average acceptance
MALA: average acceptance rate = 0.679
HMC(L=5): average acceptance rate = 0.698
\1
2000 aOOCI 6000 8000 10000
alcoritli ~iicti ~i«
0 2000 4000
algorithi
Figure 4.6: Trace plots of the first component of (0^t-))t=o,...,10000 of (lhs, black) the MH algorithm, (middle, blue) the MALA, and (rhs, red) the HMC algorithm with L = 5.
We start with the MH algorithm. For the MH algorithm we choose proposal distribution
O* ~ g(.|0(*)) (=> AA(0(t),e2l).
Thus, we do a RW-MH choice and the resulting acceptance probability is given by
a (©W,©*) = exp (min j-^©*)^"1©* + ^(e^'S"1©^ o}) .
The parameter s > 0 is fine-tuned so that we obtain an average acceptance rate of roughly 0.234. The trace plot of the first component (O^)t is given in Figure 4.6 (lhs, black). Note that we choose starting point ©(°) = (10,..., 10)' and the burn-in takes roughly to = 1000 algorithmic steps. Next we study the MALA. We choose proposal
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
95
from which the acceptance probability can easily be calculated, see also (4.17). The parameter s = 0.23 is fine-tuned so that we obtain an average acceptance rate of roughly 0.574. The trace plot of the first component (O^)t is given in Figure 4.6 (middle, blue). Note that we choose starting point ©(°) = (10,..., 10)' and the burn-in is faster than in the RW-MH algorithm.
Finally, we consider the HMC algorithm. We choose exactly the same e = 0.23 as in the MALA above, but we merge L = 5 steps along the Hamiltonian dynamics. We then apply the following steps. First, we simulate ^J/(t+1) ~ J\f (0,1). The leapfrog updates are obtained by initializing 0*(O) = ©^ and ^*(0) = $(t+1), and updating for j = 1,. .. , 5
0*U)   =   (l-jZ-^O'U-V + erU-D,
= ^0'-i)-f s-^O'-i)^^
This provides a proposal for the next state (©*, 5'*) = (0*(5), The acceptance
probability for this move is then easily calculated from (4.13). For this parametrization we obtain an average acceptance rate of 0.698. The trace plot of the first component (©!*'')t is given in Figure 4.6 (rhs, red). The striking difference is that we get much better mixing than in the RH-MH algorithm and in the MALA. This is confirmed by looking at
auto-correlation in MCMC samples auto-correlation in MCMC samples auto-correlation in MCMC sampl«
Figure 4.7: Autocorrelations of the first component ©^ of (@^)t=t0+i,...,10000 (after burn-in t0 = 1000) of (lhs) the MH algorithm (black), the MALA (blue) and the HMC algorithm with L = 5 (red); (middle) the HMC algorithm for L = 3,5,10 (orange, red, magenta); and (rhs) the HMC algorithm for constant total move Le = 1.15 (red, pink, brown).
the resulting empirical auto-correlations in the samples. They are presented in Figure 4.7 (lhs) for the first component of ©(*). From this we conclude that we should clearly favor the HMC algorithm with L > 1. In Figure 4.8 we also present the resulting QQ-plots (after burn-in to = 1000). They confirm the previous statement, in particular, the tails of the HMC algorithm look much better than in the other algorithms. In the HMC algorithm we still have the freedom of different choices of L and e. For the above chosen e = 0.23 we present three different choices of L = 3, 5,10 in Figure 4.9. In this case L = 10 has the best mixing properties, also confirmed by the autocorrelation plot in Figure 4.7 (middle). However, L should also not be chosen too large because the
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
96
Chapter 4.  Credibility Theory
Normal QQ-plotfor MH algorithm Normal QQ-plotTor MALA Normal QQ-plot for HMC(5| algorithm
Theoretical Quantiles Theoretical Guantiles Theoretical QtfanHei
Figure 4.8: QQ-plots of the first component ©^ of (@^)t=t0+i,...,10000 (after burn-in t0 = 1000) of (lhs, black) the MH algorithm, (middle, blue) the MALA, and (rhs, red) the HMC algorithm with L = 5.
HMC(L=3): average acceptance rate = 0.727 HMC(L=5): average acceptance rate = 0.G98 HMC(L=10): average acceptance rate = 0.655
Figure 4.9: Trace plots of the first component ©^ of (@^)t=o,...,10000 of the HMC algorithm for (lhs, orange) L = 3, (middle, red) L = 5, and (rhs, magenta) L = 10.
leapfrog update may use too much computational time for large L. Hoffman-Gelman [67] have developed the no-U-turn sampler (NUTS) which fine-tunes L in an adaptive way.
From Figure 4.9 we also observe that the choice of L does not influence the average acceptance probability too much. In Figure 4.10 we chose Le = 5 ■ 0.23 = 1.15 constant, i.e. we decrease the step size and increase the number of steps. Of course, this implies that we follow more closely the Hamiltonian dynamics and the average acceptance rate is increasing, see Figure 4.10 from left to right. In Figure 4.7 (rhs) we present the resulting autocorrelation picture. From this we conclude, that L = 10 with original step size e = 0.23, Figure 4.7 (middle, magenta), provides the best mixture of the Markov chain in our example (and should be preferred here). This finishes the example. ■
4.4.6   Markov chain Monte Carlo methods: summary
We have met several simulation algorithms to numerically approximate the density tt. Typically, these algorithms are used to determine posterior densities of type (4.4). These posterior densities are known up to the normalizing constants.   This then allows one
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 4.  Credibility Theory
97
Figure 4.10: Trace plots of the first component of (@^)t=o,...,10000 of the HMC algorithm for (lhs, red) L = 5 and e = 0.23, (middle, pink) L = 7 and e = 0.23 ■ 5/7, and (rhs, brown) L = 10 and e = 0.23/2, i.e. the total move Le = 1.15 is constant.
to empirically calculate the posterior Bayesian parameter (4.5). The advantage of the posterior Bayesian parameter is that we receive a natural regularization (determined by the choice of the prior distribution) and, moreover, we can quantify parameter estimation uncertainty through the posterior distribution of this parameter.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
98
Chapter 4.  Credibility Theory
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix to Chapter 4
4.5   Proofs of Section 4.4
Proof of Lemma 4.12. The MH algorithm provides transition probability for algorithmic time t > 0 r   (t+i)       i   m       if 9(^)1^) a(9i,9i) for 0, ^ 0,-,
pt+1|t(^.) = p[e^> = e,\ev = *<] = ( ^^V^;^,^(1 _ {9tt9k))   for*1£
For the detailed balance condition we need to prove for all 9i, 9j and t > 0
pt+nt^,^-) = tt(9j) p^WjA)- (4-15) The case 9i = 0,- is trivial, so consider the case 9i     Oj. We assume first that
<<>i) > <0i) q(0A0i). (4.16)
In this case we have acceptance probability a(9i, 9j) = 1, and, thus, pt+i\t(0i, Oj) = q(0j\0i). We start from the right-hand side in (4.15), note that under (4.16) we have a(0j,0i) = 7r(0j) q{0j\0i)/iT(0j) q{0i\0j) < 1,
t(^) Pt+nt^-.ft)   = 9(^1^) a^-.fc) = tt(^) g(f^) g(^|y
This proves the first case. For the opposite sign in (4.16) we have a(0j,0i) = 1, and we read the last identities from the right to the left as follows
=    irifij) q(9l\9J)  = tt^-) pt+i|t(^,0i). This proves the lemma. □
Proof of Lemma 4.13. The acceptance probability in (4.13) for L = 1 is
a ((6« §«+1>), (6%**)) = min (-^1 exp {-I^T"1** + i^C^T"1*^1)} , 1
Note that we have identity
z = ** + | Velog7r(e*) = -§(t+1)-| Velog7r(e(t)).
We define a* = § Ve log7r(6*) and a(t) = § Ve log7r(e(t)). Then we obtain
(tf')'T-1** - ($('+1))'T-1*(t+1) = (z - a'/T-^z - a*) - (z + bWJT^ + a(t)) =    _2(a* + a^/T-1* + (a*)'T" V - (a^T"1^.
For z we have using the proposal G*
z =        _ | Ve iog^(ew) = 1 r (e(t) - e*).
99
Electronic copy available at: https://ssrn.com/abstract=2870308
100 Chapter 4.  Credibility Theory
Therefore, we obtain
(^yr-1** - (fc+Dyr-1*^1) = - _(a* + a(t))' (e(t) - e*) + (a*)T-V - (aWj'r1^".
e v '
This provides acceptance probability (4.13) for L = 1 being the minimum of 1 and
7r(e*)
7T(GW)
exp
11(a' + a(t))' (e(t) - e*) - ±(a*)'T-V + i^T-V0}.
If we run an MH algorithm with proposal (4.14), we obtain acceptance probability being the minimum of 1 and
„(9-) e*p {-^ (e(t) - e* - (eW " e*" gT" v)}
7r(eW)exp{-^(e* -6W -eT-1aW)'T(e* -6W -eT-^W)}' Since the last two displayed formulas are identical, we obtain the same acceptance probability and the claim follows. □
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter
5
Neural
Networks
5.1   Feed-forward neural networks
Feed-forward neural networks are popular methods in data science and machine learning. Originally, they have been inspired by the functionality of brains and have their roots in the 1940s. In essence, feed-forward neural networks can be understood as high-dimensional non-linear regression functions, and resulting statistical models can be seen as parametric regression models. However, due to their high-dimensionality and non-interpretability of parameters, they are often called non-parametric regression models. Typically, one distinguishes between two types of feed-forward neural networks: (i) shallow feed-forward neural networks having one hidden layer, (ii) and deep feed-forward neural networks with more than one hidden layer. These will be described in detail in this chapter. This description is based on the q-dimensional feature space X C R9 introduced in Chapter 2 and we will again analyze regression problems of type (2.1).
Remark. We use neural networks for modeling regression functions. However, neural networks can be seen in a much broader context. They describe a powerful approximation framework to continuous functions. In fact, neural networks provide a general toolbox of basis functions that can be composed to a very flexible approximation framework. This may be seen in a similar way as the splines and MARS frameworks presented in Chapter 3 or as the toolbox of wavelets and other families of basis functions.
5.1.1    Generic feed-forward neural network construction
In this section we describe the generic architecture of feed-forward neural networks. Explicit examples and calibration techniques are provided in the subsequent sections. This section follows Chapter 7 of Wuthrich-Merz [141], and for a more in-depth introduction we refer to this latter reference.
Activation functions
The first important ingredient (hyperparameter) of a neural network architecture is the choice of the activation function <fi : R —> R. Since we would like to approximate nonlinear regression functions, these activation functions should be non-linear.  Table 5.1
101
Electronic copy available at: https://ssrn.com/abstract=2870308
102
Chapter 5.  Neural Networks
summarizes common choices. The first two examples in Table 5.1 are difierentiable (with
sigmoid/logistic activation function	4>{x)	= (l+e-*)"1	4>' = 0(1 " 0)
hyperbolic tangent activation function	4>{x)	= tanh(a;)	</>' = l - 4>2
step function activation	4>{x)	= l{x>0}	
rectified linear unit (ReLU) activation function	4>{x)		
Table 5.1: Typical choices of (non-linear) activation functions.
easy derivatives). This is an advantage in algorithms for model calibration that involve derivatives.
The hyperbolic tangent activation function satisfies
ex — e~x 2e2x / \ -l
x i y  tanh^) = —— = —-2^-1 = 2 (l+e-2x)    -1. (5.1) ex + e x       1 + ezx \ /
The right-hand side uses the sigmoid activation function. The hyperbolic tangent is symmetric w.r.t. the origin, and this symmetry is an advantage in fitting deep network architectures. We illustrate the sigmoid activation function in more detail. It is given by
x i y ^x) = ^^ = (l+e-xy1   £ (0,1). (5.2) 1 + ex
Figure 5.1 shows the sigmoid activation function x      <fi(wx) for w £ {1/4,1,4} and
sigmoid function
-10 -5 0 5 10
Figure 5.1: Sigmoid activation function x     <fi(wx) for w £ {1/4,1,4} and x £ (—10,10).
x £ (—10,10). We see that the signal is deactivated for small values of x, and it gets fully activated for big values, and w determines the intensity of activation. The increase of activation takes place in the neighborhood of the origin and it may be shifted by an appropriate choice of translation (also called intercept), i.e. x 4>{wq + wx) for a given intercept wq £ R. The sign of w determines the direction of the activation.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
103
Neural network layers
A neural network consist of (several) neural network layers. Choose hyperparameters qm-l,qm G N and an activation function <f> : R —> R. A neural network layer is a mapping
z(m) . ^ Rqm z ^ z(m)(z) = (z[m\Z), zj^(z))' ,
that has the following form. Neuron z^m\ 1 < j < qm, is described by
/ 9m-l \
4m) = *im)(*) = * [w%] + £ -S^J =• *<^mU. (5-3)
for given network weights w^™^ = ('it>j™'))o<Z<qm_i S R9m_1 .
Neuron z)™^ corresponds to a GLM regression w.r.t. feature z G R9m_1 that consists of a scalar product and then measures the resulting activation of this scalar product in a non-linear fashion <f>. Function (5.3) is called ridge function. A ridge function can be seen as a data compression because it reduces the dimension from qm-\ to 1. Since this dimension reduction goes along with a loss of information, we consider simultaneously qm different ridge functions in a network layer. This network layer has a network parameter w(m) = (w(m)?. . . ? Wq7^) of dimension gm(gm_i + 1). Often we abbreviate a neural network layer as follows
z ^ z{m\z) = 4>{W{m\z). (5.4)
Feed-forward neural network architecture
A feed-forward neural network architecture is a composition of several neural network layers. The first layer is the input layer which is exactly our feature space X C R9, and we set qo = q for the remainder of this chapter. We choose a hyperparameter d G N which denotes the depth of the neural network architecture. The activations in the m-th hidden layer for 1 < m < d are obtained by
x G X 1-4- z
We use the neurons z^d:1\x) G R9d in the last hidden layer as features in the final regression. We choose the exponential activation function because our responses are unbounded and live on the positive real line. This motivates the output layer (regression function)
xeX ^ A(x) =exp|/3o + ^/3J^(d:1)(x)| =exp(/3,z(d:1)(a;)), (5.6)
with output network weights j3 = {fij)o<j<qd G R9d+1. Thus, (5.6) defines a general regression function which has a network parameter 9 = (/3, W^d\ . . ., W^1-1) of dimension r = qd + 1 + Ylm=i Qm{qm-i + !)• The general aim is to fix a neural network architecture
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
104
Chapter 5.  Neural Networks
for the regression model, and fit the resulting network parameter 9 6 W as good as possible to the data.
,claims
Figure 5.2: (lhs) Neural network of depth d = 1 with two-dimensional input feature x = (age,ac)' and qi = 5 hidden neurons; (rhs) neural network of depth d = 2 with two-dimensional input feature x = (age, ac)' and 2 hidden layers with qi = 8 and qi = 5 hidden neurons, respectively; input layer is in blue color, hidden layers are in black color, and output layer is in red color.
If Figure 5.2 we illustrate two different feed-forward neural network architectures. These have network parameters 6 of dimensions r = 21 and r = 75, respectively.
Remarks 5.1.
• The neural network introduced in (5.6) is called feed-forward because the signals propagate from one layer to the next (directed acyclic graph). If the network has loops it is called recurrent neural network.
• Often, neural networks of depth d = 1 are called shallow networks, and neural networks of depth d > 2 are called deep networks.
• Here, we have defined a generic feed-forward neural network architecture for a regression problem. Similar architectures can be defined for classification problems. In that case the output layer will have multiple neurons, and often the softmax activation function is chosen to turn the output activations into categorical probabilities for classification.
We briefly summarize the models studied so far which brings a common structure to regression problems. We have studied three different types of regression functions:
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
105
Generalized linear models (GLMs): see (2.2),
x i—^ X(x) = exp (f3, x). Generalized additive models (GAMs): see (3.11),
x i—y X{x) = exp (/3, s{x)). Feed-forward neural networks: see (5.6),
x i—y X(x) = ex-p(/3,zid:1\x)) .
In all three examples we choose the log-link function which results in the exponential output activation. In the exponent, we have a scalar product in all three cases. In the first case, the feature x directly runs into the scalar product. As we have seen in Chapter 2 this requires feature engineering so that this log-linear form provides a reasonable regression function for the problem to be solved. In GAMs the linear structure in the exponent is replaced by natural cubic splines which directly pre-process features into an appropriate structure so that we can apply the scalar product on the log-scale. Finally, neural networks allow for even more modeling flexibility in feature engineering x i—y z(d:1\x), in particular, w.r.t. interactions as we will see below.
5.1.2   Shallow feed-forward neural networks
We start by analyzing shallow feed-forward neural networks, thus, we choose networks of depth d = 1. Figure 5.2 (lhs) shows such a shallow neural network. We start from go-dimensional features x = (x±,. .. ,xqo)' G X C R90 (we have set qo = q)- The hidden layer with qi neurons is given by (for notational convenience we drop the upper index m = 1 in the shallow network case)
x ^ z = z(x) = (f>(W,x) G Rqi. (5.7)
The output layer is then received by
x^ \ogX(x)=(30 + YJ(3jZj(x) = (p,z(x)), (5.8)
with output network weights j3 = (f3j)o<j<q1 G Thus, we have network parameter
9 = (/3, W) of dimension r = qi + 1 + qi(qo + 1).
Universality theorems
Functions of type (5.8) provide a very general family of regression functions (subject to the choice of the output activation). The following statements have been proved by Cybenko [29], Hornik et al. [69] and Leshno et al. [88]: shallow neural networks (under mild conditions on the activation function) can approximate any compactly supported continuous function arbitrarily well (in supremum norm or L2-norm), if we allow for an
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
106
Chapter 5.  Neural Networks
arbitrary number of hidden neurons qi £ N. Thus, shallow neural networks are dense in the class of compactly supported continuous functions; and shallow neural networks are sufficiently flexible to approximate any desired (sufficiently regular) regression function. These results are called universality theorems. A famous similar statement goes back to Kolmogorov [82] and his student Arnold [3] who were considering Hilbert's 13th problem which is related to the universality theorems. Thus, the universality theorems tell us that shallow neural networks are sufficient, nevertheless we are also going to use deep neural networks, below, because they have a better approximation capacity at lower complexity.
Example 5.2. We provide a simple illustrative example for the functioning of shallow neural networks. We make two different choices for the activation function <f>. We start by choosing the step function activation, see Table 5.1,
<f>(x) = l{x>0}-
Note that inside the activation function (5.7) we consider the scalar products Y^iLi wj,ixi which are translated by intercepts wj:q. For the step function activation we are interested in analyzing the property, for j = 1,. . ., qi,
go ? 1=1
For illustration we choose qo = qi = 2. The blue line in Figure 5.3 (lhs) corresponds
step function activation step function activation
Figure 5.3: Step function activation for qo = qi = 2 (lhs) x 6 i2 and (rhs) z £ {0, l}2.
to the translated scalar product of the first hidden neuron z\(x) and the red line to the second hidden neuron z^ix). In our example, all x above the blue line are mapped to z = (1,2:2)' and all x below the blue line to z = (0,2:2)'- Similarly, all x above the red line are mapped to z = (21,1)' and all x below the red line to z = (21,0)'. Thus, under the step function activation we have four possible values z £ {0, l}2 for the neuron activation, see colored dots in Figure 5.3 (rhs). This network can have at most four different expected responses.
As a second example we consider the sigmoid activation function provided in (5.2) for <Zo = <Zi = 2, and exactly the same parameters W as in the step function activation of
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
107
sigmoid activation function sigmoid activation function
-2 -1 D 1 2 0.0 0.2 OA 0.6 0.8 1.0
Figure 5.4: Sigmoid activation for qo = q± = 2 (lhs) x £ R2 and (rhs) z £ [0, l]2.
Figure 5.3. We consider the straight lines in the feature plane x £ R2 in Figure 5.4 (lhs) and map these lines to the z unit square [0, l]2. The blue line leaves the component z\(x) invariant and the red line leaves the component ziix) invariant. All other lines that are non-parallel to the blue and the red lines (e.g. the green, dark-green and magenta lines) connect the corners (0, 0) and (1,1) or (1,0) and (0,1) of the unit square in the z graph, see Figure 5.4 (rhs).
The ridge function reduces in the scalar product (wj,x) the dimension from qo to 1. Observe that x i—> Zj{x) = cf>(wj,x) is constant on hyperplanes orthogonal to Wj and the gradient points into the direction of Wj, i.e. Vx(f>(wj,x) = cf>'(wj,x)wj (subject to differentiability of the activation function <fi). ■
Gradient descent method for shallow neural networks
Assume we have chosen the shallow feed-forward neural network (5.7)-(5.8) for our regression modeling problem. The remaining difficulty is to find the optimal network parameter 9 = (/3, W) of that model. State-of-the-art of neural network calibration uses variants of the gradient descent method.
Initial remark. The choices of the activation function cf>, the depth d of the neural network and the numbers (qm)i<m<d °f hidden neurons are considered to be hyperpa-rameters. Therefore, we fit the network parameter 6 conditionally given these hyperpa-rameters. As stated in the universality theorems on page 105, a shallow neural network with an arbitrary number q\ of neurons is sufficient to approximate any compactly supported continuous function. This illustrates that a large value of qi may likely lead to over-fitting of the regression model to the data because we can approximate a fairly large class of functions with that neural network model. This suggests to go for a low dimensional neural network. But this is only half of the story: for model fitting (slight) over-parametrization is often needed in the gradient descent method, otherwise the algorithm may have poor convergence properties. Thus, the crucial point in gradient descent calibration is in most of the cases to stop the optimization algorithm at the right moment
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
108
Chapter 5.  Neural Networks
to prevent an over-parametrized model from over-fitting; this is called early stopping. We will discuss these issues in more detail below.
We illustrated the gradient descent algorithm on an explicit example. Assume that the expected frequency is modeled by (5.8), and we set
91
fi(x) = logA(a:) =     + = {P,z) = (P,z(x)),
3 = 1
with hidden neurons z satisfying (5.7) for a differentiable activation function <f>. Under the Poisson assumption
N,        Poi(A(ccl)wl),        for i = 1,... ,n,
we obtain Poisson deviance loss
n
D*(N, A)   =   E2[ANK-Ar^-Ar^log(ANK) + Ar^1°gAr^
i=i
n
=       2 [eKxi)Vi -Ni- Ni»(xi) + N, \og(Ni/vij\ .
i=i
The goal is to make this Poisson deviance loss D*(N,X) small in network parameter 6 which is equivalent to determining the MLE for 6. Typically, this optimal network parameter cannot be determined explicitly because the complexity of the problem is too high.1 The idea therefore is to design an algorithm that iteratively improves the network parameter by looking for locally optimal moves.
We calculate the gradient of the Poisson deviance loss D*(N,X) w.r.t. 9
n n
VeD*(N,X) =  Y,2 [e^^Vi - iVj] Vofj,(xi) =  £ 2 [X(Xl)Vl - Nt] V*/^).
i=i i=i
The last gradient is calculated component-wise. We have
jLi(x) = (z0(x), z(xYY = z+(x) G (5.9) where we set zq(x) = 1 for the intercept /3q. For j = 1,..., qi and / = 0,..., qo we have
■-—^(x) = l3j(f>' \ + Xj W3,kxkj Xi = /3j(j)'(Wj,x)xi,
where we set xq = 1 for the intercepts Wjfl. For the hyperbolic tangent activation function we have 4>' = 1 — </>2, see Table 5.1. This implies for the hyperbolic tangent activation function
d
Collecting all terms, we obtain the gradient under the hyperbolic activation function choice
-H(x) = ß3 (l - Zj(x)2^ xi. (5.10)
VeKx)  =   {z+{x)',ßi (l - z!(x)2^ xq, ... ,ßqi (l - zqi(x)2"j xqo"j
1The issue of not being able to find a global optimizer is discussed in more detail in Remarks 5.3.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
109
with dimension r = qi + 1 + qi(qo + 1).
The first order Taylor expansion of the Poisson deviance loss D*(AT", A) in 9 is given by, we set A = X$ and A = Ag-,
D*(N, A) = D*(N, A) + VeD*(N, A)' (o - O) + o (\\9 - 0\\2) ,
for \\9 — 9\\2 —> 0. Therefore, the negative gradient — V$D*(N, A) gives the direction for 9 of the maximal local decrease in Poisson deviance loss.
Assume we are in position 9^ after the t-th algorithmic step of the gradient descent method which provides regression function x h-> A^ (x) = A^(t) (x) with network parameter 0(*). For a given learning rate Qt+i > 0, the gradient descent algorithm updates this network parameter 9^ by
0(t) ^ =      _ gt+1\7dD*(N,\^). (5.11)
This update provides new in-sample Poisson deviance loss
D*(N,\W) = D*(N,\V) - Qt+i \\vdD*(N,X^)\\\o(gt+1), for gt+i —> 0. Iteration of this algorithm decreases the Poisson deviance loss step by step.
Remarks 5.3.
• Iteration of algorithm (5.11) does not increase the complexity of the model (which is fixed by the choices of c/rj and qi); this is different from the gradient boosting machine presented in Chapter 7 on stage-wise adaptive learning, which increases the number of parameters in each iteration of the algorithm.
• By appropriately fine-tuning (tempering) the learning rate Qt+i > 0, the gradient descent algorithm will converge to a local minimum of the objective function. This may lead to over-fitting for a large hyperparameter qi (in an over-parametrized model). Therefore, in neural network calibrations it is important to find a good balance between minimizing in-sample losses and over-fitting. Typically, the data is partitioned into a training set and a validation set. The gradient descent algorithm is performed on the training set (in-sample), and over-fitting is tracked on the validation set (out-of-sample), and as soon as there are signs of over-fitting on the validation data, the gradient descent algorithm is early stopped.
• Early stopped solutions are not unique, even if the stopping criterion is well-defined. For instance, two different initial values (seeds) of the algorithm may lead to different solutions. This issue is a major difficulty in insurance pricing because it implies that "best" prices are not unique in the sense that they may depend on the (random) selection of a seed, for a broader discussion we refer to Section 7.4.4 in [141]. More colloquially speaking this means that we have many competing models of similar quality, and we do not have an objective criterion to select one of them. This is further discussed in Section 5.1.6, below.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
110
Chapter 5.  Neural Networks
• The learning rate gt+i > 0 is usually chosen sufficiently small so that the update is still in the region of a locally decreasing deviance loss. Fine-tuning of this learning rate is crucial (and sometimes difficult).
• One may try to use higher order terms in the Taylor expansion (5.11). In most cases this is computationally too costly and one prefers other techniques like momentum-based updates, see (5.12)-(5.13).
• Big data has not been discussed, yet. If the number n of observations in T> is very large, then the gradient calculation V$D* involves high-dimensional matrix multiplications. This may be very slow in computations. Therefore, typically, the stochastic gradient descent (SGD) method is used. The SGD method considers for each step in (5.11) only a (random) sub-sample (batch) of all observations in T>. In one epoch the SGD algorithm runs through all cases in T> once. In applications one usually chooses the batch size and the number of epochs the SGD method should be iterated.
If one considers all observations in T> simultaneously, then one receives the optimal direction and for this reason that latter is sometimes also called steepest gradient descent method.
• The initial value for the network parameter in the gradient descent algorithm should be taken at random to avoid any kind of symmetry that may trap the algorithm in a saddle point.
Listing 5.1: Steepest gradient descent method for a shallow neural network
In <-  nrow(X) #  number  of   observations   in  design  matrix X
2 z <-  array(l,   c(ql+l,   n))   #  dimension  of   the  hidden layer
3 z[-l,]   <-  tanhCW %*%  t(X)) #   initialize  neurons  for W
4 lambda  <-  exp(t(beta)   %*%  z)       #   initialize   frequency  for beta
5 for   Ctl   in l:epochs){
6 delta.2       <-  2*(lambda*dat$expo   - dat$claims)
7 grad.beta  <-  z  '/,*'/, t(delta.2)
8 delta.1       <-   (beta[-l,]   '/,*'/,  delta.2)   *   (1   - z[-l,]~2)
9 grad.W <-  delta. 1  '/,*'/,  as. matrix (X)
10 beta <-  beta  -  rho [2]   *  gr ad . bet a/sqrt ( sum (gr ad . bet a ~2 ) )
11 W <- W  -  rho[1]   *  grad.W/sqrt(sum(grad.W~2))
12 z[-l,]   <-  tanh(W '/,*'/, t(X))
13 lambda  <-  exp(t(beta)   %*% z)
14 }
In Listing 5.1 we illustrate the implementation of the gradient descent algorithm for a shallow neural network having hyperbolic tangent activation function. On line 1 we extract the number of observations in the design matrix X, see also (2.7). On lines 2-4 we initialize the regression function A for an initial network parameter 6^ = (j3,W) (not further specified in Listing 5.1). Lines 7 and 9 give the gradient of the objective function for the hyperbolic tangent activation function. On lines 10-11 we update the network parameter as in (5.11) with different learning rates for the different layers. Finally, on lines 12-13 we calculate the updated regression function.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
111
Similar to the Newton method one may try to consider second order terms (Hessians) in gradient descent steps. However, this is not feasible computationally. Therefore, one tries to mimic second order terms by momentum-based gradient descent methods, see Rumelhart et al. [117]. Choose a momentum coefficient v G [0,1) and initialize momentum v(°) =0 G W-
We replace update (5.11) by
For v = 0 we have the classical gradient descent method, for v > 0 we also consider previous gradients (with exponentially decaying rates).
Another improvement has been proposed by Nesterov [99]. Nesterov has noticed that (for convex functions) the gradient descent update may behave in a zig-zag manner. He proposed the following adjustment. Initialize       = 8^°\ Update
Note that this is a special momentum update.
The R interface to Keras2 offers predefined gradient descent methods; for technical details we refer to Sections 8.3 and 8.5 in Goodfellow et al. [56].
Predefined gradient descent methods
• 'adagrad' chooses learning rates that differ in all directions of the gradient and that consider the directional sizes of the gradients ('ada' stands for adapted);
• 'adadelta' is a modified version of 'adagrad' that overcomes some deficiencies of the latter, for instance, the sensitivity to hyperparameters;
• 'rmsprop' is another method to overcome the deficiencies of 'adagrad' ('rmsprop' stands for root mean square propagation);
• 'adam' stands for adaptive moment estimation, similar to 'adagrad' it searches for directionally optimal learning rates based on the momentum induced by past gradients measured by an <?2-norm;3
• 'nadam' is a Nesterov [99] accelerated version of 'adam'.
2Keras is a user-friendly API to TensorFlow, see https://tensorflow.rstudio.com/keras/ 3We would like to indicate that there is an issue with 'adam'. The explanation of the functioning of 'adam' is based on Lemma 10.3 in Kingma-Ba [81], however this lemma is not correct.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
e(t)   _> e(t+i)
- Qt+1VgD*(N,\®)
Q(t) +v(*+l).
(5.12)
(5.13)
Electronic copy available at: https://ssrn.com/abstract=2870308
112
Chapter 5.  Neural Networks
Pre-processing features
We have learned how to pre-process categorical feature components in Section 2.4.1. Choosing a fixed reference level, a categorical feature component with k labels can be transformed into a (k — l)-dimensional dummy vector, for an example see (2.16). We will use this dummy coding here for categorical feature components. In the machine learning community often one-hot encoding is preferred because it is (slightly) simpler and the full rank property of the design matrix X is not an important property in an over-parametrized model.
For the gradient descent method to work properly (and not get trapped) we also need to pre-process continuous feature components. A necessary property therefore is that all feature components live on a comparable scale, otherwise the gradient is dominated by components that live on a bigger scale. In many cases the so-called MinMaxScaler is used. Denote by xj~ and xf the minimal and maximal feature value of the continuous feature component x\. Then we transform this continuous feature components for all cases 1 < i < n by
xiJL * ^M = 2^f4-1 G [-1,1]. (5.14)
xi xi
If the resulting feature values (a;^M)i<i<ra cluster in interval [—1,1], for instance, because of outliers in feature values, we should first transform them non-linearly to get more uniform values. For convenience we will drop the upper index in xff^ in the sequel.
Example 5.4 (example SNN1). We consider the same set-up as in Example 2.10 (GLM1) and Example 3.4 (GAM1). That is, we only use the feature components age and ac in the regression model. We choose two different shallow neural network architectures, a first one having qi = 5 hidden neurons and a second one having qi = 20 hidden neurons, respectively. The first one is illustrated in Figure 5.2 (lhs).
In a first analysis we study the convergence behavior of the gradient descent algorithm for these two shallow neural network architectures (having network parameters of dimensions r = 21 and r = 81, respectively). We therefore partition the data T> into a training set and a validation set of equal sizes. As in Example 3.4 (GAM1) we compress the data because we only have 73 different age's and 36 different ac's which results in at most 2'628 non-empty risk cells. In fact, the training data has 2'120 non-empty risk cells and the validation data has 2'087 non-empty risk cells. We then run the gradient descent algorithm on this compressed data (sufficient statistics).
In order to perform the network calibration we use the R interface to Keras which uses a TensorFlow backend.4 The corresponding code is provided in Listing 5.2. On line 1 we start the keras library and on lines 3-4 we define the design matrix X and the offsets \og(vi). On lines 6-9 we define the shallow neural network, and the output weights j3 are initialized on line 9 to give the homogeneous MLE A. On lines 11-13 we merge the network part with the offset, and we choose the exponential response activation function for the output layer (this part does not involve any trainable weights). On lines 15-16 the model is defined, the optimizer is specified and the objective function is chosen. Line
4see https://keras.io/backend/
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
113
Listing 5.2: Shallow neural network with q\ hidden neurons in Keras
1 1ibrary(keras)
2
3 Design  <-   layer_input(shape  =  c(qO),   dtype  =   Jfloat32J,   name  = 'Design5)
4 LogVol   <-   layer_input(shape  =  c(l),     dtype  =   Jfloat32J,   name  = JLogVolJ)
5
6 Network =  Design %>%
7 layer_dense(units=ql,   activation=3tanh3 ,       name = 3 Layer1 3) %>%
8 layer_dense(units =1,     activation=5linear 3 ,   name='Network 5 ,
9 we ights = list(array(0,   dim = c(ql,l)),   array(log(lambdaO),   dim = c(l))))
10
11 Response  =  1 ist(Network,   LogVol)   %>%  1 ayer_add(name=3 Add 3) %>%
12 layer_dense(units=1,   activation = k_exp,   name  =   'Response 1 ,   trainable = FALSE,
13 we ights = list(array(1,   dim = c(l,l)),   array(0,   dim = c(l))))
14
15 model   <-  keras_model(inputs  =  c(Design,   LogVol),   outputs  = c(Response))
16 model  %>%  compile(optimizer  =  optimizer_nadam() ,   loss  =   'poisson J)
17
18 summary(model)
19
20 model  %>%  fit(X.train,   Y.train,   validation_data = list (X.vali ,Y.vali),
21 epochs = 10000,   batch_s ize = nrow(X.train) , verbose=0)
18 summarizes the model which provides the output of Listing 5.3.
Listing 5.3: Shallow neural network with ql = 5 in Keras: model summary
1 2 3 4	Layer (type)	Output	Shape	Param #	Connected to
	Design   (InputLayer)	(None,	2)	0	
5 6	Layerl (Dense)	(None ,	5)	15	Design [0] [0]
7 8	Network (Dense)	(None,	1)	6	Layerl [0] [0]
9 10	LogVol   (InputLayer)	(None ,	1)	0	
11 12 13	Add (Add)	(None,	1)	0	Network [0] [0] LogVol [0] [0]
14 15	Response (Dense)	(None ,	1)	2	Add [0] [0]
16 Total  params: 23
17 Trainable  params: 21
18 Non-trainable  params: 2
On lines 20-21 of Listing 5.2 we fit this model on the training data and we validate it on the validation data. As batch size we choose the size of the observations (steepest gradient descent method) and we run the gradient descent algorithm for lO'OOO epochs using the nadam optimizer. We illustrate the convergence behavior in Figure 5.5. We observe that we receive a comparably slow convergence behavior for both shallow neural networks. We do not see any sign of over-fitting after lO'OOO gradient descent iterations. Moreover, there is not any visible difference between the two shallow neural networks.
(For no specific reason,) we select the shallow neural network with qi = 20 hidden neu-Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
114
Chapter 5.  Neural Networks
shallow net: gradient descents
0 2000 4000 S000 0000 10000
gradient descent steps
Figure 5.5: Convergence of the gradient descent algorithm for the two shallow neural networks with qi = 5, 20 hidden neurons; blue gives the training set (in-sample) and red gives the validation set (out-of-sample).
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	/■cv >~D	/•cv >~D	£(A,A*)	>~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.1) GLM1	3.1s	11	28.3543	28.3544	0.6052	28.3510	10.2691%
(Ch3.1) GAM1	1.1s	108	28.3248	28.3245	0.5722	28.3134	10.2691%
(Ch5.0) SNN1 gi = 20	87s	81	-	-	0.5730	28.3242	10.2096%
Table 5.2: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 3.1.
rons, and we fit this network over 20'000 epochs on the entire (compressed) data. We call this model SNN1. The results are illustrated on line (Ch5.0) of Table 5.2. An observation is that the neural network model is better than model GLM1 and comparably good as model GAM1, however, it uses much more computational time for calibration than model GAM1. We interpret the results as follows: (1) Model GLM1 is not competitive because the building of the classes for age and ac does not seem to have been done in the most optimal way. (2) Model GAM1 is optimal if the feature components age and ac only interact in a multiplicative fashion, see (3.3). This seems to be the case here,5 thus, model GAM1 is optimal in some sense, and model SNN1 (only) tries to mimic model GAM1.
In Figure 5.6 we plot the resulting marginal frequencies of the three models GLM1, GAM1 and SNN1. For the feature component age we do not see much (visible) differences between models GAM1 and SNN1. For the feature component ac the prediction of model
5In fact, the discovery that model GAM1 and model SNN1 are similarly good unravels some details of our choice of the true regression function A*.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
115
Figure 5.6: Comparison of predicted frequencies in the models GLM1, GAM1 and SNN1.
SNN1 looks like a smooth version of the one of model GAM1. Either model GAM1 over-fits for higher ages of cars or model SNN1 is not sufficiently sensitive in this part of the feature space. From the current analysis it is difficult to tell which model is better, and we may work with both of them. Note that these higher ages of cars have very small volumes (years at risk), see Figure A.5, and therefore may be influenced by special configurations of other feature components. This finishes this example. ■
In a next analysis we should incorporate all feature components of the individual cases. Before doing so, we briefly discuss deep feed-forward neural networks. In the previous example we have seen that the gradient descent algorithm may converge very slowly. This may even be more pronounced if we have interactions between feature components because modeling of interactions needs more neurons qi in the hidden layer. For this reason, we switch to deep neural networks before discussing the example on all feature components.
5.1.3   Deep feed-forward neural networks
As mentioned above, often deep neural networks are more efficient in model calibration, in particular, if the regression function has interactions between the feature components. For this reason, we study deep feed-forward neural networks in this section.
In view of definition (5.6), feed-forward neural networks of depth d > 2 are given by
id
xeX _> ^x)=\ogX(x) = (30 + YJ(3Jzf1\x) = {/3,z^1\x)),
3 = 1
with composition z^d:1\x) = (z^ o ■ ■ ■ o z^)(x) based on hidden layers 1 < m < d z e R9m-i ^ z(™)(z) = 4>{W(m\z) £ Rq'm.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
116
Chapter 5.  Neural Networks
Complexity of deep neural networks
Above we have met the universality theorems which say that shallow neural networks are sufficient from a pure approximation point of view. We present a simple example that can easily be reconstructed by a neural network of depth d = 2, but it cannot easily be approximated by a shallow one. Choose a two-dimensional feature space X = [0, l]2 and define the regression function A : X —> R+ by
X i y X(x) = l + l{X2>l/2} +1{x1>1/2, x2>l/2}   e  {1,2,3}. (5.15)
This regression function is illustrated in Figure 5.7. The aim is to model this regression function with a neural network having step function activation. We rewrite regression
regression function lambda
I-;-1 i—i- 3.D
1.0
feature component X1
Figure 5.7: Regression function (5.15).
function (5.15) choosing step function activations: for x G X we define the first hidden layer
ZW(X) = [z^\x), z£\x))' = l{*2>l/2})' e {0, l}2.
This provides us with
X(x)    =    1 + l{X2>l/2} + 1{xi>1/2}1{x2>1/2}
=   1 + z{21] (x) +      (x) z£\x)
1 + 1{z<i\x)>l/2} + 1{z<i1\x)+z<i\x)>3/2}
with neurons in the second hidden layer given by
z{-2\z)= \zf(z),zf(z))= (t{z2>1/2},t{zi+Z2>3/2})',       ioize [0,1]2. Thus, we obtain deep neural network regression function
x ^ \(x) = (ß,z(-2:1\x)) = (ß,(zWoz(-V)(x)), (5.16)
with output weights ß = (1,1,1)' G M3. We conclude that a deep neural network with d = 2 hidden layers and qi = q2 = 2 hidden neurons, i.e. with totally 4 hidden neurons,
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
117
can perfectly replicate example (5.15). With shallow neural networks there is no similarly efficient way of constructing this regression function with only a few neurons.
Unfortunately, there is no general theory about optimal choices of deep feed-forward neural network architectures. Some examples give optimal numbers of hidden layers under additional assumptions on the feature space, see, for instance, Shaham et al. [122]. An interesting paper is Zaslavsky [144] which proves that qi hyperplanes can partition the space R90 in at most
This is an upper complexity bound for shallow neural networks with step function activation. This bound has an exponential growth for qi < qo, and it slows down to a polynomial growth for qi > qo. Thus, with formula (5.17) we can directly bound the complexity of shallow neural networks having step function activation. A lower bound is given in Montufar et al. [96] for deep neural networks (having ReLU activation function). Shortly said, growing shallow neural networks in width is (much) less efficient in terms of the complexity of the resulting regression functions than growing networks simultaneously in depth and width.
Gradient descent method and back-propagation
Calibration becomes more involved when fitting deep neural networks with many hidden layers, i.e. we cannot simply calculate the gradients as in (5.9)-(5.10). However, there is a nice re-parametrization also known as the back-propagation method. This gives us a recursive algorithm for calculating the gradients. Define for 1 < m < d + 1 the matrices
where we set qd+1 = 1 and = f3jd for jd = 0,.. . , qd.
Corollary 5.5 (back-propagation algorithm). Choose a neural network of depth d > 1 and with hyperbolic tangent activation function. Denote by D*(N, A) the Poisson deviance loss of the single case (N,x,v) with expected frequency A = X(x) = X$(x) for network parameter 9 = (/3, W^d\ .. . ,W^).
• Define recursively (back-propagation)
disjoint sets.
(5.17)
— initialize
ö(d+1\x) = 2[X(x)v-N]   e R;
iterate for 1 < m < d
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
118
Chapter 5.  Neural Networks
• We obtain for 0 < m < d
dD*(N,X)
)
jm+1 = 1, • • • ,<2m+1 i jm =0,... ,q,
8{m+1)(x) (z{m:1)(x))' e R9™+ix(*»+1)
<m
where z^0:1\x)
x (including the intercept component).
For a proof we refer to Proposition 7.4 in [141]. Remarks.
• Corollary 5.5 gives an efficient way of calculating the gradient of the deviance loss function w.r.t. the network parameter. We have provided the specific form of the Poisson deviance loss and the hyperbolic tangent activation function. However, these two items can easily be exchanged by other choices.
• In Listing 5.4 we provide the steepest gradient descent algorithm for a neural network of depth d = 2 and using the back-propagation method of Corollary 5.5. This looks similar to Listing 5.1 where we have already been using the back-propagation notation. Our practical applications will again be based on the R interface to Keras library. I.e. we do not need to bother about the explicit implementation, and at the same time we benefit from momentum-based accelerations.
Example 5.6 (example DNN1). In this example we consider a claims frequency modeling attempt that is based on all feature components. We therefore implement a deep feedforward neural network architecture. The results will be comparable to the ones of models GLM4 and GAM3, and they follow up Table 3.2. We use a modeling approach similar to Conclusion 2.8, namely, we consider the 5 continuous feature components age, ac,
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Listing 5.4: Back-propagation for a deep neural network with d = 2
In <-  nrow(X) #  number  of   observations   in  design  matrix X
2 zl <-  array(1,   c(ql+l,   n))       #  dimension  of   the   1st  hidden layer
3 z2 <-  array(1,   c(q2+l,   n))       #  dimension  of   the  2nd hidden layer
4 zl[-l,]   <-  tanhCWl  '/,*'/,  t(X)) #   initialize  neurons  zl  for Wl
5 z2[-l,]   <-  tanhCW2  '/,*'/,  zl) #   initialize  neurons  z2  for W2
6 lambda  <-  expCtCbeta)   %*%  z2) #   initialize   frequency  for beta
7 for   Ctl   in l:epochs){
8 delta.3       <-  2*(lambda*dat$expo   - dat$claims)
9 grad.beta  <-  z2  '/,*'/, tCdelta.3)
10 delta.2       <-   Cbeta[-1,]   '/,*'/,  delta.3)   *   CI   - z2[-l,]~2)
11 grad.W2       <-  delta.2  '/,*'/, t(zl)
12 delta.1       <-   (t(W2[,-l])   '/,*'/,  delta.2)   *   CI  - zl[-l,]~2)
13 grad.Wl       <-  delta. 1  '/,*'/, as.matrixCX)
14 beta <-  beta  -  rho [3]   *  gr ad . bet a/sqrt C sum C gr ad . bet a ~2 ) )
15 W2 <-  W2   -  rho [2]   *  gr ad . W2 / sqrt ( sum (gr ad . W2 ~2 ) )
16 Wl <-  Wl   -  rho [1]   * grad.Wl/sqrt(sum(grad.Wl~2))
17 zl[-l,]       <-  tanhCWl  '/,*'/, t(X))
18 z2[-l,]       <-  tanhCW2  '/,*'/, zl)
19 lambda <-  expCtCbeta)   '/,*'/, z2)
20 }
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
119
power, area and log(dens), these are transformed with the MinMaxScaler (5.14) to the interval [—1,1]; the binary component gas is set to ±1/2; and the 2 categorical feature components brand and ct are treated by dummy coding. This provides a feature space
X = [-1, l]5 x {-1/2,1/2} x Xbrajad x Xct c R90, (5.18)
of dimension qo = 5 + 1 + 10 + 25 = 41.
Listing 5.5: Deep neural network in Keras
1 1ibrary(keras)
2
3 Design  <-   layer_input(shape  =  c(qO),   dtype  =   'float32',   name  = 'Design')
4 LogVol   <-   layer_input(shape  =  c(l),     dtype  =   'float32',   name  = 'LogVol')
5
6 Network =  Design %>%
7 layer_denseCunits=ql,   activation='tanh',       name='Layerl') %>%
8 layer_denseCunits=q2,   activation='tanh',       name='Layer2') %>%
9 layer_denseCunits=q3,   activation='tanh',       name='Layer3') %>%
10 layer_dense(units=1,     activation='linear',   name='Network',
11 weights = listCarray CO,   dim = cCq3,1)) ,   array(log(lambdaO) , dim=c(l))))
12
13 Response  =  1ist(Network,   LogVol)   %>%  1 ayer_add(name='Add') %>%
14 layer_dense(units=1,   activation=k_exp,   name  =   'Response',   trainable=FALSE,
15 weights = 1ist(array(1,   dim = c(1,1)) ,   array CO,   dim = c(l))))
16
17 model   <-  keras_model(inputs  =  cCDesign,   LogVol),   outputs  = c(Response))
18 model  %>%  compile(optimizer  =  optimizer_nadam(),   loss  = 'poisson')
19
20 summary(model)
Listing 5.6: Deep neural network with (qi, q2, qs) = (20,15,10): model summary
1 2 3 4	Layer (type)	Output	Shape	Param #	Connected to
	Design (InputLayer)	(None ,	41)	0	
5 6	Layerl (Dense)	(None ,	20)	840	Design [0] [0]
7 8	Layer2 (Dense)	(None ,	15)	315	Layerl [0] [0]
9 10	Layer3 (Dense)	(None ,	10)	160	Layer2 [0] [0]
11 12	Network (Dense)	(None ,	1)	11	Layer3 [0] [0]
13 14	LogVol (InputLayer)	(None ,	1)	0	
15 16 17	Add (Add)	(None ,	1)	0	Network [0] [0] LogVol [0] [0]
18 19	Response (Dense)	(None ,	1)	2	Add [0] [0]
20 Total  params: 1,328
21 Trainable  params: 1,326
22 Non-trainable  params: 2
We use a feed-forward neural network of depth d = 3 having qi = 20, q2 = 15 and 53 = 10 hidden neurons, respectively. The corresponding R code is given in Listing 5.5;
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
120
Chapter 5.  Neural Networks
we mention that this code is almost identical to the one in Listing 5.2 except that it has more hidden layers on lines 7-9. This network is illustrated in Figure 5.9 (lhs), below, and the different colors in the input layer exactly correspond to the ones in (5.18). In Listing 5.6 we present the summary of this network model having depth d = 3 with hidden neurons (qi, q2, c/3) = (20,15,10). This model has a network parameter 6 of dimension r = 1'326. The first hidden layer receives 840 parameters; this high number of parameters is strongly influenced by categorical feature components having many different labels. In our case feature component ct has 26 different labels which provides 25 ■ 20 = 500 parameters for the first hidden layer (dummy coding). In (5.19) on embedding layers we are going to present a different treatment of categorical feature components which (often) uses less parameters (and is related to representation learning).
deep net: different batch sizes 10-fold CV losses (non-stratified) 10-fold CV losses (stratified)
Figure 5.8: (lhs) SGD method on different batch sizes (the more light the color the smaller the batch size) with blue being the training data and red the validation data; (middle, rhs) 10-fold cross-validation losses of models GLM4 and DNN1 on the K = 10 individual partitions (non-stratified and stratified) on the identical scale.
We calibrate this deep neural network architecture using the momentum-based gradient descent optimizer nadam, see line 18 of Listing 5.5. In the application of this optimizer we still have the freedom of choosing the number of epochs and the batch size (of the stochastic gradient descent (SGD) method). The batch size should be related to the resulting confidence bounds (2.20) which quantify the amount of experience needed to detect structural differences in frequencies. We therefore consider a validation set (out-of-sample analysis) of 10% of the total data T>, and we learn the neural network on 90% of the data (training set) using the different batch sizes of 1'000, 2'000, 5'000 and lO'OOO policies, and we evaluate the resulting model out-of-sample on the validation data, exactly as described in formula (1.8) on out-of-sample cross-validation. In Figure 5.8 (lhs) we illustrate the resulting decreases in in-sample losses (blue) and out-of-sample losses (red), the more light the color the smaller the batch size.
We observe that for smaller batch sizes of l'OOO to 2'000 cases, the model starts to over-fit after roughly 50 epochs. For bigger batch sizes of 5'000 or lO'OOO cases the SGD algorithm seems to over-fit after roughly 100 epochs and it provides overall better results than for smaller batch sizes. For this reason we run the SGD method for 100 epochs on a batch size of lO'OOO cases. We call the resulting model DNN1, and we provide the results in Table 5.3, line (Ch5.1). We observe that model DNN1 has a clearly better
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
121
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	rev	/•CV J~D	£(A,A*)	/•is J~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.4) GLM4	14s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch3.3) GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
(Ch5.1) DNN1	56s	1'326	28.1163	28.1216	0.3596	27.9280	10.0348%
Table 5.3: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 3.2.
performance than models GLM4 and GAM3 in terms of the resulting estimation loss £(A, A*) of 0.3596 versus 0.4137 and 0.3967, respectively. This illustrates that model DNN1 can capture (non-multiplicative) interactions which have not been considered in models GLM4 and GAM3. This picture is negatively affected by the fact that we receive a (bigger) bias in model DNN1, i.e. the average frequency turns out to be too low. This bias is problematic in insurance because is says that the balance property of Proposition 2.4 is not fulfilled.
From Table 5.3 we see that typically the cross-validation losses between the non-stratified version and the stratified version are very similar (28.1163 vs. 28.1216 for DNN1). In particular, there does not seem to be any value added by considering the stratified version. In Figure 5.8 (middle, rhs) we revise this opinion. Figure 5.8 (middle, rhs) illustrates the 10-fold cross-validation losses on every of the K = 10 partitions individually for the non-stratified version (middle) and the stratified version (rhs). The crucial observation is that the stratified version (rhs) fluctuates much less compared to the non-stratified one (middle). Moreover, from the stratified version we see that there is a systematic improvement from model GLM4 to model DNN1, whereas in the non-stratified version this improvement could also be allocated to (pure) random fluctuations (coming from process uncertainty). This finishes this deep neural network example. ■
In the previous example we have seen that the deep neural network architecture outperforms the GLM and the GAM approaches. This indicates that the latter two models are missing important (non-multiplicative) interactions. In a next step we could explore these interactions to improve the GLM and GAM approaches. Another direction we could pursue is to explore other neural network architectures compared to the one considered in Example 5.6. Our next goal is slightly different, namely, we are going to discuss an other treatment of categorical feature components than the one used in (5.18).
Embedding layers for categorical feature components
In view of Example 5.6 it seems to be inefficient to use dummy coding (or one-hot encoding) for categorical feature component. In fact, this treatment of nominal labels almost seems to be a waste of network parameters. In natural language processing (NLP) one embeds words (or categorical features) into low dimensional Euclidean spaces
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
122
Chapter 5.  Neural Networks
R . This approach is also called representation learning and it has been proposed in the actuarial literature by Richman [108, 109]. We review it in this section. We use the feature component brand as illustration to discuss so-called embedding layers. We have 11 different car brands, and one-hot encoding provides the 11 unit vectors a;(ih) G Rii as
numerical representation, see Table 2.1. Thus, every car brand has its own unit vector and the resulting distances between all different car brands are the same (using the Euclidean norm in R11). The idea of an embedding layer and representation learning is that we embed categorical labels into low dimensional Euclidean spaces and proximity in these spaces should mean similarity for regression modeling. Choose b £ N and consider an embedding mapping (representation)
e : (Bl,..., B6} -> R6,       brand ^ e(brand) d= ebrand. (5.19)
If two car brands are very similar, say Bl and B3 are very similar (w.r.t. the regression task), then naturally their embedding vectors eB1 and eB3 should be close together in Rb; we illustrate this in Figure 5.11 (lhs), below, for an embedding dimension of b = 2. Embedding (5.19) can be viewed as an additional (initial) network layer, where each categorical label sends a signal to b embedding neurons ebrand = (ebrand,..., ebrand)', see Figure 5.9 (rhs) for a two-dimensional embedding b = 2 of brand. The corresponding optimal embedding weights (embedding vectors) will then be learned during model calibration adding an additional layer to the gradient descent and back-propagation method. In the subsequent analysis we choose separate embedding neurons (representations) for each categorical feature component (brand and ct), that is, we embed the two categorical feature components into two (parallel) embedding layers R615^ and Rb" of dimensions ebrand and bct, respectively. These are then concatenated with the remaining feature components, resulting in a feature space after embedding given by
Xe = [-1, l]5 x {-1/2,1/2} x R6brMd x R6" c R90, (5.20)
of dimension qo = 5 + 1 + frbrand + bct, see Figure 5.9.
We briefly discuss the resulting dimensions of the network parameter 6 for the different modeling approaches. Choose a shallow neural network with qi = 20 hidden neurons. If we use dummy coding for brand and ct, the input layer has dimension qo = 41, see (5.18). This results in a network parameter 6 of dimension r = qi + 1 + qi(qo + 1) = 20 + 1 + 20 ■ (41 + 1) = 861 for the dummy coding treatment of our categorical variables. Choose embedding dimensions bbT3ILd = ^ct = 2 for the two embedding layers; this results in 11 ■ 2 + 26 ■ 2 = 74 embedding weights. Feature space Xe has dimension qo = 10. Including the embedding weights this results in a network parameter 6 of dimension r = 20+ 1+20- (10-1-1) +74 = 315, thus, we obtain much less parameters to be calibrated in this embedding layer treatment of our categorical variables. In the remainder of this section we revisit Example 5.6, but we use embedding layers for brand and ct of different dimensions. This example also shows how embedding layers can be modeled in the R interface to Keras.
Example 5.7 (example DNN2/3 using embedding layers). We revisit Example 5.6 using a feed-forward neural network of depth d = 3 having (qi, q%, qs) = (20,15,10) hidden
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
123
Listing 5.7: Deep neural network with embedding layers in Keras
1 Design  <-   layer_input(shape  =  c(qOO),   dtype  =   'float32',   name  = 'Design')
2 Brand    <-   layer_input(shape  =  c(l),       dtype  =   'int32',       name  = 'Brand')
3 Canton  <-   layer_input(shape  =  c(l),       dtype  =   'int32',       name  = 'Canton')
4 LogVol   <-   layer_input(shape  =  c(l),       dtype  =   'float32',   name  = 'LogVol')
5
6 BrandEmb  =  Brand %>%
7 layer_embedding(input_dim=ll,   output_dim=l,   input_length=l,   name='BrandEmb') %>%
8 layer_flatten(name='Brand_flat')
9
10 CantonEmb  =  Canton %>%
11 layer_embedding(input_dim=26,   output_dim=l,   input_length=l,   name='CantonEmb') %>%
12 layer_flatten(name='Canton_flat')
13
14 Network =  list(Design,   BrandEmb,   CantonEmb)   %>%  layer_concatenate(name='Cone') %>%
15 layer_denseCunits=20,   activation='tanh',   name='Layerl') %>%
16 layer_denseCunits=15,   activation='tanh',   name='Layer2') %>%
17 layer_denseCunits=10,   activation='tanh',   name='Layer3') %>%
18 layer_denseCunits=l,   activation='linear',   name='Network',
19 weights = 1istCarray CO,   dim = c(10,1)) ,   array(log(lambdaO),   dim = c(l))))
20
21 Response  =  1ist(Network,   LogVol)   %>%  1 ayer_add(name='Add') %>%
22 layer_dense(units=l,   activation=k_exp,   name  =   'Response',   trainable=FALSE,
23 weights=1ist(array(1,   dim=c(1,1)),   array(0,   dim=c(1))))
24
25 model   <-  keras_model(inputs  =  c(Design,Brand,Canton,LogVol),   outputs  = c(Response))
26 model  %>%  compile(optimizer  =  optimizer_nadam(),   loss  = 'poisson')
Figure 5.9: Dummy coding with qo = 41 (lhs), embedding layers for brand and ct with frbrand = fret = 1 (middle) and frbrand = fret = 2 (rhs) resulting in qo = 8 and qo = 10, respectively.
neurons. The categorical feature components brand and ct are considered in embedding layers which provides us with feature space Xe given in (5.20).
We consider two different variants of embedding layers: model DNN2 considers embedding layers of dimension frbrand = fret = 1 (Figure 5.9, middle) and model DNN3 of dimension frbrand = fret = 2 (Figure 5.9, rhs). In Listing 5.7 we provide the R code for a deep neural network with embedding layers for brand and ct of dimension frbrand = fret = 1> see output_dim=l on lines 7 and 11 of Listing 5.7. In Listing 5.8 we present the summary of this model. It has r = 703 trainable network parameters which is roughly half as much
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
124
Chapter 5.  Neural Networks
Listing 5.8: Deep network with embeddings öbrand = bct = 1 and (qi, q2, qs) = (20,15,10)
1 2 3 4	Layer (type)	Output	Shape	Param #	Connected to
	Brand   (InputLayer)	(None ,	1)	0	
5 6	Canton   (InputLayer)	(None ,	1)	0	
7 8	BrandEmb (Embedding)	(None ,	1, 1)	11	Brand [0] [0]
9 10	CantonEmb (Embedding)	(None ,	1, 1)	26	Canton [0] [0]
11 12	Design (InputLayer)	(None ,	6)	0	
13 14	Brand_flat (Flatten)	(None ,	1)	0	BrandEmb [0] [0]
15 16	Canton_flat (Flatten)	(None ,	1)	0	CantonEmb [0] [0]
17 18 19 20	Cone (Concatenate)	(None ,	8)	0	Design [0] [0] Brand_f lat [0] [0] Canton_f lat [0] [0]
21 22	Layerl (Dense)	(None ,	20)	180	Cone [0] [0]
23 24	Layer2 (Dense)	(None ,	15)	315	Layerl [0] [0]
25 26	Layer3 (Dense)	(None ,	10)	160	Layer2 [0] [0]
27 28	Network (Dense)	(None ,	1)	11	Layer3 [0] [0]
29 30	LogVol (InputLayer)	(None ,	1)	0	
31 32 33	Add (Add)	(None ,	1)	0	Network [0] [0] LogVol [0] [0]
34 1*	Response (Dense)	(None ,	1)	2	Add [0] [0]
36 Total  params: 705
37 Trainable  params: 703
38 Non-trainable  params: 2
compared to the dummy coding approach of Listing 5.6.
In a first step, we again determine the optimal batch size for the SGD method. We therefore repeat the out-of-sample analysis on validation data of Example 5.6. We partition the data T> into a training sample of size 90% of the data, and the remaining 10% of the data are used for validation, exactly as described in formula (1.8) on out-of-sample cross-validation. In Ligure 5.10 we illustrate the resulting decreases in in-sample losses (blue) and out-of-sample losses (red), the more light the color the smaller the batch size for the two different choices of embedding dimensions frbrand = ^ct £ {L 2}. In general, we observe a slower convergence compared to Ligure 5.8 (lhs), and also we prefer a smaller batch size of roughly 5'000 cases.
In Lable 5.4 we present the results which are based on 200 epochs with a batch size of 5'000 cases. We observe that the embedding layers substantially increase the quality of the model, compare the estimation losses £(\, A*) of 0.1600 and 0.2094 to the corresponding ones of the other models.   Lrom this we conclude that we clearly favor deep neural
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
125
deep net: different batch sizes deep net: different batch sizes
D 50 100 150 200 0 50 100 150 200
epochs epochs
Figure 5.10: SGD method on different batch sizes with embedding layer dimensions (lhs) frbrand = bct = 1 and (rhs) frbrand = bct = 2; in blue the training losses (in-sample) and in red the validation losses (out-of-sample).
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	>~D	/•cv >~D	£(A,A*)	>~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.4) GLM4	14s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch3.3) GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
(Ch5.1) DNN1	56s	1'326	28.1163	28.1216	0.3596	27.9280	10.0348%
(Ch5.2) DNN2	123s	703	27.9235	27.9545	0.1600	27.7736	10.4361%
(Ch5.3) DNN3	125s	780	27.9404	27.9232	0.2094	27.7693	9.6908%
Table 5.4: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 5.3.
networks with embedding layers for categorical feature components. However, this again comes at the price of a bigger bias in the average frequency, also compare to Proposition 2.4 on the balance property. Of course, this is problematic in insurance. K-to\& cross-validation is a bit time-consuming in this case. However, the results of Table 5.4 show that we clearly favor the deep neural network with embedding layers, though we cannot decide between the two models DNN2 and DNN3 having different embedding dimensions. Note that optimal embedding dimensions may also differ between the different categorical variables which has not been considered here. Another nice consequence of low dimensional embedding layers is that they allow us for graphical illustration of the representations learned. In Figure 5.11 we plot the resulting embedding weights ebrand 6 R2 and ect 6 R2 of the categorical variables brand and ct for embedding dimension frbrand = ^ct = 2. Such a representation may provide clustering among categorical variables, for instance, from Figure 5.11 (lhs) we conclude that car
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
126
Chapter 5.  Neural Networks
2-dimensional embedding of brand
2-dimensional embedding of ct
GL AR
P SO
SH A,
dimension 1
dimension 1
Figure 5.11: Embedding weights ebrand 6 R2 and ect 6 R2 of the categorical variables brand and ct for embedding dimension frbrand = bct = 2.
brand B12 is very different from all other car brands.
Figure 5.12: Resulting estimated frequencies (on log scale) of models DNN2 vs. GAM3 (lhs), DNN2 vs. DNN3 (middle), and true vs. DNN3 (rhs).
In Figure 5.12 we illustrate the resulting frequency estimates (on log scale) on an individual policy level. The graphs compare models DNN2 vs. GAM3 (lhs), models DNN2 vs. DNN3 (middle), as well as the true log frequency log A* to model DNN2 (rhs). Comparing the graphs on the right-hand side of Figures 3.7 and 5.12, we see a clear improvement in frequency estimation from model GAM3 to model DNN2. Especially, policies with small frequencies are captured much better in the latter model. What might be a bit worrying is that the differences between models DNN2 and DNN3 are comparably large. Indeed, this is a major issue in applications of network predictions that we may receive substantial differences on an individual policy level, which cannot be detected on a portfolio level. These differences may even be induced by choosing different seeds in the SGD calibration method. For a broader discussion we refer to Richman-Wuthrich [111]. This finishes the example. ■
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
127
Special purpose layers
Besides the classical hidden feed-forward layers and the embedding layers, there are many other layers in neural network modeling that may serve a certain purpose. We mention some of them.
Drop-out layers
A method to prevent from over-training individual neurons to a certain purpose is to introduce so-called drop-out layers. A drop-out layer, say, after 'Layer2' on line 17 of Listing 5.7 would remove during a gradient descent step at random any of the 15 neurons in that layer with a given drop-out probability <f> 6 (0,1), and independently from the other neurons. This random removal will imply that the remaining neurons needs to sufficiently well cover the dropped-out neurons. This implies that a single neuron is not over-trained to a certain purpose because it may need to take over several different roles at the same time; we refer to Srivastava et al. [124] and Wager et al. [130].
Normalization layers
We pre-process all feature components of x so that they are living on a comparable scale, see (5.14) for the MinMaxScaler of continuous components. This is important for the gradient descent algorithm to work properly. If we have very deep neural network architectures with, for instance, ReLU activations, it may happen that the activations of the neurons in z(m) again live on different scales. To prevent from this scale-imbalance one may insert normalization layers between different hidden layers.
Skip connections
In the next section we are going to meet skip connections which are feed-forward connections that skip certain hidden layers. That is, we may decide that a given feature component xi activates neurons and that it is at the same time directly linked to, say, neurons z^\ skipping the second hidden layer. Skip connections are often used in very deep feed-forward neural network architectures because they may lead to faster convergence in calibration. In our examples above, we have used a skip connection to model the offset. Note that the volume directly impacts the output layer, skipping all hidden layers, see for instance Listing 5.7, line 21.
Depending on the problem there are many other special layers like convolutional layers and pooling layers in pattern recognition, recurrent layers in time series problems, noise layers, etc.
5.1.4   Combined actuarial neural network approach The CANN regression model
In this section we revisit the modeling approach proposed in the editorial "Yes, we CANN!" of ASTIN Bulletin [140].6 The main idea is to Combine a classical Actuarial
8In my presentation on "Yes, we CANN!" at the Waterloo Conference in Statistics, Actuarial Science, and Finance on April 25-26, 2019, I have been kindly introduced by Prof. Sheldon Lin (University of
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
128
Chapter 5.  Neural Networks
regression model with a Neural Network (CANN) approach. This will allow us to simultaneously benefit from both worlds. The basis of the CANN approach is that the classical actuarial regression model can be brought into a neural network structure. This is the case for many/most parametric regression models.
We illustrate the CANN approach on the GLM example of Chapter 2 and the deep feed-forward neural network of this chapter. Assume that the two models have common feature space X C R90. The GLM has regression function, see (2.2),
x _> AGLM(x)=exp(/3GLM,;E),
with GLM parameter ^GLM g R«+1, and the deep feed-forward neural network of depth d has regression function, see (5.6),
x _> ADNN(x) = exp(/3DNN,^:1)(;E)),
with network parameter 6>DNN = (,SDNN, W^d\ ..., W^) e M9d+1+^l=i
For the CANN approach we combine these two models, namely, we define the CANN regression function
x _> X(x) = exp{{(3GLM,x) + {(3Bm,z^1\x))}, (5.21)
with joint network parameter 9 = (/3GLM,0DNN) 6 W having dimension r = qo + 1 + qd + 1 + J2m=i qm(qm-i + !)• This is illustrated in Figure 5.13.
GLM skip connection
Figure 5.13: Illustration of the CANN approach.
The second ingredient in the CANN approach [140] is a clever fitting strategy: initialize the network parameter 9^ 6 M.r for gradient descent so that we exactly start in the classical MLE of the GLM. Denote by /3°LM the MLE of ,r3GLM given in Proposition 2.2.
Toronto). In his introduction Prof. Lin suggested that I may also rename my title to "Let's make Actuarial Science great again!".
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
129
Start the gradient descent algorithm (5.11) for network calibration of the CANN regression model (5.21) in
0(0) = ^LM ^DNN = Q) W(d)t        W(l)^ g Rr (5 22)
Remarks 5.8.
• For initialization (5.22), the gradient descent algorithm then tries to improve the GLM using network features. If the resulting loss substantially decreases during the gradient descent algorithm, then the GLM can be improved, otherwise the GLM is already good. In Chapter 7 we come back to this idea called boosting, which tries to improve models in a stage-wise adaptive way.
• The MLE /3G can either be declared to be trainable or non-trainable. In the latter case we build the network around the GLM compensating for weaknesses of the GLM.
The CANN regression approach in the Poisson case
The CANN regression model (5.21) can easily be implemented similar to Listing 5.7. Though, in some cases this might be a bit cumbersome. The Poisson regression model even admits for a much simpler implementation if we keep the (initial) value j3 as non-trainable for the GLM parameter /3GLM in (5.21).
In that case we have for all cases (Ni,Xi,Vi), i = 1,..., n,
^   Poi(exp{(3GLM,^) + (/3DNN,^:1)(^))}^)
(i>   Poi(exp(/3DNN,^1)(;El))^)   (i> Poi(ADNN(^), (5.23)
where we have defined the working weights
/-glm \
Vi = exp(p      ,Xi)Vi. (5.24)
Thus, in this case the CANN regression model calibration is identical to the classical feed-forward neural network calibration, only replacing the original volumes (years at risk) Vi by the working weights V{. In more statistical terms log^) acts as an offset in the regression model.
Remarks 5.9.
• Observe that in (5.24) we use the GLM estimate for the modification of the years at risk V{. Of course, we can replace the GLM by any other regression model, for instance, we may use the GAM instead. Thus, (5.23)-(5.24) provides a very general way of back-testing (boosting) any regression model adding neural network features to the original regression model. Boosting ideas are going to be discussed in detail in Chapter 7.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
130
Chapter 5.  Neural Networks
• In approach (5.23)-(5.24), the GLM is fully non-trainable. This could be modified by a credibility weight where we declare the credibility weight to be trainable. In practical implementations this means that the offset on lines 21-23 of Listing 5.7 may be declared to be (partially) trainable, too.
Example 5.10 (example CANN1/2). In this example we boost model GAM3 of Example 3.5 with a neural network (in a CANN approach). We have seen that model GAM3 is optimal if we only allow for multiplicative interactions. We aim at exploring other forms of interactions in this example. We use the frequency estimates AGAM3(cci) of model GAM3 to receive the working weights Vi = A AM3(a^)vi, similarly to (5.24), and we declare this part to be non-trainable (offset). We then calibrate a deep feed-forward neural network as in (5.23). We choose the network of Example 5.7 having depth d = 3 with (qi, q2, qs) = (20,15,10) hidden neurons. The categorical feature components brand and ct are considered in embedding layers which provides feature space (5.20). This CANN approach is illustrated in Figure 5.13, if we replace the GLM skip connection (in orange color) by a GAM skip connection. Lhis regression model can be calibrated with the R code given in Listing 5.7 replacing the years at risk Vi by the working weights d{. Lor model calibration we run the SGD method over 50 epochs on batch sizes of lO'OOO cases for embedding layer dimensions ^brand = ^ct =
1 (called model CANN1) and frbrand = bct = 2 (called model CANN2). Note that after roughly 50 epochs the model starts to over-fit to the training data. Lhis is much faster than in Example 5.7 where we have been running the gradient descent algorithm for 200 epochs. Lhe reason for this smaller number of necessary epochs is that the initial value received from model GAM3 is already reasonable, therefore less gradient descent steps are needed. On the other hand, we also see that the GAM3 calibration is rather good because it takes the gradient descent algorithm roughly 10 epochs to leave the GAM3 solution, i.e. in the first 10 SGD epochs the objective function only decreases very little.
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	/•cv	/•CV J~D		/•is J~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch3.3) GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
(Ch5.4) CANN1	28s	703+	27.9292	27.9362	0.2284	27.8940	10.1577%
(Ch5.5) CANN2	27s	780f	27.9306	27.9456	0.2092	27.8684	10.2283%
(Ch5.2) DNN2	123s	703	27.9235	27.9545	0.1600	27.7736	10.4361%
(Ch5.3) DNN3	125s	780	27.9404	27.9232	0.2094	27.7693	9.6908%
Lable 5.5: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters (} only considers the network parameters and not the non-trainable GAM parameters), this table follows up from Lable 5.4.
Lhe results are presented in Lable 5.5. We observe that the CANN approach leads to a substantial improvement of model GAM3, the estimation loss £(A,A*) is roughly halved. Lhis indicates that we are missing important (non-multiplicative) interactions
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
131
in the GAM regression function. Moreover, the results of the CANN and the DNN approaches are comparably good, however, the CANN models need much less run time for calibration. This finishes this example. ■
In the previous example we have boosted model GAM3 by neural network features. We could also think of a more modular approach for detecting explicit weaknesses in a chosen model. This is exactly what we are going to describe in the next example.
Example 5.11 (exploring missing interactions). In this example we use the framework (5.23)-(5.24) to explore which pairwise (non-multiplicative) interactions are missing in model GAM3. Denote the estimated regression function of model GAM3 by x i—y AGAM3(cc). The working weights are then given by Vi = A AM3(a^)vi. For testing missing pairwise interactions between, say, components x\ and xk of x, we modify the CANN regression function x \-> ADNN(cc) in (5.23) as follows
(xhxk) i y \u(xl,xk) = eXp(pu,z(d:1\xl,xk)). (5.25)
Thus, we restrict the input space to the two variables x\ and xk under consideration. For this neural network module we use the same architecture of depth d = 3 as in the previous examples with embedding layers of dimension 2 for the categorical feature components, and we run the SGD algorithm over 50 epochs on batch sizes of lO'OOO cases.
	age	ac	power	gas	brand	area	dens	ct
age		28.1055	28.1055	28.1055	28.0783	28.1055	28.1055	28.1056
ac	0.3966		28.0796	28.0976	28.0071	28.1039	28.1028	28.1055
power	0.3968	0.3718		28.1029	28.1055	28.1130	28.1044	28.1057
gas	0.3966	0.3935	0.3965		28.1059	28.1056	28.1059	28.1055
brand	0.3904	0.3203	0.3968	0.3976		28.1058	28.1055	28.1057
area	0.3966	0.3950	0.4091	0.3965	0.3977		28.1053	28.1055
dens	0.3966	0.3939	0.3962	0.3993	0.3969	0.3964		28.1055
ct	0.3971	0.3967	0.3964	0.3965	0.3973	0.3966	0.3965	
Table 5.6: In-sample Poisson deviance losses (top-right) and estimation losses (bottom-left) of pairwise interaction improvements of model GAM3.
The results are presented in Table 5.6. The upper-right part shows the resulting in-sample losses after considering pairwise interactions, the base value of model GAM3 is 28.1055, see line (Ch3.3) of Table 5.5. The lower-left part shows the resulting estimation losses £(A, A*) after considering pairwise interactions, the base value of model GAM3 is 0.3967. The numbers in red color show pairwise interactions that lead to a substantial decrease in the corresponding loss functions. This indicates that model GAM3 should be enhanced by such non-multiplicative interactions, in particular, an additional interaction between the feature components ac and brand can substantially improve the regression model. The advantage of this neural network boosting (5.25) is that we do not need to assume any functional form for the missing interactions, but the neural network module is capable to find an appropriate functional form (if necessary).
We could now explore other interactions, for instance, between three feature components, etc. We refrain from exploring more missing properties, but we refer to Section 3.5 of Schelldorfer-Wuthrich [119] for another illustrative example. This finish our example. ■
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
132
Chapter 5.  Neural Networks
5.1.5   The balance property in neural networks
We briefly recall the balance property studied in Proposition 2.4. In the GLM framework of Chapter 2 we consider a design matrix X G Rrax(g+1) of full rank q + 1 < n. The first column of this design matrix is identically equal to 1 modeling the intercept fto of the GLM regression function
x i—^ X(x) = exp(/3,x),
with regression parameter j3 G R9+1, see (2.2) and (2.7). Using the Poisson deviance loss D*(N,X) as objective function we receive a convex optimization problem in a minimal representation, which implies that we have a unique MLE j3 for j3 in this GLM context. For the intercept component this results in requirement
——D*(N,X) = E2 KexP(/3,^i) - = 0. (5.26) Cpo i=1
Thus, a critical point of the intercept component Bq implies that the balance property is fulfilled. Remark that for exactly this reason one should exclude the intercept from any regularization, see Remarks 4.10. In neural network calibration we do not pay any attention to this point, in fact, we apply early stopping to the gradient descent algorithm. As a consequence we do not stop the algorithm in a critical point of the intercept component and an identity similar to (5.26) typically fails to hold, see for instance the last column of Table 5.5.
Of course, it is not difficult to correct for this deficiency. We are going to discuss this along the lines of Wuthrich [139]. We come back to the deep feed-forward neural network (5.6) given by
x
&X       X(x) =exp\ßo + Y/ßjzf'\ = exp (ß, z^d:1\x)) .
By considering this expression we see that we have a classical GLM in the output layer based on modified features (pre-processed features)
x G X g> gd:l)(x) = (zW 0 ... 0 ZWj (j,.) G R9m_
Thus, the neural network x i—> z^d:1\x) does a feature engineering which is then used in a GLM step.
Assume that QGDM = (^ ? W^d\ ..., lU^1-1) is an early stopped gradient descent calibration which provides neural network pre-processing x i—> zSd'^{x). This motivates to define the pre-processed design matrix
3t=(^d:1)(Xi)) eRnx(gd+l) \ 1       v Vl<i<n:0<Z<oj
l<i<n;0</<gd
where we set (again) z^'^ixi) = 1, for all 1 < i < n, modeling the intercept component Po of j3 G M9d+1. Based on this design matrix X we can run a classical GLM receiving a unique MLE j3      from the following requirement, see also Proposition 2.2,
X'Uexp{X/3} = 3t'N, (5.27) Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
133
whenever X has full rank + 1 < n. In neural network terminology this means that for the final calibration step we freeze all network weig hts (t?(d),..., W^) and we only optimize over j3 which is a classical convex/concave optimization problem (depending on the sign), and it can be solved by Fisher's scoring method or the IRLS algorithm. This provides improved network parameter QGDM+ = (/3 ? W^d\ ..., W^), where the improvement has to be understood in-sample because it could lead to over-fitting. Over-fitting can be controlled by either a more early stopping rule and/or a low dimensional last hidden layer z^d\
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	/•cv	/•CV J~D	£(A,A*)	/•is J~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.4) GLM4	14s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch3.3) GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
(Ch5.1) DNN1	56s	1'326	28.1163	28.1216	0.3596	27.9280	10.0348%
(Ch5.2) DNN2	123s	703	27.9235	27.9545	0.1600	27.7736	10.4361%
(Ch5.3) DNN3	125s	780	27.9404	27.9232	0.2094	27.7693	9.6908%
(Ch5.6) DNN3 (balance)	135s	780	-	-	0.2006	27.7509	10.2691%
Table 5.7: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 5.4.
Example 5.12 (bias regularization in neural networks). We revisit model DNN3 provided in Table 5.4. This model considers a deep neural network of depth d = 3 with hidden neurons (q±, q2, qs) = (20,15,10). For the two categorical features brand and ct it uses 2-dimensional embedding layers, see Figure 5.9 (rhs). We take the DNN3 calibration provided on line (Ch5.3) of Table 5.4 and we apply the additional GLM step (5.27) for bias regularization (the balance property). Observe that the model DNN3 provides an average frequency of 9.6908% which is (much) smaller than the balance property 10.2691%.
The results are presented on line (Ch5.6) of Table 5.7. We observe the mentioned in-sample improvement, the balance property, and in this case also an estimation loss improvement which says that in the present situation the additional GLM step (5.27) provides a real model improvement. In fact, these results are rather promising and they suggest that we should always explore this additional GLM step. ■
5.1.6   Network ensemble
In the previous example we have seen that the issue of the failure of the balance property in neural network calibrations can easily be solved by an additional GLM step (5.27). In the next example we further explore this issue. The key idea is to build several neural network calibrations for the same architecture. Averaging over these calibrations should
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
134
Chapter 5.  Neural Networks
provide better predictive models. This intuition is inspired by the ensemble learning methods presented in Section 7.2, below.
Example 5.13 (network blending/nagging predictor). In Example 5.12 we have provided a solution to be compliant with the balance property in neural network calibrations. The purpose of this example is two fold. First we would like to study how severe the failure of the balance property may be. Therefore, we choose the neural network architecture of model DNN3 and we fit this model to the data M = 50 times with 50 different seeds. Early stopping will provide 50 different neural network calibrations for the same architecture which are comparably good in terms of in-sample losses. The first purpose of this example is to analyze the volatility received in the balance property. The second purpose of this example is to see what happens if we blend these models by averaging over the resulting predictors resulting in the nagging predictor.
We run the gradient descent algorithm M = 50 times on the same network architecture. We choose all hyperparameters identical, this includes the network architecture, the gradient descent algorithm, its parameters (number of epochs = 200, batch size = 5'000, nadam optimizer, etc.) and the split in training and validation data. The only difference between the 50 different runs is the choice of the initial seed 8^ of the network parameter, see also (5.11).
balance property over 50 SGD calibrations in-sample losses over 50 SGD calibrations estimation losses over 50 SGD calibrations
averagefrequencies in-sample losses estimation losses
Figure 5.14: Gradient descent calibrations of the same network architecture using M = 50 different seeds: (lhs) average frequencies, (middle) in-sample losses on training data, (rhs) estimation losses; orange line corresponds to model DNN3 of Table 5.7, magenta line corresponds to the blended model.
In Figure 5.14 (lhs) we show the box plot of the resulting average frequencies over these 50 calibrations. We observe considerable differences between the different estimated models. The average frequency over all 50 runs is 10.2404% (magenta line), thus, slightly smaller than the balance property of 10.2691% (cyan line). However, the standard deviation in these observations is 0.3539% which is unacceptably high. If we allow for this uncertainty then we can easily charge a premium (on portfolio level) that is 5% too high or too low! Thus, we may receive very accurate prices on a policy level but we may misspecify the price on the portfolio level if we do not pay attention to the balance property. Figure 5.14 (middle, rhs) show the box plots of the in-sample losses C1^ on the training data and the estimation losses £(A,A*), respectively, over the 50 different seeds; the
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
135
orange lines show the figures of model DNN3 in Table 5.7. From these plots we conclude that model DNN3 has a rather typical behavior because the loss figures over the 50 calibrations spread around the ones of model DNN3.
Now comes the interesting fun part. The M = 50 different runs have provided 50 network calibrations A^m^(-), m = 1,..., M, always on the same network architecture. A natural idea is to consider the ensemble of these models by averaging them. This motivates the (blended) regression function called nagging predictor in Richman-Wuthrich [111]
1 m
x   ^   Ablend(x) = _£X(m)(a^ {5 28)
m=l
The magenta lines in Figure 5.14 (middle, rhs) show the in-sample loss C1^ and the estimation loss £(Ablend, A*) of this blended model. We observe that this blending blasts all other approaches in terms of model accuracy!
in-sample losses over 50 SGD calibrations
bias corrected
estimation losses over 50 SGD calibrations
bias corrected
Figure 5.15: Gradient descent calibrations of the same network architecture using M = 50 different seeds (balance property regularized versions in red color): (lhs) in-sample losses, (rhs) estimation losses; orange line corresponds to model DNN3 of Table 5.7, magenta and green lines correspond to the blended models.
We can even go one step further. Before blending the models A^(-), m = 1,..., M, we can apply the regularization step, i.e. the additional GLM step (5.27) to each calibration. This will generally decrease the in-sample losses and it may also act positively on the estimation losses (if we do not over-fit). This is exactly illustrated in Figure 5.15 and reflects the step from the blue box plots to the red ones. Indeed, we generally receive better models by balance property adjustments.
In the final step we can blend these regularized models to the nagging predictor analogously to (5.28). This corresponds to the green lines in Figure 5.15. We observe that for the blended models, the balance property adjustment is not necessary because the green and the magenta lines almost coincide.
We summarize our findings in Table 5.8. We observe that the blended method on line (Ch5.7) by far outperforms any other predictive model in terms of estimation loss,
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
136
Chapter 5.  Neural Networks
		run	#	CV loss	strat. CV	est. loss	in-sample	average
		time	param.	/•cv	/•CV J~D	£(A,A*)	/•is J~D	frequency
(ChA.l	true model A*			27.7278				10.1991%
(Chl.l)	homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.4)	GLM4	14s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch3.3)	GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
(Ch5.4)	CANN1	28s	703	27.9292	27.9362	0.2284	27.8940	10.1577%
(Ch5.5)	CANN2	27s	780	27.9306	27.9456	0.2092	27.8684	10.2283%
(Ch5.1)	DNN1	56s	1'326	28.1163	28.1216	0.3596	27.9280	10.0348%
(Ch5.2)	DNN2	123s	703	27.9235	27.9545	0.1600	27.7736	10.4361%
(Ch5.3)	DNN3	125s	780	27.9404	27.9232	0.2094	27.7693	9.6908%
(Ch5.6)	DNN3 (balance)	135s	780	-	-	0.2006	27.7509	10.2691%
(Ch5.7)	blended DNN	-		-	-	0.1336	27.6939	10.2691%
Table 5.8: Poisson deviance losses of If-fold cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Tables 5.5 and 5.7.
i.e. compared to the true model. The in-sample loss of 27.6939 seems slightly too low, because it is lower than loss 27.7278 of the true model. This indicates (slight) over-fitting of the blended model.
In Richman-Wuthrich [111] we study convergence properties of the nagging predictor in an explicit motor third-party liability insurance example. The findings are that the nagging predictor converges in roughly 20 steps (20 blended neural network predictors) and the corresponding uncertainty becomes sufficiently small in 40 steps on portfolio level. On an individual policy level there still remains quite some uncertainty on a few policies after blending 40 models, and uncertainties only become small after blending over 400 neural network predictors. ■
Remarks 5.14.
• At latest now, we are in the two modeling cultures dilemma described by Breiman [15]. In the GLM chapter we have started off from the data modeling culture specifying explicit parametric models that can be analyzed and interpreted, for instance, in terms of goodness of fit. In the previous Example 5.13 we have fully arrived at the algorithmic modeling culture where it's all about predictive performance. That is, how can we twist and blend the models to tease a slightly better predictive accuracy.
• Our analysis has always been based on a training/learning set for fitting the models (and providing in-sample losses) and a validation set for tracking early stopping (and providing out-of-sample losses). We have been able to do our analysis based on two sets because we know the true model A* which allowed for model evaluation (in green color). If the true model is not known, one typically chooses a third data set called test data set which empirically plays the role of the true model. Thus, in that case the entire data is partitioned into three parts.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
137
• In the GLM Chapter 2 we have also discussed AIC for model selection. AIC cannot be used within neural networks. Note that AIC is based on a bias correction that is justified by asymptotic behaviors of MLEs. In neural networks we apply early stopping, thus, we do not work with MLEs and such asymptotic results do not apply, see Section 4.2.3 in [141].
5.2    Gaussian random fields
In view of the universality theorems of Cybenko [29], Hornik et al. [69] or Leshno et al. [88], it is tempting to study the space of infinite (shallow) neural networks. These neural networks are dense in the set of compactly supported continuous functions (for non-polynomial activation functions), and hence can approximate any continuous regression function /j : X —> R on a compact feature space X arbitrarily well. We do not consider a model in full generality here, but we restrict to a Gaussian Bayesian model because in that case everything can be calculated explicitly. We follow Neal [97].
5.2.1    Gaussian Bayesian neural network
We start with a shallow neural network having JeN hidden neurons and we assume that the activation function <f> is (non-polynomial and) uniformly bounded
H^Hoo = sup\^(x)\ < co. (5.29)
This shallow neural network provides regression function /i : X —> R on the c/o-dimensional feature space X taking the form
x i y fi(x) = (fi,z(x)),
where we set j3 = (/3o,..., f3qi)' G R9l+1, zq{x) = 1, and with hidden neurons
z(x) = (f>(W,x) G R91.
Note that we use the letter /j for the regression function here because we use the linear activation for the output layer.
Next we embed this regression function into a Bayesian context by choosing a prior distribution 7rqi(/3, w±,..., wqi) on the parameter space &qi = {(/3, w±,..., wqi)}. This choice implies that we have (prior) expected regression function
x e-> E[n(x)]=E[(P,z(x))]= /    (fi,z(x)) d,Tvqi(P,wi,...,wqi).
The goal is to analyze this (Bayesian) regression function for an increasing number qi —> oo of hidden neurons.
Model Assumptions 5.15. Assume uniform boundedness (5.29) of the activation function (f>. For the prior distribution 7rqi we make the following assumptions (for given q\):
(1) (/3j)j>o and (wj)j>i are independent. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
138
Chapter 5.  Neural Networks
(2) The weights Wj are i.i.d. Gaussian distributed with mean mw £ R9o+1 and positive definite covariance matrix Tw £ K(9o+l)x(go+l) for j > 1.
(3) The regression parameters /3j, j > 0, are independent and Af(rrij,t2)-distributed,
with means rrij = 0 and variances r = t2 for j > 1.
Under the above assumptions we do not assume a uniquely identifiable parametrization, in fact, when we will let qi —> oo, the identifiability issue will not be relevant (as we will see below). Under the above assumptions we have the following lemma.
Lemma 5.16. Under Model Assumptions 5.15 we receive for given features x\,. .. , xn £ X moment generating function in r = (r±,.. . ,rn)' £ M.n
2  I n
MKXl..n)(r) = exp I m0 £n + ^ I £
i=i
%=i
E
exp I ~o PSSl^^I'^)
%=i
91
where we define the n-dimensional random vector fi(Xi:n) = (n(xi))fi=1 n.
Proof of Lemma 5.16. We calculate the moment generating function of the random vector fi(Xi:n) in rgl*  It is given by
M^x1:n)(r) = E [exp {r'fi(X 1:n)}] = E
exp i Xr> [ ß°+X ßi<t>l<w^xi)
The above assumptions (l)-(3) imply that the latter decouples in j, and we obtain using normality in /3j
n >
E exp ^ ^ ^ n /3o
exp
E
=   exp < mo
91
exp t^y^jißj^jw^Xj) f "
exp < ^ ^ nßj(ß(wj, Xj
E
'SFTI (X-)
exp 1 "2" (x/^-"^''^'
The i.i.d. property of m,- provides the claim. □
5.2.2   Infinite Gaussian Bayesian neural network
We would like to consider the limit qi —> 00 in Lemma 5.16. Note that we have assumed Halloo < 00• This implies
logE
exp
r2r2
<
J2.J1
and we may have a similar lower bound. This suggests that we need to scale the variance t2 = t2 (qi) as a function of the number of hidden neurons q± so that the system does not explode for qi —> 00. This proposes the following extension of our model assumptions.
Model Assumptions 5.17. In addition to Model Assumptions 5.15 we make the following assumption under prior distribution 7rqi:
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
139
(4) The variance r2 of the regression parameters /3j, j > 1, is a function of the number of hidden neurons q\ with scaling r2 = t2(qi) = t2 jq\ for some r2 > 0.
Using Lemma 5.16 we can prove the following statement.
Proposition 5.18. Under Model Assumptions 5.17 we receive for features X\,. .. , xn £ X andr = (n,. .. ,rn)' G R™
ql^MrtXi-.n)^) = exP jmor'e„ + ir'C(Xi:jl)r|, with en = (1,.. . , 1)' G R™ and covariance matrix
C(X1:n) = (r02 + r2E [</>(«n, a:i)0(«>i, Xj)])
V / i,j = l,...,n
G Rnxn
Sketch of proof of Proposition 5.18. The statement can be seen as follows: the assumption ||0||oo < oo implies that the moment generating function of the random variable ri<t>{wi-/ xi)) 2 nas a positive
radius of convergence. This implies that the corresponding power series expansion exists around zero. This in turn implies that the limit qi —¥ oo can be evaluate element-wise in the power series expansion, and this provides the above result. □
The previous proposition tells us that the regression function /i converges in distribution to a Gaussian random field on X for qi —> oo. The goal now is to perform Bayesian inference under the given data T>, assuming that the prior regression function is given by a Gaussian random field on the feature space X.
For the further analysis it is going to be crucial to understand the covariance structure
(x1,x2)'eX2   i y   C{xi,x2) = t02 + t2E[<t){w1,x1)(t){w1,X2)} . (5.30)
In Section 2.1.3, Neal [97] analyses these covariance functions for smooth activation functions (f> and for the step function activation <f>. The step function activation is particularly nice because it allows for a simple closed form solution in (5.30). Therefore, we will restrict to the step function activation below.
5.2.3   Bayesian inference for Gaussian random field priors
Assume that the regression function jU : X —> R is modeled by the Gaussian random field on X obtained in Proposition 5.18. For a (Gaussian) Bayesian model we make the following assumption: given realization /i, we assume that the case (Y, x) is described by
Y\fi - M^(x),a2). (5.31)
Assume that for given /i, we have n independent cases (Y^, Xi)i=i n obeying (5.31). The joint density of the Gaussian random field fi(Xi:n) in these features X±:n = (xi)i=i n and the corresponding observations Y\-n = (Yi,. .. , Yn)' is given by
f (Yl:n,n(Xl:n))    (X    exp | - ^ (Yl:n ~ n(Xl:n))' (Y 1:n - [l(X1:n)) |
x  exp j-i (n(Xi:n) - m0en)'' C{Xi:n)~1 (^(X1:n) - m0e„)| . This provides the following corollary.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
140
Chapter 5.  Neural Networks
Corollary 5.19. Under the above assumptions we have posterior distribution of fi in the features X±:n = (xi)i=i n (assuming that the covariance matrix C(Xi:n) is positive definite) for given observations Y\-n = (Y±,. .. , Yn)'
V(X1:n)\Yl:n ~ M (film, Cl:n) ,
with first two moments fii:n = Ci:n(v~2Yi:n + m0C(Xi:n)~1enSj     and    C1:n = (c(Xi:n)_1 + cr"2l„^j .
Note that we need for the above corollary that Xi are mutually different for all i = 1,... ,n, otherwise the covariance matrix C(Xi:n) will not be invertible. However, if we have more than one case with the same feature value, say x\ = x2, then we may consider (the sufficient statistics)
Y1+Y2\li ~ M{2n{Xl),2a2),
and obtain a similar result. Concluding, Corollary 5.19 provides an explicit form of the posterior distribution of /i in the features X\:n. This corresponds to Bayesian inference in the cases X±:n = (xi)i=i n.
5.2.4   Predictive distribution for Gaussian random field priors
In exactly the same fashion as above we consider the predictive distribution of random vectors Yn+i:n+k = (Yn+i, ■ ■ ■, Yn+k), for k > 1, and the corresponding mean vectors fj,(Xn+1:n+k) = (Kxi)Yi=n+i,...,n+k e M?C' §iven observations Y1:n.
Corollary 5.20. Assume that the cases (Y±, xt), .. ., (Yn+kl xn+k) are conditionally independent, given fi, having conditional distribution (5.31), and fi is the Gaussian random field from above. The predictive distribution of the random vector Yn+i:n+k> given Yi:n, is multivariate Gaussian with the first two moments given by
fin+l:n+k\l:n    =    ^ [ fJ,(Xn+i:n+k) \ Yi:n] , Cn+l:n+k\l:n    =    Gov ( /i(Xn+i:n+fc) | Y\-n) + CT2lfc.
These first two moments are obtained from the last k components of
V-l:n+k\l:n    =    E [^{X1:n+k)\Yl:n]
=    Ci:ra+fc|l:ra (a~2(Y[.n, 0, . . . , 0)' + m0C(Xl:n+fc)_1 en+kj ,
Cl:n+k\l:n    =    Gov ( fl(Xi:n+k) \ Y 1:n) = ^C(Xi:n+fc)_1 + Cr~2ln+fc|n) ,
with (Y'1:n, 0,....W e Rn+k and tn+k\n = diag(l,..., 1, 0,..., 0) G R(™+fe)x(™+fe) with n entries being 1 and k entries being 0.
Remark. We again assume that C{X\:n+k) is invertible for the above result to hold.
Proof of Corollary 5.20. We start by calculating the moment generating function for r G Kfe
MYn+1.n+klYl^(r)    = E[exp{r'yn+i:n+fc}|yi:n]
=   E [E [exp {r'Yn+1,n+k}\ Y1:n,fi] \ Y1:n]
=   E [exp [r fi(Xn+i:n+k)} \ Y1:n] exp |ir'rcr2| .
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
141
Thus, the claim will follow, once we have proved that fi(Xn+i-.n+k) is multivariate Gaussian with the corresponding moments, conditionally given Vi:n. The latter claim follows completely similarly to Corollary 5.19. This finishes the proof. □
Remarks 5.21.
• Under the above Gaussian Bayesian assumptions the only remaining difficulties are the calculation of Gij = K[cj)(wi,Xi)(j)(wi,Xj)] and the inversion of (potentially) high-dimensional matrices.
• In view of the previous item we define the Gram matrix
G(X1:n+k)    = (Gi,j)itj=lt_tn+k
In view of Corollary 5.20 this Gram matrix has three parts: (1) G(Xi:n) G Rraxra is the in-sample Gram matrix, (2) G(Xn+i:n+k) G Rfexfe is the out-of-sample Gram matrix, and (3) (Gíj)í=1   n. j=n+i   n+k G Kraxfe is the cross Gram matrix.
• More general forms may be obtained under non-Gaussian assumptions: (i) the observations Yi may have different distributional assumptions than (5.31), and also (ii) the Gaussian random field may be replaced by other random fields. The latter is discussed in Neal [97], in particular, from the point of view that different neurons in neural networks may represent different information. This interpretation has got lost in our asymptotic consideration (and therefore also identifiability can be neglected here).
5.2.5   Step function activation
We consider the step function activation (f>(x) = l{x>o}- We have
Gitj =E[cj)(w1,Xi}(j)(w1,xj}]   =   E [l{<Wl)a.}>0}l{<Wl)iE..}>0}] .
The crucial size is the angle between Xi G R9o+1 and Xj G R9o+1 (we include the intercepts Xifl = Xjfi = 1 in Xi and Xj, respectively, here). This angle is given by
7 = arccos ( )   G [0,tt].
V ll^ill\\xj II /
Model Assumptions 5.22. In addition to Model Assumptions 5.17 we assume that the distribution of wi is rotationally invariant.
Under this additional assumption we obtain
Gíj = E [(f)(wi, Xi)(j)(wi,Xj)] = \- -^arccos ( ^X%^]\, )   G [0,1/2].
\ 11       11 11 *^ J 11 /
Version October 27, 2021, M.V. Wůthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
142
Chapter 5.  Neural Networks
In fact, the upper bound is obtained if and only if Xi and Xj point in the same direction (in the R9o+1 space). Thus, under the step function activation 4>{x) = l{x>o} and under rotational invariance of tu i we obtain covariance function
x1:n c{x1:n)=(r02+4 - 4- arcc°s (n^Nir3/ii)) e
27T \ ||a;,; II Ha;,-II //.._.
ij=i,...,ii
We study the sensitivity of this covariance function for small perturbations of the feature values. Therefore, we consider features x = a^ and Xj = x + e(0, z')' for £ 6 R and z G R90 with J J z J J = 1. In view of the previous identity we have for this perturbation
£ ^ 9x{e)   A=   ^[^{w^Xif^w^Xj)}   = E[cf)(w1,x)(f)(w1,x + e(0,z'Y)]
11 / ||a;||2 + e(x, z) \ ,
=---arccos -''  "„      x '  1 . (5.32)
2    2ir \\\x\\^/\\x\\2+2e(x,z)+e2) V '
Note that we have g(0) = 1/2. We abbreviate z = (x, z). The Cauchy-Schwarz inequality implies, we use ||z|| = 1,
go      \ 2     go go z2 = (x,z)2 = (     xizi     <Vi?<1 + = l|a;||2,
=1     /      i=i i=i
where the latter square norm involves the intercept xq = 1. This implies that ||a;|| — z > 0. We define
:(e)
\x\\2 + £Z
\x\\^/\\x\\2 + 2ez + e2 We first calculate the derivative of e i—> x{e). It is given by
d z(\\x\\2+2ez + e2) - (\\x\\2 + ez) {z + e) e (||a;||2 - z2
-x
de \\x\\ (\\x\\2 + 2ez + e2)3/2 \\x\\ (\\x\\2 + 2ez + e2)3/2 '
We conclude that
lim — x{e) = lim--^—^-'--— = 0.
e^ode e^o   ||a;||(||a;||2 + 2ez + e2)3/2
We now come back to function (5.32) given by
£ >-> gx(e) = ^ -     arccos (x(e)). Its derivative for s ^ 0 is given by
d 1 x'{e)
:9x{£) -
de    v '     2ir ^1 - x(e)
2
Observe that both enumerator and denominator of the last ratio converge to zero as e —> 0. Therefore, we can apply l'Hopital's rule to obtain, we also use ||a;||2 — z2 > 0,
d     . . 1 x'{e) 1  (..     {x'{£))2 X 1/2
lim ——qT(s)   =   — lim — = =--lim . . „
e^O d£     V ' 27T e^0 y/\ - x{£)2 27T \e^0 1 - x{s)2
1/2
=--lim—V4       =--\ \\x 2 - z2 < 0.
2tt \e^0 -x(s) J 2tt"  11    V 11 11
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 5.  Neural Networks
143
From this we conclude that the function (5.32) is approximately linear for small s > 0 with negative slope — (2tt)~1 ||a:||-2-\/||a:||2 — z1. Thus, locally the Gaussian random field behaves as a Brownian motion.
Remarks. The case for the step function activation 4>(x) = l{x>o} and rotationally invariant weights w\ (according to Model Assumptions 5.22) is completely solved. Note that through definition (5.30) of the covariance function C(Xi:n) we still keep track of the original shallow neural network model with qi —> oo hidden neurons. In general, however, this is not necessary and we could directly start with a Bayesian random field prior model by assuming that
• the regression function /i is a random field on the feature space X; and
• the cases (Y, x) are independent having conditional distribution
with fi(x) characterizing the conditional mean of Y with feature x £ X, given /i.
If we start from such a random field prior model we should ensure that /i is comparably smooth. Otherwise interpretation of insurance prices is not meaningful in the sense that neighboring features receive similar prices.
There are many more examples of analytically tractable Gram matrices. For instance, if the activation function <f> is a cumulative distribution function, then the Gram matrix is analytically tractable. This has been explored in Section 2.2 of Xiang [142].
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
144
Chapter 5.  Neural Networks
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6
Classification and Regression Trees
In this chapter we present the classification and regression tree (CART) technique introduced and studied in Breiman et al. [16]. We closely follow this reference, but we use a modified notation. For our examples we use the R package rpart; as mentioned in Therneau-Atkinson [125], CART is a trademarked software package, therefore the routine in R is called rpart which comes from recursive partitioning.
6.1   Binary Poisson regression trees
Assume that the cases (Ni,Xi,Vi) are independent and iVj are Poisson distributed for i = 1,... ,n with expected claims counts E[iVj] = \{xi)vi for a given regression function A : X —> R+. In Chapter 2 on GLMs we have made log-linear assumption for the modeling of the expected frequency x i—> X(x), see (2.2). This GLM approach assumes a fixed structural form of the expected frequency A(-). Similarly to neural networks, regression trees are more general in the sense that they try to learn an optimal structural form from the data T>. The idea is to design an algorithm that partitions the feature space X into disjoint (homogeneous)1 subsets (Xt)t€Ti where T is a finite set of indexes. On each subset Xt we choose a frequency parameter At that describes the expected frequency on that (homogeneous) part Xt of the feature space X. Finally, we estimate the (unknown) expected frequency on the total feature space X by
There are two things we need to explain: (i) the choice of a 'good' partition (Xt)ter °f the feature space X and (ii) the choices of the frequency parameters At for each t G T.
Remark that this partitioning into homogeneous subsets is rather similar to categorical coding of continuous variables in GLMs when they do not meet the required functional form, for instance, the monotonicity assumption.
1The meaning of homogeneity will become clear below.
(6.1)
145
Electronic copy available at: https://ssrn.com/abstract=2870308
146
Chapter 6.  Classification and Regression Trees
6.1.1   Binary trees and binary indexes
For the partitioning we use a binary tree growing algorithm. This binary tree growing algorithm starts by partitioning the feature space X into two disjoint subsets Xq and X\; these subsets may then further be partitioned into two disjoint subsets, and so forth. To describe this partitioning process we introduce a binary tree notation.
The binary tree growing algorithm can be described recursively as follows.
• We initialize (root) generation k = 0. We define the initial binary index t = 0 and we set Xq = X. This is the root or root node of the binary tree.
• Assume Xt is a node of generation k with binary index t. This node can be partitioned into two nodes of generation k + 1 denoted by Xto and Xt\. Node Xto is called the left child and node Xt\ the right child of the mother node Xt- These children have binary indexes in {0} x {0, l}fe+1. Note that indexes tO and tl are always understood in the concatenation sense.
Choose, say, the node Xt in generation k = 3 with binary index t = 0101. This node is the right child of the mother node with index 010 (belonging to generation 2) and the potential children have binary indexes 01010 and 01011 (in generation 4). Thus, binary indexes (chosen in a consistent way) describe a tree and the length of the binary index always indicates to which generation this particular node is belonging to. Moreover, it allows us to identify the relationship between generations.
We denote a binary split of node Xt by q = {Xto, Xti). A binary split q = (Xto, Xti) is always assumed to be a partition of Xt, that is, Xto U Xt\ = Xt and Xto H Xt\ = 0.
• After applying a stopping rule to the binary tree growing algorithm we obtain a set of binary indexes T that label the resulting nodes (Xt)t£j- This set T describes the structure of a binary tree which allows us to identify generations and relationships.
All nodes that do not have children are called leaves of the binary tree and are indexed by T c T. The leaves (Xt)t^T provide a partition of X.
Binary Tree Growing Algorithm.
(0) Initialize X0 = X and T = T = {0}.
(1) Repeat until a stopping rule is exercised: select t 6 T and the corresponding Xt,
(a) apply a binary split qt = (Xto,Xti) to Xt;
(b) return the new binary tree (Xt)t^j with indexes
T   <- TU{t0,tl},
t  <- (t\{i})U{«Ml}.
(2) Return the binary tree (Xt)t^j with binary indexes T, leaf indexes T and partition (Xt)t€T of X.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
147
A 'good' partition (Xtjter °f th-e feature space X for our prediction problem will describe good choices of binary splits q = (Xto, Xti) of nodes Xt, and also a stopping rule when no further binary split should be applied to a particular node and binary tree, respectively.
6.1.2   Pre-processing features: standardized binary splits
In step (la) of the Binary Tree Growing Algorithm we may consider any possible (non-trivial) binary split q = (Xto, Xti) of Xt- Typically, this leads to too much complexity and computational burden. Therefore, we only consider so-called (non-trivial) standardized binary splits (SBS). Remark that these non-trivial SBS do not need pre-processing of features, i.e. this is different from the GLMs and neural networks considered above. The general set-up is that we have a q-dimensional feature space X and we want to estimate a regression function A(-) : X —> r+ with function (6.1). For this goal, we only consider SBS of nodes Xt in step (la) of the Binary Tree Growing Algorithm; these SBS lead to 'rectangles' on X and are described as follows.
Standardized Binary Split.
A standardized binary split (SBS) q = (Xto,Xti) of Xt is done as follows:
1. consider only a single component x\ of the feature x = (xi,..., xq) £ Xt at a time;
2. for ordered components x\ we ask split questions: x\ < c, for given values c;
3. for nominal components x\ we ask split questions: x\ 6 C, for non-empty subsets C of the possible labels of the feature component x%.
Since we only have finitely many cases (Ni,Xi,Vi) in the data T>, there exist at most finitely many non-trivial SBS questions until the observed (different) features x±,.. . , xn are fully atomized by the Binary Tree Growing Algorithm (if we assume that the separation lines always lie at half distance between continuous feature components). In the sequel we always exclude trivial SBS, where trivial SBS in this context means that either Xto or Xti does not contain at least one observed feature Xi of the data T>. This is a standard assumption that we will not mention each time below, but tacitly assume.
6.1.3   Goodness of split
To apply the Binary Tree Growing Algorithm with SBS we still need to explain how we select t g Tin step (1) and which SBS q = (Xto, Xti) we apply to Xt in step (la). This selection is done by a goodness of split criterion. Similarly to above we use the Poisson deviance loss as our objective function (under the assumption of having independent Poisson distributed samples).
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
148
Chapter 6.  Classification and Regression Trees
Optimal splits for continuous feature components
Assume for the moment that all feature components of x = (x±,..., xq) £ X are continuous and we are searching for the optimal SBS q = (Xto, Xt\) of Xt for given data T> and given t £ T. Thus, we are looking for an optimal feature component x\ of x and an optimal constant c £ r such that the resulting Poisson deviance loss is minimized:
mm
i<z<<j, eel
min       Yl       D*(Nl,X0) + mm       £ D*(Nl,X1)
(6.2)
where D*(Ni,X) is the Poisson deviance loss of (single) case (A^aMi^) for expected frequency A > 0, see (1.5). As mentioned above, we only consider non-trivial SBS, which implies that we only consider SBS q = (Xto, Xti) on non-empty sets {a^ £ Xt : x^ / < c} for the left child Xto and {xi £ At : Xij > c} for the right child Xt\ of Aj. Moreover, for continuous feature components x\ the constant c is chosen at half distance between the feature components xn of "neighboring observations". The inner optimizations in (6.2) can then be solved explicitly and provide the MLEs for the SBS q = (Xto, %ti) °f Xt, set
t = 0,1,
at.
XtT = argmin     £    I?* (iVj, Ar)  = g \ (6.3)
This allows us to rewrite the SBS optimization problem (6.2) as follows: Determine the optimal (non-trivial) SBS q = (Xto,Xti) of Xt solving
where Ajr is the MLE on XtT given by (6.3) and the Poisson deviance loss on XtT, r = 0,1, is given by
XtrVi      1      .      ^ AtrWj
1 — lot
D*XtT(N,Xtr)   =      E 2iVi
where the right-hand side is set equal to 2XtTVi if iVj = 0.
Optimization problem (6.4) describes an in-sample Poisson deviance loss minimization.
Processing features: categorical feature components
In the above optimization (6.2) we have assumed that all feature components x\ are continuous (or at least ordered) which motivates the choice of the splitting values c. For a nominal categorical feature component xi, the split criterion c may be replaced by nonempty subsets C that describe the possible labels of the categorical feature component xi, see Section 6.1.2. In particular, features do not need pre-processing to apply this SBS tree growing algorithm. However, this evaluation of categorical feature labels may be computationally too expensive. For instance, if we aim at partitioning the criterion Swiss canton (there are 26 cantons in Switzerland) we obtain 226"1 - 1 = 33'554'431 possible SBS. For this reason, a categorical feature component x% is often replaced by an ordered version thereof for the SBS. In the Poisson case this is done as follows: calculate
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
149
for each categorical label of x\ on Xt the empirical frequency \t{xi) (°n X-t) and use these empirical frequencies x\ = \t(xi) as replacements for the categorical feature component xi for an ordered SBS of Xt w.r.t. x\.
This latter approach only works well as long as Xt contains sufficiently many cases {Ni,Xi,Vj). If the number of cases in Xt is too small, then it is likely that we do not observe sufficiently many claims for certain categorical feature labels, in particular, in the low frequency case, and thus the empirical frequencies do not provide a useful ordering.
Bayesian approach
For given data T> and a given node Xt c X we can find the best SBS q = (Xto,Xti) w.r.t. (6.4). A general issue that often occurs in insurance claims frequency modeling is that on certain nodes Xtr we may get a MLE \tT that is equal to zero, see (6.3). This happens in particular when the expected frequency and the overall volume on XtT are small. The overall volume on XtT is defined by (we always assume that there is at least one case (Ni,Xi,Vi) in the considered part of the feature space)
WtT =      ^2     Vi> 0, (6.5)
i: Xi^XtT
and the expected frequency for data T> on XtT is given by
Xtr = —      E     Vi A(^) > °> (6-6) Wtr ■ v
if we assume that the true model has expected frequency function x i—> X(x). Thus, if the MLE XtT = 0 for XtT > 0, we obtain a degenerate (calibrated) Poisson distribution on X^ which, of course, is nonsense from a practical point of view. For this reason, the MLE Xtr on X^ is replaced by a Bayesian estimator Af°st.
Model Assumptions 6.1. Assume we are given a portfolio (a^, i>i)i=i n and a regression function A(-) : X —> R_|_. Choose 0 ~ T(7, 7) and assume that conditionally, given 0, the random variables N±,.. . ,Nn are independent with conditional distribution
Ni\e  - Poi(A(xl)07jl),
for i = 1,. .. , n.
The above assumptions introduce uncertainty in the expected frequency using a multiplicative perturbation 0 of A. The coefficient of variation of the (expected) frequency A(cCj)0 is
Vco (A(aji)0) = Vco (0) = 7-1/2, that is, we have a coefficient of variation that does not depend on the features.
Lemma 6.2. Set Model Assumptions 6.1 and choose XtT c X such that Xi £ XtT for at least one i = 1,.. . ,n. Conditionally, given 0, we have
NtT=   Yl   Ni le ~ Poi (Atr0^tr),
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
150
Chapter 6.  Classification and Regression Trees
with XtT given by (6.6) and WtT given by (6.5). The posterior expected frequency on XtT, given observation NtT, is given by
E [Atr0| NtT] = XtTE [0| NtT] = atT XtT + (1 - atT) XtT,
with MLE Xtr given by (6.3) and with credibility weight
WtrKr       — / <->  -i \
atT = -;-r— G (0, 1).
7 + WtT XtT
Proof of Lemma 6.2. The first statement immediately follows from Lemma 1.3, see also Remarks 4.3. For the second statement we define AtT = VrO; AtT has a gamma distribution with shape parameter 7 and scale parameter j/\tT, see formula (3.5) in Wuthrich [135]. Therefore, the assumptions of Corollary 4.4 are fulfilled and we obtain the claim. □
Remarks 6.3.
• For given prior mean Ajr > 0 and given shape parameter 7 > 0 we can calculate the posterior mean E [Atr©| NtT] as an estimator for the expected frequency on XtT. This posterior mean is always strictly positive and therefore does not have the same deficiency as the MLE Ajr given by (6.3).
• The difficulty in practice is that the prior mean XtT is not known and needs to be replaced by another estimator. The natural choices are the MLE At on Xt or the Bayesian estimator E [AjO| Nt] on Xt. The former has the disadvantage that it does not solve the problem of positivity and the latter needs to be calculated recursively. Thus, for the latter we may define recursively
A]?rost = E [Atr0| NtT] = atT Xtr + (1 - atT) A]?ost, (6.7) with estimated credibility weight
\ post
. \ post
7 + wtTXl
and initialization AQ°st = Ao = YHl=i Ni/ YHl=i vi which is the (homogeneous) overall MLE for A0 = £?=i v, X{Xl)/E"=i Vi-
— The credibility weights atT and atT are increasing in XtT and Afost, respectively. Therefore, good risks Ajr < A^ost obtain a higher credibility weight under (6.7) and bad risks a lower credibility weight for the observations on XtT-
— Replacing XtT by A]?ost shrinks the prior mean of the credibility estimator towards the overall estimate A]?ost on Xt and diminishes structural differences.
— Of course, this is non-optimal, but an improved version would require more knowledge about the tree structure. Such tree structure knowledge would also allow us to replace Model Assumptions 6.1 by individual assumptions on each leaf of the tree, for instance, each leaf t G T may have its own (independent) Of Note that Model Assumptions 6.1 also introduce (undesirable) dependencies between the leaves through O.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
151
— The remaining parameter is the coefficient of variation 7-1/2. This coefficient of variation is an input parameter in the R package rpart, see Therneau-Atkinson [125], and needs to be chosen externally. It should not be chosen too small so that it allows us to model a good range of possible expected frequencies A(ajj) which may be of magnitude 4 between good and bad risks.
— A more thorough treatment would embed (6.7) into a hierarchical credibility model, see Buhlmann-Gisler [18].
The R command rpart (used for the examples below) uses a less sophisticated approach than (6.7), namely, it replaces Af°st by
xfT = afT xtT + (1 - afT) A0, (6.8)
with (homogeneous) overall MLE Aq as prior mean and with estimated credibility weights
wtT Ao
Ht
G (0,1),
7 + wtT Ao
and 7 is chosen exogenously. In our examples we will use this approach (6.8)-(6.9).
(6.9)
6.1.4   Standardized binary split tree growing algorithm
We come back to the Binary Tree Growing Algorithm introduced on page 146. In the previous two sections we have discussed how to choose an optimal SBS q = (Xto,Xti) of Xt for a given binary index t G T using the goodness of split criterion (6.4). In this section we discuss the choice of an optimal binary index t G T to perform the SBS.
The algorithm and Poisson deviance losses
We apply SBS q = (Xto, Xti) to grow a binary tree. The aim at each iteration step is to choose the partition in (6.4) that maximizes the increase in goodness of split, measured by the decrease of the Poisson deviance loss. We define the decrease in Poisson deviance loss of a binary split q = (Xto, Xt±) of Xt by
AD*XM = D*x (TV,\t) - [D*Xa (TV, At0) + D*Xa (n,Atl)] . (6.10)
Lemma 6.4. For any binary split q = (Xto,Xti) of Xt we have AZ?^(q) > 0. Proof of Lemma 6.4. Observe that we have the following inequality
D'Xt (N,\t
E 2N> E 2N>
1 - log
- 1 - log
1   DXta (N,Xto) +D*Xtl (IV, Äti
Ni
log
The latter follows because the MLEs minimize the corresponding Poisson deviance losses.
The above motivates the following binary tree growing algorithm.
SBS Tree Growing Algorithm.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
152
Chapter 6.  Classification and Regression Trees
(0) Initialize X0 = X and T = T = {0}.
(1) Repeat until a stopping rule is exercised:
(a) calculate the optimal SBS on the leaves indexed by T
(t,Q)=      argmax      AD*Xa(q8) > 0, (6.11)
with a deterministic rule if there is more than one optimal SBS;
(b) if ADxt($t) > 0) return the new binary tree (Xt)t^j with indexes
T   <- TU{tO,tl},
T   <- (T\{i})U{«Ml};
(c) decide whether the algorithm should be stopped and go to step (2), or otherwise return to step (la).
(2) Return the binary tree (Xt)t^j with binary indexes T, leaf indexes T and partition (Xt)t€T of X.
Remarks.
• The above algorithm always chooses the leaf index t G T which leads to the (locally) biggest improvement in the corresponding Poisson deviance loss. In view of step (lb) we only consider SBS that lead to a real improvement AD*x{qt) > 0, otherwise the algorithm should be terminated.
• Often the maximization in (6.11) is more restricted, for instance, we may restrict to SBS c;s = (Xsq, Xsi) of Xs which lead to new leaves Xsq and Xs\ that contain at least a minimal number of cases; this is an input parameter in rpart.
• The only open point is the choice of a sensible stopping rule, for instance, the improvement in Poisson deviance loss in (6.11) should exceed a certain (relative) threshold, otherwise the algorithm is terminated; this is an input parameter in rpart.
• Note that the above algorithm could also be considered for other objective functions to which a lemma similar to Lemma 6.4 applies.
Stopping rule
The Poisson deviance loss on the partition (Xt)ter °f X is given by
D*(N,(\t)t€T) =  Y,D*Xt(N,\t). (6.12)
Completely analogously to Lemma 6.4 we have the following statement. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
153
Corollary 6.5. Consider for a partition (Xtjt^T °f ^, and an additional SBS q = (Xto,Xti) of Xt for a given t £ T■ This gives the new leaf indexes T' = (T\{t})U{tO,tl}. We have D*(n, (As)seT) > D*(n, (As)seT')-
Proof. The proof is a straightforward consequence of Lemma 6.4. □
Corollary 6.5 says that every extra SBS decreases the objective function D*(n, (\s)S£t) and a sensible stopping rule would explore the amount of decrease of the Poisson deviance loss in relation to the number of splits (and parameters As, s £ T) involved. This is quite similar to the likelihood ratio test (2.15). However, in many situations this is not a feasible way and the problem is rather solved by growing a very large tree in a first step, and in a second step this large tree is pruned by splits that do not contribute sufficiently to a modeling improvement. Details are presented in Section 6.2 on pruning.
Assignment rule and regression function
The SBS Tree Growing Algorithm on page 151 establishes us with a partition (Xt)t^_T of X having leaf indexes T. This provides the regression tree estimator for the expected frequency A(-) : X —> R+ defined by
teT
we also refer to (6.1) and (6.3). Note that this is not necessarily a strictly positive function. In a non strictly positive situation one may replace the MLE At by the empirical credibility estimator Afost given in (6.7) or by Xf given in (6.8). We will discuss this just next.
Bayesian set-up
We have seen that there may be an issue with degenerate Poisson distributions on some of the leaves t £ T. In applications one therefore replaces (6.11) by
(t,Q)=      argmax      AZ7^(?S), (6.13)
where
AD*Xt(,t) = D*Xt (n,X?) - [D*Xt0 (n,X?0) + D*Xtl (n,>&)] ■
That is, we replace the MLEs At by the empirical credibility estimators Xf given in (6.8). This has the advantage that the resulting frequency estimators are always strictly positive, the disadvantage is that Lemma 6.4 does not hold true and errors may well be increasing by adding more splits. This is because only the MLE minimizes the corresponding Poisson deviance loss and maximizes the log-likelihood function, respectively. The SBS Tree Growing Algorithm with (6.11) replaced by (6.13) will, however, only consider SBS that lead to a real improvement (thanks to step (lb)) but it may miss the optimal one.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
154
Chapter 6.  Classification and Regression Trees
6.1.5   Example in motor insurance pricing, revisited
We revisit the synthetic MTPL data given in Appendix A and apply the SBS Tree Growing Algorithm of page 151 to estimate the expected frequency by (6.1).
Listing 6.1: R command rpart
1	tree	<-  rpart(claims	-   age  +  ac  + power  +  gas  + brand  +  area +  dens  + ct
2			+  offset(log(expo)),
3			data = dat,   method = "poisson",   parms = 1ist(shr ink = l) ,
4			control=rpart.control(xval=10,   minbucket=10000,   cp=0.001))
D 6	rpart	.plot C tree)	
7	tree		
We provide the R code in Listing 6.1. The variable dat contains the years at risk (expo = Vi), the claims counts Ni as well as all feature components. The method applied considers the Poisson deviance loss, and lines 3-4 of Listing 6.1 give the control variables, these are going to be described below. Due to the issue that we may receive MLE \tT = 0 on some of the leaves, the algorithm rpart always uses the Bayesian approach. Note that shrink = 7-1/2 is the shrinkage parameter used in the rpart command, see (6.9). The standard shrinkage parameter used is shrink = 1, which corresponds to 7 = 1 in (6.9).
Figure 6.1: Regression trees for different stopping rules.
In Figure 6.1 we present the resulting regression trees of the SBS Tree Growing Algorithm using split criterion (6.13). The two trees (only) differ in their sizes because we have applied different stopping rules. We analyze the tree in Figure 6.1 (lhs), this tree is given in more detail in Listing 6.2.
We have used n=500000 cases for its construction. Variable node) denotes the nodes of the tree, these indexes differ from our binary index in the sense that t = 0 corresponds to 1), t = 00,01 correspond to 2) and 3), etc. split denotes the split criterion applied; n denotes the number of cases in that node; deviance is the resulting Poisson deviance loss
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
155
Listing 6.2: SBS Tree Growing Algorithm result
1 n= 500000
2
3 node),   split,   n,   deviance, yval
4 *  denotes   terminal node
5
6 1)   root   500000   145532.300 0.10269060
7 2)   ac>=0.5   452749   126855.500 0.09407058
8 4)   ac>=2.5   345628     93113.030 0.08592400
9 8)   age>=23.5  332748  87712.610 0.08277720
10 16)   dens<  599.5   183105   45721.540   0.07437700 *
11 17)   dens>=599.5   149643  41788.160   0.09374741 *
12 9)   age<  23.5   12880  4845.229   0.19376530 *
13 5)   ac<  2.5   107121   33147.840   0.12489500 *
14 3)   ac<  0.5  47251   16399.780 0.24966230
15 6)   brand=Bl,B10,Bll,B13,B14,B2,B3,B4,B5,B6  14444  3790.369  0.12806680 *
16 7)   brand=B12  32807   12073.280   0.32044710 *
of that node; yval gives the frequency estimate of that node; and * indicates whether the node is a leaf or not. For instance, line 9 in Listing 6.2 provides:
>       8) age >=23.5 332748 87712.610 0.08277720
This means that we consider node 8) that corresponds to t = 0000. This node was obtained from node t = 000 (equal to node 4)) using the split question age>=23.5. The node contains 332'748 cases, the Bayesian estimator is given by A]3 = 0.08277720 and the resulting Poisson deviance loss on that node Xt is D\ (n, AtB) = 87'712.610. The graph in Figure 6.1 (lhs) gives the additional information that on node t = 0000 we have Nt =15e+3 claims observed and that 67% of the cases fall into that node. Note that this information is not sufficient to calculate the full statistics because the volume wt on node t is not displayed, see (6.5). The question that remains open for the moment is the stopping rule, i.e. should we rather use the small or the big tree in Figure 6.1? We treat this question in Section 6.2, below, together with an analysis of the predictive power of the regression tree estimator. This finishes the example for the moment.
6.1.6   Choice of categorical classes
In Section 2.4 the categorical classes for driver's age age and for the age of car ac were chosen based on pure expert opinion, see Figure 2.2 (lhs). We could also use a regression tree to determine these classes. We consider regression trees on the marginal observations.2 This marginal consideration neglects possible interaction between the feature components but gives a data based answer on optimal (marginal) choices of (univariate) categorical classes for continuous feature components.
We choose for the two feature components exactly as many classes as in Figure 2.2 (lhs). The results are provided in Figure 6.2 and they give the classes shown in Table 6.1 (under the assumption that we would like to have 8 age classes and 4 ac distance classes). We
2Note that we use the same data for the construction of the categorical classes and the GLM analysis, of course, this might be questioned.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
156
Chapter 6.  Classification and Regression Trees
0.25
Figure 6.2: Choice of categorical classes using regression trees.
	re class 1	18-23		
	;e class 2	24-25		
aj	re class 3	26-28	ac class 1:	0
aj	;e class 4	29-40	ac class 2:	1
aj	re class 5	41-58	ac class 3:	2
aj	re class 6	59-61	ac class 4:	3+
aj	re class 7	62-74		
aj	re class 8	75-90		
Table 6.1: Optimal (lhs) age classes and (rhs) ac classes based on marginal regression tree estimators.
observe that the chosen ac classes are optimal, but for the age classes the regression tree algorithm proposes a slightly different choice of classes (which provides more marginal homogeneity in terms of Poisson deviance losses).
In Table 6.2 we consider the GLM analyses based on the two choices for the categorical age classes given in Figure 2.2 (lhs) and Table 6.1 (lhs), respectively. We observe a slight decrease in in-sample loss but at the same time an increase in the estimation loss. From this we conclude that in the present example we cannot gain much by a different/better choice of age classes.
Remark. This marginal choice of categorical classes can in particular be useful, if one starts from a huge set of categorical labels (e.g. grouping of industry sector codes) from which one would like to build bigger categorical classes. Doing this by expert opinion is often not feasible, but leaving the original categorical classes on a very granular level can be very time consuming in computational algorithms.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
157
	run	#	CV loss	strat. CV	est. loss	in-sample	average
	time	param.	rev	rev	£{\,\*)	>~D	frequency
(ChA.l) true model A*			27.7278				10.1991%
(Chl.l) homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.3) GLM3	12.0s	50	28.2125	28.2133	0.4794	28.1937	10.2691%
(CI16.O) GLM3.Tree	11.9s	50	-	-	0.4895	28.1905	10.2691%
Table 6.2: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 2.5.
6.2   Tree pruning
6.2.1   Binary trees and pruning
The resulting Poisson deviance loss of the partition (Xt)t£_T °f X IS given by, see (6.12),
D*(AT,(At)ter) = £r^(7V,At).
This describes an in-sample loss of the given tree estimate.3 Corollary 6.5 states that any further SBS q = (Xto, Xi) of Xt reduces the Poisson deviance loss by AD*x{qt) > 0. Since often a good stopping rule in the SBS Tree Growing Algorithm is not feasible, we present a different strategy of a tree selection here. The idea is to first grow a very large binary tree using the SBS Tree Growing Algorithm. In a second step this large binary tree is reduced by pruning the nodes of the large tree for which the reduction in Poisson deviance loss is not sufficient to justify that split.
Recall that T denotes the binary indexes of a binary tree with nodes (Xt)t^j- In order to not overload the language, we synonymously use T for the binary tree and/or its corresponding binary indexes here.
Choose a node with binary index teT being, say, t = 0101. This node belongs to generation k = 3, its ancestors (previous generations k = 0,1, 2) are given by the binary indexes in
{0,01,010} C T,
and its descendants (later generations k > 4) are given by the binary indexes in seT; s = 0101t for some r G (J (0,1}£ > .
feN J
If we add to the latter set the binary index t = 0101 we obtain a new binary tree with t being its new root node (to be consistent in notation we could relabel the binary indexes, but we refrain from doing so, because we think that this is sufficiently clear from the
3To get in-sample loss C% given in (1.10) we still need to scale with the number of observations n.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
158
Chapter 6.  Classification and Regression Trees
context). This motivates the following definition: choose t G T and define the new binary tree indexes
T(t+) = {*} U I s G T; s = tr for some r G |J {0, l}e \ . (6.14) [ <»eN J
These binary tree indexes T(t+) define a new binary tree (Xs) ^t+) with root node t and with leaves indexed by T(t+)- The leaves provide a partition (Xs)s^j^t+) °^ ^ne subset Xt of the total feature space X.
Definition 6.6. Pruning a binary tree at a node with binary index t 6 T means that we delete all descendants of Xt from the tree and the pruned binary tree is obtained by the binary indexes
T(_t)= (T\T(t+))u{t} C T.
The tree T(_t) is the subtree of T that stops growing at binary index t, the tree T(t+) is the subtree of T that has as root node the binary index t, and T(_t) n T(t+) = The general idea now is to start with a very large binary tree T. We then prune this large binary tree step by step by deleting all binary splits that do not substantially reduce the Poisson deviance loss (in relation to their complexity).
6.2.2   Minimal cost-complexity pruning Theoretical results on smallest cost-optimal trees
An efficient way to obtain a pruning algorithm is to do minimal cost-complexity pruning which was invented by Breiman-Stone [17] and Breiman et al. [16].
Definition 6.7. Choose a binary tree (Xt)t^j with binary indexes T and resulting partition (Xt)t£T °f X indexed by T. For r\ > 0, we define the cost-complexity measure of T by
Rv(J) = D%N,(Xt)t€T)+r]\T\.
Remarks 6.8.
• We aim at minimizing the Poisson deviance loss D*(N, (\t)t€T) by choosing a sufficiently large tree T, and in order to not choose an overly large tree we punish (regularize) the choices of large trees by a factor ?7|7"| accounting for the number of leaves in that tree T. We call r\ the regularization parameter. Observe that this idea has already been used in GAMs, see (3.8), and we have been discussing ridge and LASSO regularization in Section 4.3.2. Here we replace the ridge (L2) or the LASSO (L1) regularization by the number of parameters (cardinality (LP) regularization).
• In the above definition we use the Poisson deviance loss for the definition of the cost-complexity measure -R^T) of the tree L. In all what follows we could also use any other sensible objective function.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
159
Every finite binary tree T has at most finitely many binary subtrees T'cT. Therefore, starting from a (very large) finite binary tree T, we can always find a binary subtree T'(r]) C T that minimizes the cost-complexity measure Rq(-) for a given regularization parameter 77 > 0. That is, there exists at least one minimizer T'(r]) to the problem
argmin RT)(J'), (6.15)
T'CT
where the minimization T' C T runs over all binary subtrees T' of the original tree T having identical root node t = 0. Optimization (6.15) can have more than one minimizer T'(r]) for a given 77. We will just prove that for every regularization parameter 77 there is a (unique) smallest cost-optimal binary subtree of T that solves (6.15). This smallest cost-optimal binary subtree of T will be denoted by T(r)). The next theorem is proved in Section 6.4, below.
Theorem 6.9. Choose a finite binary tree T and assume there is more than one minimizer T'(r]) of (6.15) for a given 77 > 0. Then there exists a unique subtree T(r]) ofT that minimizes (6.15) and which satisfies T(r]) C T'(r]) for all minimizers T'(r]) of (6.15).
We call this subtree T(?7) the (unique) smallest cost-optimal binary subtree of T for the given regularization parameter 77 > 0.
Summarizing this result provides the following corollary.
Corollary 6.10. The smallest (cost-optimal) minimizer T(q) of (6.15) is well-defined (and unique). We have for any minimizer T'(r]) of (6.15) the relationships T(q) C T'(?7) C T and
Rv(J(v)) = Rv(V(r1)) = min RV(V).
Thus, we have for every regularization parameter 77 > 0 a smallest cost-optimal binary subtree T(q) of the original (large) binary tree T. The remaining question is: how can we determine these smallest cost-optimal subtrees efficiently? Searching the whole binary tree T for every 77 is too expensive.
Construction of the smallest cost-optimal subtree
Choose any binary subtree T'cT with leaves T' and consider the function rji : R+ —> R+ given by
V -> rT,(V) = Rv(T) = D*(N,(\t)t£t')+v\T'\. (6.16)
Note that we introduce this new notation to indicate that 771—y r-p (77) is a function of the regularization parameter 77 (for a given subtree T'), whereas T 1—y /^(T') is a function of the subtrees T' (for a given regularization parameter 77). The function r-p (77) is linear in 77 with intercept D*(AT", (Xt)t^t') — 0 and slope \T'\ > 0. We define the function
77 1—y 7(77) = min rji(rf) = min i?^(T'), (6-17)
which is well-defined because every finite binary tree T has at most finitely many binary subtrees T'.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
160
Chapter 6.  Classification and Regression Trees
Theorem 6.11. Choose a finite binary tree T. The function r : R+ —> R+ defined in (6.17) is positive, strictly increasing, piece-wise linear and concave with finitely many values 0 = t]q < r\\ < . . . < r\K < r\K+\ = oo in which r(-) is not differentiable. Define tw = T(m) for£ = 0,...,n. We have
t3t(°b...3t(KI = {0}, (6.18)
and for all £ = 0,.. ., k
r(7?) = rt(£)(77) = i?^(tw)        for all 77 G [r]e,f]e+l)-This theorem is proved in Section 6.4, below, and illustrated in Figure 6.4 (lhs).
Theorem 6.11 gives the instructions for an efficient algorithm to find the smallest cost-optimal binary subtree T(r)) of the original binary tree t for all regularization parameters 77 > 0. We just need to determine the tree sequence (6.18) for the selected 77^'s.
We start from a (large) binary tree t that has been generated by the SBS Tree Growing Algorithm on page 151. This algorithm has the property that it only uses SBS which lead to a real improvement in the Poisson deviance loss in (6.11), i.e. for the chosen SBS q = (Xto,Xti) we have AZ?^(q) > 0. For this reason we (may) initialize t^0) = t, see also Remark 6.13 below. Assume that 7^ {0} has been constructed for the regularization parameter 77^; if t^ is the root tree we have found the final £ = k and stop the algorithm. To construct t^+1-) C t^ and determine the corresponding 77^+1 > 77^ we consider for any t G t^) \ the (root) tree {£} and the (non-trivial) binary subtree t|2^ C Since every SBS considered is assumed to lead to a real improvement in
the Poisson deviance loss we have
DUN> (As)se{t}) - P*Xt(N, (As)serW ) > 0.
(£+)
As long as4
r{t}(v) = Rv({t}) > Rv(Tf¥))=r w (??), (6.19)
v   ' (*+)
the root tree {£} has a higher cost-complexity than Therefore we do not stop
growing the tree in node t. But as soon as the last inequality becomes an equality (by increasing the regularization parameter 77) we should prune the tree tj^ to the simpler root tree {£} for that increased 77. The idea now is to find the weakest node t*e+1 G tw \ TW defined by
D*Xt(N, (As)se{t}) - D*Xt(N, (\s)s€T(fl teTw\rw 1^)1-1
t*e+1=  argmin   —-; -(6.20)
This weakest node determines the minimal regularization parameter over all nodes in tM \ fW to turn inequality (6.19) into an equality for exactly this node and this regularization parameter. We prune the tree t^ at this weakest node The next lemma is proved in Section 6.4, below.
4Note that we use an abuse of notation in (6.19) because we interpret the functions on trees with root node {t}; for more clarification we also refer to (6.37), below.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
161
Lemma 6.12. Choose £ > 0 with ^ {0}. The weakest node defined by (6.20) provides T^+1-) = T^,* and
v lt+l>
D*Xt.   (N,(\8)a€{t.  })-D*x    (N,(XS)     w )
ir^ üi
Z?*(iV,(Äs)    w )-Z?*(iV,(Äs)serW)
irw|-|7^. )|
D*(N, (Ä8)8£T(<+1)) - D*(N, (Xa)a€TW) |7-W| _ |7-(m)|
jyj^+i) = {o} we set K = £ + L
Remark. In fact, Lemma 6.12 uses an induction in its proof, which shows that the algorithm is well initialized for £ = 0 and it also determines the maximal constant k.
Minimal Cost-Complexity Pruning Algorithm.
(0) Initialization: Set       = T.
(1) Repeat until T^) is the root tree {0}:
(a) find the weakest node t}+1 e ~tfr> \       and define T^+1) c T^ and r]£+1 > r\i as in Lemma 6.12;
(b) if T<'+1) = {0} set k = £ + 1.
(2) Return (%)^=o,...,K and (T^))^=0,...,K.
Remark 6.13. The initialization step (0) sets T^0) = T. Since we grow the tree T by the SBS Tree Growing Algorithm, only SBS are considered that strictly improve the Poisson deviance loss. From this it immediately follows that the smallest cost-optimal tree for 77 = 770 = 0 is given by the original tree T. In a situation where this is not the case, i.e. where another tree growing algorithm has been used that does not have this property, one needs to already prune the starting tree, see Breiman et al. [16].
Regularization and cost-complexity parameters
The parameter 77 > 0 regularizes the complexity of the resulting smallest cost-optimal binary subtree T(7/), see Definition 6.7. For an arbitrary binary tree T the cost-complexity measure in 77 is given by
Rv(T) = D*(N,CXt)t£T)+v\T\. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
162
Chapter 6.  Classification and Regression Trees
From this we see that the regularization parameter 77 has the same unit as the Poisson deviance loss. Since this cost-complexity consideration can be applied to any loss function, one often normalizes the regularization by the corresponding loss of the root tree, i.e. one considers
RV(T)   = D*(N,(\t)t€T)+v\T\
=   D*(N, (At)teT) + cp D*(N, (At)te{0}) |T|,
where D*(N, (\t)tt{0}) *s the Poisson deviance loss of the root tree with (homogeneous) MLE Aq = Ya=1 Ni/ Yli=i vi, and we define the cost-complexity parameter
cp
D*(N,(Xt)t€{0}) The sequence 77^+1 , £ = 0,. . ., k — 1, given by
_ D*(N, (AS)S£T(,+1)) - D*(N, (AS)S£TW)
m+1 jrwj - ir('+1)|
then provides cost-complexity parameters
ip. 01 \
D*(N,(Xt)t€{0})
1 D*(N, (AS)S£T(W)) - D*(N, (A8)8£TW)
D*(AT,(At)te{0}) |rM|-|t(^)|
Remark. Note that we call 77 regularization parameter and its normalized counterpart cp cost-complexity parameter.
Example in motor insurance pricing, revisited
We revisit the example of Section 6.1.5, but for illustrative purposes we start with a small tree. This small tree T is given in Figure 6.3 and Listing 6.3, note that this small tree is a subtree of the trees in Figure 6.1.
26e+3 / 500e+3
6240/107e+3
Figure 6.3: Regression tree T for an early stopping rule.
This tree T has leaf indexes {000, 001, 01}, and it has two subtrees with root 0 and with leaf indexes {00,01} and {0}, the latter is the root tree. We denote these trees for the moment by T3,T2,Ti (the first one denoting the full tree T and the last one denoting
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
163
Listing 6.3: Regression tree T for an early stopping rule
1 n= 500000
2
3 node),   split,   n,   deviance, yval
4 *  denotes   terminal node
5
6 1)   root   500000   145532.30 0.10269060
7 2)   ac>=0.5  452749   126855.50 0.09407058
8 4)   ac>=2.5  345628     93113.03   0.08592400 *
9 5)   ac<  2.5   107121     33147.84  0.12489500 * 10 3)   ac<  0.5  47251     16399.78   0.24966230 *
the root tree). For these three trees Tj, i = 1,2,3, we can study the cost-complexity measures
V -> rTM =Rv(Tl) =D*(N,C\t)t£tt)+v\%L (6.22) where % are the leaves of Tj with |7i| = i for i = 1,2,3.
cost-complexity
cp	nsplit	rel error	xerror	xstd
0.0156463	0	1.00000	1.00001	0.0044701
0.0040858	1	0.98435	0.98437	0.0043582
0.0039000	2	0.98027	0.98139	0.0043392
1000 2000 3000
Figure 6.4: (lhs) Cost-complexity measures (6.22) for the three trees T3,T2,Ti and (rhs) resulting cost-complexity results.
In Figure 6.4 (lhs) we plot the cost-complexity measures (6.22) for our example given in Figure 6.3. We see the increasing, piece-wise linear and concave property of r(-), see Theorem 6.11, and that
m = 595,      m=r]K = 2'277,      cPl = 0.0040858,      cp2 = cpK = 0.0156463.
These values are obtained from the R command printcp(tree) where tree is obtained as in Listing 6.1. Note that on line 4 of Listing 6.1 we specify the parameter cp which determines to which cost-complexity parameter we grow the tree. If we set cp=0.0039 we exactly receive Ligure 6.4 (rhs). Lhe first column specifies the critical cost-complexity parameters cp-L and cp2, (6.21) then provides the corresponding regularization parameters 771 and 772. Decreasing cp in Listing 6.1 will provide a larger tree.
Moreover, we only consider splits so that each leaf contains at least minbucket=10000 cases, see line 4 of Listing 6.1. We choose this value comparably large because if
10'000
v=        vi = 5000       and       A = 10%,
i=i
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
164
Chapter 6.  Classification and Regression Trees
then we have estimated confidence bounds of two standard deviations given by, see also (2.20),
A
'X/v, A+ 2 J\/i
[9.1%, 10.9%]
That is, we obtain a precision of roughly 0.9% in the expected frequency estimate A (which is not very precise). The column rel error gives a scaled version of the x2-test statistics, see (2.15), given by
rel error,
D*(N, (Xt)t£Tj) D*(N,CXt)te{0})
for i = 1,2,3. Note that this relative error is an in-sample loss. We observe that the in-sample loss decays by roughly 2% by the first two splits.
The remaining question is the best choice of the regularization parameter 77 > 0. This will be done in Section 6.2.4, below.
6.2.3   Choice of the best pruned tree
The remaining difficulty is to find the optimal regularization parameter rjg, £ = 0,.. . , k, for pruning, i.e. which one of the optimally pruned (smallest cost-optimal) trees T = T(°) d TW d ... d T(K_1) d T(K) = {0} should be selected. The choice of the best pruned tree is often done by K-fo\d cross-validation. Assume that we use (stratified) If-fold cross-validation for the Poisson deviance loss, see (1.11),
£8vw - t< E w E2». -1 -log {—rt^ • <6-23)
where X^~Bk>{-) = X^~Bk\ ■ ; 77) is the regression tree estimator
X^Mx) = X(-Bk\x;V) Yl     *t t{x€Xt},
t€T<--Bk)(v)
on the optimally pruned tree T^~Bk\r)) C T^~Bk^ for regularization parameter 77, where for the construction of the (big) tree T^~Bk^ we have only used the (stratified) training data X>(-8fe) = T> \ VBk, and data X>(8fe) is used for validation. This optimally pruned tree provides an 77- and X>(-8fe)-dependent partition (Xt)t€j-(-Bk)^ of the feature space X and the corresponding estimators (Xt)t€-j-(-Bk)^ on that partition.
This (stratified) K-iold cross-validation error £^(77) can be considered as an out-of-sample prediction error estimate (using the Poisson deviance loss as objective function) on a portfolio consisting of the characteristics given by data T> and using an optimally pruned binary tree for the regularization parameter 77. That is, we estimate the out-of-sample loss of the optimally pruned tree T(r)) at regularization parameter 77 by
£^(77) = £gv(77). (6.24)
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
165
Note that we deliberately use the hat-notation on the left-hand side: the smallest cost-optimal tree T(r)) and the original tree T were constructed on the entire T>. In order to estimate the resulting out-of-sample loss we use (stratified) A'-fold cross-validation. This cross-validation separates the data into training samples X>(-8fe) and validation samples V®k, k = 1,. . ., K, from which the cross-validation error of the corresponding smallest cost-optimal trees T^~bk\t]) is calculated. That is, the right-hand and the left-hand sides of (6.24) do not use the data T> in exactly the same manner because cross-validation requires a training and a validation sample for back-testing (estimating to out-of-sample loss on a test data set). Therefore, we consider the right-hand side as an estimate of the left-hand side.
Choice of the Best Pruned Tree.
(0) Grow a large tree T on the entire data T> using the SBS Tree Growing Algorithm.
(1) Determine the regularization parameters 770,... ,77^+1, see Lemma 6.12, and define their geometric means for £ = 0,. . ., k by
m = Vmrn+i, witn Vk = oo.
(2) Partition the data T> into (stratified) subsets V^k, k = 1,...,K, for (stratified) A'-fold cross-validation:
(a) grow for every k a large tree T(_8fe) based on the (stratified) training sample V\VBk;
(b) determine for every k the optimally pruned subtrees T^~^k^(f)e), £ = 0,. .. , k;
(c) calculate the cross-validation error estimate CpY (f]£) for every £ = 0,...,k based on the optimally pruned subtrees T(~Bk\fj(>) and the corresponding validation samples V^k, k = 1,..., K.
(3) Determine £* 6 {0,..., k} by one of the following criteria:
— minimum cross-validation error: choose £* such that 77^* minimizes (6.24) for £ = 0,. . ., k; we denote this minimizing index by £min = £*;
— 1-SD rule: the 1 standard deviation rule acknowledges that there is some uncertainty involved in the estimate (6.23)-(6.24); we therefore provide an estimate for the standard deviation Var(£c,v(77^))1/2 (which is done empirically over the K cross-validation samples); the 1-SD rule proposes to choose £* minimal such that
£gV(w*) < mm£gv(77,)+Var(/:gv(77,))1/2;
— elbow criterion: when looking at the sequence (fi(>), £ = k, . .. , 0, one often observes that it steeply drops in £, then has a kink at a certain index £e and after that kink it is rather flat; the elbow criteria suggests to choose £* = £e-
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
166
Chapter 6.  Classification and Regression Trees
(4) Return the best pruned tree T(?^*) (for one of the previous criteria).
Remark.
• The geometric means fjf> = ^/rjirjl+i are considered to be typical values on the intervals [77^,77^+1). Note that the optimal regularization parameters of Lemma 6.12 will vary over the choices T^~Bk\ k = 1,..., K. Therefore, 'typical' choices 77^, £ = 0,..., k, based on all data T> are considered for cross-validation, i.e. around 77^ we expect the critical tree size of the optimal tree to be more stable because these values are sufficiently different from 77^ and 77^+1 where the tree sizes change.
• There is still the flexibility of the choice of the selection criteria in step (3). This is further discussed in the examples.
• There is still the freedom of growing the (large) trees T and T^~Bk\ k = 1,..., K. If we use the SBS Tree Growing Algorithm we typically use the same criterion for the determination of the algorithm for all these trees.
6.2.4   Example in motor insurance pricing, revisited
We consider two examples. The first one will be based on the small tree T of Figure 6.3. For the second example we will grow a large tree that we are going to prune.
Small tree of Figure 6.3
We continue the example of Figure 6.3: for illustrative purposes it is based on the small tree T = t3, only. For this small tree we have, see also Figure 6.4,
770 = 0,       771 = l'l64       and       772 = 00.
The column rel error in Figure 6.4 (rhs) gives an in-sample loss that is generally too small because of potential over-fitting to the learning data. The column xerror in Figure 6.4 (rhs) then provides the relative 10-fold cross-validation errors (note that we choose xval = K = 10 in Listing 6.1) for these regularization parameters 77^, that is,
xerror* —
D*(N,(\t)t€{Q)y
for £ = 1,2,3. According to the minimum cross-validation error rule we would choose the maximal tree t3 in this example. The final column xstd in Figure 6.4 (rhs) denotes the relative estimated standard error of the cross-validation, i.e.
^ nVar(£gv(77,))1/2
xstd/j = -=-,
D*(N,(Xt)t€{0})
for £ = 1,2,3. Note that these numbers are displayed in gray color in Figure 6.4 (rhs) because they are not calculated by version 4.1.10 of rpart as we require.5 Therefore,
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
167
cp	nsplit	rel error	xerror	xstd	xstd	strat. CV
				not stratified	stratified	ria >~D
0.0156463	0	1.00000	1.00001	0.01076	0.00218	29.1065
0.0040858	1	0.98435	0.98437	0.01120	0.00397	28.6516
0.0039000	2	0.98027	0.98139	0.00992	0.00402	28.5649
Table 6.3: Cost-complexity and 10-fold cross-validation outputs for the trees T3,T2,Ti, see Figure 6.3, the last column is in 10-2.
we have implemented our own cross-validation procedure. The results are given in Table 6.3. We observe a substantial decrease in xstd when applying stratified cross-validation compared to non-stratified one, thus, uncertainty in cross-validation errors substantially decreases when applying the stratified version. This observation is fully in line with Figure 5.8 (middle, rhs). The final column in Table 6.3 displays the (absolute) 10-fold cross-validation errors (in 10-2). These should be compared to Table 5.4 which provides the values for the models GLM4, GAM3 and DNN1-3. We observe that the tree T3 cannot compete with the models of Table 5.4, nevertheless the first two splits lead to a remarkable decrease in Poisson deviance loss. That this small tree is not competitive is not surprising because we did not grow a sufficiently large tree (and applied optimal pruning). This we are going to do next.
Optimally pruned large tree
Now we grow a large tree. We therefore use the R command given in Listing 6.1, but we control the size of the tree by setting cp=0. 000001 and minbucket=1000, i.e. we grow a large three T = T^0) having in each leaf at least l'OOO cases.
The stratified 10-fold cross-validation results are given in Figure 6.5. We see a decrease in cross-validation losses for the first 70 SBS (green vertical line in Figure 6.5). From the 71st split on the cross-validation losses start to increase which indicates over-fitting to the data. The error bars correspond to 1 standard deviation obtained by stratified 10-fold cross-validation. Using the 1-SD rule we obtain a comparably small tree with 13 leaves (blue line in Figure 6.5), using the minimum cross-validation error we choose a tree with 71 leaves (green line in Figure 6.5). These trees are shown in Figure 6.6. We call these binary trees T1_SD and Tmm for further reference.
For model comparison we then compare the resulting prediction errors, these are provided in Table 6.4. We observe that the 1-SD rule regression tree results provide higher estimation errors than the models GLM4 and GAM3 which shows that this regression tree T1_SD is not competitive. In fact, this is often observed that the 1-SD rule tends to go for a tree that is too small.
The minimum cross-validation error tree Tmm performs much better. It has a better performance than model GAM3 in terms of estimation loss £(\, A*). This illustrates (once more) that the GAM is missing important (non-multiplicative) interactions in the regression function. On the other hand, this regression tree has a worse performance than the neural network approaches. This is also a common observation. The neural network
5In our implementation we would like to use the empirical frequencies for choosing stratified samples.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
168
Chapter 6.  Classification and Regression Trees
-4 -5 -6 -7 -8 -9 -10
log(cp)
Figure 6.5: Cross-validation results from growing a large binary tree, error bars correspond to 1 standard deviation obtained by stratified 10-fold cross-validation.
Figure 6.6: Optimally pruned trees according to the 1-SD rule T and the minimum cross-validation error rule Tmm, see also Figure 6.5.
architecture with differentiable activation function can continuously inter- and extrapolate for continuous feature values. This makes it more powerful than the regression tree approach because the neural network is less sensitive in small (homogeneous) subport-folios as long as it can benefit from an ordinal relationship within feature components, i.e. can "continuously learn over neighboring leaves".
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
169
		run	#	CV loss	strat. CV	est. loss	in-sample	average
		time	param.	/•cv >~D	/•CV J~D	£(A,A*)	/•is J~D	frequency
(ChA.l	true model A*			27.7278				10.1991%
(Chl.l)	homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.4)	GLM4	14s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch3.3)	GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
(Ch5.4)	CANN1	28s	703f	27.9292	27.9362	0.2284	27.8940	10.1577%
(Ch5.5)	CANN2	27s	780f	27.9306	27.9456	0.2092	27.8684	10.2283%
(Ch5.2)	DNN2	123s	703	27.9235	27.9545	0.1600	27.7736	10.4361%
(Ch5.3)	DNN3	125s	780	27.9404	27.9232	0.2094	27.7693	9.6908%
(Ch5.7)	blended DNN	-		-	-	0.1336	27.6939	10.2691%
(Ch6.1)	Treel T1_SD	65s*	13	-	28.2280	0.4814	28.1698	10.2674%
(Ch6.2)	Tree2 Tmin	65s*	71	-	28.1388	0.3748	27.9156	10.2570%
Table 6.4: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration (} includes 10-fold cross-validation to select the optimal tree), and '# param.' gives the number of estimated model parameters (t only considers the network parameters and not the non-trainable GAM parameters), this table follows up from Table 5.8.
6.3   Binary tree classification
Above we have introduced the SBS tree construction for the Poisson regression problem which provided us regression tree estimator (6.1). We now turn to the classification problem. Assume that the cases (Y, x) have a categorical distribution satisfying
7Ty(x) = P [Y = y] > 0       for y e y = {0,..., J - 1},
with normalization J2yey ^yC31) = 1 f°r an x £ X. The classifier C : X —> y provides for cases (Y, x) the labels
C{x) = argmax iry(x), (6.25)
y&y
with a deterministic rule if the maximum is not unique. This is similar to the logistic regression classification in Section 2.5.2. However, in this section we do not assume any structural form for the probabilities iry(x) in terms of the features x, but we aim at estimating and approximating this structure with a SBS tree construction.
6.3.1   Empirical probabilities
Assume we have data T> of the form (2.22). Choose a subset X' C X of the feature space. Denote by n(y,X';T>) the number of cases (Yi,Xi) in T> that have feature Xi 6 X' and response Yi = y. The (empirical) probability that a randomly chosen case (Y, x) of T> belongs to feature set X' and has response Y = y is given by
P(y,X>)=P(y,X>;V) = = (6.26)
J2y'&yn(y',X;V)
n
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
170
Chapter 6.  Classification and Regression Trees
We use the hat-notation for the empirical probabilities p to indicate that these depend on the data T>. The (empirical) probability that a randomly chosen case (Y, x) belongs to X' is given by
p{x') = YJp{y,x')
Ey^yn(y,X';V) _ Ey£y n(y, X'; V) Hv&yn{y,X-V) n
and the (empirical) probability that a randomly chosen case (Y, x) in X' has response y is given by
_p(y,X') _ _n(y,X';V)
' '    p(x>) £y,e:
supposed that p(X') > 0, i.e. there are cases in X'.
^-w-wwr (6'27)
Remark. These empirical probabilities p purely depend on the data T>. The theory also works in the setting where one has prior knowledge about the classification probabilities and, therefore, uses a modified version for the definition of p. We will meet the modified framework in Chapter 7 for the AdaBoost algorithm, for more details we also refer to Breiman et al. [16].
6.3.2   Standardized binary split tree growing algorithm for classification
Similar to Section 6.1.3 we need a goodness of split criterion which evaluates possible SBS of the feature space. The criterion that we consider first relates to misclassification.
Class assignment rule
We start by defining the class assignment rule. Assume we have a partition {Xt)t^_T °f the feature space X (with every Xt containing at least one case of the data T>). We define the class assignment rule
y\ = argmax p{y*\Xt) = argmax n(y*, Xt; T>). (6.28)
y*&y y*&y
That is, we choose the label yl on Xt that has the biggest likelihood (with a deterministic rule if there is more than one). This provides us classifier estimator C : X —> y
£{x) = Y,vl iw,}- (6-29)
This assigns a class C{x) 6 3^ to every feature x 6 X. Misclassification rate
Next we introduce the misclassification rate of class assignment rule (6.29) by
1 n
£f«}(V,= -E W(xo>- (6-30)
1=1
Note that this is in line with (2.31) and determines the in-sample misclassification rate. We can represent this misclassification rate as given in the following lemma.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
171
Lemma 6.14. The mis classification rate satisfies
4S{#}(^' (y*t)t€T) = £pW (i-m\xt)) ■
teT
Proof. We can exchange the summation in the definition of the misclassification rate which provides
1   n 1 n
i = l i = l t€T
1 n 1
= -EE ^^.w.)1^) = -EE^'*;I))
t€T i=i teT
From this the claim follows. □
This misclassification rate plays the role of the deviance loss D*(N, (\t)t€T) given m (6.12). It estimates the probability that a randomly chosen case (Y, x) G T> is misclassi-fied.
The above construction implicitly assumes that misclassification is equally weighted for all labels in y. If preference should be given to a certain misclassification, we may introduce another loss function with L(y,y*) > 0 for y* ^ y £ J, and L(y,y) = 0 for all y G y. We interpret L(y,y*) to be the loss of classifying a response as y* instead of the "true" label y. In this case we replace class assignment rule (6.28) by
yt*=argmin ^ L(y,y*)p(y\Xt), (6.31)
and the above misclassification rate is modified to
Ll(V (yDter) = Y,P(X) E Hy,y*t)p(y\Xt). (6.32) ter y&y
This latter loss C^iJ), (-)teT) is motivated by the fact that (y^)t^T defined in (6.31) minimizes (6.32) on partition (Xt)t^_T- Note that the misclassification rate (6.30) is the special case given by the loss function L(y,y*) = ^{y^y*}, we also refer to (2.32). Similar to Corollary 6.5 we have the following corollary.
Corollary 6.15. Consider for a partition (Xtjter °f X an additional SBS for a given t G T.  This gives the new leaves V = (T \ {*}) U {i0,tl}.  We have £f(D, (yt*)ter) >
Proof. Note that we always assume a non-trivial SBS, i.e. we require existence of t G T such that Xt contains at least two elements with different features. Then, we have for this binary index t
p(xt)Y,L(y,yt)p(y\Xt) = E^OpM)
y£V y£V
= E L(y> yt)p(y> Xt°) + E L(y> yt)p(y>Xti)
y£V y£V
>   p{Xto)^2L(y,yt0)p(y\Xto) + PJXg) E )p(v\Xti)-
y£V y£V
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
172
Chapter 6.  Classification and Regression Trees
This completes the proof. □
This implies that in the SBS Tree Growing Algorithm on page 151 we may replace (6.11) by a similar condition measuring the decrease in misclassification rate
AD*Xt(<;t) = Cf(V, (yt*)ter) - Cf(V, (y*t)t€T,) > 0, (6.33)
using the notation of Corollary 6.15.
Conclusion. We conclude for the tree classification case that all the results of the tree regression case remain true if we replace in (6.11) the deviance loss improvement (6.10) by the misclassification rate improvement (6.33).
Impurity measures
In definition (6.30) we have naturally assumed that we would like to measure the misclassification rate. This misclassification rate has the disadvantage that the considered impurity function
</>R(p(o\xt),...,p(j-i\xt)) = £i{y*,t.}*W) = i-piyiW),
y&y
with yl given by (6.28), is not differentiable in its arguments which might cause problems in optimization. In particular, <fi* contains y\ which is obtained by an optimization in the empirical probabilities p. For instance, for J = 2 we have on the 1-unit simplex (choose p G [0,1] and set q = 1 — p G [0,1])
4>r(p,q) = 4>r{p^-p) = 1 - max{p, 1 -p] ,
which is not differentiable in p = 1/2. Therefore, this impurity function <f>r is often replaced by other impurity functions. The two most popular ones are the Gini index function
</>G(p(o\Xt),...,p(J-i\Xt)) = Y,{l-p{y\X))p{y\xt),
y&y
and the entropy function
^E(p(o\Xt),-..,p(J-i\Xt)) = J2-lo&(p(y\xt))p(y\x).
y&y
These two impurity functions have the advantage that they are differentiable. For instance for J = 2 we have on the 1-unit simplex (for q = 1 — p G [0,1])
4>g (p, q) = 4>g (p, i - p) = 2p (l - p),
and
4>E{p,q) = -plogp- (l -p)log(i -p),
see also Figure 6.7. These considerations motivate to replace misclassification rate (6.30), Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 6.  Classification and Regression Trees
173
impurity functions
0.0 0.2 0.4 0.6 0.8 1.0
Figure 6.7: Impurity functions: misclassification error rate <f>R, Gini index function <f>Q, entropy function (fig for J = 2 as a function of p G [0,1].
see also Lemma 6.14, by the impurity measures
4S(X>, (Xt)t€T) = EP(xt) <t>» (iW), 0,P(J- l\Xt)), (6.34)
for o G {G, E}. We can use this impurity measure as objective function.
Choose a binary index t 6 T and consider a SBS q = (Xto, Xt\) of Xt. This SBS leads to a change in impurity given by
AD^kt) = CP, (l)ter) - C(X>, (Aft)teT'), (6-35)
where 7"' denotes the resulting leaves also considering the additional binary split q (similar to Corollary 6.15).
Remarks.
• Note that these two impurity functions <f>Q and (fig no longer depend on the class assignments (y^)t^T- Therefore, the class assignment (6.28) is done after the SBS Tree Growing Algorithm has been performed.
• 4>g and (f>e are called impurity functions because they measure the impurity on the (J — l)-unit simplex. They take the maximum in the center (1/J,..., 1/J), the minimum in the corners (1, 0,.. . , 0),. . ., (0,.. . , 0,1) and are Schur concave.
• By applying the SBS Tree Growing Algorithm with criterion (6.35) in (6.11) we improve the purity on the resulting leaves in each step of the algorithm, however, in terms of the misclassification rate there might be better SBS. On the other hand, misclassification rates are not always sufficiently sensitive in SBS and the Gini index or the entropy function may have a better behavior on the final result.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
174 Chapter 6.  Classification and Regression Trees
• Pruning and cross-validation can then be done considering misclassification rates.
• The entropy impurity measure is closely related to the deviance loss function of a categorical distribution.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix to Chapter 6
6.4   Proofs of pruning results
In this section we prove the results on cost-complexity pruning.
Proof of Theorem 6.9. This theorem is proved in Theorem 10.7 of [16]. Choose r] > 0 fixed and assume that T'cT describes a non-trivial binary (sub-)tree of T having identical root node t = 0. We decouple this binary tree T' into its root tree T^_0j = {0} and the binary subtrees T'^00+^ and T^01+j having new root nodes with binary indexes 00 and 01, respectively, see also (6.14). Observe that this provides a natural partition of the leaves of T' with leaf indexes T' = 7(oo+) J V(oi+)- This implies for the cost-complexity measure of the tree T'
R,(T)   =   D*(N,(\t)t€T,)+r,\r\ =   Y (D*t (N'~Xt)
=      Y   (DXt(N,\t)+v)+   Y   (Dk(N,\t)+v) (6.36)
teT('oo+) ter('o1+) =   -R>?(T(oo+)) +-R>?(T(oi+))) where the last identity denotes for r = 0,1
i?,(T'(0r+)) =   Y D*Xt(N,\t)+V\7f0T+)\.
Note that we use a slight abuse of notation here because the trees T'^Qt+^ provide a partition of Xqt = (Xt)teTr , and n°t °f the total feature space X. But observe that (Aot)t=o,i *s a partition of the total feature space X.
Identity (6.36) shows that finding minimizers T'(rj) of (6.15) can be done inductively by finding minimizers (independently) on the sub-trees T(0o+) and T(0i+) of T. Thus, (6.36) reduces the number of generations by one digit and iterating this provides the proof. The formal completion of the proof is obtained by an explicit construction of the smallest cost-optimal binary subtree using that (6.36) can be applied to any split in any generation and using the fact that there are only finitely many binary subtrees. □
Proof of Theorem 6.11. Choose a finite binary tree T. This finite binary tree has at most finitely many subtrees T' that have the same root node t = 0. Therefore, the minimum in (6.17) is well-defined for all rj > 0. For any subtree T' of T the function (6.16) is positive, strictly increasing and linear. Therefore, the minimum in (6.17) over finitely many such positive and strictly increasing linear functions needs to be positive, strictly increasing, piece-wise linear and concave with at most finitely many values 0 < rji < . .. < rjK < oo where it is not differentiable, i.e. where the linear functions (6.16) intersect for different subtrees T'. We additionally set rjo = 0 and J?K+i = oo. The smallest cost-optimal binary subtree of (6.15) in rji is given by Jw =T(r?£), for I = 0,. .. , k, see Corollary 6.10. This implies that
r(i]i) = min rT'(^) = min Rm{T') = Rne(T(-r]i)) = i?^(T(£)).
T'cT T'cT
Since on the open interval (rji,r]i+i) no other linear function rji(-) intersects the straight line fT(«)(-) it follows that r(rf) = rTW{rj) on [r]i,r]e+1).
175
Electronic copy available at: https://ssrn.com/abstract=2870308
176
Chapter 6.  Classification and Regression Trees
Finally, we need to show that the sequence of the smallest cost-optimal binary subtrees (T^)e=o,...,K is decreasing for increasing rji. Pursuing the last argument to the point rji+i, we see from the continuity of the functions rv(-) that RTll+1(T(r]i)) = Rnt+l (T(rj^+i)). Since T(j^+i) is the smallest cost-optimal binary subtree in r]i+i we have that T(rji) D T(j^+i) and T(rji) can be considered as a minimizing tree T'(rj£+i) in rji+i that is not smallest in the sense of as described by Theorem 6.9. Finally, the root tree {0} is the unique binary subtree with the smallest slope |T'| = 1, therefore for r] —¥ oo the root tree is cost-complexity optimal and T1-"-1 = {0}. This finishes the proof. □
Proof of Lemma 6.12. To initialize we distinguish £ = 0 and £ > 0. We start with £ = 0. Choose t £ T \ T. For every SBS considered in the SBS Tree Growing Algorithm we have a real improvement in the Poisson deviance loss which implies that
D"(N,(Xs)s€T) < D*(N, (Xs)s
i(-t)>
This implies that Ro(T) < i?o(T(_t)) and therefore provides the initialization T1-0-1 = T for regularization parameter r] = rjo = 0, i.e.
RV0(T(0)) < Rr,0(T<Vt))        for any t £ T<°> \ T(0). For £ > 0 we know by construction that
Rm (TW) < Rm (Tf}t))        for any t £ TW \ TW,
because T1^-1 = t{r)t) 7^ {0} is the smallest cost-optimal subtree that minimizes the cost-complexity measure for regularization constant rji.
For any t £ Tw \ T(i) we have \TW\ > |7^(5t)|- This implies that for any t £ Tw \ T(i) there exists an rj(t) > rji with
'(-t)
because the linear function rT(«) (•) has a bigger slope than the linear function r («)  (•). This        > is
(-*)
given by the solution of
0    =   rT(e)(V(t))-r w (^(i))
(-*)
=   D*(iV,(A3)3eT(f))-D*(iV,(A3)     w )+^)|TW|-r7(i)|^)t)|
H)
=   D*Xt(NXxa)£^-D*Xt(N,CXs)s£{t])+v(t)(y\T^ (6.37)
and thus
«-t(iV, (Ä3)3£{t}) - D^.t(iV,(Ä3)     (f) ) B(t) =-
ir(£)
The weakest node        is then given by the minimal rj(t) where the minimum is considered for all nodes
t £ T^ ' \ T• '. This determines the first intersection after rji of rT(t) (•) with any other function ry and therefore determines jj^+i and T^+1-) = T?^ » □
(<) I
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7
Ensemble Learning Methods
In this chapter we present estimation methods that are based on combining several estimators in different ways, for instance, by averaging or stage-wise adaptive learning. Combining multiple learning methods to one estimator is often called ensemble learning.
7.1   Bootstrap simulation
The bootstrap simulation method goes back to Efron [35] and Efron-Tibshirani [38]. The main idea behind bootstrap is to simulate from an estimated model in order to gain more insight. This can either be done in a parametric or in a non-parametric way. The presentation in this section is taken from the lecture notes of Buhlmann-Machler [20], Chapter 5, and it is similar to Section 4.3 in Wuthrich-Merz [141].
7.1.1   Non-parametric bootstrap
We start from i.i.d. observations Y±,.. . , Yn coming from an unknown distribution function F. Based on these observations we may construct an estimator
6n = g(Y1,...,Yn), (7.1)
of a (real-valued, unknown but well-defined) quantity 6 of interest. The function g(-) is known and we would like to understand its properties as a function of random variables Y±,... ,Yn. For instance, we may want to analyze the distributional properties of 6n, i.e. for measurable sets A we would like to study
P [en G A] =F[g(Y1,..., Yn) G A] = J t{g{yi,...,yn)€A}dFY(yi,... ,yn), (7.2)
where Fy is the joint (product) distribution function of the i.i.d. observations Yi ~ F. Since F is not known, the distributional properties of 6n cannot be determined. Bootstrap replaced the unknown distribution F by the empirical distribution of the data
1 n
Fn(x) = -^21{Yi<xh it.
i=l
and to simulate from this empirical distribution. This provides the following non-parametric bootstrap algorithm.
177
Electronic copy available at: https://ssrn.com/abstract=2870308
178
Chapter 7.  Ensemble Learning Methods
(Non-Parametric) Bootstrap Algorithm.
(1) Repeat for m = 1,..., M
(a) simulate i.i.d. observations Y*,... ,Y* from Fn;
(b) calculate the estimator
Oirr)=g(Y*,...,Y*).
(2) Return On \ . . ., 9n   ^ and the corresponding bootstrap distribution
1 m
We may now use the bootstrap distribution F* as an estimate of the distribution of 9n, that is, in view of (7.2) we estimate
, m
m=l
Remarks.
Strictly speaking there is an intermediate step in (7.3) that is highlighted in Section 4.3.1 of [141], namely, the quality of approximation (7.3) for (7.2) depends on two things. Firstly, on the richness of the observations Y±,..., Yn, because
GAj=Pjyii...iyB} [6nEA\.
Thus, the bootstrap probability P* = P|Yi Y j is a conditional probability, given the data Y±,..., Yn. Secondly, it depends on M and the random drawings from Fn, however, this source of uncertainty can be controlled by the law of large numbers.
The bootstrap distribution can be used to estimate, for instance, the mean of the estimator 9n given in (7.1) which is set as
m
-i IV 1
171=1
E
and similarly for its variance
1 m
Var (6n) = Var* (on) = — £ (e^ - E* [or
m=l
• We can then ask various questions about consistency, quality of bootstrap approximation, etc. We refer to Buhlmann-Machler [20] for more details.
• Basically, the bootstrap algorithm is based on the Glivenko-Cantelli [54, 22] theorem which says that for i.i.d. observations the empirical distribution Fn converges uniformly to the true distribution function F, see Theorem 20.6 in Billingsley [11].
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7. Ensemble Learning Methods 179 7.1.2   Parametric bootstrap
The parametric bootstrap differs from its non-parametric counterpart in the sense that we assume a given parametric distribution function F = Fq with unknown parameter 6 for the i.i.d. observations Y±,... ,Yn. We then use estimator
6n = g(Y1,...,Yn), (7.4)
for the unknown (say, real-valued) parameter 6 of interest.
(Parametric) Bootstrap Algorithm.
(1) Repeat for m = 1,..., M
(a) simulate i.i.d. observations Y-,*,..., Y* from F~ ;
(b) calculate the estimator
e^*)=giY*,...,Y:).
(2) Return 6n \ . . ., 0n   ^ and the corresponding bootstrap distribution
. m
m=l
for the estimated distribution of 9„.
The remainder is equivalent to the non-parametric bootstrap and the same remarks apply.
Remark. The parametric bootstrap needs some care because the chosen distribution can be misspecified. For instance, if we assume that Y±,..., Yn follow a Poisson distribution with estimated claim frequency A, then one should additionally check whether the data is not over-dispersed. If the data is over-dispersed and we work with a Poisson distribution, we will underestimate uncertainty. This is not the case in the non-parametric bootstrap as long as the data is a typical observation.
7.2 Bagging
Bagging goes back to Breiman [13]. It combines bootstrap and aggregating. We apply bagging to the classification and regression trees (CART) constructed in Chapter 6. Two main disadvantages of standardized binary splits (SBS) in the CART construction are that they only lead to piece-wise constant estimates (because we apply rectangular splits) and that they can be rather unstable under slight changes in the observations, i.e. a small change in an observation may lead to a different split (possibly close to the root of the tree). This different split may result in a completely different tree. For these reasons one aims at constructing a whole family (ensemble) of tree estimators which should become more stable under averaging (and aggregating). In this section we generate this family of tree estimators by the bootstrap algorithm.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
180
Chapter 7.  Ensemble Learning Methods
7.2.1 Aggregating
Before we describe bagging we would like to discuss aggregating (and averaging) of estimators. We do this with the explicit Poisson model at hand. For a given estimator A of the true frequency A* we have the following Poisson deviance loss
d%N,\)=2Y,-Hi + Ni\ogNi+\(xi)vi-Ni\og(\(xi)vi)   > 0,
i=i
where A^logA^ = 0 for ATj = 0. For the true expected frequency A* we obtain expected average Poisson deviance loss over our portfolio (a^,i>i)i=i n
1 2 n
-e[d*(N,\*)] = - V E [iVi log iVi] - A^KlogtA^H) > 0,
n n t—f
i=i
where the last inequality follows because y i—> ylogy is a convex function on R+. The single terms (expected values) of this sum are illustrated in Figure 1.1. This average Poisson deviance loss can be compared to the one obtained from the estimated frequency A. Assume that the assumptions of Proposition 1.8 are fulfilled for A, and AT" describes an independent copy of the data T> (which has been used to estimate A). Then we have, see (1.12),
-E \d*(N, A)j = -E [d*(N, A*)] + E if (A, A*)l , (7.5) n    L in L J
with estimation loss estimate, see (1.13),
1 n £(\,\*) = -
n .
i=i
Xix^-X^x^-X^x^lor' X^X
> 0,
the inequality follows similar to the one in Proposition 1.8. This provides the following corollary.
Corollary 7.1. Under the above assumptions we have
— E \d*(N, A)l   > - e[d*(N,X")} .
Next we construct an aggregated estimator. We therefore assume that we can construct i.i.d. estimators X^ A, m > 1, also being independent of AT" and fulfilling the
assumptions of Proposition 1.8. We define the aggregated estimator by
, M
^g(-) = ^£A(m)(-). (7.6)
m=l
We have already met this idea in (5.28).
Proposition 7.2. Under the above assumptions we have
— E \d*(N, A)l   >   -e\d*(N,\M)\   > -e[d*(N,\*)]. n     L J n     L ss J n
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
181
Proof. The second inequality follows similarly to Corollary 7.1. For the first inequality we have (using that A(m) are i.i.d. estimators having the same distribution as A)
n E[D*(iV,Afgg)j =- E\D'(N,\)
log
Xagg(Xj)
Jensen's inequality implies
E		> E
	.     \ X(xi) J.	
IVl
(m).
E logA(a;i)   = 0
This proves the claim.
Remarks to Proposition 7.2.
• A crucial remark is that the average bias remains unchanged by aggregation (7.6), but the estimation variance is reduced as can be seen from Proposition 7.2 (this is basically explained by the law of large numbers and the central limit theorem). Thus, aggregation (and averaging) has the nice effect of reducing the estimation uncertainty. The only open question is how to construct i.i.d. estimators \(m\ This is discussed next.
• If additionally we know that A is unbiased for A*, then the law of large numbers will imply that the aggregated estimator converges a.s. to the true A* as M —> oo, and a central limit type theorem provides that the rate of convergence is of order \f~M. Additionally, a convergence result can be proved for the expected deviance losses. For more details we refer to Richman-Wuthrich [111].
7.2.2   Bagging for Poisson regression trees
Assume that we have Poisson observations T>, see (2.3), and that we have determined the regression tree estimator X(x) according to Sections 6.1 and 6.2. We can then use this estimated expected frequency to apply the parametric bootstrap algorithm of Section 7.1.2 to receive estimators for aggregating, i.e. we combine bootstrap and aggregating in the following algorithm.1
Bagging for Poisson Regression Trees.
(1) Repeat for m = 1,..., M
(a) simulate independent observations for i = 1,. . ., n
N* ~ PoKA^H); (7.7)
(b) calculate the regression tree estimator x i—> \(m*\x) from these bootstrap observations V* = {(N*,xi,vi),..., (N*, xn,vn)}.
1Note that for the regression tree estimators A we use the credibility estimators (6.8). These are strictly positive if Aq > 0.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
182
Chapter 7.  Ensemble Learning Methods
(2) Return the bagging estimator
1 m
x ^ \™g(x) = — E ^{m*](x).
m=l
Remarks.
• In the bagging algorithm above we assume that we always choose the same portfolio (xi,Vi)i=i^^n. Of course, we could also resample the corresponding portfolio from its empirical distribution (drawing with replacements).
• The bagging estimator Ag^g(-) is a sample estimator approximation to A(-), using the bootstrap samples A^m*^(-), m = 1,..., M. In particular, we have
AElg(-) = A(-) + (AElg(-)-A(-)),
where the second term is the bootstrap bias. It (usually) turns out that the bagging estimator has a smaller variance (compared to the original tree estimator) at the cost of an additional bias \^ag(x) — X(x). This is in the spirit of Proposition 7.2, however, we cannot do an exact statement here because we only simulate (bootstrap) from an estimated model.
• The Bagging for Poisson Regression Trees algorithm uses a parametric bootstrap in (7.7). A non-parametric version would use sampling with replacements directly from the data T>. The latter may be advantageous if the bias in A is too large (due to a tree that was chosen too small) or if the resulting regression structure shows too much over-dispersion in the data, i.e., if the regression structure cannot explain the data up to a noise term of unit variance.
• We conclude that bagging is mainly a variance reduction technique and, for the moment, it is not clear whether it also improves the Poisson deviance loss.
Example 7.3 (Aggregating and bagging). We revisit the regression tree example considered in Table 6.4.
		E[D*(iV,AD)]/n	E[£(AD, A*)]
(1) true frequency AD = A*		27.7013	0.0000
(2) estimated frequency AD =	= A from Tmin	28.0774	0.3762
(3) estimated frequency AD =	- X5 ^agg	27.9101	0.2088
(4) estimated frequency AD =	- A5 - ABag	28.0645	0.3632
(5) estimated frequency AD =	- ABag	28.0512	0.3499
Table 7.1: Bagging results for the synthetic example given in Appendix A; in 10 .
Line (1) of Table 7.1 gives the expected average Poisson deviance loss E [D*(N, A*)] /n w.r.t. the true frequency A* as chosen in Appendix A. This value is obtained numerically
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
183
by Monte Carlo simulation of AT" from the true model on our given portfolio, see Figure A.3 (rhs). We note that this expected value of 27.7013 ■ 10-2 is slightly smaller than the one of the selected observation given by 27.7278 ■ 10-2, see Table 6.4. On line (2) of Table 7.1 we calculate the i.i.d. regression tree estimators \(m^ from i.i.d. samples n^m\ m > 1, using the true model of Appendix A. To construct the regression tree estimators we use exactly the same cost-complexity parameter cp = 8.217-10-5 as has been used to receive the minimum cross-validation error tree Tmm on line (Ch6.2) of Table 6.4. This allows us to empirically calculate the estimation error E[£(A, A*)] of this regression tree estimator. The expected average Poisson deviance loss is then estimated from (7.5) and using line (1) of Table 7.1. The estimation error of 0.3762 ■ 10-2 is very similar to the one of the selected sample given on line (Ch6.2) in Table 6.4. Thus, our tree Tmm seems to be a typical one. Moreover, the stratified 10-fold cross-validation error of 28.1388 ■ 10"2 matches well 28.0774 ■ 10"2 which expresses that cross-validation works properly here.
On line (3) of Table 7.1 we perform the same analysis as on line (2) except that we replace the i.i.d. regression tree estimators A^™-1 by i.i.d. aggregated estimators A^g = | J2m=i A((fe_1)5+m), k > 1, to estimate the aggregated figures w.r.t. X^gg (for M = 5) empirically. As suggested by Proposition 7.2, we receive (clearly) better results by aggregation, in fact, the results are competitive with the neural network results, see Table 6.4. Unfortunately, this positive result is not of practical use, not knowing the true regression function A* we cannot sample i.i.d. estimators A^gg. Therefore, this method remains intractable.
Finally, lines (4)-(5) of Table 7.1 show the bagging estimators for M = 5,10. To receive them, we replace the i.i.d. samples n^m\ m > 1, from the true regression function A* by i.i.d. boostrap samples n~(m*\ m > 1, from the estimated regression function A (using the minimum cross-validation error tree Tmm of line (Ch6.2) of Table 6.4), see also (7.7) for bootstrapping. The remaining steps are done completely analogously to the steps on lines (3) and (2) of Table 7.1. Bagging leads to a small improvement in Poisson deviance loss compared to the tree estimator on line (2). However, the reduction is comparably small which suggests that the estimation error term is dominated by the estimation bias in the tree estimator A used for bootstrapping. ■
From the previous example we see that bagging leads in our example only to a small improvement (if at all). This may come from the fact that bootstrapping from the tree estimator A is too static, i.e. for every bootstrap sample we use the same origin and the same information (and hence the same (expected) bias). We conclude that bagging is not very helpful in our problem.
7.3   Random forests
Random forests are motivated by the fact that the bagging algorithm presented in the previous section is too static and it does not really lead to (much) improvement and bias reduction. This deficiency may be eliminated by growing a very large tree (which typically leads to a smaller bias) and then applying a more noisy bootstrap on this large tree. If we average over sufficiently many noisy bootstrap samples (similarly to bagging)
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
184
Chapter 7.  Ensemble Learning Methods
we should still get a small estimation variance. This is the basic idea behind random forests which goes back to Breiman [14].
We construct a very large tree T and a tree estimator A, respectively, based on the original data T>. From this very large tree estimator we simulate (parametric) bootstrap samples according to (7.7). These bootstrap samples then serve as observations to construct binary regression trees. However, for each standardized binary split (SBS) decision we choose at random a fixed number q* of feature components on which we base the split decisions. Typically, 1 < q* < q, and as a consequence we may miss the optimal split because we do not consider all feature components.
Poisson Random Forest with 1 < q* < q.
(0) Choose 1 < q* < q fixed.
(1) Repeat for m = 1,..., M
(a) simulate independent observations for i = 1,. . ., n
N* ~ PoKA^K); (7.8)
(b) calculate a SBS regression tree estimator x i—> X^m'q*\x) from these bootstrap observations T>* = {(N*, x±,vi),..., (N*, xn,vn)}, where for each SBS decision in (6.11) and (6.13), respectively, we first choose at random q* feature components of x on which the SBS split decision is based on.
(2) Return the random forest estimator
, m
x i y \&(x) = - £ \(™><*\x). (7.9)
m=l
Remarks.
• Observe that the Poisson Random Forest algorithm differs from the Bagging for Poisson Regression Tree algorithm as soon as q* < q, because in this case we may miss the optimal split if it is not among the q* feature components selected. This leads to more noisy trees and averaging smooths out this noise.
• Often, one chooses q* < y/q or q* = q/3.
• In (7.8) we use a parametric bootstrap, a non-parametric bootstrap would be achieved by using sampling with replacements directly from the observations.
• If the bias is still (too) large, one may choose a bigger tree. However, this question can typically not be answered in a satisfactory manner because usually the best tree model is not known.
• Choosing features at random may also help to break colinearity between feature components (decorrelate trees).
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
185
• Another interesting analysis is to study which feature component has been chosen how often in the random forest algorithm, and how much these splits decrease the deviance loss. This provides a measure of variable importance which may be illustrated in plots like Figure 17.5 in Efron-Hastie [37].
Example 7.4 (Random forests). We continue our MTPL example considered in Table 6.4. The Poisson deviance loss version of the random forest algorithm is implemented in the R package rfCountData. The corresponding R code is given in Listing 7.1.
Listing 7.1: R command rfPoisson
1 1ibraryCrfCountData)
2
3 rf   <-  rfPoisson(x = dat[,c("age","ac","power","gas","brand","area","dens","ct")] ,
4 offset=logCdat$expo),   y=dat$claims,
5 ntree=10,   mtry=3,   replace=TRUE,   nodesize=1000)
6
7 print (rf )
8 plot(rf)
9 dat$predRF   <-  predict(rf,   offset=log(dat$expo), newdata=dat)
Lines 3-4 specify the features x, the years at risk v as log-offsets, and the responses N. ntree denotes the number M of bootstrapped trees, mtry gives the number q* of feature components considered, and replace determines whether the drawing of cases should be done with our without replacements. Note that rfPoisson is based on a non-parametric bootstrap as it has been introduced by Breiman [14]. That is, (7.8) is replaced by drawings with or without replacements; on the non-selected samples the algorithm calculates an out-of-bag generalization loss, which is reported in Figure 7.1. Finally, nodesize determines the minimal number of cases in each leaf the tree estimators \(m,q*) gho^d contain.
In Table 7.2 we consider three different parametrizations of random forests: RF1 has M = 10 and nodesize = lO'OOO, RF2 has M = 10 and nodesize = l'OOO, and RF3 has M = 100 and nodesize = l'OOO. Thus, random forest RF1 is based on M = 10 comparably small tree estimators A^™'9*-1 having in each leaf at least lO'OOO cases. For random forest RF2 we decrease this number to l'OOO cases per leaf. From Figure 6.5 we know that these large trees tend to over-fit to the data. Averaging (7.9) should take care of this over-fitting, for RF2 we average over M = 10 tree estimators and for RF3 we average over M = 100 tree estimators.
From random forest estimate RF1 in Table 7.2 we observe that too small trees, i.e. too big values for nodesize, do not provide competitive models. However, increasing the size of the trees provides remarkably good predictive results. The two random forest estimates RF2 and RF3 are on a similar level as the neural network approaches in terms of estimation losses £(A,A*).
The difficulty with the large trees in RF2 and RF3 are the computational times. The construction of one such big tree takes roughly 60 seconds in the implementation of rfPoisson, thus, if averaging over M = 100 tree estimators takes roughly 1.5 hours.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
186
Chapter 7.  Ensemble Learning Methods
		run	#	CV loss	strat. CV	est. loss	in-sample	average
		time	par am.	rev	/•cv >~D	£(A,A*)	Ma >~D	frequency
(ChA.l	true model A*			27.7278				10.1991%
(Chl.l)	homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.4)	GLM4	14s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch3.3)	GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
(Ch5.4)	CANN1	28s	703	27.9292	27.9362	0.2284	27.8940	10.1577%
(Ch5.5)	CANN2	27s	780	27.9306	27.9456	0.2092	27.8684	10.2283%
(Ch5.2)	DNN2	123s	703	27.9235	27.9545	0.1600	27.7736	10.4361%
(Ch5.3)	DNN3	125s	780	27.9404	27.9232	0.2094	27.7693	9.6908%
(Ch5.7)	blended DNN	-		-	-	0.1336	27.6939	10.2691%
(CI16.2)	Tree2 Tmm	65s	71	-	28.1388	0.3748	27.9156	10.2570%
(Ch7.1)	RF1	60s	-	28.1375+	-	0.3642	28.0153	10.1901%
(Ch7.2)	RF2	594s	-	28.2135+	-	0.2944	27.3573	10.0628%
(Ch7.3)	RF3	5'931s	-	27.9808+	-	0.2256	27.2942	10.0863%
Table 7.2: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; t correspond to out-of-bag losses; losses are reported in 10-2; run time gives the time needed for model calibration, and param.' gives the number of estimated model parameters, this table follows up from Table 6.4.
Figure 7.1: Decrease of out-of-bag losses in the random forest algorithm for trees of minimal leaf size of nodesize = l'OOO.
In Figure 7.1 we plot the out-of-bag generalization losses on the not-selected samples in the bootstrap iterations. This plot suggests that the predictive model can further be improved by using more iterations M in the random forest algorithm. We conclude that random forests give powerful predictors, however, at rather high computational costs (in the current implementation). This finishes the example. ■
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
187
7.4   Boosting machines
Boosting is an iterative method that aims at combining many weak predictions into one powerful one. The underlying mechanism is completely different from bagging and random forests, namely, boosting aims at minimizing the in-sample loss by a stage-wise adaptive learning algorithm. Its roots go back to Valiant [128] and Kearns-Valiant [79, 80]. Schapire [118], Freund [44] and Freund-Schapire [45] popularized the method by presenting the AdaBoost algorithm and by providing rigorous results. This chapter is based on Chapter 10 of Hastie et al. [62].
7.4.1    Generic gradient boosting machine
Assume we are given an objective function L(-) and we aim at minimizing the in-sample loss over the (learning) sample T>. That is, we try to solve
where we restrict this optimization over a well-specified family of functions /. Observe that optimization (7.10) only specifies the function / : X —> R in the feature values xi,... ,xn G X. This implies that the above optimization problem can be replaced by trying to find optimal parameters / = (/i,.. ., fn)' = (/(cci),. .. , f{xn))' G R™ that minimize the in-sample loss
The saturated model would provide a minimal in-sample loss, but the saturated model leads to over-fitting. This is exactly the reason for introducing smoothness and regularity conditions on / in Chapter 3 on GAMs, see (3.7)-(3.8), where we restrict to natural cubic splines with multiplicative interactions (by using the log-link) for /. In general, a non-trivial global solution to (7.10) is not feasible. Here, we solve the problem by restricting to another specific class of predictors (and functions), namely, we will consider regression tree functions with a given small number of leaves. Therefore, we design an algorithm that locally improves the situation in each iteration. This is exactly what the gradient boosting machine (GBM) is designed for. The GBM was developed in Friedman [46, 47], we also refer to Ridgeway [112] that serves the purpose of describing the corresponding implementation in R of the package gbm.
In a nutshell, assume that we have found a minimizer y"m_i(-) w.r.t. (7.10), where the minimization has been performed over a (small) family of admissible functions. In a next step, we may try to adaptively improve this estimator by considering the optimization problem
where we restrict this optimization to sufficiently simple functions Qm D {(7 = 0}. This then provides an improved estimator fm = fm-i + 9m- Iteration of these optimizations (7.12) for m > 1 will stage-wise adaptively improve an initial (weak) learner /q. This
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
(7.10)
(7.11)
(7.12)
Electronic copy available at: https://ssrn.com/abstract=2870308
188
Chapter 7.  Ensemble Learning Methods
idea has already been presented in Section 5.1.4 on the CANN approach where we have started from a GLM estimator. This GLM estimator jo has been boosted in the sense of (7.12) by allowing for neural network features in gi.
Recall that optimization (7.10) is only specified in the observed features (7.11). Assume that in the (m — l)-st step of the optimization algorithm we have found values in Xi given by f m_\ = (/m-l,l> • • • > fm-l,n)' £ R™ that provide in-sample loss
1 n
£L(/m-l) = -£M^,/m-l,i,l>i).
n r-f
i=i
In the next step, we perturb /m_i locally so that it leads to a maximal local decrease in in-sample loss. This provides the next value fm of the algorithm. Assume that the loss function L(-) is sufficiently smooth. The gradient V£^(/m_1) indicates the locally biggest decrease in in-sample loss. This gradient is given by
dL(Yn,f,v„
/=/m-i,i '     ' df
i
n\ df
The first order Taylor expansion around f m_\ provides
C*L (/) = ?l (fm-l) + V^(/m_!Jj (/ " /m_x) + o(\\f - /„_!!!),
as 11/ — fm-i\\ ~> 0- H we choose a (small) step of size g > 0 we have locally optimal first order update
/m_i C /m_i - gVOHf^). (7.13) This provides first order approximation to the in-sample loss in the updated estimate
£1 (fm-l ~ QV£l(fm-lJ) = ^tif m-l) ~ 6\W £f(f m-lf + o(6), (7.14)
as g —> 0. Thus, we see that update (7.13) provides locally the biggest improvement in in-sample loss. This is completely analogous to the steepest gradient descent step given in (5.11). The optimal step size gm > 0 is then found by
gm = argmin £f l/m_i - gV£f(fm-i,
g>0 v
This provides locally optimal first order update
fm-l   ~^   fm = fm-l ~ QmS£■ L,(fm-l)■>
and we iterate this algorithm. This update is exactly in the spirit of (7.12) and motivates the following generic GBM.
(Generic) Gradient Boosting Machine (GBM).
(0) Initialize the constant function fo(x) = go = argmin X^JLi L (Yi, g, Vi).
e
(1) Repeat for m = 1,..., M
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
189
(a) calculate the working responses Rm^, i = 1
dL(Yi,f,Vi]
1 ' ' ' 1 ' v1
df
f=fm-l(Xi)
(b) fit a regression model gm : X —> R to the working responses (Rm,i)i=i,...,n having corresponding features and volumes (ajj, Wi)i=i,...,ra;
(c) compute the optimal step size
1  n ^ Qm = argmin - ^ L (Y%, fm-i(xi) + aam(
i=i
(x,),v,
(d) update
(x). (7.15)
(2) Return estimator x i—> f{x) = Jm(x).
Remarks to the generic GBM.
• Step (la): Observe that the scaled working responses (Rm,i/n)i=i,...,n exactly correspond to the components of the negative gradient —V£I£(/m_i(a;i),..., fm-i(xn)). Lhus, these working responses provide the locally optimal direction for the update.
• Step (lb): In the update (7.13) we optimize over all observed values (fm-i,i)i- Lhis is changed in step (lb) of the generic GBM for the following two reasons: (i) we would like to fit a function gm : X —> R that is defined on the entire feature space X (and not only in the observed feature values x±,. . ., xn); and (ii) we do not want over-fitting. Lherefore, we rather fit a "lower-dimensional" regression model gm to these working responses (and we do not choose the saturated improvement). More illustration to this step is provided in the Poisson case below.
• Step (Id): We receive potential over-fitting if the number of iterations M in the generic GBM is chosen too large. Lo prevent from this kind of over-fitting, often a smaller step size is executed in (7.15): choose fixed shrinkage constant a 6 (0,1) and define the shrinkage updates
fm-i{x) ->■ fm{x) = fm-i{x) + aQmgm{x). (7.16)
Regression step (lb) in the gradient boosting machine
Step (lb) of the GBM requires fitting a regression model gm : X —> R to the working responses (Rm,i,Xi,Vi), i = 1,... ,n. Observe that the optimal first order direction is, see (7.14),
-^V£1£(/m_1) = (Rm,l, ■ ■ ■ , Rm,n)' ■
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
190
Chapter 7.  Ensemble Learning Methods
We try to optimally approximate this direction by selecting the optimal regression function gm : X —> R within the family Qm of admissible regression functions on X. The optimal L2-distance approximation in step (lb) is given by
1 n
9m = argmin - ^ (i?m>l - gm(xi)) . (7-17)
Thus, if we approximate direction —nS/Dfj{fm_i) by gm = (jjm(xi),. .. ,gm(xn))' G R™ we obtain for (7.14) approximation, as g —> 0,
£L(fm-i + edm) = ^l(fm-i) + 0^^l(fm-i)'dm + o(g)
«    Cfifm-l) ~ Q\\dm\\2 ln-
From this we see that the regression step (lb) results in finding the L2-optimal regression function gm : X —> R in (7.17) that is close to the working responses. In particular, we use the square loss function (residual sum of squares) for this step of the generic GBM. For an other (more consistent) approach we refer to the next section on Poisson regression.
Remarks. The family Qm of admissible functions in (7.17) is often chosen to be the family of regression tree functions with a given small number of leaves. Having only a small number of leaves improves the regression function (7.15) in each stage-wise adaptive step weakly, and combining many weak learners typically leads to a powerful regression function / = /m-
7.4.2   Poisson regression tree boosting machine
The generic GBM is a bit artificial in the Poisson case with the Poisson deviance loss as objective function. Steps (la)-(ld) aim at locally improving the in-sample loss by stage-wise adaptive adding a new basis function gm to jm-\. This is done via the gradient of the objective function so that the optimal first order Taylor expansion can be studied. In the following Poisson case we can consider these steps directly. Assume that the functions / play the role of the logged frequency log A. Suppose /m_i has been constructed and we consider the (stage-wise adaptive) in-sample loss optimization (set step size g = 1)
9k
argmin
gm e Q~ n
1=1
i{xi)+gm{xi)
(7.18)
for a given family Qm of admissible regression functions on X. This in-sample loss is for the Poisson deviance loss function given by
D* ( N, e-
fm-l+g„
n	Vi
i=i	-
n	
i=\	
EL(F-e i=i	
0fm-1(xi)+gm(xi)
Ni
,Ám)e9m.(Xi)
1 - log
fm-l(xi)+gm(xi)
N,.
9m(xi) w(m)
1 - log
ivi
Ni
Version October 27, 2021, M.V. Wüthrich & C. Buser, ELK Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
191
with working weights for i = 1,. .. , n given by
™M = Uie/m-i(*0. (7.19)
From this we conclude that optimization problem (7.18) is in our Poisson case with deviance loss equivalent to solving
i n
gm = ajgmin-X)i(^,e9m(!Bi),^i   ), (7-20)
i=i
for a given family Qm of admissible regression functions on X. Thus, in every step of the Poisson boosting machine we receive updated working weights {w^"^)i=i n, playing the role of the volumes in the Poisson model, and replacing the role of the working responses in the boosting machine. This is not surprising because it is exactly what we have been using in (5.23)-(5.24) for boosting a GLM with neural network features in the Poisson case, that is, the logarithm of the working responses play the role of fixed offsets that are enhanced by a next regression model/boosting step.
This motivates iterative solutions of regression problems. For binary tree regressions this results in the following stage-wise adaptive learning algorithm. We set for the stage-wise adaptive improvements log flm = gm.
Poisson Regression Tree Boosting Machine.
(0) Initialization:
(a) choose J > 2 and a 6 (0,1] fixed;
(b) set homogeneous overall MLE fo{x) = log Xo(x) = log (J22=i ^i/J2?=i vi)-
(1) Repeat for m = 1,..., M
(a) set working weights        according to (7.19) and consider the working data
= {(N1, xuw™) ,...,(Nn, xn, wW)} ;
(b) construct a SBS Poisson regression tree estimator
tg7-(m)
with 171"-11 = J leaves for the working data T>^m\ see Section 6.1 and (6.1) (c) update the estimator
{x)+agm{x) (7.22)
< E :
fm_i(a;)+Qí  E  log(MÍm)) ^{xeX(m)y
(2) Return the expected frequency estimator
x    H4-    X(x) = exp |/M(cc)| .
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
192
Chapter 7.  Ensemble Learning Methods
Remarks on the Poisson Regression Tree Boosting Machine.
• In steps (la)-(lc) we directly study the optimal local improvement (without considering an approximation to the first order Taylor expansion). Local is meant here in the sense that we disturb the previous solution /m-i(-), which is included in the working weights (ifl-m'))i=iv..ira.
• In step (lb) we choose a SBS regression tree estimator for updating the fit. Of course, we could choose any other regression method here. Importantly, we fix the number of leaves J > 2 in advance, i.e. we do not aim for the optimal tree here. This J should not be chosen too large for several reasons:
— speed of calculations (in each iteration m= 1,..., M),
— over-fitting, i.e. in general we are only looking for small or moderate improvements in each step, otherwise we obtain over-fitting and too wildly looking functions. Over-fitting can also be tuned by choosing an appropriate shrinkage constant a < 1.
• The boosting machine is of particular interest for model back-testing. If we have an existing model Am(-) = exp{/m(-)}, then the boosting step analyzes whether we should additionally scale this model with a function fim+i(-) different from 1. This idea is illustrated in more detail in a mortality modeling example in Deprez et al. [32].
• The Poisson regression tree boosting machine can be generalized to other distributions such the exponential dispersion family, see Lee-Lin [86].
• The SBS Poisson regression tree estimator in step (lb) is called exact greedy algorithm because it looks for the best possible split among all SBSs. It has been noticed that this can be very time consuming, in particular, if we have many categorical feature components (we also refer to the run times of the random forests in Table 7.2). Meanwhile there are other algorithms that, in particular, can handle categorical variables and sparse data (from one-hot encoding) more efficiently. A very powerful algorithm is XGBoost developed by Chen-Guestrin [26] that overcomes several issues, we refer to Ferrario-Hammerli [42].
Example 7.5 (trees of depth 1). We consider the Poisson regression tree boosting machine with J = 2 leaves. That is, every stage-wise adaptive learner in (7.21) has exactly one SBS % = (^ir, <^oi^) °^ ^ne entire feature space Xq = X. Thus, in view of (6.4) we consider for each iteration m = 1,..., M the optimizations
min        D*,m) (n,p,$)+D*im) (n,^) , where for r = 0,1 we have MLEs
E,
E.
'i: Xi&Xr
E.
E,
.(m) vte-
'i: Xi&Xr
í: Xi&Xr
•Or
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
193
and the corresponding Poisson deviance losses are given by
D* x
(m)
E
2Ni
v. Xi^X^T '
lm)?,.P/m-i(^)
1 - loe
lm)?,.p/m-l(^)
The optimal SBS s0 = (xffi, ) selects one feature component x\ of x for this split and, thus, adds one "rectangular" split to X. In Figure 7.2 we illustrate these rectangular splits for M = 3 iterations on a feature space X C [—1, l]2 of dimension q = 2.
boosting machine with M=3 and J=2
iteration 1 iteration 2 iteration 3
feature component x_1
Figure 7.2: Poisson regression tree boosting with J = 2 illustrating M = 3 iterations.
Let us assume that the optimal SBS in iteration m is w.r.t. the (continuous) feature component xi*^. Moreover, assume that it corresponds to the split question xi*m < c*m, see Section 6.1.2. This then implies that we obtain multiplicative structure
\m{x) = e^W = ^(K)(|4mU{li,^}+#l{ll,>c}).
From this we see that in the special case J = 2 we obtain multiplicative correction factors which depend on one single feature component only. Thus, we have
M
XM(x)   =  A0(a:) J] (moo) 1{^m<c*m}+^) 1{xl*m>c*m})
m=l M q
= mp) n n exp {(iog^ooV^^}+log^oTVix^c,^}) i{^=/}}
m=l 1=1
This identity illustrates that we receive a multiplicative structure in the case J = 2. This is also nicely illustrated in Figure 7.2. We conclude that for trees of depth 1 we expect the Poisson regression tree boosting machine to have a similar performance as GLMs and GAMs because it (only) allows for multiplicative interactions. For more complex interactions we have to consider more deep SBS regression trees. This finishes this example. ■
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
194
Chapter 7.  Ensemble Learning Methods
Example 7.6 (regression trees, boosting machines and neural networks). In this example we consider regression trees, boosting machines and neural networks on a feature space X = [—1,1]2 of dimension q = qo = 2. For the neural networks we choose the step function activation which makes them directly comparable to regression trees.
regression tree with 3 standardized binary splits boosting machine with M=3and J=2 shallow neural network with q=3
Figure 7.3: (lhs) SBS split regression tree with 3 splits, (middle) Poisson boosting machine with M = 3 and J = 2, (rhs) shallow neural network with qi = 3 hidden neurons and step function activation.
We start by comparing the Poisson regression tree boosting machine with one split (J = 2) and M iterations to a shallow neural network (of depth d = 1) having q± = M hidden neurons. These two regression models are illustrated in Figure 7.3 (middle and rhs) for qi = M = 3. In Example 7.5 we have seen that regression tree boosting machines with J = 2 leaves remain in the family of multiplicative models. Shallow neural networks are more general because they allow for splits in any direction, compare Figure 7.3 (middle and rhs). This additional flexibility allows for non-multiplicative interactions and, in fact, the universality theorems on page 105 tell us that these shallow neural networks can approximate a large class of functions if we increase the number of hidden neurons qi sufficiently.
Increasing the number of hidden neurons qi in a shallow neural network means growing this network in width. On page 116 we have demonstrated that this growing in width may not be very efficient for capturing certain types of interactions, therefore, neural networks should be grown simultaneously in width and depth d to efficiently approximate any type of regression function.
The same can be said for the Poisson regression tree boosting machine. In Figure 7.3 (lhs) we illustrate a SBS regression tree with 3 splits, i.e. having J = 4 leaves. Observe that this provides us with a rather complex non-multiplicative interaction in the regression function for the components x\ and x2 because the second and third splits do not split across the entire feature space axes, i.e. are non-multiplicative. This complexity is similar to growing a neural network in depth, see (5.16) for an example. The Poisson regression tree boosting machine with J = 4 leaves multiplies (for m > 1) such non-trivial partitions of the type in Figure 7.3 (lhs) which indicates that we may construct very flexible regression functions by iterating the boosting machine for m > 1 (and for J > 2). ■
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
195
7.4.3   Example in motor insurance pricing, revisited
The GBM is implemented in the R package gmb. In Listing 7.2 we provide a short regression tree boosting algorithm that is based on the R package rpart.
Listing 7.2: Poisson regression tree boosting machine
1 d    <-  2 #  depth  of tree
2 M    <-   100 # iterations
3 alpha  <-  1 #  shrinkage constant
4
5 dat$fit   <- dat$expo  *  sum(dat$claims)/sum(dat$expo)
6
7 for   (m  in 1:M){
8 tree   <-  rpart(claims   ~   age  +  ac  +  power  +  gas  +  brand +  area +  dens  + ct
9 +  offset(log(fit)),
10 data=dat,   method="poisson",   parms=list(shrink=l),
11 control=rpart.control(xval=1,   minbucket=1000,   cp=0.000001,
12 maxdepth=d,   maxsurrogate=0))
13 dat$fit   <-  predict(tree)~alpha*dat$fit
14 }
Note that the R command rpart uses the variable maxdepth=d (depth of tree) to determine the size of the tree instead of the number of leaves J. If we use d = 1 the tree will have exactly 2 leaves, trees of depth d = 2 can have either 3 or 4 leaves, the explicit number will depend on the cost-complexity parameter cp on line 11 in Listing 7.2 (and also on the choice of minbucket). The variables xval and maxsurrogate are chosen such that run time is minimized.
Poisson boosting (d=1, alpha=1]
Poisson boosting (d=2, alpha=1]
Poisson boosting (d=3, alpha=1]
Figure 7.4: Decrease in estimation loss £(\, A*) of the Poisson regression tree boosting machine estimators over M = 100 iterations of trees of depths d 6 {1,2,3} (lhs, middle, rhs) and shrinkage constant a = 1.
We run the Poisson regression tree boosting machine of Listing 7.2 on our MTPL data considered in the previous examples. We choose M = 100 iterations, we set the shrinkage constant a = 1 and we choose trees of depths de {1, 2,3}. The resulting estimation losses £(\, A*) are presented in Figure 7.4. The plot on the lhs shows the results for trees of depth d = 1. As discussed in Example 7.5, this Poisson regression tree boosting machine can only model multiplicative interactions and, indeed, the corresponding estimation loss converges to the one of the GAM (blue horizontal line) as we increase the number
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
196
Chapter 7.  Ensemble Learning Methods
of iterations M. Thus, this Poisson regression tree boosting machine is not competitive because it does not allow for non-multiplicative interactions.
In Figure 7.4 (middle, rhs) we show the results of the Poisson regression tree boosting machine for trees of depths d = 2,3. We note that these configurations start to be competitive with the neural network and random forest approaches (magenta and green horizontal lines). We also see that the model with trees of depth d = 3 starts to over-fit after M = 20 iterations.
Poisson boosting (d=3, alpha-0.75) Poisson boosting (d=3, alpha-0.5) Poisson boosting (d=3, alpha=0.25)
iterations iterations iterations
Figure 7.5: Decrease in estimation loss £(\, A*) of the Poisson regression tree boosting machine estimators over M = 100 iterations of trees of depth d = 3 and shrinkage constants a 6 {0.75,0.5,0.25} (lhs, middle, rhs).
In Figure 7.5 we plot the same graphs of the estimation losses £(\, A*) of the Poisson regression tree boosting machine for trees of depth d = 3 and for different shrinkage constants a 6 {0.75,0.5,0.25} (lhs, middle, rhs). For a shrinkage constant of a = 0.25, i.e. only a moderate change of the previous estimator in (7.22), we do not receive over-fitting over the first M = 100 iterations. Moreover, the resulting boosted regression tree estimator outperforms all regression tree estimators we have met so far. We call this model PBM1, and the corresponding results are reported in Table 7.3.
Poisson boosting (d=4, alpha=0.25) Poisson boosting (d=4, alpha=0.1]
iterations iterations
Figure 7.6: Decrease in estimation loss £(\, A*) of the Poisson regression tree boosting machine estimators over M = 100 iterations of trees of depth d = 4 and shrinkage constants a 6 {0.25,0.1} (lhs, rhs).
Figure 7.6 shows the analogous results of the Poisson regression tree boosting machine Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
197
for trees of depth d = 4 and a 6 {0.25,0.1} (lhs, rhs). This configuration provides an even better performance for shrinkage constant a = 0.1, however, at a higher run time. We call this latter model PBM2, and the corresponding results are reported in Table 7.3.
		run	#	CV loss	strat. CV	est. loss	in-sample	average
		time	param.	rev	/•cv >~D		Ms >~D	frequency
(ChA.l	true model A*			27.7278				10.1991%
(Chl.l)	homogeneous	0.1s	1	29.1066	29.1065	1.3439	29.1065	10.2691%
(Ch2.4)	GLM4	14s	57	28.1502	28.1510	0.4137	28.1282	10.2691%
(Ch3.3)	GAM3	50s	79	28.1378	28.1380	0.3967	28.1055	10.2691%
(Ch5.4)	CANN1	28s	703	27.9292	27.9362	0.2284	27.8940	10.1577%
(Ch5.5)	CANN2	27s	780	27.9306	27.9456	0.2092	27.8684	10.2283%
(Ch5.2)	DNN2	123s	703	27.9235	27.9545	0.1600	27.7736	10.4361%
(Ch5.3)	DNN3	125s	780	27.9404	27.9232	0.2094	27.7693	9.6908%
(Ch5.7)	blended DNN	-		-	-	0.1336	27.6939	10.2691%
(Ch6.2)	Tree2 Tmm	65s	71	-	28.1388	0.3748	27.9156	10.2570%
(Ch7.3)	RF3	5'931s	-	27.9808	-	0.2256	27.2942	10.0863%
(Ch7.4)	PBM1	99s	725	27.8794	27.8783	0.1301	27.6149	10.2655%
(Ch7.5)	PBM2	115s	1314	27.8686	27.8763	0.1254	27.5921	10.2588%
Table 7.3: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and param.' gives the number of estimated model parameters, this table follows up from Table 7.2.
Figure 7.7: Resulting estimated frequencies (on log scale) of true vs. model DNN2 (lhs), true vs. model PBM2 (middle), and model PBM2 vs. DNN2 (rhs).
From these results we conclude that the boosting machine provides excellent results, on the same level as neural network ensembles, see Table 7.3, but at a reasonable run time. Another advantage of the Poisson regression tree boosting machine is that it provides rather stable average frequency estimates (last column of Table 7.3). In fact, the balance property only fails to hold here because we use the Bayesian version of the parameter estimates. Neural networks are also competitive in terms of estimation loss and run time, but the average frequency estimates need balance property regularization as shown in Section 5.1.5. Neural networks have the advantage that they allow for continuous inter-
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
198
Chapter 7.  Ensemble Learning Methods
and extrapolation if the activation functions have been chosen to be continuous. In Figure 7.7 we compare the log-frequencies of the neural network model DNN2, the Poisson regression tree boosting machine PBM2 and the true model A*. We see that the Poisson regression tree boosting machine is more concentrated around the true model than the neural network model, compare Figure 7.7 (lhs) and (middle). The only concern that we see with the Poisson regression tree boosting machine is that the low frequencies in the left-lower corner of Figure 7.7 (middle) are biased. This may be caused by the fact that we set a minimal leaf size of l'OOO cases, see line 11 of Listing 7.2. This leaf size may imply that we cannot distinguish the quality of good insurance policies. We could decrease this value. However, this comes at the price of more run time and with a higher potential of over-fitting. This closes our example.
7.4.4   AdaBoost algorithm
Usually, boosting is explained with the AdaBoost algorithm at hand which goes back to Freund-Schapire [45]. In analogy to the Poisson regression tree boosting machine, we can design a boosting algorithm for binary classification, see Section 2.5.1. We assume to have a binary classification problem
C : X -^y = {0,1},       x^y = C(x),
which needs to be estimated from the data
V={(Y1,x1),...,(Yn,xn)}, (7.23)
with features Xi G X and binary responses Yi G {0,1}. For a given classifier C we can study the misclassification rate on the data T> given by, we also refer to (2.31)-(2.32) and (6.30),
1 n
i=l
A weak classifier C is one that is (only slightly) better than random guessing, a more precise definition is given in (7.28), below. Boosting constructs a sequence (Cm)m=i m of weak classifiers together with a sequence of weights (ctm)m=i,...,M from which a more powerful classifier is constructed. The Ada(ptive)Boost algorithm works as follows.
AdaBoost Algorithm.
(0) Initialize working weights        = 1 for i = 1,. . ., n.
(1) Repeat for m = 1,..., M
(a) fit a weak classifier Cm to the working data
X><m> = { (li, si, 4m)) ,..., (y„, xn, WW) } , (7.24)
where we attach the working weights (tul-m'))i=i,...,ra to the original data (7.23); Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 7.  Ensemble Learning Methods
199
(b) calculate the weighted (in-sample) mis classification rate of Cm on X>(m) given
by
i n
c(m) 1 V7/;(m)1
™        (m) l^Wi     K {Y^Cm{Xl)Y
2^1=1 wi i=l
Ithm is terminates
(c) compute
the algorithm is terminated if = 0, otherwise
am = log
r(m)
(d) update the working weights i = 1
w\ exp
(2) Return the classifier (weighted majority vote
M
x    M-    C(a:) = sgn ( £ am (^(a:) - 0.5) ) . (7.26)
m=l
Comments on the AdaBoost algorithm.
• In (7.24) the observations V are extended by the working weights (tul-m'))i=iv..>ra. These working weights may be interpreted to give more attention to particular misclassification than to others, i.e. we pay more attention to cases (Yi,Xi) with a higher working weight w^m\ Such weights have not been used in classification, yet, for instance, in the binary tree classification algorithm of Section 6.3, but they could be implemented by modifying the empirical probabilities in (6.26) accordingly to
p(y,X';V^) = U=lWi   *K=y<**x'\ (7.27)
Observe that we initialize the above algorithm by setting       = 1 which provides
p(y,X';V^)
EiLi 1{Yi=y, xtex'} _ n(y,X';V)
n
J2y€yn(y,X;T>)'
Thus, a first classifier C\ for m = 1 in the AdaBoost algorithm can be constructed by the (classical) classification tree algorithm presented in Section 6.3.
Since Cm is a weak classifier for the working weights (tul-m'))i=iv..,ra, it is slightly better than random guessing for these weights, which is defined by the corresponding weighted (in-sample) misclassification rate satisfying requirement
< 1/2- (7.28)
This implies that am > 0 and henceforth update (7.25) increases the working weights w^m+1^ of the cases (Yi, Xj) that are misclassified by Cm, thus, more attention is paid to these misclassified cases in the next (weak) classifier Cm+i-
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
200
Chapter 7.  Ensemble Learning Methods
• In this sense, classifier (7.26) can be viewed as a GAM with basis functions Ci(-),..., C see (3.10). The main differences are that these basis functions are constructed on site (adaptive) and the optimization is stage-wise because only the latest parameter am is optimized (and the previous ones a±,. . ., am-l are kept fixed).
• It can be shown that the AdaBoost algorithm arises similarly to the Poisson regression tree boosting algorithm as a stage-wise adaptive optimization algorithm from an in-sample loss optimization problem. In this case we consider a classification problem with either having the exponential function or the binomial deviance loss as objective function, this also explains the choices of am, for details we refer to Hastie et al. [62], Sections 10.4-10.5.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8
Telematics Car Driving Data
In the previous chapters we have been analyzing claims frequency modeling of a heterogeneous car insurance portfolio. The risk drivers of this heterogeneous portfolio have been described by classical car insurance feature information. This classical car insurance feature information can be divided into several different groups:
• driver specific features: age of driver, gender of driver, marital status, date and place of driving test, occupation, medical conditions, size of household, type of flat, garage, credit record, leasing, etc.
• car specific features: type of car, car brand, size of car, weight of car, age of car, horse power, type of engine, cubic capacity, price of car, equipment, number of seats, etc.
• insurance contract specific features: type of contract, duration of contract, issue date of contract, sales channel, deductible, other insurance covers, etc.
• geographic features: province of living, zip code, city-rural area, etc.
• driving related features: annual distance, vehicle use, bonus-malus level, claims experience, etc.
Typical car insurance tariffs are based on more than 30 feature components. From the previous list, however, we see that many of these components are not directly related to the driving habits, driving styles and driving skills of the drivers.
Nowadays, many vehicles are equipped with technology that transmits car driving information via telecommunication systems to central data warehouses. These data transmissions (called telematics data) comprise detailed car driving information, and in the foreseeable future this information will be used for car insurance pricing because it displays driving habits and driving styles rather transparently. In fact, it is likely that this information will complement the classical feature information from above. These car driving transmissions may comprise rather different information (depending on the installed devices). This information may include high-frequency data about location (typically GPS location sec by sec), speed, acceleration and deceleration, left- and right-turns, engine revolutions (all sec by sec). Moreover, it may include number of trips, total distance of trips, total duration of trips, time stamp of trips, road conditions, road types,
201
Electronic copy available at: https://ssrn.com/abstract=2870308
202
Chapter 8.  Telematics Car Driving Data
weather information, as well as driver related information. The main difficulty from a statistical point of view is to convert this high-dimensional and high-frequency data into useful feature information for car insurance pricing.
If we assume that in average a car driver produces 100 KB of telematics data per day, this amounts to 40 MB of telematics data per year. For a comparably small portfolio of lOO'OOO car drivers this results in 4 TB of telematics data each year. This is a large amount of data, and proper data warehousing is crucial to handle this data. For instance, it may already be difficult to identify a given driver each day in that data, thus, even calculating the total annual distance of a given driver may result in a non-trivial exercise. From this it is obvious that we have to compress telematics data in an appropriate way to make it useful for statistical analysis.
In the actuarial literature there is an increasing number of contributions that study telematics data from a statistical point of view. Verbelen et al. [129] use GAMs to analyze the effect of telematics data in conjunction with classical feature information. Their study considers information like number of trips, total distance driven, road type, and daytime and weekday of driving. It does not study speed, acceleration and deceleration, intensity of left- and right-turns. Another interesting stream of literature explores pay-as-you-drive (PAYD) and usage-based (UB) car insurance products, see Ayuso et al. [5, 6] and Boucher et al. [12]. These works elaborate on the exposure measures and design pricing frameworks that are directly related to the effectively driven distances. The latter work is complemented by Denuit et al. [30] which adds to the above features additional information about distances traveled at night, distances driven in urban zones, and distances driven above speed limits. This information is still rather driving habits based, except the last one on 'excesses of speed limits' which indicates information about driving styles.
Probably, the first contribution in the actuarial literature that considers driving styles from telematics car driving data is Weidner et al. [132]. These authors use Fourier analysis for pattern recognition to study driving behavior and driving styles. In particular, they analyze the frequency spectrum obtained by single driving maneuvers and trips. Though, the main question of how to use these maneuvers and trip information as feature information for car insurance pricing is not finally answered in Weidner et al. [132], yet. In this chapter, we start by considering Wuthrich [136] who classifies telematics information of different car drivers into different groups of drivers according to a chosen similarity measure. This classification does not use claims frequency information and belongs to the field of unsupervised learning methods, dealing with cluster analysis and pattern recognition. Based on this starting point we will discuss other unsupervised learning methods for clustering different drivers.
8.1   Description of telematics car driving data 8.1.1    Simple empirical statistics
For the analysis in these notes we consider telematics car driving data which comprises GPS location data sec by sec. In Figure 8.1 we choose three different car drivers (call them drivers A, B and C, respectively) and we illustrate 200 individual trips of these
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
203
three car drivers. For confidentiality reasons and illustrative purposes all individual trips are initialized to start at location (0,0), and they are randomly rotated (at this origin) and potentially reflected at the coordinate axes. These transformations do not change the crucial driving characteristics like speed, acceleration and deceleration, and intensity of turns.
GPS coordinates of individual trips GPS coordinates of individual trips GPS coordinates of individual trips
Figure 8.1: 200 individual trips of the three different car drivers A (left), B (middle) and C (right); in orange color are the shorter 100 trips and in gray color the longer 100 trips; the square shows the area [—20 km, 20 km]2.
Figure 8.1 shows for each of these three drivers the square [—20 km, 20 km]2, in orange color we plot the shorter 100 trips and in gray color the longer 100 trips. We see that driver A usually travels short distances (within a radius of less than 5 km) but he also drives a few longer trips; driver B is a long-distance traveler (with many trips longer than 10 km); and driver C only does shorter drives.
From this GPS location data we can calculate many other quantities. Following [51], we calculate the average speed vt, the average acceleration and deceleration at, and the average change in direction (angle) At every second t of each individual trip. Let (xt,yt) denote the GPS location in meters every second t of a single trip of a given driver. From this GPS location data we calculate the average speed (velocity) at time t (in m/s)
and the average acceleration and deceleration at time t (in m/s2)
at = vt- vt-\.
For a positive average speed vt > 0 at time t, we define the direction of the heading by
<pt = atan2 (yt - yt-i,xt - xt-i)   G (-tt,tt],
where atan2 is a common modification of the arctan function that transforms Cartesian coordinates to polar coordinates such that the resulting polar angle is in (—tv,tv]. For positive speeds vs > 0 at times s = t — 1, t we can then consider the change in direction (angle) from t — 1 to t given by ipt — ft-l G (—2tt, 2tt) . For the change in direction (angle) at time t we define
At = \sm(<pt ~ <Pt-i) \ ■ Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
204
Chapter 8.  Telematics Car Driving Data
We choose absolute values of changes in angles because our telematics data analysis should not be influenced by the signs of the angles, but rather by the intensity of turns (no matter whether these are left- or right-turns). Moreover, we choose the sine of the change in angle: the reason for this choice is that GPS location data has some imprecision1 which manifests stronger at very low speeds, say, when the car is almost standing still. Taking the sine of the change in angle slightly dampens this effect, but requires ipt — (ft-i G (—2ir, — 3ir/2] U [—7r/2, 7r/2] U [3tv/2, 2ir) for identifiability reasons. Note that the latter is usually fulfilled because changes of directions within one second cannot exceed tt/2.
driver A, trip number 1
driver B, trip number 1
driver C, trip number 1
A..............................................1 ,		
		
L . 1		
time in seconds
Figure 8.2: Individual trips of the three drivers A (left), B (middle) and C (right): the lower line in blue color shows the speeds (vt)t (in km/h), the upper line in red color shows that acceleration and deceleration (atjt (in m/s2), and the middle line in black color shows the changes in angle (At)t over 180 sec, i.e. t G {0,..., 180}.
In Figure 8.2 we illustrate individual trips of lengths 180 sec of the three drivers A (left), B (middle) and C (right). The lower lines in blue color show the speed (velocity) patterns (vt)t (in km/h), the upper lines in red color show that acceleration and deceleration patterns {atjt (in m/s2), and the middle lines in black color show the changes in angle patterns (At)t. We note that changes in angle are bigger at lower speeds. From Figure 8.2 (rhs) we can very well see that these changes in angle often go along with deceleration first and then accelerating after changing the direction of the heading. We remark that the maximal acceleration and deceleration at has been capped at ±3m/s2 for the plots in Figure 8.2. [133] state that normal acceleration goes up to 2.5m/s2, and extreme acceleration can go up to 6m/s2 for vehicles driving straight ahead. Braking
1GPS location data is typically subject to quite some imprecision. First of all, often GPS location data is rounded. This provides a first source of imprecision which has a stronger influence on acceleration and changes in angle at speeds close to zero. A second source of imprecision may be that the GPS signal itself is not fully precise w.r.t. position and timing. Finally, it may happen that the GPS signal is not received at all, for instance, while driving through a tunnel. The latter can be identified more easily because it leads to missing values or accelerations beyond physical laws (if missing values are not marked). In many cases, one also directly receives speed and acceleration (in all directions) from installed devices. However, also this data is subject to imprecision. In particular, often one faces the issue that the devices are not correctly calibrated, for instance, there may be a transmission of a constant positive speed of a car standing still according to GPS location data. Moreover, depending on the type of device installed the speed may be rounded to km/h which may be too rough, etc., and other devices may calculate statistics on a coarser time scale due to data volume constraints.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
205
may be stronger and may vary up to — 8m/s2. In our data acceleration and deceleration beyond ±3m/s2 is rather sparse and may also be caused by data quality reasons of an imprecise GPS signal, therefore we typically cap extreme acceleration and deceleration.
	driver A	driver B	driver C
total distance (in km) average distance per trip (in km)	1'235 6.18	1'808 9.04	l'OOl 5.01
total time (in h) average time per trip (in min)	40.84 12.15	51.27 15.38	39.43 11.83
average velocity (in km/h) median velocity over trips (in km/h)	30.25 28.83	35.27 35.91	25.39 25.29
Table 8.1: Empirical statistics of the drivers A, B and C.
In Table 8.1 we provide some simple empirical statistics of these three selected drivers, these include the totally driven distance of all considered trips and the total time therefore used. We also provide the average trip lengths (in distance and time), the average velocity over all trips, and the median speed of the single trips is provided. These statistics reflect the graphs in Figure 8.1. Next we analyze the amount of time spent in different speed buckets. We therefore consider the speed intervals (in km/h)
[0] car is in idling mode (zero speed),
(0, 5] acceleration or braking phase (from/to speed 0),
(5, 20] low speeds,
(8.1)
(20,50] urban area speeds,
(50,80] rural area speeds,
(80,130] highway speeds (we truncate speeds above 130 km/h).
Figure 8.3: Speed bucket distribution (in driving time) of the three drivers A, B and C.
In Figure 8.3 we show the amount of time spent in each of these speed buckets for the three selected drivers A, B and C. We observe that the distribution of the resulting speeds varies considerably over the speed buckets; not surprisingly driver B drives almost half of his time with speeds above 50 km/h, whereas the other two drivers mostly drive with speeds below 50 km/h. Also remarkable is that driver A stands still 28% of his total trip time of 40.84 hours, the corresponding figures for drivers B and C are 24% of 51.27 hours and 19% of 39.43 hours, respectively. These numbers are rather typical for big
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
206
Chapter 8.  Telematics Car Driving Data
cities, for instance, the idling mode in peak hours on arterials in the city of Beijing takes roughly 29% of the total driving time and in the city of Shanghai 38% of the total driving time, see Table 6 in Wang et al. [131]. Of course, these numbers depend on the network topography of a city and therefore may vary considerably.
Up to now we have only been considering driving habits, i.e. whether we have a longdistance driver, an urban area driver, etc. We could calculate many more of these simple empirical statistics. These statistics can be transformed into feature information which provides important complementary information to the classical actuarial feature information. However, the statistics considered so far do not say much about driving styles. This analysis is our next aim.
8.1.2   The velocity-acceleration heatmap
To analyze driving styles we consider so-called velocity-acceleration (v-a) heatmaps that display the average velocity v on the a;-axis (in km/h) and the corresponding acceleration and deceleration a on the y-axis (in m/s2). For this analysis we consider the same speed buckets for the velocity as used in Figure 8.3. Speed buckets have the advantage that the different driving habits do not directly influence the v-a heatmap analysis, if we consider in each speed bucket the resulting empirical density (normalized to 1). To receive the v-a heatmap, say in R = (5, 20] x [—2, 2] (in km/hxm/s2), we partition R into J congruent rectangles R±,..., Rj with
J
(J R3; = R       and       R- n Rf = 0    for all j ^ j'. (8.2)
Let Xj > 0 denote the total relative time amount spent in Rj of a given driver. Then, x = (xi,.. . ,xj)' G Vj gives us a discrete probability distribution in the (J — l)-unit simplex in satisfying
J
Xj > 0 for all j = 1,.. ., J, and       Xj = 1.
The corresponding v-a heatmap is defined to be the graphical illustration of x on R. In Figure 8.4 we provide these graphical illustrations for our three selected drivers on the different speed buckets defined in (8.1): the column on the left-hand side shows driver A, the middle column gives driver B and the column on the right-hand side corresponds to driver C. The five different rows show the five non-zero speed buckets. In the first row we have the acceleration and deceleration styles of the three drivers in speed bucket (0, 5] km/h. We observe that these look quite differently. Driver B seems to accelerate more intensively than driver C at speed 0, and it seems that he brakes more heavily at a higher speed than driver C. This may indicate that driver B is a more aggressive driver than driver C (because he needs to brake more abruptly, i.e. he approaches a red light at a higher speed). Driver A has intermediate braking and acceleration maxima, this indicates that he drives a manual gear car. The same consideration applies to the speed bucket (5, 20] km/h in the second row of Figure 8.4, in particular, the v-a heatmaps of drivers B and C look much more smooth which is probably implied by driving an automatic gear car (compared to driver A).
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
207
Rows three and four of Figure 8.4 are interpreted such that driver B has the smoothest driving style in these two speed buckets because the vertical diameter (level sets) of his heatmaps are smaller compared to the other two drivers. This may also be caused by
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
208
Chapter 8.  Telematics Car Driving Data
the possibility that driver B drives a heavier car that is less agile than the other two drivers. Note also that the light upper and lower acceleration boundaries in these graphs are caused by the fact that we truncate (here) acceleration and deceleration at ±2 m/s2. Finally, the last row of Figure 8.4 shows the driving styles above speed 80 km/h. We observe that in this speed bucket we only have limited information because our three drivers only spend little time in this speed bucket. Similar graphs could be provided for left- and right-turns.
Remarks.
• There is a closely related stream of literature that aims at understanding vehicular emissions, energy consumption and impacts on traffic, see [40, 66, 70, 77, 131]. This literature aims at constructing typical driving cycles for cities considering the respective topology of the given city. For the selection of representative driving cycles this literature uses so-called speed acceleration probability distributions (SAPD), see Figures 5-6 in [70], and speed acceleration matrices, see [77], which are nothing else than our v-a heatmaps.
• In Gao et al. [49] we have been studying the robustness of v-a heatmaps. Convergence results show that roughly 300 minutes of driving experience are sufficient to receive stable v-a heatmaps in the speed bucket (5, 20] km/h.
8.2    Cluster analysis
Our main goal is to classify the v-a heatmaps, i.e. we would like to identify the drivers that have similar v-a heatmaps. These drivers are then interpreted to have similar driving styles (according to their v-a heatmaps), and are therefore put into the same categorical classes for car insurance pricing. Here, we do this for one single speed bucket only. We choose R = (5, 20] x [—2,2] (in km/hxm/s2) because this speed bucket has been identified to be very predictive for claims frequency modeling, see Gao et al. [52]. We partition this rectangle R as described in (8.2), and we denote the resulting discrete probability distributions in the (J — l)-unit simplex by Vj. Classification of individual car drivers is then achieved by looking at a classifier function
C:Vj->JC,       x^C(x), (8.3)
with K, = {1,...,K} describing the K different categorical classes considered. Thus, classification aims at finding such a classifier C that separates different driving styles. We will present a simple unsupervised machine learning method for this task. For more sophisticated methods on pattern recognition we refer to the related literature.
8.2.1   Dissimilarity function
In order to construct a classifier C we start by describing the dissimilarity between two different probability distributions Xi,xi 6 Vj.  The dissimilarity between Xi and x% is
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
209
defined by
1 J
d(xi,xi) = dw(xi,xi) = - (xij - xij) ,
1 3=1
where Wj > 0 are predefined weights. Remark that we choose the (weighted) square distance because it has nice analytical properties (as we will see below), other choices are described in Section 14.3.2 of Hastie et al. [62]. The weights w = (wj)j may allow us to emphasize dissimilarities on certain subsets Rj more than on others. Below, for simplicity, we only consider w = 1.
Assume we have n different car drivers with v-a heatmaps Xi £ Vj for i = 1,..., n. The total dissimilarity over all drivers is defined by
1 n
D(I) = - d(xi,X!). n^i
Lemma 8.1. The total dissimilarity over all drivers satisfies
j n
D(l) =  J2w3J2 fcij - x-jf ,
0 = 1 i=l
with (empirical) means Xj = n_1 Y^2=l xi,j> for J = 1,.. . , ^/-Proof of Lemma 8.1. A straightforward calculation provides
^      n       J ^      J n
1,1=1 j=i j=i 1,1=1
^      J n
= -- x ™j x (xi<j ~ xj^2+2 (xi<j ~ x^ (xj ~ xi-^ + (xj ~ xi^2
j=l i,l=l
J n ▼  J n
=   x wj x 1|U ~ Xj^2 + n 53 wj 53 ^xi,j ~      ^j ~ xl'^ ' j=i      i=i j=i i,i=i
The second term is zero and the claim follows. □
Lemma 8.1 has a nice interpretation: Xj is the empirical mean of all probability weights j of the n drivers on subset Rj and the dissimilarity in these probability weights is measured by the (empirical) variance defined by
2       1  n 2
i=i
Thus, the total dissimilarity over all drivers is given by
J
D{Z) = n J2wo sr
The weights Wj > 0 can now be chosen such that different subsets Rj are weighted differently. For instance, if we are concerned with high acceleration, then we would give more weight to subsets Rj that cover the high acceleration region in R. In the examples below we will choose w = 1. The following lemma is immediate.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
210
Chapter 8.  Telematics Car Driving Data
Lemma 8.2. The empirical means Xj are optimal in the sense that
n
Xj = argmin      (xi,j ~ mj)2 ■
m-j        . -i J i=l
Lemma 8.2 says that the total dissimilarity D(T) is found by solving an optimization problem on each subset Rj. Moreover, this optimal solution (xj)j=it...tj is itself a probability distribution on R because it can be interpreted as a mixing distribution: all elements satisfy Xj G [0,1] and
E^ = ^EE^j = 1-
j=l i=l j = l
Thus, (xj)j=i,...,j G Vj is the discrete probability distribution that solves the optimization problems in Lemma 8.2 simultaneously for all j = 1,..., J.
8.2.2   Classifier and clustering
In the previous section we have considered the total dissimilarity over all drivers given by
^     n     j j n j
D(T) = -^^^Wjixij-xij)2  =  ^Wj^(xij - xj)2  = nY,wjs23-
i,l=l j=l 3 = 1       i=l 3 = 1
This corresponds to a weighted sum of squares (WSS), additionally scaled by n_1. This WSS is an error measure if we do not assume any additional model structure, i.e. under the assumption that we have a homogeneous portfolio of drivers. In the subsequent analysis we introduce model structure, that aims at separating different drivers into homogeneous sub-classes. We introduce a classification structure by partitioning the set X = {1,.. . ,n} of all drivers into K (non-empty) clusters I\,.. . ,1k satisfying
k
[J Xfc = X       and       X^ n Xj/ = 0    for all k ^ k'.
k=l
These K clusters define a natural classifier C, restricted to the drivers i G X, given by
k
C:X->/C,       i H4- C(i) = E k 1{iexfe}-
k=l
Note that we use a slight abuse of notation here: at the moment the classifier C is only defined on X, this is in contrast to (8.3). Its extension to Vj is provided in Remarks 8.4, below.
Under this classification approach on X we can decompose the WSS into two parts, the within-cluster sum of squares (WCSS) and the between-cluster sum of squares (BCSS), the latter being explained by the chosen clustering structure. That is,
k j
k=l 3=1 ielfe
k    j 2 k j 2
= E E wo E (xij - xo\k) + E nk Ewo (xm - xj) »
k=l 3=1      ielfe k=l 3=1
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
211
where     = |Ifc | is the number of drivers in 1^ and the empirical means on 1^ are given
by
= — E xij- (8-5)
The last term on the right-hand side of (8.4) is interpreted as the between-cluster dissimilarity of classifier C, defined by
K       J 2
fc=i j=i
This is term that can be explained by the classification structure induced by the partition (Xfc)fc of X. The first term on the right-hand side of (8.4) is the aggregate within-cluster dissimilarity of classifier C, defined by
K   J 2
W(C) =  EE"'/ E (>•-    •'•//) • (8-6)
k=ij=i ieife
Thus, we immediately get the following simple corollary.
Corollary 8.3. For any partition (Ifc)fceK;        we ^awe the relationship
D(I) = W{C) + B{C) >  max {W(C), B(C)} .
This indicates that, in general, we try to find the partition (Ik)k€K °f ^ that gives a minimal aggregate within-cluster dissimilarity W(C), because this partition (based on K parameters) maximally explains the observations. We consider the single terms of W(C) and define the (individual) within-cluster dissimilarities by
J J 2
D^Tk) = 1k~ 5 E^- fa,* -xij)2 = E^- E
k i,l€lk j=l 3 = 1 i€lk
xi,j xj\k
An easy consequence is
j        I K
D(i) = E"'/E':-'-- %)2 ^ ^(c) = Eö(^)-
j=i i=i fc=i
This explains the idea that we will pursue, namely, find the K-partition of I that leads to a minimal aggregate within-cluster dissimilarity W(C) in (8.6), i.e. and a maximal between-cluster similarity B(C). This optimally describes the observed heterogeneity on the portfolio considered.
This optimization problem is of very similar nature as the ones studied above, for instance, the regression tree construction problem of Chapter 6. We try to solve a global minimization problem that typically is not tractable do to computational constraints. Therefore, we present an algorithm in the next section which (at least) converges to a local minimum of the objective function.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
212
Chapter 8.  Telematics Car Driving Data
8.2.3   i\-means clustering algorithm
In this section we describe the K-means clustering algorithm, see Algorithm 10.1 in James et al. [74], which provides a classifier C on X for a fixed number K of categorical classes. This X-means clustering algorithm is very popular (and it is probably the most used one for solving this kind of classification problem).
For a given classifier C : X —> K, we consider the aggregate within-cluster dissimilarity given by, see (8.6),
k j
W(C)   =   Z   HW3 Z (XiJ -X3\k) >
fc=i j=i ieife with empirical means on X^ given by Xj^ = n^1 Sieife xi,j-
The optimal classifier C* : X —> K, for given K is found by solving
k      j .2
C* = argmin W(C) = argmin    min Z_jWj
C:Z->/C C:I^C   K-|fc)iifc   fc=1   j = 1 i:C(i)=fc
This optimal classifier is approximated by alternating the two minimization steps. X-Means Clustering Algorithm.
(0) Choose an initial classifier         : X —> K, with corresponding empirical means
(1) Repeat for i > 1 until no changes are observed:
(a) given the present empirical means (a^|fc ^)j,k choose the classifier       : X —> K, such that for each driver i £ X we have
J 2 C(t)(i) = argmin Ewi (XiJ ~ xf\k ^) '■>
(b) calculate the empirical means (^*j.)j,fc on tne new partition induced by classifier C<*) : X -> JC.
Remarks 8.4.
• The X-means clustering algorithm converges: Note that due to the minimization in step (la) and due to Lemma 8.2 for step (lb) each iteration in (1) reduces the aggregate within-cluster dissimilarity, i.e., we have
W(C(0)) > ... > wic^-V) > W(C(t)) > ... > 0.
This provides convergence in finite time because we have finitely many K-partitions. However, we may end up in a local minimum. Therefore, one may use different (random) initial classifiers (seeds) in step (0) of the algorithm to back-test the solution.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
213
Another issue is the choice of the constant K for the number of clusters considered. We may start with running the algorithm for K = 2 which leads to a binary partition (Ifc)fc=i,2 with empirical means (xj^)j:k=i,2- For K = 3, we may then use these empirical means |fe)j,fc=i,2 together with (xj)j as initial values for the K-means clustering algorithm with K = 3. This choice ensures that the resulting aggregate within-cluster dissimilarity is decreasing in K. This is exactly what we consider in Listing 8.1, below.
For an arbitrary driver x £ Vj D {x\,..., xn} we extend classifier of the above algorithm to
j
x I—^ C^\x) = argmin ^
fce/C j=1
W3 \X3 Xj\k
-(*-i)x2
where (x^\k^)j,k are the empirical means obtained after the (t — l)-st iteration of the K-me&ns clustering algorithm.
8.2.4 Example
We apply the X-means clustering algorithm to an example. We therefore choose the synthetic data which is generated by the simulation machine available from:
https://people.math.ethz.ch/~wmario/Simulation_Machines/simulation.machine.heatmaps.VI.zip
We exactly use the set-up proposed in that simulation machine for n = 2'000 car drivers. Lhis generates heatmaps of grid size 20 x 20, i.e. with J = 400. Moreover, we set w = 1.
Listing 8.1: K-means algorithm for K = 2,..., 10
1 J <- 400
2 WCSS <-  array(NA, c(10))
3 WCSS [1] <-  nrow(X)   *  sum ( c olSds ( as . mat r ix (X ) ) ~2 )
4 Classifier     <-  array(1,   c(10, nrow(X)))
5
6 set.seed(100)
7 for   (K  in  2:10){
8 if   (K==2){(K_res <- kmeans(X.K) )}
9 if   (K>2){(K_res <- kmeans(X,K_centers) )}
10 clusters <- K_res$cluster
11 WCSS[K] <- sum(K_res$withins)
12 Classifier[K,] <- clusters
13 K_centers <- array(NA,   c(K+l, J))
14 K_centers[K+l,] <- colMeans(X)
15 K_centers[1:K,] <- K_res$centers
16 }
In Listing 8.1 we provide the R code of the K-means algorithm for K = 2,..., 10, where we always use the resulting centers for given K as initial values for the algorithm with K + l centers (line 14) and the new center is initialized by the overall mean (xj)j (line 13). In Ligure 8.5 we provide the results of iterations K = 1,..., 4 (rows 1-4). In Ligure 8.6 we provide the resulting decrease in aggregate within-cluster dissimilarities W(C) as a function ofi^=l,...,10. We observe that the first split is by far the most efficient one, and this is supported by rows 1 and 2 in Ligure 8.5, where we see that
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
214
Chapter 8.  Telematics Car Driving Data
			J... J !: ;l		1;	
	E	1 !	[  l l l ; I I	1 !		
kW
Figure 8.5: X-means clustering algorithm: resulting empirical means (xj\k)j,k=l,...,K f°r the different constants K = 1,2, 3,4 corresponding to rows 1-4.
decrease in aggregate within-cluster dissimilarity
Figure 8.6: Aggregate within-cluster dissimilarities W(C) as a function ofÄ" = l,...,10.
smooth plots are separated from plots with multiple local maxima. The second split Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
215
from K = 2 to K = 3 further splits the plots with multiple local maxima. The third split then separates more smooth v-a heatmaps (bottom-right) from the other heatmaps. Remark that individual policies may change clusters from K to K + 1 in Figure 8.5. Based on a fixed K for the number of labels, and depending on the chosen dissimilarity measure this provides us with categorical feature information that may be used in car frequency prediction (if we believe that the resulting clusters describe common driving styles and risk factor behaviors).
8.3   Principal component analysis
The K-me&ns clustering approach described in the previous section has the disadvantage that it leads to (nominal) categorical classes. We have seen that this may be a disadvantage in a claims frequency regression analysis because it may induce that many regression parameters are necessary (if we use, for instance, dummy coding). In this section we introduce dimension reduction techniques that lead to continuous (low-) dimensional representations of v-a heatmaps.
We denote by X = (x^,..., x'n)' £ Rrax J the n x J design matrix that contains the v-a heatmaps Xi £ Vj of all drivers i = 1,..., n (we assume n > J). Denote by X £ RraxJ a normalized version of X so that all column means are zero and have unit variance. The goal is to find a matrix Xq £ RraxJ of rank q < min{ J,n} such that the least-squares reconstruction error to X is minimized. This then implies that we receive a q-dimensional representation of X. As described in Section 14.5 of Hastie et al. [62], this means that we need to perform a principal component analysis (PCA) that provides a best linear approximation to X of rank q.2 The corresponding solutions are obtained by the following singular value decomposition (SVD) which says that there exists an n x J orthogonal matrix U (U'U = tj), & J x J orthogonal matrix V (V'V = lj) and a J x J diagonal matrix A = diag(Ai,..., Aj) with singular values Ai > ... > Aj > 0 such that
X = UAV',
see (14.51) in Hastie et al. [62]. Multiplying from the right with V, we receive the principal components of X as the columns of the matrix
XVr = *7A = [/diag(Ai,...,Aj). (8.7)
In PCA we now analyze how many singular values Ai > ... > Xq, 1 < q < J, and principal components (columns of XV = UA) we need to find good approximations to the true matrix X. The first q principal components are received from the first q columns of V called right-singular vectors of X, see (8.7).
We use the R command svd to receive the SVD. In Figure 8.7 we provide the resulting singular values Ai > ... > Aj > 0 on the original scale and on the log scale. We
2In a nutshell, a PCA provides an orthogonal transformation to a system of coordinates such that the projections onto these coordinates provide the directions of the biggest variances of the design matrix X (in decreasing order and always orthogonal to the previous ones). If we only keep the first q principal components, then we receive a rank q approximation Xq to X that has minimal least-squares reconstruction error, i.e. we project from dimension J to dimension q at a minimal loss of information.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
216
Chapter 8.  Telematics Car Driving Data
Figure 8.7: Singular values Ai > ... > Aj > 0 with J = 400 on the original scale (lhs) and on the log scale (rhs).
observe that the first three singular values are dominant, thus, reduce the reconstruction error most. Denote Ag = diag(Ai,..., Xq, 0,..., 0) G RJx J■ The best possible rank q approximation to X (i.e. with minimal least-squares reconstruction error) is given by
Xq = UAqV*
and doing the back-transformation to the original column means and variances, respectively, we obtain the rank q approximation Xq to X.
Figure 8.8: PCA approximation of the v-a heatmap of driver 1: (lhs) original heatmap x±, (other columns) approximations for q = 1,.. . , 3.
□
Figure 8.9: PCA approximation of the v-a heatmap of driver 100: (lhs) original heatmap ccioo, (other columns) approximations for q = 1,. .. , 3.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
217
Figure 8.10: PCA approximation of the v-a heatmap of driver 500: (lhs) original heatmap CC500, (other columns) approximations for q = 1,. .. , 3.
In Figures 8.8-8.10 we illustrated the three different drivers i = 1,100, 500. The first plot in each figure shows the original v-a heatmap Xi of that driver, and columns 2-4 show the PCA approximations for q = 1,2, 3 (rows i = 1,100,500 of matrix 3tq). Just by looking at these plots we claim that we need (at least) three principal components to represent our v-a heatmaps sufficiently accurately.
weights of principal component 1
weights of principal component 2
weights of principal component 3
	
	
	4-
I	
Figure 8.11: First three right-singular vectors of X, i.e. first three columns of V.
In Figure 8.11 we illustrate the first three right-singular vectors of X, i.e. the first three columns of V. These provide the first three principal components, see (8.7). We interpret these figures as follows: the first right-singular vector of X measures strong acceleration and deceleration (beyond ±1 m/s2), the second right-singular vector balances differences between high and low velocities in R, and the third right-singular vector takes care of local maxima.
In Figure 8.12 we provide the first three principal components, i.e. we consider the projection of X onto the first three principal components (setting q = 3). The dots in Figure 8.12 are colored blue-red-green according to the K-means classification received in Figure 8.5 for K = 3. We observe that the PCA with q = 3 and the A-means clustering with K = 3 basically come to the same results because the colored dots from the A-means clustering seem well separated in the PCA representation, see Figure 8.12.
Remarks.
• The PCA leads to a rank q approximation Xq to X. This approximation is a linear approximation that minimizes the least-square reconstruction error ||Xq — X\\%. The
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
218
Chapter 8.  Telematics Car Driving Data
Figure 8.12: First three principal components representation of all drivers, the colors (blue-red-green) illustrate the ii-means classes for K = 3 received in Figure 8.5.
PCA has the disadvantage that the resulting approximations are not necessarily v-a heatmaps, for instance, the linearity implies that some entries of Xq may be negative, violating the assumption of having probabilities xq £ Vj.
• More generally, we may consider so-called auto-encoders.  An auto-encoder is a composition it = ip o ip of two maps. The encoder is given by
(p : X —> Z,       z = (p(x),
and decoder is given by
V> : 2 ->■ X,       x = ip(z).
A good auto-encoder considers the two maps (p and ip such that tt(x) = (ipo(p)(x) pa x, that is, the output signal ir(x) cannot be distinguished from the input signal x. In that case we can use the value z = (f(x) 6 Z as representation for x, because the decoder tp allows us to reconstruct the input signal x from its encoded value z. If the dimension of Z is low we receive a low-dimensional representation of x £ X.
• Bottleneck neural networks are popular architectures of auto-encoders, we refer to Kramer [83] and Hinton-Salakhutdinov [65]. For a bottleneck neural network we
Version October 27, 2021, M.V. Wůthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Chapter 8.  Telematics Car Driving Data
219
consider a feed-forward neural network encoder
ip : Rqo -> Rq'm,       <p(x) = (z<m) o ■ ■ ■ o z*1)) (x),
and feed-forward neural network decoder
if) : Rqm -> Rq2m,       i/)(z) = (z(2m) o ■ ■ ■ o z(m+1)) (z),
with dimensions qo = q2m qm. This network is then trained such that we have tt(x) = (ip o <p)(x) ~ x, with qm-dimensional representation z = <p(x) of x. Since we typically assume qm <C qo = q2m, this neural network architecture is called bottleneck neural network with bottleneck activations z = (p(x) G R9m. Sometimes, this low-dimensional representation is also called non-linear PCA.
Figure 8.13: Bottleneck neural network with bottleneck q2 = 2 and go = 94 = 20.
In Figure 8.13 we illustrate a bottleneck neural network. Input and output dimensions are go = 94 = 20 (blue color), and the bottleneck dimension is q2 =2. Observe that the chosen neural network architecture in Figure 8.13 is symmetric w.r.t. the bottleneck. This has been done on purpose because this symmetry is essential for finding good calibrations, we refer to Hinton-Salakhutdinov [65].
• There remains to prove in a comprehensive study that these low dimensional representations of the v-a heatmaps have explanatory power for claims frequency predictions. First attempts on rather small portfolios have been done in [49, 52].
• Another direction that we may pursue is to directly analyze individual trips, and score these individual trips, see also Figure 8.2. This individual trip scoring is related to time series analysis, and the methods at hand are convolutional neural networks (for fixed length of the time series), long short-term memory (LSTM) networks or gated recurrent unit (GRU) networks. Preliminary analysis in [51] indicates that 180 sec of driving experience already allows us to allocate randomly chosen trips to the right drivers.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
220
Chapter 8.  Telematics Car Driving Data
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix A
Motor Insurance Data
A.l    Synthetic data generation
We consider a motor third party liability (MTPL) insurance portfolio that consists of n = 500'000 car insurance policies. This portfolio has been constructed synthetically based on the French MTPL data f reMTPL2f req.1 In particular, we have been transforming the old French regions to Swiss cantons by meeting related population densities.
Listing A.l: Synthetic MTPL insurance portfolio
1	1 data.frame 1 :		500000   obs. of	10  variables:											
2	$ id	int	1  2  3 4  5  6 7	8  9 10...											
3	$ expo	num	0.33   0.08 0.92	1   0.63 1	0.13   1 1	OA	35								
4	$ age	int	66  31   60  66 63	53  61 41	41 39										
5	$ ac	int	4  1  6 4 3  5 13	114 6											
6	$ power	int	3 7  5  2  5  3 4	14  1 ...											
7	$ gas	Factor  w/  2 levels		"Diesel"."	Regular"	2	1	1	1	2	1	2	1	1	1 ...
8	$ brand	Factor  w/   11 levels		"Bl","BIO	","B11",		4	1	1	1	4	1	7	7	1  7 .
9	$ area	Factor  w/  6 levels		" A " , " B " , " C	" , " D " , . .	2	1	3	3	2	3	3	2	4	4 ...
10	$ dens	int	83  34  223 283	74  125 323	65   533 i	389									
11	$ ct	Factor  w/  26 levels		" ZH","AG"	,"AI " , . .	5	6	2	8	25		26	2	24	5 10
In Listing A.l we provide a short summary of the synthetic portfolio. For each of these car insurance policies we have feature information (on lines 4-11 of Listing A.l) and the corresponding years at risk (expo) information Vi 6 (0,1] (on line 3 of Listing A.l). Line 2 of Listing A.l gives the policy number (id). Since the policy number is not considered to be of explanatory character for claims prediction we drop this information from all our considerations.
We start by describing the exposure. The years at risk information expo is illustrated in Figure A.l (lhs and middle). For the years at risk we have the following properties imiiiVi = 0.02 and max^ Vi = 1, that is, the minimal observed time insured in our portfolio is 7.3 days and the maximal time insured is 1 year. The average time insured is J2ivi/n = 0.506, which corresponds to 185 days, the median time insured is 172 days, and (only) 21% of the policies are insured during the whole year.
1The data freMTPL2freq is included in the R package CASdatasets, see Charpentier [25]. It is described on page 55 of the reference manual [23], we also refer to the CASdatasets website http: //cas .uqam. ca.
221
Electronic copy available at: https://ssrn.com/abstract=2870308
222
Appendix A. Motor Insurance Data
Figure A.l: (lhs) Histogram and (middle) boxplot of the years at risk expo of the n = 500'000 car insurance policies, (rhs) histogram of the true frequencies \*(xi) of all insurance policies i = 1,..., n.
Next, we describe the feature information provided on lines 4-11 of Listing A.l:
age age of driver, continuous feature in {18,..., 90} years;
ac age of car, continuous feature in {0,..., 35} years;
power power of car, continuous feature in {1,.. . , 12};
gas fuel type of car (diesel/regular petrol), binary feature; (A.l)
brand brand of car, categorical feature with 11 labels;
area area code, categorical feature with 6 labels;
dens density at the living place of the driver, continuous feature in [1,27'000];
ct Swiss canton of the car license plate, categorical feature with 26 labels.
In Figure A.2 we illustrate the 26 Swiss cantons.
Swiss cantons
Figure A.2: Swiss cantons.
Based on this feature information we design a synthetic regression function Xi i—> \*(xi) from which we sample our claims observations Ni for 1 < i < n. In the first step we need to define a feature space X* containing the 8 explanatory variables given in (A.l). Note that this requires transformation of the categorical feature components into an appropriate form, for instance, by dummy coding. In the second step we design a "realistic" (true) expected frequency function and denote it by
\*:X*^M.+ , x^\*(x). Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix A. Motor Insurance Data
223
The choice we make is "realistic" in the sense that it reflects typical characteristics of a real car insurance portfolio. Deliberately we do not provide the specific functional form of that regression function A* here, but we leave it to the reader to search for the true expected frequency function. On line 12 of Listing A.2 (denoted by truef req) we provide the true expected frequencies \*(xi) of all insurance policies i = 1,... ,n. The histogram of these true expected frequencies is given in Figure A.l (rhs). Note that these true expected frequencies are right-skewed.
Listing A.2: Synthetic MTPL insurance portfolio including claims
1	1 data.frame 1		500000   obs. of	12 variabl		es :										
2	$   id :	int	1  2  3 4  5  6 7	8  9 10												
3	$  expo :	num	0.33  0.08 0.92	1 0.63	1	0.13  1 1	OA	55								
4	$  age :	int	66  31  60  66 63	53 61	41	41 39										
5	$  ac :	int	4  1  6 4  3  5 13	11  4 6												
6	$  power :	int	3 7  5  2  5  3 4	14 1.												
7	$  gas :	Factor  w/  2 levels		"Diesel	" , "	Regular"	2	1	1	1	2	1	2	1	1	1 ...
8	$  brand :	Factor  w/  11 levels		"Bl","	BIO	","Bll" ,		4	1	1	1	4	1	7	7	1  7 .
9	$  area :	Factor  w/  6 levels		" A " , " B "	, "C	" , " D " , . .	2	1	3	3	2	3	3	2	4	4 ...
10	$  dens :	int	83  34  223 283	74 125	323	65   533 i	589									
11	$  ct :	Factor  w/  26 levels		"ZH","	AG"	,"AI" , . .	5	6	2	8	25		26	2	24	5 10
12	$  truefreq:	num	0.0599 0.1192	0.0743	0.0928 0.05											
13	$  claims :	int	0  0  0  0  0  0 0	0  0 0.												
Remarks.
• Knowing the true expected frequencies A*(a^) of all insurance policies i = 1,..., n has the (big) advantage that we can explicitly quantify the quality of all considered estimated models. However, this is not the typical situation in practice, therefore, any number that can only be calculated because we know the true model is highlighted in green color in these notes.
• Whenever we refer to the true model we use the *-sign in the notation of the feature space X* and the regression function A*.
• We leave it to the reader to unravel the true expected frequency function A* : X* —> R+. It contains non-log-linear terms and non-trivial interaction terms. Lhe data can be downloaded from
https://people.math.ethz.ch/~wmario/Lecture/MTPL_data.csv
Based on regression function A* : X* —> R+ we generate the observations N±,..., Nn which comprises the data
V = {(iVi, xuVl) ,...,(Nn, xn, vn)} . (A.2) Lhis data T> is simulated by independently generating observations iVj from
Ni        Poi (X*(xi)vi)        for i = 1,..., n. (A.3)
Lhe simulated claims are illustrated on line 13 of Listing A.2.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
224
Appendix A. Motor Insurance Data
We briefly analyze whether the generated claims are a typically realization of our simulation model. The true average portfolio frequency and the empirical average portfolio frequency are given by, respectively,
A* = SXi^te)"' = 10,1991%       and       A = ^fr1 Ni = 10.2691%. (A.4)
Ej=i vi El=i vi
Thus, the simulated data has a small positive bias compared to the true average portfolio frequency A*.
Figure A.3: (lhs) Empirical average portfolio frequencies A of l'OOO simulations with different seeds and (rhs) true Poisson deviance losses of l'OOO simulations with different seeds; the straight vertical lines show the empirical means, the dotted lines the 2 standard deviations confidence bounds, the blue dots specify the chosen simulation of Listing A.2.
We repeat l'OOO times the simulation of the data T> in (A.2) using the true model (A.3) with different seeds. In Figure A.3 (lhs) we illustrate the empirical density of the estimated average portfolio frequency A over the l'OOO simulations with the blue dot illustrating the one chosen in Listing A.2. The sample mean of these simulations is 10.2022% with a standard deviation of 0.0657%. We note that the chosen simulation has a typical empirical average portfolio frequency A, see Figure A.3 (lhs). Next, we calculate the Poisson deviance loss w.r.t. the true model A*
1 I  n    \ f
-D*(N,\*) = - V2 Xk(xi)vi-Ni-Ni\og[
i=i
\*{Xj)vi
27.7278 ■ 10-2.    (A.5)
Also this analysis we repeat l'OOO times using different seeds for the different simulations. In Figure A.3 (rhs) we illustrate the empirical density of the Poisson deviance loss D*(N, A*)/n over the l'OOO simulations of observations AT", the blue dot again illustrates the realization chosen in Listing A.2. We observe that the chosen data has a slightly bigger value than the empirical average of 27.7013 ■ 10-2 (straight vertical black line in Figure A.3 (rhs)), but it is within two standard deviations of the empirical average (the empirical standard deviation is 0.1225 ■ 10-2).
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix A. Motor Insurance Data
225
A.2   Descriptive analysis
We provide a descriptive analysis of our synthetic MTPL claims data in this section.
	0	1	2	3
# policies	475'153	23773	1'012	62
in %	95.031%	4.755%	0.202%	0.012%
Table A.l: Split of the portfolio w.r.t. the number of claims.
In Table A.l we illustrate the distribution of the observed numbers of claims [Ni)i<i<n across the entire portfolio of our synthetic data T>. We note that 95% of the policies do not suffer a claim. This is the so-called class imbalance problem that often causes difficulties in model calibration.
age of driver structure observed frequency per age of driver
18   25   32   39   46   53   60   67   74   81 88
Figure A.4: Age of driver (age): (lhs) portfolio structure in terms of expo, (middle) observed frequency, (rhs) average age of driver per Swiss canton.
We provide for all 8 feature components the marginal plots in Figures A.4 to A.ll. The graphs on the left-hand side show the exposure structures (in years at risk). The middle graphs provide the observed marginal frequencies (the dotted lines correspond to the estimated 2 standard deviation confidence bounds). The graphs on the right-hand side show the average feature values per Swiss canton (green color means low value and red color means high value).
From Figure A.4 (lhs) we observe that our car drivers are mostly aged between 35 and 55 years. In the middle plot of Figure A.4 we see that the observed frequency sharply drops at young ages, and it stabilizes at the age of 30. Noticeable is that we have a local maximum around the age of 50: typically, the feature age denotes the age of the main driver on a given car, but the claims frequencies may also include multiple drivers on the same car. The local maximum at the age of 50 is explained by the fact that at this age young adults start to drive on their parents' cars. We also note that the uncertainty is large for higher ages because we only have very few older policyholders above the age of 80 in our portfolio. Finally, Figure A.4 (rhs) shows that our portfolio is considerably old in canton TG, and rather young in cantons SG, JU and BL.
We can do a similar descriptive analysis for all other feature components. We just mention some peculiarities of our portfolio. Figure A.5: New cars have a rather high observed
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
226
Appendix A. Motor Insurance Data
Figure A.5: Age of car (ac): (lhs) portfolio structure in terms of expo, (middle) observed frequency, (rhs) average age of car per Swiss canton.
Figure A.6: Power of car (power) (lhs) portfolio structure in terms of expo, (middle) observed frequency, (rhs) average power of car per Swiss canton.
Figure A.7: Fuel type of car (gas) (lhs) portfolio structure in terms of expo, (middle) observed frequency, (rhs) ratio of regular fuel cars per Swiss canton.
claims frequency. Our portfolio has comparably old cars in many cantons that are not so densely populated (mountain area). Figures A.6 and A.7 (rhs) have some similarities because the power of a car is often related to the fuel type. From Figure A.8 we observe that car brand B12 is rather special having a much higher observed frequency than all other car brands.
In Figures A.9 and A. 10 we provide the plots of the area codes and the densities. We observe quite some similarities between these two feature components which indicates
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix A. Motor Insurance Data
227
observed frequency per brand of c
JDdUd^UUUU
B13 B2 B4 brand of car
Figure A.8: Brand of car (brand) (lhs) portfolio structure in terms of expo, (rhs) observed frequency.
Figure A.9: Area code (area) (lhs) portfolio structure in terms of expo, (middle) observed frequency, (rhs) average area code per Swiss canton, we map {A,..., F} i—> {1,..., 6}.
Figure A. 10: Density (dens) (lhs) portfolio structure in terms of expo, (middle) observed frequency, (rhs) average density per Swiss canton.
that they are dependent. In our data the canton GE is very densely populated and observed frequencies are increasing in density. Finally, in Figure A. 11 we show the empirical statistics per Swiss canton. Canton GE has the highest frequency which is likely caused by the density and the young car structure of that canton. Canton AI has the lowest frequency which needs to be explained by several different factors.2
2Note that in real Swiss MTPL data, typically, canton AI has the highest frequency which is caused by the fact that many rental car companies are located in that canton.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
228
Appendix A. Motor Insurance Data
Figure A.11: Swiss cantons (ct) (lhs) portfolio structure in terms of expo, (rhs) observed frequency.
Next we analyze the dependencies between the different feature components. We therefore distinguish between categorical and continuous ones. The area code is considered to be continuous and we use the corresponding mapping {A,..., F} i—> {1,..., 6}.
car brand groups car brand groups car brand groups
Figure A. 12: Distribution of the continuous feature components across the car brands: from top-left to bottom-right: age, ac, power, gas, area and dens.
In Figure A. 12 we illustrate the distribution of the continuous feature components across the car brands. Top-left: we observe that older people drive more likely car brand BIO and younger drivers B3. Striking is the age of the cars of brand B12, more than 50% of this brand is less than 3 years old (top-middle). Cars BIO and Bll seem to be very powerful cars (top-right), where the former is more likely a diesel car (bottom-left). In Figure A. 13 we give the distribution of the continuous feature components across the Swiss cantons. These figures reflect (in more detail) what we have already observed in
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix A. Motor Insurance Data
229
driver's age profile among Swiss cantons age of car profile among Swiss cantons power of car profile among Swiss cantons
Swiss cantons Swiss cantons Swiss cantons
Figure A. 13: Distribution of the continuous feature components across the Swiss cantons: from top-left to bottom-right: age, ac, power, gas, area and dens.
car brand profile among Swiss cantons
ZH    AG    Al    AR    BE    BL    BS    FR    GE    GL    GR    JU    LU    NE   NW   OW   SG    SH    SO    SZ    TG    Tl     UR    VD    VS ZG
Swiss cantons
Figure A. 14: Distribution of car brands across the Swiss cantons.
the Swiss maps in Figures A.4 - A. 10. Canton GE plays a special role in terms of age of car, area code and density, and the mountain area AI, GL, GR, NW, OW, UR has a comparably old car portfolio (ac).
In Figure A. 14 we study the distribution of the car brands across the Swiss cantons. We note that car brand B12 is dominant in AR and SO. This car brand has mostly new cars which typically have a higher frequency. This may explain the high observed frequencies in these two cantons.
In Figure A. 15 we provide the contour plots of the portfolio distribution of the continuous feature components. These serve us to analyze colinearity in feature components. From
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
230
Appendix A. Motor Insurance Data
□
Figure A. 15: Two dimensional contour plots of the portfolio distribution of the continuous features age, ac, power, area and dens.
these plots we cannot detect much dependence structure between the continuous feature components, except between area and dens there seems to be a strong linear relationship (we have mapped {A,..., F} f-> {1,..., 6}). It seems that the area code has been set according to the density of the living place of the policyholder. This has already been noticed in Figures A.9 and A.10, and it is confirmed by the corresponding correlation analysis given in Table A.2. From this we conclude that if there is such a strong linear relationship between area and dens, then likely one of the two variables is superfluous in the following regression analysis, in fact, this is what we find in some of our regression models studied. This finishes our descriptive analysis.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Appendix A. Motor Insurance Data
231
	age	ac	power	area	dens
age		-0.07	-0.04	-0.03	-0.01
ac	-0.09		0.00	-0.02	-0.04
power	0.05	0.01		-0.03	0.02
area	-0.03	-0.02	-0.03		0.59
dens	-0.03	-0.02	-0.03	0.97	
Table A.2: Correlations in continuous feature components: top-right shows Pearson's correlation; bottom-left shows Spearman's p.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
232
Appendix A. Motor Insurance Data
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Bibliography
[1] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19/6, 716-723.
[2] Amari, S. (2016). Information Geometry and its Applications. Springer.
[3] Arnold, V.l. (1957). On functions of three variables. Doklady Akademii Nauk SSSR 114/4, 679-681.
[4] ASTIN Big Data/Data Analytics Working Party (2015). Phase 1 Paper - April 2015. http: //www.actuaries.org
[5] Ayuso, M., Guillen, M., Perez-Marin, A.M. (2016). Telematics and gender discrimination: some usage-based evidence on whether men's risk of accidents differs from women's. Risks 4/2, article 10.
[6] Ayuso, M., Guillen, M., Perez-Marin, A.M. (2016). Using GPS data to analyse the distance traveled to the first accident at fault in pay-as-you-drive insurance. Transportation Research Part C: Emerging Technologies 68, 160-167.
[7] Bailey, R.A. (1963). Insurance rates with minimum bias. Proceedings CAS 50, 4-11.
[8] Barndorff-Nielsen, O. (2014). Information and Exponential Families: In Statistical Theory. John Wiley & Sons.
[9] Barndorif-Nielsen, O.E., Jensen, J.L., Kendall, W.S. (1993). Networks and Chaos - Statistical and Probabilistic Aspects. Chapman & Hall.
[10] Barron, A.R. (1993). Universal approximation bounds for superpositions of sigmoidal functions. IEEE Transactions of Information Theory 39/3, 930-945.
[11] Billingsley, P. (1995). Probability and Measure. 3rd edition. Wiley.
[12] Boucher, J.-P., Cote, S., Guillen, M. (2017). Exposure as duration and distance in telematics motor insurance using generalized additive models. Risks 5/4, article 54.
[13] Breiman, L. (1996). Bagging predictors. Machine Learning 24/2, 123-140.
[14] Breiman, L. (2001). Random forests. Machine Learning 45/1, 5-32.
[15] Breiman, L. (2001). Statistical modeling: the two cultures. Statistical Science 16/3, 199-215.
[16] Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Wadsworth Statistics/Probability Series.
[17] Breiman, L., Stone, C.J. (1978). Parsimonious binary classification trees. Technical report. Santa Monica, California: Technology Service Corporation.
[18] Bühlmann, H., Gisler, A. (2005). A Course in Credibility Theory and its Applications. Springer.
233
Electronic copy available at: https://ssrn.com/abstract=2870308
234
Bibliography
[19] Bühlmann, H., Straub, E. (1970). Glaubwürdigkeit für Schadensätze. Bulletin of the Swiss Association of Actuaries 1970, 111-131.
[20] Bühlmann, P., Mächler, M. (2014). Computational Statistics. Lecture Notes. Department of Mathematics, ETH Zurich.
[21] Burges, C.J.C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121-167.
[22] Cantelli, F.P. (1933). Sulla determinazione empirica delle leggi di probabilitä. Giornale DeU'Istituto Italiano Degli Attuari 4, 421-424.
[23] CASdatasets Package Vignette (2016). Reference Manual, May 28, 2016. Version 1.0-6. Available from http: //cas . uqam. ca.
[24] Casella, G., Berger, R.L. (2002). Statistical Inference. 2nd edition. Duxbury.
[25] Charpentier, A. (2015). Computational Actuarial Science with R. CRC Press.
[26] Chen, T., Guestrin, C. (2016). XGBoost: a scalable tree boosting system. arVw:1603.02754v3.
[27] China InsurTech Lab (2017). China InsurTech Development White Paper. Fudan University.
[28] Congdon, P. (2006). Bayesian Statistical Modelling. 2nd edition. Wiley.
[29] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems 2, 303-314.
[30] Denuit, M., Guillen, M., Trufin, J. (2018). Multivariate credibility modeling for usage-based motor insurance pricing with behavioural data. Detralytics, 2018-2.
[31] Denuit, M., Marechal, X., Pitrebois, S., Walhin, J.-F. (2007). Actuarial Modelling of Claims Count. Wiley.
[32] Deprez, P., Shevchenko, P.V., Wüthrich, M.V. (2017). Machine learning techniques for mortality modeling. European Actuarial Journal 7/2, 337-352.
[33] Döhler, S., Rüschendorf, L. (2001). An approximation result for nets in functional estimation. Statistics and Probability Letters 52, 373-380.
[34] Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B 195/2, 216-222.
[35] Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics 7/1, 1-26.
[36] Efron, B. (2020). Prediction, estimation, and attribution. Journal of the American Statistical Association 115/530, 636-655.
[37] Efron, B., Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press.
[38] Efron, B., Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall.
[39] Embrechts, P., Klüppelberg, C., Mikosch, T. (2003). Modelling Extremal Events for Insurance and Finance. 4th printing. Springer.
[40] Esteves-Booth, A., Muneer, T., Kirby, H., Kubie, J., Hunter, J. (2001). The measurement of vehicular driving cycle within the city of Edinburgh. Transportation Research Part D: Transport and Environment 6/3, 209-220.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Bibliography
235
[41] Fahrmeir, L., Tutz, G. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models. Springer.
[42] Ferrario, A., Hámmerli, R. (2019). On boosting: theory and applications. SSRN Manuscript ID 3402687.
[43] Ferrario, A., Noll, A., Wiithrich, M.V. (2018). Insights from inside neural networks. SSRN Manuscript ID 3226852.
[44] Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation 121/2, 256-285.
[45] Freund, Y., Schapire, R.E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences 55/1, 119-139.
[46] Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics 29/5, 1189-1232.
[47] Friedman, J.H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis 38/4, 367-378.
[48] Gabrielli, A., Richman, R., Wiithrich, M.V. (2020). Neural network embedding of the over-dispersed Poisson reserving model. Scandinavian Actuarial Journal 2020/1, 1-29.
[49] Gao, G., Meng, S., Wiithrich, M.V. (2019). Claims frequency modeling using telematics car driving data. Scandinavian Actuarial Journal 2019/2, 143-162.
[50] Gao, G., Wiithrich, M.V. (2018). Feature extraction from telematics car driving heatmap. European Actuarial Journal 8/2, 383-406.
[51] Gao, G., Wiithrich, M.V. (2019). Convolutional neural network classification of telematics car driving data. Risks 7/1, 6.
[52] Gao, G., Wiithrich, M.V., Yang, H. (2019). Evaluation of driving risk at different speeds. Insurance: Mathematics & Economics 88, 108-119.
[53] Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo in Practice. Chapman & Hall.
[54] Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilita. Giornale Dell'Istituto Italiano Degli Attuari 4, 92-99.
[55] Gneiting, T. (2011). Making and evaluation point forecasts. Journal of the American Statistical Association 106/494, 746-762.
[56] Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press, http://www. deeplearningbook.org
[57] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82/4, 711-732.
[58] Green, P.J. (2003). Trans-dimensional Markov chain Monte Carlo. In: Highly Structured Stochastic Systems, P.J. Green, NX. Hjort, S. Richardson (eds.), Oxford Statistical Science Series, 179-206. Oxford University Press.
[59] Grohs, P., Perekrestenko, D., Elbráchter, D., Bolcskei, H. (2019). Deep neural network approximation theory. Submitted to IEEE Transactions on Information Theory (invited paper).
[60] Hastie, T., Tibshirani, R. (1986). Generalized additive models (with discussion). Statistical Science 1, 297-318.
Version October 27, 2021, M.V. Wůthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
236
Bibliography
[61] Hastie, T., Tibshirani, R. (1990). Generalized Linear Models. Chapman & Hall.
[62] Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd edition. Springer Series in Statistics.
[63] Hastie, T., Tibshirani, R., Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press.
[64] Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57/1, 97-109.
[65] Hinton, G.E., Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural networks. Science 313, 504-507.
[66] Ho, S.-H., Wong, Y.-D., Chang, V.W.-C. (2014). Developing Singapore driving cycle for passenger cars to estimate fuel consumption and vehicular emissions. Atmospheric Environment 97, 353-362.
[67] Hoffman, M.D., Gelman, A. (2014). The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15, 1351-1381.
[68] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 251-257.
[69] Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366.
[70] Hung, W.T., Tong, H.Y., Lee, CP., Ha, K., Pao, L.Y. (2007). Development of practical driving cycle construction methodology: a case study in Hong Kong. Transportation Research Part D: Transport and Environment 12/2, 115-128.
[71] Künsch H.R. (1993). Mathematische Statistik. Lecture Notes. Department of Mathematics, ETH Zurich.
[72] Ingenbleek, J.-F., Lemaire, J. (1988). What is a sports car? AST IN Bulletin 18/2, 175-187.
[73] Isenbeck, M., Rüschendorf, L. (1992). Completeness in location families. Probability and Mathematical Statistics 13, 321-343.
[74] James, C, Witten, D., Hastie, T., Tibshirani, R. (2015). An Introduction to Statistical Learning. With Applications in R. Corrected 6th printing. Springer Texts in Statistics.
[75] Johansen, A.M., Evers, L., Whiteley, N. (2010). Monte Carlo Methods. Lecture Notes, Department of Mathematics, University of Bristol.
[76] Jung, J. (1968). On automobile insurance ratemaking. ASTIN Bulletin 5, 41-48.
[77] Kamble, S.H., Mathew, T.V., Sharma, G.K. (2009). Development of real-world driving cycle: case study of Pune, India. Transportation Research Part D: Transport and Environment 14/2, 132-140.
[78] Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side Constraints. M.Sc. Dissertation. Department of Mathematics, University of Chicago.
[79] Kearns, M., Valiant, L.G. (1988). Learning Boolean formulae or finite automata is hard as factoring. Technical Report TR-14-88. Harvard University Aiken Computation Laboratory.
[80] Kearns, M., Valiant, L.G. (1994). Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the Association for Computing Machinery ACM 41/1, 67-95.
[81] Kingma, D., Ba, J. (2014). Adam: A method for stochastic optimization. arXj'w.4412.6980. Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Bibliography
237
[82] Kolmogorov, A. (1957). On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk SSSR 114/5, 953-956.
[83] Kramer, M.A. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal 37/2, 233-243.
[84] Kuhn, H.W., Tucker, A.W. (1951). Nonlinear programming. Proceedings of 2nd Berkeley Symposium. Berkeley: University of California Press, 481-492.
[85] Latuszyhski, K., Roberts, G.O., Rosenthal, J.S. (2013). Adaptive Gibbs samplers and related MCMC methods. Annals of Applied Probability 23/1, 66-98.
[86] Lee, S.C.K., Lin, S. (2018). Delta boosting machine with application to general insurance. North American Actuarial Journal 22/3, 405-425.
[87] Lehmann, E.L. (1983). Theory of Point Estimation. Wiley.
[88] Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6/6, 861-867.
[89] Lorentzen, C, Mayer, M. (2020). Peeking into the black box: an actuarial case study for interpretable machine learning. SSRN Manuscript ID 3595944.
[90] Makavoz, Y. (1996). Random approximants and neural networks. Journal of Approximation Theory 85, 98-109.
[91] Marra, G., Wood, S.N. (2011). Practical variable selection for generalized additive models. Computational Statistics and Data Analysis 55/7, 2372-2387.
[92] McClure, P., Kriegeskorte, N. (2017). Representing inferential uncertainty in deep neural networks through sampling. ICLR conference paper.
[93] McCullagh, P., Nelder, J.A. (1983). Generalized Linear Models. Chapman & Hall.
[94] Meier, D., Wiithrich, M.V. (2020). Convolutional neural network case studies: (1) anomalies in mortality rates (2) image recognition. SSRN Manuscript ID 3656210.
[95] Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics 21/6, 1087-1092.
[96] Montiifar, G., Pascanu, R., Cho, K., Bengio, Y. (2014). On the number of linear regions of deep neural networks. Neural Information Processing Systems Proceedings13 27, 2924-2932.
[97] Neal, R.M. (1996). Bayesian Learning for Neural Networks. Springer.
[98] Nelder, J.A., Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal Statistical Society. Series A (General) 135/3, 370-384.
[99] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic University of Lou vain.
[100] Nielsen, M. (2017). Neural Networks and Deep Learning. Online book available on http: //neuralnetworksanddeeplearning.com
[101] Noll, A., Salzmann, R., Wiithrich, M.V. (2018). Case study: French motor third-party liability claims. SSRN Manuscript ID 3164764.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
238
Bibliography
[102] Ohlsson, E., Johansson, B. (2010). Non-Life Insurance Pricing with Generalized Linear Models. Springer.
[103] Park, J., Sandberg, I. (1991). Universal approximation using radial-basis function networks. Neural Computation 3, 246-257.
[104] Park, J., Sandberg, I. (1993). Approximation and radial-basis function networks. Neural Computation 5, 305-316.
[105] Petrushev, P. (1999). Approximation by ridge functions and neural networks. SIAM Journal on Mathematical Analysis 30/1, 155-189.
[106] Pruscha, H. (2006). Statistisches Methodenbuch: Verfahren, Fallstudien, Programmcodes. Springer.
[107] Rentzmann, S., Wiithrich, M.V. (2019). Unsupervised learning: What is a sports car? SSRN Manuscript ID 3439358. Version October 14, 2019.
[108] Richman, R. (2020). AI in actuarial science - a review of recent advances - part 1. Annals of Actuarial Science, to appear.
[109] Richman, R. (2020). AI in actuarial science - a review of recent advances - part 2. Annals of Actuarial Science, to appear.
[110] Richman, R., Wiithrich, M.V. (2019). Lee and Carter go machine learning: recurrent neural networks. SSRN Manuscript ID 3441030.
[Ill] Richman, R., Wiithrich, M.V. (2020). Nagging predictors. Risks 8/3, 83.
[112] Ridgeway, G. (2007). Generalized boosted models: a guide to the gbm package. Version of August 3, 2007.
[113] Robert, CP. (2001). The Bayesian Choice. 2nd edition. Springer.
[114] Roberts, G.O., Gelman, A., Gilks, W.R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability 7, 110-120.
[115] Roberts, G.O., Rosenthal, J.S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society, Series B 60/1, 255-268.
[116] Riiger, S.M., Ossen, A. (1997). The metric structure of weight space. Neural Processing Letters 5, 63-72.
[117] Rumelhart, D.E., Hinton, G.E., Williams, R.J. (1986). Learning representations by back-propagating errors. Nature 323/6088, 533-536.
[118] Schapire, R.E. (1990). The strength of weak learnability. Machine Learning 5/2, 197-227.
[119] Schelldorfer, J., Wiithrich, M.V. (2019). Nesting classical actuarial models into neural networks. SSRN Manuscript ID 3320525.
[120] Schelldorfer, J., Wiithrich, M.V. (2021). LocalGLMnet: a deep learning architecture for actuaries. SSRN Manuscript ID 3900350. Version August 6, 2021.
[121] Schwarz, G.E. (1978). Estimating the dimension of a model. Annals of Statistics 6/2, 461-464.
[122] Shaham, U, Cloninger, A., Coifman, R.R. (2015). Provable approximation properties for deep neural networks. arA"jw.4509.07385v3.
[123] Shmueli, G. (2010). To explain or to predict? Statistical Science 25/3, 289-310. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
Bibliography
239
[124] Srivastava, N., Hinton, G., Krizhevsky, A. Sutskever, I., Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning 15, 1929-1958.
[125] Therneau, T.M., Atkinson, E.J. (2015). An introduction to recursive partitioning using the RPART routines. R Vignettes, version of June 29, 2015. Mayo Foundation, Rochester.
[126] Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological) 58/1, 267-288.
[127] Tikhonov, A.N. (1943). On the stability of inverse problems. Doklady Akademii Nauk SSSR 39/5, 195-198.
[128] Valiant, L.G. (1984). A theory of learnable. Communications of the Association for Computing Machinery ACM 27/11, 1134-1142.
[129] Verbelen, R., Antonio, K., Claeskens, G. (2018). Unraveling the predictive power of telematics data in car insurance pricing. Journal of the Royal Statistical Society: Series C (Applied Statistics) 67/5, 1275-1304.
[130] Wager, S., Wang, S., Liang, P.S. (2013). Dropout training as adaptive regularization. In: Advances in Neural Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger (eds.), Curran Associates, Inc, 351-359.
[131] Wang, Q., Huo, H., He, K., Yao, Z., Zhang, Q. (2008). Characterization of vehicle driving patterns and development of driving cycles in Chinese cities. Transportation Research Part D: Transport and Environment 13/5, 289-297.
[132] Weidner, W., Transchel, F.W.G., Weidner, R. (2016). Classification of scale-sensitive telem-atic observables for riskindividual pricing. European Actuarial Journal 6/1, 3-24.
[133] Weidner, W., Transchel, F.W.G., Weidner, R. (2017). Telematic driving profile classification in car insurance pricing. Annals of Actuarial Science 11/2, 213-236.
[134] Wood, S.N. (2017). Generalized Additive Models: an Introduction with R. 2nd edition. CRC Press.
[135] Wiithrich, M.V. (2013). Non-Life Insurance: Mathematics & Statistics. SSRN Manuscript ID 2319328. Version March 20, 2019.
[136] Wiithrich, M.V. (2017). Covariate selection from telematics car driving data. European Actuarial Journal 7/1, 89-108.
[137] Wiithrich, M.V. (2017). Price stability in regression tree calibrations. Proceedings of CI-CIRCM 2017, Tsinghua University Press.
[138] Wiithrich, M.V. (2018). Machine learning in individual claims reserving. Scandinavian Actuarial Journal 2018/6, 465-480.
[139] Wiithrich, M.V. (2020). Bias regularization in neural network models for general insurance pricing. European Actuarial Journal 10/1, 179-202.
[140] Wiithrich, M.V., Merz, M. (2019). Editorial: Yes, we CANN! ASTIN Bulletin 49/1, 1-3.
[141] Wiithrich, M.V., Merz, M. (2021). Statistical Foundations of Actuarial Learning and its Applications. SSRN Manuscript ID 3822407.
[142] Xiang, Q. (2018). Bayesian Gaussian Random Fields Applied to Car Insurance Claim Size Modeling. Semester Thesis. Department of Mathematics, ETH Zurich.
Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308
240
Bibliography
[143] Yukich, J., Stinchcombe, M., White, H. (1995). Sup-norm approximation bounds for networks through probabilistic methods. IEEE Transactions on Information Theory 41/4, 1021-1027.
[144] Zaslavsky, T. (1975). Facing up to arrangements: face-count formulas for partitions of space by hyperplanes. Memoirs of the American Mathematical Society 154.
[145] Zöchbauer, P. (2016). Data Science in Non-Life Pricing: Predicting Claims Frequencies using Tree-Based Models. M.Sc. Thesis. Department of Mathematics, ETH Zurich.
[146] Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67/2, 301-320.
Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich
Electronic copy available at: https://ssrn.com/abstract=2870308