Data Analytics for Non-Life Insurance Pricing - Lecture Notes - Mario V. Wuthrich RiskLab Switzerland Department of Mathematics ETH Zurich Christoph Buser AXA versicherungen Version October 27, 2021 Electronic copy available at: https://ssrn.com/abstract=2870308 2 Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Preface and Terms of Use Lecture notes. These notes are the basis of our lecture on Data Analytics for Non-Life Insurance Pricing at ETH Zurich. They aim at giving insight and education to practitioners, academics and students in actuarial science on how to apply machine learning methods for statistical learning. There is some overlap in these notes with our book Statistical Foundations of Actuarial Learning and its Applications which treats the topic of deep learning in much more depth, see [141] Prerequisites. The prerequisites for these lecture notes are a solid education in mathematics, in particular, in probability theory and statistics. Moreover, knowledge in statistical software such as R is required. Terms of Use. These lecture notes are an ongoing project which is continuously revised and updated. Of course, there may be errors in the notes and there is always room for improvement. Therefore, we appreciate any comment and/or correction that readers may have. However, we would like you to respect the following rules: • These notes are provided solely for educational, personal and non-commercial use. Any commercial use or reproduction is forbidden. • All rights remain with the authors. They may update the manuscript or withdraw the manuscript at any time. There is no right of availability of any (old) version of these notes. The authors may also change these terms of use at any time. • The authors disclaims all warranties, including but not limited to the use or the contents of these notes. On using these notes, you fully agree to this. • Citation: please use the SSRN URL https://ssrn. com/abstract=2870308 • All included figures were produced by the authors with the open source software R. Versions of the 1st edition of these notes (before 2019): November 15, 2016; January 27, 2017; March 28, 2017; October 25, 2017; June 11, 2018 Versions of the 2nd edition of these notes (after 2019): February 5, 2019; June 4, 2019; September 10, 2020 3 Electronic copy available at: https://ssrn.com/abstract=2870308 4 Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Acknowledgment Firstly, we thank RiskLab and the Department of Mathematics of ETH Zurich, who have always strongly supported this project. A special thank you goes to Peter Reinhard (AXA Insurance, Winterthur) who has initiated this project by his very supportive and forward-looking anticipation. Secondly, we kindly thank Patrick Zochbauer who has written his MSc ETH Mathematics thesis under our supervision at AXA-Winterthur on "Data Science in Non-Life Insurance Pricing: Predicting Claims Frequencies using Tree-Based Models". This MSc thesis has been a great stimulus for these lecture notes and it has also supported us to get more easily into this topic. This thesis is awarded the Walter Saxer Insurance Price 2016. Next we thank the Institute of Statistical Mathematics (ISM), Tokyo, and, in particular, Prof. Tomoko Matsui (ISM) and Prof. Gareth Peters (University College London, UCL, and Heriot-Watt University, Edinburgh) for their support. Part of these notes were written while MVW was visiting ISM. We kindly thank Philippe Deprez, John Ery and Andrea Gabrielli who have been carefully reading preliminary versions of these notes. This has helped us to substantially improve the outline. Finally, we thank many colleagues and students for very fruitful discussions, providing data and calculating examples. We mention in particular: Michel Baes, Peter Blum, Hans Buhlmann, Peter Buhlmann, Patrick Cheridito, Philippe Deprez, Paul Embrechts, Andrea Ferrario, Andrea Gabrielli, Guojun Gan, Guangyuan Gao, Donatien Hainaut, Christian Lorentzen, Nicolai Meinshausen, Michael Merz, Alexander Noll, Gareth Peters, Ronald Richman, Ulrich Riegel, Robert Salzmann, Jiirg Schelldorfer, Pavel Shevchenko, Olivier Steiger, Qikun Xiang, Xian Xu. Zurich, October 27, 2021 Mario V. Wiithrich & Christoph Buser 5 Electronic copy available at: https://ssrn.com/abstract=2870308 6 Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Contents 1 Introduction to Non-Life Insurance Pricing 11 1.1 Introduction................................... 11 1.2 The compound Poisson model......................... 13 1.2.1 Model assumptions and first properties ............... 13 1.2.2 Maximum likelihood estimation: homogeneous case......... 16 1.2.3 Poisson deviance loss.......................... 17 1.3 Prediction uncertainty............................. 19 1.3.1 Generalization loss........................... 19 1.3.2 Cross-validation on test samples ................... 23 1.3.3 Leave-one-out cross-validation..................... 24 1.3.4 if-fold cross-validation......................... 24 1.3.5 Stratified if-fold cross-validation................... 25 1.4 Example: homogeneous Poisson model.................... 25 2 Generalized Linear Models 29 2.1 Heterogeneous Poisson claims frequency model............... 29 2.2 Multiplicative regression model........................ 31 2.3 Deviance residuals and parameter reduction................. 34 2.4 Example in car insurance pricing....................... 36 2.4.1 Pre-processing features: categorical feature components...... 36 2.4.2 Pre-processing features: continuous feature components...... 39 2.4.3 Data compression ........................... 45 2.4.4 Issue about low frequencies...................... 47 2.4.5 Models GLM3+ considering all feature components......... 49 2.4.6 Generalized linear models: summary................. 53 2.5 Classification problem............................. 54 2.5.1 Classification of random binary outcomes.............. 54 2.5.2 Logistic regression classification.................... 55 2.6 Maximum likelihood estimation........................ 59 3 Generalized Additive Models 61 3.1 Generalized additive models for Poisson regressions............. 61 3.1.1 Natural cubic splines.......................... 62 3.1.2 Example in motor insurance pricing, revisited............ 66 3.1.3 Generalized additive models: summary................ 74 7 Electronic copy available at: https://ssrn.com/abstract=2870308 8 Contents 3.2 Multivariate adaptive regression splines ................... 74 4 Credibility Theory 77 4.1 The Poisson-gamma model for claims counts ................ 77 4.1.1 Credibility formula........................... 77 4.1.2 Maximum a posteriori estimator................... 80 4.1.3 Example in motor insurance pricing................. 80 4.2 The binomial-beta model for classification.................. 82 4.2.1 Credibility formula........................... 82 4.2.2 Maximum a posteriori estimator................... 83 4.3 Regularization and Bayesian MAP estimators................ 83 4.3.1 Bayesian posterior parameter estimator............... 83 4.3.2 Ridge and LASSO regularization................... 84 4.4 Markov chain Monte Carlo method...................... 87 4.4.1 Metropolis-Hastings algorithm.................... 88 4.4.2 Gibbs sampling............................. 90 4.4.3 Hybrid Monte Carlo algorithm.................... 90 4.4.4 Metropolis-adjusted Langevin algorithm............... 92 4.4.5 Example in Markov chain Monte Carlo simulation......... 93 4.4.6 Markov chain Monte Carlo methods: summary........... 96 4.5 Proofs of Section 4.4.............................. 99 5 Neural Networks 101 5.1 Feed-forward neural networks......................... 101 5.1.1 Generic feed-forward neural network construction.......... 101 5.1.2 Shallow feed-forward neural networks ................ 105 5.1.3 Deep feed-forward neural networks.................. 115 5.1.4 Combined actuarial neural network approach............ 127 5.1.5 The balance property in neural networks............... 132 5.1.6 Network ensemble........................... 133 5.2 Gaussian random fields............................. 137 5.2.1 Gaussian Bayesian neural network.................. 137 ■5.2.2 Infinite Gaussian Bayesian neural network.............. 138 5.2.3 Bayesian inference for Gaussian random field priors ........ 139 5.2.4 Predictive distribution for Gaussian random field priors...... 140 5.2.5 Step function activation........................ 141 6 Classification and Regression Trees 145 6.1 Binary Poisson regression trees........................ 145 6.1.1 Binary trees and binary indexes.................... 146 6.1.2 Pre-processing features: standardized binary splits......... 147 6.1.3 Goodness of split............................ 147 6.1.4 Standardized binary split tree growing algorithm.......... 151 6.1.5 Example in motor insurance pricing, revisited............ 154 6.1.6 Choice of categorical classes...................... 155 Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Contents 9 6.2 Tree pruning ............ ...................... 157 6.2.1 Binary trees and pruning ........ ............... 157 6.2.2 Minimal cost-complexity pruning................... 158 6.2.3 Choice of the best pruned tree .................... 164 6.2.4 Example in motor insurance pricing, revisited............ 166 6.3 Binary tree classification............................ 169 6.3.1 Empirical probabilities......................... 169 6.3.2 Standardized binary split tree growing algorithm for classification 170 6.4 Proofs of pruning results............................ 175 7 Ensemble Learning Methods 177 7.1 Bootstrap simulation.............................. 177 7.1.1 Non-parametric bootstrap....................... 177 7.1.2 Parametric bootstrap ......................... 179 7.2 Bagging..................................... 179 7.2.1 Aggregating............................... 180 7.2.2 Bagging for Poisson regression trees................. 181 7.3 Random forests................................. 183 7.4 Boosting machines............................... 187 7.4.1 Generic gradient boosting machine.................. 187 7.4.2 Poisson regression tree boosting machine............... 190 7.4.3 Example in motor insurance pricing, revisited............ 195 7.4.4 AdaBoost algorithm.......................... 198 8 Telematics Car Driving Data 201 8.1 Description of telematics car driving data.................. 202 8.1.1 Simple empirical statistics....................... 202 8.1.2 The velocity-acceleration heatmap.................. 206 8.2 Cluster analysis................................. 208 8.2.1 Dissimilarity function......................... 208 8.2.2 Classifier and clustering........................ 210 8.2.3 K-means clustering algorithm..................... 212 8.2.4 Example................................. 213 8.3 Principal component analysis......................... 215 A Motor Insurance Data 221 A.l Synthetic data generation........................... 221 A.2 Descriptive analysis .............................. 225 Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 10 Contents Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1 Introduction to Non-Life Insurance Pricing 1.1 Introduction The non-life insurance section ASTIN (Actuarial STudies In Non-life insurance) of the International Actuarial Association (IAA) has launched a working party on big data and data analytics in 2014. The resulting document [4] states in its introduction: "The internet has changed and continues to change the way we engage with the tangible world. The resulting change in communication platforms, commercial trade and social interactions make the world smaller and the data bigger. Whilst the fundamentals of analyzing data have not changed, our approach to collating and understanding data, creating accessible and useful information, developing skill sets and ultimately transforming huge and ever-growing repositories of data into actionable insights for our employers, shareholders and our communities more generally has entered a new paradigm. As a corollary, this has made the fundamentals of data processing, modeling techniques and aligning business structures accordingly more important - no longer can existing approaches suffice - we now face the pressures of a faster moving world, blurring business lines which make customer-centricity surpass a siloed product/service focus and challenges to the actuary being central to predictive modeling. We too need to evolve, working intelligently with a wide cross-function of skill sets such as data experts, data scientists, economists, statisticians, mathematicians, computer scientists and, so as to improve the transformation of data to information to customer insights, behavioral experts. This is a necessary evolution for actuaries and the profession to remain relevant in a high-tech business world." The goal of these lecture notes is to face this paradigm. We aim at giving a broad toolbox to the actuarial profession so that they can cope with these challenges, and so that they remain highly competitive in their field of competence. In these lecture notes we start from the classical actuarial world of generalized linear models, generalized additive 11 Electronic copy available at: https://ssrn.com/abstract=2870308 12 Chapter 1. Introduction to Non-Life Insurance Pricing models and credibility theory. Lhese models form the basis of the deeper statistical understanding. We then present several machine learning techniques such as neural networks, regression trees, bagging techniques, random forests and boosting machines. One can view these machine learning techniques as non-parametric and semi-parametric statistics approaches, with the recurrent goal of optimizing a given objective function to receive optimal predictive power. In a common understanding we would like to see these machine learning methods as an extension of classical actuarial models, and we are going to illustrate how classical actuarial methods can be embedded into these non-parametric machine learning approaches, benefiting from both the actuarial world and the machine learning world. Lhese lecture notes have also served as a preliminary version to our book [141] which treats the topic of statistical modeling and deep learning in much more depth, and the reader will notice that there is some (unavoidable) overlap between these lecture notes and our book [141]. A second family of methods that we are going to meet in these lecture notes are so-called unsupervised learning methods (clustering methods). Lhese methods aim at finding common structure in data to cluster these data. We provide an example of unsupervised learning by analyzing telematics car driving data which poses the challenge of selecting feature information from high frequency data. Lor a broader consideration of unsupervised learning methods we refer to our tutorial [107]. We close this short introductory section by briefly reviewing major developments identified in the China InsurLech Development White Paper [27]. Lhe notion of Insurance Lechnology (InsurLech) is closely related to Linancial Lechnology (LinLech), and it is a commonly used term for technology and innovation in the insurance industry. Among other things it compromises the following key points: • artificial intelligence, machine learning and statistical learning, which may learn and accumulate useful knowledge through data; • big data analytics, which deals with fact that data may be massive; • cloud computing, which may be the art of performing real-time operations; • block chain technology, that may be useful for a more efficient and anonymous exchange of data; • internet of things, which involves the integration and interaction of physical devices (e.g. wearables, sensors) through computer systems to reduce and manage risk. Lhe actuarial profession has started several initiatives in data analytics and machine learning to cope with these challenges. We mention the working party "Data Science" of the Swiss Association of Actuaries (SAV) that aims at building a toolbox for the actuarial profession: https : //www. actuarialdatascien.ee. org/ Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1. Introduction to Non-Life Insurance Pricing 13 The SAV working party has prepared several tutorials on actuarial developments in machine learning. These tutorials are supported by computer code that can be downloaded from GitHub.1 In the present lecture notes we work with synthetic data for which we exactly know the true data generating mechanism. This has the advantage that we can explicitly back-test the quality of all our models presented. In typical real world applications this is not the case and one needs to fully rely on estimated models. This is demonstrated in the tutorials of the SAV working party which describe several statistical modeling techniques on real insurance data, see [43, 89, 101, 119, 120]. We highly recommend readers to perform similar analysis on their own real data, because one crucial pillar of statistical modeling is to derive the right intuition and understanding for the data and the models used. 1.2 The compound Poisson model In this section we introduce the basic statistical approach for modeling non-life insurance claims. This approach splits the total claim amount into a compound sum which accounts for the number of claims and determines the individual claim sizes. 1.2.1 Model assumptions and first properties The classical actuarial approach for non-life insurance portfolio modeling uses a compound random variable with N describing the number of claims that occur within a fixed time period and Z±,..., Zn describing the individual claim sizes. The total claim amount in that fixed time period is then given by N S = Z\ + ... + Zn = ^2 Zk. k=l The main task in non-life insurance modeling is to understand the structure of such total claim amount models. The standard approach uses a compound distribution for S. Throughout, we assume to work on a sufficiently rich probability space (Q, J-, P). Model Assumptions 1.1 (compound distribution). The total claim amount S is given by the following compound random variable on (Q,J-,¥) N S = Z\ + ... + Zn = ^2 Zk, fc=i with the three standard assumptions: (1) N is a discrete random variable which takes values in No; (2) Zi, Z2, ■ ■ ■ are independent and identically distributed (i.i.d.); (3) N and (Z±, Z%,. . .) are independent. 1https://github.com/JSchelldorfer/ActuarialDataScience Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 14 Chapter 1. Introduction to Non-Life Insurance Pricing Remarks. • If S satisfies the three standard assumptions (l)-(3) of Model Assumptions 1.1 we say that S has a compound distribution. N is called claims count and Z^, k > 1, are the individual claim sizes or claim severities. • The compound distribution has been studied extensively in the actuarial literature. The aim here is to give more structure to the problem so that it can be used for answering complex pricing questions on heterogeneous insurance portfolios. In the sequel we assume that the claims count random variable N can be modeled by a Poisson distribution. We therefore assume that there exist an expected (claims) frequency A > 0 and a volume v > 0. We say that N has a Poisson distribution, write N ~ Poi(Av), if (Xv)k ¥[N = k] = e~Xv for all k e N0. k\ The volume v > 0 often measures the time exposure in yearly units. Therefore, throughout these notes, the volume v is called years at risk. If a risk is insured part of the year, say, 3 months we set v = 1/4. For a random variable Z we denote its coefficient of variation by Vco(Z) = Var(Z)1/2/E[Z] (subject to existence). We have the following lemma for the Poisson distribution. Lemma 1.2. Assume N ~ Poi(Aw) for fixed X,v > 0. Then E[N] =Xv = Var(iV) and Vco(iV) = ' = \Jl/^v 0 as v ^ oo. Proof. See Proposition 2.8 in Wiithrich [135]. □ Lemma 1.3. Assume that Ni, i = 1,.. ., n, are independent and Poisson distributed with means XiVi > 0. We have n / n \ N = yJNl - Poi X>«u« • i=i \i=i ) Proof. This easily follows by considering the corresponding moment generating functions and using the independence assumption, for details see Wiithrich [135], Chapter 2. □ Definition 1.4 (compound Poisson model). The total claim amount S has a compound Poisson distribution, write S - CompPoi(Ai;,G), if S has a compound distribution with N ~ Poi(Aw) for given X,v > 0 and individual claim size distribution Z\ ~ G. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1. Introduction to Non-Life Insurance Pricing 15 Proposition 1.5. Assume S ~ CompPoi(Av, G). We have, whenever they exist, E[S] = Xv E[Zi], Var(S) = Xv E[Z?1, Vco(S) = a/Vco(Zi)2 + 1. V Xv v Proof. See Proposition 2.11 in Wuthrich [135]. □ Remarks. • If S has a compound Poisson distribution with fixed expected frequency A > 0 and fixed claim size distribution G having finite second moment, then the coefficient of variation converges to zero at speed v-1/2 as the years at risk v increase to infinity. In industry, this property is often called diversification property. • The compound Poisson distribution has the so-called aggregation property and the disjoint decomposition property. These are two extremely beautiful and useful properties which explain part of the popularity of the compound Poisson model, we refer to Theorems 2.12 and 2.14 in Wuthrich [135] for more details. The aggregation property for the Poisson distribution has already been stated in Lemma 1.3. It tells us that we can aggregate independent Poisson distributed random variables and we stay within the family of Poisson distributions. • The years at risk v > 0 may have different interpretation in the compound Poisson context: either (i) we may consider a single risk which is insured over v accounting years, or (ii) we have a portfolio of independent compound Poisson risks and then v measures the volume of the aggregated portfolio (in years at risk). The latter uses the aggregation property which says that also the aggregated portfolio has a compound Poisson distribution (with volume weighted expected frequency), see Theorem 2.12 in Wuthrich [135]. One crucial property of compound distributions is that we have the following decomposition of their expected values E[S] = E[N] E[Zi] = Xv E[Zi], where for the second identity we have used the Poisson model assumption, see Proposition 1.5, and where we assume S £ L1(J7,Jr, P). This implies that for the modeling of the pure risk premium E[S] we can treat the (expected) number of claims E[N] and the (expected) individual claim sizes E[Zi] separately in a compound model. We make the following restriction here: In the present notes, for simplicity, we mainly focus on the modeling of the claims count N, and we only give broad indication about the modeling of Z^. We do this to not overload these notes. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 16 Chapter 1. Introduction to Non-Life Insurance Pricing In many situations it is more appropriate to consider volume scaled quantities claims count N we may consider the claims frequency defined by Y=™. V In the Poisson model, the claims frequency Y has the following two properties E[Y] = A and Var(F) = X/v. From this we see that confidence bounds for frequencies based on standard deviations Var(y)1/2 = \[Xjv get more narrow with increasing years at risk v > 0. This is the reason why the equal balance (diversification) concept on a portfolio level works. This is going to be crucial in the subsequent chapters where we aim at detecting structural differences between different risks in terms of expected frequencies. Anticipatory, one often rewrites (1.1) as follows Y = \ + e, where s is centered with variance X/v. In this form, A describes the structural behavior of Y and s is understood as the noise term that describes the random deviation/fluctuation around the structural term when running the experiment. In all what follows, we try to separate the structural behavior from the random noise term, so that we are able to quantify the price level (pure risk premium) of individual insurance policies. 1.2.2 Maximum likelihood estimation: homogeneous case Assume that N has a Poisson distribution with volume v > 0 and expected frequency A > 0, that is, N ~ Poi(Aw). For predicting this random variable N, we would typically like to use its expected value E[iV] = Xv as predictor because this minimizes the mean square error of prediction (MSEP); we highlight this in more detail in Section 1.3, below. However, this predictor is only useful if the parameters are known. Typically, we know the volume v > 0 but, in general, we do not assume knowledge of the true expected frequency A. Hence, we need to estimate this latter parameter. This estimation task is solved by assuming that one has a family of independent random variables N±,..., Nn with Ni ~ Poi(Awi) for all i = 1,..., n. Note that for the time being we assume a homogeneous portfolio in the sense that all random variables iVj are assumed to have the same expected frequency A > 0; this is going to be relaxed in the subsequent chapters. Based on this model assumption one aims at estimating the common frequency parameter A from given observations AT" = (N±,.. . ,Nn)'. The joint log-likelihood function of these observations is given by n A tN{X) = yj-^l + Nl\og{Xvl)-\og{Nl\). i=i The parameter estimation problem is then commonly solved by calculating the maximum likelihood estimator (MLE), which is the parameter value A that maximizes the log-likelihood function, i.e., from the class of homogeneous Poisson models the one is selected Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1. Introduction to Non-Life Insurance Pricing 17 that assigns the highest probabibility to the actual observation AT". Thus, the MLE A for A is found by the solution of |^v(A) = 0. (1-2) This solution is given by - V™ N- A = ^^>0. (1.3) Note that -§^tN{\) < 0 for any A > 0.2 The MLE is unbiased for A, i.e. E[A] = A, and its variance is given by Var(A) A which converges to zero as the denominator goes to infinity. This allows us to quantify the parameter estimation uncertainty by the corresponding standard deviation. This is going to be important in the analysis of heterogeneous portfolios, below. 1.2.3 Poisson deviance loss Instead of maximizing the log-likelihood function we could also try to minimize an appropriate objective function. The canonical objective function under our model assumptions is the Poisson deviance loss. We define the maximal log-likelihood (which is received from the saturated model) n £N(N) = y,~Nl + NilogNi - log(AM). (1.4) i=i The saturated model is obtained by letting each observation Ni have its own parameter Aj = ¥,[Nj\/vi. These individual parameters are estimated by their corresponding individual MLEs Aj = Ni/vi, that is, each policy i receives its own individual MLE parameter. We set the i-th term on the right-hand side of (1.4) equal to zero if Ni = 0 (we use this terminology throughout these notes). The (scaled) Poisson deviance loss for expected frequency A > 0 is defined by D*(N, A) = 2(£N(N)-£N(X)) = 2 ~Ni + Ni lo§ Ni + A^ " Ni lo§ (^i) i=l £2 iV, i=i A^ , . (\vl --1 — log — > 0, (1.5) where the i-th term in (1.5) is set equal to 2Xvi for Ni = 0. Remarks. • Each term under the summation in (1.5) is non-negative because the saturated model maximizes each of these terms individually (by choosing the individual MLEs 2If A = Vi = 0' which happens with positive probability in the Poisson model, we get a degenerate model. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 18 Chapter 1. Introduction to Non-Life Insurance Pricing Aj = Ni/vi). Lherefore, D*(N,X) > 0 for any A > 0. Note that we even receive non-negativity in (1.5) if we choose arbitrary policy-dependent expected frequencies Aj > 0 instead of the homogeneous expected frequency parameter A.3 • Maximizing the log-likelihood function £]\[(X) in A is equivalent to minimizing the deviance loss function D*(N,X) in A. • Lhe deviance loss can be generalized to the exponential dispersion family which contains many interesting distributions such as the Gaussian, the Poisson and the gamma models. We refer to Section 4.1.2 in [141]. We close this section by analyzing the expected deviance loss E[D*(N,X)} = 2E[Xv - N - Nlog(Xv/N)} = 2E [N \og(N/Xv)] , of a Poisson distributed random variable N with expected value E[iV] = Xv > 0. Ligure 1.1: Expected deviance loss E[D*(N, A)] = 2E[iVlog(iV/Av)] of a claims count N ~ Poi(Aw) as a function of its expected value E[iV] = Xv on two different scales; these plots are obtained by Monte Carlo simulation, the blue dot shows mean E[iV] = 5.16%, the vertical cyan ties show means E[iV] = 5.16% ± 4.58%. Ligure 1.1 illustrates this expected deviance loss on two different scales for the expected number of claims E[iV] = Xv. Lhe example which we are going to study in these notes has an average expected number of claims of Xv = 5.16%. Lhis gives an expected deviance loss of 30.9775 ■ 10-2 (blue dot in Ligure 1.1). Lhis is going to be important for the understanding of the remainder of these notes. Note that this value is bigger than 27.7278 ■ 10-2 given in (A.5) in the appendix. Lhis difference is explained by the fact that the latter value has been obtained on a heterogeneous portfolio. In this heterogeneous portfolio the individual policies have a standard deviation in expected numbers of claims of 4.58%, see Ligure A.l in the appendix. Lhe cyan vertical ties in Ligure 1.1 3Policy dependent expected frequency parameters Aj are going to be crucial in the subsequent chapters because the general goal of these notes is to price heterogeneous insurance portfolios. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1. Introduction to Non-Life Insurance Pricing 19 (rhs) provide two policies with expected numbers of claims of 5.16% ±4.58%. If we average their expected deviance losses we receive 26.3367-10-2 which corresponds to the cyan dot in Figure 1.1 (rhs). 1.3 Prediction uncertainty In this section we quantify prediction uncertainty. This is mainly done in terms of the Poisson model and the Poisson deviance loss. For the general case within the exponential dispersion family and more insight we refer to Chapter 4 of [141], in particular, we mention that deviance losses are strictly consistent scoring functions for mean estimation which is important in forecast evaluation, see Gneiting [55]. 1.3.1 Generalization loss Assume we have n observations, called cases, given by V={(N1,v1),...,(Nn,vn)}. (1.6) T> stands for data. Typically, we do not know the true data generating mechanism of T>. Therefore, we make a model assumption: assume that all n cases (iVj, Vi) are independent and that Ni are Poisson distributed with expected frequency A > 0. We infer the expected frequency parameter A from this data T>. We denote the resulting estimator by A, for the moment this can by any (sensible) Independent estimator for A. We would like to analyze how well this estimator performs on cases which have not been seen during the estimation procedure of A. In machine learning this is called generalization of the estimated model to unseen data, and the resulting error is called generalization error, out-of-sample error or prediction error. To analyze this generalization error we choose a new random variable N ~ Poi(Aw), which is independent from the data T> and which has the same expected frequency A. The frequency Y = N/v of this new random variable is predicted by Y = E[Y] = X. (1.7) Note that we deliberately write E because it is estimated from the data T>. Proposition 1.6 (MSEP generalization loss). Assume that all cases in T> are independent and Poisson distributed having the same expected frequency A > 0. Moreover, let case (N,v) be independent ofT> and Poisson distributed with the same expected frequency A. We predict Y = N/v byY = X, where the D-based estimator A is assumed to be square integrable. This prediction has MSEP = (E [Y] - E [f])2 + Var(F) + Var(F). Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 E Y-Y 20 Chapter 1. Introduction to Non-Life Insurance Pricing Proof of Proposition 1.6. The first step is done to bring the terms into the same order as in the statement of the proposition, in the second last step we use independence between T> and (N,v), E Y - Y = E E Y -Y = E = E [Y - E[Y] (E [Y] - E |Yj + E |Yj - Y = ( E [Y] - E \9] ) 2 + Var(Y) + Var(Y) This proves the proposition Y -E[Y] +E[Y] - Yj + E [(E[Y] - Y)2] + 2E^Y- E[Y]) (E[Y] - Y)j + Var(Y) Remarks 1.7 (generalization loss, part I). • The first term on the right-hand side of the statement of Proposition 1.6 denotes the squared bias, the second term the estimation variance and the last term the pure randomness (process variance) involved in the prediction problem. In general, we try to minimize simultaneously the bias and the estimation variance in order to get accurate predictions. Usually, these two terms compete in the sense that a decrease in one of them typically leads to an increase in the other one. This phenomenon is known as the bias-variance trade-off for which one needs to find a good balance (typically by controlling the complexity of the model). This is crucial for heterogeneous portfolios and it is going to be the main topic of these notes. We also refer to Section 7.3 in Hastie et al. [62] for the bias-variance trade-off. • If we are in the atypical situation of having a homogeneous (in A) Poisson portfolio and if we use the D-based MLE A, the situation becomes much more simple. In Section 1.2.2 we have seen that the MLE A is unbiased and we have determined its estimation variance. Henceforth, we receive in this special (simple) case for the MSEP generalization loss E Y — Y E[Y]-E A + Var(A) + Var(Y) = 0 + A A E"=i Vi We emphasize that this is an atypical situation because usually we do not assume to have a homogeneous portfolio and, in general, the MLE is not unbiased. Proposition 1.6 considers the MSEP which implicitly implies that the weighted square loss function is the objective function of interest. However, in Section 1.2.2 we have been considering the Poisson deviance loss as objective function (to obtain the MLE) and, therefore, it is canonical to measure the generalization loss in terms of the out-of-sample Poisson deviance loss. Under the assumptions of Proposition 1.6 this means that we aim at studying E[D*(N, A)] = 2E \\v - N - Nlog(\v/N)} . Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1. Introduction to Non-Life Insurance Pricing 21 Proposition 1.8 (Poisson deviance generalization loss). Assume that all cases in T> are independent and Poisson distributed having the same expected frequency A > 0. Moreover, let case (N, v) be independent of T> and Poisson distributed with the same expected frequency X. We predict N by Xv, where the D-based estimator X is assumed to be integrable and X > e, P-a.s., for some e > 0. This prediction has Poisson deviance generalization loss E[D*(N, a)] = E[D*(N, a)] + S(X, a), with estimation loss defined by S(X, A) = 2v (E [a] - A - AE [log (a/a)]) > 0. Proof of Proposition 1.8. The assumptions on A imply loge < E[logA] < logE[A] < oo. We have E[D*(N,\)} =2E \^Xv - Xv + Xv - N - N\og(Xv/N) - N\og(X/X)^ . The first claims follows by the independence between N and A. There remains the proof of the positivity of the estimation loss. Observe A - A - A log (x/xj = X(x — 1 — log x) = Xg{x), where we have denned x = A/A > 0, and the last identity defines the function g(x) = x — 1 — logrr. The claim now follows from log a; < x — 1 with that latter inequality being strict for i/l. □ Remarks 1.9 (generalization loss, part II). • The estimation loss £(a,a) simultaneously quantifies the bias and the estimation volatility. Again, we try to make this term small by controlling the bias-variance trade-off. • If we want to use the D-based MLE A of Section 1.2.2 for estimating A, we need to insure positivity, P-a.s., i.e. we need to exclude degenerate models. We set Xe = X+e for some e > 0. We receive Poisson deviance generalization loss for the MLE-based estimator Xe E[D*(N,Xe)} = K[D*(N, a)] + 2ve - 2vXE [log (a£/a)] , with £(Xe,X) = 2ve - 2E[Xv\og(Xev/(Xv))} > 0, E[D*(N,X)} = -2E[Anog(Ai;/A0] > 0. Note that we may also use Xe = max{a, e} for guaranteeing positivity, P-a.s., but in this case the bias is more difficult to determine. • In view of the Poisson deviance generalization loss we can determine the optimal estimator for a w.r.t. optimization criterion a* = argmin E[D*(N,fi)] = argmin 2E[fiv - N - N\og(fiv/N)]. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 22 Chapter 1. Introduction to Non-Life Insurance Pricing This minimizer is given by A* = E[iV]/v which implies that the Poisson deviance loss is strictly consistent for the expected value according to Gneiting [55]; strict consistency is a property needed to perform back-testing, we refer to Section 4.1.3 in [141]. We close this section by comparing the square loss function, the Poisson deviance loss function and the absolute value loss function. Example 1.10. In this example we illustrate the three different loss functions: L(Y,X,v = l) = (Y - A)2 , L(Y,X,v = l) = 2[A-F-Flog(A/F)], L(Y,X,v = l) = \Y-X\, the first one is the square loss function, the second one the Poisson deviance loss function and the third one the absolute value loss function. Figure 1.2: Loss functions Y >—> L(Y,\,v = 1) with A = 5.16% for different scales on the a;-axis (and y-axis). These three loss functions are plotted in Figure 1.2 for an expected frequency of A = 5.16%. From this plot we see that the pure randomness is (by far) the dominant term: the random variable Y = N (for v = 1) lives on the integer values No and, thus, every positive outcome Y > 1 substantially contributes to the loss L(-), see Figure 1.2 (middle). A misspecification of A only marginally contributes (for small A); this is exactly the class imbalance problem that makes model calibration of low frequency examples so difficult. Moreover, we see that the Poisson deviance loss reacts more sensitively than the square loss function for claims N G {1,..., 8} for our expected frequency choice of A = 5.16%. Formula (6.4) and Figure 6.2 in McCullagh-Nelder [93] propose the following approximation, we also refer to Figure 5.5 in [141], L(Y, A, v = 1) « 9F1/3 (Y1/3 - A1/3)2 . Thus, Poisson deviance losses have a rather different behavior from square losses. ■ Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1. Introduction to Non-Life Insurance Pricing 23 1.3.2 Cross-validation on test samples Assessment of the quality of an estimated model is done by analyzing the generalization loss, see Propositions 1.6 and 1.8. That is, one back-tests the estimated model on cases that have not been seen during the estimation procedure. However, the true generalization loss cannot be determined because typically the true data generating mechanism is not known. Therefore, one tries to assess the generalization loss empirically. Denote by A i—> L(Y, A, v) a (non-negative) loss function, for examples see Example 1.10. The generalization loss for this loss function is defined by (subject to existence) CL = E[L(Y,X,v)] ■ If the amount of data T> in (1.6) is very large, we are tempted to partition the data T> into a learning sample VB and a test sample VBC with B C {1,. .. ,n} labeling the cases VB = {(Ni,Vi); i G B} C T> considered for learning the model, and with Bc = {1,..., n} \ B labeling the remaining test cases. Based on the learning sample VB we construct the predictor A8 for Y = N/v, see (1.7), and we use the test sample VBC to estimate the generalization loss empirically by CT = E \l(yxv)] =^ E L(Y,XB,Vl). (1.8) The upper index in C°^s indicates that we do an out-of-sample analysis (estimation) because the data for estimation and back-testing has been partitioned into a learning sample and a disjoint back-testing sample. The out-of-sample Poisson deviance loss is analogously estimated by, assume > 0. (1.9) C°^ = nD*(N,X)] = j^r y,2n* \BC, XBv% (XBvl 1 — lot In many situations it is too optimistic to assume that we can partition data T> into a learning sample and a test sample because often the volume of the data is not sufficiently big. A naive way to solve this problem is to use the whole data T> for learning the model and then back-testing this model on the same data. The model estimated from the whole data T> is denoted by A (assume A > 0). The in-sample Poisson deviance loss is defined by 1 1 n £% = -D*(N,X) = -y,2n- i=i Xv% 1 - loe Xvi (1.10) i.e. this is exactly the empirical Poisson deviance loss of the estimated model. This in-sample loss Dp is prone to over-fitting because it prefers more complex models that can follow observations more closely. However, this smaller in-sample loss does Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 24 Chapter 1. Introduction to Non-Life Insurance Pricing not necessarily imply that a more detailed model generalizes better to unseen cases. Therefore, we need other evaluation methods. These are discussed next. 1.3.3 Leave-one-out cross-validation Choose i G {1,... ,n} and consider partition Bi = {1,... ,n} \ {i} and B\ = {i}. This provides on every Bi a predictor for Y = N/v given by def. ^ The leave-one-out cross-validation loss for loss function l(-) is defined by i n 4r = - El^a^U). n ^—f V / i=i Leave-one-out cross-validation uses all data T> for learning: the data T> is split into a training set for (partial) learning and a validation set Z>W for an out-of-sample validation iteratively for all i G {1,... , n}. Often the leave-one-out cross-validation is computationally too expensive as it requires fitting n times the model which is for large insurance portfolios too exhaustive. 1.3.4 K-fold cross-validation For K-fo\d cross-validation we choose an integer K > 2 and partition {1,..., n} randomly into K disjoint subsets B±,... ,Bk of roughly the same size. This provides for every k = 1,..., K a training set = {(Nl,vl); i^Bk}cV, and the corresponding estimator for the expected frequency a The K-fo\d cross-validation loss for loss function l(-) is defined by n k=iieBk ^ k=i |Dfe| ieBk Note that we use the whole data T> for learning. As for leave-one-out cross-validation we split this learning data into a training set X>(-8fe) for (partial) learning and a validation set V®k for out-of-sample validation. This is done for all k = 1,... ,K, and the out-of-sample (generalization) loss is then estimated by the resulting average cross-validation loss. K-iold cross-validation is the method typically used, and in many applications one chooses K = 10. We do not further elaborate on this choice here, but we refer to the related literature. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1. Introduction to Non-Life Insurance Pricing 25 1.3.5 Stratified K-io\d cross-validation K-fo\d cross-validation partitions {1,... , n} into K disjoint random subsets B±,... ,Bk of approximately the same size. If there are outliers then these outliers may fall into the same subset Bk, and this may disturb K-fo\d cross-validation results Stratified K-fo\d cross-validation distributes outliers more equally across the partition. This is achieved by ordering the observations Yi = Ni/vi, i = 1,... ,n, that is, Y^ > Y(2) — • • • — Y(n), with a deterministic rule if there is more than one observation of the same size. Then we build urns Uj of size K for j = 1,..., \n/K~\ (the last urn U^n/x] may be smaller depending on the cardinality of n and K) Uj = {Y(i); {j-l)K+lk)k=i,...,K °f f°r k = 1,..., K by T^k = j choose randomly from each urn tii,..., U\r^K\ one case (without replacement) |, where "choose randomly (without replacement)" is meant in the sense that all urns are randomly distributed resulting in the partitioned data T>k, k = 1,. . ., K. K-fo\d cross-validation is now applied to the partition T>i,... ,T>k- Note that this does not necessarily provide the same result as the original K-fo\d cross-validation because in the stratified version it is impossible that the two largest outliers fall into the same set of the partition (supposed that they are bigger than the remaining observations). Summary. To estimate the generalization loss we typically choose the (stratified) K-fo\d Poisson deviance cross-validation loss given by 71 k=ll£Bk Xt-^Vi-Ni-Nilog X^Vl Ni > 0. (1.11) Remark. Evaluation of the K-fo\d Poisson deviance cross-validation loss (1.11) requires that the (random) partition B\,..., Bk of V is chosen such that X^~Bk^ > 0 for all k = 1,..., K. This is guaranteed in stratified K-to\d cross-validation as soon as we have K observations with Ni > 0. For non-stratified K-fo\d cross-validation the situation is more complicated because the positivity constraint may fail with positive probability (if we have too many claims with Ni = 0). 1.4 Example: homogeneous Poisson model Throughout these notes we consider a synthetic motor-third party liability (MTPL) car insurance portfolio. This portfolio consists of n = 500'000 car insurance policies having claims Ni information and years at risk information Vi G (0,1], for i = 1,. . . ,n. The simulation of the data is described in Appendix A, in particular, we refer to Listing A.2 which sketches this data V.4 In a first (homogeneous) statistical model we assume Ni Poi(Awi) for i = 1,..., n, 4The data is available from https://people.math.ethz.ch/~wmario/Lecture/MTPL_data.csv Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 26 Chapter 1. Introduction to Non-Life Insurance Pricing and with given expected frequency A > 0. The MLE for A of this data T> is given by, see (1.3), A 10.2691%. 1 "I This should be compared to the true average frequency A* = 10.1991% given in (A.4). Thus, we have a small positive bias in our estimate. The in-sample Poisson deviance loss is given by C% = 29.1065 ■ 10"2. run # CV loss strat. CV est. loss in-sample average time param. /•cv /•CV j~d £(X,X*) /•is Wd frequency (ChA.l) true model A* 27.7278 10.1991% (Chl.l) homogeneous 0.1s 1 29.1066 29.1065 1.3439 29.1065 10.2691% Table 1.1: Poisson deviance losses of If-fold cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters. Next, we calculate the 10-fold cross-validation losses (1.11). The results are presented in columns (non-stratified and stratified versions) of Table 1.1. We observe that this 10-fold cross-validation losses of 29.1066 ■ 10-2 and 29.1065 ■ 10-2, respectively, match the in-sample loss C1^ = 29.1065 ■ 10-2, i.e. we do not have any sign of over-fitting, here. The column 'run time' shows the total run time needed,5 and '# param.' gives the number of estimated model parameters. Finally, we determine the estimation loss £(A, A*) w.r.t. the true model A*, see Appendix A and Proposition 1.8. We emphasize that we are in the special situation here of knowing the true model A* because we work with synthetic data. This is an atypical situation in practice and therefore we highlight all values in green color which can only be calculated because we know the true model. Using in-sample loss (1.10) we derive C D E2^ i=i 1 D*(N,X* \*(xi)vj _ \k(xi)vl + i(x,x* Auj 1 - log log A 1 n - ^2(A*(xlK-iVl)log A (1.12 i=l vA^) where the first term is the Poisson deviance loss w.r.t. the true model A*, see (A.5) 1 n i r / -D*(N, A*) = - 2 Xk(xi)vi -N.-N, log -1 n i=i L V The second term in (1.12) is defined by 27.7278 ■ 10"2. 1 n £(A,A*) = - j22u i=i A-A*(a:i)-A*(a:i)log 1.3439 ■ 10"2 > 0. (1.13) 5A11 run times are measured on a personal laptop Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99GHz with 16GB RAM. Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 1. Introduction to Non-Life Insurance Pricing 27 This is an estimate for the estimation loss £(A,A*) w.r.t. the true model A*.6 The last term in (1.12) refers to over-fitting, in average it disappear if the estimation A is independent from AT". In our case it takes value 1 n ( A \ - 2 " Ni) log —— = 0.0348 ■ 10"2, i.e. it is negligible w.r.t. the other terms. We will use the estimated estimation loss £(X, A*) as a measure of accuracy in all models considered, below. Note that we have £(X, A*) = 0 if and only if A = A*(ajj) for all i = 1,..., n. Thus, obviously, if we have heterogeneity between the expected frequencies of the insurance policies, the estimation loss cannot be zero for a homogeneous model. ■ Summary. In practice, model assessment is done w.r.t. cross-validation losses, see Table 1.1. In our special situation of knowing the true model and the true expected frequency function A*(-), we will use the estimation loss £(A,A*) for model assessment. This also allows us to check whether we draw the right conclusions based on the cross-validation analysis. In mathematical statistics, the estimation loss (1.13) is related to the risk stemming from a decision rule. In our situation the decision rule is the MLE A7" i—> A = X(N) which is compared to the true parameters (A*(ajj))j=i n in terms of the loss function under the summation in (1.13). 8Strictly speaking we should consider the estimation loss £(\e, A*) of Ae = max{A, e} for some e > 0, because we need a strictly positive estimator, P-a.s., see Proposition 1.8. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 28 Chapter 1. Introduction to Non-Life Insurance Pricing Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2 Generalized Linear Models Generalized linear models (GLMs) are popular statistical models that are used in the framework of the exponential dispersion family (EDF). For standard references on GLMs and the EDL we refer to Nelder-Wedderburn [98] and McCullagh-Nelder [93]. We present the Poisson claims count model within a GLM framework in this chapter. Lhis extends the homogeneous Poisson portfolio consideration of Section 1.2.2 to the heterogeneous case. Similar derivations can be done for individual claim sizes using, for instance, a gamma or a log-normal distribution for individual claim sizes, for details we refer to Ohlsson-Johansson [102] and Wuthrich-Merz [141]. Additionally, we introduce a classification problem in this chapter which is based on the Bernoulli model and which describes a logistic regression problem, see Section 2.5 below. 2.1 Heterogeneous Poisson claims frequency model Assume that the claims count random variable N has a Poisson distribution with given years at risk v > 0 and expected frequency A > 0. We aim at modeling the expected frequency A > 0 such that it allows us to incorporate structural differences (heterogeneity) between different insurance policies and risks; such structural differences are called systematic effects in the statistical literature. Assume we have q-dimensional features x = [x\,... ,xq)' G X belonging to the set of all possible features (called feature space X). A regression function A(-) maps this feature space X to expected frequencies X:X^R+, x^X = X(x). (2.1) Lhe feature x describes the risk characteristics of a certain insurance policy, see Example 2.1 below, and A(-) describes the resulting structural differences (systematic effects) in the expected frequency described by these features. Terminology. Lhere is different terminology and we use the ones in italic letters. • x is called feature, covariate, explanatory variable, independent variable, predictor variable, measurement vector. • A is called (expected) response, dependent variable, regressand, regression function, classifier (the latter three terminologies also depend on the particular type of 29 Electronic copy available at: https://ssrn.com/abstract=2870308 30 Chapter 2. Generalized Linear Models regression problem considered, this is further highlighted below). • The components of x can be continuous, discrete or even finite. In the finite and discrete cases we distinguish between ordered (ordinal) and unordered (nominal) components (called categorical components). For instance, x\ 6 {female,male} is a nominal categorical feature component. The case of two categories is called binary. Our main goal is to find the regression function A(-) as a function of the features x 6 X and to understand the systematic effects and their sensitivities in each feature component xi of x. In general, available observations are noisy, that is, we cannot directly observe A (a;), but only the frequency Y = N/v, which is generated by Y = \(x)+e, with residual (noise) term e satisfying in the Poisson case E[e] = 0 and Var(e) = X(x)/v. Example 2.1 (car insurance). We model the claims frequency of Y = N/v with feature x by (X(x)v)k F[Y = k/v] =F[N = k] = exp{-A(a;)t;} V V , for k e N0, k\ with given years at risk v > 0 and regression function A(-) given by (2.1). We may now think of features x = (x\,. .. ,xq) characterizing different car drivers with, for instance, x\ describing the age of the driver (continuous component), X2 describing the price of the car (continuous component or discrete ordinal component if of type "cheap", "average", "expensive"), x% describing the gender of the driver (binary component), etc. The goal is to find (infer) the regression function A(-) so that it optimally characterizes the underlying risks in terms of the chosen features x 6 X. This inference needs to be done from noisy claims count observations N and the chosen features may serve as proxies for other (unobservable) risk drivers such as driving experience, driving skills, etc. ■ In view of the previous example it seems advantageous to include as many feature components as possible in the model. However, if the feature space is too complex and too large this will lead to a poor model with a poor predictive performance (big generalization loss, see Section 1.3). The reason for this being that we have to infer the regression function from a finite number of observations, and the bigger the feature space the more likely that irrelevant feature information may play a special role in a particular (finite) observation sample. For this reason it is important to carefully choose relevant feature information. In fact, feature extraction is one of the best studied and understood fields in actuarial practice with a long tradition. The Swiss car insurance market has been deregulated in 1996 and since then sophisticated pricing models have been developed that may depend on more than 30 feature components. One may ask, for instance, questions like "What is a sports car?", see Ingenbleek-Lemaire [72]. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 31 2.2 Multiplicative regression model A commonly used technique to infer a regression function A(-) is the MLE method within the family of GLMs. We consider the special case of the Poisson model with a multiplicative regression structure in this chapter. This multiplicative regression structure leads to a log-linear functional form (or the log-link, respectively). Assume X C R9 and that the regression function A(-) is characterized by a parameter vector j3 = (/3o, fli,..., /3q)' £ R9+1 such that for x £ X we have representation x \og\{x) = (i0+liixi +...+fiqxq d= (p,x). (2.2) The last definition uses a slight abuse of notation because the features x £ X are extended by a zero component xq = 1 for the intercept component /3q. Formula (2.2) assumes that all feature components are real-valued. This requires that categorical feature components are transformed (pre-processed); a detailed treatment and description of categorical feature components' pre-processing is provided in Section 2.4.1, below. The task is to infer /3 from cases (N,Xi, Vi) £ T>, where we extend definition (1.6) of the data to V = {(iVi, Xl,vi) ,...,(Nn, xn, vn)} , (2.3) with Xi £ X being the feature information of policy i = 1,..., n. Assume that all cases are independent with Ni being Poisson distributed with expected frequency A(ajj) given by (2.2). The joint log-likelihood function of the data T> under these assumptions is given by n P -> ^(/3) = E-exP(/3'^)^ + Ar^(/3'^)+1°g^)-log(^0- (2-4) i=i The MLE may be found by the solution of1 §0 M/3) = 0. (2.5) We calculate the partial derivatives of the log-likelihood function for 0 < / < q Q n n —£n(P) = ^-exp(/3,xl)vlxhi + NiXij = ^2 {-\{xl)vl + Ni) xhi = 0, (2.6) ' 1 i=i i=i where Xi = (xiti,... ,Xi,q)' £ X describes the feature of the i-th case in T>, and for the intercept fto we add components Xi:o = 1. We define the design matrix X G Rrax(g+1) by = (xi,l)l has been generated by this Poisson GLM we have E[s(/3, AT")] = 0 and we obtain Fisher's information l(/3) G R(9+1)x(9+1), see also Section 2.6 in the appendix, d l(ß)=E[s(ß,N)s(ß,N)'] = -E -nßeN(ß). From Proposition 2.2 we obtain unbiasedness of the volume-adjusted MLE (2.8) for the expected number of claims, see also Proposition 2.4, below. Moreover, the Cramer-Rao bound is attained, which means that we have a uniformly minimum variance unbiased (UMVU) estimator, see Section 2.7 in Lehmann [87]. Fisher's information matrix T(/3) plays a crucial role in uncertainty quantification of MLEs within GLMs. In fact, one can prove asymptotic normality of the MLE /3 if the volumes go to infinity and the asymptotic covariance matrix is a scaled version of the inverse of Fisher's information matrix, we refer to Chapter 6 in Lehmann [87], Fahrmeir-Tutz [41] and Chapter 5 in Wuthrich-Merz [141]. Finally, we consider the so-called balance property, see Theorem 4.5 in Buhlmann-Gisler [18]. This is an important property in insurance to receive the right price calibration on the portfolio level. Proposition 2.4 (balance property). Under the assumptions of Proposition 2.2 we have for the MLE f3 n n n 1 = 1 1=1 1=1 Proof. The proof is a direct consequence of Proposition 2.2. Note that the first column in the design matrix X is identically equal to 1 (modeling the intercept). This implies J n # Vi exp<3, Xi) = (1,. .. , 1)V exp{X/3} = (1,. .. , 1)JV = ^ Nt. i=l i=l This proves the claim. □ Remark 2.5. The balance property holds true in general for GLMs within the EDF as long as we work with the canonical link. The canonical link of the Poisson model is the log-link, which exactly tells us that for regression function (2.2) the balance property has to hold. For more background on canonical links we refer to McCullagh-Nelder [93] and Section 5.1.5 in [141]. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 34 Chapter 2. Generalized Linear Models 2.3 Deviance residuals and parameter reduction There are several ways to assess goodness of fit and to evaluate the generalization loss introduced in Section 1.3. For analyzing the generalization loss we consider the (stratified) K-fo\d Poisson deviance cross-validation loss described in (1.11), modified to the heterogeneous case. The (heterogeneous) Poisson deviance loss for regression function (2.2) is given by, see also (1.5), > 0. (2.11) Note, again, that minimizing this Poisson deviance loss provides the MLE /3 under assumption (2.2). For the goodness of fit we may consider the (in-sample) Pearson's residuals. We set Yl = N./v, for 1 < i < n N - \{xl)vl \{Xi)vi Y - X(xi) (2.12) These Pearson's residuals should roughly be centered with unit variance (and close to independence) under our model assumptions. Therefore, we can consider scatter plots of these residuals (in relation to their features) and we should not detect any structure in these scatter plots. Pearson's residuals 5f, 1 < i < n, are distribution-free, i.e. they do not (directly) account for a particular distributional form. They are most appropriate in a (symmetric) Gaussian case. For other distributional models one often prefers deviance residuals. The reason for this preference is that deviance residuals are more robust (under the right distributional choice): note that the expected frequency parameter estimate A appears in the denominator of Pearson's residuals in (2.12). This may essentially distort Pearson's residuals. Therefore, we may not want to rely on weighted residuals. Moreover, Pearson's residuals do not account for the distributional properties of the underlying model. Therefore, Pearson's residuals can be heavily distorted by skweness and extreme observations (which look very non-Gaussian). The Poisson deviance residuals are defined by sgn [Ni Remarks. If we allow for an individual expected frequency parameter Ai for each observation Ni, then the MLE optimal model is exactly the saturated model with parameter estimates Aj = N/vi, see also (1.4). Therefore, each term in the summation on the right-hand side of the above deviance loss (2.11) is non-negative, i.e. 2 N X(xl)vl 1_log( Kxi)vi > o, Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 35 for all 1 < i < n. This implies that the deviance loss is bounded from below by zero, and a sequence of parameters (j3t)t>i that is decreasing in the sequence of deviance losses (D*(N, X = A^g ))t>i may provide convergence to a (local) minimum w.r.t. that loss function. This is important in any numerical optimization such as the gradient descent algorithm that we will meet below. • Observe that maximizing the log-likelihood In (ft) for parameter j3 is equivalent to minimizing the deviance loss D*(N,X = \p) for /3. In this spirit, the deviance loss plays the role of the canonical objective function that should be minimized. • D*(N,X) is the scaled Poisson deviance loss. Within the EDF there is a second deviance statistics that accounts for potential over- or under-dispersion ^ 1. In the Poisson model this does not apply because by definition cf> = 1. Nevertheless, we can determine this dispersion parameter empirically on our data. There are two different estimators. Pearson's (distribution-free) dispersion estimate is given by 1 n 2 and the deviance dispersion estimate is given by *D = n-(q+l) §1*0 = «-(? + !)■ (2J4) Pearson's dispersion estimate should be roughly equal to 1 to support our model choice. The size of the deviance dispersion estimate depends on the size of the expected number of claims Xv, see Figure 1.1. For our true expected frequency A* we receive 4>D = D*(N,X*)/n = 27.7228 ■ 10"2, see (A.5). Finally, we would like to test whether we need all components in j3 G R9+1 or whether a lower dimensional nested model can equally well explain the observations AT". We assume that the components in j3 are ordered in the appropriate way, otherwise we permute them (and accordingly the columns of the design matrix X). Null hypothesis Ho'- fti = . .. = /3P = 0 for given 1 < p < q. 1. Calculate the deviance loss D*(N, Afuu) in the full model with MLE /3 G K9+1- 2. Calculate the deviance loss D*(N,\h0) under the null hypothesis Ho with MLE 3 G W+1-p. Define the likelihood ratio test statistics, see Lemma 3.1 in Ohlsson-Johansson [102], xl = D*(N, XHo) - D*(N, Ami) > 0. (2.15) Under Hq, the likelihood ratio test statistics Xv 1S approximately x2-distributed with p degrees of freedom. Alternatively, one could use a Wald statistics instead of Xv- A Wald statistics is a second order approximation to the log-likelihood function which is motivated by asymptotic normality of the MLE /3. The z-test in the R output in Listing 2.2 refers to a rooted Wald statistics; for more details we refer to Section 5.3.2 in [141]. Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 36 Chapter 2. Generalized Linear Models Remarks 2.6. • Analogously, the likelihood ratio test and the Wald test can be applied recursively to a sequence of nested models. This leads to a step-wise reduction of model complexity, this is similar in spirit to the analysis of variance (ANOVA) in Listing 2.7, and it is often referred to as backward model selection, see dropl below. • The tests presented apply to nested models. If we want to selected between nonnested models we can use cross-validation. 2.4 Example in car insurance pricing We consider a car insurance example with synthetic data T> generated as described in Appendix A. The portfolio consists of n = 500'000 car insurance policies for which we have feature information Xi 6 X* and years at risk information Vi £ (0,1], for i = 1,.. . ,n. Moreover, for each policy i we have an observation N{. These (noisy) observations were generated from independent Poisson distributions based on the underlying expected frequency function A*(-) as described in Appendix A.2 Here, we assume that neither the relevant feature components, the functional form nor the involved parameters of this expected frequency function A*(-) are known (this is the typical situation in practice), and we aim at inferring an expected frequency function estimate A(-) from the data X>.3 2.4.1 Pre-processing features: categorical feature components We have 4 categorical (nominal) feature components gas, brand, area and ct in our data T>. These categorical feature components need pre-processing (feature engineering) for the application of the Poisson GLM with regression function (2.2) because they are not real-valued. Component gas £ {Diesel, Regular} is binary and is transformed to 0 and 1, respectively. The area code is ordinal and is transformed by {A, ...,F} i—> {1,..., 6} Cl, see Figure A.9. Thus, there remains the car brand and the Swiss cantons ct. Car brand has 11 different levels and there are 26 Swiss cantons, see (A.l). We present dummy coding to transform these categorical feature components to numerical representations. For illustrative purposes, we demonstrate dummy coding on the feature component car brand. In a first step we use binary coding to illustrate which level a selected car has. In Table 2.1 we provide the one-hot encoding of the car brands on the different rows. Each brand is mapped to a basis vector in R11, i.e. each car brand is represented in one-hot encoding by a unit vector cc^lh-) £ {0, l}11 with Yl}=ixilh^ = 1- This mapping is illustrated by the different rows in Table 2.1. In a second step we need to ensure that the resulting design matrix X (which includes an intercept component, see (2.7)) has full rank. This is achieved by declaring one level to be the reference level (this level is modeled by the intercept /3q). Dummy coding then only measures relative differences 2The true (typically unknown) expected frequency is denoted by A*(-) and the corresponding feature space by X*, i.e. we have true regression function A* : X* —S- K+. 3The data is available from https://people.math.ethz.ch/~wmario/Lecture/MTPL_data.csv Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 37 Bl 1 0 0 0 0 0 0 0 0 0 0 BIO 0 1 0 0 0 0 0 0 0 0 0 Bll 0 0 1 0 0 0 0 0 0 0 0 B12 0 0 0 1 0 0 0 0 0 0 0 B13 0 0 0 0 1 0 0 0 0 0 0 B14 0 0 0 0 0 1 0 0 0 0 0 B2 0 0 0 0 0 0 1 0 0 0 0 B3 0 0 0 0 0 0 0 1 0 0 0 B4 0 0 0 0 0 0 0 0 1 0 0 B5 0 0 0 0 0 0 0 0 0 1 0 B6 0 0 0 0 0 0 0 0 0 0 1 Table 2.1: One-hot encoding x^lh^ 6 R11 for car brand, encoded on the different rows. to this reference level. If we declare Bl to be the reference level, we can drop the first column of Table 2.1. This provides the dummy coding scheme for car brand given in Table 2.2. Bl 0 0 0 0 0 0 0 0 0 0 BIO 1 0 0 0 0 0 0 0 0 0 Bll 0 1 0 0 0 0 0 0 0 0 B12 0 0 1 0 0 0 0 0 0 0 B13 0 0 0 1 0 0 0 0 0 0 B14 0 0 0 0 1 0 0 0 0 0 B2 0 0 0 0 0 1 0 0 0 0 B3 0 0 0 0 0 0 1 0 0 0 B4 0 0 0 0 0 0 0 1 0 0 B5 0 0 0 0 0 0 0 0 1 0 B6 0 0 0 0 0 0 0 0 0 1 Table 2.2: Dummy coding ccbrand 6 R10 for car brand, encoded on the different rows. We define the part of the feature space X that belongs to car brand as follows ^brand = ^brand £ ^ 1jlO. ^ ^.brand £ ^ ^ c R10_ (2.16) That is, the resulting part Xhraild of the feature space X is 10 dimensional, the feature components can only take values 0 and 1, and the components of xbrand add up to either 0 or 1, indicating to which particular car brand a specific policy is belonging to. Note that this feature space may differ from the original feature space X* of Appendix A. For the 26 Swiss cantons we proceed completely analogously receiving Xct C {0, l}25 C R25 with, for instance, canton ZH being the reference level. Remarks 2.7. • If we have k categorical classes, then we need k — 1 indicators (dummy variables) to uniquely identify the parametrization of the (multiplicative) model including an Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 38 Chapter 2. Generalized Linear Models intercept. Choice (2.16) assumes that one level is the reference level. Lhis reference level is described by the intercept All other levels are measured relative to this reference level and are described by regression parameters /3/, with 1 < / < k — 1, see also (2.10). Lhis parametrization is called dummy coding or treatment contrast coding where the reference level serves as control group and all other levels are described by dummy variables relative to this control group. • Other identification schemes (contrasts) are possible, as long as they lead to a full rank in the design matrix X. Lor instance, if we have the hypothesis that the first level is the best, the second level is the second best, etc., then we could consider Helmert's contrast coding diagram given in Lable 2.3. Lor illustrative purposes we only choose k = 5 car brands in Lable 2.3. Bl 4/5 0 0 0 B2 -1/5 3/4 0 0 B3 -1/5 -1/4 2/3 0 B4 -1/5 -1/4 -1/3 1/2 B5 -1/5 -1/4 -1/3 -1/2 Lable 2.3: Helmert's contrast coding a;He:LlIiert i R4 encoded on the different rows. Lhis coding in Lable 2.3 has the following properties: (i) each column sums to zero, (ii) all columns are orthogonal, and (iii) each level is compared to the mean of the subsequent levels, see also (2.17), below. In view of (2.2), this provides regression function for policy i (restricted to 5 car brand levels, only) ' ßo + Ißl i is car brand Bl, ßo- lßl + \ßl i is car brand B2, = < ßo- x5ßl- \ß2 + i is car brand B3, ßo- kßl- \ß2~ 5Ä + \ß, i is car brand B4, . ßo- x5ßl- 5Ä- 5/34 i is car brand B5. (/3,^elmert) Observe that the mean over all risk classes is described by (3q. Lor a given level, say car brand B2, the subsequent levels have the same mean (on the log scale) except in the variable to which it is compared to: 1 3 (log-) mean of car brand B2: (3o--(3i + —/?2> 5 4 (log-) mean over car brands B3, B4, B5: (3o--(3i — —/?2> 5 4 thus, the difference in this comparison is exactly one unit of /?2- (2.17) Lhe choice of the explicit identification scheme does not have any influence on the prediction, i.e. different (consistent) parametrizations lead to the same prediction. However, the choice of the identification may be important for the interpretation of the parameters (see the examples above) and it is important, in particular, if we explore parameter reduction techniques, e.g., backward selection. Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 39 • If we have k = 2 categorical classes (binary case), then dummy coding is equivalent to a continuous consideration of that component. Conclusion 2.8. If we consider the 5 continuous feature components age, ac, power, area and log(dens) as log-linear, the binary component gas as 0 or 1, and the 2 categorical components brand and ct by dummy coding we receive a feature space X = R5 x {0,1} x Xbrajad x Xct C Rq of dimension q = 5 + 1 + 10 + 25 = 41. 2.4.2 Pre-processing features: continuous feature components In view of the previous conclusion, we need to ask ourselves whether a log-linear consideration of the continuous feature components age, ac, power, area and log(dens) in regression function (2.2) is appropriate. observed log-frequency per age of di observed log-frequency per age of c observed log-frequency per power of c log-frequency per area code observed log-frequency per log density Figure 2.1: Observed marginal log-frequencies of the continuous feature components age, ac, power, area and log(dens). In Figure 2.1 we illustrate the observed marginal log-frequencies of the continuous feature components age, ac, power, area and log(dens) provided by the data T> in Appendix A. We see that some of these graphs are (highly) non-linear and non-monotone which does not support the log-linear assumption in (2.2). This is certainly true for the components age and ac. The other continuous feature components power, area and log(dens) need a more careful consideration because these marginal plots in Figure 2.1 can be (rather) misleading. Note that these marginal plots involve interactions between feature components and, thus, these marginal plots may heavily be influenced by such interactions. For instance, log(dens) at the lower end may suffer from insufficient years at risk and interactions with other feature components that drive low expected frequencies. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 40 Chapter 2. Generalized Linear Models GLM modeling decision. To keep the outline of the GLM chapter simple we assume that the components power, area and log(dens) can be modeled by a log-linear approach, and that the components age and ac need to be modeled differently. These choices are going to be challenged and revised later on. The first way to deal with non-linear and non-monotone continuous feature components in GLMs is to partition them and then treat them as (nominal) categorical variables. We will present an other treatment in Chapter 3 on generalized additive models (GAMs). In Figure 2.2 we provide the chosen categorization which gives us a partition of age into 8 'age classes' and of ac into 4 'ac classes' (which hopefully are homogeneous w.r.t. claims frequency). age class 1 : 18-20 age class 2 : 21-25 age class 3 : 26-30 age class 4 : 31-40 age class 5 : 41-50 age class 6 : 51-60 age class 7 : 61-70 age class 8 : 71-90 ac class 0: 0 ac class 1: 1 ac class 2: 2 ac class 3: 3+ • • age of driver observed log-frequency per age of car Figure 2.2: (lhs) Chosen 'age classes' and 'ac classes'; (rhs) resulting partition. Listing 2.1: Categorical classes for GLM 1 cl <- c(18, 21, 26, 31, 41, 51, 61, 71, 91) 2 ageGLM <- cbind(c(18:90) ,c(rep(paste(cl [1] ,"to",cl [2]-1 , sep = ""),... 3 dat$ageGLM <- as.factor(ageGLM[dat$age-17 , 2]) 4 dat$ageGLM <- relevel(dat$ageGLM, ref="51to60") 5 dat$acGLM <- as.factor(pmin(dat$ac,3)) 6 levels(dat$acGLM) <- c("acO","ac1","ac2","ac3+") In Listing 2.1 we give the R code to receive this partitioning. Note that we treat these age and ac classes as categorical feature components. In R this is achieved by declaring these components being of factor type, see lines 3 and 5 of Listing 2.1. For the GLM approach we then use dummy coding to bring these (new) categorical feature components into the right structural form, see Section 2.4.1. On line 4 of Listing 2.1 we define Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 41 the reference level of 'age class' for this dummy coding, and for 'ac class' we simply choose the first one acO as reference level. Thus, in full analogy to Section 2.4.1, we obtain Xage C {0, l}7 C R7 for modeling 'age classes', and Xac C {0, l}3 C R3 for modeling 'ac classes'. This implies that our feature space considers 3 log-linear continuous feature components power, area and log(dens), the binary feature component gas and 4 categorical feature components 'age class', 'ac class', brand and ct. This equips us with feature space X = R3 x {0,1} x Xage x Xac x Xbrajad x Xct c R9, of dimension g = 3 + l + 7 + 3 + 10 + 25 = 49. Remarks 2.9 (feature engineering). • The choice of 'age classes' and 'ac classes' has been done purely expert based by looking at the marginal plots in Figure 2.2 (rhs). They have been built such that each resulting class is as homogeneous as possible in the underlying frequency. In Chapter 6 we will meet regression trees which allow for a data driven selection of classes. • Besides the heterogeneity between and within the chosen classes, one should also pay attention to the resulting volumes. The smallest 'age class' 18to20 has a total volume of 1'741.48 years at risk, the smallest 'ac class' acO a total volume of 14'009.20 years at risk. The latter is considered to be sufficiently large, the former might be critical. We come back to this in formula (2.20), below. Thus, in building categorical classes one needs to find a good trade-off between homogeneity within the classes and minimal sizes of these classes for reliable parameter estimation. • A disadvantage of this categorical coding of continuous variables is that the topology gets lost. For instance, after categorical coding as shown in Figure 2.2 it is no longer clear that, e.g., ages 70 and 71 are neighboring ages (in categorical coding). Moreover, the resulting frequency function will not be continuous at the boundaries of the classes. Therefore, alternatively, one could also try to replace non-monotone continuous features by more complex functions. For instance, we could map age to a 4 dimensional function having regression parameters /3/,..., /3j+3 G R age H> /3/age + /3/+i log(age) + /3/+2age2 + /3/+3age3. In this coding we keep age as a continuous variable and we allow for more modeling flexibility by providing a pre-specified functional form that will run into the regression function. • In (2.10) we have seen that if we choose the log-link function then we get a multiplicative tariff structure. Thus, all feature components interact in a multiplicative way. If we have indication that interactions have a different structure, we can model this structure explicitly. As an example, we may consider (age, ac) h-> /3/age + /3/+iac + /3/+2age/ac2. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 42 Chapter 2. Generalized Linear Models This gives us systematic effect (under log-link choice) exp{/3/age} exp{/3/+:Lac} exp{/3/+2age/ac2}. This shows that GLMs allow for a lot of modeling flexibility, but the modeling has to be done explicitly by the modeler. This requires deep knowledge about the data, so that features can be engineered and integrated in the appropriate way. Example 2.10 (example GLM1). In a first example we only consider feature information age and ac, and we choose the (categorical) feature pre-processing as described in Listing 2.1. In the next section on 'data compression' it will become clear why we are starting with this simplified example. We define feature space X = Xage x Xac C R9 which has dimension q = 7 + 3 = 10, and we assume that the regression function A : X —> R+ is given by the log-linear form (2.2). Listing 2.2: Results of example GLM1 (Example 2.10) 1 Call: 2 glm(formula = claims ~ ageGLM + acGLM , family = poisson(), data = dat, 3 offset = log(expo)) 4 5 Deviance Residuals: 6 Min iq Median 3Q Max 7 Q -1.1643 -0. 3967 -0.2862 -0.1635 4.3409 O 9 Coefficients 10 Estimate Std. Error z value Pr (> ! z!) 11 CIntercept) -1. 41592 0. 02076 -68. 220 < 2e -16 * * * 12 ageGLM18to20 1. 12136 0. 04814 23 . 293 < 2e -16 * * * 13 ageGLM21to25 0. 48988 0. 02909 16 . 839 < 2e -16 * * * 14 ageGLM26to30 0. 13315 0. 02473 5 . 383 7.32e -08 * * * 15 ageGLM31to40 0. 01316 0. 01881 0. 700 0.48403 16 ageGLM41to50 0. 05644 0. 01846 3. 058 0.00223 ** 17 ageGLM61to70 -0. 17238 0. 02507 -6. 875 6.22e -12 * * * 18 ageGLM71to90 -0. 13196 0. 02983 -4. 424 9.71e -06 * * * 19 acGLMacl -0. 60897 0. 02369 -25 . 708 < 2e -16 * * * 20 acGLMac2 -0. 79284 0. 02588 -30. 641 < 2e -16 * * * 21 acGLMac3+ -1. 08595 0. 01866 -58. 186 < 2e -16 * * * 22 --- 23 S ignif. code s : 0 *** 0.001 ** 0. 01 * 0. 05 . 0 . 1 1 24 25 (Dispersion parameter for poisson family taken to be 26 27 Null devianc e: 145532 on 499999 de grees of free^ 28 Residual devianc e: 141755 on 499989 de grees of free^ 29 AIC: 192154 30 31 Number of Fisher Scoring it erations : 6 The MLE /3 S R9+1 is then calculated using the R package glm. The results are presented in Listing 2.2. We receive q + 1 = 11 estimated MLE parameters (lines 11-21), the intercept being (3o = —1.41592. Thus, the reference risk cell (51to60, acO) has an expected frequency of A(cc) = exp{ —1.41592} = 24.27%. All other risk cells are measured relative to this reference cell using the multiplicative structure (2.10). Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 43 The further columns on lines 11-21 provide: column Std. Error gives the estimated standard deviation 07 in /3/, for details we refer to (2.33) in the appendix Section 2.6 where Fisher's information matrix is discussed; the column z value provides the rooted Wald statistics under the null hypothesis /3/ = 0 (for each component 0 < / < q individually), it is defined by z% = fii/df, finally, the last column Pr(>!z!) gives the resulting (individual two-sided) p-values for these null hypotheses under asymptotic normality. These null hypotheses should be interpreted as there does not exist a significant difference between the considered level 1 < / < q and the reference level (when using dummy coding). Note that these tests cannot be used to decide whether a categorical variable should be included in the model or not.4 Line 28 of Listing 2.2 gives the unsealed in-sample Poisson deviance loss, i.e. nC1^ = D*(N,X) = 141'755. This results in an in-sample loss of £% = 28.3510 ■ 10-2, see also Table 2.4. Line 27 gives the corresponding value of the homogeneous model (called null model) only considering an intercept, this is the model of Section 1.4. This allows us to consider the likelihood ratio test statistics (2.15) for the null hypothesis Hq of the homogeneous model versus the alternative heterogeneous model considered in this example. xl = D*(N,XHo) - D*(N,Xfull) = 145'532 - 14l'755 = 3'777. This test statistics has approximately a ^-distribution with q = 10 degrees of freedom. The resulting p-value is almost zero which means that we highly reject the homogeneous model in favor of the model in this example. Finally, Akaike's information criterion (AIC) is given, and Fisher's scoring method for parameter estimation has converged in 6 iterations. run # CV loss strat. CV est. loss in-sample average time param. rev /•CV J~D /•is J~D frequency (ChA.l) true model A* 27.7278 10.1991% (Chl.l) homogeneous 0.1s 1 29.1066 29.1065 1.3439 29.1065 10.2691% (Ch2.1) GLM1 3.1s 11 28.3543 28.3544 0.6052 28.3510 10.2691% Table 2.4: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 1.1. Exactly in the same structure as in Table 1.1, we present the corresponding cross-validation losses £§v, the estimation loss £(A,A*) and the in-sample loss Cf) on line (Ch2.1) of Table 2.4. Note that the estimation loss is given by 4The rooted Wald statistics uses a second order Taylor approximation to the log-likelihood. Based on asymptotic normality it allows for a z-test under known dispersion or a i-test for estimated dispersion whether a variable can be dropped from the full model. However, if we have categorical features, one has to be more careful because in this case the p-value only indicates whether a level significantly differs from the reference level. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 44 Chapter 2. Generalized Linear Models and this estimation loss can only be calculated because we know the true model A*, here. We observe that these loss figures are clearly better than in the homogeneous model on line (Chl.l), but worse compared to the true model A* on line (ChA.l). We also observe that we have a negligible over-fitting because cross-validation provides almost the same numbers compared to the in-sample loss.5 Next we estimate the dispersion parameter . Lhe resulting Pearson's and deviance estimators are 4>P = 1.0061 and 4>D = 28.3516 ■ 10" Lhe Pearson's estimator shows a small over-dispersion, the deviance estimator is more difficult to interpret because its true level for small frequencies is typically not exactly known, see also Ligure 1.1. Note that the deviance dispersion estimate (2.14) scales with the number of parameters q, that is, the degrees of freedom are given by 500'000 —(g+l) = 499'989. Lhis slightly corrects for model complexity. However, the resulting estimate is bigger than the cross-validation loss in our example. GLM1: estimated frequency of age GLM1: estimated frequency of ac • observed — GLM1 estimated \ > • o I' • ---= _._• » «* • • 23 28 33 38 43 48 53 58 63 68 73 age of driver 2 4 6 8 10 13 16 19 22 25 28 31 34 age of car Ligure 2.3: Estimated marginal frequencies in example GLM1 for age and ac. In Ligure 2.3 we present the resulting marginal frequency estimates (averaged over our portfolio); these are compared to the marginal observations. Lhe orange graphs reflect the MLEs provided in Listing 2.2. Lhe graphs are slightly wiggly which is caused by the fact that we have a heterogeneous portfolio that implies non-homogeneous multiplicative interactions between the feature components age and ac. Finally, in Figure 2.4 we show the resulting Pearson's and deviance residuals. Lhe different colors illustrate the underlying years at risk Vi G (0,1], i = 1,.. ., n. As previously mentioned, we observe that Pearson's residuals are not very robust for small years at risk Vi (red color) because we divide by these (small) volume measures for Pearson's residual 5Cross-validation is done on the same partition for all models considered in these notes. Run time measures the time to fit the model once on a personal laptop Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99GHz with 16GB RAM. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 45 Figure 2.4: Deviance residuals Sf versus Pearson's residuals Sf; the colors illustrate the underlying years at risk Vi G (0,1]. definition (2.12). In general, one should work with deviance residuals because this is a distribution-adapted version of the residuals and considers skewness and tails in the right way (supposed that the model choice is appropriate). This finishes the example for the time being. ■ 2.4.3 Data compression We note that Example 2.10 only involves d = 8 ■ 4 = 32 different risk cells for the corresponding 'age classes' and 'ac classes', see also Figure 2.2. Within these risk cells we assume that the individual insurance policies are homogeneous, i.e. can be characterized by a common feature value x\ G X = Xage X Xac C R9 with q = 10. Running the R code in Listing 2.2 may be time consuming if the number of policies n is large. Therefore, we could/should try to first compress the data accordingly. For independent Poisson distributed random variables this is rather simple, the aggregation property of Lemma 1.3 implies that we can consider sufficient statistics on the d = \X\ = 32 different risk cells. These are described by the (representative) features {a;^,..., x^} = X. We define the aggregated portfolios in each risk cell k = 1,..., d by n n i=i i=i Observe that and Vi are on an individual insurance policy level, whereas N£ and are on an aggregated risk cell (portfolio) level. The consideration of the latter provides a substantial reduction in computational time when calculating the MLE /3 of /3 because the n = 500'000 observations are compressed to d = 32 aggregated observations (sufficient statistics). Lemma 1.3 implies that all risk cells k = 1,..., d are independent and Poisson distributed N+ ~Poi(A«K+). Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 46 Chapter 2. Generalized Linear Models The joint log-likelihood function on the compressed data T>+ = {(N^,x~^,v£); k = 1,. . ., d} is given by d = £-exp(/3,:E+H+ + iV+ ((/3, x+) + \ogv+) - log(iV+!). (2.19) k=l We introduce the design matrix X+ = (x~fci)i dat <- ddply(dat, .(ageGLM, acGLM), summarize,expo=sumCexpo),claims=sumCclaims)) > str(dat) 1 data.frame 1 : 32 obs. of 4 variables: $ ageGLM: Factor w/ 8 levels "51to60" , "181o20" , .. : 1111222233 ... $ acGLM : Factor w/ 4 levels "acO","acl","ac2",..: 1234123412 ... $ expo : num 3144.4 6286.4 5438.8 38209.8 38.6 ... $ claims: int 712 817 618 3174 19 31 12 410 179 192 ... In Listing 2.3 we provide the R code for the data compression in each risk cell. Note that ageGLM and acGLM describe the categorical classes, see also Listing 2.1. Example 2.10, revisited (example GLM2). We revisit example GLM1 (Example 2.10), but calculate the MLE /3 directly on the aggregated risk cells received from Listing 2.3. The results are presented in Listing 2.4. We note that the result for the MLE /3 is exactly the same, compare Listings 2.2 and 2.4. Of course, this needs to be the case due to the fact that we work with (aggregated) sufficient statistics in the latter version. The only things that change are the deviance losses because we work on a different scale now. We receive in-sample loss = D*(N+,X)/d = 32.1610/32 = 1.0050 on the aggregate data D+. The degrees of freedom are d — (q + 1) = 32 — 11 = 21, this results in dispersion parameter estimates 4>P = 1.5048 and 4>D = 1.5315. Thus, we obtain quite some over-dispersion which indicates that our regression function choice misses important structure. In Figure 2.5 (lhs) we plot deviance residuals versus Pearson's residuals, the different colors showing the volumes of the underlying risk cells. We note that the two versions of residuals are almost identical on an aggregate risk cell level. Figure 2.5 (rhs) gives the Tukey-Anscombe plot which plots the residuals versus the fitted means. In this plot we would not like to discover any structure. In Figure 2.6 we plot the estimated marginal frequencies of example GLM2 (on the different categorical levels). We notice that the observed frequencies Nj^~ /v£ (marginally aggregated) exactly match the estimated expected frequencies \(x~^) (also marginally aggregated). This is explained by the fact that the Poisson GLM method with categorical coding provides identical results to the method of the total marginal sums by Bailey [7] and Jung [76], see Section 7.1 in Wuthrich [135]. This also explains why the average estimated frequencies of the homogeneous model and example GLM1 are identical in the last column of Table 2.4. This finishes example GLM2. ■ Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 47 Listing 2.4: Results of example GLM2 1 Call : 2 glm C formula = claims - ageGLM + acGLM , family = poisson 3 4 offset = log(expo)) 5 Deviance Residuals: 6 Min 1Q Median 3Q Max 7 Q -1.9447 -0. 8175 -0.1509 0.6370 1.9360 O 9 Coefficients 10 Estimate Std. Error z value Pr (> ! z!) 11 CIntercept) -1.41592 0. 02076 -68. 220 < 2e -16 * * * 12 ageGLM18to20 1.12136 0. 04814 23. 293 < 2e -16 * * * 13 ageGLM21to25 0.48988 0. 02909 16. 839 < 2e -16 * * * 14 ageGLM26to30 0.13315 0. 02473 5. 383 7.32e -08 * * * 15 ageGLM31to40 0.01316 0. 01881 0. 700 0.48403 16 ageGLM41to50 0.05644 0. 01846 3. 058 0.00223 ** 17 ageGLM61to70 -0.17238 0. 02507 -6. 875 6.22e -12 * * * 18 ageGLM71to90 -0.13196 0. 02983 -4. 424 9.71e -06 * * * 19 acGLMacl -0.60897 0. 02369 -25. 708 < 2e -16 * * * 20 acGLMac2 -0.79284 0. 02588 -30. 641 < 2e -16 * * * 21 acGLMac3+ -1.08595 0. 01866 -58. 186 < 2e -16 * * * 22 --- 23 S ignif. code s : 0 *** 0.001 ** 0.01 * 0. 05 . 0 . 1 1 24 25 (Dispersion parameter for poisson family taken to be 1) 26 27 Null deviance: 3809.473 on 31 d egr ees of f re sedom 28 Residual deviance: 32.161 on 21 d egr ees of f re sedom 29 AIC: 305.12 30 31 Number of Fisher Scoring it erations: 3 deviance versus Pearson's residuals Tukey-Anscombe plot Figure 2.5: Example GLM2: (lhs) deviance versus Pearson's residuals, (rhs) Tukey-Anscombe plot. 2.4.4 Issue about low frequencies In the previous section we have worked on aggregated data T>+ (sufficient statistics). This aggregation is possible on categorical classes, but not necessarily if we consider continuous feature components. This is exactly the reason why we have been starting Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 48 Chapter 2. Generalized Linear Models Figure 2.6: Estimated marginal frequencies in GLM2 Example 2.10 for 'age classes' and 'ac classes' risk cells. by only considering two categorized feature components in Example 2.10 (GLM1 and GLM2). In this section we would like to emphasize a specific issue in insurance of having rather low expected frequencies in the range of 3% to 20%. If, for instance, X(x) = 5% and v = 1 then we obtain for N ~ Poi(X(x)v) E[N] = 0.05 and Vai(N)1/2 = 0.22. This indicates that the pure randomness is typically of much bigger magnitude than possible structural differences (see also Figure 1.2), and we require a lot of information to distinguish good from bad car drivers. In particular, we need portfolios having appropriate volumes (years at risk). If, instead, these drivers would have observations of 10 years at risk, i.e. v = 10, then magnitudes of order start to change E[N] = 0.50 and Var(V)1/2 = 0.71. If we take confidence bounds of 2 standard deviations we obtain for the expected frequency A (a;) the following interval X(x) X(x) , A(a:) This implies for X(x) = 5% that we need v = 2'000 years at risk to detect a structural difference to an expected frequency of 4%. Of course, this is taken care of by building sufficiently large homogeneous sub-portfolios (we have n = 500'000 policies). But if the dimension of the feature space X is too large or if we have too many categorical feature components then we cannot always achieve to obtain sufficient volumes for parameter estimation (and a reduction of dimension technique may need to be applied). Note that the smallest 'age class' 18to20 is rather heterogeneous but it only has a total volume of 1'741.48 years at risk, see Remarks 2.9. Moreover, these considerations also refer to the remark after Example 2.1. Version October 27, 2021, M.V. Wuthrich h C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 49 2.4.5 Models GLM3+ considering all feature components In this section we consider the a GLM considering all feature components and based on our GLM modeling decision on page 40: we assume that the components power, area and log(dens) can be modeled by a log-linear approach, component gas is binary, the components age and ac are modeled categorically according to Figure 2.2, and the remaining feature components brand and ct are categorical. This gives us the feature space X = R3 x {0,1} x Xage x Xac x Xbrajad x Xct c R9, of dimension qr = 3 + l + 7 + 3 + 10 + 25 = 49. We call this model example GLM3. Listing 2.5: Results of example GLM3 1 Call : 2 glm C formula = claims ~ powe r + area log(d ens) + gas + 3 acGLM + brand + ct, family = poissonO, data = i dat , 5 Deviance Residuals: 6 Min iq Median 3Q Max 7 Q -1.1944 -0. 3841 -0.28 53 -0.1635 4 .3759 O 9 Coefficients 10 Estimate Std .. Error z value Pr(>!z ! ) 11 CIntercept) -1. 7374212 0 . 0455594 - 38 . 135 < 2e-16 * * * 12 power -0. 0008945 0 . 0031593 -0 . 283 0.777069 13 area 0. 0442176 0 . 0192701 2 . 295 0.021755 * 14 log C dens) 0. 0318978 0 . 0143172 2 . 228 0.025884 * 15 gasRegular 0. 0491739 0 . 0128676 3 .822 0.000133 * * * 16 ageGLM18to20 1. 1409863 0 . 0483603 23 . 593 < 2e-16 * * * 17 18 ageGLM71to90 -0. 1264737 0 . 0300539 -4 . 208 2.57e-05 * * * 19 acGLMacl -0. 6026905 0 . 0237064 - 25 .423 < 2e-16 * * * 20 acGLMac2 -0. 7740611 0 . 0259994 - 29 .772 < 2e-16 * * * 21 acGLMac3+ -1. 0242291 0 . 0204606 - 50 . 059 < 2e-16 * * * 22 brandBlO -0. 0058938 0 . 0445983 -0 . 132 0.894864 23 24 brandB5 0. 1300453 0 . 0292474 4 .446 8.73e-06 * * * 25 brandB6 -0. 0004378 0 . 0332226 -0 . 013 0.989486 26 ct AG -0. 0965560 0 . 0274785 -3 .514 0.000442 * * * 27 28 ctVS -0. 1110315 0 . 0317446 -3 .498 0.000469 * * * 29 ctZG -0. 0724881 0 . 0463526 -1 . 564 0.117855 30 --- 31 Signif. code s : 0 *** 0 . 001 ** 0.01 * 0. 05 . 0.1 1 32 33 Null devianc e: 145532 on 499999 degr e es of fr e edo] 34 Residual devianc e: 140969 on 499950 degr e es of fr e edo] 35 AIC: 191445 The results for the MLE /3 are provided in Listing 2.5 and the run time is found in Table 2.5. From Listing 2.5 we need to question the modeling of (all) continuous variables power, area and log(dens). Either these variables should not be in the model or we should consider them in a different functional form. In Table 2.5 we state the resulting loss figures of model GLM3. They are better than the ones of model GLM1 (only considering the two age and ac classes, respectively), but there is still room for improvements compared to the true model A*. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 50 Chapter 2. Generalized Linear Models run # CV loss strat. CV est. loss in-sample average time param. rev >~D J~D £(A,A*) >~D frequency (ChA.l) true model A* 27.7278 10.1991% (Chl.l) homogeneous 0.1s 1 29.1066 29.1065 1.3439 29.1065 10.2691% (Ch2.1) GLM1 3.1s 11 28.3543 28.3544 0.6052 28.3510 10.2691% (Ch2.3) GLM3 12.0s 50 28.2125 28.2133 0.4794 28.1937 10.2691% (Ch2.4) GLM4 14.0s 57 28.1502 28.1510 0.4137 28.1282 10.2691% (Ch2.5) GLM5 13.3s 56 28.1508 28.1520 0.4128 28.1292 10.2691% Table 2.5: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 2.4. As a first model modification we consider power as categorical, merging powers above and including 9 to one class, thus, we consider 9 different 'power classes'. This equips us with a new feature space X = AfP°wer x M2 x {0,1} x Xage x Xac x Xbrajad x Xct c K9, (2.21) of dimension g = 8 + 2 + l + 7 + 3 + 10 + 25 = 56. We call this new model GLM4. The results are presented in Listing 2.6. Based on this analysis we keep all components in the model. In particular, we should keep the variable power but a log-linear shape is not the right functional form to consider power. This also becomes apparent from the cross-validation analysis given in Table 2.5 on line (Ch2.4). Note that the number of parameters has been increasing from 11 (GLM1) to 57 (GLM4), and at the same time over-fitting is increasing (difference between cross-validation loss jC^,v and in-sample loss C1^). The cross-validation loss has decreased by 28.3543 — 28.1502 = 0.2041 which is consistent with the decrease of 0.1915 in estimation loss S(X, A*) from GLM1 to GLM4. Another important note is that all GLMs fulfill the balance property of Proposition 2.4. This is also seen from the last column in Table 2.5. Next we consider an analysis of variance (ANOVA) which nests GLMs w.r.t. the considered feature components. In this ANOVA, terms are sequentially added to the model. Since the order of this sequentially adding is important, we re-order the components on lines 2 and 3 of Listing 2.6 as follows acGLM, ageGLM, ct, brand, powerGLM, gas, log(dens) and area. This ordering is based on the rationale that the first one is the most important feature component and the last one is the least important one. The ANOVA is then done in R by the command anova. In Listing 2.7 we present the results of this ANOVA. The column Df shows the number of model parameters /3/, 1 < I < q, the corresponding feature component uses. The column Deviance shows the amount of reduction in in-sample deviance loss D*(N,\) when sequentially adding this feature component, and the last column Resid.Dev shows the remaining in-sample deviance loss. We note that gas and area lead to a comparably small decrease of in-sample loss which may question the use of these feature components (in the current form). We can calculate the corresponding p-values of the x2-test statistics Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 51 Listing 2.6: Results of example GLM4 29 30 31 32 33 34 Call : glmCformula = claims ~ powerGLM + area + log(dens) + gas + ageGLM + acGLM + brand + ct, family = poissonO, data = dat, offset = log(expo)) Deviance Residuals: Min IQ Median -1.1373 -0.3820 -0.2838 Coefficients : 3Q -0.1624 Max 4.3856 10 Estimate Std .. Error z value Pr(>!z ! ) 11 CIntercept) -1 . 903e ä+00 4. 699e ä-02 -40 . 509 < 2e-16 * * * 12 powerGLM2 2 . 681e ä-01 2 . 121e ä-02 12 . 637 < 2e-16 * * * 13 powerGLM3 2 . 525e ä-01 2 . 135e ä-02 11 .828 < 2e-16 * * * 14 powerGLM4 1 .377e ä-01 2 . 113e ä-02 6 .516 7.22e-ll * * * 15 powerGLM5 -2 .498e ä-02 3. 063e ä-02 -0 .816 0.414747 16 powerGLM6 3 . 009e ä-01 3. 234e ä-02 9 .304 < 2e-16 * * * 17 powerGLM7 2 . 214e ä-01 3. 240e ä-02 6 .835 8.22e-12 * * * 18 powerGLM8 1 . 103e ä-01 4. 128e ä-02 2 . 672 0.007533 ** 19 powerGLM9 -1 . 044e ä-01 4. 708e ä-02 -2 .218 0.026564 * 20 area 4 .333e ä-02 1. 927e ä-02 2 . 248 0.024561 * 21 log C dens) 3 . 224e ä-02 1. 432e ä-02 2 . 251 0.024385 * 22 gasRegular 6 .868e ä-02 1. 339e ä-02 5 . 129 2.92e-07 * * * 23 ageGLM18to20 1 . 142e ä + 00 4. 836e ä-02 23 . 620 < 2e-16 * * * 24 25 26 ctZG -8 . 123e ä-02 4. 638e ä-02 -1 .751 0.079900 27 --- 28 Signif. codes 0 *** 0. . 001 ** 0.01 * i 0. 05 .0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 145532 on 499999 degrees of freedom Residual deviance: 140641 on 499943 degrees of freedom AIC: 191132 Listing 2.7: ANOVA results 1 1 Analysis of Deviance Table 2 3 Model: poisson, link: log 4 5 Response: claims 6 7 Terms added sequentially (first to last) 8 9 10 Df Deviance Resid. Df Resid. Dev 11 NULL 499999 145532 12 acGLM 3 2927. . 32 499996 142605 13 ageGLM 7 850. . 00 499989 141755 14 ct 25 363. . 29 499964 141392 15 brand 10 124. .37 499954 141267 16 powerGLM 8 315 . .48 499946 140952 17 gas 1 50. . 53 499945 140901 18 log(dens) 1 255 . . 22 499944 140646 19 area 1 5 . . 06 499943 140641 Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 52 Chapter 2. Generalized Linear Models Listing 2.8: ANOVA results 2 1 Analysis of Deviance Table 2 3 Model: poisson , link: log 4 5 Response: claims 6 7 8 Terms adds 3d s equentially (first to last) 9 10 Df Deviance Re sid. Df Resid. Dev 11 NULL 499999 145532 12 acGLM 3 2927. . 32 499996 142605 13 ageGLM 7 850. . 00 499989 141755 14 ct 25 363. . 29 499964 141392 15 brand 10 124. .37 499954 141267 16 powerGLM 8 315 . .48 499946 140952 17 gas 1 50. . 53 499945 140901 18 area 1 255 . . 20 499944 140646 19 log C dens) 1 5 . . 07 499943 140641 Listing 2.9: drop 1 analysis 1 Single term deletions 2 3 Model: 4 claims - acGLM + ageGLM + ct + brand + powerGLM + gas + areaGLM + 5 log(dens) 6 Df Deviance AIC LRT Pr(>Chi) 7 140641 191132 8 acGLM 3 142942 193426 2300.61 < 2.2e-16 *** 9 ageGLM 7 141485 191962 843.91 < 2.2e-16 *** 10 ct 25 140966 191406 324.86 < 2.2e-16 *** 11 brand 10 140791 191261 149.70 < 2.2e-16 *** 12 powerGLM 8 140969 191443 327.68 < 2.2e-16 *** 13 gas 1 140667 191156 26.32 2.891e-07 *** 14 areaGLM 1 140646 191135 5.06 0.02453 * 15 log(dens) 1 140646 191135 5.07 0.02434 * 16 --- 17 Signif. codes: 0 '***I 0.001 '**' 0.01 '*' 0.05 '.' 0.1 (2.15) with Df degrees of freedom. These p-values are all close to zero except for area and log (dens). These p-value are 2.5%, which may allow to work with a model reduced by area and/or log (dens). One may argue that the ANOVA in Listing 2.7 is not "fully fair" for area because this feature component is considered as the last one in this sequential analysis. Therefore, we revert the order in a second analysis putting log (dens) to the end of the list, see Listing 2.8. Indeed, in this case we come to the conclusion that feature component log (dens) may not be necessary. In fact, these two feature components seem to be exchangeable which is explained by the fact that they are very highly correlated, see Table A.2 in the appendix. For these reasons we study another model GLM5 where we completely drop area. The cross-validation results are provided on line (Ch2.5) of Table 2.5. These cross-validation Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 53 losses Ljj are slightly bigger for model GLM5 compared to model GLM4, therefore, we decide to keep area in the model, and we work with feature space (2.21) for the GLM. Remark that this decision is wrong because the estimation loss of 0.4128 in model GLM5 is smaller than the 0.4137 of model GLM4. However, these estimation losses £(\, A*) are not available in typical applications where the true data generating mechanism A* is not known. AIC (Ch2.3) GLM3 191'445 (Ch2.4) GLM4 191'132 (Ch2.5) GLM5 191'135 Table 2.6: Akaike's information criterion AIC. Alternatively to cross-validation, we could also compare Akaike's information criterion (AIC). In general, the model with the smallest AIC value should be preferred; for the validity of the use of AIC we refer to Section 4.2.3 in [141]. In view of Table 2.6 we reach to the same conclusion as with cross-validation losses in this example. The AN OVA in Listing 2.7 adds one feature component after the other. In practice, one usually does model selection rather by backward selection. Therefore, one starts with a complex/full model and drops recursively the least significant variables. If we start with the full model we can perform a dropl analysis which is given in Listing 2.9: this analysis individually drops each variable from the full model. Based on this analysis we would first drop areaGLM because it has the highest p-value (received from the Wald statistics, see Section 2.6). However, this p-value is smaller than 5%, therefore, we would not drop any variable on a 5% significance level. The same conclusion is drawn from AIC because the full model has the smallest value. This finishes the GLM example. ■ 2.4.6 Generalized linear models: summary In the subsequent chapters we will introduce other regression model approaches. These approaches will challenge our (first) model choice GLM4 having q + 1 = 57 parameters and feature space (2.21). We emphasize the following points: • We did not fully fine-tune our model choice GLM4. That is, having n = 500'000 observations we could provide better feature engineering. For instance, we may explore explicit functional forms for ac or age. Moreover, in feature engineering we should also explore potential interactions between the feature components, see Remarks 2.9. • Exploring functional forms, for instance, for age also has the advantage that neighboring 'age classes' stand in a neighborhood relationship to each other. This is not the case with the categorization used in Figure 2.2. • Another issue we are going to meet below is over-parametrization and over-fitting. To prevent from over-fitting one may apply regularization techniques which may tell us that certain parameters or levels/labels are not needed. Regularization will be discussed in Section 4.3.2, below. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 54 Chapter 2. Generalized Linear Models • Linally, we may also question the Poisson model assumption. In many real applications one observes so-called zero-inflated claims counts which means that there are too many zero observations in the data. In this case often a zero-inflated Poisson (ZIP) model is used that adds an extra point mass to zero. If we face over-dispersion, then also an negative binomial model should be considered. 2.5 Classification problem Lor classification we consider the following data V={(Y1,x1),...,(Yn,xn)}, (2.22) with features Xi 6 X and responses Yi taking values in a finite set y. Lor instance, we may consider genders 3^ = {female, male}. In general, we call the elements of 3^ classes or levels, and we represent the classes by a finite set of integers (labels) y = {0,..., J — 1}, for a given JeN. Lhese classes can either be of ordered type (e.g. small, middle, large) or of categorical type (e.g. female, male). Our goal is to construct a classification on X. Lhat is, we aim at constructing a classifier C:X^y, x^y = C(x). (2.23) Lhis classifier may, for instance, describe the most likely outcome y 6 3^ of a (noisy) response Y having feature x. Lhe classifier C provides a finite partition of the feature space X given by X=\JX(y), X(y) = {x £ X; C(x) = y} . (2.24) y&y 2.5.1 Classification of random binary outcomes Lypically, also in classification the data generating mechanism is not known. Lherefore, we make a model assumption under which we infer a classifier C of the given data T>. Let us focus on the binary situation 3^ = {0,1}. We assume that all responses of the cases (Yi, x^ in T> are independent and generated by the following probability law tvi(x) = F[Y = 1] = p(x) and ir0(x) = P [Y = 0] = 1 - p(x), (2.25) for a given (but unknown) probability function p:X->[Q,l], x^p(x). (2.26) Lormula (2.25) describes a Bernoulli random variable Y. A classifier C : X —> {0,1} can then be defined by C(x) = argmax 7ry(x), y&y with a deterministic rule if both iry(x) are equally large. C(x) is the most likely outcome of Y = Y(x). Of course, this can be generalized to multiple response classes J > 2 with corresponding probabilities ttq(x), ..., irj_i(x) for features x 6 X. Lhis latter Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 55 distribution is called categorical distribution. For simplicity we restrict to the binary (Bernoulli) case here. If for all features x there exists y £ y with Try(x) = 1, there is no randomness involved, i.e. the responses Y are not noisy, and we obtain a deterministic classification problem. Nevertheless, also in this case we typically need to infer the unknown partition (2.24). 2.5.2 Logistic regression classification Assume we consider the binary situation and the probability function (2.26) can be described by a logistic functional form through the scalar product (P,x) given in (2.2). This means that we assume for given j3 £ R9+1 p{x) = exp(/3, ■£) or equivalents (p, x) = log ( p{x) ). (2.27) l+exp(P,x) \l-p{x)J The aim is to estimate /3 with MLE methods. Assume we have n independent responses in T>, given by (2.22), all being generated by a model of the form (2.25)-(2.26). The joint log-likelihood under assumption (2.27) is then given by, we set Y = (Y±,..., Yn)', n n = E Yi lo§K^) + (1 " Yi) log(l - p(Xi)) = m Y (p, Xl) - log(l + exp(/3, Xi)). i=\ i=i We calculate the partial derivatives of the log-likelihood function for 0 < / < q P.yW = Y,(y,-- = dPi f^K 1 +exp(P,xl)J Proposition 2.11. The solution P to the MLE problem (2.25)-(2.26) in the logistic case is given by the solution of ,_exp{Xp]_ _ 1 + exp{Xp} for design matrix X given by (2.7) and where the ratio is understood element-wise. If the design matrix X has full rank q+1 < n, the log-likelihood function is concave and, henceforth, the MLE is unique. Moreover, the root search problem of the score function in Proposition 2.11 is solved by Fisher's scoring method or the IRLS algorithm, see also Section 2.2. The estimated logistic probabilities are obtained by — / \ —/ \ exp(3,a;) , ^ , s n ^ , s 1 tti{x) = p{x) =--- and t^o{x) = 1 — iri{x) — l+exp(P,x) l+exp(P,x) This provides estimated classifier C{x) = argmax tvv(x), (2.28) y^y with a deterministic rule if both 7Ty(x) are equally large. Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 56 Chapter 2. Generalized Linear Models Remarks 2.12. • Model (2.25)-(2.26) was designed for a binary classification problem with two response classes J = 2 (Bernoulli case). Similar results can be derived for multiple response classes J > 2 (categorical case). • The logistic approach (2.27) was used to obtain a probability p{x) G (0,1). This is probably the most commonly used approach, but there exist many other (functional) modeling approaches which are in a similar spirit to (2.27). • In machine learning, the logistic regression assumption (2.27) is also referred to the sigmoid activation function 4>{x) = (1 + e_x)_1 for iGR, We come back to this in the neural network chapter, see Table 5.1. • In categorical problems with more than two classes one often replaces the logistic regression function by the so-called softmax function. • Note that Try(x) is by far more sophisticated than C{x). For instance, if Y is an indicator whether a car driver has an accident or not, i.e., Y = 1{7V>1}> then we have in the Poisson case tti(cc) = p{x) = P [Y = 1] = P [N > 1] = 1 - exp{-X(x)v} = \{x)v + o (X(x)v), (2.29) as X{x)v —> 0. Thus, tvi(x) ~ X{x)v for typical car insurance frequencies, say, of magnitude 5% and one year at risk v = 1. This implies that in the low frequency situation we obtain classifier C{x) = 0 for all policies. Moreover, for small expected frequencies X{x) we could also use the logistic regression modeling approach and (2.29) to infer the regression model, we refer to Sections 3.3 and 3.4 in Ferrario et al. [43]. In analogy to the Poisson case we can consider the deviance loss in the binomial case given by D*(Y,p) = 2{eY(Y)-eY0)) (2.30) n = -2^Y (3, Xi) - log(l + exp(3, Xi)), i=\ because the saturated model provides log-likelihood equal to zero. Similar to (2.15) this allows us for a likelihood ratio test for parameter reduction. It also suggests the Pearson's residuals and the deviance residuals, respectively, = Yj -p(xj) yjp{xi){\ -pix,))1 SP = sgn (Yi - 0.5) J-2 (Yi 0, x,) - log(l + exp(3, a^))). Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 2. Generalized Linear Models 57 Finally, for out-of-sample back-testing and cross-validation one often considers the probability of misclassification as generalization loss P [Y + Č{x)\ , for a classifier C(-) estimated from randomly chosen i.i.d. data (Yí,Xí) ~ P, i = 1,..., n, where the distribution P has the meaning that we choose at random a policy with feature Xi and corresponding classifier Yi, and where (Y, x) is independent of C(-) having the same distribution as the cases (Yi,Xi). Based on a learning sample V® we estimate the classifier by C®(-) and using the test sample V^c we can estimate the probability of misclassification defined by =F[Y^ C{x)\ = E \YiMty i2'31) we also refer to Section 1.3.2. Thus, we use the loss function for misclassification L(Y,C(x)) = l{Y^c(x)}. (2.32) We can then apply the same techniques as in Section 1.3.2, i.e. the leave-one-out cross-validation or the K-iald cross-validation to estimate this generalization loss. In relation to Remarks 2.12 one should note that for rare events the misclassification rate should be replaced by other loss functions. In the binary situation with iri(x) C 1/2 for all x G X, the trivial predictor Y = 0, a.s., obtains an excellent misclassification rate (because of missing sensitivity of this loss function towards rare events); in machine learning this problem is often called class imbalance problem. The binomial deviance loss (2.30) is also referred to the cross-entropy loss which may be used instead of misclassification. Version October 27, 2021, M.V. Wůthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 58 Chapter 2. Generalized Linear Models Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Appendix to Chapter 2 2.6 Maximum likelihood estimation Under fairly general conditions, the MLE satisfies asymptotic normality properties converging to a multivariate standard Gaussian distribution if properly normalized, see Theorem 6.2.3 in Lehmann [87]. In particular, the MLE is asymptotically unbiased with asymptotic variance described by the inverse of Fisher's information matrix. The log-likelihood function is in our Poisson GLM given by, see (2.4), n P ^ ^(/3) = E-exP(/3'^)^ + Ar^(/3'^)+log^)-1°g(Ar*!)) i=i and the MLE is found by the roots of I W> = ° Assuming full rank q + 1 < n of the design matrix X we receive a unique solution to this root search problem because under this assumption the log-likelihood function is concave. Fisher's information matrix is in this Poisson GLM given by the negative Hessian, see (2.9), d2 Aß) = -E ^2 vl exp(/3, Xi)xitiXitr I = -UߣN(ß). 0 we require that the functions // satisfy for all / = 1,.. . , q 1 n -E/'K/)=0- (3-2) 1=1 • The log-link in (3.1) leads to a multiplicative tariff structure X(x) = exp{(30}l[exp{fl(xl)}. (3.3) i=i Normalization (3.2) then implies that exp{/3o} describes the base premium and we obtain multiplicative correction factors exp{//(a;/)} relative to this base premium for features x G X, we also refer to (2.10). In the sequel of this section we consider natural cubic splines for the modeling of //. 3.1.1 Natural cubic splines A popular approach is to use natural cubic splines for the functions //(■) in (3.1). Choose a set of m knots u± < ... < um on the real line R and define the function / : [u±, um] —> R as follows: f(x) = ht(x), for x G [ut,ut+i), (3.4) where for t = 1,. . ., m — 1 we choose cubic functions ht(x) = at + $tx + Tt^2 + $tx^ on R. For the last index t = m — 1 we extend (3.4) to the closed interval [um-i,um]. We say that the function / defined in (3.4) is a cubic spline if it satisfies the following differentiability (smoothing) conditions in the (internal) knots U2 < ■ ■ ■ < um—\ ht-i{ut) = ht(ut), h[Mi(ut) = K(ut) and h"_1(ut)=h,"(ut), (3.5) for all t = 2,..., m — 1. Function / has 4(m — 1) parameters and the constraints (3.5) reduce these by 3(m — 2). Therefore, a cubic spline has m + 2 degrees of freedom. Observe that (3.5) implies that a cubic spline is twice continuously differentiable on the interval (ui,um). If we extend this cubic spline / twice continuously differentiable to an interval [a, b] D [jM,um] with linear extensions on [a,iti] and [um,b], we call this spline natural cubic spline. This provides two additional boundary constraints f"(x) = 0 on [a,iti] U [ttm,6], thus, h'{(ui) = /i^l_1(ttm) = 0, and reduces the degrees of freedom by 2. Therefore, a natural cubic spline has m degrees of freedom. Note that a natural cubic spline can be represented by functions x i—> (x — ttt)+, t = 1,... ,171. Namely, m mm f(x)=ao+'&ox + '^2ct{x-ut)+, with ^c( = 0 and ^« = 0, (3.6) t=i t=i t=i Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 3. Generalized Additive Models 63 gives a natural cubic spline. The two side constraints ensure that we have a smooth linear extension above um. This again provides m degrees of freedom and these natural cubic splines (with m given knots) build an m-dimensional linear space. The knots ui < ... < um play a particularly special role: assume that in these knots we are given values /J,. .. , Then, there exists a unique natural cubic spline / on [a, b] with f{ut) = /t* for all i = 1,..., m, see Theorem 5.1 in [102]. We would like to solve the following optimization problem: Find the intercept /3q and the natural cubic splines fi, ■ ■ ■ ,fq satisfying normalization conditions (3.2) such that the following expression is minimized: with observations AT" = (Ni,...,Nn)', intercept and natural cubic splines given by / = (/?o; fi, ■ ■ ■, fq) and determining the GAM regression function A(-) by (3.1), tuning parameters r\\ > 0, 1 = 1,.. . ,q, and Poisson deviance loss given by where the right-hand side is set equal to 2\{xi)vi for iVj = 0. The supports [a/, 6/] of the natural cubic splines // should contain all observed feature components xn of T>. Remarks 3.2. • We refer to page 35 for the interplay between maximizing the log-likelihood function and minimizing the deviance loss. In short, we are minimizing the in-sample loss in (3.7), modeled by the deviance loss D*(N, A), which is equivalent to maximizing the corresponding log-likelihood function, subject to regularization conditions that we discuss next. • The regularization conditions for the natural cubic splines // guarantee in the optimization of (3.7) that the resulting (optimal) functions // do not look too wildly over their domains [a/, b{\ D [uii, umii], where u±i,. . ., umii are the mi knots of fi. In particular, we want to prevent from over-fitting to the data T>. The tuning parameters r\i > 0 balance the influence of the regularization conditions (3.8). Appropriate tuning parameters are (often) determined by cross-validation. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich i=i (3.7) (3.8) Electronic copy available at: https://ssrn.com/abstract=2870308 64 Chapter 3. Generalized Additive Models Usually, one starts from a more general optimization problem, namely, minimize 9 rh D*(N, a) + 5> i' {fl'{xl)fdxl i=i Jai E2 i=i ßo+EU M*i,i)v. -Ni-Ni + /zKz)) + Ni log^Ni/vi + fl U'l'{xl))2dxl (3.9) over fto G K and over all (normalized) functions /i,.. . ,M that are twice continuously differentiable on their domains. > Assume that the feature component x\ has exactly m; < n different values a;^ j < ... < x* l among the observations T>. This implies that the deviance loss D*(N,X) only depends on these values j*x = ji{x*i) for i = 1,... ,m; and / = 1,..., g, see middle line of (3.9). > Theorem 5.1 (part 1) in [102] says that for given values (x*l7 /*;)i=i,...,m; there exists a unique natural cubic spline f* on [a/, b\\ with /;*(a;*;) = /*; for all i = 1,... ,mj. Thus, a choice (a;*;, f*i)i=i,...,mi completely determines the natural cubic spline /*. > Theorem 5.1 (part 2) in [102] says that any twice continuously differentiable function // with fi(x*j) = f*j for all i = 1,. .. , mj, has a regularization term (3.8) that is at least as big as the one of the corresponding natural cubic spline /*. For these reasons we can restrict the (regularized) optimization to natural cubic splines. This basically means that we need to find the optimal values (/*;)i=i,...,m; in the feature components/knots (3^/)i=i,...m| subject to regularization conditions (3.8) for given tuning parameters r\\ > 0 and normalization (3.2) for / = 1,. .. , q. • Note that m; < n is quite common, for instance, our n = 500'000 car drivers only have mi = 90 — 18 + 1 = 73 different drivers' ages, see Appendix A. In this case, the deviance loss D*(N,X) is for feature component x\ completely determined by the values fgt = fiix*^, i = 1,... ,m; = 73. For computational reasons one may even merge more of the feature component values (by rounding/bucketing) to get less values f*t and knots un = x*t in the optimization. Since natural cubic splines with given knots x*i,..., x^ l build an m/-dimensional linear space, we may choose m; linearly independent basis functions s±i,.. . ,smii (of the form (3.6)) and represent the natural cubic splines // on [a/, 6/] by mi fl(xl) = Y1 Pk,lSk,l(xi), (3.10) k=l for unique constants Pii,.. . ,/3mij. Observe that (3.10) brings us fairly close to the GLM framework of Chapter 2. Assume that all natural cubic splines // are represented in the form (3.10), having parameters ftii,.. . ,/3mii and corresponding basis functions Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 3. Generalized Additive Models 65 sii,..., smui, then we can rewrite (3.1) as q mi logAfa;) =/3o + EE/3MsM(a;/) =f' (0, *(*)>, (3-H) z=i fc=l where the last identity defines the scalar product between /3 = (/3o, /5i,i, • • •, Pmq,q)' and s(x) = (1, si,i(a;i),..., smq^q(xq)Y £ W+1. Moreover, note that for these natural cubic splines we have / (fi'(xi)) dxi = / \Z^Pk,iSk,i(xi)\ dxi Jai Juu \k=1 J ml fUm i ml = Y Pk,lPj,l / 1 4tl(xi)s"j(xi)dxi A= Yj Pk,lPj,l ^fc]-k,j=l Ul<1 k,j=l Therefore, optimization problem (3.7) is transformed to (we drop the irrelevant terms): Minimize over j3 £ W+1 the objective function n ^2[exp(/3,S(;E^)>^-iV^(/3,s(;E^)>]+/3, n(rj) /3, (3.12) i=i with block-diagonal matrix Q(rj) = diag(0, Q\,..., Qd) S M(r+1)x(r'+1) having blocks nl = nl(rn) = vi , £Rm>xm>, V " / k,j=l,...,mi for / = 1,. .. , q, tuning parameters rj = (771,.. . and under side constraints (3.2). The optimal parameter j3 of this optimization problem is found numerically. Remarks 3.3. • The normalization conditions (3.2) provide identifiability of the parameters: n n mi mi / n \ 0 = Y Mx*,i) = Yl Yl fik,isk,i{xi,i) = Y (^fe.z Yl sk,i(xi,i)) • i=i i=i /«=i /«=i V i=i / Often these conditions are not explored at this stage, but functions are only adjusted at the very end of the procedure. In particular, we first calculate a relative estimator and the absolute level is only determined at the final stage of the calibration. This relative estimator can be determined by the back-fitting algorithm for additive models, for details see Algorithm 9.1 in Hastie et al. [62] and Section 5.4.2 in Ohlsson-Johansson [102]. • A second important remark is that the calculation can be accelerated substantially if one "rounds" the feature components, i.e., for instance, for the feature component dens we may choose the units in hundreds, which substantially reduces m; and, hence, the computational complexity, because we obtain less knots and hence a lower dimensional j3. Often one does not lose (much) accuracy by this rounding and, in particular, one is less in the situation of a potential over-parametrization and over-fitting, we also refer to Section 2.4.3 on data compression. • In the example below, we use the command gam from the R package mgcv. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 66 Chapter 3. Generalized Additive Models 3.1.2 Example in motor insurance pricing, revisited We present two applications of the GAM approach. We revisit the car insurance example of Section 2.4 which is based on the data generated in Appendix A. The first application only considers the two feature components age and ac, this is similar to Example 2.10. The second application considers all feature components, similarly to Section 2.4.5. Example 3.4 (example GAM1). We consider the same set-up as in Example 2.10 (GLM1) but we model the feature components age and ac by natural cubic splines. Note that we have mi = 73 different ages of drivers and m-2 = 36 different ages of cars. For computational reasons, it is important in GAM calibrations that data is compressed accordingly, as described in Section 2.4.3. This gives us at most m*m,2 = 73 ■ 36 = 2'628 different risk cells. In our data T>, only 2'223 of these risk cells are non-empty, i.e. have a volume v£ > 0, for the latter see also (2.18). Thus, the data is reduced from n = 500'000 observations to a sufficient statistics of size d = 2'223.1 Listing 3.1: Results of example GAM1 1 Family: poisson 2 Link function: log 3 4 Formula: 5 claims ~ sCage, bs = "er", k = kl) + s(ac, bs = "er", k = k2) 6 7 Parametric coefficients: 8 Estimate Std. Error z value Pr(>!z!) 9 (Intercept) -2.39356 0.02566 -93.29 <2e-16 *** 10 --- 11 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 12 13 Approximate significance of smooth terms: 14 edf Ref.df Chi.sq p-value 15 s(age) 12.48 15.64 1162 <2e-16 *** 16 s(ac) 17.43 20.89 3689 <2e-16 *** 17 --- 18 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 19 20 R-sq.(adj) = 0.963 Deviance explained = 65% 21 UBRE = -0.012852 Scale est. =1 n = 2223 The natural cubic splines approach is implemented in the command gam of the R package mgcv, and the corresponding results are provided in Listing 3.1. The feature components s (age ,bs=" cr" ,k=kl) and s(di,bs="cr" ,k=k2) are being modeled as continuous variables (indicated by s (■)) and we fit cubic splines (indicated by bs=" cr"). The parameters kl = mi = 73 and k2 = m-2 = 36 indicate how many knots we would like to have for each of the two feature components age and ac. Note that we choose the maximal number of possible knots (different labels) here, see also Remarks 3.2. This number is usually too large and one should choose a lower number of knots for computational reasons. We did not specify the tuning parameters r\\ in Listing 3.1. If we drop these tuning parameters in the R command gam, then a generalized cross-validation (GCV) criterion or an unbiased risk estimator (UBRE) criterion is applied to determine good tuning parameters 1This data compression reduces the run time of the GAM calibration from 1'018 seconds to 1 second! Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 3. Generalized Additive Models 67 (internally). The GCV criterion considers a scaled in-sample loss given by GCVfo) = (1 ~ M(rj)/n)-2 £% = _ 71 ^ (3.13) where M(tj) is the effective degrees of freedom of the model. This effective degrees of freedom is obtained from the corresponding influence matrix, for more details we refer to Wood [134], Section 3.2.3, and Hastie et al. [62], Section 7.10.1. The GCV criterion has the advantage over K-iold cross-validation that it is computationally much more fast, which is important in the optimization algorithm. We can extract the resulting optimal tuning parameters with the command gam$sp which provides (771,772) = (273'401,2'591) in our example. The column edf in Listing 3.1 shows the effective degrees of freedom which corresponds to the number of knots needed (for this optimal tuning parameters), for details we refer to Section 5.4.1 in Hastie et al. [62]. If edf is close to 1 then we basically fit a straight line which means that we can use the GLM (log-linear form) for this feature component. The last two columns on lines 15-16 of Listing 3.1 give an approximate x2-test for assessing the significance of the corresponding smooth term. The corresponding p-values are roughly zero which says that we should keep both feature components. 5 S " I 1 ™ ■ |||( I 20 40 60 80 0 5 10 20 30 20 40 60 80 0 5 10 20 30 Figure 3.1: GAM1 results with (771,772) = (273'401, 2'591), and kl = 73 and k2 = 36; (lhs) fitted splines s(age,bs="cr" ,k=kl) and s(di,bs="cr" ,k=k2) excluding observations, and (rhs) including observations; note that (771,772) is determined by GCV here. In Figure 3.1 we provide the resulting marginal plots for age and ac. These are excluding (lhs) and including (rhs) the observations (and therefore have different scales on the y-axis). Moreover, we provide confidence bounds in red color. From these plots it seems that age is fitted well, but that ac is over-fitting for higher ages of the car, thus, one should either decrease the number of knots k2 or increase the tuning parameter 772. In Figure 3.2 we provide the resulting two dimensional surface (from two different angles). We observe a rather step slope for small age and small ac which reflects the observed frequency behavior. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 68 Chapter 3. Generalized Additive Models Figure 3.2: GAM1 results with (771,772) = (273'401. 2'591), and kl = 73 and k2 = 36 from two different angles. run # CV loss strat. CV est. loss in-sample average time param. rev /•cv Id £(A,A*) Ma >~D frequency (ChA.l) true model A* 27.7278 10.1991% (Clil.l) homogeneous 0.1s 1 29.1066 29.1065 1.3439 29.1065 10.2691% (Ch2.1) GLM1 3.1s 11 28.3543 28.3544 0.6052 28.3510 10.2691% (Ch3.1) GAM1 1.1s 108 28.3248 28.3245 0.5722 28.3134 10.2691% Table 3.1: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model a*; losses are reported in 10-2; run time gives the time needed for model calibration, and '# param.' gives the number of estimated model parameters, this table follows up from Table 2.4. In Table 3.1 we provide the corresponding prediction results on line (Ch3.1). The number of parameters '# param.' is determined by the number of chosen knots minus one, i.e. kl + k2 — 1 = mi + 777,2 — 1 = 73+ 36 —1 = 108, where the minus one corresponds to the intercept parameter minus two normalization conditions (3.2). We observe a smaller in-sample loss of GAM1 compared to GLM1, i.e. the GAM can better adapt to the data T> than the GLM (considering the same feature components age and ac). This shows that the choices of the categorical classes for age and ac can be improved in the GLM approach. This better performance of the GAM carries over to the cross-validation losses jC^,v and the estimation loss £(X, a*). We also note that the difference between the in-sample loss and the cross-validation loss increases, which shows that the GAM has a higher potential for over-fitting than the GLM, here. In Figure 3.3 we present the results for different choices of the tuning parameters 77/. For small tuning parameters (77/ = 10 in our situation) we obtain wiggly pictures which follow closely the observations. For large tuning parameters (77/ = IO'000'OOO in our situation) we obtain rather smooth graphs with small degrees of freedom, see Figure 3.3 (rhs). Thus, we may get either over-fitting (lhs) and under-fitting (rhs) to the data. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 3. Generalized Additive Models 69 20 40 60 0 5 10 20 30 20 40 60 ~i—i—i—I—i—i—r 0 5 10 20 30 Figure 3.3: GAM1 results with (lhs) (771,772) (IO'000'OOO, IO'000'OOO) for kl = 73 and k2 = 36. (10,10) and (rhs) (771,772) 0 5 10 20 30 Figure 3.4: GAM1 results with number of knots (lhs) kl = 12 and k2 = 17 and (rhs) kl = 5 and k2 = 5; note that (771,772) is determined by GCV here. In Figure 3.4 we change the number of knots (kl,k2) and we let the optimal tuning parameters be determined using the GCV criterion. We note that for less knots the picture starts to smooth and the confidence bounds get more narrow because we have less parameters to be estimated. For one knot we receive a log-linear GLM component. Finally, we evaluate the quality of the GCV criterion. From Listing 3.1 we see that the effective degrees of freedom are M(tj) = 1 + 12.48 + 17.43 = 30.91. Thus, we obtain GCVfa) C Kn-M(r])J D V500'000 - 30.91, This is smaller than the K-iold cross-validation errors £gv on line (Ch3.1) of Table 3.1. 500'000 28.3134 ■ 10"2 = 28.3166 ■ 10"2. Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 70 Chapter 3. Generalized Additive Models This indicates that GCV is likely to under-estimate over-fitting here. This closes our first GAM example. ■ Example 3.5 (example GAM2/3). In our second example we consider all available feature components. We keep gas, brand and ct as categorical. All other feature components are modeled with natural cubic splines, using the GCV criterion to determine the optimal tuning parameters r\\. To keep computational time under control we use data compression, and before that we modify log (dens )/2 to be rounded to one digit (which gives 48 different values). The data is then compressed correspondingly leading to d = 235'163 compressed observations.2 This compression gives us 73 + 36 + 12 + 2 + 11 + 6 + 48 + 26 + 1 - 8 = 207 parameters to be estimated. This should be compared to Conclusion 2.8. Listing 3.2: Results of example GAM2 1 Family: poisson 2 Link function: log 3 4 Formula: 5 claims ~ sCage, bs="cr", k=73) + s(ac, bs="cr", k=36) + s(power, bs="cr", k=12) 6 + gas + brand + s(area, bs="cr", k=6) + s(densGAM, bs="cr", k=48) + ct 7 8 Parametric coefficients: 9 Estimate Std. Error z value Pr(>!z!) 10 (Intercept) -2.254150 0.020420 -110.391 < 2e-16 *** 11 gasRegular 0.073706 0.013506 5.457 4.83e-08 *** 12 brandBlO 0.044510 0.045052 0.988 0.323174 13 . 14 brandB5 0.107430 0.029505 3.641 0.000271 *** 15 brandB6 0.020794 0.033406 0.622 0.533627 16 ctAG -0.094460 0.027672 -3.414 0.000641 *** 17 . 18 ctZG -0.079805 0.046513 -1.716 0.086208 . 19 --- 20 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 21 22 Approximate significance of smooth terms: 23 edf Ref.df Chi.sq p-value 24 s(age) 13.132 16.453 1161.419 <2e-16 *** 25 s(ac) 17.680 21.147 2699.012 <2e-16 *** 26 s(power) 9.731 10.461 311.483 <2e-16 *** 27 s(area) 1.009 1.016 5.815 0.0164 * 28 s(densGAM) 5.595 7.071 10.251 0.1865 29 --- 30 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 31 32 R-sq.(adj) = 0.115 Deviance explained = 4.747, 33 UBRE = -0.56625 Scale est. =1 n = 235163 The estimation results are presented in Listing 3.2. The upper part (lines 10-18) shows the intercept estimate (3o as well as the estimates for the categorical variables gas, brand and ct. A brief inspection of these numbers shows that we keep all these variables in the model. 2The data compression takes 44 seconds. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 3. Generalized Additive Models 71 123456 02468 10 area densGAM Figure 3.5: GAM2 results: marginal splines approximation of the continuous feature components age, ac, power, area and log(dens). The lower part of Listing 3.2 (lines 24-28) shows the results of the 5 continuous feature components age, ac, power, area and log(dens). The first 3 continuous feature components should be included in this natural cubic spline form. The effective degrees of freedom of the feature component area is very close to 1, which suggests that this component should be modeled in log-linear form. Finally, the component log(dens) does not seem to be significant which means that we may drop it from the model. In Figure 3.5 we show the resulting marginal spline approximations, which confirm the findings of Listing 3.2, in particular, area is log-linear whereas log(dens) can be modeled by almost a horizontal line (respecting the red confidence bounds and keeping in mind that area and dens are strongly positively correlated, see Table A.2). In Table 3.2 we provide the resulting losses of model GAM2 on line (Ch3.2). Firstly, we see that the run time of the calibration takes roughly 10 minutes, i.e. quite long. This run time could be reduced if we choose less knots in the natural cubic splines. Secondly, we observe that the estimation loss £(A,A*) and the in-sample loss C1^ become smaller Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 72 Chapter 3. Generalized Additive Models run # CV loss strat. CV est. loss in-sample average time param. /•cv /•CV J~D £(A,A*) /•is J~D frequency (ChA.l) true model A* 27.7278 10.1991% (Chl.l) homogeneous 0.1s 1 29.1066 29.1065 1.3439 29.1065 10.2691% (Ch2.4) GLM4 14s 57 28.1502 28.1510 0.4137 28.1282 10.2691% (Ch3.2) GAM2 678s 207 - - 0.3877 28.0927 10.2690% (Ch3.3) GAM3 50s 79 28.1378 28.1380 0.3967 28.1055 10.2691% Table 3.2: Poisson deviance losses of K-fo\d cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, and param.' gives the number of estimated model parameters, this table follows up from Table 2.5. compared to model GLM4 which shows that the categorization in GLM4 is not optimal. We refrain from calculating the cross-validation losses for model GAM2 because this takes too much time. Finally, in Figure 3.6 we illustrate the resulting two-dimensional surfaces of the continuous feature components ac, power, age and log(dens). Figure 3.6: GAM2 results of the continuous components ac, power, age and log (dens). In view of Listing 3.2 we may consider a simplified GAM. We model the feature components age and ac by natural cubic splines having kl = 14 and k2 = 18 knots, respectively, because the corresponding effective degrees of freedom in Listing 3.2 are 13.132 and 17.680 (the tuning parameters r\\ are taken equal to zero for these two components). The component power is transformed to a categorical variable because the effective degrees of freedom of 9.731 is almost equal to the number of levels (these are 11 parameters under dummy coding), suggesting that each label should have its own regression parameter /3/. The component area is considered in log-linear form, the component dens is dropped from the model, and gas, brand and ct are considered as categorical variables. We call this model GAM3. The calibration of this GAM takes roughly 50 seconds, which makes it feasible to perform K-fo\d cross-validation. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 3. Generalized Additive Models 73 Listing 3.3: Results of example GAM3 1 Family: poisson 2 Link function: log 3 4 Formula: 5 claims ~ sCage, bs = "cr", k = 14) + s(ac, bs = "cr", k = 18) 6 + powerCat + gas + brand + area + ct 7 8 Parametric coefficients: 9 Estimate Std. Error z value Pr(>!z!) 10 (Intercept) -2.674145 0.032674 -81.844 < 2e-16 *** 11 powerGAM2 0.264250 0.021278 12.419 < 2e-16 *** 12 . 13 powerGAM12 -0.275665 0.109984 -2.506 0.012197 * 14 gasRegular 0.072807 0.013526 5.383 7.34e-08 *** 15 brandBlO 0.045986 0.045066 1.020 0.307524 16 . 17 brandB5 0.107218 0.029511 3.633 0.000280 *** 18 brandB6 0.021155 0.033407 0.633 0.526560 19 area 0.084586 0.005360 15.782 < 2e-16 *** 20 ctAG -0.102839 0.027335 -3.762 0.000168 *** 21 . 22 ctZG -0.085552 0.046349 -1.846 0.064917 . 23 --- 24 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 25 26 Approximate significance of smooth terms: 27 edf Ref.df Chi.sq p-value 28 s(age) 13 13 1192 <2e-16 *** 29 s(ac) 17 17 2633 <2e-16 *** 30 --- 31 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 32 33 R-sq.(adj) = 0.0334 Deviance explained = 3.44% 34 UBRE = -0.71863 Scale est. =1 n = 500000 The results are presented on line (Ch3.3) of Table 3.2 and in Listing 3.3 (we do not choose any regularization here because the chosen numbers of knots are assumed to be optimal). We conclude that this model GAM3 has a clearly better performance than model GLM4. It is slightly worse than GAM2, however, this comes at a much lower run time. For this reason we prefer model GAM3 in all subsequent considerations. In Figure 3.7 (lhs) we provide the scatter plot of the resulting estimated frequencies (on the log scale) between the two models GLM4 and GAM3. The colors illustrate the years at risk vl on a policy level. The cloud of frequencies lies on the 45 degrees axis which indicates that model GLM4 and model GAM3 make similar predictions A(a^) on a policy level. In Figure 3.7 (middle and rhs) we compare these predictions to the true frequencies A*(a^) (which are available for our synthetic data). These two clouds have a much wider diameter which says that the models GLM4 and GAM3 have quite some room for improvements. This is what we are going to explore in the next chapters. This finishes the GAM example. ■ Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 74 Chapter 3. Generalized Additive Models comparison of frequencies comparison of frequencies comparison of frequencies log frequency GLM4 log of true frequency log or true frequency Figure 3.7: Resulting estimated frequencies (on the log scale) of models GLM4 and GAM3, also compared to the true frequencies. 3.1.3 Generalized additive models: summary We have seen that GAMs provide an improvement over GLMs for non log-linear feature components. In particular, GAMs make it redundant to categorize non log-linear feature components in GLM applications. This has the advantage that the ordinal relationship is kept. The drawback of GAMs is that they are computationally more expensive than GLMs. We have not explored interactions beyond multiplicative structures, yet. This may be a main reason why the comparisons in Figure 3.7 (middle and rhs) do not look fully convincing. Having 500'000 observation we could directly explore such interactions in GLMs and GAMs, see also Remarks 2.9. We refrain from doing so but we will provide other data driven methods below. 3.2 Multivariate adaptive regression splines The technique of multivariate adaptive regression splines (MARS) uses piece-wise linear basis functions. These piece-wise linear basis functions have the following two forms on X c W: x h-> h_tj(x) = (xi - t)+ or x h-> h+tj(x) = (t - xi)+, (3.14) for a given constant t and a given feature component x\. The pair in (3.14) is called reflected pair with knot t for feature component x\. These reflected pairs consist of so-called rectified linear unit (ReLU) or hinge functions. We define the set of all such reflected pairs that are generated by the features in T>. These are given by the basis H = {h_tl,i(-), h+thi{-); I = 1,.. . , q and t/ = x^h . .. , xnj} . Note that ~H spans a space of continuous piece-wise linear functions (splines). MARS makes the following modeling approach: choose the expected frequency function as M x h-> logA(cc) = (30 + (3mhm(x), (3.15) m=l Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 3. Generalized Additive Models 75 for given M > 0, and hm{-) are functions from the basis ~H or products of such basis functions in T-L. The latter implies that the non-zero part of hm(-) can also be non-linear. Observe that this approach is rather similar to the GAM approach with natural cubic splines. The natural cubic splines are generated by functions x \—> {x\ — i)+ with t = Xij, see (3.6), which can be obtained by multiplying the basis function h_t,i{x) 6 ~H three times with itself. This model is fit by an iterative (stage-wise adaptive) algorithm that selects at each step a function hm(-) of the present calibration, and splits this function into the two (new) functions x hmi(x) = hm(x) (xi - xhi)+ and x ^ hm2(x) = hm(x) (xhi - xi)+, for some / = 1,.. . , q and i = 1,. . ., n. Thereby, it chooses the additional optimal split (in terms of m, I and xn) to obtain the new (refined) model logA(cc) = (30 + E Pm'hm>(x) + (3mihm(x)h_Xi]lii(x) + (3m2hm(x)h+Xi]lii(x) = Po + E Pm'hm'{x) + l3mihmi(x) + l3m2hm2(x). (3.16) The optimal split can be determined relative to a loss function, for instance, one can consider the resulting in-sample loss from the Poisson deviance loss. This way we may grow a large complex model (in a stage-wise adaptive manner). By backward pruning (deletion) some of the splits may be deleted again, if they do not contribute sufficiently to the reduction in cross-validation error. For computational reasons, in MARS calibrations one often uses the GCV criterion, which provides a more crude error estimation in terms of an adjusted in-sample error, see (3.13) above and (9.20) in Hastie et al. [62]. We do not provide more details on pruning and deletion here, but pruning in a similar context will be treated in more detail in Chapter 6 in the context of regression trees. We remark the that refinement (3.16) can also be understood as a boosting step. Boosting is going to be studied in Section 7.4. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 76 Chapter 3. Generalized Additive Models Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4 Credibility Theory In Chapter 2 on GLMs we have seen that many years at risk are needed in order to draw statistically firm conclusions. This holds true, in particular, for problems with low expected frequencies. In this chapter we present credibility methods that allow us to smooth predictors and estimators with other sources of information in situations where we do not have sufficient volume or in situations where we have outliers. Credibility methods are based on Bayesian statistics. We do not present the entire credibility theory framework here, but we only consider some special cases that are related to the Pois-son claims frequency model and to a binary classification problem. For a comprehensive treatment of credibility theory we refer to the book of Buhlmann-Gisler [18]. Along the way we also meet the important technique of regularization. Moreover, we discuss simulation methods such as the Markov chain Monte Carlo (MCMC) method to numerically calculate Bayesian posterior distributions. 4.1 The Poisson-gamma model for claims counts 4.1.1 Credibility formula In Section 1.2 we have assumed that N ~ Poi(Av) for a fixed expected frequency parameter A > 0 and for given years at risk v > 0. In this chapter we do not assume that the expected frequency parameter A is fixed, but it is modeled by a strictly positive random variable A. This random variable A may have different interpretations: (i) there is some uncertainty involved in the true expected frequency parameter and we reflect this uncertainty by a random variable A; (ii) we have a heterogeneous portfolio of different risks and we choose at random a policy from this portfolio. The latter is similar to a heterogeneous situation where a priori all policies are equivalent. Here, we make a specific distributional assumption for A by choosing a gamma prior distribution having density for 6 e K+ with shape parameter 7 > 0 and scale parameter c > 0. Note that the mean and variance of the gamma distribution are given by 7/c and 7/c2, respectively. For more information on the gamma distribution we refer to Section 3.2.1 in Wuthrich [135]. 77 Electronic copy available at: https://ssrn.com/abstract=2870308 78 Chapter 4. Credibility Theory Definition 4.1 (Poisson-gamma model). Assume there is a fixed given volume v > 0. • Conditionally, given A, the claims count N ~ Poi(Aw). • A ~ r(7, c) with prior shape parameter 7 > 0 and prior scale parameter c > 0. Theorem 4.2. Assume N follows the Poisson-gamma model of Definition J^.l. The posterior distribution of A, conditional on N, is given by A\N ~ r(7 + iV, c + v). Proof. The Bayes' rule says that the posterior density of A, given N, is given by (up to normalizing constants) it(9\N) cc e-°" ^ F-1e-cB cc ^+441e-fl. This is a gamma density with the required properties. Remarks 4.3. • The posterior distribution is again a gamma distribution with updated parameters. The parameter update is given by 7 1—y 7P°st = 7 + AT and c ^ cpost = c + v. Often 7 and c are called the prior parameters and jP°st and cpost the posterior parameters. • The remarkable property of the Poisson-gamma model is that the posterior distribution belongs to the same family of distributions as the prior distribution. There are more examples of this kind, many of these examples belong to the EDF with conjugate priors, see Buhlmann-Gisler [18]. • For the estimation of the unknown parameter A we obtain the following prior estimator and the corresponding posterior (Bayesian) estimator A d==f' E[A] = 1, c AP°st d==f' E[A\N] = 1^1 = 1±^. L 1 1 cP°st c + v • Assume that the claims counts N±,..., Nn are conditionally, given A, independent and Poisson distributed with conditional means Avi, for i = 1,. . . ,n. Lemma 1.3 implies that n / n n = Y.n*\a ^ Poi A$>« i=i V 1=1 Thus, we can apply Theorem 4.2 also to the aggregate portfolio = J2i Ni, if the claims counts iVj all belong to the same frequency parameter A. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 79 • The unconditional distribution of N under the Poisson-gamma model is a negative binomial distribution, see Proposition 2.20 in Wuthrich [135]. Corollary 4.4. Assume N follows the Poisson-gamma model of Definition J^.l. The posterior estimator Apost has the following credibility form Apost = a\ + (l-a) A, with credibility weight a and observation based estimator A given by v . . N a = - G (0,1) and X = —. c + v v The (conditional) mean square error of this estimator is given by E (A - Apost) N post 2 ^post^2 (1 - a) - Apost. c Proof. In view of Theorem 4.2 we have for the posterior mean Apost = 1±_- = ^_--JV+_--l = QA + (l-a)A. c + v c + v v c + v c This proves the first claim. For the estimation uncertainty we have E^A-A^j JVJ = Var(A|iV) = j^-—yT = (1 - a) - Apost. This proves the claim. □ Remarks 4.5. • Corollary 4.4 shows that the posterior estimator Apost is a credibility weighted average between the prior guess A and the MLE A with credibility weight a G (0,1). • The credibility weight a has the following properties (which balances the weights given in the posterior estimator Apost): 1. for volume v —> oo: a —> 1; 2. for volume v —> 0: a —> 0; 3. for prior uncertainty going to infinity, i.e. c —> 0: a —> 1; 4. for prior uncertainty going to zero, i.e. c —> oo: a —> 0. Note that Var (A) = 0L = - \. cz c For c large we have an informative prior distribution, for c small we have a vague prior distribution and for c = 0 we have a non-informative or improper prior distribution. The latter means that we have no prior parameter knowledge (this needs to be understood in an asymptotic sense). These statements are all done under the assumption that the prior mean A = 7/c remains constant. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 80 Chapter 4. Credibility Theory • In motor insurance it often happens that N = 0. In this case, A = 0 and we cannot meaningfully calibrate a Poisson model with MLE because we receive a degenerate model. However, the resulting posterior estimator Apost = (1 — a)A > 0 still leads to a sensible model calibration in the credibility approach. This is going to be important in the application of Poisson regression trees in Chapter 6. 4.1.2 Maximum a posteriori estimator In the previous section we have derived the posterior estimator Apost for the claims frequency based on observation N. We have calculated the posterior distribution of A, given N. Its log-likelihood function is given by log7r(0|iV) oc (~f + N-l)log9-(c + v)£. If we maximize this log-likelihood function we receive the maximum a posteriori (MAP) estimator given by ^map = 7 + N -1 = - t _ a C + V V Note that the MAP estimator is always positive for 7 > 1, the latter corresponds to a prior coefficient of variation 7-1/2 being bounded from above by 1. Thus, if we have a prior distribution for A which is informative with maximal coefficient of variation of 1, the MAP estimator will always be positive. 4.1.3 Example in motor insurance pricing The credibility formula of Corollary 4.4 is of particular interest when the portfolio is small, i.e. if v is small, because in that case we receive a credibility weight a that substantially differs from 1. To apply this model we a need prior mean A = 7/c > 0 and a scale parameter c > 0. These parameters may either be chosen from expert opinion or they may be estimated from broader portfolio information. In the present example we assume broader portfolio information and we calibrate the model according to Section 4.10 of Buhlmann-Gisler [18], for full details we refer to this latter reference. Model Assumptions 4.6. Assume we have independent portfolios i = l,...,n all following the Poisson-gamma model of Definition 4-1 with portfolio dependent volumes Vi > 0 and portfolio independent prior parameters 7, c > 0. This model is a special case of the Buhlmann-Straub model, see Buhlmann-Gisler [18]. Corollary 4.4 implies that for each portfolio i = 1,..., n we have APOSt = ajAj + (1 - ai) A, with credibility weights on and observation based estimators \ given by Qj =--— G (0,1) and \ = —-, c + Vi Vi and the prior mean is given by A = 7/c. The optimal estimator (in a credibility sense, homogeneous credibility estimator) of this prior parameter is given by 1 n 2^i=l a* i=i Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 81 Thus, there only remains to determine the scale parameter c > 0. To estimate this last parameter we apply the iterative procedure given in Buhlmann-Gisler [18], pages 102-103. This procedure provides estimates f2 and Ao from which we estimate c = A0/?2, for details we refer to Buhlmann-Gisler [18]. This then provides estimated credibility weights ai = -^-G(0,l). (4.2) C + Vi We apply this credibility model to the car insurance pricing data generated in Appendix A. As portfolios i = 1,... ,n we choose the 26 Swiss cantons ct and we consider the corresponding volumes Vi and the observed number of claims iVj in each Swiss canton. For the first analysis we consider the whole portfolio which provides a total of J2i vi = 253'022 years at risk. The largest Swiss canton ZH has 42'644 years at risk, and the smallest Swiss canton AI has 1'194 years at risk. For the second example we only consider drivers with an age below 20 for which we have a total of J2i vi = 1'741 years at risk. The largest Swiss canton ZH has 283 years at risk (drivers below age 20), and the smallest Swiss canton AR has 4 years at risk in this second example. If we consider the entire portfolio we obtain fast convergence of the iterative procedure (4 iterations) and we set c = 456 and AQ = 10.0768% (based on the entire portfolio). We use this same estimate c = 456 for both examples, the entire portfolio and the young drivers only portfolio. From this we can calculate the estimated credibility weights on for all Swiss cantons i = 1,..., n. The results are presented in Figure 4.1. For the entire credibility weights full portfolio credibility weights young drivers only Figure 4.1: Estimated credibility weights (4.2) of (lhs) the entire portfolio and of (rhs) the young drivers only portfolio. portfolio the estimated credibility weights lie between 72% and 99%. For the young drivers only portfolio the estimated credibility weights are between 0.8% and 40%. Using these credibility weights we could then calculate the Bayesian frequency estimators (which are smoothed versions of the MLE frequency estimators). This finishes our example. ■ Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 82 Chapter 4. Credibility Theory 4.2 The binomial-beta model for classification 4.2.1 Credibility formula There is a similar credibility formula for the binary classification problem when we assume that the prior distribution of the parameter is given by a beta distribution. Definition 4.7 (Binomial-beta model). • Conditionally, given O, Y±,. .. ,Yn are i.i.d. Bernoulli distributed with success probability O. • 0 ~ Beta(a, b) with prior parameters a, b > 0. Theorem 4.8. Assume Y±,.. . ,Yn follow the binomial-beta model of Definition J^.l. The posterior distribution of 0, conditional on Y = (Yi,. .. , Yn), is given by @\Y ~ Beta (a + YYi>b + n-YY* i=i %=i def. Beta(apos\6post Proof. Using the Bayes' rule, the posterior density of G is given by (up to normalizing constants) ir(e\Y) (x ^f[eY^(i-e)1-Y^jea-1(i-e)b-1. This is a beta density with the required properties. □ For the estimation of the unknown success probability we obtain the following prior and posterior (Bayesian) estimators p d= E[0] a + b1 Impost def. E[0|Y] 7post _ = a + E^iY apost _|_ frpost a _|_ 5 + n Corollary 4.9. Assume Y±,.. . ,Yn follow the binomial-beta model of Definition 4-7. The posterior estimator pPost has the following credibility form pp°st =a p+(l-a) p. with credibility weight a and observation based estimator p given by n V™ Y a = -■- G(0,1) and p= ^l=1 \ a + o + n n The (conditional) mean square error of this estimator is given by E 1 1 + a + b + n pPost ^ _£Post^ Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 83 Proof. For the estimation uncertainty we have apost^post E [(e-?,ost)2|y] = Var(6|r) = This proves the claim. (1 _|_ apost _|_ 5post^apost _|_ £post)2 4.2.2 Maximum a posteriori estimator In the previous section we have derived the posterior estimator ppost. The corresponding log-likelihood function was given by log7r(0|Y) oc fa + EF,-l)logg+ (b + n-J2^j]\^jJ)-If we maximize this log-likelihood function we receive the MAP estimator MAP _ a + E"=l Yi ~ 1 _ a + b + n , t a P =-;-=-;- P a + b + n — 2 a + b + n — 2 \ n Here, we require that a > 1 for having a MAP estimator that has a positive probability for any observation Y. 4.3 Regularization and Bayesian MAP estimators 4.3.1 Bayesian posterior parameter estimator For illustrative purposes we reconsider the GLM of Chapter 2. However, the theory presented in this section applies to any other parametric regression model. We consider GLM (2.4) for independent cases (iVj, Xi,Wi) G T>. The joint density of these cases is for regression parameter j3 given by fß(Nl, ...,Nn) = exp (M/3)} = ft e-""V"* (wV^M* , i=l Bayesian modeling in this context means that we choose a prior density 7r(/3) for the regression parameter j3. The joint distribution of the cases and the regression parameter then reads as f(N1,...,Nn,P) = f[e-^^ (6XP*;)vi)Nt ff(/3). (4.3) i=i iVl' After having observed data T>, the parameter vector j3 has posterior density ir(f3\V) oc f(Nu...,Nn,P), (4.4) where we have dropped the normalizing constants. Thus, we receive an explicit functional form for the posterior density of the parameter vector j3, conditionally given the data T>. Simulation methods like MCMC algorithms, see Section 4.4, or sequential Monte Carlo (SMC) samplers then allow us to evaluate numerically the Bayesian (credibility) estimator 3P°St = E[/3|P] = jp^(p\V)dp. (4.5) This Bayesian estimator considers both, the prior information tv(P) and the data T>. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 84 Chapter 4. Credibility Theory 4.3.2 Ridge and LASSO regularization The joint distribution of the data and the regression parameter (4.3) can also be viewed as a regularized version of the log-likelihood function In (ft), where the prior density tt is used for regularization. We have already met this idea in the calibration of GAMs, see (3.7). Assume that 7r(/3) oc exp{—?7||/3||p} for some tuning parameter r\ > 0 and || ■ ||p denoting the Lp-norm for some p > 1. In this case we receive joint log-likelihood function of data and regression parameter n eN(P)+\ogn(P) oc Y,-eM(3,xl)vl+Nl((p,xl)vl + logvl)-f]\\p\\pp. (4.6) i=i ^map If we now calculate the MAP estimator P we receive an Lp-regularized version of the MLE P given in Proposition 2.2. Maximization of objective function (4.6) implies that too large values for P are punished (regularized/shrunk towards zero) with regularization parameter (tuning parameter) 77 > 0. In particular, this helps us to control complex models to not over-fit to the data (in-sample). Remarks 4.10 (ridge and LASSO regularization). • Lhe last term in (4.6) tries to keep P small. Lypically, the intercept (3o is excluded from this regularization because otherwise regularization induces a bias towards 0, see balance property of Proposition 2.4. • Lor a successful application of regularization one requires that all feature components live on the same scale. Lhis may require scaling using, for instance, the MinMaxScaler, see (5.14), below. Categorical feature components may require a group treatment, see Section 4.3 in Hastie et al. [63]. • p = 2 gives a Gaussian prior distribution, and we receive the so-called ridge regularization or Likhonov regularization [127]. Note that we can easily generalize this ridge regularization to any Gaussian distribution. Assume we have a prior mean 6 and a positive definite prior covariance matrix £, then we can consider component-specific regularization \ogn(P) oc -6)'£-1(0-6). • p = 1 gives a Laplace prior distribution, and we receive the so-called LASSO regularization (least absolute shrinkage and selection operator), see Libshirani [126]. LASSO regularization shrinks (unimportant) components to exactly zero, i.e. LASSO regularization can be used for variable selection. • Optimal tuning parameters 77 are determined by cross-validation. Example 4.11 (regularization of GLM1). We revisit Example 2.10 (GLM1) and we apply ridge and LASSO regularization according to (4.6). We consider the feature components age and ac using the categorical classes as defined in Ligure 2.2. Lhe R code is provide in Listing 4.1. On line 1 we define the design matrix only consisting of the categorized 'age class' and 'ac class', respectively. On lines 2-3 we fit Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 85 Listing 4.1: R code glmnet for regularization 1 X <- model.matrix(- ageGLM + acGLM, dat) 2 glm.ridge <- glmnet(x=X,y=dat$claims,family="poisson",alpha=0,offset=log(dat$expo)) 3 exp(predict(glm.ridge , newx = X, newoffset = log(dat$ expo))) this regularized regression model. Parameter alpha = 0 gives ridge regularization, and alpha = 1 gives LASSO regularization. Running this algorithm as in Listing 4.1 fits the regularized GLM for 100 different values of the tuning parameter 77. These tuning parameters are in [0.00137,13.72341] for ridge regularization and in [0.00004,0.01372] for LASSO regularization in our example. Moreover, the intercept /3q is excluded from regularization. 10 10 10 10 10 10 10 10 10 10 -4-2 0 2 0 12 3 4 log(eta) L1 norm ^map Figure 4.2: Estimated regression parameters j3 under ridge regularization, the a;-axis shows (lhs) log(77) and (rhs) the total L1-norm of the regression parameter (excluding the intercept), in blue color are the parameters for the 'age classes' and in orange color the ones for the 'ac classes', see also Listing 2.2. ^map In Figure 4.2 we illustrate the resulting ridge regularized regression parameters j3 for different tuning parameters 77 (excluding the intercept). We observe that the components ^map in j3 shrink with increasing tuning parameter towards the homogeneous model. In Table 4.1 lines (Ch4.1)-(Ch4.3) we present the resulting errors for different tuning parameters. We observe that choosing different tuning parameters 77 > 0 continuously closes the gap between the homogeneous model and model GLM1. ^map In Figure 4.3 we illustrate the resulting LASSO regularized regression parameters j3 for different tuning parameters 77. We see that this is different from ridge regularization because parameters are set/shrunk to exactly zero for increasing tuning parameter 77. ^map The a;-axis on the top of Figure 4.3 shows the number of non-zero components in j3 (excluding the intercept). We note that at 77 = 0.0001 the regression parameter for age31to40 set to zero and at 77 = 0.0009 the one for age41to50 is set to zero. Since age51to60 is the reference label, this LASSO regularization implies that we receive one large age class from age 31 to age 60 for 77 = 0.0009. Lines (Ch4.4)-(Ch4.6) of Table Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 86 Chapter 4. Credibility Theory run # CV loss strat. CV est. loss in-sample average time param. rev /•cv J~D £(A,A*) Ms >~D frequency (ChA.l) true model A* 27.7278 10.1991% (Chl.l) homogeneous 0.1s 1 29.1066 29.1065 1.3439 29.1065 10.2691% (Ch4.1) ridge r) = 13.723 13st 11 1.3439 29.1065 10.2691% (Ch4.2) ridge r) = 0.1438 13st 11 1.0950 28.8504 10.2691% (Ch4.3) ridge r) = 0.0014 13st 11 0.6091 28.3547 10.2691% (Ch4.4) LASSO r? = 0.0137 7st 1 1.3439 29.1065 10.2691% (Ch4.5) LASSO r? = 0.0009 7s+ 9 0.6329 28.3791 10.2691% (Ch4.6) LASSO r? = 0.0001 7s+ 11 0.6053 28.3511 10.2691% (Ch2.1) GLM1 3.1s 11 28.3543 28.3544 0.6052 28.3510 10.2691% Table 4.1: Poisson deviance losses of K-to\& cross-validation (1.11) with K = 10, corresponding estimation loss (1.13), and in-sample losses (1.10); green color indicates values which can only be calculated because we know the true model A*; losses are reported in 10-2; run time gives the time needed for model calibration, the run time^ is received by simultaneously fitting the model for 100 tuning parameters 77, and param.' gives the number of estimated model parameters, this table follows up from Table 2.4. log(eta) L1 norm Figure 4.3: Estimated regression parameters j3 under LASSO regularization, the x-axis shows (lhs) log (77) and (rhs) the total L1-norm of the regression parameter (excluding the intercept), in blue color are the parameters for the 'age classes' and in orange color the ones for the 'ac classes', see also Listing 2.2. 4.1 show that also LASSO regularization closes the gap between the homogeneous model and model GLM1. This finishes this example. ■ In the previous example we have seen that the ridge and LASSO regressions behave fundamentally differently, namely, LASSO regression shrinks certain parameters perfectly to zero whereas ridge regression does not. We briefly analyze the reason therefore, and we give some more insight into regularization. For references we refer to Hastie et al. [63], Chapter 16 in Efron-Hastie [37], and Chapter 6 in Wuthrich-Merz [141]. We start from the following optimization problem argmin - eN{/3), subject to ||/3||£ < t, (4.7) P Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 87 Figure 4.4: Illustration of optimization problem (4.7) under a budget constraint (lhs) for p = 2 and (rhs) p = 1; this figure is taken from Wuthrich-Merz [141]. for some given budget constraint t G K+ and p > 1. This is a convex optimization problem with a convex constraint. Version (4.6) is obtained as the Lagrange version of this optimization problem (4.7) with fixed Lagrange multiplier 77. We illustrate optimization problem (4.7) in Figure 4.4. The red dot shows the MLE that maximizes the unconstrained log-likelihood £n(/3). The convex curve around this MLE shows a level set of the log-likelihood function given by {/3;^jy(/3) = co} f°r a given level cq. The blue area shows the constraint in (4.7), the (lhs) shows a Euclidean ball for p = 2 and the (rhs) shows a square for p = 1. The optimal regularized estimate for j3 is obtained by the tangential intersection of the appropriate level set with the blue budget constraint. This is illustrated by the orange dots. We observe that for p = 1 this might be in the corner of the square (Figure 4.4, rhs), this is the situation where the parameter for the first feature component shrinks exactly to zero. For p = 2 this does not happen, a.s. 4.4 Markov chain Monte Carlo method In this section we give the basics of Markov chain Monte Carlo (MCMC) simulation methods. We refrain from giving rigorous derivations and proofs but we refer to the relevant MCMC literature like Congdon [28], Gilks et al. [53], Green [57, 58], Johansen et al. [75], Neal [97] or Robert [113]. A main problem in Bayesian modeling is that posterior distributions of type (4.4) cannot be found explicitly, i.e. we can neither directly simulate from that posterior distribution nor can we calculate the (posterior) Bayesian credibility estimator (4.5) explicitly. In this section we provide an algorithm that helps us to approximate posterior distributions. For this discussion we switchback the notation from ß to © because the latter is more standard in a Bayesian modeling context. In a nutshell, if a probability density 7r(0) (or ir{6\T>) in a Bayesian context) of a random variable 0 is given up to its normalizing constant (i.e. if we explicitly know its functional form in 6), then we can design a discrete time Markov process that converges (in distribu- Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 88 Chapter 4. Credibility Theory tion) to tt (or tt(-\T>), respectively). More precisely, we can design a (time inhomogeneous) Markov process that converges (under reversibility, irreducibility and aperiodicity) to its (unique) stationary distribution (also called equilibrium distribution). Assume for simplicity that tt is discrete. If the (irreducible and aperiodic) Markov chain (Q^)t>0 has the following invariance property for tt 7t(0í) = 53tt(0j-) p[0(í+1) = 0l|©(i) =0j] , for all 0, and t > 0, (4.8) then tt is the (unique) stationary distribution of (@^)t>o- Based on the latter property, we can use the samples (©^)t>t0 as an empirical approximation to tt after the Markov chain has sufficiently converged, i.e. for sufficiently large to. This to is sometimes also called the burn-in period. Thus, we are aiming at constructing an irreducible and aperiodic Markov chain that satisfies invariance property (4.8) for tt. A Markov chain is said to fulfill the detailed balance condition (is reversible) w.r.t. tt if the following holds for all 9i,9j and t > 0 7r(0i) p [©(í+1) = 6j\©(i) = 0,] = tt(6j) p [e(í+1) = oÁ ©^ = . (4.9) We observe that the detailed balance condition (4.9) implies the invariance property (4.8) for tt (sum both sides over 6j). Therefore, reversibility is sufficient to obtain the invariance property. The aim now is to design (irreducible and aperiodic) Markov chains that are reversible for tt; the limiting samples of these Markov chains can be used as empirical approximations to tt. Crucial for this design is that we only need to know the functional form of the density tt; note that the normalizing constant of tt cancels in (4.9). 4.4.1 Metropolis—Hastings algorithm The goal is to design an irreducible and aperiodic Markov chain that fulfills the detailed balance condition (4.9) for density tt. The main idea of this design goes back to Metropolis and Hastings [95, 64] and uses an acceptance-rejection method. Assume that the Markov chain is in state at algorithmic time t and we would like to simulate the next state ©(i+1) of the chain. For the simulation of this next state we choose a proposal density q(-\&^) that may depend on the current state (and also on the algorithmic time i, not indicated in the notation). Metropolis-Hastings Algorithm. 1. In a first step we simulate a proposed next state 0* ~ g(-|©(i)). 2. Using this proposed state we calculate the acceptance probability 3. Simulate independently U ~ Uniform[0,1]. Version October 27, 2021, M.V. Wůthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 89 4. Set for the state at algorithmic time t + 1 @(m) = | o* ift7...>et+11),e^e£J11\...>eri)); (4.n) Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 90 Chapter 4. Credibility Theory (b) calculate the acceptance probability of this newly proposed component (c) simulate independently U ~ Uniform[0,1]; (d) set for the component k at algorithmic time t + 1 0(m)( ift7e£-11)>e£J11)>...,eri)) - *(e*), where we use notation (4.11). These proposals are always accepted because a = 1. 4.4.3 Hybrid Monte Carlo algorithm The hybrid Monte Carlo (HMC) algorithm goes back to Duane et al. [34] and is well described in Neal [97]. Note that the abbreviation HMC is also used for Hamiltonian Monte Carlo because the subsequent Markov process dynamics are based on Hamiltonians. Represent the density ir as follows tt(0) = exp {log tt(0)} oc exp {^(0)} . Observe that £ is the log-likelihood function of ir, (potentially) up to a normalizing constant. The acceptance probability (4.10) in the MH algorithm then reads as a(0(t),e*) = exp (min {^(©*) - ^(©^) + logg(©W|©*) - logg(©*|©W), o}) . Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 91 In physics —£(6) can be interpreted as the potential energy of the system in state 9. We enlarge this system by also accounting for its momentum. In physics this is done by considering the kinetic energy (and using the law of conservation of energy). Assume that 0 and ^ are independent with joint density /(0,V) = <0)9ty) ^ exp{*(0)-/#)}• We assume that —£{■) and k(-) are differentiable, playing the role of potential and kinetic energy, respectively, and we define the Hamiltonian H(6,u)) = -£(6) + k(u)). Assume that the dimension of 9 = (6±,... ,6r) is r G N, then a typical choice for yj = (ipi,..., ipr) and its kinetic energy is k{uh) i=i ^V' (diag(t2,...,r2; for given parameters r/ > 0. In other words, g{ip) is a Gaussian density with independent components ^i,...,^ and covariance matrix T = diag(r2,..., r2). The goal is to design a reversible, irreducible and aperiodic Markov process that has / as stationary distribution (note that we silently change to a continuous space and time setting, for instance, for derivatives being well-defined). The crucial point now is the construction of the Markov process dynamics. Assume that we are in state (©(*), at time t and we would like to study an infinitesimal time interval dt. For this we consider a change (in infinitesimal time) that leaves the total energy in the Hamiltonian unchanged (law of conservation of energy). Thus, for all components / = 1,.. . , r we choose the dynamics at time t d@ (t) dt and d£{6) dt ew (4.12) This implies for the change in the Hamiltonian at time t dt E 1=1 r E i=i OH(0,i/j) d6l d£{6) d@ (t) dH(0,i/j) (e('),*W) dt d^ (t) d6l d@ (t) dk(u)) ew dt d^ dt (ew,*W) dt = 0. Thus, a Markov process (0(*),$(*))t>o which moves according to (4.12) leaves the total energy in the Hamiltonian unchanged. Moreover, this dynamics preserves the volumes, see (3.8) in Neal [97], and it is reversible w.r.t. f(9,yj). This implies that f(9,yj) is an invariant distribution for this (continuous-time) Markov process (©(*), ^^t^o- Note, however, that this Markov process is not irreducible because it cannot leave a given total energy state (and hence cannot explore the entire state space) for a given starting value. This deficiency can be corrected by randomizing the total energy level. Furthermore, for simulation we also need to discretize time in (4.12). These two points are described next. Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 92 Chapter 4. Credibility Theory First, we mention that we can apply Gibbs sampling for the update of the kinetic energy state given ©(*), simply by using the Gaussian density g(ip). Assume we are in position (©(*), at algorithmic time t £ No and we aim at simulating state (©(*+!), $(*+!)). We do this in two steps, first we move the kinetic energy level and then we move along the corresponding (discretized) constant Hamiltonian level. Hybrid (Hamiltonian) Monte Carlo Algorithm. 1. Using Gibbs sampling, we set e(t) ~ AA(0,r = diag(r12,...,t^ 2. For fixed step size s > 0 and number of steps L 6 N, we initialize 6*(0) = ©(*) and V'*(0) = and we consider the leapfrog updates for j = 1,..., L V*0'-i/2) = f(i-i) + ^v4ro'-i)), e*U) = ^-i/ryo'-i/2), = ^(j-i/2) + | v0e(o*(j)). This provides a proposal for the next state (0*,$*) = (0*(L),-i/}*(L)). Accept this proposal with acceptance probability ^(0(*), $(*+!)), (0*, = exp (min {-#(©*, **) + (0^, $(t+1)), o}) , otherwise keep (©(*), $(t+1)). (4.13) We give some remarks. Observe that in step 2 we may update both components ©(*) and ^(t+1). If the leapfrog discretization were exact (i.e. no discretization error), then it would define, as in (4.12), a reversible Markov process that leaves the total energy and the total volume unchanged, see Section 3.1.3 in Neal [97]. This would imply that the acceptance probability in (4.13) is equal to 1. The MH step with acceptance probability (4.13) corrects for the discretization error (corrects for a potential bias). Moreover, since the kinetic energy k{ip) is symmetric with respect to zero, the reflection of the momentum in the proposal ^* would not be necessary (if we perform Gibbs sampling in step 1). Note that the wanted distribution tt is obtained by considering the (for the first component) marginalized Markov process (©(*), ^^)t>o asymptotically. 4.4.4 Metropolis-adjusted Langevin algorithm The above HMC algorithm provides as special case for L = 1 the Metropolis-adjusted Langevin algorithm (MALA), studied in Roberts-Rosenthal [115]. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 93 First, observe that the Gibbs sampling step of ^(t+1) is independent of everything else, and we can interpret this Gibbs sampling step as a proposal distribution for updating in particular, because we are only interested into the marginalized process (0^t-))t>o of (©(*), fW)oo. The leapfrog update for L = 1 reads as 0* = 0(*) + — T-1Velog^(eo is an r~dimensional Brownian motion on our probability space, then we can consider the r-dimensional stochastic differential equation d0(*) = ptog7r(eW)dt + dWi. When 7r is sufficiently smooth, then tt is a stationary distribution of this Langevin diffusion. The MALA described by the proposals (4.14) is then directly obtained by a discretized version of this Langevin diffusion for T = 1 and for s describing the step size. The acceptance-rejection step then ensures convergence to the appropriate stationary distribution tt. In Roberts-Rosenthal [115] it was proved that (under certain assumptions) the optimal average acceptance rate of the MALA is 0.574. Note that this is vastly different from the 0.234 in the classical MH algorithm, see Roberts et al. [114]. In Neal [97] it is argued that also the MALA may have a too random walk-like behavior and the full power of HMC algorithms is only experienced for bigger L's reflecting bigger steps along the (deterministic) Hamiltonian dynamics. 4.4.5 Example in Markov chain Monte Carlo simulation In this section we make a simple example for the density tt that allows us to empirically compare the performance of the MH algorithm, the MALA and the HMC algorithm. We choose dimension r = 10 (which is not large) and let X 6 M7"*7" be a positive definite covariance matrix, the explicit choice is provided in Figure 4.5. We then assume that tt (=> AA(0,S). Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 94 Chapter 4. Credibility Theory □□□ n □□DUDD □ □□□□□□ □□ 1 1 □□□□□□ ■ Figure 4.5: Choice of covariance matrix X. Thus, 7T is a centered multivariate Gaussian density. Of course, in this case we could directly simulate from ir. Nevertheless, we will use the MCMC algorithms to approximate samples of ir. This allows us to study the performance of these algorithms. MH: average acceptance MALA: average acceptance rate = 0.679 HMC(L=5): average acceptance rate = 0.698 \1 2000 aOOCI 6000 8000 10000 alcoritli ~iicti ~i« 0 2000 4000 algorithi Figure 4.6: Trace plots of the first component of (0^t-))t=o,...,10000 of (lhs, black) the MH algorithm, (middle, blue) the MALA, and (rhs, red) the HMC algorithm with L = 5. We start with the MH algorithm. For the MH algorithm we choose proposal distribution O* ~ g(.|0(*)) (=> AA(0(t),e2l). Thus, we do a RW-MH choice and the resulting acceptance probability is given by a (©W,©*) = exp (min j-^©*)^"1©* + ^(e^'S"1©^ o}) . The parameter s > 0 is fine-tuned so that we obtain an average acceptance rate of roughly 0.234. The trace plot of the first component (O^)t is given in Figure 4.6 (lhs, black). Note that we choose starting point ©(°) = (10,..., 10)' and the burn-in takes roughly to = 1000 algorithmic steps. Next we study the MALA. We choose proposal Version October 27, 2021, M.V. Wüthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 95 from which the acceptance probability can easily be calculated, see also (4.17). The parameter s = 0.23 is fine-tuned so that we obtain an average acceptance rate of roughly 0.574. The trace plot of the first component (O^)t is given in Figure 4.6 (middle, blue). Note that we choose starting point ©(°) = (10,..., 10)' and the burn-in is faster than in the RW-MH algorithm. Finally, we consider the HMC algorithm. We choose exactly the same e = 0.23 as in the MALA above, but we merge L = 5 steps along the Hamiltonian dynamics. We then apply the following steps. First, we simulate ^J/(t+1) ~ J\f (0,1). The leapfrog updates are obtained by initializing 0*(O) = ©^ and ^*(0) = $(t+1), and updating for j = 1,. .. , 5 0*U) = (l-jZ-^O'U-V + erU-D, = ^0'-i)-f s-^O'-i)^^ This provides a proposal for the next state (©*, 5'*) = (0*(5), The acceptance probability for this move is then easily calculated from (4.13). For this parametrization we obtain an average acceptance rate of 0.698. The trace plot of the first component (©!*'')t is given in Figure 4.6 (rhs, red). The striking difference is that we get much better mixing than in the RH-MH algorithm and in the MALA. This is confirmed by looking at auto-correlation in MCMC samples auto-correlation in MCMC samples auto-correlation in MCMC sampl« Figure 4.7: Autocorrelations of the first component ©^ of (@^)t=t0+i,...,10000 (after burn-in t0 = 1000) of (lhs) the MH algorithm (black), the MALA (blue) and the HMC algorithm with L = 5 (red); (middle) the HMC algorithm for L = 3,5,10 (orange, red, magenta); and (rhs) the HMC algorithm for constant total move Le = 1.15 (red, pink, brown). the resulting empirical auto-correlations in the samples. They are presented in Figure 4.7 (lhs) for the first component of ©(*). From this we conclude that we should clearly favor the HMC algorithm with L > 1. In Figure 4.8 we also present the resulting QQ-plots (after burn-in to = 1000). They confirm the previous statement, in particular, the tails of the HMC algorithm look much better than in the other algorithms. In the HMC algorithm we still have the freedom of different choices of L and e. For the above chosen e = 0.23 we present three different choices of L = 3, 5,10 in Figure 4.9. In this case L = 10 has the best mixing properties, also confirmed by the autocorrelation plot in Figure 4.7 (middle). However, L should also not be chosen too large because the Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 96 Chapter 4. Credibility Theory Normal QQ-plotfor MH algorithm Normal QQ-plotTor MALA Normal QQ-plot for HMC(5| algorithm Theoretical Quantiles Theoretical Guantiles Theoretical QtfanHei Figure 4.8: QQ-plots of the first component ©^ of (@^)t=t0+i,...,10000 (after burn-in t0 = 1000) of (lhs, black) the MH algorithm, (middle, blue) the MALA, and (rhs, red) the HMC algorithm with L = 5. HMC(L=3): average acceptance rate = 0.727 HMC(L=5): average acceptance rate = 0.G98 HMC(L=10): average acceptance rate = 0.655 Figure 4.9: Trace plots of the first component ©^ of (@^)t=o,...,10000 of the HMC algorithm for (lhs, orange) L = 3, (middle, red) L = 5, and (rhs, magenta) L = 10. leapfrog update may use too much computational time for large L. Hoffman-Gelman [67] have developed the no-U-turn sampler (NUTS) which fine-tunes L in an adaptive way. From Figure 4.9 we also observe that the choice of L does not influence the average acceptance probability too much. In Figure 4.10 we chose Le = 5 ■ 0.23 = 1.15 constant, i.e. we decrease the step size and increase the number of steps. Of course, this implies that we follow more closely the Hamiltonian dynamics and the average acceptance rate is increasing, see Figure 4.10 from left to right. In Figure 4.7 (rhs) we present the resulting autocorrelation picture. From this we conclude, that L = 10 with original step size e = 0.23, Figure 4.7 (middle, magenta), provides the best mixture of the Markov chain in our example (and should be preferred here). This finishes the example. ■ 4.4.6 Markov chain Monte Carlo methods: summary We have met several simulation algorithms to numerically approximate the density tt. Typically, these algorithms are used to determine posterior densities of type (4.4). These posterior densities are known up to the normalizing constants. This then allows one Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 4. Credibility Theory 97 Figure 4.10: Trace plots of the first component of (@^)t=o,...,10000 of the HMC algorithm for (lhs, red) L = 5 and e = 0.23, (middle, pink) L = 7 and e = 0.23 ■ 5/7, and (rhs, brown) L = 10 and e = 0.23/2, i.e. the total move Le = 1.15 is constant. to empirically calculate the posterior Bayesian parameter (4.5). The advantage of the posterior Bayesian parameter is that we receive a natural regularization (determined by the choice of the prior distribution) and, moreover, we can quantify parameter estimation uncertainty through the posterior distribution of this parameter. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 98 Chapter 4. Credibility Theory Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Appendix to Chapter 4 4.5 Proofs of Section 4.4 Proof of Lemma 4.12. The MH algorithm provides transition probability for algorithmic time t > 0 r (t+i) i m if 9(^)1^) a(9i,9i) for 0, ^ 0,-, pt+1|t(^.) = p[e^> = e,\ev = *<] = ( ^^V^;^,^(1 _ {9tt9k)) for*1£ For the detailed balance condition we need to prove for all 9i, 9j and t > 0 pt+nt^,^-) = tt(9j) p^WjA)- (4-15) The case 9i = 0,- is trivial, so consider the case 9i Oj. We assume first that <<>i) > <0i) q(0A0i). (4.16) In this case we have acceptance probability a(9i, 9j) = 1, and, thus, pt+i\t(0i, Oj) = q(0j\0i). We start from the right-hand side in (4.15), note that under (4.16) we have a(0j,0i) = 7r(0j) q{0j\0i)/iT(0j) q{0i\0j) < 1, t(^) Pt+nt^-.ft) = 9(^1^) a^-.fc) = tt(^) g(f^) g(^|y This proves the first case. For the opposite sign in (4.16) we have a(0j,0i) = 1, and we read the last identities from the right to the left as follows = irifij) q(9l\9J) = tt^-) pt+i|t(^,0i). This proves the lemma. □ Proof of Lemma 4.13. The acceptance probability in (4.13) for L = 1 is a ((6« §«+1>), (6%**)) = min (-^1 exp {-I^T"1** + i^C^T"1*^1)} , 1 Note that we have identity z = ** + | Velog7r(e*) = -§(t+1)-| Velog7r(e(t)). We define a* = § Ve log7r(6*) and a(t) = § Ve log7r(e(t)). Then we obtain (tf')'T-1** - ($('+1))'T-1*(t+1) = (z - a'/T-^z - a*) - (z + bWJT^ + a(t)) = _2(a* + a^/T-1* + (a*)'T" V - (a^T"1^. For z we have using the proposal G* z = _ | Ve iog^(ew) = 1 r (e(t) - e*). 99 Electronic copy available at: https://ssrn.com/abstract=2870308 100 Chapter 4. Credibility Theory Therefore, we obtain (^yr-1** - (fc+Dyr-1*^1) = - _(a* + a(t))' (e(t) - e*) + (a*)T-V - (aWj'r1^". e v ' This provides acceptance probability (4.13) for L = 1 being the minimum of 1 and 7r(e*) 7T(GW) exp 11(a' + a(t))' (e(t) - e*) - ±(a*)'T-V + i^T-V0}. If we run an MH algorithm with proposal (4.14), we obtain acceptance probability being the minimum of 1 and „(9-) e*p {-^ (e(t) - e* - (eW " e*" gT" v)} 7r(eW)exp{-^(e* -6W -eT-1aW)'T(e* -6W -eT-^W)}' Since the last two displayed formulas are identical, we obtain the same acceptance probability and the claim follows. □ Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 5 Neural Networks 5.1 Feed-forward neural networks Feed-forward neural networks are popular methods in data science and machine learning. Originally, they have been inspired by the functionality of brains and have their roots in the 1940s. In essence, feed-forward neural networks can be understood as high-dimensional non-linear regression functions, and resulting statistical models can be seen as parametric regression models. However, due to their high-dimensionality and non-interpretability of parameters, they are often called non-parametric regression models. Typically, one distinguishes between two types of feed-forward neural networks: (i) shallow feed-forward neural networks having one hidden layer, (ii) and deep feed-forward neural networks with more than one hidden layer. These will be described in detail in this chapter. This description is based on the q-dimensional feature space X C R9 introduced in Chapter 2 and we will again analyze regression problems of type (2.1). Remark. We use neural networks for modeling regression functions. However, neural networks can be seen in a much broader context. They describe a powerful approximation framework to continuous functions. In fact, neural networks provide a general toolbox of basis functions that can be composed to a very flexible approximation framework. This may be seen in a similar way as the splines and MARS frameworks presented in Chapter 3 or as the toolbox of wavelets and other families of basis functions. 5.1.1 Generic feed-forward neural network construction In this section we describe the generic architecture of feed-forward neural networks. Explicit examples and calibration techniques are provided in the subsequent sections. This section follows Chapter 7 of Wuthrich-Merz [141], and for a more in-depth introduction we refer to this latter reference. Activation functions The first important ingredient (hyperparameter) of a neural network architecture is the choice of the activation function R. Since we would like to approximate nonlinear regression functions, these activation functions should be non-linear. Table 5.1 101 Electronic copy available at: https://ssrn.com/abstract=2870308 102 Chapter 5. Neural Networks summarizes common choices. The first two examples in Table 5.1 are difierentiable (with sigmoid/logistic activation function 4>{x) = (l+e-*)"1 4>' = 0(1 " 0) hyperbolic tangent activation function 4>{x) = tanh(a;) ' = l - 4>2 step function activation 4>{x) = l{x>0} rectified linear unit (ReLU) activation function 4>{x) Table 5.1: Typical choices of (non-linear) activation functions. easy derivatives). This is an advantage in algorithms for model calibration that involve derivatives. The hyperbolic tangent activation function satisfies ex — e~x 2e2x / \ -l x i y tanh^) = —— = —-2^-1 = 2 (l+e-2x) -1. (5.1) ex + e x 1 + ezx \ / The right-hand side uses the sigmoid activation function. The hyperbolic tangent is symmetric w.r.t. the origin, and this symmetry is an advantage in fitting deep network architectures. We illustrate the sigmoid activation function in more detail. It is given by x i y ^x) = ^^ = (l+e-xy1 £ (0,1). (5.2) 1 + ex Figure 5.1 shows the sigmoid activation function x {wq + wx) for a given intercept wq £ R. The sign of w determines the direction of the activation. Version October 27, 2021, M.V. Wuthrich & C. Buser, ETH Zurich Electronic copy available at: https://ssrn.com/abstract=2870308 Chapter 5. Neural Networks 103 Neural network layers A neural network consist of (several) neural network layers. Choose hyperparameters qm-l,qm G N and an activation function : R —> R. A neural network layer is a mapping z(m) . ^ Rqm z ^ z(m)(z) = (z[m\Z), zj^(z))' , that has the following form. Neuron z^m\ 1 < j < qm, is described by / 9m-l \ 4m) = *im)(*) = * [w%] + £ -S^J =• *<^mU. (5-3) for given network weights w^™^ = ('it>j™'))o. Function (5.3) is called ridge function. A ridge function can be seen as a data compression because it reduces the dimension from qm-\ to 1. Since this dimension reduction goes along with a loss of information, we consider simultaneously qm different ridge functions in a network layer. This network layer has a network parameter w(m) = (w(m)?. . . ? Wq7^) of dimension gm(gm_i + 1). Often we abbreviate a neural network layer as follows z ^ z{m\z) = 4>{W{m\z). (5.4) Feed-forward neural network architecture A feed-forward neural network architecture is a composition of several neural network layers. The first layer is the input layer which is exactly our feature space X C R9, and we set qo = q for the remainder of this chapter. We choose a hyperparameter d G N which denotes the depth of the neural network architecture. The activations in the m-th hidden layer for 1 < m < d are obtained by x G X 1-4- z We use the neurons z^d:1\x) G R9d in the last hidden layer as features in the final regression. We choose the exponential activation function because our responses are unbounded and live on the positive real line. This motivates the output layer (regression function) xeX ^ A(x) =exp|/3o + ^/3J^(d:1)(x)| =exp(/3,z(d:1)(a;)), (5.6) with output network weights j3 = {fij)o