Statistics for Computer Sciences Lecture 04 to Lecture 09 Probabilistic and Statistical Models Stanislav Katina1 1Institute of Mathematics and Statistics, Masaryk University Honorary Research Fellow, The University of Glasgow December 1, 2015 Probabilistic and Statistical Models Random variable, random vector, data, individuals ◮ random variable and random vector ◮ random variable X is a function from a sample space to a set of real numbers X : Y → R (a set of all possible outcomes) ◮ 2-dimensional random vector (X1, X2)T : Y → R2 ◮ k-dimensional random vector (X1, X2, . . . , Xk )T : Y → Rk ◮ data – data vector and data matrix – the elements of a vector and the rows of a matrix are measured on individuals (statistical units) ◮ data as realisations of X – n-dimensional vector x = (x1, x2, . . . , xn)T , where n is a sample size ◮ data as realisations of (X1, X2)T – (n × 2)-dimensional matrix with rows (xi1, xi2)T , i = 1, 2, . . . , n and columns x1 and x2 ◮ data as realisations of (X1, X2, . . . , Xk )T – (n × k)-dimensional matrix with rows (xi1, xi2, . . . , xik )T , i = 1, 2, . . . , n and columns x1, x2 and xk Probabilistic and Statistical Models Model ◮ based on probabilistic sampling principles, the individuals are sampled from a population ◮ attribute – a specific value of a variable ◮ with certain precision, data are measured on individuals ◮ descriptive statistics – describing and summarising data ◮ inferential statistics (statistical inference) – inferring (drawing conclusions) about random variable based on a model fitted to data ◮ F is a set of models (probabilistic or statistical) ◮ X is characterised by a model F(·), F ∈ F ◮ (X1, X2)T is characterised by a model F(2) (·), F ∈ F ◮ (X1, X2, . . . , Xk )T is characterised by a model F(k) (·), F ∈ F ◮ parameter – a numerical quantity that characterises a model – one-dimensional parameter θ, k-dimensional vector of parameters θ = (θ1, θ2, . . . , θk )T Probabilistic and Statistical Models Distribution function, probability and density function ◮ useful assumption – Xi, i = 1, 2, . . . , n, are independently identically distributed random variables ◮ distribution function ◮ discrete random variable FX (x) = Pr (X ≤ x) = i:xi ≤x Pr (X = xi ) , where k(∞) i=1 pi = 1, Pr (X = xi ) = pi = fX (xi ) = f(xi ), ∀xi , where pi is probability mass function; {xi , pi } k(∞) i=1 , k ∈ N+ ◮ continuous random variable FX (x) = x −∞ f (t) dt, f (x) ≥ 0, where ∞ −∞ f (x) dx = 1, fX (x) = f(x) = ∂ ∂x FX (x) is density function Probabilistic and Statistical Models Parametric and non-parametric model ◮ Θ is a parametric space, the support of F(·; θ) is Yθ ⊆ Rn (the smallest set, where the distribution function is defined); sample space Y = ∪θ∈ΘYθ ◮ F as a parametric set of distribution functions F = F(·; θ) : θ ∈ Θ ⊆ Rk , ◮ F as a parametric set of probability or density functions F = f(·; θ) : θ ∈ Θ ⊆ Rk ◮ F as non-parametric set F = {a set of all density functions} , alternatively, probability or distribution function can be used Probabilistic and Statistical Models Reading of mathematical notation ◮ the term ”probability model” is often reduced to ”distribution” ◮ ”Random variable X is distributed as F(x)” or ”random variable X is characterised by distribution F(x)”, notation X ∼ FX (x); symbol ”∼” means ”asymptotically”, ”for sufficiently large n” (notation X ∼ fX (x) is used very rarely) ◮ ”Random variable X is distributed as random variable Y” or ”Random variable X and Y are identically distributed” (notation X ∼ Y or FX (x) ∼ FY (y) ◮ the term ”statistical model” is often reduced to ”model” (usually referred as causal statistical model or model of causal dependence) ◮ ”Y depends on X”, where X is independent variable and Y is dependent variable (notation Y|X) Probabilistic and Statistical Models Reading of mathematical notation ◮ ”X is normally distributed with parameters µ and σ2”, notation X ∼ N(µ, σ2), where θ = (µ, σ2)T ◮ ”X = (X1, X2)T is characterised by bivariate normal distribution with parameters µ1, µ2, σ2 1, σ2 2 and ρ”, notation X ∼ N2(µ, Σ), where θ = (µ1, µ2, σ2 1, σ2 2, ρ)T ◮ ”X = (X1, X2, . . . , , Xk )T is characterised by multivariate normal distribution with parameters µ1, µ2, . . ., µk , σ2 1, σ2 2, . . ., σ2 k , and ρ1,2, . . ., ρk−1,k , ”, notation X ∼ Nk (µ, Σ), where θ = (µ1, µ2, . . . , µk , σ2 1, σ2 2, . . . , σ2 k , ρ1,2, . . . , ρk−1,k )T ◮ ”X is binomially distributed with parameter p”, notation X ∼ Bin(N, p), where θ = p ◮ ”X is characterised by distribution with parameter λ”, notation X ∼ Poiss(λ), where θ = λ ◮ ”X = (X1, X2, . . . , , Xk )T is multinomially distributed with parameter p”, notation X ∼ Multk (N, p), where θ = p Probabilistic and Statistical Models Measures of normal distribution ◮ ”X is normally distributed with parameters µ and σ2”, notation X ∼ N(µ, σ2), where θ = (µ, σ2)T ◮ Random variable Z (Z-transformation) Pr(Z = X−µ σ < x1−α) = 1 − α, Z ∼ N(0, 1) ◮ Rule ”90 − 95 − 99” Pr (a ≤ X ≤ b) = 1 − α, where 1 − α = 0.90, 0.95 and 0.99, a = µ − x1−α/2σ and b = µ + x1−α/2σ ◮ Rule ”68.27 − 95.45 − 99.73” Pr (a ≤ X < b) = Pr (X < b)−Pr (X < a) = FX (b)−FX (a), where a = µ − kσ, b = µ + kσ, k = 1, 2 and 3 Probabilistic and Statistical Models Approximation of binomial distribution by normal distribution Definition (approximation of binomial distribution by normal distribution) If random variable X is binomially distributed with parameter p, X ∼ Bin(N, p), where θ = p, then if Np > 5 a Nq > 5, where q = 1 − p, then the distribution of random variable X can be approximated by normal distribution, X ∼ N(Np, Npq), where θ = (Np, Npq)T . Table: Examples of minimal N for fixed p p 0.1 0.2 0.3 0.4 0.5 q 0.9 0.8 0.7 0.6 0.5 N 51 26 17 13 11 Probabilistic and Statistical Models Approximation of binomial distribution by normal distribution Example Let Pr(male) = 0.515 and Pr(female) = 0.485. Let X if the frequency of males and Y frequency of females. Assuming that X ∼ Bin(N, p), (a) Pr(X ≤ 3), if N = 5, (b) Pr(X ≤ 5), if N = 10 and (c) Pr(X ≤ 25), if N = 50. Compare the results with normal approximation X ∼ N(Np, Npq). Solution (a) E[X] = Np = 5 × 0.515 = 2.575, E[Y] = 5 × 0.485 = 2.425, Pr(X ≤ 3) = k≤3 5 k 0.515k 0.4855−k = 0.793, Pr(X ≤ 3) = 0.648, N(5 × 0.515, 5 × 0.515 × 0.485). (b) E[X] = 10 × 0.515 = 5.15, E[Y] = 10 × 0.485 = 4.85, Pr(X ≤ 5) = k≤5 10 k 0.515k 0.48510−k = 0.586, Pr(X ≤ 5) = 0.462, N(10 × 0.515, 10 × 0.515 × 0.485). (c) E[X] = 50 × 0.515 = 25.75, E[Y] = 50 × 0.485 = 24.25, Pr(X ≤ 25) = k≤25 50 k 0.515k 0.48550−k = 0.471, Pr(X ≤ 25) = 0.416, N(50 × 0.515, 50 × 0.515 × 0.485). Probabilistic and Statistical Models Approximation of binomial distribution by normal distribution 0 1 2 3 4 5 0.00.10.20.30.4 Bin(5,0.515) 0 1 2 3 4 5 0.00.20.40.60.81.0 Bin(5,0.515) 0 2 4 6 8 10 0.00.10.20.30.4 Bin(10,0.515) 0 2 4 6 8 10 0.00.20.40.60.81.0 Bin(10,0.515) 0 10 20 30 40 50 0.00.10.20.30.4 Bin(50,0.515) 0 10 20 30 40 50 0.00.20.40.60.81.0 Bin(50,0.515) Figure: Probability function of binomial distribution superiposed by the density function of normal distribution (p = 0.515; N = 5, 10 and 50) Probabilistic and Statistical Models Approximation of binomial distribution by normal distribution 0 5 10 15 0.00.10.20.30.40.50.6 Bin(5,0.1) 0 5 10 15 0.00.20.40.60.81.0 Bin(5,0.1) 0 5 10 15 0.00.10.20.30.40.50.6 Bin(10,0.1) 0 5 10 15 0.00.20.40.60.81.0 Bin(10,0.1) 0 5 10 15 0.00.10.20.30.40.50.6 Bin(50,0.1) 0 5 10 15 0.00.20.40.60.81.0 Bin(50,0.1) Figure: Distribution function of binomial distribution superiposed by the distribution function of normal distribution (p = 0.515; N = 5, 10 and 50) Probabilistic and Statistical Models (Univariate) normal distribution Definition (normal distribution) Random variable is normally distributed with parameters µ and σ, i.e. X ∼ N(µ, σ2), where θ = (µ, σ2)T and density is defined as f (x) = 1√ 2πσ2 e − (x−µ)2 2σ2 , x ∈ R, σ > 0. Definition (standardised normal distribution) Random variable is normally distributed with parameters µ = 0 and σ = 1, i.e. X ∼ N(0, 1), where θ = (0, 1)T and density is defined as φ (x) = f (x) = 1√ 2π e− x2 2 , x ∈ R, σ > 0. Parameter µ is called mean of X and σ2 the variance of X. Probabilistic and Statistical Models Bivariate normal distribution Definition (bivariate normal distribution) Random vector (X, Y)T is normally distributed with parameters µ and Σ, i.e. (X, Y)T ∼ N2(µ, Σ), where µ = (µ1, µ2)T a Σ = σ2 1 ρσ1σ2 ρσ1σ2 σ2 2 , θ = (µ1, µ2, σ2 1, σ2 2, ρ)T , (x, y)T ∈ R2, µj ∈ R1, σ2 j > 0, j = 1, 2, ρ ∈ −1, 1 ; density is defined as f (x, y) = 1 A exp − 1 B (x−µ1)2 σ2 1 − 2ρ(x−µ1)(y−µ2) σ1σ2 + (y−µ2)2 σ2 2 , where A = 2π σ2 1σ2 2 (1 − ρ2), B = 2 1 − ρ2 . Probabilistic and Statistical Models Standardised bivariate normal distribution Definition (bivariate standardised normal distribution) Random vector (X, Y)T is normally distributed with parameters µ and Σ, i.e. (X, Y)T ∼ N2(µ, Σ), where µ = (0, 0)T a Σ = 1 ρ ρ 1 , θ = (0, 0, 1, 1, ρ)T , (x, y)T ∈ R2, ρ ∈ −1, 1 ; density is defined as f (x, y) = 1 2π 1 − ρ2 exp − x2 + 2ρxy + y2 2 (1 − ρ2) . Probabilistic and Statistical Models Standardised bivariate and multivariate normal distribution Let x = x1, y = x2 and x = (x1, x2)T . Then the density can be rewritten into matrix form: f (x) = 1 2π(det(Σ))1/2 exp − 1 2 xT Σ−1 x . Let (X1, X2, . . . , Xk )T ∼ Nk (µ, Σ) and x is k-dimensional vector, then the density is equal to f (x) = 1 (2π)k/2(det(Σ))1/2 exp − 1 2 xT Σ−1 x . Marginal distributions of: ◮ bivariate normal distribution – Xj ∼ N(µj, σ2 j ), j = 1, 2, . . . , k ◮ standardised bivariate normal distribution – Xj ∼ N(0, 1), j = 1, 2, . . . , k Probabilistic and Statistical Models Bivariate normal distribution – simulation Simulation of pseudo-random numbers from bivariate normal distribution: 1. let X1 ∼ N(0, 1) and X2 ∼ N(0, 1) 2. then (Y1, Y2)T ∼ N2 (µ, Σ), where Y1 = σ1X1 + µ1 and Y2 = σ2(ρX1 + 1 − ρ2X2) + µ2 Example Sumulate pseudo-random numbers from bivariate normal distribution, where θ = (µ1, µ2, σ2 1, σ2 2, ρ)T . (a) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1, ρ = 0; (1) n = 50 and (2) n = 1000; (b) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1, ρ = 0.5; (1) n = 50 and (2) n = 1000; (c) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1.2, ρ = 0.5; (1) n = 50 and (2) n = 1000. Probabilistic and Statistical Models Mixture of two bivariate normal distribution The mixture of two univariate normal distribution is defined as follows: pN(µ1, σ2 1) + pN(µ2, σ2 2), where θ = (µ1, µ2, σ2 1, σ2 2)T The mixture of two bivariate normal distribution is defined as follows: pN2 (µ1, Σ1) + (1 − p)N2 (µ2, Σ2), where θ = (µ11, µ12, σ2 11, σ2 12, ρ1, µ21, µ22, σ2 21, σ2 22, ρ2)T Probabilistic and Statistical Models Binomial distribution Jacob Bernoulli (1655–1705) – one of the founding fathers of probability theory. Definition (binomial distribution) Let N be number of independent identical (random) Bernoulli trials Xi, where Xi = 1 is a success (event occurred) and Xi = 0 is a failure (event did not occur), i = 1, 2, . . . , N. Then probability of success Pr(Xi = 1) = p and probability of failure Pr(Xi = 0) = 1 − p. Number of successes X = N i=1 Xi. The probability that random variable X is equal to x = n (realisation) is defined as Pr(X = x) = N x px (1 − p)N−x , for x = 0, 1, 2, . . . , N. Expected value of X is defined as E[X] = N x=0 x Pr(X = x) = N x=0 x N x px (1 − p)N−x = Np. Variance of X is defined as Var[X] = N x=0 (x − E[X])2 Pr(X = x) = N x=0 (x − Np)2 N x px (1 − p)N−x = Np (1 − p). Probabilistic and Statistical Models Binomial distribution Reading: Random variable X is binomially distributed with parameters N an p, where θ = p. Notation: X ∼ Bin(N, p), θ = p Do we need to change it? YES. Why? Due to generalisation. Equivalently, X ∼ Bin (N, p, 1 − p), where X = (X1, X2)T , θ = (p, 1 − p)T , X1 is number of successes, X2 = N − X1 is number of failures, X1 ∼ Bin (N, p) and X2 ∼ Bin (N, 1 − p). Then d ◮ E[X1] = Np, E[X2] = N (1 − p), ◮ Var[X2] = Np (1 − p) = Var[X1] is independent of p, ◮ Cov [X1, X2] = −Np (1 − p) and ◮ Cor [X1, X2] = −1. Finally, n = (n1, n2)T a p = (p1, p2)T , p1 = p and p2 = 1 − p. Then θ = p. Probabilistic and Statistical Models Binomial distribution Example (number of boys) Number of boys X in families with N children is binomially distributed, i.e. X ∼ Bin(N, p), where N = 12, number of families M = 6115 (Geissler 1889). Calculate theoretical frequencies mn,E . You know that p = N n=0 nmn NM = 0.5192 (weighted average; average of number of families weighted by number of boys). Table: Observed and theoretical frequencies (mn,O and mn,E ) of families with n boys (O = observed, E = expected, theoretical) n 0 1 2 3 4 5 6 7 8 9 10 11 12 mn,O 3 24 104 286 670 1033 1343 1112 829 478 181 45 7 mn,E 1 12 72 258 628 1085 1367 1266 854 410 133 26 2 Probabilistic and Statistical Models Multinomial distribution Definition (multinomial distribution) Let N be number of independent identical (random) trials and in each of them J ≥ 2 distinct possible outcomes can occur, where Xji = 1 is a success (event occurred) and Xji = 0 is a failure (event did not occur), i = 1, 2, . . . , N, j = 1, 2, . . . , J. Number of successes Xj = N i=1 Xji, N = J j=1 Xj. Then probability of success of i-th outcome in j-th trial is equal to Pr(Xji = 1) = pj (cell probabilities) and probability of failure in j-th trial is equal to Pr(Xji = 0) = 1 − pj. Let X = (X1, X2, . . . , XJ)T . The probability that random variables Xj are equal to xj = nj is defined as Pr(X1 = x1, . . . , XJ = xJ) = N! x1! . . . xJ! px1 1 px2 2 . . . pxJ J = N! j xj! J j=1 p xj j . Probabilistic and Statistical Models Multinomial distribution Expected value of X is a vector defined as E[X] = Np. Covariance matrix of X is defined as Var[X] = N diag (p) − ppT , where (Var[X])ij = Npj(1 − pj)) if i = j −Npipj if i = j Marginal distributions are binomial, i.e. Xj ∼ Bin N, pj . Then ◮ E[Xj] = Npj, ◮ Var[Xj] = Npj 1 − pj ◮ Cov Xi, Xj = −Npipj ◮ Cor Xi, Xj = −pipj / pi (1 − pi) pj 1 − pj Probabilistic and Statistical Models Multinomial distribution Reading: Random vector X is multinomially distributed with parameters N and p, where θ = p. Notation: X ∼ MultJ(N, p). If J = 2, then Bin (N, p) ≈ Mult2 (N, p) Realisation of one trial xij could be (1, 0, . . . , 0)T or (0, 1, . . . , 0)T . Example (number of individuals with certain blood type) Number of individuals X = (X1, X2, X3, X4)T with certain blood group is multinomially distributed following Hardy-Wienberg equilibrium, i.e. X = (X1, X2, X3, X4)T ∼ Mult4(N, p), where N = 500 (Katina et al. 2015). Calculate theoretical frequencies nE,j. attributes (groups) 0 A B AB nO,j 209 184 81 26 nE,j 210 183 80 27 Probabilistic and Statistical Models Multinomial distribution Example (number of individuals with certain socioeconomic status, political philosophy and political affiliation) Number of individuals X1, . . . , X8 with socioeconomic status, political philosophy and political affiliation is multinomially distributed, i.e. X = (X1, . . . , X8)T ∼ Mult8(N, p), where p = (p1, p2, . . . , p8)T and N = 50 (Christensen 1990). Calculate (a) Var[X1], (b) Var[X3], (c) Cov [X1, X3] and (d) Corr [X1, X3]. Table: 2 × 4 contingency table of probabilities pj D-Li D-C R-Li R-C total H 0.12 0.12 0.04 0.12 0.4 Lo 0.18 0.18 0.06 0.18 0.6 total 0.30 0.30 0.10 0.30 1.0 Probabilistic and Statistical Models Multinomial distribution Notation: (1) socioeconomic status (high – H, low – Lo), (2) political philosophy (democrat – D, republican – R) a (3) political affiliation (liberal – Li, conservative – C). Then X1 (H-D-Li), X2 (H-D-C), X3 (H-R-Li), X4 (H-R-C), X5 (Lo-D-Li), X6 (Lo-D-C), X7 (Lo-R-Li) a X8 (Lo-R-C). Solution: Var[X1] = 50 × 0.12 × (1 − 0.12) = 5.28 Var[X3] = 50 × 0.04 × (1 − 0.04) = 1.92 Cov [X1, X3] = −50 × 0.12 × 0.04 = −0.24 Cor [X1, X3] = −0.24/ √ 5.28 × 1.92 = −0.075 What are the expected frequencies? Table: 2 × 4 contingency table of frequencies Xj D-Li D-C R-Li R-C H 6 6 2 6 Lo 9 9 3 9 Probabilistic and Statistical Models Multinomial distribution Example (number of individuals with certain eye and hair colour) Let X = (X1, X2, . . . , X12)T be random vector of number of individuals with certain eye colour (with levels blue Bl, green Gr, brown Br) and hair color (with levels blond Blo, light-brown LB, black Ble, red R), where X1 means Bl-Blo, X2 means Bl-LB, X3 means Bl-Ble, X4 means Bl-R, X5 means Gr-Blo, X6 means Gr-LB, X7 means Gr-Ble, X8 means Gr-R, X9 means Br-Blo, X10 means Br-LB, X11 means Br-Ble and X12 means Br-R. Let X ∼ Mult12(N, p), where N = 6800 (Yule and Kendall 1950). Table: 3 × 4 contingency table of frequencies nj Blo LB Ble R row sums Bl 1768 807 189 47 2811 Gr 946 1387 746 53 3132 Br 115 438 288 16 857 column sums 2829 2632 1223 116 6800 Probabilistic and Statistical Models Product-multinomial distribution Definition (product-multinomial distribution) Let Nk be number of independent identical (random) trials and in each of them J ≥ 2 distinct possible outcomes can occur, where Xkji = 1 is a success (event occurred) and Xkji = 0 is a failure (event did not occur), i = 1, 2, . . . , Nk , k = 1, 2, . . . , K, j = 1, 2, . . . , J. Number of successes Xkj = Nk i=1 Xkji and K k=1 Nk = N. Then probability of success of kj-th outcome in i-th trial is equal to Pr(Xkji = 1) = pkj (cell probabilities) and probability of failure of kj-th outcome in i-th trial is equal to Pr(Xkji = 0) = 1 − pkj. Let Xk = (Xk1, Xk2, . . . , XkJ)T si multinomially distributed with parameters Nk and pk , i.e. Xk ∼ MultJ (Nk , pk ), kde θk = pk a pk = (pk1, pk2, . . . , pkJ)T . Let realisations of Xk be xk . The xkj = nkj and nk = (nk1, nk2, . . . , nkJ)T . Additionally, Xk are independent. Probabilistic and Statistical Models Product-multinomial distribution The probability that random variables Xkj are equal to xkj = nkj (for all j and k) is defined as Pr(Xkj = xkj, ∀k, j) = K k=1 Pr(Xkj = xkj, ∀j). The probability that random variables Xkj are equal to xkj = nkj (for all j) is defined as Pr(Xkj = xkj, ∀j) =  Nk !/ J j=1 xkj!   J j=1 p xkj kj . Then Pr(Xkj = xkj, ∀k, j) = K k=1    Nk !/ J j=1 xkj!   J j=1 p xkj kj   . Probabilistic and Statistical Models Product-multinomial distribution Reading: Random matrix X is product-multinomially distributed with parameters N = (N1, N2, . . . , NK )T and P with the rows pk , where θk = pk , k = 1, 2, . . . , K. Notation: X ∼ ProdMultK (N, p). If K = 1, then MultJ (N, p ) ≈ ProdMult1 (N, p) Realisation of one trial xkij could be (1, 0, . . . , 0)T or (0, 1, . . . , 0)T . Then ◮ expected frequencies are equal to Nk pkj , ◮ within each Xk , variances Var[Xkj ], covariances Cov[Xkj] and correlations Cor[Xkj ] are calculated as for multinomial distribution, ◮ between Xk , e.g. Cov[X1, X2], k = 1, 2, are zeroes due to independence of Xk Probabilistic and Statistical Models Product-multinomial distribution Example (number of individuals with certain blood type) Let X = (X1, X2)T , where X1 = (X11, X12, X13, X14)T is number of individuals in Koˇsice (Slovakia) with certain blood group, X2 = (X21, X22, X23, X24)T is number of individuals in Prague (Czech Republic) with certain blood group. X is product-multinomially distributed, i.e. X ∼ ProdMult2(N, P), where N = (N1, N2)T , where N1 = 500 and N2 = 400 (Katina et al. 2015). Calculate theoretical frequencies nE,kj . Question: What are the probabilities of having particular blood group in Prague and Koˇsice? Table: Observed frequencies of particular blood group in Prague an Koˇsice attributes (groups) 0 A B AB n1j =nPrague,j 209 184 81 26 n2j =nKoˇsice,j 138 147 84 31 Probabilistic and Statistical Models Product-multinomial distribution Example (number of individuals with certain socioeconomic status, political philosophy and political affiliation) Number of individuals X = (X1, X2)T with socioeconomic status, political philosophy and political affiliation is product-multinomially distributed, i.e. X ∼ ProdMult2(N, P), where X1 = (X11, X12, X13, X14)T are number of individuals with high socioeconomic status, X2 = (X21, X22, X23, X24)T number of individuals with low socioeconomic status, pk = (p1|k , p2|k , . . . , pJ|k )T , pkj = pj|k = njk nk , k = 1, 2, N = (N1, N2)T , N1 = 30, N2 = 20 (Christensen 1990). Calculate (a) probabilities pj|k , (b) expected frequencies, (c) Var[X3|1], (d) Cov and (e) Cov X1|1, X3|2 . Probabilistic and Statistical Models Product-multinomial distribution Notation: (1) socioeconomic status (high – H, low – Lo), (2) political philosophy (democrat – D, republican – R) a (3) political affiliation (liberal – Li, conservative – C). Then X11 = X1|1 (H-D-Li), X12 = X2|1 (H-D-C), X13 = X3|1 (H-R-Li), X14 = X4|1 (H-R-C), X21 = X1|2 (Lo-D-Li), X22 = X2|2 (Lo-D-C), X23 = X3|2 (Lo-R-Li) a X24 = X4|2 (Lo-R-C). Solution: Table: 2 × 4 contingency table of probabilities pj|k D-Li D-C R-Li R-C total H 0.3 0.3 0.1 0.3 1.0 Lo 0.3 0.3 0.1 0.3 1.0 Probabilistic and Statistical Models Product-multinomial distribution Table: 2 × 4 contingency table of frequencies nkj D-Li D-C R-Li R-C total H 9 9 3 9 30 Lo 6 6 2 6 20 Var(X3|1) = 30 × 0.1 × (1 − 0.1) = 2.7. Cov X1|2, X3|2 = −20 × 0.3 × 0.1 = −0.6, Cov X1|1, X3|2 = 0, due to the independence of X1 and X2. Probabilistic and Statistical Models Product-multinomial distribution Notation: (1) socioeconomic status (high – H, low – Lo), (2) political philosophy (democrat – D, republican – R) a (3) political affiliation (liberal – Li, conservative – C). Then X1 (H-D-Li), X2 (H-D-C), X3 (H-R-Li), X4 (H-R-C), X5 (Lo-D-Li), X6 (Lo-D-C), X7 (Lo-R-Li) a X8 (Lo-R-C). Solution: Var[X1] = 50 × 0.12 × (1 − 0.12) = 5.28 Var[X3] = 50 × 0.04 × (1 − 0.04) = 1.92 Cov [X1, X3] = −50 × 0.12 × 0.04 = −0.24 Cor [X1, X3] = −0.24/ √ 5.28 × 1.92 = −0.075 What are the expected frequencies? Table: 2 × 4 contingency table of frequencies Xj D-Li D-C R-Li R-C H 6 6 2 6 Lo 9 9 3 9 Probabilistic and Statistical Models Poisson distribution Definition (Poisson distribution) Let X be random variable characterised by Poisson distribution, i.e. X(λ), where θ = λ. Then Pr(X = x) = λx e−λ x! , x = 0, 1, . . . , where x = n is realisation of X. Then E[X] = λ and Var[X] = λ. Probabilistic and Statistical Models Poisson distribution Example (Poisson distribution; killing by horse kicks) Data were published by Russian economist Ladislaus Bortkiewicz in his book entitled Das Gesetz der keinem Zahlen (The Law of Small Numbers) in 1898. Let X be the number of corpses with certain number of solders killed by horse kicks in the Prussian army within one year (von Bortkiewicz 1898; in 10 different army corps; in 20 years, between 1875 and 1894), n be the number of annual deaths, mn be the number of army corps with particular number of annual deaths, M = mn = 10 × 20 = 200. Then X ∼ Poiss(λ), where λ = n nmn n mn = 0.61 (weighted average; average of number of army corps weighted by number of annual deaths). Table: Observed and theoretical frequencies (mn,O and mn,E ) of solders killed by horse with n annual deaths over 20 years n 0 1 2 3 4 5+ mn,O 109 65 22 3 1 0 mn,E 109 66 20 4 1 0 Probabilistic and Statistical Models Poisson distribution Example (Poisson distribution; accidents in the factories) Let X be the number of workers having an accident in the munition factories in England during First World War (Greenwood and Yule 1920), n be the number of accidents, mn be the number of workers with particular number of accidents, M = mn = 647. Then X ∼ Poiss(λ), where λ = n nmn n mn = 0.47 (weighted average; average of number of workers weighted by number of accidents). Table: Observed and theoretical frequencies (mn,O and mn,E ) of workers with n accidents n 0 1 2 3 4 ≥ 5 mn,O 447 132 42 21 3 2 mn,E 406 189 44 7 1 0 Probabilistic and Statistical Models Formulations of hypotheses about probability distributions 1. multinomial distribution – example – number of individuals with certain eye and hair color: Are the rows and columns of a contingency table independent? ◮ Are the frequencies of individuals with certain eye color (with levels blue, green, brown) independent of hair color (with levels blond, light-brown, black, red)? 2. product-multinomial distribution: Are the vectors of frequencies the same in each row? Are the vectors of frequencies independent of the row index? ◮ example – number of individuals with certain socioeconomic status, political philosophy and affiliation – Are the vectors of frequencies of individuals (D-Li, D-C, R-Li, R-C) the same for each level of socioeconomic status (high and low)? ◮ example – blood groups – Is the distribution of the blood groups (0, A, B, AB) the same in Prague and Koˇsice? Probabilistic and Statistical Models Formulations of hypotheses about probability distributions 3. binomial distribution – example – number of boys: ◮ Is the probability of number of boys in the families with 12 boys binomial? ◮ Is the probability of having a boy in the family equal to 0.5? 4. Poisson distribution: ◮ example – killing by horse kick – Is the distribution of number of corpses with certain number of solders killed by horse kick Poisson? ◮ example – accidents in the factories – Is the distribution of number of workers having an accident Poisson? Probabilistic and Statistical Models Types of contingency tables – multinomial distribution 1 × J contingency table of frequencies outcome 1 outcome 2 . . . outcome J sum x1 x2 . . . xJ N 1 × J contingency table of probabilities outcome 1 outcome 2 . . . outcome J sum p1 p2 . . . pJ 1 2 × J contingency table of frequencies outcome 1 outcome 2 . . . outcome J sum row 1 x11 x12 . . . x1J N1 row 2 x21 x22 . . . x2J N2 2 × J contingency table of probabilities outcome 1 outcome 2 . . . outcome J sum row 1 p11 p12 . . . p1J p1• = 1 row 2 p21 p22 . . . p2J p2• = 1 Probabilistic and Statistical Models Types of contingency tables – multinomial distribution K × J contingency table of frequencies outcome 1 outcome 2 . . . outcome J sum row 1 x11 x12 . . . x1J N1 row 2 x21 x22 . . . x2J N2 ... ... ... . . . ... ... row K xK1 xK2 . . . xKJ NK K × J contingency table of probabilities outcome 1 outcome 2 . . . outcome J sum row 1 p11 p12 . . . p1J p1• = 1 row 2 p21 p22 . . . p2J p2• = 1 ... ... ... . . . ... ... row K pK1 pK2 . . . pKJ pK• = 1 Probabilistic and Statistical Models Types of contingency tables – product-multinomial distribution 1 × J contingency table of frequencies (≈ multinomial distribution) outcome 1 outcome 2 . . . outcome J sum x1 x2 . . . xJ N 1 × J contingency table of probabilities (≈ multinomial distribution) outcome 1 outcome 2 . . . outcome J sum p1 p2 . . . pJ 1 2 × J contingency table of frequencies (≈ multinomial distribution) outcome 1 outcome 2 . . . outcome J sum group 1 x11 x12 . . . x1J N1 group 2 x21 x22 . . . x2J N2 2 × J contingency table of probabilities outcome 1 outcome 2 . . . outcome J sum group 1 p1|1 p2|1 . . . pJ|1 1 group 2 p1|2 p2|2 . . . pJ|2 1 Probabilistic and Statistical Models Types of contingency tables – product-multinomial distribution K × J contingency table of frequencies (≈ multinomial distribution) outcome 1 outcome 2 . . . outcome J sum group 1 x11 x12 . . . x1J N1 group 2 x21 x22 . . . x2J N2 ... ... ... . . . ... ... group K xK1 xK2 . . . xKJ NK K × J contingency table of probabilities outcome 1 outcome 2 . . . outcome J sum group 1 p1|1 p2|1 . . . pJ|1 1 group 2 p1|2 p2|2 . . . pJ|2 1 ... ... ... . . . ... ... group K p1|K p2|K . . . pJ|K 1 Probabilistic and Statistical Models Data structure for 1 × J contingency table – multinomial distribution outcome 1 outcome 2 . . . outcome J sum x1 1 0 . . . 0 1 x2 0 1 0 1 x3 0 1 0 1 x4 1 0 . . . 0 1 ... ... ... . . . ... ... xN−1 0 0 . . . 1 1 xN 1 0 . . . 0 1 sum= x x1 x2 . . . xJ N ◮ sum of each row is one ◮ sum of all row sums is N ◮ sum of each column is xj , where j = 1, 2, . . . , J ◮ sum of all xj , j = 1, 2, . . . , J, is N ◮ x = n Probabilistic and Statistical Models Data structure for K × J contingency table – (product-)multinomial distribution outcome 1 outcome 2 . . . outcome J sum xk1 1 0 . . . 0 1 xk2 0 1 0 1 xk3 0 1 0 1 xk4 1 0 . . . 0 1 ... ... ... . . . ... ... xk,Nk −1 0 0 . . . 1 1 xk,Nk 1 0 . . . 0 1 sum= xk xk1 xk2 . . . xkJ Nk ◮ sum of each row is one ◮ sum of all row sums is Nk ◮ sum of each column is xkj , where j = 1, 2, . . . , J ◮ sum of all xkj , j = 1, 2, . . . , J, is N ◮ xk = nk , where k = 1, 2, . . . , K Probabilistic and Statistical Models Assignments in Assignment number of boys: 1. Draw probability mass function of number of boys in the families with 12 children? 2. What are the probabilities of having n boys in the family (n = 1, 2, . . . , 12)? What is the probability of having eight or more boys in the family? What is the probability of having five to seven boys in the family? Assignment killing by horse kick: 1. Draw probability mass function of number of corpses with certain number of solders killed by horse kick? 2. What are the probabilities of having n annual deaths (n = 0, 1, 2, 3, 4, 5+)? What is the probability of having one or less annual deaths? Assignment accidents in the factories: 1. Draw probability mass function of number of workers having an accident? 2. What are the probabilities of having n accidents (n = 0, 1, 2, 3, 4, 5+)? What is the probability of having two or more accidents? Probabilistic and Statistical Models Assignments in Assignment number of individuals with certain socioeconomic status, political philosophy and affiliation: 1. What is the number of all 2 × 4 contingency table with N = 50? n+k−1 k = 57 7 = 57 50 = 264385836 choose(57,50)=choose(57,7) 2. What is the probability of getting the following 2 × 4 contingency table? D-Li D-C R-Li R-C H 5 7 4 6 Lo 8 7 3 10 Pr(X1 = x1, X2 = x2, . . . , X8 = x8) = 50! 5!7!4!6!8!7!3!10! 0.125 0.127 0.044 0.126 0.188 0.187 0.063 0.1810 = 2.332506 × 10−6 n<-c(5,7,4,6,8,7,3,10) p<-c(.12,.12,.04,.12,.18,.18,.06,.18) dmultinom(x=n,prob=p) ## 2.332506e-06 Probabilistic and Statistical Models Assignments in Assignment number of individuals with certain socioeconomic status, political philosophy and affiliation: 3. What is the most probable 2 × 4 contingency table and what is the probability of getting it? D-Li D-C R-Li R-C H 6 6 2 6 Lo 9 9 3 9 Pr(X1 = x1, X2 = x2, . . . , X8 = x8) = 50! 6!6!2!6!9!9!3!9! 0.126 0.126 0.042 0.126 0.189 0.189 0.063 0.189 = 1.020471 × 10−5 4.375× more than in (2) n<-c(6,6,2,6,9,9,3,9) p<-c(.12,.12,.04,.12,.18,.18,.06,.18) dmultinom(x=n,prob=p) ## 1.020471e-05 4. Draw probability mass function of number of possible 2 × 4 contingency tables with N = 50? Probabilistic and Statistical Models Likelihood function Definition (likelihood function) For a statistical model F where we expect the data x ∈ R to be observed, the function L : Θ → R+ ∪ {0}, called likelihood function (likelihood), is defined as L(θ|x) = L(θ, x) = c(x)f(θ, x), where c ∈ R is independent of θ, f(θ|x) = f(θ, x) = n i=1 f(xi , θ). Likelihood L(θ|x) is used when describing a function of a parameter given an outcome. Density (probability mass function) f(xi , θ) = f(xi |θ) is used when describing a function of the outcome given a fixed parameter value. Probabilistic and Statistical Models Likelihood function The natural logarithm of the likelihood function, called the log-likelihood, is defined as ln(L(θ|x)) = l(θ|x) = ln c + ln(f(θ|x)). ◮ The log-likelihood, is more convenient to work with. ◮ We are searching for the maximum of likelihood function. ◮ Because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same points as the function itself. Hence the log-likelihood can be used in place of the likelihood in finding the maximum. ◮ Finding the maximum of a function involves taking the (partial) derivative of a function, equaling it to zero, and solving for the parameter being maximized. Probabilistic and Statistical Models Likelihood function Definition (maximum-likelihood estimate) The estimate of a parameter θ, θML = θ, called maximum-likelihood estimate (MLE), is a value which maximises the likelihood function, i.e. θML = arg max ∀θ L(θ|x) = arg max ∀θ l(θ|x). The process of maximisation is called maximum-likelihood estimation: ◮ the first derivative of log-likelihood function (score function) S(θ) = ∂ ∂θ l(θ|x), ◮ likelihood equations (score equations) S(θ) = 0, ◮ the second derivative of log-likelihood function ∂2 ∂θ2 l(θ|x), ◮ in the maximum is the second derivative negative and the curvature in θ is equal to Fisher information I(θ) = − ∂2 ∂θ2 l(θ|x)|θ=θ. Probabilistic and Statistical Models Likelihood function ◮ The curvature is inversely related to the variance of θ, i.e. Var[θ] = 1/I(θ). ◮ Since Xi , i = 1, 2, . . . , n are independent, I(θ) = ni(θ). Ronald Aylmer Fisher (1890−1962) – English statistician, wrote in 1925: What has now appeared is that the mathematical concept of probability is inadequate to express our mental confidence or diffidence in making such inferences, and that the mathematical quantity which appears to be appropriate measuring our order of preference among different possible populations, does not in fact obey the laws of probability. To distinguish it from probability, I have used the term ”likelihood” to designate this quantity. Probabilistic and Statistical Models Profile likelihood function Let θ = (θ1, θ2)T , where s θ1 is the parameter of interest and θ2 a nuisance parameter. In some cases, the separation into two such components can be achieved after suitable reparametrisation. If θ2|θ1 denotes the value of θ2 which maximizes the likelihood (or log-likelihood) function for the given value of θ1, we define profile likelihood function LP(θ1|x) = L((θ1, θ2|θ1 )T |x) and profile log-likelihood function lP(θ1|x) = l((θ1, θ2|θ1 )T |x) The term ”profile” comes about through thinking of the usual (log-)likelihood function a s a hill being observed from a viewpoint with abscissa θ2 = ∞, so that, for any fixed θ1, only the highest value L((θ1, θ2|θ1 )T |x) or l((θ1, θ2|θ1 )T |x) is seen. Probabilistic and Statistical Models Likelihood function of binomial distribution Definition (likelihood and log-likelihood function of binomial distribution) Let X be binomially distributed with parameters N and θ = p, i.e. X ∼ Bin(N, p). Realisations of X be x = n. Then the likelihood function is equal to L(p|x) = N i=1 N xi pxi (1 − p) N−xi = px (1 − p) N−x N i=1 N xi . Since the product of binomial coefficients is not important in likelihood maximisation, only the kernel (often called likelihood as well) is used. Then L(p|x) ≈ px (1 − p) N−x . The log-likelihood function is equal to l(p|x) = x ln p + (N − x) ln (1 − p) . Probabilistic and Statistical Models Likelihood function of binomial distribution Example (maximum-likelihood estimation) Let X be binomially distributed with parameters N and θ = p, i.e. X ∼ Bin(N, p). Derive p and Var[p]. Solution (partial) S(p) = ∂ ∂p l(p|x) = x p − N − x 1 − p , ∂2 ∂p2 l(p|x) = − Np /p2 − N 1 − p / (1 − p) 2 . Then p = x N and Var[p] = p(1 − p) N . Probabilistic and Statistical Models Likelihood function of binomial distribution Example (maximal likelihood estimates of p) Generate in pseudo-random variables X ∼ Bin(N, p), where N = 20. Write -function to calculate (1) likelihood function L(p|x) of binomial distribution, where x = 2, N = 20, (2) likelihood function L(p|x) of binomial distribution, where x = 10, N = 20 and (3) likelihood function L(p|x) of binomial distribution, where x = 18, N = 20. Repeat the same for log-likelihood function. Calculate also p using function optimize(). Draw all three functions in three side-by-side windows with highlighted maxima. Solution (partial) l(p|x) = px (1 − p) N−x , where p ∈ (0, 1), x = 2, N = 20 l(p|x) = px (1 − p) N−x , where p ∈ (0, 1), x = 10, N = 20 l(p|x) = px (1 − p) N−x , where p ∈ (0, 1), x = 18, N = 20 Probabilistic and Statistical Models Likelihood function of binomial distribution 0.0 0.2 0.4 0.6 0.8 1.0 0.00000.00050.00100.0015 MLE of p = 0.1 probability p L(p|x) 0.0 0.2 0.4 0.6 0.8 1.0 0e+002e−074e−076e−078e−07 MLE of p = 0.5 probability p L(p|x) 0.0 0.2 0.4 0.6 0.8 1.0 0.00000.00050.00100.0015 MLE of p = 0.9 probability p L(p|x) Figure: Likelihood functions of binomial distribution X ∼ Bin(N, p), where N = 20 Probabilistic and Statistical Models Likelihood function of binomial distribution 0.0 0.2 0.4 0.6 0.8 1.0 −120−100−80−60−40−20 MLE of p = 0.1 probability p l(p|x) 0.0 0.2 0.4 0.6 0.8 1.0 −70−60−50−40−30−20 MLE of p = 0.5 probability p l(p|x) 0.0 0.2 0.4 0.6 0.8 1.0 −120−100−80−60−40−20 MLE of p = 0.9 probability p l(p|x) Figure: Log-likelihood functions of binomial distribution X ∼ Bin(N, p), where N = 20 Probabilistic and Statistical Models Likelihood function of multinomial distribution Definition (likelihood and log-likelihood function of multinomial distribution) Let X be multinomially with parameters N and θ = p, i.e. X ∼ MultJ (N, p). Realisations of Xj be xj = nj . Then the (kernel of) likelihood function is equal to L(p|x) = N i=1 N! J j=1 xj ! J j=1 p xji j ≈ J j=1 p xj j and the log-likelihood function is equal to l (p|x) = J j=1 xj ln pj . Probabilistic and Statistical Models Likelihood function of multinomial distribution Example (maximum-likelihood estimation) Let X be multinomially with parameters N and θ = p, i.e. X ∼ MultJ (N, p). Derive p and Var[p]. Solution (partial) Let pJ = 1 − J−1 j=1 pj and p = (p1, p2, . . . , pJ−1)T Then l(p|x) = J−1 j=1 nj ln pj + nJ ln(1 − J−1 j=1 pj ) and (S(p))j = ∂ ∂pj l(p|x) = nj pj − nJ pJ as the elements of S(p)). Then I(p) = − ∂ ∂p S(p) = diag n1 p2 1 , n2 p2 2 , . . . , nJ−1 p2 J−1 + nJ p2 J 11T . Probabilistic and Statistical Models Likelihood function of multinomial distribution I(p) = N diag 1 p1 , 1 p2 , . . . , 1 pJ−1 + 11T pJ . Then I(p) = N       1 p1 + 1 pJ 1 pJ 1 pJ . . . 1 pJ 1 pJ 1 p2 + 1 pJ 1 pJ . . . 1 pJ ... ... ... ... ... 1 pJ 1 pJ . . . 1 pJ 1 pJ−1 + 1 pJ       . Var[p] = I−1 (p) = 1 N diag p − ppT . Then Var[p] = 1 N      p1(1 − p1) −p1p2 . . . −p1pJ−1 −p2p1 p2(1 − p2) . . . −p2pJ−1 ... ... ... ... −pJ−1p1 −pJ−1p2 . . . pJ−1(1 − pJ−1)      . Probabilistic and Statistical Models Likelihood function of multinomial distribution p1 p2 ● 0 0.2 0.4 0.6 0.8 1 00.20.40.60.81 Figure: Log-likelihood function of multinomial (trinomial) distribution Probabilistic and Statistical Models Likelihood function of Poisson distribution Definition (likelihood and log-likelihood function of Poisson distribution) Let X be distributed as Poisson with parameter θ = λ, i.e. X ∼ Poiss(λ). Realisations of Xj be xj = nj . Then the (kernel of) likelihood function is equal to L (λ|x) = N i=1 λxi e−λ xi ! ≈ λ N i=1 xi e−Nλ and the log-likelihood function is equal to l (λ|x) = N i=1 xi ln λ − Nλ. In general, L(λ|x) = n pmn n , where pn = Pr(X = n) = e−λ λn /n! and l(λ|x) = −λ n mn + n nmn ln λ. Probabilistic and Statistical Models Likelihood function of Poisson distribution Example (maximum-likelihood estimation) Let X be distributed Poisson with parameter θ = λ, i.e. X ∼ Poiss(λ). Derive λ and Var[λ]. Solution (partial) S(λ) = ∂ ∂λ l (λ|x) = N i=1 xi λ − N, ∂2 ∂λ2 l (λ|x) = − N i=1 xi λ2 . Then λ = N i=1 xi N = x and Var[λ] = x N In general notation, λ = n nmn n mn . Probabilistic and Statistical Models Likelihood function of Poisson distribution Example (maximal likelihood estimates of λ) Write -function to calculate likelihood function L(λ|x) and log-likelihood function l(λ|x) of Poisson distribution X ∼ Poiss(λ) for horse kick data. Calculate also λ using function optimize(). Draw both functions in two side-by-side windows with highlighted maximum. Solution (partial) l(λ|x) = −λ n mn + n nmn ln λ, where λ ∈ (0, 2) Probabilistic and Statistical Models Likelihood function of Poisson distribution 0.0 0.5 1.0 1.5 2.0 0e+002e−804e−806e−80 MLE of λ = 0.61 λ L(λ|x) 0.0 0.5 1.0 1.5 2.0 −500−400−300−200 MLE of λ = 0.61 λ l(λ|x) Figure: Likelihood function L(λ|x) (left) anf log-likelohood function l(λ|x) of Poisson distribution X ∼ Poiss(λ) for horse kick data Probabilistic and Statistical Models Assignments in Assignment number of boys: Calculate p (the probability of having a boy in a family) and Var[p] (the variance probability of having a boy in a family). Assignment killing by horse kick: Calculate λ (the mean number of annual deaths) and Var[λ] (the variance of mean number of annual deaths). Assignment accidents in a factory: Calculate λ (the mean number of accidents in a factory) and Var[λ] (the variance of mean number of accidents in a factory). Probabilistic and Statistical Models Assignments in Assignment blood groups: In Prague and Koˇsice, calculate p (the probabilities of having certain blood group in particular city) and Var[p] (the covariance matrix of probability of having certain blood group in particular city). Assignment eye and hair color: Calculate p (the probabilities of having certain eye and hair color) and Var[p] (the covariance matrix of probability of having certain eye and hair color). Probabilistic and Statistical Models Likelihood function of normal distribution Definition (likelihood and log-likelihood function of normal distribution) Let X be distributed normally with parameter θ = (µ, σ2 )T , i.e. X ∼ N(µ, σ2 ). Realisations of Xi be xi . Then the likelihood function is equal to L(θ|x) = n i=1 1 √ 2πσ exp − 1 2 xi − µ σ 2 = 1 (2πσ2)n/2 exp − 1 2σ2 n i=1 x2 i − 2µ n i=1 xi + nµ2 and the log-likelihood function is equal to l(θ|x) = − n 2 ln(2π) − n 2 ln σ2 − 1 2σ2 n i=1 (xi − µ) 2 . Probabilistic and Statistical Models Likelihood function of normal distribution Example (maximum-likelihood estimation) Let X be distributed normally with parameter θ = (µ, σ2 )T , i.e. X ∼ N(µ, σ2 ). Derive θ = (µ, σ2 )T and Var[θ] = Σ. Solution (partial) S1(θ) = ∂ ∂µ l(θ|x) = 1 σ2 n i=1 (xi − µ), S2(θ) = ∂ ∂σ2 l(θ|x) = − n 2σ2 + 1 2σ4 n i=1 (xi − µ)2 . Then µ = x = 1 n n i=1 xi , σ2 = 1 n n i=1 (xi − µ)2 , and I(θ) = n σ2 0 0 n 2σ4 . Probabilistic and Statistical Models Likelihood function of normal distribution Example (maximal likelihood estimates of µ and σ2 ) Generate in pseudo-random variables X ∼ N(µ, σ2 ), where µ = 4, σ2 = 1 and n = 1000. Write -function to calculate (1) (profile) likelihood function LP(µ|x) of normal distribution for generated data X, (2) (profile) likelihood function LP(σ2 |x) of normal distribution for generated data X, and (3) likelihood function L(θ|x) of normal distribution for generated data X, where θ = (µ, σ2 )T . Repeat the same for log-likelihood function. Calculate also MLEs using functions optimize() and optim(). Draw all three functions in three side-by-side windows with highlighted maxima. Solution (partial) lP(µ|x) = −n 2 ln(2π) − n 2 ln σ2 1 − 1 2σ2 1 n i=1 x2 i − 2µ n i=1 xi + nµ2 , where µ ∈ (2, 6), σµ = 1; lP(σ2 |x) = −n 2 ln(2π) − n 2 ln σ2 − n i=1(xi −µ1)2 2σ2 , where µσ = 4, σ ∈ (0.5, 1.5); l(θ|x) = −n 2 ln(2π) − n 2 ln σ2 − n i=1(xi −µ)2 2σ2 , where µ ∈ (2, 6) and σ ∈ (0.5, 1.5). Probabilistic and Statistical Models Likelihood function of normal distribution 1 n <- 1000 2 # Profile likelihood for mu 3 sigma.mu <- 1 4 x <- rnorm(n,4,sigma) 5 "negloglikemu" <- function(mu){n/2*log(2*pi) 6 +n/2*log(sigma.muˆ2)+(sum(xˆ2)-2*mu*sum(x)+n*muˆ2)/(2* sigma.muˆ2)} 7 OPTmu <- optimize(negloglikemu,c(2,6),maximum=FALSE) 8 OPTmu$minimum # 3.987524 9 # Profile likelihood for sigmaˆ2 10 mu.sigma <- 4 11 "negloglikesigma" <- function(sigma2){n/2*log(2*pi) 12 +n/2*log(sigma2)+sum((x-mu.sigma)ˆ2)/(2*sigma2)} 13 OPTsigma <- optimize(negloglikesigma,c(0.5,1.5),maximum=FALSE) 14 OPTsigma$minimum # 0.9630124 15 # Likelihood for mu and sigmaˆ2 16 "negloglike" <- function(theta){(n/2)*log(2*pi) 17 +(n/2)*log(theta[2])+(1/(2*theta[2]))*sum((x-theta[1])ˆ2)} 18 OPTboth <- optim(c(3,0.5),negloglike,method="Nelder-Mead", 19 hessian=TRUE) 20 OPTboth$parameter # 3.9875376 0.9627521 Probabilistic and Statistical Models Likelihood function of normal distribution −10 −5 0 5 10 15 20 0.00.20.40.60.8 µ L(µ|x) 0 2 4 6 8 10 0.650.700.750.800.85 σ2 L(σ2 |x) µ σ2 0.02238843 0.02238843 0.5 515948 0.659017 0.7482824 0.8059284 ● −11 −3.5 4 11.5 19 2.557.510 Figure: Profile likelihood functions (left, middle) and likelihood function (right) of normal distribution X ∼ N(µ, σ2 ), where µ = 4, σ2 = 1 and n = 1000; all functions multiplied by suitable constant, here 10−4 L(·) Probabilistic and Statistical Models Likelihood function of normal distribution −10 −5 0 5 10 15 20 −10−8−6−4−20 µ l(µ|x) 0 2 4 6 8 10 −0.45−0.40−0.35−0.30−0.25−0.20−0.15 σ2 l(σ2 |x) µ σ2 −37992.11 −37992.11 −8246.8 07 −8246.8 07 −4170.059 −2899.748 −2157.604 ● −11 −3.5 4 11.5 19 2.557.510 Figure: Profile log-likelihood functions (left, middle) and log-likelihood function (right) of normal distribution X ∼ N(µ, σ2 ), where µ = 4, σ2 = 1 and n = 1000; all functions are multiplied by suitable constant, here exp(10−4 L(·)) Probabilistic and Statistical Models Likelihood function of normal distribution mu sigma2 likelihood mu sigma2 log−likelihood Figure: Likelihood (left) and log-likelihood (right) function of normal distribution X ∼ N(µ, σ2 ), where µ = 4, σ2 = 1 and n = 1000; all functions are multiplied by suitable constant, here exp(10−4 L(·)) and 10−4 L(·) Probabilistic and Statistical Models Approximation of likelihood function Definition (relative likelihood and log-likelihood function) Relative likelihood function is defined as L(θ|x) = L(θ|x) L(θ|x) and relative log-likelihood function as ln L(θ|x) = ln L(θ|x) L(θ|x) . ◮ It is often useful that likelihood function could be approximated by a quadratic function. ◮ But additionally to the location of maxima of likelihood function, we need the curvature around maximum. ◮ Since the log-likelihood, is more convenient to work with, we need a quadratic approximation of log-likelihood function. Probabilistic and Statistical Models Approximation of likelihood function Definition (Taylor polynomial of order r) if a function g(x) has derivatives of order r, that is, g(r) (x) = ∂r ∂xr g(x) exists, then for any constant a, theTaylor polynomial of order r about a is Tr (x) = r j=0 g(j) (a) j! (x − a)j . In practical statistical situations we assume that the remainder g(x) − Tr (x) converges to zero as n increases, therefore we are going to ignore it. There are many explicit forms, one of the most useful is g(x) − Tr (x) = x a g(r+1) (t) r! (x − t)r dt. If g(r) (a) = ∂r ∂xr g(x)|x=a exists, then lim x→a g(x) − Tr (x) (x − a)r = 0. Probabilistic and Statistical Models Approximation of likelihood function The quadratic approximation of log-likelihood function about θ defined as l(θ|x) ≈ l(θ|x) + S(θ)(θ − θ) − 1 2 I(θ)(θ − θ)2 , The quadratic approximation of relative likelihood function about θ is defined as ln L(θ|x) = ln L(θ|x) L(θ|x) = l(θ|x) − l(θ|x) ≈ − 1 2 I(θ)(θ − θ)2 . It is often useful to visualise a derivative of the quadratic approximation S(θ) ≈ −I(θ)(θ − θ) or −I−1/2 (θ)S(θ) ≈ I1/2 (θ)(θ − θ), where −I−1/2 (θ)S(θ) is visualised against I1/2 (θ)(θ − θ). If the quadratic approximation is correct, this should be a line with slope equal to one. Probabilistic and Statistical Models Approximation of likelihood function 0.5 0.6 0.7 0.8 0.9 1.0 −4−3−2−10 θ scaledl(θ|x) (a) x=8, N=10 approx log−like log−like −2 −1 0 1 2 05101520 scaled θ scaledscorefunction (a) linearity of score function approx log−like log−like 0.5 0.6 0.7 0.8 0.9 1.0 −4−3−2−10 θ scaledl(θ|x) (b) x=80, N=100 −2 −1 0 1 2 0102030405060 scaled θ scaledscorefunction (b) linearity of score function 0.5 0.6 0.7 0.8 0.9 1.0 −4−3−2−10 θ scaledl(θ|x) (c) x=800, N=1000 −2 −1 0 1 2 050100150200 scaled θ skalovanaskorefunkcia (c) linearity of score function Figure: Relative binomial log-likelihood, its quadratic approximation and linearity of score function Probabilistic and Statistical Models Numerical maximisation of likelihood function Isaac Newton (1643−1727) and Joseph Raphson (1648−1715). Definition (Newton-Raphson method) Having quadratic approximation of log-likelihood function about θ0 l(θ|x) ≈ l(θ0|x) + S(θ0)(θ − θ0) − 1 2 I(θ0)(θ − θ0)2 or linear approximation of score function about θ0 S(θ) ≈ S(θ0) − I(θ0)(θ − θ0), the numerical maximisation can be done via iterative function θ0 + S(θ0) I(θ0) . Probabilistic and Statistical Models Numerical maximisation of likelihood function The iterative process is defined as follows: 1. initialisation step – starting point θ(0) , where I(θ(0) ) = 0, 2. updating equations – iteration of θ(i) = θ(i−1) + S(θ(i−1) ) I(θ(i−1)) , where I(θ(i−1) ) = 0, pre i = 1, 2, . . . 3. stopping rule based on absolute convergence criteria – until |l(θ(i) |x) − l(θ(i−1) |x)| < ǫ, where the threshold ǫ is sufficiently small Geometrical interpretation: θ(i) is a crossing point of tangent of score function S(·) in the point [θ(i−1) , S(θ(i−1) )] with x-axis. In : ◮ optimize(f,interval,maximum= FALSE, tol,...) ◮ Newton-Raphson method is combined here with golden section method and successive parabolic interpolation to speed up the convergence. Probabilistic and Statistical Models Numerical maximisation of likelihood function Definition (multivariate Newton-Raphson method) Having quadratic approximation of log-likelihood function about θ0 l(θ|x) ≈ l(θ0|x) + S(θ0)(θ − θ0) − 1 2 (θ − θ0)T I(θ0)(θ − θ0) or linear approximation of score function about θ0 S(θ) ≈ S(θ0) − I(θ0)(θ − θ0). the numerical maximisation can be done via iterative function θ0 + (I(θ0))−1 S(θ0). Probabilistic and Statistical Models Numerical maximisation of likelihood function The iterative process is defined as follows: 1. initialisation step – starting point θ(0) , where I(θ(0) ) = 0, 2. updating equations – iteration of θ(i) = θ(i−1) + (I(θ(i−1) ))−1 S(θ(i−1) ), where I(θ(i−1) ) = 0, pre i = 1, 2, . . . 3. stopping rule based on absolute convergence criteria – until |l(θ(i) |x) − l(θ(i−1 |x)| < ǫ, where the threshold ǫ is sufficiently small In : ◮ optim(par,fn,gr,method,control,hessian =FALSE,...) ◮ Newton-Raphson method is often modified – Fisher scoring method, quasi Newton method, Broyden-FletcherGoldfarb-Shannon (BFGS) method Probabilistic and Statistical Models Numerical maximisation of likelihood ≈ minimising negative log-likelihood Nelder-Mead method (method of simplexes) – the idea of ”jumps” across triangles defined by the points θ (i−1) 1 , θ (i−1) 2 , θ (i−1) 3 , where l(θ (i−1) 1 |x) < l(θ (i−1) 2 |x) < l(θ (i−1) 3 |x). We are substituting point θ (i−1) 1 with a ”better” point θ (i) 1 , where l(θ (i) 1 |x) < l(θ (i−1) 1 |x). Then new point is defined based on reflection (point symmetry), contraction or extrapolation (expansion) as 1. reflection – z1 = θ (i) 1 = θ (i−1) 23 + 1 θ (i−1) 23 − θ (i−1) 1 , 2. reflection & expansion – z2 = θ (i) 1 = θ (i−1) 23 + 2 θ (i−1) 23 − θ (i−1) 1 , 3. contraction A – z3 = θ (i) 1 = θ (i−1) 23 + 1 2 θ (i−1) 23 − θ (i−1) 1 , 4. contraction B – z4 = θ (i) 2 = θ (i−1) 1 + 1 2 θ (i−1) 2 − θ (i−1) 1 and z5 = θ (i) 3 = θ (i−1) 1 + 1 2 θ (i−1) 3 − θ (i−1) 1 where θ (i−1) 23 = θ(i−1) 2 +θ(i−1) 3 2 , i.e. the mid-point of the line defined by the points θ (i−1) 2 and θ(i−1) . If l(θ (i) 1 |x) < l(θ (i−1) 1 |x) then new triangle is defined with θ (i) 1 , θ (i−1) 2 , θ (i−1) 3 for (1) to (3). Otherwise new triangle is θ (i−1) 1 , θ (i) 2 , θ (i) 3 . Probabilistic and Statistical Models Numerical maximisation of likelihood ≈ minimising negative log-likelihood 1 2 3 4 5 5 6 6 7 7 8 8 9 9 910 10 x y x y function Figure: Demonstration of Nelder-Mead method or minimising the function ((x − y)2 + (x − 2)2 + (y − 3)4 )/10, number of iterations is 49 Probabilistic and Statistical Models Numerical maximisation of likelihood ≈ minimising negative log-likelihood x y 1 2 3 4 5 5 6 6 7 7 8 8 9 9 910 10 −2 0 2 4 0246 1 2 3 4 5678910111213141516171819202122232425262728293031323334353637383940414243444546474849 x y 1 2 3 4 5 5 6 6 7 7 8 8 9 9 910 10 −2 0 2 4 0246 ● Figure: Demonstration of Nelder-Mead method or minimising the function ((x − y)2 + (x − 2)2 + (y − 3)4 )/10, number of iterations is 49