Statistical Inference I and II Likelihood function Stanislav Katina1 1Institute of Mathematics and Statistics, Masaryk University Honorary Research Fellow, The University of Glasgow November 13, 2018 1 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function Definition (likelihood function) For a statistical model F where we expect the data x ∈ R to be observed, the function L : Θ → R+ ∪ {0}, called likelihood function (likelihood), is defined as L(θ|x) = L(θ, x) = c(x)f(θ, x), where c ∈ R is independent of θ, f(θ|x) = f(θ, x) = n i=1 f(xi , θ). Likelihood L(θ|x) is used when describing a function of a parameter given an outcome. Density (probability mass function) f(xi , θ) = f(xi |θ) is used when describing a function of the outcome given a fixed parameter value. 2 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function The natural logarithm of the likelihood function, called the log-likelihood, is defined as ln(L(θ|x)) = l(θ|x) = ln c + ln(f(θ|x)). The log-likelihood is more convenient to work with. We are searching for the maximum of likelihood function. Because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same points as the function itself. Hence the log-likelihood can be used in place of the likelihood in finding the maximum. Finding the maximum of a function involves taking the (partial) derivative of a function, equaling it to zero, and solving for the parameter being maximised. 3 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function Definition (maximum-likelihood estimate) The estimate of a parameter θ, θML = θ, called maximum-likelihood estimate (MLE), is a value which maximises the likelihood function, i.e. θML = arg max ∀θ L(θ|x) = arg max ∀θ l(θ|x). The process of maximisation is called maximum-likelihood estimation: the first derivative of log-likelihood function (score function) S(θ) = ∂ ∂θ l(θ|x), likelihood equations (score equations) S(θ) = 0, the second derivative of log-likelihood function ∂2 ∂θ2 l(θ|x), the second derivative is negative at the point of maximum and the curvature in θ is equal to Fisher information I(θ) = − ∂2 ∂θ2 l(θ|x)|θ=θ. 4 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function The curvature is inversely related to the variance of θ, i.e. Var[θ] = 1/I(θ). Since Xi , i = 1, 2, . . . , n are independent, I(θ) = ni(θ), where i(θ) is a likelihood of one observation. Ronald Aylmer Fisher (1890−1962) – English statistician, wrote in 1925: What has now appeared is that the mathematical concept of probability is inadequate to express our mental confidence or diffidence in making such inferences, and that the mathematical quantity which appears to be appropriate measuring our order of preference among different possible populations, does not in fact obey the laws of probability. To distinguish it from probability, I have used the term ”likelihood” to designate this quantity. 5 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Profile likelihood and log-likelihood function Let θ = (θ1, θ2)T , where θ1 is the parameter of interest and θ2 a nuisance parameter. In some cases, the separation into two such components can be achieved after suitable reparametrisation. If θ2|θ1 denotes the value of θ2 which maximises the likelihood (or log-likelihood) function for the given value of θ1, the profile likelihood function is defined as LP(θ1|x) = max ∀θ2 L(θ|x) = L((θ1, θ2|θ1 )T |x) and profile log-likelihood function as lP(θ1|x) = l((θ1, θ2|θ1 )T |x). The term ”profile” comes about through thinking of the usual (log-)likelihood function as a hill being observed from a viewpoint with abscissa θ2 = ∞, so that, for any fixed θ1, only the highest value L((θ1, θ2|θ1 )T |x) or l((θ1, θ2|θ1 )T |x) is seen. 6 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Profile relative likelihood and log-likelihood function Profile relative likelihood function is defined as: LP(θ1|x) = L((θ1, θ2|θ1 )T |x) L((θ1, θ2|θ1 )T |x) and profile relative log-likelihood function as ln LP(θ1|x) = ln L((θ1, θ2|θ1 )T |x) L((θ1, θ2|θ1 )T |x) . 7 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Profile relative likelihood and log-likelihood function The estimated likelihood function is defined as Le(θ1|x) = L((θ1, θ2)T |x) and estimated log-likelihood function as le(θ1|x) = l((θ1, θ2)T |x). Estimated relative likelihood function is defined as: Le(θ1|x) = L((θ1, θ2)T |x) L((θ1, θ2)T |x) and estimated relative log-likelihood function as ln Le(θ1|x) = ln L((θ1, θ2)T |x) L((θ1, θ2)T |x) . 8 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of binomial distribution Definition (likelihood and log-likelihood function of binomial distribution) Let X be binomially distributed with sample size N and parameter θ = p, i.e. X ∼ Bin(N, p). Realisations of X be x = n. Then the likelihood function is equal to L(p|x) = N i=1 N xi pxi (1 − p) N−xi = px (1 − p) N−x N i=1 N xi . Since the product of binomial coefficients is not important in likelihood maximisation, only the kernel (often called likelihood as well) is used. Then L(p|x) ≈ px (1 − p) N−x . The log-likelihood function is equal to l(p|x) = x ln p + (N − x) ln (1 − p) . 9 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of binomial distribution Example (maximum-likelihood estimation of parameter p) Let X be binomially distributed with sample size N and parameter θ = p, i.e. X ∼ Bin(N, p). Derive p and Var[p]. Solution S(p) = ∂ ∂p l(p|x) = x p − N − x 1 − p , where if S(p) = 0, then p = x N . ∂2 ∂p2 l(p|x) = − x p2 − N − x (1 − p) 2 , where if ∂2 ∂p2 l(p|x)|x=Np = − Np p2 − N 1 − p (1 − p) 2 . If p = p, then Var[p] = p(1 − p) N . 10 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of binomial distribution Example (maximal likelihood estimation of parameter p) Generate in pseudo-random variables X ∼ Bin(N, p), where N = 20. Write -function to calculate likelihood function L(p|x) of binomial distribution and visualise it for (1) x = 2, N = 20, (2) x = 10, N = 20 and (3) x = 18, N = 20. Repeat the same for log-likelihood function. Calculate also p using function optimize(). Draw all three functions in three side-by-side windows with highlighted maxima. Solution (partial) L(p|x) = px (1 − p) N−x , where p ∈ (0, 1), x = 2, N = 20 L(p|x) = px (1 − p) N−x , where p ∈ (0, 1), x = 10, N = 20 L(p|x) = px (1 − p) N−x , where p ∈ (0, 1), x = 18, N = 20 11 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of binomial distribution 0.0 0.2 0.4 0.6 0.8 1.0 0.00000.00050.00100.0015 MLE of p = 0.1 probability p L(p|x) 0.0 0.2 0.4 0.6 0.8 1.0 0e+002e−074e−076e−078e−07 MLE of p = 0.5 probability p L(p|x) 0.0 0.2 0.4 0.6 0.8 1.0 0.00000.00050.00100.0015 MLE of p = 0.9 probability p L(p|x) Figure: Likelihood functions of binomial distribution X ∼ Bin(N, p), where N = 20 12 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of binomial distribution 0.0 0.2 0.4 0.6 0.8 1.0 −120−100−80−60−40−20 MLE of p = 0.1 probability p l(p|x) 0.0 0.2 0.4 0.6 0.8 1.0 −70−60−50−40−30−20 MLE of p = 0.5 probability p l(p|x) 0.0 0.2 0.4 0.6 0.8 1.0 −120−100−80−60−40−20 MLE of p = 0.9 probability p l(p|x) Figure: Log-likelihood functions of binomial distribution X ∼ Bin(N, p), where N = 20 13 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of multinomial distribution Definition (likelihood and log-likelihood function of multinomial distribution) Let X be multinomially distributed with sample size N and parameter θ = p, i.e. X ∼ MultJ (N, p). Realisations of Xj be xj = nj . Then the (kernel of) likelihood function is equal to L(p|x) = N i=1 N! J j=1 xj ! J j=1 p xji j ≈ J j=1 p xj j and the log-likelihood function is equal to l(p|x) = J j=1 xj ln pj . 14 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of multinomial distribution Example (maximum-likelihood estimation of parameter p) Let X be multinomially distributed with sample size N and parameter θ = p, i.e. X ∼ MultJ (N, p). Derive p and Var[p]. Solution Let pJ = 1 − J−1 j=1 pj and p = (p1, p2, . . . , pJ−1)T Then l(p|x) = J−1 j=1 nj ln pj + nJ ln(1 − J−1 j=1 pj ), (S(p))j = ∂ ∂pj l(p|x) = nj pj − nJ pJ , where if (S(p))j = 0, then pj = nj N , where (S(p))j are the elements of S(p). Then I(p) = − ∂ ∂p S(p) = diag n1 p2 1 , n2 p2 2 , . . . , nJ−1 p2 J−1 + nJ p2 J 11T . 15 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of multinomial distribution I(p) = N diag 1 p1 , 1 p2 , . . . , 1 pJ−1 + 11T pJ . Then I(p) = N       1 p1 + 1 pJ 1 pJ 1 pJ . . . 1 pJ 1 pJ 1 p2 + 1 pJ 1 pJ . . . 1 pJ ... ... ... ... ... 1 pJ 1 pJ . . . 1 pJ 1 pJ−1 + 1 pJ       , Var[p] = I−1 (p) = 1 N diag p − ppT . Then Var[p] = 1 N      p1(1 − p1) −p1p2 . . . −p1pJ−1 −p2p1 p2(1 − p2) . . . −p2pJ−1 ... ... ... ... −pJ−1p1 −pJ−1p2 . . . pJ−1(1 − pJ−1)      . 16 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of multinomial distribution p1 p2 q 0 0.2 0.4 0.6 0.8 1 00.20.40.60.81 Figure: Log-likelihood function of multinomial (trinomial) distribution 17 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of Poisson distribution Definition (likelihood and log-likelihood function of Poisson distribution) Let X be distributed as Poisson with parameter θ = λ, i.e. X ∼ Poiss(λ). Realisations of Xj be xj = nj . Then the (kernel of) likelihood function is equal to L (λ|x) = N i=1 λxi e−λ xi ! ≈ λ N i=1 xi e−Nλ and the log-likelihood function is equal to l (λ|x) = N i=1 xi ln λ − Nλ. In general notation (from examples), L(λ|x) = n pmn n , where pn = Pr(X = n) = e−λ λn /n! and l(λ|x) = n nmn ln λ − λ n mn. 18 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of Poisson distribution Example (maximum-likelihood estimation) Let X be distributed as Poisson with parameter θ = λ, i.e. X ∼ Poiss(λ). Derive λ and Var[λ]. Solution (partial) S(λ) = ∂ ∂λ l (λ|x) = N i=1 xi λ − N, ∂2 ∂λ2 l (λ|x) = − N i=1 xi λ2 . Then λ = N i=1 xi N = x and Var[λ] = x N . In general notation, λ = n nmn n mn . 19 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of Poisson distribution Example (maximal likelihood estimation of parameter λ) Write -function to calculate likelihood function L(λ|x) and log-likelihood function l(λ|x) of Poisson distribution X ∼ Poiss(λ) for horse kick data. Calculate also λ using function optimize(). Draw both functions in two side-by-side windows with highlighted maximum. Solution (partial) l(λ|x) = n nmn ln λ − λ n mn, where λ ∈ (0, 2) 20 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of Poisson distribution 0.0 0.5 1.0 1.5 2.0 0e+002e−804e−806e−80 MLE of λ = 0.61 λ L(λ|x) 0.0 0.5 1.0 1.5 2.0 −500−400−300−200 MLE of λ = 0.61 λ l(λ|x) Figure: Likelihood function L(λ|x) (left) and log-likelohood function l(λ|x) of Poisson distribution X ∼ Poiss(λ) for horse kick data 21 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Assignments in Assignment number of boys: Calculate p (the probability of having a boy in a family) and Var[p] (the variance of probability of having a boy in a family). Assignment killing by horse kick: Calculate λ (the mean number of annual deaths) and Var[λ] (the variance of mean number of annual deaths). Assignment accidents in a factory: Calculate λ (the mean number of accidents in a factory) and Var[λ] (the variance of mean number of accidents in a factory). 22 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Assignments in Assignment blood groups: In Prague and Koˇsice, calculate p (the probabilities of having certain blood group in particular city) and Var[p] (the covariance matrix of probability of having certain blood group in particular city). Assignment eye and hair colour: Calculate p (the probabilities of having certain eye and hair colour) and Var[p] (the covariance matrix of probability of having certain eye and hair colour). 23 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of normal distribution Definition (likelihood and log-likelihood function of normal distribution) Let X be distributed normally with parameter θ = (μ, σ2 )T , i.e. X ∼ N(μ, σ2 ). Realisations of Xi be xi . Then the likelihood function is equal to L(θ|x) = n i=1 1 √ 2πσ exp − 1 2 xi − μ σ 2 = 1 (2πσ2)n/2 exp − 1 2σ2 n i=1 x2 i − 2μ n i=1 xi + nμ2 and the log-likelihood function is equal to l(θ|x) = − n 2 ln(2π) − n 2 ln σ2 − 1 2σ2 n i=1 (xi − μ) 2 . 24 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of normal distribution Example (maximum-likelihood estimation of parameters μ and σ2 ) Let X be distributed normally with parameter θ = (μ, σ2 )T , i.e. X ∼ N(μ, σ2 ). Derive θ = (μ, σ2 )T and Var[θ] = Σ. Solution (partial) S1(θ) = ∂ ∂μ l(θ|x) = 1 σ2 n i=1 (xi − μ), S2(θ) = ∂ ∂σ2 l(θ|x) = − n 2σ2 + 1 2σ4 n i=1 (xi − μ)2 . Then μ = x = 1 n n i=1 xi , σ2 = 1 n n i=1 (xi − μ)2 , and I(θ) = n σ2 0 0 n 2σ4 . 25 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of normal distribution Example (maximal likelihood estimation of parameters μ and σ2 ) Generate in pseudo-random variables X ∼ N(μ, σ2 ), where μ = 4, σ2 = 1 and n = 1000. Write -function to calculate (1) (profile) likelihood function LP(μ|x) of normal distribution for generated data X, (2) (profile) likelihood function LP(σ2 |x) of normal distribution for generated data X, and (3) likelihood function L(θ|x) of normal distribution for generated data X, where θ = (μ, σ2 )T . Repeat the same for log-likelihood function. Calculate also MLEs using functions optimize() and optim(). Draw all three functions in three side-by-side windows with highlighted maxima. Solution (partial) lP (μ|x) = −n 2 ln(2π) − n 2 ln σ2 μ − 1 2σ2 μ n i=1 x2 i − 2μ n i=1 xi + nμ2 , where μ ∈ (2, 6), σμ = 1; lP (σ2 |x) = −n 2 ln(2π) − n 2 ln σ2 − n i=1(xi −μσ)2 2σ2 , where μσ = 4, σ ∈ (0.5, 1.5); l(θ|x) = −n 2 ln(2π) − n 2 ln σ2 − n i=1(xi −μ)2 2σ2 , where μ ∈ (2, 6) and σ ∈ (0.5, 1.5). 26 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of normal distribution −10 −5 0 5 10 15 20 0.00.20.40.60.8 µ L(µ|x) 0 2 4 6 8 10 0.650.700.750.800.85 σ2 L(σ2 |x) µ σ2 0.02238843 0.02238843 0.5 515948 0.659017 0.7482824 0.8059284 q −11 −3.5 4 11.5 19 2.557.510 Figure: Profile likelihood functions (left, middle) and likelihood function (right) of normal distribution X ∼ N(μ, σ2 ), where μ = 4, σ2 = 1 and n = 1000; all functions multiplied by suitable constant 27 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of normal distribution −10 −5 0 5 10 15 20 −10−8−6−4−20 µl(µ|x) 0 2 4 6 8 10 −0.45−0.40−0.35−0.30−0.25−0.20−0.15 σ2 l(σ2 |x) µ σ2 −37992.11 −37992.11 −8246.8 07 −8246.8 07 −4170.059 −2899.748 −2157.604 q −11 −3.5 4 11.5 19 2.557.510 Figure: Profile log-likelihood functions (left, middle) and log-likelihood function (right) of normal distribution X ∼ N(μ, σ2 ), where μ = 4, σ2 = 1 and n = 1000; all functions are multiplied by suitable constant 28 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Likelihood function of normal distribution mu sigma2 likelihood mu sigma2 log−likelihoodFigure: Likelihood (left) and log-likelihood (right) function of normal distribution X ∼ N(μ, σ2 ), where μ = 4, σ2 = 1 and n = 1000; all functions are multiplied by suitable constant 29 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Approximation of likelihood function Definition (relative likelihood and log-likelihood function) Relative likelihood function is defined as L(θ|x) = L(θ|x) L(θ|x) and relative log-likelihood function as ln L(θ|x) = ln L(θ|x) L(θ|x) . It is often useful that likelihood function could be approximated by a quadratic function. But additionally to the location of maxima of likelihood function, we need the curvature around maximum. Since the log-likelihood, is more convenient to work with, we need a quadratic approximation of log-likelihood function. 30 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Approximation of likelihood function Definition (Taylor polynomial of order r) If a function g(x) has derivatives of order r, that is, g(r) (x) = ∂r ∂xr g(x) exists, then for any constant a, the Taylor polynomial of order r about a is Tr (x) = r j=0 g(j) (a) j! (x − a)j . In practical statistical situations we assume that the remainder g(x) − Tr (x) converges to zero as r increases, therefore we are going to ignore it. There are many explicit forms, one of the most useful is g(x) − Tr (x) = x a g(r+1) (t) r! (x − t)r dt. If g(r) (a) = ∂r ∂xr g(x)|x=a exists, then lim x→a g(x) − Tr (x) (x − a)r = 0. 31 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Approximation of likelihood function The quadratic approximation of log-likelihood function about θ is defined as l(θ|x) ≈ l(θ|x) + S(θ)(θ − θ) − 1 2 I(θ)(θ − θ)2 . The quadratic approximation of relative log-likelihood function about θ is defined as ln L(θ|x) = ln L(θ|x) L(θ|x) = l(θ|x) − l(θ|x) ≈ − 1 2 I(θ)(θ − θ)2 . It is often useful to visualise a derivative of the quadratic approximation S(θ) ≈ −I(θ)(θ − θ) or −I−1/2 (θ)S(θ) ≈ I1/2 (θ)(θ − θ), where −I−1/2 (θ)S(θ) is visualised against I1/2 (θ)(θ − θ). If the quadratic approximation is correct, this should be a line with slope equal to one. 32 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Approximation of likelihood function 0.5 0.6 0.7 0.8 0.9 1.0 −4−3−2−10 θ scaledl(θ|x) (a) x=8, N=10 approx log−like log−like −2 −1 0 1 2 05101520 scaled θ scaledscorefunction (a) linearity of score function approx log−like log−like 0.5 0.6 0.7 0.8 0.9 1.0 −4−3−2−10 θ scaledl(θ|x) (b) x=80, N=100 −2 −1 0 1 2 0102030405060 scaled θ scaledscorefunction (b) linearity of score function 0.5 0.6 0.7 0.8 0.9 1.0 −4−3−2−10 θ scaledl(θ|x) (c) x=800, N=1000 −2 −1 0 1 2 050100150200 scaled θ skalovanaskorefunkcia (c) linearity of score function Figure: Relative binomial log-likelihood, its quadratic approximation (top) and linearity of score function (bottom) 33 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Numerical maximisation of likelihood function Isaac Newton (1643−1727) and Joseph Raphson (1648−1715). Definition (Newton-Raphson method) Having quadratic approximation of log-likelihood function about θ0 l(θ|x) ≈ l(θ0|x) + S(θ0)(θ − θ0) − 1 2 I(θ0)(θ − θ0)2 or linear approximation of score function about θ0 S(θ) ≈ S(θ0) − I(θ0)(θ − θ0), the numerical maximisation can be done via iterative function θ0 + S(θ0) I(θ0) . 34 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Numerical maximisation of likelihood function The iterative process is defined as follows: 1 initialisation step – starting point θ(0) , where I(θ(0) ) = 0, 2 updating equations – iteration of θ(i) = θ(i−1) + S(θ(i−1) ) I(θ(i−1)) , where I(θ(i−1) ) = 0, for i = 1, 2, . . . 3 stopping rule based on absolute convergence criteria – until |l(θ(i) |x) − l(θ(i−1) |x)| < , where the threshold is sufficiently small Geometrical interpretation: θ(i) is a crossing point of tangent of score function S(∙) in the point [θ(i−1) , S(θ(i−1) )] with x-axis. In : optimize(f,interval,maximum= FALSE, tol,...) Newton-Raphson method is combined here with golden section method and successive parabolic interpolation to speed up the convergence. 35 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Numerical maximisation of likelihood function Definition (multivariate Newton-Raphson method) Having quadratic approximation of log-likelihood function about θ0 l(θ|x) ≈ l(θ0|x) + S(θ0)(θ − θ0) − 1 2 (θ − θ0)T I(θ0)(θ − θ0) or linear approximation of score function about θ0 S(θ) ≈ S(θ0) − I(θ0)(θ − θ0). the numerical maximisation can be done via iterative function θ0 + (I(θ0))−1 S(θ0). 36 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Numerical maximisation of likelihood function The iterative process is defined as follows: 1 initialisation step – starting point θ(0) , where I(θ(0) ) = 0, 2 updating equations – iteration of θ(i) = θ(i−1) + (I(θ(i−1) ))−1 S(θ(i−1) ), where I(θ) is regular, i.e. det I(θ(i−1) ) = 0, for i = 1, 2, . . . 3 stopping rule based on absolute convergence criteria – until |l(θ(i) |x) − l(θ(i−1 |x)| < , where the threshold is sufficiently small In : optim(par,fn,gr,method,control,hessian =FALSE,...) Newton-Raphson method is often modified – Fisher scoring method, quasi Newton method, Broyden-FletcherGoldfarb-Shannon (BFGS) method 37 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Numerical maximisation of likelihood ≈ minimising negative log-likelihood Nelder-Mead method (method of simplexes) – the idea of ”jumps” across triangles defined by the points θ (i−1) 1 , θ (i−1) 2 , θ (i−1) 3 , where −l(θ (i−1) 1 |x) < −l(θ (i−1) 2 |x) < −l(θ (i−1) 3 |x). We are substituting point θ (i−1) 1 with a ”better” point θ (i) 1 , where −l(θ (i) 1 |x) < −l(θ (i−1) 1 |x). Then new point is defined based on reflection (point symmetry), contraction or extrapolation (expansion) as 1 reflection: z1 = θ (i) 1 = θ (i−1) 23 + 1 θ (i−1) 23 − θ (i−1) 1 , 2 reflection and expansion: z2 = θ (i) 1 = θ (i−1) 23 + 2 θ (i−1) 23 − θ (i−1) 1 , 3 reflection and contraction: z3 = θ (i) 1 = θ (i−1) 23 + 1 2 θ (i−1) 23 − θ (i−1) 1 , 4 contraction A: z4 = θ (i) 2 = θ (i−1) 1 + 1 2 θ (i−1) 2 − θ (i−1) 1 and B: z5 = θ (i) 3 = θ (i−1) 1 + 1 2 θ (i−1) 3 − θ (i−1) 1 , where θ (i−1) 23 = θ(i−1) 2 +θ(i−1) 3 2 , i.e. the mid-point of the line defined by the points θ (i−1) 2 and θ(i−1) . If −l(θ (i) 1 |x) < −l(θ (i−1) 1 |x) then new triangle is defined with θ (i) 1 , θ (i−1) 2 , θ (i−1) 3 for (1) to (3). Otherwise new triangle is θ (i−1) 1 , θ (i) 2 , θ (i) 3 . 38 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Numerical maximisation of likelihood ≈ minimising negative log-likelihood 1 2 3 4 5 5 6 6 7 7 8 8 9 9 910 10 x y x y function Figure: Demonstration of Nelder-Mead method of minimising the function ((x − y)2 + (x − 2)2 + (y − 3)4 )/10, number of iterations is 49 39 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Numerical maximisation of likelihood ≈ minimising negative log-likelihood x y 1 2 3 4 5 5 6 6 7 7 8 8 9 9 910 10 −2 0 2 4 0246 1 2 3 4 5678910111213141516171819202122232425262728293031323334353637383940414243444546474849 x y 1 2 3 4 5 5 6 6 7 7 8 8 9 9 910 10 −2 0 2 4 0246 q Figure: Demonstration of Nelder-Mead method of minimising the function ((x − y)2 + (x − 2)2 + (y − 3)4 )/10, number of iterations is 49 40 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Mixture of two univariate normal distribution - likelihood estimation Given data x = (x1, x2, . . . , xn) T , the log-likelihood function L(θ|x) = n i=1 f(xi , θ), where θ = (p, μ1, μ2, σ2 1, σ2 2)T , must be maximised numerically. One complication is that l(θ|x) is unbounded. To see how this can happen, fix p, μ2 and σ2 2 at any set values, with the exception that p is not equal to 0 or 1. Denote these fixed values by p∗, μ2,∗ and σ2 2,∗, respectively. Now, set μ1 = xi for any choice of i ∈ {1, 2, . . . , n}. This leaves only σ2 1 unspecified, and θσ2 1 = (p∗, xi , μ2,∗, σ2 1, σ2 2,∗)T can be used to denote the parameter vector with the values of the other parameters fixed as described. When θ = θσ2 1 , the binormal density function evaluated f(xi , θσ2 1 ) = p∗ √ 2πσ1 + 1 − p∗ √ 2πσ2,∗ exp − (μ1 − μ2,∗) 2 2σ2 2,∗ . Note that f(xi , θσ2 1 ) can be made arbitrarily large by making σ1 arbitrarily small. 41 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Mixture of two univariate normal distribution - likelihood estimation Since L(θσ2 1 |x) = n i=1 f(xi , θσ2 1 ), and each f(xi , θσ2 1 ) is bounded away from zero (by virtue of p∗, μ2,∗ and σ2 2,∗ being fixed), it follows that l(θσ2 1 |x) can also be made arbitrarily large. A further problem is that the parametrisation of the binormal model is not identifiable because the role of the two distributions in the mixture can be swapped. That is, the binormal distribution corresponding to parameters (p, μ1, μ2, σ2 1, σ2 2) is the same as that specified by parameters (1 − p, μ1, μ2, σ2 1, σ2 2). 42 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Mixture of two univariate normal distribution - likelihood estimation The unbounded likelihood and non-identifiability issues can be eliminated by suitable restriction on the parameter space. One possibility is to constrain the ratio of the two standard deviations by requiring that 0 < c < σ2 1 σ2 2 < 1, where c is some suitably small constant. In practice, despite the unbounded likelihood and non-identifiability, a sensible local maximum of the likelihood function can often be found using unconstrained numerical optimisation. This is especially the case if there is good separation between the two component normal distributions, and the optimizer is given a starting value of θ that is somewhere in the general vicinity of the local maximum. Ultimately, it is the shape of the likelihood function in the neighbourhood of this local maximum that is relevant to inference. 43 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Mixture of two univariate normal distribution - likelihood estimation The binormal density function is a linear combination of the density functions given by N(μ1, σ2 1) and N(μ2, σ2 2) distributions. waiting time (minutes) histograminabsolutescale 40 50 60 70 80 90 100 01020304050 0.00 0.01 0.02 0.03 50 60 70 80 90 waiting time (minutes) density Figure: Mixture of two normal densities – data faithful 44 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Mixture of two univariate normal distribution - likelihood estimation The histogram of waiting times shows that they look like a combination of (very roughly) 40 % from N(52, 25) distribution and 60 % from N(80, 25) distribution. The corresponding parameter values θ(0) = (0.4, 52, 80, 25, 25)T would make good starting values for finding a local MLE using numerical optimisation. To estimate θ, use optim() function. The call of optim() produced some warning messages (not shown), because it attempted to evaluate negative log-likelihood at parameter values outside of the parameter space (e.g. σ1, σ2 < 0). This can be avoided by using lower and upper bound arguments in the optim() call. 45 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Negative binomial distribution Example (Negative binomial distribution; accidents in the factories) Let X be the number of workers having an accident in the munition factories in England during First World War (Greenwood and Yule 1920), n be the number of accidents, mn be the number of workers with particular number of accidents, M = mn = 647. Question: Calculate theoretical frequencies mn,E . Table: Observed and theoretical frequencies (mn,O and mn,E ) of workers with n accidents n 0 1 2 3 4 ≥ 5 mn,O 447 132 42 21 3 2 mn,E 446 134 44 15 5 3 46 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Negative binomial distribution Likelihood function is defined as follows L((α, π)T |x) = 4 n=0 (Pr(X = n)) mn 1 − 4 n=0 Pr(X = n) m≥5 and logarithm of likelihood function l((α, π)T |x) = 4 n=0 mn ln Pr(X = n) + m≥5 ln 1 − 4 n=0 Pr(X = n) . Using numerical optimisation we get the following estimates α = 0.84 and π = 0.64. Risk ratio μ = 1−π π α = 0.47. 47 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Negative binomial distribution number of accidents absolutefrequencies 0100200300400 0 1 2 3 4 5+ 1 -2 -2 6 -2 -1 Figure: Comparison of observed and expected frequencies (negative binomial distribution) 48 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Zero-inflated Poisson (ZIP) distribution Example (ZIP distribution; number of movements of a foetal lamb) Let X be the number of movements of a foetal lamb in 240 five-second periods (Leroux and Puterman 1992), n be the number of movements, mn be the number of periods with particular number of movements. Question: Calculate theoretical frequencies mn,E using Poisson and ZIP distribution. Table: Observed and theoretical frequencies (mn,O and mn,E ) of five-second periods with n movements n 0 1 2 3 4 5 6 7 mn,O 182 41 12 2 2 0 0 1 mn,E (Poisson) 168 60 11 1 0 0 0 0 mn,E (ZIP) 182 37 16 4 1 0 0 0 49 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Zero-inflated Poisson (ZIP) distribution Likelihood function is defined as follows L((λ, p)T |x) = (p + (1 − p)f(0, λ)) m0 I(n>0) ((1 − p)f(n, λ)) mn and logarithm of likelihood function l((λ, p)T |x) = m0 ln (p + (1 − p)f(0, λ)) + I(n>0) mn ln((1 − p)f(x, λ)). For Poisson model, λ = n nmn n mn = 86 240 = 0.358. For ZIP model, using numerical optimisation we get the following estimates λ = 0.847 a p = 0.577. 50 / 51 Stanislav Katina Statistical Inference I and II Probabilistic and Statistical Models Zero-inflated Poisson (ZIP) distribution number of movements absolutefrequencies 050100150 0 1 2 3 4 5 6 7 14 -19 1 1 2 0 0 1 number of movements absolutefrequencies 050100150 0 1 2 3 4 5 6 7 0 4 -4 -2 1 0 0 1 Figure: Comparison of observed and expected frequencies, Poisson distribution (left), ZIP distribution (right) 51 / 51 Stanislav Katina Statistical Inference I and II