Statistical Inference I and II
Likelihood function
Stanislav Katina1
1Institute of Mathematics and Statistics, Masaryk University
Honorary Research Fellow, The University of Glasgow
November 13, 2018
1 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function
Definition (likelihood function)
For a statistical model F where we expect the data x ∈ R to be
observed, the function L : Θ → R+
∪ {0}, called likelihood function
(likelihood), is defined as
L(θ|x) = L(θ, x) = c(x)f(θ, x),
where c ∈ R is independent of θ,
f(θ|x) = f(θ, x) =
n
i=1
f(xi , θ).
Likelihood L(θ|x) is used when describing a function of a parameter
given an outcome.
Density (probability mass function) f(xi , θ) = f(xi |θ) is used when
describing a function of the outcome given a fixed parameter value.
2 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function
The natural logarithm of the likelihood function, called the
log-likelihood, is defined as
ln(L(θ|x)) = l(θ|x) = ln c + ln(f(θ|x)).
The log-likelihood is more convenient to work with.
We are searching for the maximum of likelihood function.
Because the logarithm is a monotonically increasing
function, the logarithm of a function achieves its maximum
value at the same points as the function itself. Hence the
log-likelihood can be used in place of the likelihood in finding the
maximum.
Finding the maximum of a function involves taking the
(partial) derivative of a function, equaling it to zero, and
solving for the parameter being maximised.
3 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function
Definition (maximum-likelihood estimate)
The estimate of a parameter θ, θML = θ, called maximum-likelihood
estimate (MLE), is a value which maximises the likelihood function,
i.e.
θML = arg max
∀θ
L(θ|x) = arg max
∀θ
l(θ|x).
The process of maximisation is called maximum-likelihood
estimation:
the first derivative of log-likelihood function (score
function) S(θ) = ∂
∂θ l(θ|x),
likelihood equations (score equations) S(θ) = 0,
the second derivative of log-likelihood function ∂2
∂θ2 l(θ|x),
the second derivative is negative at the point of maximum and
the curvature in θ is equal to Fisher information
I(θ) = − ∂2
∂θ2 l(θ|x)|θ=θ.
4 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function
The curvature is inversely related to the variance of θ, i.e.
Var[θ] = 1/I(θ).
Since Xi , i = 1, 2, . . . , n are independent, I(θ) = ni(θ), where
i(θ) is a likelihood of one observation.
Ronald Aylmer Fisher (1890−1962) – English statistician, wrote in
1925:
What has now appeared is that the mathematical concept
of probability is inadequate to express our mental
confidence or diffidence in making such inferences, and
that the mathematical quantity which appears to be
appropriate measuring our order of preference among
different possible populations, does not in fact obey the
laws of probability. To distinguish it from probability, I
have used the term ”likelihood” to designate this quantity.
5 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Profile likelihood and log-likelihood function
Let θ = (θ1, θ2)T
, where θ1 is the parameter of interest and θ2 a
nuisance parameter. In some cases, the separation into two such
components can be achieved after suitable reparametrisation.
If θ2|θ1
denotes the value of θ2 which maximises the likelihood (or
log-likelihood) function for the given value of θ1, the profile likelihood
function is defined as
LP(θ1|x) = max
∀θ2
L(θ|x) = L((θ1, θ2|θ1
)T
|x)
and profile log-likelihood function as
lP(θ1|x) = l((θ1, θ2|θ1
)T
|x).
The term ”profile” comes about through thinking of the usual
(log-)likelihood function as a hill being observed from a viewpoint with
abscissa θ2 = ∞, so that, for any fixed θ1, only the highest value
L((θ1, θ2|θ1
)T
|x) or l((θ1, θ2|θ1
)T
|x) is seen.
6 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Profile relative likelihood and log-likelihood function
Profile relative likelihood function is defined as:
LP(θ1|x) =
L((θ1, θ2|θ1
)T
|x)
L((θ1, θ2|θ1
)T |x)
and profile relative log-likelihood function as
ln LP(θ1|x) = ln
L((θ1, θ2|θ1
)T
|x)
L((θ1, θ2|θ1
)T |x)
.
7 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Profile relative likelihood and log-likelihood function
The estimated likelihood function is defined as
Le(θ1|x) = L((θ1, θ2)T
|x)
and estimated log-likelihood function as
le(θ1|x) = l((θ1, θ2)T
|x).
Estimated relative likelihood function is defined as:
Le(θ1|x) =
L((θ1, θ2)T
|x)
L((θ1, θ2)T |x)
and estimated relative log-likelihood function as
ln Le(θ1|x) = ln
L((θ1, θ2)T
|x)
L((θ1, θ2)T |x)
.
8 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of binomial distribution
Definition (likelihood and log-likelihood function of binomial
distribution)
Let X be binomially distributed with sample size N and parameter
θ = p, i.e. X ∼ Bin(N, p). Realisations of X be x = n. Then the
likelihood function is equal to
L(p|x) =
N
i=1
N
xi
pxi
(1 − p)
N−xi
= px
(1 − p)
N−x
N
i=1
N
xi
.
Since the product of binomial coefficients is not important in likelihood
maximisation, only the kernel (often called likelihood as well) is used.
Then
L(p|x) ≈ px
(1 − p)
N−x
.
The log-likelihood function is equal to
l(p|x) = x ln p + (N − x) ln (1 − p) .
9 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of binomial distribution
Example (maximum-likelihood estimation of parameter p)
Let X be binomially distributed with sample size N and parameter
θ = p, i.e. X ∼ Bin(N, p). Derive p and Var[p].
Solution
S(p) =
∂
∂p
l(p|x) =
x
p
−
N − x
1 − p
, where if S(p) = 0, then p =
x
N
.
∂2
∂p2
l(p|x) = −
x
p2
−
N − x
(1 − p)
2
, where if
∂2
∂p2
l(p|x)|x=Np = −
Np
p2
−
N 1 − p
(1 − p)
2
.
If p = p, then
Var[p] =
p(1 − p)
N
.
10 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of binomial distribution
Example (maximal likelihood estimation of parameter p)
Generate in pseudo-random variables X ∼ Bin(N, p), where
N = 20. Write -function to calculate likelihood function L(p|x) of
binomial distribution and visualise it for (1) x = 2, N = 20, (2)
x = 10, N = 20 and (3) x = 18, N = 20. Repeat the same for
log-likelihood function. Calculate also p using function optimize().
Draw all three functions in three side-by-side windows with
highlighted maxima.
Solution (partial)
L(p|x) = px
(1 − p)
N−x
, where p ∈ (0, 1), x = 2, N = 20
L(p|x) = px
(1 − p)
N−x
, where p ∈ (0, 1), x = 10, N = 20
L(p|x) = px
(1 − p)
N−x
, where p ∈ (0, 1), x = 18, N = 20
11 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of binomial distribution
0.0 0.2 0.4 0.6 0.8 1.0
0.00000.00050.00100.0015
MLE of p = 0.1
probability p
L(p|x) 0.0 0.2 0.4 0.6 0.8 1.0
0e+002e−074e−076e−078e−07
MLE of p = 0.5
probability p
L(p|x)
0.0 0.2 0.4 0.6 0.8 1.0
0.00000.00050.00100.0015
MLE of p = 0.9
probability p
L(p|x)
Figure: Likelihood functions of binomial distribution X ∼ Bin(N, p),
where N = 20
12 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of binomial distribution
0.0 0.2 0.4 0.6 0.8 1.0
−120−100−80−60−40−20
MLE of p = 0.1
probability p
l(p|x)
0.0 0.2 0.4 0.6 0.8 1.0
−70−60−50−40−30−20
MLE of p = 0.5
probability p
l(p|x)
0.0 0.2 0.4 0.6 0.8 1.0
−120−100−80−60−40−20
MLE of p = 0.9
probability p
l(p|x)
Figure: Log-likelihood functions of binomial distribution X ∼ Bin(N, p),
where N = 20
13 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of multinomial distribution
Definition (likelihood and log-likelihood function of multinomial
distribution)
Let X be multinomially distributed with sample size N and parameter
θ = p, i.e. X ∼ MultJ (N, p). Realisations of Xj be xj = nj . Then the
(kernel of) likelihood function is equal to
L(p|x) =
N
i=1
N!
J
j=1 xj !
J
j=1
p
xji
j ≈
J
j=1
p
xj
j
and the log-likelihood function is equal to
l(p|x) =
J
j=1
xj ln pj .
14 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of multinomial distribution
Example (maximum-likelihood estimation of parameter p)
Let X be multinomially distributed with sample size N and parameter
θ = p, i.e. X ∼ MultJ (N, p). Derive p and Var[p].
Solution
Let pJ = 1 −
J−1
j=1 pj and p = (p1, p2, . . . , pJ−1)T
Then
l(p|x) =
J−1
j=1
nj ln pj + nJ ln(1 −
J−1
j=1
pj ),
(S(p))j =
∂
∂pj
l(p|x) =
nj
pj
−
nJ
pJ
, where if (S(p))j = 0, then pj =
nj
N
,
where (S(p))j are the elements of S(p). Then
I(p) = −
∂
∂p
S(p) = diag
n1
p2
1
,
n2
p2
2
, . . . ,
nJ−1
p2
J−1
+
nJ
p2
J
11T
.
15 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of multinomial distribution
I(p) = N diag
1
p1
,
1
p2
, . . . ,
1
pJ−1
+
11T
pJ
.
Then
I(p) = N






1
p1
+ 1
pJ
1
pJ
1
pJ
. . . 1
pJ
1
pJ
1
p2
+ 1
pJ
1
pJ
. . . 1
pJ
...
...
...
...
...
1
pJ
1
pJ
. . . 1
pJ
1
pJ−1
+ 1
pJ






,
Var[p] = I−1
(p) =
1
N
diag p − ppT
.
Then
Var[p] =
1
N





p1(1 − p1) −p1p2 . . . −p1pJ−1
−p2p1 p2(1 − p2) . . . −p2pJ−1
...
...
...
...
−pJ−1p1 −pJ−1p2 . . . pJ−1(1 − pJ−1)





.
16 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of multinomial distribution
p1
p2
q
0 0.2 0.4 0.6 0.8 1
00.20.40.60.81
Figure: Log-likelihood function of multinomial (trinomial) distribution
17 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of Poisson distribution
Definition (likelihood and log-likelihood function of Poisson
distribution)
Let X be distributed as Poisson with parameter θ = λ, i.e.
X ∼ Poiss(λ). Realisations of Xj be xj = nj . Then the (kernel of)
likelihood function is equal to
L (λ|x) =
N
i=1
λxi e−λ
xi !
≈ λ
N
i=1 xi
e−Nλ
and the log-likelihood function is equal to
l (λ|x) =
N
i=1
xi ln λ − Nλ.
In general notation (from examples), L(λ|x) = n pmn
n , where
pn = Pr(X = n) = e−λ
λn
/n! and l(λ|x) = n nmn ln λ − λ n mn.
18 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of Poisson distribution
Example (maximum-likelihood estimation)
Let X be distributed as Poisson with parameter θ = λ, i.e.
X ∼ Poiss(λ). Derive λ and Var[λ].
Solution (partial)
S(λ) =
∂
∂λ
l (λ|x) =
N
i=1 xi
λ
− N,
∂2
∂λ2
l (λ|x) = −
N
i=1 xi
λ2
.
Then
λ =
N
i=1 xi
N
= x and Var[λ] =
x
N
.
In general notation, λ = n nmn
n mn
.
19 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of Poisson distribution
Example (maximal likelihood estimation of parameter λ)
Write -function to calculate likelihood function L(λ|x) and
log-likelihood function l(λ|x) of Poisson distribution X ∼ Poiss(λ) for
horse kick data. Calculate also λ using function optimize(). Draw
both functions in two side-by-side windows with highlighted maximum.
Solution (partial)
l(λ|x) = n nmn ln λ − λ n mn, where λ ∈ (0, 2)
20 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of Poisson distribution
0.0 0.5 1.0 1.5 2.0
0e+002e−804e−806e−80
MLE of λ = 0.61
λ
L(λ|x)
0.0 0.5 1.0 1.5 2.0
−500−400−300−200
MLE of λ = 0.61
λ
l(λ|x)
Figure: Likelihood function L(λ|x) (left) and log-likelohood function
l(λ|x) of Poisson distribution X ∼ Poiss(λ) for horse kick data
21 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Assignments in
Assignment number of boys:
Calculate p (the probability of having a boy in a family) and Var[p]
(the variance of probability of having a boy in a family).
Assignment killing by horse kick:
Calculate λ (the mean number of annual deaths) and Var[λ] (the
variance of mean number of annual deaths).
Assignment accidents in a factory:
Calculate λ (the mean number of accidents in a factory) and Var[λ]
(the variance of mean number of accidents in a factory).
22 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Assignments in
Assignment blood groups:
In Prague and Koˇsice, calculate p (the probabilities of having certain
blood group in particular city) and Var[p] (the covariance matrix of
probability of having certain blood group in particular city).
Assignment eye and hair colour:
Calculate p (the probabilities of having certain eye and hair colour)
and Var[p] (the covariance matrix of probability of having certain eye
and hair colour).
23 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of normal distribution
Definition (likelihood and log-likelihood function of normal distribution)
Let X be distributed normally with parameter θ = (μ, σ2
)T
, i.e.
X ∼ N(μ, σ2
). Realisations of Xi be xi . Then the likelihood function
is equal to
L(θ|x) =
n
i=1
1
√
2πσ
exp −
1
2
xi − μ
σ
2
=
1
(2πσ2)n/2
exp −
1
2σ2
n
i=1
x2
i − 2μ
n
i=1
xi + nμ2
and the log-likelihood function is equal to
l(θ|x) = −
n
2
ln(2π) −
n
2
ln σ2
−
1
2σ2
n
i=1
(xi − μ)
2
.
24 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of normal distribution
Example (maximum-likelihood estimation of parameters μ and σ2
)
Let X be distributed normally with parameter θ = (μ, σ2
)T
, i.e.
X ∼ N(μ, σ2
). Derive θ = (μ, σ2
)T
and Var[θ] = Σ.
Solution (partial)
S1(θ) =
∂
∂μ
l(θ|x) =
1
σ2
n
i=1
(xi − μ),
S2(θ) =
∂
∂σ2
l(θ|x) = −
n
2σ2
+
1
2σ4
n
i=1
(xi − μ)2
.
Then
μ = x =
1
n
n
i=1
xi , σ2
=
1
n
n
i=1
(xi − μ)2
, and I(θ) =
n
σ2 0
0 n
2σ4
.
25 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of normal distribution
Example (maximal likelihood estimation of parameters μ and σ2
)
Generate in pseudo-random variables X ∼ N(μ, σ2
), where μ = 4,
σ2
= 1 and n = 1000. Write -function to calculate (1) (profile)
likelihood function LP(μ|x) of normal distribution for generated data
X, (2) (profile) likelihood function LP(σ2
|x) of normal distribution for
generated data X, and (3) likelihood function L(θ|x) of normal
distribution for generated data X, where θ = (μ, σ2
)T
. Repeat the
same for log-likelihood function. Calculate also MLEs using functions
optimize() and optim(). Draw all three functions in three
side-by-side windows with highlighted maxima.
Solution (partial)
lP (μ|x) = −n
2
ln(2π) − n
2
ln σ2
μ − 1
2σ2
μ
n
i=1 x2
i − 2μ n
i=1 xi + nμ2
, where
μ ∈ (2, 6), σμ = 1;
lP (σ2
|x) = −n
2
ln(2π) − n
2
ln σ2
−
n
i=1(xi −μσ)2
2σ2 , where μσ = 4, σ ∈ (0.5, 1.5);
l(θ|x) = −n
2
ln(2π) − n
2
ln σ2
−
n
i=1(xi −μ)2
2σ2 , where μ ∈ (2, 6) and σ ∈ (0.5, 1.5).
26 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of normal distribution
−10 −5 0 5 10 15 20
0.00.20.40.60.8
µ
L(µ|x)
0 2 4 6 8 10
0.650.700.750.800.85
σ2
L(σ2
|x)
µ
σ2
0.02238843 0.02238843
0.5
515948
0.659017
0.7482824
0.8059284
q
−11 −3.5 4 11.5 19
2.557.510
Figure: Profile likelihood functions (left, middle) and likelihood function
(right) of normal distribution X ∼ N(μ, σ2
), where μ = 4, σ2
= 1 and
n = 1000; all functions multiplied by suitable constant
27 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of normal distribution
−10 −5 0 5 10 15 20
−10−8−6−4−20
µl(µ|x)
0 2 4 6 8 10
−0.45−0.40−0.35−0.30−0.25−0.20−0.15
σ2
l(σ2
|x)
µ
σ2
−37992.11 −37992.11
−8246.8
07
−8246.8
07
−4170.059
−2899.748
−2157.604
q
−11 −3.5 4 11.5 19
2.557.510
Figure: Profile log-likelihood functions (left, middle) and log-likelihood
function (right) of normal distribution X ∼ N(μ, σ2
), where
μ = 4, σ2
= 1 and n = 1000; all functions are multiplied by suitable
constant
28 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Likelihood function of normal distribution
mu
sigma2
likelihood
mu
sigma2
log−likelihoodFigure: Likelihood (left) and log-likelihood (right) function of normal
distribution X ∼ N(μ, σ2
), where μ = 4, σ2
= 1 and n = 1000; all
functions are multiplied by suitable constant
29 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Approximation of likelihood function
Definition (relative likelihood and log-likelihood function)
Relative likelihood function is defined as
L(θ|x) =
L(θ|x)
L(θ|x)
and relative log-likelihood function as
ln L(θ|x) = ln
L(θ|x)
L(θ|x)
.
It is often useful that likelihood function could be approximated
by a quadratic function.
But additionally to the location of maxima of likelihood function,
we need the curvature around maximum.
Since the log-likelihood, is more convenient to work with, we
need a quadratic approximation of log-likelihood function.
30 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Approximation of likelihood function
Definition (Taylor polynomial of order r)
If a function g(x) has derivatives of order r, that is, g(r)
(x) = ∂r
∂xr g(x)
exists, then for any constant a, the Taylor polynomial of order r
about a is
Tr (x) =
r
j=0
g(j)
(a)
j!
(x − a)j
.
In practical statistical situations we assume that the remainder
g(x) − Tr (x) converges to zero as r increases, therefore we are going
to ignore it. There are many explicit forms, one of the most useful is
g(x) − Tr (x) =
x
a
g(r+1)
(t)
r!
(x − t)r
dt.
If g(r)
(a) = ∂r
∂xr g(x)|x=a exists, then
lim
x→a
g(x) − Tr (x)
(x − a)r
= 0.
31 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Approximation of likelihood function
The quadratic approximation of log-likelihood function about θ is
defined as
l(θ|x) ≈ l(θ|x) + S(θ)(θ − θ) −
1
2
I(θ)(θ − θ)2
.
The quadratic approximation of relative log-likelihood function
about θ is defined as
ln L(θ|x) = ln
L(θ|x)
L(θ|x)
= l(θ|x) − l(θ|x) ≈ −
1
2
I(θ)(θ − θ)2
.
It is often useful to visualise a derivative of the quadratic
approximation S(θ) ≈ −I(θ)(θ − θ) or
−I−1/2
(θ)S(θ) ≈ I1/2
(θ)(θ − θ), where −I−1/2
(θ)S(θ) is visualised
against I1/2
(θ)(θ − θ). If the quadratic approximation is correct, this
should be a line with slope equal to one.
32 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Approximation of likelihood function
0.5 0.6 0.7 0.8 0.9 1.0
−4−3−2−10
θ
scaledl(θ|x)
(a) x=8, N=10
approx log−like
log−like
−2 −1 0 1 2
05101520
scaled θ
scaledscorefunction
(a) linearity of score function
approx log−like
log−like
0.5 0.6 0.7 0.8 0.9 1.0
−4−3−2−10
θ
scaledl(θ|x)
(b) x=80, N=100
−2 −1 0 1 2
0102030405060
scaled θ
scaledscorefunction
(b) linearity of score function
0.5 0.6 0.7 0.8 0.9 1.0
−4−3−2−10
θ
scaledl(θ|x)
(c) x=800, N=1000
−2 −1 0 1 2
050100150200
scaled θ
skalovanaskorefunkcia
(c) linearity of score function
Figure: Relative binomial log-likelihood, its quadratic approximation
(top) and linearity of score function (bottom)
33 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Numerical maximisation of likelihood function
Isaac Newton (1643−1727) and Joseph Raphson (1648−1715).
Definition (Newton-Raphson method)
Having quadratic approximation of log-likelihood function about
θ0
l(θ|x) ≈ l(θ0|x) + S(θ0)(θ − θ0) −
1
2
I(θ0)(θ − θ0)2
or linear approximation of score function about θ0
S(θ) ≈ S(θ0) − I(θ0)(θ − θ0),
the numerical maximisation can be done via iterative function
θ0 +
S(θ0)
I(θ0)
.
34 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Numerical maximisation of likelihood function
The iterative process is defined as follows:
1 initialisation step – starting point θ(0)
, where I(θ(0)
) = 0,
2 updating equations – iteration of
θ(i)
= θ(i−1)
+
S(θ(i−1)
)
I(θ(i−1))
,
where I(θ(i−1)
) = 0, for i = 1, 2, . . .
3 stopping rule based on absolute convergence criteria – until
|l(θ(i)
|x) − l(θ(i−1)
|x)| < , where the threshold is sufficiently
small
Geometrical interpretation: θ(i)
is a crossing point of tangent of score
function S(∙) in the point [θ(i−1)
, S(θ(i−1)
)] with x-axis. In :
optimize(f,interval,maximum= FALSE, tol,...)
Newton-Raphson method is combined here with golden section
method and successive parabolic interpolation to speed up
the convergence.
35 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Numerical maximisation of likelihood function
Definition (multivariate Newton-Raphson method)
Having quadratic approximation of log-likelihood function about
θ0
l(θ|x) ≈ l(θ0|x) + S(θ0)(θ − θ0) −
1
2
(θ − θ0)T
I(θ0)(θ − θ0)
or linear approximation of score function about θ0
S(θ) ≈ S(θ0) − I(θ0)(θ − θ0).
the numerical maximisation can be done via iterative function
θ0 + (I(θ0))−1
S(θ0).
36 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Numerical maximisation of likelihood function
The iterative process is defined as follows:
1 initialisation step – starting point θ(0)
, where I(θ(0)
) = 0,
2 updating equations – iteration of
θ(i)
= θ(i−1)
+ (I(θ(i−1)
))−1
S(θ(i−1)
),
where I(θ) is regular, i.e. det I(θ(i−1)
) = 0, for i = 1, 2, . . .
3 stopping rule based on absolute convergence criteria – until
|l(θ(i)
|x) − l(θ(i−1
|x)| < , where the threshold is sufficiently
small
In :
optim(par,fn,gr,method,control,hessian
=FALSE,...)
Newton-Raphson method is often modified – Fisher scoring
method, quasi Newton method, Broyden-FletcherGoldfarb-Shannon
(BFGS) method
37 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Numerical maximisation of likelihood ≈ minimising negative log-likelihood
Nelder-Mead method (method of simplexes) – the idea of ”jumps” across
triangles defined by the points θ
(i−1)
1 , θ
(i−1)
2 , θ
(i−1)
3 , where
−l(θ
(i−1)
1 |x) < −l(θ
(i−1)
2 |x) < −l(θ
(i−1)
3 |x). We are substituting point θ
(i−1)
1
with a ”better” point θ
(i)
1 , where −l(θ
(i)
1 |x) < −l(θ
(i−1)
1 |x). Then new point is
defined based on reflection (point symmetry), contraction or
extrapolation (expansion) as
1 reflection: z1 = θ
(i)
1 = θ
(i−1)
23 + 1 θ
(i−1)
23 − θ
(i−1)
1 ,
2 reflection and expansion: z2 = θ
(i)
1 = θ
(i−1)
23 + 2 θ
(i−1)
23 − θ
(i−1)
1 ,
3 reflection and contraction: z3 = θ
(i)
1 = θ
(i−1)
23 + 1
2
θ
(i−1)
23 − θ
(i−1)
1 ,
4 contraction A: z4 = θ
(i)
2 = θ
(i−1)
1 + 1
2
θ
(i−1)
2 − θ
(i−1)
1 and B:
z5 = θ
(i)
3 = θ
(i−1)
1 + 1
2
θ
(i−1)
3 − θ
(i−1)
1 ,
where θ
(i−1)
23 =
θ(i−1)
2 +θ(i−1)
3
2
, i.e. the mid-point of the line defined by the points
θ
(i−1)
2 and θ(i−1)
. If −l(θ
(i)
1 |x) < −l(θ
(i−1)
1 |x) then new triangle is defined with
θ
(i)
1 , θ
(i−1)
2 , θ
(i−1)
3 for (1) to (3). Otherwise new triangle is θ
(i−1)
1 , θ
(i)
2 , θ
(i)
3 .
38 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Numerical maximisation of likelihood ≈ minimising negative log-likelihood
1
2
3
4
5
5
6
6
7
7
8
8
9
9
910 10
x
y
x
y
function
Figure: Demonstration of Nelder-Mead method of minimising the
function ((x − y)2
+ (x − 2)2
+ (y − 3)4
)/10, number of iterations is 49
39 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Numerical maximisation of likelihood ≈ minimising negative log-likelihood
x
y
1
2
3
4
5
5
6
6
7
7
8
8
9
9
910 10
−2 0 2 4
0246
1
2
3 4 5678910111213141516171819202122232425262728293031323334353637383940414243444546474849
x
y
1
2
3
4
5
5
6
6
7
7
8
8
9
9
910 10
−2 0 2 4
0246
q
Figure: Demonstration of Nelder-Mead method of minimising the
function ((x − y)2
+ (x − 2)2
+ (y − 3)4
)/10, number of iterations is 49
40 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Mixture of two univariate normal distribution - likelihood estimation
Given data x = (x1, x2, . . . , xn)
T
, the log-likelihood function
L(θ|x) =
n
i=1 f(xi , θ), where θ = (p, μ1, μ2, σ2
1, σ2
2)T
, must be
maximised numerically. One complication is that l(θ|x) is unbounded.
To see how this can happen, fix p, μ2 and σ2
2 at any set values, with
the exception that p is not equal to 0 or 1. Denote these fixed values
by p∗, μ2,∗ and σ2
2,∗, respectively. Now, set μ1 = xi for any choice of
i ∈ {1, 2, . . . , n}. This leaves only σ2
1 unspecified, and
θσ2
1
= (p∗, xi , μ2,∗, σ2
1, σ2
2,∗)T
can be used to denote the parameter
vector with the values of the other parameters fixed as described.
When θ = θσ2
1
, the binormal density function evaluated
f(xi , θσ2
1
) =
p∗
√
2πσ1
+
1 − p∗
√
2πσ2,∗
exp −
(μ1 − μ2,∗)
2
2σ2
2,∗
.
Note that f(xi , θσ2
1
) can be made arbitrarily large by making σ1
arbitrarily small.
41 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Mixture of two univariate normal distribution - likelihood estimation
Since L(θσ2
1
|x) =
n
i=1 f(xi , θσ2
1
), and each f(xi , θσ2
1
) is bounded away
from zero (by virtue of p∗, μ2,∗ and σ2
2,∗ being fixed), it follows that
l(θσ2
1
|x) can also be made arbitrarily large.
A further problem is that the parametrisation of the binormal model is
not identifiable because the role of the two distributions in the
mixture can be swapped. That is, the binormal distribution
corresponding to parameters (p, μ1, μ2, σ2
1, σ2
2) is the same as that
specified by parameters (1 − p, μ1, μ2, σ2
1, σ2
2).
42 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Mixture of two univariate normal distribution - likelihood estimation
The unbounded likelihood and non-identifiability issues can be
eliminated by suitable restriction on the parameter space. One
possibility is to constrain the ratio of the two standard deviations
by requiring that 0 < c <
σ2
1
σ2
2
< 1, where c is some suitably small
constant.
In practice, despite the unbounded likelihood and non-identifiability, a
sensible local maximum of the likelihood function can often be
found using unconstrained numerical optimisation. This is
especially the case if there is good separation between the two
component normal distributions, and the optimizer is given a
starting value of θ that is somewhere in the general vicinity of the
local maximum. Ultimately, it is the shape of the likelihood function in
the neighbourhood of this local maximum that is relevant to inference.
43 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Mixture of two univariate normal distribution - likelihood estimation
The binormal density function is a linear combination of the
density functions given by N(μ1, σ2
1) and N(μ2, σ2
2) distributions.
waiting time (minutes)
histograminabsolutescale
40 50 60 70 80 90 100
01020304050
0.00
0.01
0.02
0.03
50 60 70 80 90
waiting time (minutes)
density
Figure: Mixture of two normal densities – data faithful
44 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Mixture of two univariate normal distribution - likelihood estimation
The histogram of waiting times shows that they look like a
combination of (very roughly) 40 % from N(52, 25) distribution and
60 % from N(80, 25) distribution. The corresponding parameter
values θ(0)
= (0.4, 52, 80, 25, 25)T
would make good starting values
for finding a local MLE using numerical optimisation. To estimate θ,
use optim() function.
The call of optim() produced some warning messages (not shown),
because it attempted to evaluate negative log-likelihood at parameter
values outside of the parameter space (e.g. σ1, σ2 < 0). This can be
avoided by using lower and upper bound arguments in the
optim() call.
45 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Negative binomial distribution
Example (Negative binomial distribution; accidents in the factories)
Let X be the number of workers having an accident in the munition
factories in England during First World War (Greenwood and Yule
1920), n be the number of accidents, mn be the number of workers
with particular number of accidents, M = mn = 647. Question:
Calculate theoretical frequencies mn,E .
Table: Observed and theoretical frequencies (mn,O and mn,E ) of
workers with n accidents
n 0 1 2 3 4 ≥ 5
mn,O 447 132 42 21 3 2
mn,E 446 134 44 15 5 3
46 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Negative binomial distribution
Likelihood function is defined as follows
L((α, π)T
|x) =
4
n=0
(Pr(X = n))
mn
1 −
4
n=0
Pr(X = n)
m≥5
and logarithm of likelihood function
l((α, π)T
|x) =
4
n=0
mn ln Pr(X = n) + m≥5 ln 1 −
4
n=0
Pr(X = n) .
Using numerical optimisation we get the following estimates α = 0.84
and π = 0.64. Risk ratio μ = 1−π
π α = 0.47.
47 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Negative binomial distribution
number of accidents
absolutefrequencies
0100200300400
0 1 2 3 4 5+
1
-2
-2
6
-2 -1
Figure: Comparison of observed and expected frequencies (negative
binomial distribution)
48 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Zero-inflated Poisson (ZIP) distribution
Example (ZIP distribution; number of movements of a foetal lamb)
Let X be the number of movements of a foetal lamb in 240
five-second periods (Leroux and Puterman 1992), n be the number of
movements, mn be the number of periods with particular number of
movements. Question: Calculate theoretical frequencies mn,E using
Poisson and ZIP distribution.
Table: Observed and theoretical frequencies (mn,O and mn,E ) of
five-second periods with n movements
n 0 1 2 3 4 5 6 7
mn,O 182 41 12 2 2 0 0 1
mn,E (Poisson) 168 60 11 1 0 0 0 0
mn,E (ZIP) 182 37 16 4 1 0 0 0
49 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Zero-inflated Poisson (ZIP) distribution
Likelihood function is defined as follows
L((λ, p)T
|x) = (p + (1 − p)f(0, λ))
m0
I(n>0)
((1 − p)f(n, λ))
mn
and logarithm of likelihood function
l((λ, p)T
|x) = m0 ln (p + (1 − p)f(0, λ)) +
I(n>0)
mn ln((1 − p)f(x, λ)).
For Poisson model, λ = n nmn
n mn
= 86
240 = 0.358. For ZIP model, using
numerical optimisation we get the following estimates λ = 0.847 a
p = 0.577.
50 / 51 Stanislav Katina Statistical Inference I and II
Probabilistic and Statistical Models
Zero-inflated Poisson (ZIP) distribution
number of movements
absolutefrequencies
050100150
0 1 2 3 4 5 6 7
14
-19
1
1 2 0 0 1
number of movements
absolutefrequencies
050100150
0 1 2 3 4 5 6 7
0
4
-4
-2 1 0 0 1
Figure: Comparison of observed and expected frequencies, Poisson
distribution (left), ZIP distribution (right)
51 / 51 Stanislav Katina Statistical Inference I and II