Statistics for Computer Sciences
Lecture 04 to Lecture 09
Probabilistic and Statistical Models
Stanislav Katina1
1Institute of Mathematics and Statistics, Masaryk University
Honorary Research Fellow, The University of Glasgow
December 1, 2015
Probabilistic and Statistical Models
Random variable, random vector, data, individuals
◮ random variable and random vector
◮ random variable X is a function from a sample space to a
set of real numbers X : Y → R (a set of all possible
outcomes)
◮ 2-dimensional random vector (X1, X2)T
: Y → R2
◮ k-dimensional random vector (X1, X2, . . . , Xk )T
: Y → Rk
◮ data – data vector and data matrix – the elements of a
vector and the rows of a matrix are measured on
individuals (statistical units)
◮ data as realisations of X – n-dimensional vector
x = (x1, x2, . . . , xn)T
, where n is a sample size
◮ data as realisations of (X1, X2)T
– (n × 2)-dimensional
matrix with rows (xi1, xi2)T
, i = 1, 2, . . . , n and columns x1
and x2
◮ data as realisations of (X1, X2, . . . , Xk )T
–
(n × k)-dimensional matrix with rows (xi1, xi2, . . . , xik )T
,
i = 1, 2, . . . , n and columns x1, x2 and xk
Probabilistic and Statistical Models
Model
◮ based on probabilistic sampling principles, the
individuals are sampled from a population
◮ attribute – a speciﬁc value of a variable
◮ with certain precision, data are measured on individuals
◮ descriptive statistics – describing and summarising data
◮ inferential statistics (statistical inference) – inferring
(drawing conclusions) about random variable based on a
model ﬁtted to data
◮ F is a set of models (probabilistic or statistical)
◮ X is characterised by a model F(·), F ∈ F
◮ (X1, X2)T
is characterised by a model F(2)
(·), F ∈ F
◮ (X1, X2, . . . , Xk )T
is characterised by a model F(k)
(·), F ∈ F
◮ parameter – a numerical quantity that characterises a
model – one-dimensional parameter θ, k-dimensional
vector of parameters θ = (θ1, θ2, . . . , θk )T
Probabilistic and Statistical Models
Distribution function, probability and density function
◮ useful assumption – Xi, i = 1, 2, . . . , n, are independently
identically distributed random variables
◮ distribution function
◮ discrete random variable
FX (x) = Pr (X ≤ x) =
i:xi ≤x
Pr (X = xi ) ,
where
k(∞)
i=1 pi = 1, Pr (X = xi ) = pi = fX (xi ) = f(xi ), ∀xi ,
where pi is probability mass function; {xi , pi }
k(∞)
i=1 , k ∈ N+
◮ continuous random variable
FX (x) =
x
−∞
f (t) dt, f (x) ≥ 0,
where
∞
−∞
f (x) dx = 1, fX (x) = f(x) = ∂
∂x FX (x) is density
function
Probabilistic and Statistical Models
Parametric and non-parametric model
◮ Θ is a parametric space, the support of F(·; θ) is
Yθ ⊆ Rn (the smallest set, where the distribution function
is deﬁned); sample space Y = ∪θ∈ΘYθ
◮ F as a parametric set of distribution functions
F = F(·; θ) : θ ∈ Θ ⊆ Rk
,
◮ F as a parametric set of probability or density
functions
F = f(·; θ) : θ ∈ Θ ⊆ Rk
◮ F as non-parametric set
F = {a set of all density functions} ,
alternatively, probability or distribution function can be used
Probabilistic and Statistical Models
Reading of mathematical notation
◮ the term ”probability model” is often reduced to
”distribution”
◮ ”Random variable X is distributed as F(x)” or ”random
variable X is characterised by distribution F(x)”, notation
X ∼ FX (x); symbol ”∼” means ”asymptotically”, ”for
sufﬁciently large n” (notation X ∼ fX (x) is used very rarely)
◮ ”Random variable X is distributed as random variable Y” or
”Random variable X and Y are identically distributed”
(notation X ∼ Y or FX (x) ∼ FY (y)
◮ the term ”statistical model” is often reduced to ”model”
(usually referred as causal statistical model or model of
causal dependence)
◮ ”Y depends on X”, where X is independent variable and
Y is dependent variable (notation Y|X)
Probabilistic and Statistical Models
Reading of mathematical notation
◮ ”X is normally distributed with parameters µ and σ2”,
notation X ∼ N(µ, σ2), where θ = (µ, σ2)T
◮ ”X = (X1, X2)T is characterised by bivariate normal
distribution with parameters µ1, µ2, σ2
1, σ2
2 and ρ”, notation
X ∼ N2(µ, Σ), where θ = (µ1, µ2, σ2
1, σ2
2, ρ)T
◮ ”X = (X1, X2, . . . , , Xk )T is characterised by multivariate
normal distribution with parameters µ1, µ2, . . ., µk , σ2
1, σ2
2,
. . ., σ2
k , and ρ1,2, . . ., ρk−1,k , ”, notation X ∼ Nk (µ, Σ),
where θ = (µ1, µ2, . . . , µk , σ2
1, σ2
2, . . . , σ2
k , ρ1,2, . . . , ρk−1,k )T
◮ ”X is binomially distributed with parameter p”, notation
X ∼ Bin(N, p), where θ = p
◮ ”X is characterised by distribution with parameter λ”,
notation X ∼ Poiss(λ), where θ = λ
◮ ”X = (X1, X2, . . . , , Xk )T is multinomially distributed with
parameter p”, notation X ∼ Multk (N, p), where θ = p
Probabilistic and Statistical Models
Measures of normal distribution
◮ ”X is normally distributed with parameters µ and σ2”,
notation X ∼ N(µ, σ2), where θ = (µ, σ2)T
◮ Random variable Z (Z-transformation)
Pr(Z = X−µ
σ < x1−α) = 1 − α, Z ∼ N(0, 1)
◮ Rule ”90 − 95 − 99”
Pr (a ≤ X ≤ b) = 1 − α, where 1 − α = 0.90, 0.95 and
0.99, a = µ − x1−α/2σ and b = µ + x1−α/2σ
◮ Rule ”68.27 − 95.45 − 99.73”
Pr (a ≤ X < b) = Pr (X < b)−Pr (X < a) = FX (b)−FX (a),
where a = µ − kσ, b = µ + kσ, k = 1, 2 and 3
Probabilistic and Statistical Models
Approximation of binomial distribution by normal distribution
Deﬁnition (approximation of binomial distribution by
normal distribution)
If random variable X is binomially distributed with parameter p,
X ∼ Bin(N, p), where θ = p, then if Np > 5 a Nq > 5, where
q = 1 − p, then the distribution of random variable X can be
approximated by normal distribution, X ∼ N(Np, Npq), where
θ = (Np, Npq)T .
Table: Examples of minimal N for ﬁxed p
p 0.1 0.2 0.3 0.4 0.5
q 0.9 0.8 0.7 0.6 0.5
N 51 26 17 13 11
Probabilistic and Statistical Models
Approximation of binomial distribution by normal distribution
Example
Let Pr(male) = 0.515 and Pr(female) = 0.485. Let X if the
frequency of males and Y frequency of females. Assuming that
X ∼ Bin(N, p), (a) Pr(X ≤ 3), if N = 5, (b) Pr(X ≤ 5), if N = 10
and (c) Pr(X ≤ 25), if N = 50. Compare the results with normal
approximation X ∼ N(Np, Npq).
Solution
(a) E[X] = Np = 5 × 0.515 = 2.575, E[Y] = 5 × 0.485 = 2.425,
Pr(X ≤ 3) = k≤3
5
k 0.515k 0.4855−k = 0.793,
Pr(X ≤ 3) = 0.648, N(5 × 0.515, 5 × 0.515 × 0.485).
(b) E[X] = 10 × 0.515 = 5.15, E[Y] = 10 × 0.485 = 4.85,
Pr(X ≤ 5) = k≤5
10
k 0.515k 0.48510−k = 0.586,
Pr(X ≤ 5) = 0.462, N(10 × 0.515, 10 × 0.515 × 0.485).
(c) E[X] = 50 × 0.515 = 25.75, E[Y] = 50 × 0.485 = 24.25,
Pr(X ≤ 25) = k≤25
50
k 0.515k 0.48550−k = 0.471,
Pr(X ≤ 25) = 0.416, N(50 × 0.515, 50 × 0.515 × 0.485).
Probabilistic and Statistical Models
Approximation of binomial distribution by normal distribution
0 1 2 3 4 5
0.00.10.20.30.4
Bin(5,0.515)
0 1 2 3 4 5
0.00.20.40.60.81.0
Bin(5,0.515)
0 2 4 6 8 10
0.00.10.20.30.4
Bin(10,0.515)
0 2 4 6 8 10
0.00.20.40.60.81.0
Bin(10,0.515)
0 10 20 30 40 50
0.00.10.20.30.4
Bin(50,0.515)
0 10 20 30 40 50
0.00.20.40.60.81.0
Bin(50,0.515)
Figure: Probability function of binomial distribution superiposed by the
density function of normal distribution (p = 0.515; N = 5, 10 and 50)
Probabilistic and Statistical Models
Approximation of binomial distribution by normal distribution
0 5 10 15
0.00.10.20.30.40.50.6
Bin(5,0.1)
0 5 10 15
0.00.20.40.60.81.0
Bin(5,0.1)
0 5 10 15
0.00.10.20.30.40.50.6
Bin(10,0.1)
0 5 10 15
0.00.20.40.60.81.0
Bin(10,0.1)
0 5 10 15
0.00.10.20.30.40.50.6
Bin(50,0.1)
0 5 10 15
0.00.20.40.60.81.0
Bin(50,0.1)
Figure: Distribution function of binomial distribution superiposed by
the distribution function of normal distribution (p = 0.515; N = 5, 10
and 50)
Probabilistic and Statistical Models
(Univariate) normal distribution
Deﬁnition (normal distribution)
Random variable is normally distributed with parameters µ
and σ, i.e. X ∼ N(µ, σ2), where θ = (µ, σ2)T and density is
deﬁned as f (x) = 1√
2πσ2
e
−
(x−µ)2
2σ2 , x ∈ R, σ > 0.
Deﬁnition (standardised normal distribution)
Random variable is normally distributed with parameters
µ = 0 and σ = 1, i.e. X ∼ N(0, 1), where θ = (0, 1)T and
density is deﬁned as φ (x) = f (x) = 1√
2π
e− x2
2 , x ∈ R, σ > 0.
Parameter µ is called mean of X and σ2 the variance of X.
Probabilistic and Statistical Models
Bivariate normal distribution
Deﬁnition (bivariate normal distribution)
Random vector (X, Y)T is normally distributed with
parameters µ and Σ, i.e. (X, Y)T ∼ N2(µ, Σ), where
µ = (µ1, µ2)T
a Σ =
σ2
1 ρσ1σ2
ρσ1σ2 σ2
2
,
θ = (µ1, µ2, σ2
1, σ2
2, ρ)T , (x, y)T
∈ R2, µj ∈ R1, σ2
j > 0, j = 1, 2,
ρ ∈ −1, 1 ; density is deﬁned as
f (x, y) =
1
A
exp −
1
B
(x−µ1)2
σ2
1
− 2ρ(x−µ1)(y−µ2)
σ1σ2
+ (y−µ2)2
σ2
2
,
where A = 2π σ2
1σ2
2 (1 − ρ2), B = 2 1 − ρ2 .
Probabilistic and Statistical Models
Standardised bivariate normal distribution
Deﬁnition (bivariate standardised normal distribution)
Random vector (X, Y)T is normally distributed with
parameters µ and Σ, i.e. (X, Y)T ∼ N2(µ, Σ), where
µ = (0, 0)T
a Σ =
1 ρ
ρ 1
,
θ = (0, 0, 1, 1, ρ)T , (x, y)T
∈ R2, ρ ∈ −1, 1 ; density is deﬁned
as
f (x, y) =
1
2π 1 − ρ2
exp −
x2 + 2ρxy + y2
2 (1 − ρ2)
.
Probabilistic and Statistical Models
Standardised bivariate and multivariate normal distribution
Let x = x1, y = x2 and x = (x1, x2)T . Then the density can be
rewritten into matrix form:
f (x) =
1
2π(det(Σ))1/2
exp −
1
2
xT
Σ−1
x .
Let (X1, X2, . . . , Xk )T ∼ Nk (µ, Σ) and x is k-dimensional vector,
then the density is equal to
f (x) =
1
(2π)k/2(det(Σ))1/2
exp −
1
2
xT
Σ−1
x .
Marginal distributions of:
◮ bivariate normal distribution – Xj ∼ N(µj, σ2
j ), j = 1, 2, . . . , k
◮ standardised bivariate normal distribution –
Xj ∼ N(0, 1), j = 1, 2, . . . , k
Probabilistic and Statistical Models
Bivariate normal distribution – simulation
Simulation of pseudo-random numbers from bivariate
normal distribution:
1. let X1 ∼ N(0, 1) and X2 ∼ N(0, 1)
2. then (Y1, Y2)T ∼ N2 (µ, Σ), where Y1 = σ1X1 + µ1
and Y2 = σ2(ρX1 + 1 − ρ2X2) + µ2
Example
Sumulate pseudo-random numbers from bivariate normal
distribution, where θ = (µ1, µ2, σ2
1, σ2
2, ρ)T .
(a) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1, ρ = 0; (1) n = 50 and (2)
n = 1000;
(b) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1, ρ = 0.5; (1) n = 50 and (2)
n = 1000;
(c) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1.2, ρ = 0.5; (1) n = 50 and (2)
n = 1000.
Probabilistic and Statistical Models
Mixture of two bivariate normal distribution
The mixture of two univariate normal distribution is deﬁned
as follows: pN(µ1, σ2
1) + pN(µ2, σ2
2), where θ = (µ1, µ2, σ2
1, σ2
2)T
The mixture of two bivariate normal distribution is deﬁned
as follows: pN2 (µ1, Σ1) + (1 − p)N2 (µ2, Σ2), where
θ = (µ11, µ12, σ2
11, σ2
12, ρ1, µ21, µ22, σ2
21, σ2
22, ρ2)T
Probabilistic and Statistical Models
Binomial distribution
Jacob Bernoulli (1655–1705) – one of the founding fathers of
probability theory.
Deﬁnition (binomial distribution)
Let N be number of independent identical (random) Bernoulli
trials Xi, where Xi = 1 is a success (event occurred) and
Xi = 0 is a failure (event did not occur), i = 1, 2, . . . , N. Then
probability of success Pr(Xi = 1) = p and probability of
failure Pr(Xi = 0) = 1 − p. Number of successes X = N
i=1 Xi.
The probability that random variable X is equal to x = n
(realisation) is deﬁned as Pr(X = x) = N
x px (1 − p)N−x
, for
x = 0, 1, 2, . . . , N.
Expected value of X is deﬁned as
E[X] = N
x=0 x Pr(X = x) = N
x=0 x N
x px (1 − p)N−x
= Np.
Variance of X is deﬁned as Var[X] = N
x=0 (x − E[X])2
Pr(X =
x) = N
x=0 (x − Np)2 N
x px (1 − p)N−x
= Np (1 − p).
Probabilistic and Statistical Models
Binomial distribution
Reading: Random variable X is binomially distributed with
parameters N an p, where θ = p.
Notation: X ∼ Bin(N, p), θ = p
Do we need to change it? YES.
Why? Due to generalisation.
Equivalently, X ∼ Bin (N, p, 1 − p), where X = (X1, X2)T
,
θ = (p, 1 − p)T , X1 is number of successes, X2 = N − X1 is
number of failures, X1 ∼ Bin (N, p) and X2 ∼ Bin (N, 1 − p).
Then d
◮ E[X1] = Np, E[X2] = N (1 − p),
◮ Var[X2] = Np (1 − p) = Var[X1] is independent of p,
◮ Cov [X1, X2] = −Np (1 − p) and
◮ Cor [X1, X2] = −1.
Finally, n = (n1, n2)T
a p = (p1, p2)T
, p1 = p and p2 = 1 − p.
Then θ = p.
Probabilistic and Statistical Models
Binomial distribution
Example (number of boys)
Number of boys X in families with N children is binomially
distributed, i.e. X ∼ Bin(N, p), where N = 12, number of
families M = 6115 (Geissler 1889). Calculate theoretical
frequencies mn,E . You know that p =
N
n=0 nmn
NM = 0.5192
(weighted average; average of number of families weighted by
number of boys).
Table: Observed and theoretical frequencies (mn,O and mn,E ) of
families with n boys (O = observed, E = expected, theoretical)
n 0 1 2 3 4 5 6 7 8 9 10 11 12
mn,O 3 24 104 286 670 1033 1343 1112 829 478 181 45 7
mn,E 1 12 72 258 628 1085 1367 1266 854 410 133 26 2
Probabilistic and Statistical Models
Multinomial distribution
Deﬁnition (multinomial distribution)
Let N be number of independent identical (random) trials and in
each of them J ≥ 2 distinct possible outcomes can occur,
where Xji = 1 is a success (event occurred) and Xji = 0 is a
failure (event did not occur), i = 1, 2, . . . , N, j = 1, 2, . . . , J.
Number of successes Xj = N
i=1 Xji, N = J
j=1 Xj. Then
probability of success of i-th outcome in j-th trial is equal to
Pr(Xji = 1) = pj (cell probabilities) and probability of failure
in j-th trial is equal to Pr(Xji = 0) = 1 − pj. Let
X = (X1, X2, . . . , XJ)T . The probability that random variables Xj
are equal to xj = nj is deﬁned as
Pr(X1 = x1, . . . , XJ = xJ) =
N!
x1! . . . xJ!
px1
1 px2
2 . . . pxJ
J =
N!
j xj!
J
j=1
p
xj
j .
Probabilistic and Statistical Models
Multinomial distribution
Expected value of X is a vector deﬁned as E[X] = Np.
Covariance matrix of X is deﬁned as
Var[X] = N diag (p) − ppT
,
where
(Var[X])ij =
Npj(1 − pj)) if i = j
−Npipj if i = j
Marginal distributions are binomial, i.e. Xj ∼ Bin N, pj .
Then
◮ E[Xj] = Npj,
◮ Var[Xj] = Npj 1 − pj
◮ Cov Xi, Xj = −Npipj
◮ Cor Xi, Xj = −pipj / pi (1 − pi) pj 1 − pj
Probabilistic and Statistical Models
Multinomial distribution
Reading: Random vector X is multinomially distributed with
parameters N and p, where θ = p.
Notation: X ∼ MultJ(N, p).
If J = 2, then Bin (N, p) ≈ Mult2 (N, p)
Realisation of one trial xij could be (1, 0, . . . , 0)T or
(0, 1, . . . , 0)T .
Example (number of individuals with certain blood type)
Number of individuals X = (X1, X2, X3, X4)T with certain blood
group is multinomially distributed following Hardy-Wienberg
equilibrium, i.e. X = (X1, X2, X3, X4)T ∼ Mult4(N, p), where
N = 500 (Katina et al. 2015). Calculate theoretical frequencies
nE,j.
attributes (groups) 0 A B AB
nO,j 209 184 81 26
nE,j 210 183 80 27
Probabilistic and Statistical Models
Multinomial distribution
Example (number of individuals with certain
socioeconomic status, political philosophy and political
afﬁliation)
Number of individuals X1, . . . , X8 with socioeconomic status,
political philosophy and political afﬁliation is multinomially
distributed, i.e. X = (X1, . . . , X8)T ∼ Mult8(N, p), where
p = (p1, p2, . . . , p8)T and N = 50 (Christensen 1990). Calculate
(a) Var[X1], (b) Var[X3], (c) Cov [X1, X3] and (d) Corr [X1, X3].
Table: 2 × 4 contingency table of probabilities pj
D-Li D-C R-Li R-C total
H 0.12 0.12 0.04 0.12 0.4
Lo 0.18 0.18 0.06 0.18 0.6
total 0.30 0.30 0.10 0.30 1.0
Probabilistic and Statistical Models
Multinomial distribution
Notation: (1) socioeconomic status (high – H, low – Lo), (2)
political philosophy (democrat – D, republican – R) a (3) political
afﬁliation (liberal – Li, conservative – C). Then X1 (H-D-Li), X2
(H-D-C), X3 (H-R-Li), X4 (H-R-C), X5 (Lo-D-Li), X6 (Lo-D-C), X7
(Lo-R-Li) a X8 (Lo-R-C).
Solution:
Var[X1] = 50 × 0.12 × (1 − 0.12) = 5.28
Var[X3] = 50 × 0.04 × (1 − 0.04) = 1.92
Cov [X1, X3] = −50 × 0.12 × 0.04 = −0.24
Cor [X1, X3] = −0.24/
√
5.28 × 1.92 = −0.075
What are the expected frequencies?
Table: 2 × 4 contingency table of frequencies Xj
D-Li D-C R-Li R-C
H 6 6 2 6
Lo 9 9 3 9
Probabilistic and Statistical Models
Multinomial distribution
Example (number of individuals with certain eye and hair
colour)
Let X = (X1, X2, . . . , X12)T be random vector of number of
individuals with certain eye colour (with levels blue Bl, green Gr,
brown Br) and hair color (with levels blond Blo, light-brown LB,
black Ble, red R), where X1 means Bl-Blo, X2 means Bl-LB, X3
means Bl-Ble, X4 means Bl-R, X5 means Gr-Blo, X6 means
Gr-LB, X7 means Gr-Ble, X8 means Gr-R, X9 means Br-Blo,
X10 means Br-LB, X11 means Br-Ble and X12 means Br-R. Let
X ∼ Mult12(N, p), where N = 6800 (Yule and Kendall 1950).
Table: 3 × 4 contingency table of frequencies nj
Blo LB Ble R row sums
Bl 1768 807 189 47 2811
Gr 946 1387 746 53 3132
Br 115 438 288 16 857
column sums 2829 2632 1223 116 6800
Probabilistic and Statistical Models
Product-multinomial distribution
Deﬁnition (product-multinomial distribution)
Let Nk be number of independent identical (random) trials and
in each of them J ≥ 2 distinct possible outcomes can occur,
where Xkji = 1 is a success (event occurred) and Xkji = 0 is a
failure (event did not occur), i = 1, 2, . . . , Nk , k = 1, 2, . . . , K,
j = 1, 2, . . . , J. Number of successes Xkj = Nk
i=1 Xkji and
K
k=1 Nk = N. Then probability of success of kj-th outcome
in i-th trial is equal to Pr(Xkji = 1) = pkj (cell probabilities) and
probability of failure of kj-th outcome in i-th trial is equal to
Pr(Xkji = 0) = 1 − pkj. Let Xk = (Xk1, Xk2, . . . , XkJ)T si
multinomially distributed with parameters Nk and pk , i.e.
Xk ∼ MultJ (Nk , pk ), kde θk = pk a pk = (pk1, pk2, . . . , pkJ)T .
Let realisations of Xk be xk . The xkj = nkj and
nk = (nk1, nk2, . . . , nkJ)T . Additionally, Xk are independent.
Probabilistic and Statistical Models
Product-multinomial distribution
The probability that random variables Xkj are equal to xkj = nkj
(for all j and k) is deﬁned as
Pr(Xkj = xkj, ∀k, j) =
K
k=1
Pr(Xkj = xkj, ∀j).
The probability that random variables Xkj are equal to xkj = nkj
(for all j) is deﬁned as
Pr(Xkj = xkj, ∀j) =

Nk !/
J
j=1
xkj!


J
j=1
p
xkj
kj .
Then
Pr(Xkj = xkj, ∀k, j) =
K
k=1



Nk !/
J
j=1
xkj!


J
j=1
p
xkj
kj

 .
Probabilistic and Statistical Models
Product-multinomial distribution
Reading: Random matrix X is product-multinomially distributed
with parameters N = (N1, N2, . . . , NK )T and P with the rows pk ,
where θk = pk , k = 1, 2, . . . , K.
Notation: X ∼ ProdMultK (N, p).
If K = 1, then MultJ (N, p ) ≈ ProdMult1 (N, p)
Realisation of one trial xkij could be (1, 0, . . . , 0)T or
(0, 1, . . . , 0)T .
Then
◮ expected frequencies are equal to Nk pkj ,
◮ within each Xk , variances Var[Xkj ], covariances Cov[Xkj]
and correlations Cor[Xkj ] are calculated as for multinomial
distribution,
◮ between Xk , e.g. Cov[X1, X2], k = 1, 2, are zeroes due to
independence of Xk
Probabilistic and Statistical Models
Product-multinomial distribution
Example (number of individuals with certain blood type)
Let X = (X1, X2)T , where X1 = (X11, X12, X13, X14)T is number
of individuals in Koˇsice (Slovakia) with certain blood group,
X2 = (X21, X22, X23, X24)T is number of individuals in Prague
(Czech Republic) with certain blood group. X is
product-multinomially distributed, i.e. X ∼ ProdMult2(N, P),
where N = (N1, N2)T , where N1 = 500 and N2 = 400 (Katina et
al. 2015). Calculate theoretical frequencies nE,kj . Question:
What are the probabilities of having particular blood group in
Prague and Koˇsice?
Table: Observed frequencies of particular blood group in Prague an
Koˇsice
attributes (groups) 0 A B AB
n1j =nPrague,j 209 184 81 26
n2j =nKoˇsice,j 138 147 84 31
Probabilistic and Statistical Models
Product-multinomial distribution
Example (number of individuals with certain
socioeconomic status, political philosophy and political
afﬁliation)
Number of individuals X = (X1, X2)T with socioeconomic status,
political philosophy and political afﬁliation is
product-multinomially distributed, i.e. X ∼ ProdMult2(N, P),
where X1 = (X11, X12, X13, X14)T are number of individuals with
high socioeconomic status, X2 = (X21, X22, X23, X24)T number
of individuals with low socioeconomic status,
pk = (p1|k , p2|k , . . . , pJ|k )T , pkj = pj|k =
njk
nk
, k = 1, 2,
N = (N1, N2)T , N1 = 30, N2 = 20 (Christensen 1990).
Calculate (a) probabilities pj|k , (b) expected frequencies, (c)
Var[X3|1], (d) Cov and (e) Cov X1|1, X3|2 .
Probabilistic and Statistical Models
Product-multinomial distribution
Notation: (1) socioeconomic status (high – H, low – Lo), (2)
political philosophy (democrat – D, republican – R) a (3) political
afﬁliation (liberal – Li, conservative – C). Then X11 = X1|1
(H-D-Li), X12 = X2|1 (H-D-C), X13 = X3|1 (H-R-Li), X14 = X4|1
(H-R-C), X21 = X1|2 (Lo-D-Li), X22 = X2|2 (Lo-D-C), X23 = X3|2
(Lo-R-Li) a X24 = X4|2 (Lo-R-C).
Solution:
Table: 2 × 4 contingency table of probabilities pj|k
D-Li D-C R-Li R-C total
H 0.3 0.3 0.1 0.3 1.0
Lo 0.3 0.3 0.1 0.3 1.0
Probabilistic and Statistical Models
Product-multinomial distribution
Table: 2 × 4 contingency table of frequencies nkj
D-Li D-C R-Li R-C total
H 9 9 3 9 30
Lo 6 6 2 6 20
Var(X3|1) = 30 × 0.1 × (1 − 0.1) = 2.7.
Cov X1|2, X3|2 = −20 × 0.3 × 0.1 = −0.6,
Cov X1|1, X3|2 = 0, due to the independence of X1 and X2.
Probabilistic and Statistical Models
Product-multinomial distribution
Notation: (1) socioeconomic status (high – H, low – Lo), (2)
political philosophy (democrat – D, republican – R) a (3) political
afﬁliation (liberal – Li, conservative – C). Then X1 (H-D-Li), X2
(H-D-C), X3 (H-R-Li), X4 (H-R-C), X5 (Lo-D-Li), X6 (Lo-D-C), X7
(Lo-R-Li) a X8 (Lo-R-C).
Solution:
Var[X1] = 50 × 0.12 × (1 − 0.12) = 5.28
Var[X3] = 50 × 0.04 × (1 − 0.04) = 1.92
Cov [X1, X3] = −50 × 0.12 × 0.04 = −0.24
Cor [X1, X3] = −0.24/
√
5.28 × 1.92 = −0.075
What are the expected frequencies?
Table: 2 × 4 contingency table of frequencies Xj
D-Li D-C R-Li R-C
H 6 6 2 6
Lo 9 9 3 9
Probabilistic and Statistical Models
Poisson distribution
Deﬁnition (Poisson distribution)
Let X be random variable characterised by Poisson distribution,
i.e. X(λ), where θ = λ. Then
Pr(X = x) =
λx e−λ
x!
, x = 0, 1, . . . ,
where x = n is realisation of X. Then E[X] = λ and Var[X] = λ.
Probabilistic and Statistical Models
Poisson distribution
Example (Poisson distribution; killing by horse kicks)
Data were published by Russian economist Ladislaus Bortkiewicz in his book
entitled Das Gesetz der keinem Zahlen (The Law of Small Numbers) in 1898.
Let X be the number of corpses with certain number of solders killed by
horse kicks in the Prussian army within one year (von Bortkiewicz 1898; in 10
different army corps; in 20 years, between 1875 and 1894), n be the number
of annual deaths, mn be the number of army corps with particular number of
annual deaths, M = mn = 10 × 20 = 200. Then X ∼ Poiss(λ), where
λ = n nmn
n mn
= 0.61 (weighted average; average of number of army corps
weighted by number of annual deaths).
Table: Observed and theoretical frequencies (mn,O and mn,E ) of
solders killed by horse with n annual deaths over 20 years
n 0 1 2 3 4 5+
mn,O 109 65 22 3 1 0
mn,E 109 66 20 4 1 0
Probabilistic and Statistical Models
Poisson distribution
Example (Poisson distribution; accidents in the factories)
Let X be the number of workers having an accident in the munition
factories in England during First World War (Greenwood and Yule
1920), n be the number of accidents, mn be the number of workers
with particular number of accidents, M = mn = 647. Then
X ∼ Poiss(λ), where λ = n nmn
n mn
= 0.47 (weighted average; average
of number of workers weighted by number of accidents).
Table: Observed and theoretical frequencies (mn,O and mn,E ) of
workers with n accidents
n 0 1 2 3 4 ≥ 5
mn,O 447 132 42 21 3 2
mn,E 406 189 44 7 1 0
Probabilistic and Statistical Models
Formulations of hypotheses about probability distributions
1. multinomial distribution – example – number of individuals
with certain eye and hair color: Are the rows and columns of a
contingency table independent?
◮ Are the frequencies of individuals with certain eye color
(with levels blue, green, brown) independent of hair color
(with levels blond, light-brown, black, red)?
2. product-multinomial distribution: Are the vectors of frequencies
the same in each row? Are the vectors of frequencies
independent of the row index?
◮ example – number of individuals with certain
socioeconomic status, political philosophy and
afﬁliation – Are the vectors of frequencies of individuals
(D-Li, D-C, R-Li, R-C) the same for each level of
socioeconomic status (high and low)?
◮ example – blood groups – Is the distribution of the blood
groups (0, A, B, AB) the same in Prague and Koˇsice?
Probabilistic and Statistical Models
Formulations of hypotheses about probability distributions
3. binomial distribution – example – number of boys:
◮ Is the probability of number of boys in the families with 12
boys binomial?
◮ Is the probability of having a boy in the family equal to 0.5?
4. Poisson distribution:
◮ example – killing by horse kick – Is the distribution of
number of corpses with certain number of solders killed by
horse kick Poisson?
◮ example – accidents in the factories – Is the distribution
of number of workers having an accident Poisson?
Probabilistic and Statistical Models
Types of contingency tables – multinomial distribution
1 × J contingency table of frequencies
outcome 1 outcome 2 . . . outcome J sum
x1 x2 . . . xJ N
1 × J contingency table of probabilities
outcome 1 outcome 2 . . . outcome J sum
p1 p2 . . . pJ 1
2 × J contingency table of frequencies
outcome 1 outcome 2 . . . outcome J sum
row 1 x11 x12 . . . x1J N1
row 2 x21 x22 . . . x2J N2
2 × J contingency table of probabilities
outcome 1 outcome 2 . . . outcome J sum
row 1 p11 p12 . . . p1J p1• = 1
row 2 p21 p22 . . . p2J p2• = 1
Probabilistic and Statistical Models
Types of contingency tables – multinomial distribution
K × J contingency table of frequencies
outcome 1 outcome 2 . . . outcome J sum
row 1 x11 x12 . . . x1J N1
row 2 x21 x22 . . . x2J N2
...
...
... . . .
...
...
row K xK1 xK2 . . . xKJ NK
K × J contingency table of probabilities
outcome 1 outcome 2 . . . outcome J sum
row 1 p11 p12 . . . p1J p1• = 1
row 2 p21 p22 . . . p2J p2• = 1
...
...
... . . .
...
...
row K pK1 pK2 . . . pKJ pK• = 1
Probabilistic and Statistical Models
Types of contingency tables – product-multinomial distribution
1 × J contingency table of frequencies (≈ multinomial distribution)
outcome 1 outcome 2 . . . outcome J sum
x1 x2 . . . xJ N
1 × J contingency table of probabilities (≈ multinomial distribution)
outcome 1 outcome 2 . . . outcome J sum
p1 p2 . . . pJ 1
2 × J contingency table of frequencies (≈ multinomial distribution)
outcome 1 outcome 2 . . . outcome J sum
group 1 x11 x12 . . . x1J N1
group 2 x21 x22 . . . x2J N2
2 × J contingency table of probabilities
outcome 1 outcome 2 . . . outcome J sum
group 1 p1|1 p2|1 . . . pJ|1 1
group 2 p1|2 p2|2 . . . pJ|2 1
Probabilistic and Statistical Models
Types of contingency tables – product-multinomial distribution
K × J contingency table of frequencies (≈ multinomial distribution)
outcome 1 outcome 2 . . . outcome J sum
group 1 x11 x12 . . . x1J N1
group 2 x21 x22 . . . x2J N2
...
...
... . . .
...
...
group K xK1 xK2 . . . xKJ NK
K × J contingency table of probabilities
outcome 1 outcome 2 . . . outcome J sum
group 1 p1|1 p2|1 . . . pJ|1 1
group 2 p1|2 p2|2 . . . pJ|2 1
...
...
... . . .
...
...
group K p1|K p2|K . . . pJ|K 1
Probabilistic and Statistical Models
Data structure for 1 × J contingency table – multinomial distribution
outcome 1 outcome 2 . . . outcome J sum
x1 1 0 . . . 0 1
x2 0 1 0 1
x3 0 1 0 1
x4 1 0 . . . 0 1
...
...
... . . .
...
...
xN−1 0 0 . . . 1 1
xN 1 0 . . . 0 1
sum= x x1 x2 . . . xJ N
◮ sum of each row is one
◮ sum of all row sums is N
◮ sum of each column is xj , where j = 1, 2, . . . , J
◮ sum of all xj , j = 1, 2, . . . , J, is N
◮ x = n
Probabilistic and Statistical Models
Data structure for K × J contingency table – (product-)multinomial distribution
outcome 1 outcome 2 . . . outcome J sum
xk1 1 0 . . . 0 1
xk2 0 1 0 1
xk3 0 1 0 1
xk4 1 0 . . . 0 1
...
...
... . . .
...
...
xk,Nk −1 0 0 . . . 1 1
xk,Nk
1 0 . . . 0 1
sum= xk xk1 xk2 . . . xkJ Nk
◮ sum of each row is one
◮ sum of all row sums is Nk
◮ sum of each column is xkj , where j = 1, 2, . . . , J
◮ sum of all xkj , j = 1, 2, . . . , J, is N
◮ xk = nk , where k = 1, 2, . . . , K
Probabilistic and Statistical Models
Assignments in
Assignment number of boys:
1. Draw probability mass function of number of boys in the families with 12
children?
2. What are the probabilities of having n boys in the family
(n = 1, 2, . . . , 12)? What is the probability of having eight or more boys
in the family? What is the probability of having ﬁve to seven boys in the
family?
Assignment killing by horse kick:
1. Draw probability mass function of number of corpses with certain
number of solders killed by horse kick?
2. What are the probabilities of having n annual deaths
(n = 0, 1, 2, 3, 4, 5+)? What is the probability of having one or less
annual deaths?
Assignment accidents in the factories:
1. Draw probability mass function of number of workers having an
accident?
2. What are the probabilities of having n accidents (n = 0, 1, 2, 3, 4, 5+)?
What is the probability of having two or more accidents?
Probabilistic and Statistical Models
Assignments in
Assignment number of individuals with certain socioeconomic status,
political philosophy and afﬁliation:
1. What is the number of all 2 × 4 contingency table with N = 50?
n+k−1
k
= 57
7
= 57
50
= 264385836
choose(57,50)=choose(57,7)
2. What is the probability of getting the following 2 × 4 contingency table?
D-Li D-C R-Li R-C
H 5 7 4 6
Lo 8 7 3 10
Pr(X1 = x1, X2 = x2, . . . , X8 = x8) =
50!
5!7!4!6!8!7!3!10!
0.125
0.127
0.044
0.126
0.188
0.187
0.063
0.1810
=
2.332506 × 10−6
n<-c(5,7,4,6,8,7,3,10)
p<-c(.12,.12,.04,.12,.18,.18,.06,.18)
dmultinom(x=n,prob=p) ## 2.332506e-06
Probabilistic and Statistical Models
Assignments in
Assignment number of individuals with certain socioeconomic status,
political philosophy and afﬁliation:
3. What is the most probable 2 × 4 contingency table and what is the
probability of getting it?
D-Li D-C R-Li R-C
H 6 6 2 6
Lo 9 9 3 9
Pr(X1 = x1, X2 = x2, . . . , X8 = x8) =
50!
6!6!2!6!9!9!3!9!
0.126
0.126
0.042
0.126
0.189
0.189
0.063
0.189
=
1.020471 × 10−5
4.375× more than in (2)
n<-c(6,6,2,6,9,9,3,9)
p<-c(.12,.12,.04,.12,.18,.18,.06,.18)
dmultinom(x=n,prob=p) ## 1.020471e-05
4. Draw probability mass function of number of possible 2 × 4 contingency
tables with N = 50?
Probabilistic and Statistical Models
Likelihood function
Deﬁnition (likelihood function)
For a statistical model F where we expect the data x ∈ R to be
observed, the function L : Θ → R+
∪ {0}, called likelihood function
(likelihood), is deﬁned as
L(θ|x) = L(θ, x) = c(x)f(θ, x),
where c ∈ R is independent of θ,
f(θ|x) = f(θ, x) =
n
i=1
f(xi , θ).
Likelihood L(θ|x) is used when describing a function of a parameter
given an outcome.
Density (probability mass function) f(xi , θ) = f(xi |θ) is used when
describing a function of the outcome given a ﬁxed parameter value.
Probabilistic and Statistical Models
Likelihood function
The natural logarithm of the likelihood function, called the
log-likelihood, is deﬁned as
ln(L(θ|x)) = l(θ|x) = ln c + ln(f(θ|x)).
◮ The log-likelihood, is more convenient to work with.
◮ We are searching for the maximum of likelihood function.
◮ Because the logarithm is a monotonically increasing
function, the logarithm of a function achieves its maximum
value at the same points as the function itself. Hence the
log-likelihood can be used in place of the likelihood in ﬁnding the
maximum.
◮ Finding the maximum of a function involves taking the
(partial) derivative of a function, equaling it to zero, and
solving for the parameter being maximized.
Probabilistic and Statistical Models
Likelihood function
Deﬁnition (maximum-likelihood estimate)
The estimate of a parameter θ, θML = θ, called maximum-likelihood
estimate (MLE), is a value which maximises the likelihood function,
i.e.
θML = arg max
∀θ
L(θ|x) = arg max
∀θ
l(θ|x).
The process of maximisation is called maximum-likelihood
estimation:
◮ the ﬁrst derivative of log-likelihood function (score
function) S(θ) = ∂
∂θ l(θ|x),
◮ likelihood equations (score equations) S(θ) = 0,
◮ the second derivative of log-likelihood function ∂2
∂θ2 l(θ|x),
◮ in the maximum is the second derivative negative and the
curvature in θ is equal to Fisher information
I(θ) = − ∂2
∂θ2 l(θ|x)|θ=θ.
Probabilistic and Statistical Models
Likelihood function
◮ The curvature is inversely related to the variance of θ, i.e.
Var[θ] = 1/I(θ).
◮ Since Xi , i = 1, 2, . . . , n are independent, I(θ) = ni(θ).
Ronald Aylmer Fisher (1890−1962) – English statistician, wrote in
1925:
What has now appeared is that the mathematical concept
of probability is inadequate to express our mental
conﬁdence or difﬁdence in making such inferences, and
that the mathematical quantity which appears to be
appropriate measuring our order of preference among
different possible populations, does not in fact obey the
laws of probability. To distinguish it from probability, I
have used the term ”likelihood” to designate this quantity.
Probabilistic and Statistical Models
Proﬁle likelihood function
Let θ = (θ1, θ2)T
, where s θ1 is the parameter of interest and θ2 a
nuisance parameter. In some cases, the separation into two such
components can be achieved after suitable reparametrisation.
If θ2|θ1
denotes the value of θ2 which maximizes the likelihood (or
log-likelihood) function for the given value of θ1, we deﬁne proﬁle
likelihood function
LP(θ1|x) = L((θ1, θ2|θ1
)T
|x)
and proﬁle log-likelihood function
lP(θ1|x) = l((θ1, θ2|θ1
)T
|x)
The term ”proﬁle” comes about through thinking of the usual
(log-)likelihood function a s a hill being observed from a viewpoint
with abscissa θ2 = ∞, so that, for any ﬁxed θ1, only the highest value
L((θ1, θ2|θ1
)T
|x) or l((θ1, θ2|θ1
)T
|x) is seen.
Probabilistic and Statistical Models
Likelihood function of binomial distribution
Deﬁnition (likelihood and log-likelihood function of
binomial distribution)
Let X be binomially distributed with parameters N and θ = p, i.e.
X ∼ Bin(N, p). Realisations of X be x = n. Then the likelihood
function is equal to
L(p|x) =
N
i=1
N
xi
pxi
(1 − p)
N−xi
= px
(1 − p)
N−x
N
i=1
N
xi
.
Since the product of binomial coefﬁcients is not important in likelihood
maximisation, only the kernel (often called likelihood as well) is used.
Then
L(p|x) ≈ px
(1 − p)
N−x
.
The log-likelihood function is equal to
l(p|x) = x ln p + (N − x) ln (1 − p) .
Probabilistic and Statistical Models
Likelihood function of binomial distribution
Example (maximum-likelihood estimation)
Let X be binomially distributed with parameters N and θ = p, i.e.
X ∼ Bin(N, p). Derive p and Var[p].
Solution (partial)
S(p) =
∂
∂p
l(p|x) =
x
p
−
N − x
1 − p
,
∂2
∂p2
l(p|x) = − Np /p2
− N 1 − p / (1 − p)
2
.
Then
p =
x
N
and Var[p] =
p(1 − p)
N
.
Probabilistic and Statistical Models
Likelihood function of binomial distribution
Example (maximal likelihood estimates of p)
Generate in pseudo-random variables X ∼ Bin(N, p), where
N = 20. Write -function to calculate (1) likelihood function L(p|x) of
binomial distribution, where x = 2, N = 20, (2) likelihood function
L(p|x) of binomial distribution, where x = 10, N = 20 and (3)
likelihood function L(p|x) of binomial distribution, where
x = 18, N = 20. Repeat the same for log-likelihood function.
Calculate also p using function optimize(). Draw all three
functions in three side-by-side windows with highlighted maxima.
Solution (partial)
l(p|x) = px
(1 − p)
N−x
, where p ∈ (0, 1), x = 2, N = 20
l(p|x) = px
(1 − p)
N−x
, where p ∈ (0, 1), x = 10, N = 20
l(p|x) = px
(1 − p)
N−x
, where p ∈ (0, 1), x = 18, N = 20
Probabilistic and Statistical Models
Likelihood function of binomial distribution
0.0 0.2 0.4 0.6 0.8 1.0
0.00000.00050.00100.0015
MLE of p = 0.1
probability p
L(p|x)
0.0 0.2 0.4 0.6 0.8 1.0
0e+002e−074e−076e−078e−07
MLE of p = 0.5
probability p
L(p|x)
0.0 0.2 0.4 0.6 0.8 1.0
0.00000.00050.00100.0015
MLE of p = 0.9
probability p
L(p|x)
Figure: Likelihood functions of binomial distribution X ∼ Bin(N, p),
where N = 20
Probabilistic and Statistical Models
Likelihood function of binomial distribution
0.0 0.2 0.4 0.6 0.8 1.0
−120−100−80−60−40−20
MLE of p = 0.1
probability p
l(p|x)
0.0 0.2 0.4 0.6 0.8 1.0
−70−60−50−40−30−20
MLE of p = 0.5
probability p
l(p|x)
0.0 0.2 0.4 0.6 0.8 1.0
−120−100−80−60−40−20
MLE of p = 0.9
probability p
l(p|x)
Figure: Log-likelihood functions of binomial distribution
X ∼ Bin(N, p), where N = 20
Probabilistic and Statistical Models
Likelihood function of multinomial distribution
Deﬁnition (likelihood and log-likelihood function of
multinomial distribution)
Let X be multinomially with parameters N and θ = p, i.e.
X ∼ MultJ (N, p). Realisations of Xj be xj = nj . Then the (kernel of)
likelihood function is equal to
L(p|x) =
N
i=1
N!
J
j=1 xj !
J
j=1
p
xji
j ≈
J
j=1
p
xj
j
and the log-likelihood function is equal to
l (p|x) =
J
j=1
xj ln pj .
Probabilistic and Statistical Models
Likelihood function of multinomial distribution
Example (maximum-likelihood estimation)
Let X be multinomially with parameters N and θ = p, i.e.
X ∼ MultJ (N, p). Derive p and Var[p].
Solution (partial)
Let pJ = 1 −
J−1
j=1 pj and p = (p1, p2, . . . , pJ−1)T
Then
l(p|x) =
J−1
j=1
nj ln pj + nJ ln(1 −
J−1
j=1
pj )
and
(S(p))j =
∂
∂pj
l(p|x) =
nj
pj
−
nJ
pJ
as the elements of S(p)). Then
I(p) = −
∂
∂p
S(p) = diag
n1
p2
1
,
n2
p2
2
, . . . ,
nJ−1
p2
J−1
+
nJ
p2
J
11T
.
Probabilistic and Statistical Models
Likelihood function of multinomial distribution
I(p) = N diag
1
p1
,
1
p2
, . . . ,
1
pJ−1
+
11T
pJ
.
Then
I(p) = N






1
p1
+ 1
pJ
1
pJ
1
pJ
. . . 1
pJ
1
pJ
1
p2
+ 1
pJ
1
pJ
. . . 1
pJ
...
...
...
...
...
1
pJ
1
pJ
. . . 1
pJ
1
pJ−1
+ 1
pJ






.
Var[p] = I−1
(p) =
1
N
diag p − ppT
.
Then
Var[p] =
1
N





p1(1 − p1) −p1p2 . . . −p1pJ−1
−p2p1 p2(1 − p2) . . . −p2pJ−1
...
...
...
...
−pJ−1p1 −pJ−1p2 . . . pJ−1(1 − pJ−1)





.
Probabilistic and Statistical Models
Likelihood function of multinomial distribution
p1
p2
●
0 0.2 0.4 0.6 0.8 1
00.20.40.60.81
Figure: Log-likelihood function of multinomial (trinomial) distribution
Probabilistic and Statistical Models
Likelihood function of Poisson distribution
Deﬁnition (likelihood and log-likelihood function of Poisson
distribution)
Let X be distributed as Poisson with parameter θ = λ, i.e.
X ∼ Poiss(λ). Realisations of Xj be xj = nj . Then the (kernel of)
likelihood function is equal to
L (λ|x) =
N
i=1
λxi e−λ
xi !
≈ λ
N
i=1 xi
e−Nλ
and the log-likelihood function is equal to
l (λ|x) =
N
i=1
xi ln λ − Nλ.
In general, L(λ|x) = n pmn
n , where pn = Pr(X = n) = e−λ
λn
/n! and
l(λ|x) = −λ n mn + n nmn ln λ.
Probabilistic and Statistical Models
Likelihood function of Poisson distribution
Example (maximum-likelihood estimation)
Let X be distributed Poisson with parameter θ = λ, i.e. X ∼ Poiss(λ).
Derive λ and Var[λ].
Solution (partial)
S(λ) =
∂
∂λ
l (λ|x) =
N
i=1 xi
λ
− N,
∂2
∂λ2
l (λ|x) = −
N
i=1 xi
λ2
.
Then
λ =
N
i=1 xi
N
= x and Var[λ] =
x
N
In general notation, λ = n nmn
n mn
.
Probabilistic and Statistical Models
Likelihood function of Poisson distribution
Example (maximal likelihood estimates of λ)
Write -function to calculate likelihood function L(λ|x) and
log-likelihood function l(λ|x) of Poisson distribution X ∼ Poiss(λ) for
horse kick data. Calculate also λ using function optimize(). Draw
both functions in two side-by-side windows with highlighted maximum.
Solution (partial)
l(λ|x) = −λ n mn + n nmn ln λ, where λ ∈ (0, 2)
Probabilistic and Statistical Models
Likelihood function of Poisson distribution
0.0 0.5 1.0 1.5 2.0
0e+002e−804e−806e−80
MLE of λ = 0.61
λ
L(λ|x)
0.0 0.5 1.0 1.5 2.0
−500−400−300−200
MLE of λ = 0.61
λ
l(λ|x)
Figure: Likelihood function L(λ|x) (left) anf log-likelohood function
l(λ|x) of Poisson distribution X ∼ Poiss(λ) for horse kick data
Probabilistic and Statistical Models
Assignments in
Assignment number of boys:
Calculate p (the probability of having a boy in a family) and Var[p]
(the variance probability of having a boy in a family).
Assignment killing by horse kick:
Calculate λ (the mean number of annual deaths) and Var[λ] (the
variance of mean number of annual deaths).
Assignment accidents in a factory:
Calculate λ (the mean number of accidents in a factory) and Var[λ]
(the variance of mean number of accidents in a factory).
Probabilistic and Statistical Models
Assignments in
Assignment blood groups:
In Prague and Koˇsice, calculate p (the probabilities of having certain
blood group in particular city) and Var[p] (the covariance matrix of
probability of having certain blood group in particular city).
Assignment eye and hair color:
Calculate p (the probabilities of having certain eye and hair color) and
Var[p] (the covariance matrix of probability of having certain eye and
hair color).
Probabilistic and Statistical Models
Likelihood function of normal distribution
Deﬁnition (likelihood and log-likelihood function of normal
distribution)
Let X be distributed normally with parameter θ = (µ, σ2
)T
, i.e.
X ∼ N(µ, σ2
). Realisations of Xi be xi . Then the likelihood function
is equal to
L(θ|x) =
n
i=1
1
√
2πσ
exp −
1
2
xi − µ
σ
2
=
1
(2πσ2)n/2
exp −
1
2σ2
n
i=1
x2
i − 2µ
n
i=1
xi + nµ2
and the log-likelihood function is equal to
l(θ|x) = −
n
2
ln(2π) −
n
2
ln σ2
−
1
2σ2
n
i=1
(xi − µ)
2
.
Probabilistic and Statistical Models
Likelihood function of normal distribution
Example (maximum-likelihood estimation)
Let X be distributed normally with parameter θ = (µ, σ2
)T
, i.e.
X ∼ N(µ, σ2
). Derive θ = (µ, σ2
)T
and Var[θ] = Σ.
Solution (partial)
S1(θ) =
∂
∂µ
l(θ|x) =
1
σ2
n
i=1
(xi − µ),
S2(θ) =
∂
∂σ2
l(θ|x) = −
n
2σ2
+
1
2σ4
n
i=1
(xi − µ)2
.
Then
µ = x =
1
n
n
i=1
xi , σ2
=
1
n
n
i=1
(xi − µ)2
, and I(θ) =
n
σ2 0
0 n
2σ4
.
Probabilistic and Statistical Models
Likelihood function of normal distribution
Example (maximal likelihood estimates of µ and σ2
)
Generate in pseudo-random variables X ∼ N(µ, σ2
), where µ = 4,
σ2
= 1 and n = 1000. Write -function to calculate (1) (proﬁle)
likelihood function LP(µ|x) of normal distribution for generated data
X, (2) (proﬁle) likelihood function LP(σ2
|x) of normal distribution for
generated data X, and (3) likelihood function L(θ|x) of normal
distribution for generated data X, where θ = (µ, σ2
)T
. Repeat the
same for log-likelihood function. Calculate also MLEs using functions
optimize() and optim(). Draw all three functions in three
side-by-side windows with highlighted maxima.
Solution (partial)
lP(µ|x) = −n
2 ln(2π) − n
2 ln σ2
1 − 1
2σ2
1
n
i=1 x2
i − 2µ
n
i=1 xi + nµ2
,
where µ ∈ (2, 6), σµ = 1;
lP(σ2
|x) = −n
2 ln(2π) − n
2 ln σ2
−
n
i=1(xi −µ1)2
2σ2 , where
µσ = 4, σ ∈ (0.5, 1.5);
l(θ|x) = −n
2 ln(2π) − n
2 ln σ2
−
n
i=1(xi −µ)2
2σ2 , where µ ∈ (2, 6) and
σ ∈ (0.5, 1.5).
Probabilistic and Statistical Models
Likelihood function of normal distribution
1 n <- 1000
2 # Profile likelihood for mu
3 sigma.mu <- 1
4 x <- rnorm(n,4,sigma)
5 "negloglikemu" <- function(mu){n/2*log(2*pi)
6 +n/2*log(sigma.muˆ2)+(sum(xˆ2)-2*mu*sum(x)+n*muˆ2)/(2*
sigma.muˆ2)}
7 OPTmu <- optimize(negloglikemu,c(2,6),maximum=FALSE)
8 OPTmu$minimum # 3.987524
9 # Profile likelihood for sigmaˆ2
10 mu.sigma <- 4
11 "negloglikesigma" <- function(sigma2){n/2*log(2*pi)
12 +n/2*log(sigma2)+sum((x-mu.sigma)ˆ2)/(2*sigma2)}
13 OPTsigma <- optimize(negloglikesigma,c(0.5,1.5),maximum=FALSE)
14 OPTsigma$minimum # 0.9630124
15 # Likelihood for mu and sigmaˆ2
16 "negloglike" <- function(theta){(n/2)*log(2*pi)
17 +(n/2)*log(theta[2])+(1/(2*theta[2]))*sum((x-theta[1])ˆ2)}
18 OPTboth <- optim(c(3,0.5),negloglike,method="Nelder-Mead",
19 hessian=TRUE)
20 OPTboth$parameter # 3.9875376 0.9627521
Probabilistic and Statistical Models
Likelihood function of normal distribution
−10 −5 0 5 10 15 20
0.00.20.40.60.8
µ
L(µ|x)
0 2 4 6 8 10
0.650.700.750.800.85
σ2
L(σ2
|x)
µ
σ2
0.02238843 0.02238843
0.5
515948
0.659017
0.7482824
0.8059284
●
−11 −3.5 4 11.5 19
2.557.510
Figure: Proﬁle likelihood functions (left, middle) and likelihood
function (right) of normal distribution X ∼ N(µ, σ2
), where
µ = 4, σ2
= 1 and n = 1000; all functions multiplied by suitable
constant, here 10−4
L(·)
Probabilistic and Statistical Models
Likelihood function of normal distribution
−10 −5 0 5 10 15 20
−10−8−6−4−20
µ
l(µ|x)
0 2 4 6 8 10
−0.45−0.40−0.35−0.30−0.25−0.20−0.15
σ2
l(σ2
|x)
µ
σ2
−37992.11 −37992.11
−8246.8
07
−8246.8
07
−4170.059
−2899.748
−2157.604
●
−11 −3.5 4 11.5 19
2.557.510
Figure: Proﬁle log-likelihood functions (left, middle) and log-likelihood
function (right) of normal distribution X ∼ N(µ, σ2
), where
µ = 4, σ2
= 1 and n = 1000; all functions are multiplied by suitable
constant, here exp(10−4
L(·))
Probabilistic and Statistical Models
Likelihood function of normal distribution
mu
sigma2
likelihood
mu
sigma2
log−likelihood
Figure: Likelihood (left) and log-likelihood (right) function of normal
distribution X ∼ N(µ, σ2
), where µ = 4, σ2
= 1 and n = 1000; all
functions are multiplied by suitable constant, here exp(10−4
L(·)) and
10−4
L(·)
Probabilistic and Statistical Models
Approximation of likelihood function
Deﬁnition (relative likelihood and log-likelihood function)
Relative likelihood function is deﬁned as
L(θ|x) =
L(θ|x)
L(θ|x)
and relative log-likelihood function as
ln L(θ|x) = ln
L(θ|x)
L(θ|x)
.
◮ It is often useful that likelihood function could be approximated
by a quadratic function.
◮ But additionally to the location of maxima of likelihood function,
we need the curvature around maximum.
◮ Since the log-likelihood, is more convenient to work with, we
need a quadratic approximation of log-likelihood function.
Probabilistic and Statistical Models
Approximation of likelihood function
Deﬁnition (Taylor polynomial of order r)
if a function g(x) has derivatives of order r, that is, g(r)
(x) = ∂r
∂xr g(x)
exists, then for any constant a, theTaylor polynomial of order r
about a is
Tr (x) =
r
j=0
g(j)
(a)
j!
(x − a)j
.
In practical statistical situations we assume that the remainder
g(x) − Tr (x) converges to zero as n increases, therefore we are going
to ignore it. There are many explicit forms, one of the most useful is
g(x) − Tr (x) =
x
a
g(r+1)
(t)
r!
(x − t)r
dt.
If g(r)
(a) = ∂r
∂xr g(x)|x=a exists, then
lim
x→a
g(x) − Tr (x)
(x − a)r
= 0.
Probabilistic and Statistical Models
Approximation of likelihood function
The quadratic approximation of log-likelihood function about θ
deﬁned as
l(θ|x) ≈ l(θ|x) + S(θ)(θ − θ) −
1
2
I(θ)(θ − θ)2
,
The quadratic approximation of relative likelihood function about
θ is deﬁned as
ln L(θ|x) = ln
L(θ|x)
L(θ|x)
= l(θ|x) − l(θ|x) ≈ −
1
2
I(θ)(θ − θ)2
.
It is often useful to visualise a derivative of the quadratic
approximation S(θ) ≈ −I(θ)(θ − θ) or
−I−1/2
(θ)S(θ) ≈ I1/2
(θ)(θ − θ), where −I−1/2
(θ)S(θ) is visualised
against I1/2
(θ)(θ − θ). If the quadratic approximation is correct, this
should be a line with slope equal to one.
Probabilistic and Statistical Models
Approximation of likelihood function
0.5 0.6 0.7 0.8 0.9 1.0
−4−3−2−10
θ
scaledl(θ|x)
(a) x=8, N=10
approx log−like
log−like
−2 −1 0 1 2
05101520
scaled θ
scaledscorefunction
(a) linearity of score function
approx log−like
log−like
0.5 0.6 0.7 0.8 0.9 1.0
−4−3−2−10
θ
scaledl(θ|x)
(b) x=80, N=100
−2 −1 0 1 2
0102030405060
scaled θ
scaledscorefunction
(b) linearity of score function
0.5 0.6 0.7 0.8 0.9 1.0
−4−3−2−10
θ
scaledl(θ|x)
(c) x=800, N=1000
−2 −1 0 1 2
050100150200
scaled θ
skalovanaskorefunkcia
(c) linearity of score function
Figure: Relative binomial log-likelihood, its quadratic approximation
and linearity of score function
Probabilistic and Statistical Models
Numerical maximisation of likelihood function
Isaac Newton (1643−1727) and Joseph Raphson (1648−1715).
Deﬁnition (Newton-Raphson method)
Having quadratic approximation of log-likelihood function about
θ0
l(θ|x) ≈ l(θ0|x) + S(θ0)(θ − θ0) −
1
2
I(θ0)(θ − θ0)2
or linear approximation of score function about θ0
S(θ) ≈ S(θ0) − I(θ0)(θ − θ0),
the numerical maximisation can be done via iterative function
θ0 +
S(θ0)
I(θ0)
.
Probabilistic and Statistical Models
Numerical maximisation of likelihood function
The iterative process is deﬁned as follows:
1. initialisation step – starting point θ(0)
, where I(θ(0)
) = 0,
2. updating equations – iteration of
θ(i)
= θ(i−1)
+
S(θ(i−1)
)
I(θ(i−1))
,
where I(θ(i−1)
) = 0, pre i = 1, 2, . . .
3. stopping rule based on absolute convergence criteria – until
|l(θ(i)
|x) − l(θ(i−1)
|x)| < ǫ, where the threshold ǫ is sufﬁciently
small
Geometrical interpretation: θ(i)
is a crossing point of tangent of score
function S(·) in the point [θ(i−1)
, S(θ(i−1)
)] with x-axis.
In :
◮ optimize(f,interval,maximum= FALSE, tol,...)
◮ Newton-Raphson method is combined here with golden section
method and successive parabolic interpolation to speed up
the convergence.
Probabilistic and Statistical Models
Numerical maximisation of likelihood function
Deﬁnition (multivariate Newton-Raphson method)
Having quadratic approximation of log-likelihood function about
θ0
l(θ|x) ≈ l(θ0|x) + S(θ0)(θ − θ0) −
1
2
(θ − θ0)T
I(θ0)(θ − θ0)
or linear approximation of score function about θ0
S(θ) ≈ S(θ0) − I(θ0)(θ − θ0).
the numerical maximisation can be done via iterative function
θ0 + (I(θ0))−1
S(θ0).
Probabilistic and Statistical Models
Numerical maximisation of likelihood function
The iterative process is deﬁned as follows:
1. initialisation step – starting point θ(0)
, where I(θ(0)
) = 0,
2. updating equations – iteration of
θ(i)
= θ(i−1)
+ (I(θ(i−1)
))−1
S(θ(i−1)
),
where I(θ(i−1)
) = 0, pre i = 1, 2, . . .
3. stopping rule based on absolute convergence criteria – until
|l(θ(i)
|x) − l(θ(i−1
|x)| < ǫ, where the threshold ǫ is sufﬁciently
small
In :
◮ optim(par,fn,gr,method,control,hessian
=FALSE,...)
◮ Newton-Raphson method is often modiﬁed – Fisher scoring
method, quasi Newton method, Broyden-FletcherGoldfarb-Shannon
(BFGS) method
Probabilistic and Statistical Models
Numerical maximisation of likelihood ≈ minimising negative log-likelihood
Nelder-Mead method (method of simplexes) – the idea of ”jumps” across
triangles deﬁned by the points θ
(i−1)
1 , θ
(i−1)
2 , θ
(i−1)
3 , where
l(θ
(i−1)
1 |x) < l(θ
(i−1)
2 |x) < l(θ
(i−1)
3 |x). We are substituting point θ
(i−1)
1 with a
”better” point θ
(i)
1 , where l(θ
(i)
1 |x) < l(θ
(i−1)
1 |x). Then new point is deﬁned
based on reﬂection (point symmetry), contraction or extrapolation
(expansion) as
1. reﬂection – z1 = θ
(i)
1 = θ
(i−1)
23 + 1 θ
(i−1)
23 − θ
(i−1)
1 ,
2. reﬂection & expansion – z2 = θ
(i)
1 = θ
(i−1)
23 + 2 θ
(i−1)
23 − θ
(i−1)
1 ,
3. contraction A – z3 = θ
(i)
1 = θ
(i−1)
23 + 1
2
θ
(i−1)
23 − θ
(i−1)
1 ,
4. contraction B – z4 = θ
(i)
2 = θ
(i−1)
1 + 1
2
θ
(i−1)
2 − θ
(i−1)
1 and
z5 = θ
(i)
3 = θ
(i−1)
1 + 1
2
θ
(i−1)
3 − θ
(i−1)
1
where θ
(i−1)
23 =
θ(i−1)
2 +θ(i−1)
3
2
, i.e. the mid-point of the line deﬁned by the points
θ
(i−1)
2 and θ(i−1)
. If l(θ
(i)
1 |x) < l(θ
(i−1)
1 |x) then new triangle is deﬁned with
θ
(i)
1 , θ
(i−1)
2 , θ
(i−1)
3 for (1) to (3). Otherwise new triangle is θ
(i−1)
1 , θ
(i)
2 , θ
(i)
3 .
Probabilistic and Statistical Models
Numerical maximisation of likelihood ≈ minimising negative log-likelihood
1
2
3
4
5
5
6
6
7
7
8
8
9
9
910 10
x
y
x
y
function
Figure: Demonstration of Nelder-Mead method or minimising the
function ((x − y)2
+ (x − 2)2
+ (y − 3)4
)/10, number of iterations is 49
Probabilistic and Statistical Models
Numerical maximisation of likelihood ≈ minimising negative log-likelihood
x
y
1
2
3
4
5
5
6
6
7
7
8
8
9
9
910 10
−2 0 2 4
0246
1
2
3 4 5678910111213141516171819202122232425262728293031323334353637383940414243444546474849
x
y
1
2
3
4
5
5
6
6
7
7
8
8
9
9
910 10
−2 0 2 4
0246
●
Figure: Demonstration of Nelder-Mead method or minimising the
function ((x − y)2
+ (x − 2)2
+ (y − 3)4
)/10, number of iterations is 49