Statistical Inference
Lukˊaˇs Laﬀˊers
Matej Bel University, Dept. of Mathematics
MUNI Brno
1.10.2021
Maximum likelihood
Bootstrap
Maximum likelihood
Statistical inference deals with the problem of quantifying uncertainty.
By uncertainty we mean the statistical uncertainty, not the model
uncertainty.
Given the fact that our sample size is limited. How sure/unsure are we
regarding our parameter estimate?
Example 1 - Tossing a coin
We observe the following
0000010000100100000001000010010100···0001000010000
500 tosses
97 heads, 403 tails.
These are independent coin ﬂips of a single coin with a ﬁxed probability of
showing the head.
Pr(X = 97) =
500
97
p97
(1 −p)403
Is it fair?
If p = 0.5 we would see 97 heads with probability 9.31491 ·10−46
(strictly mathematically speaking: not a whole lot)
Example 1 - Tossing a coin
What value of p is the most likely?
Find the one that makes Pr(X = 97) most likely.
Example 2 - Challenger Disaster
Oi ∼ Bern(pi)
Oi ⊥ Oj
Fi = ∑6
i=1 Oi ∼ Bin(6,pi)
g(pi) = β0 +β1tempi
Example 3 - waiting time
We observe inter-arrival times of a insurance claims (in days).
2.07 5.06 6.51 1.75 13.95 2.55 ... 18.03 1.92 1.03
100 observations
These may be exponentially distributed.
what value would ﬁt the data best?
Notation
X random variable
X1,...Xn iid from parametric distribution f(x|θ)
θ ∈ Θ unknown parameter to be estimated. The true value is denoted
as θ0.
Example:
X ∼ Exp(λ)
f(x|λ) = exp(−x/λ)/λ
λ ∈ [0,∞) unknown parameter to be estimated. The true value is
denoted as λ0.
Likelihood function: Ln(θ) ≡ f(X1|θ)·...·f(Xn|θ) = ∏i f(Xi|θ)
unlike density f it is a function of a parameter θ with data kept ﬁxed
i.i.d. is crucial
Example:
Ln(λ) = ∏
i
1
λ
exp −
Xi
λ
=
1
λn
exp −
n¯Xn
λ
Maximum likelihood estimator: ˆθ ≡ argmaxθ Ln(θ)
what parameter value can rationalise the given data best?
the estimator is a random variable, because the data is random
has some favourable statistical properties
can be computed analytically or numerically
Example:
We need to solve F.O.C.:
0 =
∂
∂λ
Ln(λ) = −n
1
λn+1
exp −
n¯Xn
λ
+
1
λn
exp −
n¯Xn
λ
n¯Xn
λ2
ˆλ = ¯Xn
Log-likelihood function: n(θ) ≡ logLn(θ) = ∑i logf(Xi|θ)
Numerically more stable.
argmaxθ n(θ) = argmaxθ Ln(θ)
Example:
n(λ) = ∑
i
logf(Xi|θ) = ∑
i
−logλ −
Xi
λ
= −nlogλ −
n¯Xn
λ
Expected log density (θ) ≡ E[logf(X|θ)]
under correct speciﬁcation we have likelihood analog principle:
θ0 = argmaxθ l(θ)
Example:
(θ) = E[logf(X|θ)] = E[−logλ −X/λ] = −logλ −
E[X]
λ
= −logλ −
λ0
λ
FOC gives 0 = 1
λ + λ0
λ2 which has an unique solution λ = λ0.
Score function: Sn(θ) ≡ ∂
∂θ n(θ) = ∑i
∂
∂θ logf(Xi|θ)
How sensitive is the likelihood to θ
for interior solution we have Sn(ˆθ) = 0
Example:
Sn(λ) =
∂
∂λ
−nlogλ −
n¯Xn
λ
= −
n
λ
+
n¯Xn
λ2
Likelihood Hessian: Hn(θ) ≡ − ∂2
∂θ∂θT n(θ) = −∑i
∂2
∂θ∂θT logf(Xi|θ)
tells us how curved is the log-likelihood
Example:
Hn(λ) = −
∂2
∂λ2 n(λ) = −
∂
∂λ
Sn(λ) = −
n
λ2
+
2n¯Xn
λ3
Eﬃcient score: S ≡ ∂
∂θ logf(X|θ0)
derivative of a log-likelihood of a single observation
mean zero random vector
E[S] = E ∂
∂θ logf(X|θ0) = ∂
∂θ E [logf(X|θ0)] = ∂
∂θ (θ0) = 0
Example:
S =
∂
∂λ
logf(X|λ0) = −
1
λ0
+
X
λ2
0
.
E[S] = −
1
λ0
+
E[X]
λ2
0
= −
1
λ0
+
λ0
λ2
0
= 0
Fisher information: Jθ ≡ E[SST
]
variance of the eﬃcient score S
Example:
Jλ = E[S2
] = Var[S]
E[S]=0
= Var −
1
λ0
+
X
λ2
0
=
1
λ4
0
Var[X] =
1
λ2
0
Expected Hessian: Hθ ≡ − ∂2
∂θ∂θT (θ0)
under regularity conditions Hθ = −E ∂2
∂θ∂θT logf(X|θ0)
Example:
Hθ = −
∂2
∂λ2
(λ)|λ=λ0
= −
∂2
∂λ2
−logλ −
λ0
λ
|λ=λ0
=
1
λ2
0
Under correct speciﬁcation of f(x|θ) (there exists some θ0 ∈ Θ so that
f(x|θ0) = f(x)), we have Information Matrix Equality:
Jθ = Hθ
Example:
Jλ =
1
λ2
0
= Hλ
MLE has some interesting properties
invariant to transformations
asymptotically eﬃcient in the class of unbiased estimators (even for
transformations)
consistent
asymptotically normal
MLE is invariant to transformations
ˆθ is the MLE of θ =⇒ ˆβ = h(ˆθ) is the MLE of β = h(θ)
MLE asymptotically achieves Cramer-Rao Lower Bound
Under (i) correct speciﬁcation, (ii) support of X not being dependent on
θ and (iii) θ0 lying in the interior of Θ
For any unbiased ˜θ we have that
Var[˜θ] ≥ (nJθ )−1
For transformation β = h(θ) (under some more regularity conditions)
we get that for any unbiased estimator ˜β of β:
Var[˜β] ≥
1
n
HT
J−1
θ H
where H = ∂
∂θ h(θ0)T
.
Average log-likelihood: ¯n(θ) ≡ 1
n n(θ) = 1
n ∑i logf(Xi|θ)
MLE is consistent, ˆθ →P θ under these conditions:
Xi are i.i.d.
E |logf(X|θ)| ≤ G(X), with E[G(X)] < ∞
logf(X|θ) is continuous in θ with probability one
Θ is compact
∀θ = θ0 : l(θ) < l(θ0) (so that the parameter θ is identiﬁed)
MLE is asymptotically normally distributed
Why? Taylor expansion around θ0:
0 =
∂
∂θ
¯n(ˆθ) ≈
∂
∂θ
¯n(θ0)+
∂2
∂θ∂θT
¯n(θ0)(ˆθ −θ0)
√
n(ˆθ −θ0) ≈
∂2
−∂θ∂θT
¯n(θ0)
−1
→P H−1
θ
√
n
∂
∂θ
¯n(θ0)
→D N(0,Jθ )
→D N(0,H−1
θ Jθ H−1
θ ) = N(0,J−1
θ )
OLS is MLE under normal errors
y = Xβ +ε
if we assume that ε ∼ N(0,σ2
I)
then we get that
ˆβMLE = (XT
X)−1
XT
y
and
ˆσ2
=
1
n
ˆεT
ˆε
Bootstrap
Example - rolling a dice (again)
Data is all we have
ˆFn → F
we wish to understand sample variation, but we don’t have F
at least we have our data ˆFn
use our ˆFn to simulate new ”bootstrap” datasets
Bootstrap in understanding the sample variation
Suppose we are considering choosing between two diﬀerent estimators
˜β and ˆβ
These may possess diﬀerent qualities
The question is: Given that you have to pick only once, which one
would you choose??
Assume we are in some of the following situations
small data sample =⇒ Asymptotic approximations are unreliable (Ex:
n = 15 in linear regression)
our estimator is complex and we can’t even derive asymptotic
approximation (Ex: result of a numerical optimization)
asymptotic distribution depends on the unknown parameter (Ex:
X1,X2,...,Xn ∼ f(.), sample median ˆm ∼ N m, 1
4nf(m)2 )
traditional estimator is based on dubious assumptions (Ex: stock
returns may have fat tails)
*Example - Stamp thickness
Bootstrap - some remarks
very general approach that makes few assumptions
bootstrapped distribution can be used to construct standard errors,
conﬁdence intervals, bias correction
*Bootstrap may fail
Paradox: we wish to use it situations that are complex, but in these, it
may be also diﬃcult to prove that it ”works”
It may fail if the parameter lies on the boundary of the parameter space
(Ex: X N(µ,1) where µ ∈ [0,∞] - Andrews, 2000)
If there is missing support information: Sample maximum: F0 has
support [0,θ0]. ˆθn = max{X1,...,Xn}. ˆTn = n(ˆθn −θ), T∗
n = n(ˆθ∗
n − ˆθn).
P∗
n (T∗
n = 0) = 1 −(1 −1/n)n
→ 1 −e−1
whereas P(ˆTn = 0) → 0.
*What if bootstrap fails?
Subsampling
we draw smaller bootstrap samples without replacement
intuition: we sample directly from the true distribution (F0), not from
the estimated one (ˆFn)
more general than bootstrap
less eﬃcient if the regular bootstrap works
practical problem - how to choose subsample size?
Thank you for your attention!
References
Very non-technical explanation of MLE in Economics: Lanot, Gauthier. ”Maximum likelihood and economic modeling.” IZA World of Labor 326 (2017).
MLE is explained in Hansen’s Probability chapter 10 https://www.ssc.wisc.edu/∼bhansen/probability/
Appendix A in Faraway (2016) provides reasonable basics: Faraway, Julian J. Extending the linear model with R: generalized linear, mixed eﬀects and nonparametric
regression models. CRC press, 2016.
A book length treatment of the Bootstrap by the inventors (47000 google scholar citations): Efron, Bradley, and Robert J. Tibshirani. An introduction to the bootstrap.
CRC press, 1994.
Bootstrap animations https://www.stat.auckland.ac.nz/∼wild/BootAnim/
A very short and succinct explanation of bootstrap and subsampling in a blog post by Larry Wasserman:
https://normaldeviate.wordpress.com/2013/01/19/bootstrapping-and-subsampling-part-i/ and
https://normaldeviate.wordpress.com/2013/01/27/bootstrapping-and-subsampling-part-ii/
*A rigourous theory on bootstrap is in chapter 23 in Van der Vaart, Aad W. Asymptotic statistics. Vol. 3. Cambridge university press, 2000.
*Andrews, Donald WK. ”Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space.” Econometrica (2000): 399-405.