Theory of point estimation
Jana Jureˇckov´a
Charles University in Prague
2010
2
Contents
1 Basic concepts 5
1.1 Loss function and risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Convex loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Estimation of vector function . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Unbiased estimates 9
2.1 Uniformly best unbiased estimate . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 How to ﬁnd the best unbiased estimate . . . . . . . . . . . . . . . . 11
3 Equivariant estimators 13
3.1 Estimation of the shift parameter . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 The form of Pitman (MRE) estimator . . . . . . . . . . . . . . . . 16
3.2 Relation of equivariance and unbiasedness . . . . . . . . . . . . . . . . . . 18
4 Asymptotic behavior of estimates 21
4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Eﬃciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Shift parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Multiple root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3
4
Chapter 1
Basic concepts
1.1 Loss function and risk
Let X be an observable random vector, X ∈ X and P = {Pθ, θ ∈ Θ} be a system of
probability distributions, indexed by an unobservable parameter θ. We want to estimate
the function g(θ) : Θ → R1
. The observed values x of X create the data.
An estimator of g(θ) is a function T(X) : X → R1
. The loss incurred when we estimate
g(θ) by t is measured by the loss function L(θ, t) which should satisfy
L(θ, t) ≥ 0 ∀θ, t;
L(θ, g(θ)) = 0 ∀θ.
The quality of estimator T is measured by the risk function
R(θ, T) = IEθ L(θ, T(X)).
We wish to get the uniformly best estimator T, which satisﬁes
R(θ, T) = min with respect to T, uniformly in θ ∈ Θ.
Such estimator exists only in special cases; if it does not exist, we minimize the risk only
over a subclass of estimators, e.g.
• unbiased estimators: bias = IEθT(X) − g(θ) = 0
• median unbiased estimators: Pθ(T(X) < g(θ)) = Pθ(T(X) > g(θ)).
• If X ∼ F(x − θ) (shift parameter) and L(θ, t) = L(|θ − t|), then we consider the
equivariant estimators satisfying T(X1 + c, . . . , Xn + c) = T(X) + c.
Other possibilities:
• Instead of minimizing the risk uniformly over θ ∈ Θ, we can minimize
Θ
R(θ, T)w(θ)dθ = min over T with respect to the weight function w.
Such estimator is called the formal Bayesian estimator with the (generalized) prior
density w(θ)
• or supθ∈Θ R(θ, T) = min (minimax estimator).
5
6
1.2 Convex loss function
Convex function: φ(λx + (1 − λ)y) ≤ λφ(x) + (1 − λ)φ(y)
strictly convex: φ(λx + (1 − λ)y) < λφ(x) + (1 − λ)φ(y), x = y.
If φ is convex in I = (a, b) and t0 ∈ I is ﬁxed, then there exists a straight line y = L(x) =
c(t − t0) + φ(t0) coming through the point [t0, φ(t0)] such that L(x) ≤ φ(x) ∀x ∈ I.
Theorem 1.2.1 Jensen inequality. If φ is convex on an open interval I and the random
variable X satisﬁes
P(X ∈ I) = 1 and |IE X| < ∞,
then
φ(IE X) ≤ IE φ(X). (1.2.1)
If φ is strictly convex and X is not constant with probability 1, then (1.2.1) holds as a
sharp inequality.
Proof. Put t0 = IE X. Let L(x) is the straight line such L(x) ≤ φ(x) coming through
point [t0, φ(t0)]. Then
E φ(X) ≥ IE L(X) = IE[c(X − IE X)] + φ(IE X) = φ(X).
If φ is strictly convex, then the line touches φ at t0 only, otherwise L(x) < φ(x). 2
Deﬁnition 1.2.1 The statistic S : X → S is called suﬃcient for the system P if there is
a version of the conditional distribution Pθ (X ∈ A | S = s) = λs(A) independent of θ.
Theorem 1.2.2 Rao-Blackwell. Let X be an observable random vector with distribution
Pθ ∈ P = {Pϑ : ϑ ∈ Θ}. Let S be a suﬃcient statistic for P. Consider the estimation
problem with a strictly convex loss function L(θ, t). Let T be an estimator of g(θ) with
ﬁnite expectation and risk, i.e. R(θ, T) = IEθ L(θ, T(X)) < ∞ ∀θ. Denote
T∗
(S) = IE{T(X)|S(X) = s}.
Then T∗
(S(X)) is an estimator satisfying R(θ, T∗
) < R(θ, T), unless T(X) = T∗
(S(X))
with probability 1.
Proof. Because S is suﬃcient, T∗
(S) does not depend on θ and thus it is an estimator.
Put φ(t) = L(θ, t). Then
φ(T∗
(s)) = L(θ, T∗
(s)) = L[θ, IE(T|S = s)] = φ(IE(T|s))
< IE(φ(T)|S = s) = IE[L(θ, T(X))|S = s]
unless T(X) = T∗
(S(X)) with probability 1; hence
R(θ, T∗
) = IEθL[θ, T∗
(S(X))] < IEθL(θ, T(X)).
2
7
Remark 1.2.1 If L(θ, t) is convex, but not strictly, then Theorem 1.2.2 holds with an
unsharp inequality.
Deﬁnition 1.2.2 (Admissibility). The estimator T is called inadmissible if there exists
another estimator T dominating T, i.e. such that
R(θ, T ) ≤ R(θ, T), ∀θ, with a sharp inequality at least for one θ. (1.2.2)
The estimator T is called admissible with respect to the loss L(θ, t), if there is no estimator
T satisfying (1.2.2).
If L(θ, t) is strictly convex, and an admissible estimator exists, then it is uniquely determined.
More precisely,
Theorem 1.2.3 Let L(θ, t) be strictly convex and T be an admissible estimator of g(θ).
If T is another estimator with the same risk and T, i.e. R(θ, T) = R(θ, T ) ∀θ, then
T(X) = T (X) with probability 1.
Proof. Put T∗
= 1
2
(T + T ). Then
R(θ, T∗
) <
1
2
[R(θ, T) + R(θ, T )] = R(θ, T) ∀θ,
unless T = T with probability 1. But this contradicts with the admissibility of T. 2
1.3 Estimation of vector function
The situation is analogous for the estimation of the vector function g(θ) = (g1(θ), . . . , gk(θ)).
Its estimator T(X) is also a k-dimensional vector. The function φ : E → R1
where E is
a convex set, is called convex, if
φ(λx1 + (1 − λ)x2) ≤ λφ(x1) + (1 − λ)φ(x2) ∀x1, x2 ∈ E and 0 < λ < 1.
If φ is twice diﬀerentiable in E, then φ is convex [strictly convex] iﬀ the Hessian matrix
H =
∂2
φ(x1, . . . , xk)
∂xi∂xj i,j=1,...,k
is positively semideﬁnite [positively deﬁnite].
If φ is convex deﬁned on an open convex set E ⊂ Rk
, then to every ﬁxed point t0
there exists a hyperplane
y = L(x) = φ(t0
) +
k
i=1
ci(xi − t0
i )
going through the point (t0
, φ(t0
)) and satisfying L(x) ≤ φ(x) ∀x ∈ E.
If X is a random vector such that P(X ∈ A) = 1 for an open convex set A and IEX
exists, then IEX ∈ A.
8
Chapter 2
Unbiased estimates
2.1 Uniformly best unbiased estimate
T(X) is unbiased estimate of g(θ), if IE(T(X)) = g(θ) ∀θ ∈ Θ.
Example: Unbised estimates need not exist. Let X have binomial distribution B(n, p)
and let g(p) = 1
p
. If T is an unbiased estimate of g(p), then
n
i=1
T(i)
n
i
pi
(1 − p)n−i
=
1
p
∀p ∈ (0, 1). (2.1.1)
But if p ↓ 0, then the left-hand side of (2.1.1) → T(0), while the right-hand side → ∞,
what is a contradiction with the unbiasedness.
The function g(θ) is called estimable, if there exists at least one unbiased estimate of g(θ).
Lemma 2.1.1 [Structure of the class of unbiased estimates.] If T0 is an unbiased estimate
of g(θ), then every unbiased estimate T of g(θ) can be written in the form T = T0 − U,
where U is an unbiased estimate of zero, i.e. such that IEθU = 0 ∀θ ∈ Θ.
Proof. If T0 is unbiased, then T0 −U is unbiased ∀U. If T is any unbiased estimate, then
T0 − T = U is an unbiased estimate of zero and T = T0 − U. 2
Speciﬁcally, consider the quadratic loss L(θ, t) = (t−g(θ))2
. Then the risk of an unbiased
estimate T is its variance:
R(θ, T) = IEθ(T − g(θ))2
= varθT(X).
If T0
minimizes varθT(X) ∀θ among all unbiased estimates of g(θ), then it is called best
minimum variance estimate (BMVE) of g(θ).
Denote ∆ the set of all unbiased estimates of g(θ) satisfying IEθT2
< ∞ ∀θ ∈ Θ. Let
U be the set of all unbiased estimates of 0 which belong to ∆.
Theorem 2.1.1 Let X have distribution Pθ, θ ∈ Θ, and let T ∈ ∆. Then T is BMVE
of its expected value g(θ) if and only if
IEθ[T(X).U(X)] = 0 ∀U ∈ U and ∀θ ∈ Θ.
9
10
Proof.
(i) Necessity: Let T be BMVE, IEθT = g(θ). Let U ∈ U, U = 0. Put T = T +λU, λ ∈ R1
.
Then
IEθT (X) = IEθT(X) = g(θ) ∀θ
⇒ varθT (X) ≥ varθT(X) ∀λ ⇒ IEθ(T )2
≥ IEθT2
,
thus IEθT2
+ λ2
IEθU2
+ 2λIEθ(T.U) ≥ IEθT2
, and
λ2
IEθU2
+ 2λIEθ(T.U) ≥ 0 ∀λ. (2.1.2)
The roots of the quadratic equation λ2
IEθU2
+ 2λIEθ(T.U) = 0 are
λ = 0 and λ = −
2covθ(T, U)
varU
,
hence the quadratic function can be negative unless covθ(T, U) = 0.
(ii) Suﬃciency: Let IEθ(T.U) = 0 ∀U ∈ U and let T be an unbiased estimate of g(θ). If
varθT = ∞, the T cannot be better than T. Let varθT < ∞. Then varθ(T−T ) < ∞
and T − T ∈ U, thus
IEθ(T(T − T )) = 0 ⇒ IEθT2
= IEθ(T.T ⇒ IEθT2
− g2
(θ) = IEθ(T.T ) − g2
(θ)
⇒ varθT = covθ(T, T ) ⇒ 0 ≤ varθ(T − T ) = varθT + varθT − 2varθT
⇒ varθT ≤ varθT .
2
Deﬁnition 2.1.1 The statistic S(X) is called complete for the system of distributions
P = {Pθ, θ ∈ Θ} if, for any function h(S)
IEθh(S(X)) = 0 ∀θ ⇒ h(S(X)) = 0 a.s.[Pθ], ∀θ ∈ Θ.
Theorem 2.1.2 Let X follow distribution Pθ ∈ P and let S be a complete and suﬃcient
statistic for P. Then every estimable function g(θ) has one and only one unbiased estimate,
which is a function of S.
Proof. Let T be an unbiased estimate of g(θ). Then T∗
(S(X)) = IE(T(X)|S(X)) is an
unbiased estimate which is a function of S.
Let T1(S) and T2(S) be two unbiased estimates of g(θ). Then IEθ(T1 − T2) = 0 ∀θ,
and because S is complete, it implies that T1 − T2 = 0 s.j. [Pθ], θ ∈ Θ. 2
11
Theorem 2.1.3 (Lehmann-Scheﬀ´e). Let X follow distribution Pθ ∈ P and let S be a
complete and suﬃcient statistic for P. Then
(i) For every estimable function g(θ) and every loss function L(θ, t) convex in t, there
exists an unbiased estimate T of g(θ) which uniformly minimizes the risk R(θ, T).
(ii) T is the only unbiased estimate which is a function of S. If L is strictly convex in t,
then T is the only unbiased estimate with minimum risk.
Proof. The Rao-Blackwell theorem holds for S and convex loss function. By Theorem
2.1.2 the estimate T∗
(S(X)) = IE(T(X)|S(X)) is unique, and because S is complete, it
cannot be further improved. 2
2.1.1 How to ﬁnd the best unbiased estimate
Let S(X) be a complete and suﬃcient statistic.
Method 1: The best unbiased estimate of an estimable function g(θ) is any function
T(S) such that
IEθT(S) = g(θ) ∀θ ∈ Θ.
Method 2: Let us start with any unbiased estimate T(X). Then
T (X) = IE(T(X)|S) is the best unbiased estimate.
12
Chapter 3
Equivariant estimators
3.1 Estimation of the shift parameter
Let X1, . . . , Xn be a sample from a distribution with distribution function F(x − θ) and
density f(x − θ). The problem is to estimate θ ∈ R1
with respect to the loss L(θ, t).
Consider the loss which is invariant to the shift, i.e. L(θ, t) = L(θ +c, t+c) ∀c ∈ R1
.
Then L(θ, t) = L(0, t − θ), hence the loss depends only on the diﬀerence t − θ. If the loss
is invariant, then the whole problem is invariant to the shift. If we estimate θ by T(X),
then a natural estimate of θ + c is T(X) + c.
Deﬁnition 3.1.1 The estimator T(X) is called equivariant (with respect to the shift) if
it satisﬁes
T(X1 + c, . . . , Xn + c) = T(X1 . . . , Xn) + c ∀c ∈ R and ∀X ∈ Rn
.
Lemma 3.1.1 The bias, risk and variance of an equivariant estimate T(X) do not depend
on value of θ, and hence are constant with respect to θ.
Proof. If X1 has d.f. F(x−θ), then P(X1 −θ ≤ z) = P(X1 ≤ z +θ) = F(z), thus X1 −θ
has distribution function F(·). Then
bias = b(θ) = IEθ(T(X)) − θ = IEθ(T(X1 − θ, . . . , Xn − θ)) = IE0(T(X)) = b,
varθT(X) = IEθ(T(X) − IEθT(X))2
= IEθ [T(X − θ) + θ − IEθT(X)]2
= IEθ (T(X − θ) − b)2
= IE0(T(X) − b)2
,
R(T, θ) = IEθ[L(T(X) − θ)] = IEθ[L(T(X − θ))] = IE0[L(T(X))] = R(T).
2
We shall look for an equivariant estimate with minimal risk (MRE), i.e. T∗
such that
R(T∗
) < R(T) for any equivariant estimator T = T∗
.
First we should investigate the structure of the class of equivariant estimators.
13
14
Lemma 3.1.2 Let T0(X) be an equivariant estimate. Then the estimate T(X) is equivariant
if and only if there exists a statistic U(X), invariant to the shift, i.e. satisfying
U(X1 + c, . . . , Xn + c) = U(X1, . . . , Xn) ∀c ∈ R1
, ∀X, (3.1.1)
such that
T(X) = T0(X) + U(X) ∀X. (3.1.2)
Proof.
• Let T satisfy (3.1.1) and (3.1.2). Then
T(X + c) = T0(X + c) + U(X + c) = T0(X) + c + U(X) = T(X) + c,
thus T is equivariant.
• Let T be equivariant, and let T0 be any equivariant estimator.
Put U(X) = T(X) − T0(X). Then U is invariant and T = T0 + U. 2
Lemma 3.1.3 The function U(x) is invariant if and only if it depends only on diﬀerences
yi = xi − x1, i = 2, . . . , n in case that n ≥ 2. If n = 1, then the only invariant are the
constant functions.
Proof. If n = 1, then U(x + c) ≡ U(x) iﬀ U(x) is a constant.
Let n ≥ 2 and U(x + c) ≡ U(x). Then
U(x1, . . . , xn) = U(x1 − x1, x2 − x1, . . . , xn − x1) = U(0, y2, . . . , yn) = ˜U(y2, . . . , yn). 2
Corollary 3.1.1 Let T0 be an equivariant estimate and n ≥ 2. Then the estimator T is
equivariant if and only if there exists a function ˜U(Y2, . . . , Yn) of Y = (Y2, . . . , Yn) such
that T(X) ≡ T0(X) − ˜U(Y).
Remark 3.1.1 The diﬀerences Y2 = X2 −X1, . . . , Yn = Xn −X1 determine all diﬀerences
Xi − Xj, i = j. Instead of Y we may take e.g. X1 − ¯X, . . . , Xn − ¯X.
Deﬁnition 3.1.2 The statistic S(X) is called maximal invariant with respect to the shift,
if it is invariant and if
S(X1) = S(X2) if and only if X2 = X1 + c for some c ∈ R1
.
We see that Y2, . . . , Yn or X1 − ¯X, . . . , Xn − ¯X are maximal invariants. Maximal
invariants are important, because of the following property:
Lemma 3.1.4 The function U(x) is invariant if and only if it is a function of a maximal
invariant.
Proof. If U is a function of S, i.e. U(x) = h(S(x)), then it is invariant.
Let U be invariant and let S(x1) = S(x2). Then x2 = x1 + c, hence U(x2) = U(x1). 2
15
Theorem 3.1.1 (Minimum risk estimate). Let T0 be an equivariant estimate with a ﬁnite
risk. If for any value of diﬀerences y there exists v∗
(y) which minimizes
IE0 L[T0(X) − v(Y)] Y = y
with respect to functions of y, then there exists a minimum risk estimate and is equal to
T∗
(X) = T0(X) − v∗
(Y).
Proof. Let T(X) = T0(X) − v(Y). Then
Rθ(T(X), θ) = IEθ[L(T0(X) − v(Y) − θ)] = IE0{L[T0(X) − v(Y)]}
IE0 E0 L(T0(X) − v(Y)) Y = IE0 L [T0(X) − v(y)] y dP0(y)
should be minimized with respect to v(·). But this is minimized if the integrand in minimized
for every y. 2
Corollary 3.1.2 (a) If L(t − θ) = (t − θ)2
, then v∗
(y) = IE0 T0(X) Y = y .
(b) If L(t − θ) = |t − θ|, then v∗
(y) is the conditional median of T0(X) with respect to
the conditional distribution of X given Y = y.
Example 3.1.1 Let X1, . . . , Xn be a sample from the normal distribution N(ξ, σ2
) with
σ known. Put T0(X) = ¯X. Then ¯X and Y = (X2 − X1, . . . , Xn − X1) are independent,
hence when we consider IE0[L( ¯X − v(Y)|Y = y], we conclude that v(y) = const and is
determined by the condition that IE0[L(X − v)] = min . Thus, if L is a convex and odd
function, then v = 0 and ¯X is the MRE (minimum risk estimator).
Theorem 3.1.2 Let F be the class of all distribution functions with Lebesgue densities
f, which have a ﬁnite ﬁxed variance, say σ = 1. Let X1, . . . , Xn be a sample from the
distribution with density f(x − ξ), where ξ = IE X. Let rn(f) be the risk of the MRE of ξ
with respect to the quadratic loss function. Then rn(f) is maximal over F for the normal
f.
Proof. If F is normal, then ¯X is the MRE and IE( ¯X − ξ)2
= 1/n. Because 1/n is also
the quadratic risk of ¯X for every F ∈ F, the risk of the MRE ≤ 1/n. 2
Remark 3.1.2 It follows from Corrollary 3.1.2 that the MRE should satisfy
T∗
(X) = ¯X − IE0( ¯X|Y), hence
T∗
(X) = ¯X ⇐⇒ IE0( ¯X|Y) = 0.
But by Theorem of Kagan-Linnik-Rao (1967), IE0( ¯X|Y) = 0 is true if and only if the
distribution of X1, . . . , Xn is normal.
16
Example 3.1.2 Exponential distribution. Let X1, . . . , Xn have the distribution function
F(x − θ) =
1 − exp{x − θ} . . . x ≥ θ
0 . . . x < θ.
Put T0(X) = X(1), where X(1) ≤ X(2) ≤ . . . ≤ X(n) are order statistics. Then
P(X(1) > x) =
n
i=1
P(Xi > x) = exp{−n(x − θ)},
hence the density of X(1) is n exp{−n(x − θ)}.
Because X(1) and Y are independent, the invariant function v(Y) = const, similarly
as in Example 3.1.1. We look for v such that IE0[L(X(1) − v)] = min .
If L(t − θ) = (t − θ)2
, then IE0(X(1) − v)2
= min for
v = IEX(1) = n
∞
0
x exp{−nx}dx =
1
n
∞
0
y exp{y}dy =
1
n
and the MRE is T∗
(X) = X(1) − 1
n
.
3.1.1 The form of Pitman (MRE) estimator
Let X1, . . . , Xn be a sample form a distribution with density f(x − θ). Then the Pitman
(MRE) estimator with respect to quadratic loss is T∗
(X) = T0(X) − IE0[T0(X)|Y], where
T0 is an initial equivariant estimator with a ﬁnite risk. Then T∗
(X) can be also written
in the following form:
T∗
(X) =
∞
−∞
t · f(X1 − t) . . . f(Xn − t)dt
∞
−∞
f(X1 − t) . . . f(Xn − t)dt
.
Proof. Put T0(X) = X1. We shall look for the conditional density of X1 given Y = y
under θ = 0. Make the substitution
yi = xi − x1, i = 2, . . . , n
x1 = x1.
Then the density of Y∗
= (X1, Y2, . . . , Yn) is
p(y∗
) = f(x1, x1 + y2, . . . , x1 + yn)
and the conditional density of X1 given y = (y2, . . . , yn) is
f(x1, x1 + y2, . . . , x1 + yn)
∞
−∞
f(u, u + y2, . . . , u + yn)du
.
17
Hence,
IE(X1|Y = y) =
∞
−∞
uf(u, u + y2, . . . , u + yn)du
∞
−∞
f(u, u + y2, . . . , u + yn)du
=
∞
−∞
(X1 − t)f(X1 − t, X2 − t, . . . , Xn − t)dt
∞
−∞
f(X1 − t, . . . , Xn − t)dt
,
where we inserted t = X1 − u, yi = Xi − X1, i = 2, . . . , n. Then
T∗
(X) = X1 − IE(X1|Y = y) = X1 −
∞
−∞
(X1 − t)f(X1 − t, X2 − t, . . . , Xn − t)dt
∞
−∞
f(X1 − t, . . . , Xn − t)dt
=
∞
−∞
t · f(X1 − t) . . . f(Xn − t)dt
∞
−∞
f(X1 − t) . . . f(Xn − t)dt
.
2
Example 3.1.3 Let X1, . . . , Xn be a sample from the uniform distribution R(θ− 1
2
, θ+ 1
2
)
and let L(t − θ) = (t − θ)2
. Then
f(x1, . . . , xn) =
1 . . . θ − 1
2
≤ X(1) ≤ X(n) ≤ θ + 1
2
,
0 . . . otherwise
Then, under θ = 0,
f(x1 − t, . . . , xn − t) =
1 . . . X(n) − 1
2
≤ t ≤ X(1) + 1
2
,
0 . . . otherwise
Put T0 = 1
2
(X(1) + X(n)). Then
tf(x1 − t, . . . , xn − t)dt =
X(1)+ 1
2
X(n)− 1
2
tdt = 1
2
(X(1) + 1
2
)2
− (X(n) − 1
2
)2
and
f(x1 − t, . . . , xn − t)dt =
X(1)+ 1
2
X(n)− 1
2
dt = 1 − (X(n) − X(1)).
Finally,
T∗
(X) =
1
2
(X(1) + X(n))(1 − (X(n) − X(1)))
(1 − (X(n) − X(1)))
= 1
2
(X(1) + X(n)).
18
3.2 Relation of equivariance and unbiasedness
Lemma 3.2.1 Let L(t − θ) = (t − θ)2
.
(i) If T(X) is equivariant and has constant bias IEθT(X) − θ = b (a non-null constant),
then T(X) − b is an equivariant and unbiased estimator, whose risk is less that the
risk of T(X).
(ii) If the MRE is uniquely determined, then it is unbiased.
(iii) If there exists a uniformly best unbiased estimate which is equivariant, then it is the
MRE.
Proof.
(i) Let T1(X) = T(X) − b. Then it is equivariant and IEθ(T1(X)) = θ + b − b = θ, and
IE0(T1(X))2
= IE0(T(X) − b)2
= IE0T2
(X) − b2
< IE0T2
(X).
(ii) Let T∗
be the MRE and T be any equivariant estimate with ﬁnite risk. Then
T∗
(X) = T(X) − IE(T|Y), IE0(T∗
)2
< IE0T2
, IE0T∗
= 0.
(iii) Let T be uniformly best unbiased and also equivariant. Let T1 be equivariant. Then
IEθT1 = θ + b and if b = 0, then IE0(T1 − b)2
< IE0T2
1 . This implies that
IEθ(T − θ)2
= IE0(T)2
≤ IE(T1 − b)2
≤ IE0T2
1 .
2
Deﬁnition 3.2.1 Estimator T of g(θ) is called risk unbiased with respect to the loss L,
if
EθL(θ, T) ≤ IEθL(θ , T) ∀θ = θ.
The following theorem shows that the MRE is risk unbiased:
Theorem 3.2.1 Let X1, . . . , Xn be a sample from a distribution with the density f(x−θ).
Then the MRE with respect to the loss L(θ, t) = L(t − θ) is risk unbiased.
Proof. The risk unbiasedness means that
IEθL(T(X) − θ ) ≥ IEθL(T(X) − θ) ∀θ = θ,
otherwise speaking,
IE0L(T(X) − a) ≥ IE0L(T(X)) ∀a = 0.
19
Let T∗
be the MRE. Then T∗
(X) = T0(X) − v∗
(Y) where
IE0[L(T0(X) − v∗
(Y)|Y = y] = min .
Then
IE0[L(T(X) − a)] = IE0[L(T0(X) − v∗
(Y) − a)]
= IE0{IE0[L(T0(X) − v∗
(Y) − a)|Y]} ≥ IE0{IE0[L(T0(X) − v∗
(Y)|Y]}
= IE0[L(T(X))]
where we used the fact that v∗
(Y) + a is also an invariant function. 2
20
Chapter 4
Asymptotic behavior of estimates
4.1 Consistency
Let X1, . . . , Xn be independent observations with distribution Pθ, θ ∈ Θ. We want to
estimate the function g(θ). Then the estimator Tn is called
weakly consistent estimate of g(θ) if Tn
p
−→ g(θ) ∀θ ∈ Θ as n → ∞
strongly consistent estimate of g(θ) if Tn → g(θ) a.s.[Pθ] ∀θ ∈ Θ as n → ∞.
Let R(θ, Tn) = IEθ(T(X) − g(θ))2
be the quadratic risk. Then
Theorem 4.1.1 (i) If limn→∞ R(θ, Tn) = 0 ∀θ ∈ Θ, then Tn is weakly consistent.
(ii) If
lim
n→∞
IEθTn(X) = g(θ) ∀θ ∈ Θ,
lim
n→∞
varθTn(X) = 0 ∀θ ∈ Θ
then Tn is weakly consistent.
(iii) Especially, if Tn is unbiased ∀n and limn→∞ varθTn(X) = 0 ∀θ ∈ Θ, then Tn is
weakly consistent.
Proof.
(i) By Chebyshev inequality,
Pθ(|Tn(X)) − g(θ)| > ε) ≤
1
ε2
IEθ |Tn(X)) − g(θ)|2
=
1
ε2
R(θ, Tn) → 0.
(ii)
Pθ(|Tn(X)) − g(θ)| > ε) ≤
1
ε2
IEθ |Tn(X)) − g(θ)|2
≤
1
ε2
1
ε2
IEθ [Tn − IEθTn + IEθTn − g(θ)]2
≤
2
ε2
varθTn + (bn(θ))2
→ 0.
21
22
2
The parameter θ or the function g(θ) can be estimated by a consistent estimate only
if θ is identiﬁable, i.e. if [θ1 = θ2] =⇒ [Pθ1 = Pθ2 ].
4.2 Eﬃciency
Deﬁnition 4.2.1 (Limiting risk eﬃciency of Tn to T∗
n ). Assume that two sequences
{Tn}, {T∗
n } of estimates satisfy
lim
n→∞
nr
R(Tn, θ) = lim
n→∞
nr
R(T∗
mn
, θ) (4.2.1)
for some sequence {mn}∞
n=1 and a ﬁxed r > 0. Then the limit
lim
n→∞
mn
n
,
if it exists and is independent of the special choice of {mn}, is called the
limiting risk eﬃciency of Tn with respect to T∗
n .
Deﬁnition 4.2.2 (Relative asymptotic eﬃciency of Tn to T∗
n ). Let
√
n(Tn − g(θ))
D
−→ N(0, σ2
) as n → ∞, (4.2.2)
√
n(T∗
mn
− g(θ))
D
−→ N(0, σ2
) as n → ∞.
Then the limit
eT,T∗ = lim
n→∞
mn
n
,
if it exists and is independent of the special choice of {mn}, is called the relative asymptotic
eﬃciency (ARE) of Tn to T∗
n .
Theorem 4.2.1 Let
√
n(Tn − g(θ))
D
−→ N(0, σ2
) as n → ∞, (4.2.3)
√
n(T∗
n − g(θ))
D
−→ N(0, σ2
∗) as n → ∞.
Then
eT,T∗ =
σ2
∗
σ2
.
23
Proof. Assume (4.2.3). Then
√
n(T∗
mn
− g(θ)) =
n
mn
√
mn(T∗
mn
− g(θ))
and
√
n(Tn − g(θ))
D
−→ N(0, σ2
),
n
mn
→
1
eT,T∗
,
√
mn(T∗
mn
− g(θ))
D
−→ N(0, σ2
∗),
thus eT,T∗ = σ2
∗/σ2
. 2
Consider the system of distributions P = {Pθ; θ ∈ Θ} with densities f(x, θ) satisfying
(A0) Pθ1 = Pθ2 for θ1 = θ2.
(A1) B = {x : f(x, θ) > 0} is independent of θ.
(A2) Let X1, . . . , Xn be a sample from a distribution with density f(x, θ0), where θ0 ∈
I ⊂ Θ for an open interval I.
Theorem 4.2.2 Under conditions (A0)–(A2), it holds for any θ = θ0, θ ∈ Θ
lim
n→∞
Pθ0
n
i=1
f(Xi, θ0) >
n
i=1
f(Xi, θ) = 1. (4.2.4)
Proof. By the law of large numbers and Jenssen inequality, as n → ∞,
1
n
n
i=1
log
f(Xi, θ)
f(Xi, θ0)
Pθ0
−→ IEθ0 log
f(X, θ)
f(X, θ0)
< log IEθ0
f(X, θ)
f(X, θ0)
= 0.
This implies
Pθ0
1
n
n
i=1
log
f(Xi, θ)
f(Xi, θ0)
> 0 → 0
and that gives (4.2.4). 2
Denote
L(θ, X) = log
n
i=1
f(Xi, θ) (the likelihood).
The maximum likelihood estimate (MLE) of θ is deﬁned as a solution of the maximization
L(θ, X) = max, θ ∈ Θ.
It is one of the solutions of the likelihood equation
∂L(θ, X)
∂θ
=
n
i=1
˙f(Xi, θ)
f(Xi, θ)
= 0. (4.2.5)
24
Assume that conditions A0–A2 are satisﬁed and that f(x, θ) is diﬀerentiable in θ ∈
I ⊂ Θ, where I θ0. Then
Theorem 4.2.3 Under the above conditions, there exists a root ˆθn of the likelihood equation
(4.2.5) such that
ˆθn
Pθ0
−→ θ0
as n → ∞.
Proof. Let a > 0 is such that (θ0 − a, θ0 + a) ⊂ I. Let
Sn = {x : L(θ0, x) > L(θ0 − a, x) and L(θ0, x) > L(θ0 + a, x)}.
By Theorem 4.2.2 is limn→∞ Pθ0 (Sn) = 1. There is a local maximum ˆθn between θ0 − a
and θ0 + a and it satisﬁes L (ˆθn) = 0. Let θ∗
n be the root of L (θ) = 0 the closest to θ0.
Then
lim
n→∞
Pθ0 (|θ∗
n − θ0| < a) = 1 ∀a > 0.
2
Remark 4.2.1 We know that Θ∗
n exists as the root the closest to θ0, but we are not able
to ﬁnd it, because θ0 is unknown.
Everything holds only with probability tending to 1.
If the likelihood equation has only one root Tn ∀n and ∀x, then Tn is consistent estimate
of θ0.
Theorem 4.2.4 Let the conditions (A0)–(A2) be satisﬁed, and let it further hold
(A3)
∂3
log f(x, θ)
∂θ3
≤ M(x)
for x ∈ B and for |θ −θ0| < C, where M(x) is such that IEθ0 M(X) < ∞. Then every consistent
sequence ˆθn = ˆθn(X) of roots of the likelihood equation is asymptotically normally
distributed, i.e.
L(
√
n(ˆθn − θ0))
D
−→ N 0,
1
I(θ0)
where I(θ) = ∂ log f(x,θ)
∂θ
2
f(x, θ)dµ is the Fisher information.
Some steps of the proof.
0 = n−1/2
Ln(ˆθn) = n−1/2
n
i=1
˙f(Xi, ˆθn)
f(Xi, ˆθn)
= n−1/2
Ln(θ0) + n1/2
(ˆθn − θ0) · 1
n
Ln(θ0) + 1
2
n−1/2
[n1/2
(ˆθn − θ0)]2 1
n
Ln (θ∗
n)
25
with θ∗
n between θ0 and ˆθn. Then
n1/2
(ˆθn − θ0) ≈ −
n−1/2
Ln(θ0)
n−1Ln(θ0)
−
1
2
(ˆθn − θ0)
n−1
Ln (θ∗
n)
n−1Ln(θ0)
.
We should show that
n−1/2
Ln(θ0)
D
−→ N(0, I(θ0)) (4.2.6)
−
1
n
Ln(θ0) =
1
n
n
i=1
¨f(Xi, θ0)
f(Xi, θ0)
−
1
n
n
i=1
˙f(Xi, θ0)
f(Xi, θ0)
2
p
−→ I(θ0) (4.2.7)
1
n
Ln (θ∗
n) = Op(1). (4.2.8)
(4.2.7) follows from the central limit theorem, (4.2.8) from the law of large numbers,
(4.2.8) from the consistency of ˆθn and from (A3). Then we obtain
0 = n−1/2
Ln(ˆθn) ≈ N(0, I(θ0)) − n1/2
(ˆθn − θ0)I(θ0) +
1
2
√
n
√
n(ˆθn − θ0)
2 1
n
Ln (θ∗
n),
thus
√
n(ˆθn − θ0) ≈ −
n−1/2
Ln(θ0)
n−1Ln(θ0)
−
1
2
(ˆθn − θ0)
n−1
Ln (θ∗
n)
n−1Ln(θ0)
≈ N 0,
1
I(θ0)
+ op(1)
2
Remark 4.2.2 Such estimator is called the eﬃcient likelihood estimator. It is usually
the maximal likelihood estimator, but not neccessary.
Corollary 4.2.1 If the likelihood equation has only one root, or if it has a multiple root
with probability tending to 0 as n → ∞, then, under the conditions of Theorem 4.2.4, the
maximal likelihood estimator is asymptotically eﬃcient.
Example 4.2.1 One-parameter exponential family.
f(x, θ) = exp{θT(x) + A(θ),
n
i=1
log f(Xi, θ) = θ
n
i=1
T(Xi) + nA(θ) = max
⇒
1
n
n
i=1
T(Xi) = −A (θ) = IEθ
1
n
n
i=1
T(Xi) (likelihood equation). (4.2.9)
On the other hand, because f(x, θ)dµ = 1,
0 = (A (θ) + T(x)) exp{θT(x) + A(θ)}dµ =⇒ A (θ) = −IEθT(X).
26
We can show that IEθT(X) is increasing in θ : Indeed,
∂
∂θ
IEθT(X) = T(x)(A (θ)+T(x)) exp{θT(x)+A(θ)dµ = IEθT2
(X)−(IEθT(X))2
= varθT(X) > 0.
Thus the likelihood equation
IEθT(X) =
1
n
n
i=1
T(Xi)
has at most one solution, and the conditions of Theorem 4.2.4 are satisﬁed. Thus, with
probability tending to 0 the likelihood equation has one root ˆθn, which is consistent,
asymptotically eﬃcient and asymptotically normal
√
n(ˆθn − θ)
D
−→ N 0,
1
varθT
,
because
I(θ) = IEθ
∂ log f(X, θ)
∂θ
2
= IEθ[T(X) + A (θ)]2
= varθT(X).
Example 4.2.2 Truncated normal distribution. Let X1, . . . , Xn have normal distribution
N(θ, 1) truncated at (a, b), a < b, with the density
p(x, θ) =



1√
2π
exp{−(x−θ)2
2
} [Φ(b − θ) − Φ(a − θ)] . . . a < x < b
0 . . . otherwise.
Thus
p(x, θ) = exp{θx −
θ2
2
−
x2
2
+ log
1
√
2π
− log[Φ(b − θ) − Φ(a − θ)] ⇒ T(x) = x
and the likelihood equation has the form
¯Xn = IEθX.
If θ → ±∞, then X
p
→ a or b, thus also IEθX→a or b and IEθX is continuous, hence
the likelihood equation has a root.
4.2.1 Shift parameter
Let X1, . . . , Xn be a sample from the population with density f(x − θ). The MLE ˆθn is a
solution of n
i=1
f(Xi − θ) := max
27
and it is equivariant. The Pitman estimate T∗
n is asymptotically equivalent to ˆθn in the
sense that
√
n(ˆθn − T∗
n )
p
→ 0 as n → ∞ (Stone 1974). The likelihood equation can be
rewritten as n
i=1
f (Xi − θ)
f(Xi − θ)
= 0. (4.2.10)
If f is strongly unimodal, i.e. −f
f
is strictly increasing, then (4.2.10) has at most one
root. Because n
i=1 f(xi − θ) → 0 as θ → ±∞, then n
i=1 f(Xi − θ) has the maximum
inside the real line, hence the root of (4.2.10) exists and is asymptotically eﬃcient.
4.2.2 Multiple root
Let L(θ, x) = log n
i=1 f(xi, θ). Assume that the equation
L (θ) =
n
i=1
f (Xi, θ)
f(Xi, θ)
= 0 (4.2.11)
can have a multiple root, but that there exists a consistent estimate ˜θ0
n.
Theorem 4.2.5 (i) Let ˜θ0
n be a consistent estimate and the conditions (A0)−−(A2) hold.
Then the root of equation (4.2.11), the closest to ˜θ0
n is also consistent, and hence it
is asymptotically eﬃcient.
(ii) Let ˜θn be a consistent initial estimate satisfying
√
n(˜θn − θ) = Op(1) as n → ∞.
Put
Tn = ˜θn −
L (˜θn)
L (˜θn)
.
Then Tn is an asymptotically eﬃcient estimate of θ, i.e.
√
n(Tn − θ)
D
−→ N(0, 1/I(θ)).
Proof is similar to the proof of Theorem 4.2.4.