Theory of point estimation Jana Jureˇckov´a Charles University in Prague 2010 2 Contents 1 Basic concepts 5 1.1 Loss function and risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Convex loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Estimation of vector function . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Unbiased estimates 9 2.1 Uniformly best unbiased estimate . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 How to find the best unbiased estimate . . . . . . . . . . . . . . . . 11 3 Equivariant estimators 13 3.1 Estimation of the shift parameter . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 The form of Pitman (MRE) estimator . . . . . . . . . . . . . . . . 16 3.2 Relation of equivariance and unbiasedness . . . . . . . . . . . . . . . . . . 18 4 Asymptotic behavior of estimates 21 4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.1 Shift parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Multiple root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3 4 Chapter 1 Basic concepts 1.1 Loss function and risk Let X be an observable random vector, X ∈ X and P = {Pθ, θ ∈ Θ} be a system of probability distributions, indexed by an unobservable parameter θ. We want to estimate the function g(θ) : Θ → R1 . The observed values x of X create the data. An estimator of g(θ) is a function T(X) : X → R1 . The loss incurred when we estimate g(θ) by t is measured by the loss function L(θ, t) which should satisfy L(θ, t) ≥ 0 ∀θ, t; L(θ, g(θ)) = 0 ∀θ. The quality of estimator T is measured by the risk function R(θ, T) = IEθ L(θ, T(X)). We wish to get the uniformly best estimator T, which satisfies R(θ, T) = min with respect to T, uniformly in θ ∈ Θ. Such estimator exists only in special cases; if it does not exist, we minimize the risk only over a subclass of estimators, e.g. • unbiased estimators: bias = IEθT(X) − g(θ) = 0 • median unbiased estimators: Pθ(T(X) < g(θ)) = Pθ(T(X) > g(θ)). • If X ∼ F(x − θ) (shift parameter) and L(θ, t) = L(|θ − t|), then we consider the equivariant estimators satisfying T(X1 + c, . . . , Xn + c) = T(X) + c. Other possibilities: • Instead of minimizing the risk uniformly over θ ∈ Θ, we can minimize Θ R(θ, T)w(θ)dθ = min over T with respect to the weight function w. Such estimator is called the formal Bayesian estimator with the (generalized) prior density w(θ) • or supθ∈Θ R(θ, T) = min (minimax estimator). 5 6 1.2 Convex loss function Convex function: φ(λx + (1 − λ)y) ≤ λφ(x) + (1 − λ)φ(y) strictly convex: φ(λx + (1 − λ)y) < λφ(x) + (1 − λ)φ(y), x = y. If φ is convex in I = (a, b) and t0 ∈ I is fixed, then there exists a straight line y = L(x) = c(t − t0) + φ(t0) coming through the point [t0, φ(t0)] such that L(x) ≤ φ(x) ∀x ∈ I. Theorem 1.2.1 Jensen inequality. If φ is convex on an open interval I and the random variable X satisfies P(X ∈ I) = 1 and |IE X| < ∞, then φ(IE X) ≤ IE φ(X). (1.2.1) If φ is strictly convex and X is not constant with probability 1, then (1.2.1) holds as a sharp inequality. Proof. Put t0 = IE X. Let L(x) is the straight line such L(x) ≤ φ(x) coming through point [t0, φ(t0)]. Then E φ(X) ≥ IE L(X) = IE[c(X − IE X)] + φ(IE X) = φ(X). If φ is strictly convex, then the line touches φ at t0 only, otherwise L(x) < φ(x). 2 Definition 1.2.1 The statistic S : X → S is called sufficient for the system P if there is a version of the conditional distribution Pθ (X ∈ A | S = s) = λs(A) independent of θ. Theorem 1.2.2 Rao-Blackwell. Let X be an observable random vector with distribution Pθ ∈ P = {Pϑ : ϑ ∈ Θ}. Let S be a sufficient statistic for P. Consider the estimation problem with a strictly convex loss function L(θ, t). Let T be an estimator of g(θ) with finite expectation and risk, i.e. R(θ, T) = IEθ L(θ, T(X)) < ∞ ∀θ. Denote T∗ (S) = IE{T(X)|S(X) = s}. Then T∗ (S(X)) is an estimator satisfying R(θ, T∗ ) < R(θ, T), unless T(X) = T∗ (S(X)) with probability 1. Proof. Because S is sufficient, T∗ (S) does not depend on θ and thus it is an estimator. Put φ(t) = L(θ, t). Then φ(T∗ (s)) = L(θ, T∗ (s)) = L[θ, IE(T|S = s)] = φ(IE(T|s)) < IE(φ(T)|S = s) = IE[L(θ, T(X))|S = s] unless T(X) = T∗ (S(X)) with probability 1; hence R(θ, T∗ ) = IEθL[θ, T∗ (S(X))] < IEθL(θ, T(X)). 2 7 Remark 1.2.1 If L(θ, t) is convex, but not strictly, then Theorem 1.2.2 holds with an unsharp inequality. Definition 1.2.2 (Admissibility). The estimator T is called inadmissible if there exists another estimator T dominating T, i.e. such that R(θ, T ) ≤ R(θ, T), ∀θ, with a sharp inequality at least for one θ. (1.2.2) The estimator T is called admissible with respect to the loss L(θ, t), if there is no estimator T satisfying (1.2.2). If L(θ, t) is strictly convex, and an admissible estimator exists, then it is uniquely determined. More precisely, Theorem 1.2.3 Let L(θ, t) be strictly convex and T be an admissible estimator of g(θ). If T is another estimator with the same risk and T, i.e. R(θ, T) = R(θ, T ) ∀θ, then T(X) = T (X) with probability 1. Proof. Put T∗ = 1 2 (T + T ). Then R(θ, T∗ ) < 1 2 [R(θ, T) + R(θ, T )] = R(θ, T) ∀θ, unless T = T with probability 1. But this contradicts with the admissibility of T. 2 1.3 Estimation of vector function The situation is analogous for the estimation of the vector function g(θ) = (g1(θ), . . . , gk(θ)). Its estimator T(X) is also a k-dimensional vector. The function φ : E → R1 where E is a convex set, is called convex, if φ(λx1 + (1 − λ)x2) ≤ λφ(x1) + (1 − λ)φ(x2) ∀x1, x2 ∈ E and 0 < λ < 1. If φ is twice differentiable in E, then φ is convex [strictly convex] iff the Hessian matrix H = ∂2 φ(x1, . . . , xk) ∂xi∂xj i,j=1,...,k is positively semidefinite [positively definite]. If φ is convex defined on an open convex set E ⊂ Rk , then to every fixed point t0 there exists a hyperplane y = L(x) = φ(t0 ) + k i=1 ci(xi − t0 i ) going through the point (t0 , φ(t0 )) and satisfying L(x) ≤ φ(x) ∀x ∈ E. If X is a random vector such that P(X ∈ A) = 1 for an open convex set A and IEX exists, then IEX ∈ A. 8 Chapter 2 Unbiased estimates 2.1 Uniformly best unbiased estimate T(X) is unbiased estimate of g(θ), if IE(T(X)) = g(θ) ∀θ ∈ Θ. Example: Unbised estimates need not exist. Let X have binomial distribution B(n, p) and let g(p) = 1 p . If T is an unbiased estimate of g(p), then n i=1 T(i) n i pi (1 − p)n−i = 1 p ∀p ∈ (0, 1). (2.1.1) But if p ↓ 0, then the left-hand side of (2.1.1) → T(0), while the right-hand side → ∞, what is a contradiction with the unbiasedness. The function g(θ) is called estimable, if there exists at least one unbiased estimate of g(θ). Lemma 2.1.1 [Structure of the class of unbiased estimates.] If T0 is an unbiased estimate of g(θ), then every unbiased estimate T of g(θ) can be written in the form T = T0 − U, where U is an unbiased estimate of zero, i.e. such that IEθU = 0 ∀θ ∈ Θ. Proof. If T0 is unbiased, then T0 −U is unbiased ∀U. If T is any unbiased estimate, then T0 − T = U is an unbiased estimate of zero and T = T0 − U. 2 Specifically, consider the quadratic loss L(θ, t) = (t−g(θ))2 . Then the risk of an unbiased estimate T is its variance: R(θ, T) = IEθ(T − g(θ))2 = varθT(X). If T0 minimizes varθT(X) ∀θ among all unbiased estimates of g(θ), then it is called best minimum variance estimate (BMVE) of g(θ). Denote ∆ the set of all unbiased estimates of g(θ) satisfying IEθT2 < ∞ ∀θ ∈ Θ. Let U be the set of all unbiased estimates of 0 which belong to ∆. Theorem 2.1.1 Let X have distribution Pθ, θ ∈ Θ, and let T ∈ ∆. Then T is BMVE of its expected value g(θ) if and only if IEθ[T(X).U(X)] = 0 ∀U ∈ U and ∀θ ∈ Θ. 9 10 Proof. (i) Necessity: Let T be BMVE, IEθT = g(θ). Let U ∈ U, U = 0. Put T = T +λU, λ ∈ R1 . Then IEθT (X) = IEθT(X) = g(θ) ∀θ ⇒ varθT (X) ≥ varθT(X) ∀λ ⇒ IEθ(T )2 ≥ IEθT2 , thus IEθT2 + λ2 IEθU2 + 2λIEθ(T.U) ≥ IEθT2 , and λ2 IEθU2 + 2λIEθ(T.U) ≥ 0 ∀λ. (2.1.2) The roots of the quadratic equation λ2 IEθU2 + 2λIEθ(T.U) = 0 are λ = 0 and λ = − 2covθ(T, U) varU , hence the quadratic function can be negative unless covθ(T, U) = 0. (ii) Sufficiency: Let IEθ(T.U) = 0 ∀U ∈ U and let T be an unbiased estimate of g(θ). If varθT = ∞, the T cannot be better than T. Let varθT < ∞. Then varθ(T−T ) < ∞ and T − T ∈ U, thus IEθ(T(T − T )) = 0 ⇒ IEθT2 = IEθ(T.T ⇒ IEθT2 − g2 (θ) = IEθ(T.T ) − g2 (θ) ⇒ varθT = covθ(T, T ) ⇒ 0 ≤ varθ(T − T ) = varθT + varθT − 2varθT ⇒ varθT ≤ varθT . 2 Definition 2.1.1 The statistic S(X) is called complete for the system of distributions P = {Pθ, θ ∈ Θ} if, for any function h(S) IEθh(S(X)) = 0 ∀θ ⇒ h(S(X)) = 0 a.s.[Pθ], ∀θ ∈ Θ. Theorem 2.1.2 Let X follow distribution Pθ ∈ P and let S be a complete and sufficient statistic for P. Then every estimable function g(θ) has one and only one unbiased estimate, which is a function of S. Proof. Let T be an unbiased estimate of g(θ). Then T∗ (S(X)) = IE(T(X)|S(X)) is an unbiased estimate which is a function of S. Let T1(S) and T2(S) be two unbiased estimates of g(θ). Then IEθ(T1 − T2) = 0 ∀θ, and because S is complete, it implies that T1 − T2 = 0 s.j. [Pθ], θ ∈ Θ. 2 11 Theorem 2.1.3 (Lehmann-Scheff´e). Let X follow distribution Pθ ∈ P and let S be a complete and sufficient statistic for P. Then (i) For every estimable function g(θ) and every loss function L(θ, t) convex in t, there exists an unbiased estimate T of g(θ) which uniformly minimizes the risk R(θ, T). (ii) T is the only unbiased estimate which is a function of S. If L is strictly convex in t, then T is the only unbiased estimate with minimum risk. Proof. The Rao-Blackwell theorem holds for S and convex loss function. By Theorem 2.1.2 the estimate T∗ (S(X)) = IE(T(X)|S(X)) is unique, and because S is complete, it cannot be further improved. 2 2.1.1 How to find the best unbiased estimate Let S(X) be a complete and sufficient statistic. Method 1: The best unbiased estimate of an estimable function g(θ) is any function T(S) such that IEθT(S) = g(θ) ∀θ ∈ Θ. Method 2: Let us start with any unbiased estimate T(X). Then T (X) = IE(T(X)|S) is the best unbiased estimate. 12 Chapter 3 Equivariant estimators 3.1 Estimation of the shift parameter Let X1, . . . , Xn be a sample from a distribution with distribution function F(x − θ) and density f(x − θ). The problem is to estimate θ ∈ R1 with respect to the loss L(θ, t). Consider the loss which is invariant to the shift, i.e. L(θ, t) = L(θ +c, t+c) ∀c ∈ R1 . Then L(θ, t) = L(0, t − θ), hence the loss depends only on the difference t − θ. If the loss is invariant, then the whole problem is invariant to the shift. If we estimate θ by T(X), then a natural estimate of θ + c is T(X) + c. Definition 3.1.1 The estimator T(X) is called equivariant (with respect to the shift) if it satisfies T(X1 + c, . . . , Xn + c) = T(X1 . . . , Xn) + c ∀c ∈ R and ∀X ∈ Rn . Lemma 3.1.1 The bias, risk and variance of an equivariant estimate T(X) do not depend on value of θ, and hence are constant with respect to θ. Proof. If X1 has d.f. F(x−θ), then P(X1 −θ ≤ z) = P(X1 ≤ z +θ) = F(z), thus X1 −θ has distribution function F(·). Then bias = b(θ) = IEθ(T(X)) − θ = IEθ(T(X1 − θ, . . . , Xn − θ)) = IE0(T(X)) = b, varθT(X) = IEθ(T(X) − IEθT(X))2 = IEθ [T(X − θ) + θ − IEθT(X)]2 = IEθ (T(X − θ) − b)2 = IE0(T(X) − b)2 , R(T, θ) = IEθ[L(T(X) − θ)] = IEθ[L(T(X − θ))] = IE0[L(T(X))] = R(T). 2 We shall look for an equivariant estimate with minimal risk (MRE), i.e. T∗ such that R(T∗ ) < R(T) for any equivariant estimator T = T∗ . First we should investigate the structure of the class of equivariant estimators. 13 14 Lemma 3.1.2 Let T0(X) be an equivariant estimate. Then the estimate T(X) is equivariant if and only if there exists a statistic U(X), invariant to the shift, i.e. satisfying U(X1 + c, . . . , Xn + c) = U(X1, . . . , Xn) ∀c ∈ R1 , ∀X, (3.1.1) such that T(X) = T0(X) + U(X) ∀X. (3.1.2) Proof. • Let T satisfy (3.1.1) and (3.1.2). Then T(X + c) = T0(X + c) + U(X + c) = T0(X) + c + U(X) = T(X) + c, thus T is equivariant. • Let T be equivariant, and let T0 be any equivariant estimator. Put U(X) = T(X) − T0(X). Then U is invariant and T = T0 + U. 2 Lemma 3.1.3 The function U(x) is invariant if and only if it depends only on differences yi = xi − x1, i = 2, . . . , n in case that n ≥ 2. If n = 1, then the only invariant are the constant functions. Proof. If n = 1, then U(x + c) ≡ U(x) iff U(x) is a constant. Let n ≥ 2 and U(x + c) ≡ U(x). Then U(x1, . . . , xn) = U(x1 − x1, x2 − x1, . . . , xn − x1) = U(0, y2, . . . , yn) = ˜U(y2, . . . , yn). 2 Corollary 3.1.1 Let T0 be an equivariant estimate and n ≥ 2. Then the estimator T is equivariant if and only if there exists a function ˜U(Y2, . . . , Yn) of Y = (Y2, . . . , Yn) such that T(X) ≡ T0(X) − ˜U(Y). Remark 3.1.1 The differences Y2 = X2 −X1, . . . , Yn = Xn −X1 determine all differences Xi − Xj, i = j. Instead of Y we may take e.g. X1 − ¯X, . . . , Xn − ¯X. Definition 3.1.2 The statistic S(X) is called maximal invariant with respect to the shift, if it is invariant and if S(X1) = S(X2) if and only if X2 = X1 + c for some c ∈ R1 . We see that Y2, . . . , Yn or X1 − ¯X, . . . , Xn − ¯X are maximal invariants. Maximal invariants are important, because of the following property: Lemma 3.1.4 The function U(x) is invariant if and only if it is a function of a maximal invariant. Proof. If U is a function of S, i.e. U(x) = h(S(x)), then it is invariant. Let U be invariant and let S(x1) = S(x2). Then x2 = x1 + c, hence U(x2) = U(x1). 2 15 Theorem 3.1.1 (Minimum risk estimate). Let T0 be an equivariant estimate with a finite risk. If for any value of differences y there exists v∗ (y) which minimizes IE0 L[T0(X) − v(Y)] Y = y with respect to functions of y, then there exists a minimum risk estimate and is equal to T∗ (X) = T0(X) − v∗ (Y). Proof. Let T(X) = T0(X) − v(Y). Then Rθ(T(X), θ) = IEθ[L(T0(X) − v(Y) − θ)] = IE0{L[T0(X) − v(Y)]} IE0 E0 L(T0(X) − v(Y)) Y = IE0 L [T0(X) − v(y)] y dP0(y) should be minimized with respect to v(·). But this is minimized if the integrand in minimized for every y. 2 Corollary 3.1.2 (a) If L(t − θ) = (t − θ)2 , then v∗ (y) = IE0 T0(X) Y = y . (b) If L(t − θ) = |t − θ|, then v∗ (y) is the conditional median of T0(X) with respect to the conditional distribution of X given Y = y. Example 3.1.1 Let X1, . . . , Xn be a sample from the normal distribution N(ξ, σ2 ) with σ known. Put T0(X) = ¯X. Then ¯X and Y = (X2 − X1, . . . , Xn − X1) are independent, hence when we consider IE0[L( ¯X − v(Y)|Y = y], we conclude that v(y) = const and is determined by the condition that IE0[L(X − v)] = min . Thus, if L is a convex and odd function, then v = 0 and ¯X is the MRE (minimum risk estimator). Theorem 3.1.2 Let F be the class of all distribution functions with Lebesgue densities f, which have a finite fixed variance, say σ = 1. Let X1, . . . , Xn be a sample from the distribution with density f(x − ξ), where ξ = IE X. Let rn(f) be the risk of the MRE of ξ with respect to the quadratic loss function. Then rn(f) is maximal over F for the normal f. Proof. If F is normal, then ¯X is the MRE and IE( ¯X − ξ)2 = 1/n. Because 1/n is also the quadratic risk of ¯X for every F ∈ F, the risk of the MRE ≤ 1/n. 2 Remark 3.1.2 It follows from Corrollary 3.1.2 that the MRE should satisfy T∗ (X) = ¯X − IE0( ¯X|Y), hence T∗ (X) = ¯X ⇐⇒ IE0( ¯X|Y) = 0. But by Theorem of Kagan-Linnik-Rao (1967), IE0( ¯X|Y) = 0 is true if and only if the distribution of X1, . . . , Xn is normal. 16 Example 3.1.2 Exponential distribution. Let X1, . . . , Xn have the distribution function F(x − θ) = 1 − exp{x − θ} . . . x ≥ θ 0 . . . x < θ. Put T0(X) = X(1), where X(1) ≤ X(2) ≤ . . . ≤ X(n) are order statistics. Then P(X(1) > x) = n i=1 P(Xi > x) = exp{−n(x − θ)}, hence the density of X(1) is n exp{−n(x − θ)}. Because X(1) and Y are independent, the invariant function v(Y) = const, similarly as in Example 3.1.1. We look for v such that IE0[L(X(1) − v)] = min . If L(t − θ) = (t − θ)2 , then IE0(X(1) − v)2 = min for v = IEX(1) = n ∞ 0 x exp{−nx}dx = 1 n ∞ 0 y exp{y}dy = 1 n and the MRE is T∗ (X) = X(1) − 1 n . 3.1.1 The form of Pitman (MRE) estimator Let X1, . . . , Xn be a sample form a distribution with density f(x − θ). Then the Pitman (MRE) estimator with respect to quadratic loss is T∗ (X) = T0(X) − IE0[T0(X)|Y], where T0 is an initial equivariant estimator with a finite risk. Then T∗ (X) can be also written in the following form: T∗ (X) = ∞ −∞ t · f(X1 − t) . . . f(Xn − t)dt ∞ −∞ f(X1 − t) . . . f(Xn − t)dt . Proof. Put T0(X) = X1. We shall look for the conditional density of X1 given Y = y under θ = 0. Make the substitution yi = xi − x1, i = 2, . . . , n x1 = x1. Then the density of Y∗ = (X1, Y2, . . . , Yn) is p(y∗ ) = f(x1, x1 + y2, . . . , x1 + yn) and the conditional density of X1 given y = (y2, . . . , yn) is f(x1, x1 + y2, . . . , x1 + yn) ∞ −∞ f(u, u + y2, . . . , u + yn)du . 17 Hence, IE(X1|Y = y) = ∞ −∞ uf(u, u + y2, . . . , u + yn)du ∞ −∞ f(u, u + y2, . . . , u + yn)du = ∞ −∞ (X1 − t)f(X1 − t, X2 − t, . . . , Xn − t)dt ∞ −∞ f(X1 − t, . . . , Xn − t)dt , where we inserted t = X1 − u, yi = Xi − X1, i = 2, . . . , n. Then T∗ (X) = X1 − IE(X1|Y = y) = X1 − ∞ −∞ (X1 − t)f(X1 − t, X2 − t, . . . , Xn − t)dt ∞ −∞ f(X1 − t, . . . , Xn − t)dt = ∞ −∞ t · f(X1 − t) . . . f(Xn − t)dt ∞ −∞ f(X1 − t) . . . f(Xn − t)dt . 2 Example 3.1.3 Let X1, . . . , Xn be a sample from the uniform distribution R(θ− 1 2 , θ+ 1 2 ) and let L(t − θ) = (t − θ)2 . Then f(x1, . . . , xn) = 1 . . . θ − 1 2 ≤ X(1) ≤ X(n) ≤ θ + 1 2 , 0 . . . otherwise Then, under θ = 0, f(x1 − t, . . . , xn − t) = 1 . . . X(n) − 1 2 ≤ t ≤ X(1) + 1 2 , 0 . . . otherwise Put T0 = 1 2 (X(1) + X(n)). Then tf(x1 − t, . . . , xn − t)dt = X(1)+ 1 2 X(n)− 1 2 tdt = 1 2 (X(1) + 1 2 )2 − (X(n) − 1 2 )2 and f(x1 − t, . . . , xn − t)dt = X(1)+ 1 2 X(n)− 1 2 dt = 1 − (X(n) − X(1)). Finally, T∗ (X) = 1 2 (X(1) + X(n))(1 − (X(n) − X(1))) (1 − (X(n) − X(1))) = 1 2 (X(1) + X(n)). 18 3.2 Relation of equivariance and unbiasedness Lemma 3.2.1 Let L(t − θ) = (t − θ)2 . (i) If T(X) is equivariant and has constant bias IEθT(X) − θ = b (a non-null constant), then T(X) − b is an equivariant and unbiased estimator, whose risk is less that the risk of T(X). (ii) If the MRE is uniquely determined, then it is unbiased. (iii) If there exists a uniformly best unbiased estimate which is equivariant, then it is the MRE. Proof. (i) Let T1(X) = T(X) − b. Then it is equivariant and IEθ(T1(X)) = θ + b − b = θ, and IE0(T1(X))2 = IE0(T(X) − b)2 = IE0T2 (X) − b2 < IE0T2 (X). (ii) Let T∗ be the MRE and T be any equivariant estimate with finite risk. Then T∗ (X) = T(X) − IE(T|Y), IE0(T∗ )2 < IE0T2 , IE0T∗ = 0. (iii) Let T be uniformly best unbiased and also equivariant. Let T1 be equivariant. Then IEθT1 = θ + b and if b = 0, then IE0(T1 − b)2 < IE0T2 1 . This implies that IEθ(T − θ)2 = IE0(T)2 ≤ IE(T1 − b)2 ≤ IE0T2 1 . 2 Definition 3.2.1 Estimator T of g(θ) is called risk unbiased with respect to the loss L, if EθL(θ, T) ≤ IEθL(θ , T) ∀θ = θ. The following theorem shows that the MRE is risk unbiased: Theorem 3.2.1 Let X1, . . . , Xn be a sample from a distribution with the density f(x−θ). Then the MRE with respect to the loss L(θ, t) = L(t − θ) is risk unbiased. Proof. The risk unbiasedness means that IEθL(T(X) − θ ) ≥ IEθL(T(X) − θ) ∀θ = θ, otherwise speaking, IE0L(T(X) − a) ≥ IE0L(T(X)) ∀a = 0. 19 Let T∗ be the MRE. Then T∗ (X) = T0(X) − v∗ (Y) where IE0[L(T0(X) − v∗ (Y)|Y = y] = min . Then IE0[L(T(X) − a)] = IE0[L(T0(X) − v∗ (Y) − a)] = IE0{IE0[L(T0(X) − v∗ (Y) − a)|Y]} ≥ IE0{IE0[L(T0(X) − v∗ (Y)|Y]} = IE0[L(T(X))] where we used the fact that v∗ (Y) + a is also an invariant function. 2 20 Chapter 4 Asymptotic behavior of estimates 4.1 Consistency Let X1, . . . , Xn be independent observations with distribution Pθ, θ ∈ Θ. We want to estimate the function g(θ). Then the estimator Tn is called weakly consistent estimate of g(θ) if Tn p −→ g(θ) ∀θ ∈ Θ as n → ∞ strongly consistent estimate of g(θ) if Tn → g(θ) a.s.[Pθ] ∀θ ∈ Θ as n → ∞. Let R(θ, Tn) = IEθ(T(X) − g(θ))2 be the quadratic risk. Then Theorem 4.1.1 (i) If limn→∞ R(θ, Tn) = 0 ∀θ ∈ Θ, then Tn is weakly consistent. (ii) If lim n→∞ IEθTn(X) = g(θ) ∀θ ∈ Θ, lim n→∞ varθTn(X) = 0 ∀θ ∈ Θ then Tn is weakly consistent. (iii) Especially, if Tn is unbiased ∀n and limn→∞ varθTn(X) = 0 ∀θ ∈ Θ, then Tn is weakly consistent. Proof. (i) By Chebyshev inequality, Pθ(|Tn(X)) − g(θ)| > ε) ≤ 1 ε2 IEθ |Tn(X)) − g(θ)|2 = 1 ε2 R(θ, Tn) → 0. (ii) Pθ(|Tn(X)) − g(θ)| > ε) ≤ 1 ε2 IEθ |Tn(X)) − g(θ)|2 ≤ 1 ε2 1 ε2 IEθ [Tn − IEθTn + IEθTn − g(θ)]2 ≤ 2 ε2 varθTn + (bn(θ))2 → 0. 21 22 2 The parameter θ or the function g(θ) can be estimated by a consistent estimate only if θ is identifiable, i.e. if [θ1 = θ2] =⇒ [Pθ1 = Pθ2 ]. 4.2 Efficiency Definition 4.2.1 (Limiting risk efficiency of Tn to T∗ n ). Assume that two sequences {Tn}, {T∗ n } of estimates satisfy lim n→∞ nr R(Tn, θ) = lim n→∞ nr R(T∗ mn , θ) (4.2.1) for some sequence {mn}∞ n=1 and a fixed r > 0. Then the limit lim n→∞ mn n , if it exists and is independent of the special choice of {mn}, is called the limiting risk efficiency of Tn with respect to T∗ n . Definition 4.2.2 (Relative asymptotic efficiency of Tn to T∗ n ). Let √ n(Tn − g(θ)) D −→ N(0, σ2 ) as n → ∞, (4.2.2) √ n(T∗ mn − g(θ)) D −→ N(0, σ2 ) as n → ∞. Then the limit eT,T∗ = lim n→∞ mn n , if it exists and is independent of the special choice of {mn}, is called the relative asymptotic efficiency (ARE) of Tn to T∗ n . Theorem 4.2.1 Let √ n(Tn − g(θ)) D −→ N(0, σ2 ) as n → ∞, (4.2.3) √ n(T∗ n − g(θ)) D −→ N(0, σ2 ∗) as n → ∞. Then eT,T∗ = σ2 ∗ σ2 . 23 Proof. Assume (4.2.3). Then √ n(T∗ mn − g(θ)) = n mn √ mn(T∗ mn − g(θ)) and √ n(Tn − g(θ)) D −→ N(0, σ2 ), n mn → 1 eT,T∗ , √ mn(T∗ mn − g(θ)) D −→ N(0, σ2 ∗), thus eT,T∗ = σ2 ∗/σ2 . 2 Consider the system of distributions P = {Pθ; θ ∈ Θ} with densities f(x, θ) satisfying (A0) Pθ1 = Pθ2 for θ1 = θ2. (A1) B = {x : f(x, θ) > 0} is independent of θ. (A2) Let X1, . . . , Xn be a sample from a distribution with density f(x, θ0), where θ0 ∈ I ⊂ Θ for an open interval I. Theorem 4.2.2 Under conditions (A0)–(A2), it holds for any θ = θ0, θ ∈ Θ lim n→∞ Pθ0 n i=1 f(Xi, θ0) > n i=1 f(Xi, θ) = 1. (4.2.4) Proof. By the law of large numbers and Jenssen inequality, as n → ∞, 1 n n i=1 log f(Xi, θ) f(Xi, θ0) Pθ0 −→ IEθ0 log f(X, θ) f(X, θ0) < log IEθ0 f(X, θ) f(X, θ0) = 0. This implies Pθ0 1 n n i=1 log f(Xi, θ) f(Xi, θ0) > 0 → 0 and that gives (4.2.4). 2 Denote L(θ, X) = log n i=1 f(Xi, θ) (the likelihood). The maximum likelihood estimate (MLE) of θ is defined as a solution of the maximization L(θ, X) = max, θ ∈ Θ. It is one of the solutions of the likelihood equation ∂L(θ, X) ∂θ = n i=1 ˙f(Xi, θ) f(Xi, θ) = 0. (4.2.5) 24 Assume that conditions A0–A2 are satisfied and that f(x, θ) is differentiable in θ ∈ I ⊂ Θ, where I θ0. Then Theorem 4.2.3 Under the above conditions, there exists a root ˆθn of the likelihood equation (4.2.5) such that ˆθn Pθ0 −→ θ0 as n → ∞. Proof. Let a > 0 is such that (θ0 − a, θ0 + a) ⊂ I. Let Sn = {x : L(θ0, x) > L(θ0 − a, x) and L(θ0, x) > L(θ0 + a, x)}. By Theorem 4.2.2 is limn→∞ Pθ0 (Sn) = 1. There is a local maximum ˆθn between θ0 − a and θ0 + a and it satisfies L (ˆθn) = 0. Let θ∗ n be the root of L (θ) = 0 the closest to θ0. Then lim n→∞ Pθ0 (|θ∗ n − θ0| < a) = 1 ∀a > 0. 2 Remark 4.2.1 We know that Θ∗ n exists as the root the closest to θ0, but we are not able to find it, because θ0 is unknown. Everything holds only with probability tending to 1. If the likelihood equation has only one root Tn ∀n and ∀x, then Tn is consistent estimate of θ0. Theorem 4.2.4 Let the conditions (A0)–(A2) be satisfied, and let it further hold (A3) ∂3 log f(x, θ) ∂θ3 ≤ M(x) for x ∈ B and for |θ −θ0| < C, where M(x) is such that IEθ0 M(X) < ∞. Then every consistent sequence ˆθn = ˆθn(X) of roots of the likelihood equation is asymptotically normally distributed, i.e. L( √ n(ˆθn − θ0)) D −→ N 0, 1 I(θ0) where I(θ) = ∂ log f(x,θ) ∂θ 2 f(x, θ)dµ is the Fisher information. Some steps of the proof. 0 = n−1/2 Ln(ˆθn) = n−1/2 n i=1 ˙f(Xi, ˆθn) f(Xi, ˆθn) = n−1/2 Ln(θ0) + n1/2 (ˆθn − θ0) · 1 n Ln(θ0) + 1 2 n−1/2 [n1/2 (ˆθn − θ0)]2 1 n Ln (θ∗ n) 25 with θ∗ n between θ0 and ˆθn. Then n1/2 (ˆθn − θ0) ≈ − n−1/2 Ln(θ0) n−1Ln(θ0) − 1 2 (ˆθn − θ0) n−1 Ln (θ∗ n) n−1Ln(θ0) . We should show that n−1/2 Ln(θ0) D −→ N(0, I(θ0)) (4.2.6) − 1 n Ln(θ0) = 1 n n i=1 ¨f(Xi, θ0) f(Xi, θ0) − 1 n n i=1 ˙f(Xi, θ0) f(Xi, θ0) 2 p −→ I(θ0) (4.2.7) 1 n Ln (θ∗ n) = Op(1). (4.2.8) (4.2.7) follows from the central limit theorem, (4.2.8) from the law of large numbers, (4.2.8) from the consistency of ˆθn and from (A3). Then we obtain 0 = n−1/2 Ln(ˆθn) ≈ N(0, I(θ0)) − n1/2 (ˆθn − θ0)I(θ0) + 1 2 √ n √ n(ˆθn − θ0) 2 1 n Ln (θ∗ n), thus √ n(ˆθn − θ0) ≈ − n−1/2 Ln(θ0) n−1Ln(θ0) − 1 2 (ˆθn − θ0) n−1 Ln (θ∗ n) n−1Ln(θ0) ≈ N 0, 1 I(θ0) + op(1) 2 Remark 4.2.2 Such estimator is called the efficient likelihood estimator. It is usually the maximal likelihood estimator, but not neccessary. Corollary 4.2.1 If the likelihood equation has only one root, or if it has a multiple root with probability tending to 0 as n → ∞, then, under the conditions of Theorem 4.2.4, the maximal likelihood estimator is asymptotically efficient. Example 4.2.1 One-parameter exponential family. f(x, θ) = exp{θT(x) + A(θ), n i=1 log f(Xi, θ) = θ n i=1 T(Xi) + nA(θ) = max ⇒ 1 n n i=1 T(Xi) = −A (θ) = IEθ 1 n n i=1 T(Xi) (likelihood equation). (4.2.9) On the other hand, because f(x, θ)dµ = 1, 0 = (A (θ) + T(x)) exp{θT(x) + A(θ)}dµ =⇒ A (θ) = −IEθT(X). 26 We can show that IEθT(X) is increasing in θ : Indeed, ∂ ∂θ IEθT(X) = T(x)(A (θ)+T(x)) exp{θT(x)+A(θ)dµ = IEθT2 (X)−(IEθT(X))2 = varθT(X) > 0. Thus the likelihood equation IEθT(X) = 1 n n i=1 T(Xi) has at most one solution, and the conditions of Theorem 4.2.4 are satisfied. Thus, with probability tending to 0 the likelihood equation has one root ˆθn, which is consistent, asymptotically efficient and asymptotically normal √ n(ˆθn − θ) D −→ N 0, 1 varθT , because I(θ) = IEθ ∂ log f(X, θ) ∂θ 2 = IEθ[T(X) + A (θ)]2 = varθT(X). Example 4.2.2 Truncated normal distribution. Let X1, . . . , Xn have normal distribution N(θ, 1) truncated at (a, b), a < b, with the density p(x, θ) =    1√ 2π exp{−(x−θ)2 2 } [Φ(b − θ) − Φ(a − θ)] . . . a < x < b 0 . . . otherwise. Thus p(x, θ) = exp{θx − θ2 2 − x2 2 + log 1 √ 2π − log[Φ(b − θ) − Φ(a − θ)] ⇒ T(x) = x and the likelihood equation has the form ¯Xn = IEθX. If θ → ±∞, then X p → a or b, thus also IEθX→a or b and IEθX is continuous, hence the likelihood equation has a root. 4.2.1 Shift parameter Let X1, . . . , Xn be a sample from the population with density f(x − θ). The MLE ˆθn is a solution of n i=1 f(Xi − θ) := max 27 and it is equivariant. The Pitman estimate T∗ n is asymptotically equivalent to ˆθn in the sense that √ n(ˆθn − T∗ n ) p → 0 as n → ∞ (Stone 1974). The likelihood equation can be rewritten as n i=1 f (Xi − θ) f(Xi − θ) = 0. (4.2.10) If f is strongly unimodal, i.e. −f f is strictly increasing, then (4.2.10) has at most one root. Because n i=1 f(xi − θ) → 0 as θ → ±∞, then n i=1 f(Xi − θ) has the maximum inside the real line, hence the root of (4.2.10) exists and is asymptotically efficient. 4.2.2 Multiple root Let L(θ, x) = log n i=1 f(xi, θ). Assume that the equation L (θ) = n i=1 f (Xi, θ) f(Xi, θ) = 0 (4.2.11) can have a multiple root, but that there exists a consistent estimate ˜θ0 n. Theorem 4.2.5 (i) Let ˜θ0 n be a consistent estimate and the conditions (A0)−−(A2) hold. Then the root of equation (4.2.11), the closest to ˜θ0 n is also consistent, and hence it is asymptotically efficient. (ii) Let ˜θn be a consistent initial estimate satisfying √ n(˜θn − θ) = Op(1) as n → ∞. Put Tn = ˜θn − L (˜θn) L (˜θn) . Then Tn is an asymptotically efficient estimate of θ, i.e. √ n(Tn − θ) D −→ N(0, 1/I(θ)). Proof is similar to the proof of Theorem 4.2.4.