ESTIMATION WITH QUADRATIC LOSS W. JAMES FRESNO STATE COLLEGE AND CHARLES STEIN STANFORD UNIVERSITY 1. Introduction It has long been customary to measure the adequacy of an estimator by the smallness of its mean squared error. The least squares estimators were studied by Gauss and by other authors later in the nineteenth century. A proof that the best unbiased estimator of a linear function of the means of a set of observed random variables is the least squares estimator was given by Markov [12], a modified version of whose proof is given by David and Neyman [4]. A slightly more general theorem is given by Aitken [1]. Fisher [5] indicated that for large samples the maximum likelihood estimator approximately minimizes the mean squared error when compared with other reasonable estimators. This paper will be concerned with optimum properties or failure of optimum properties of the natural estimator in certain special problems with the risk usually measured by the mean squared error or, in the case of several parameters, by a quadratic function of the estimators. We shall first mention some recent papers on this subject and then give some results, mostly unpublished, in greater detail. Pitman [13] in 1939 discussed the estimation of location and scale parameters and obtained the best estimator among those invariant under the affine transformations leaving the problem invariant. He considered various loss functions, in particular, mean squared error. Wald [18], also in 1939, in what may be considered the first paper on statistical decision theory, did the same for location parameters alone, and tried to show in his theorem 5 that the estimator obtained in this way is admissible, that is, that there is no estimator whose risk is no greater at any parameter point, and smaller at some point. However, his proof of theorem 5 is not convincing since he interchanges the order of integration in (30) without comment, and it is not clear that this integral is absolutely convergent. To our knowledge, no counterexample to this theorem is known, but in higher-dimensional cases, where the analogous argument seems at first glance only slightly less plausible, counterexamples are given in Blackwell [2] (which is discussed briefly at the end of section 3 of this paper) and in [14], This work was supported in part by an ONR contract at Stanford University. 361 362 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN which is repeated in section 2 of this paper. In the paper of Blackwell an analogue of Wald's theorem 5 is proved for the special case of a distribution concentrated on a finite arithmetic progression. Hodges and Lehmann [7] proved some results concerning special exponential families, including Wald's theorem 5 for the special problem of estimating the mean of a normal distribution. Some more general results for the estimation of the mean of a normal distribution including sequential estimation with somewhat arbitrary loss function were obtained by Blyth [3] whose method is a principal tool of this paper. Girshick and Savage [6] proved the minimax property (which is weaker than admissibility here) of the natural estimator in Wald's problem and also generalized the results of Hodges and Lehmann to an arbitrary exponential family. Karlin [8] proved Wald's theorem 5 for mean squared error and for certain other loss functions under fairly weak conditions and also generalized the,results of Girshick and Savage for exponential families. The author in [16] proved the result for mean squared error under weaker and simpler conditions than Karlin. This is given without complete proof as theorem 1 in section 3 of the present paper. In section 2 of this paper is given a new proof by the authors of the result of [14] that the usual estimator of the mean of a multivariate normal distribution with the identity as covariance matrix is inadmissible when the loss is the sum of squares of the errors in the different coordinates if the dimension is at least three. An explicit formula is given for an estimator, still inadmissible, whose risk is never more than that of the usual estimator and considerably less near the origin. Other distributions and other loss functions are considered later in section 2. In section 3 the general problem of admissibility of estimators for problems with quadratic loss is formulated and a sufficient condition for admissibility is given and its relation to the necessary and sufficient condition [15] is briefly discussed. In section 4 theorems are given which show that under weak conditions Pitman's estimator for one or two location parameters is admissible when the loss is taken to be equal to the sum of squares of the errors. Conjectures are discussed for the more difficult problem where unknown location parameters are also present as nuisance parameters, and Blackwell's example is given. In section 5 a problem in multivariate analysis is given where the natural estimator is not even minimax although it has constant risk. These are related to the examples of one of the authors quoted by Kiefer [9] and Lehmann [11]. In section 6 some unsolved problems are mentioned. The results of section 2 were obtained by the two authors working together. The remainder of the paper is the work of C. Stein. 2. Inadmissibility of the usual estimator for three or more location parameters Let us first look at the spherically symmetric normal case where the inadmissibility of the usual estimator was first proved in [14]. Let X be a normally distributed p-dimensional coordinate vector with unknown mean t = EX and ESTIMATION WITH QUADRATIC LOSS 363 covariance matrix equal to the identity matrix, that is, E(X - t)(X - t)I. We are interested in estimating t, say by 4 and define the loss to be (1) LQ(, 4) = (t - = |) - J112, using the notation (2) -1X112 = x'x. The usual estimator is 'po, defined by (3) OW(x) = x, and its risk is (4) p(Q, po) = EL[t, po(X)] = E(X -t)'(X- = p. It is well known that among all unbiased estimators, or among all translationinvariant estimators (those s for which po(x + c) = so(x) + c for all vectors x and c), this estimator go has minimum risk for all t. However, we shall see that for p > 3, (5) E||(1-lX| )X t|| = p -E P2+2 < p'(5)~ ~ ~ IX12p - 2 +2K where K has a Poisson distribution with mean tl12/2. Thus the estimator pi defined by (6) sol(X) = 1 - 2 X has smaller risk than so for all t. In fact, the risk of ,1 is 2 at 0= 0 and increases gradually with 11112 to the value p as I tl 2 oo. Although solis not admissible it seems unlikely that there are spherically symmetrical estimators which are appreciably better than pi. An analogous result is given in formulas (19) and (21) for the case where E(X - t)(X - t)' = a21, where a2 is unknown but we observe S distributed independently of X as a2 times a x2 with n degrees of freedom. For p _ 2 it is shown in [14] and also follows from the results of section 3 that the usual estimator is admissible. We compute the risk of the estimator 2 defined by (7) <2(X) = - __X, where b is a positive constant. We have (8) P(t P2) = E X-t = EIlX-_lI2 -2bE (Xllx X + b2E11l12 = p - 2bE (X x)'X + b2E 11X1 364 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN It is well known that IIXI 12, a noncentral x2 with p degrees of freedom and noncentrality parameter lItI12, is distributed the same as a random variable W obtained by taking a random variable K having a Poisson distribution with mean 1/211,1 2 and then taking the conditional distribution of W given K to be that of a central x2 with p + 2K degrees of freedom. Thus (9) 1 1 =E(1E KIE1KE (9)IEj= X2+2K E X+2K) p-2+2K To compute the expected value of the middle term on the right side of (8) let (10) U= V=I2 Then (11) W = I1X12 = U2 + V, and U is normally distributed with mean IltlI and variance 1, and V is independent of U and has a x2 distribution with p - 1 d.f. The joint density of U and V is (12) , exp{-(U P{2(UV1(p)2}2(P21),2F](- 1)/2] V(P-)/2e-V/2 if v _ 0 and 0 if v <0. Thus the joint density of U and W is (13) 27r2(v-))/2r[(p - 1)/2] (W U2)(23)I2 exp 221I12 + IIfl U 1W} if u2 < w and 0 elsewhere. It follows that (14) EU exp {-2- r dwW x27r 2(P')'2r[(p - 1)/2] O i!U (W - U2)(P-3)/2 exp {lllu- w} Making the change of variable t = u/V\w we find (15) f dw f (w - U2)(p-3)2 exp{llllu - W}du 10f w~~3~' ex{p -w2}11dw exp {l1tltt\w} dt = f W(p3)/2 exp {-w w}dw )i f t'+'(1 - t2)(p3)/2 dt Itl (2ji+1)!|Wp/2+j-l tx --W W| f2(j+l)(1 -t2)(p-3)/2 dt ~ (Ca WP/2+1 exp dwwCd,=o(2j+ 1)! C ~ 2 W} 1t~~~( 2()2d ESTIMATION WITH QUADRATIC LOSS 365 2'0(2j+1)!2' +ir(-2+j)B(j+3,P 1-jo0 (2+;!21 Illf12i+1 2P/2+j r ( + 3)r (P ) =° (2j+ 1) (P 9p12 X 2(2 + j) (~uI) - =02 i+lr(j + 1) (2 +i) It follows from (10), (11), (14), and (15) that (16) E (Xll2 = 1- it IlE U =1-110~111t X r j21tjj2i+1 lIrf V j=02i+lr(j + 1)(2 +i) = exp {-2 11011!} { E X -11011 Ej!(p + 2j)} = exp {-2 pp{-2) i} i! p -2+2i p - 2 + 2 where K again has a Poisson distribution with mean jl1112/2. Combining (8), (9), and (16) we find (17) P(%, P2) = Ell(i-_ I_ )X- t|l = p-2(p-2)bE 1+ + b2E _ 2+2K This is minimized, for all {, by taking b = p - 2 and leads to the use of the estimator ioi defined by (6) and the formula (5) for its risk. Now let us look at the case where X has mean t and covariance matrix given by (18) E(X - t)(X -t)' = a2 366 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN and we observe S independent of X distributed as a2 times a x2 with n degrees of freedom. Both t and a2 are unknown. We consider the estimator (p3 defined by (19) p3(X, S) = (1 X, where a is a nonnegative constant. We have (20) p(%, ps) = Ellp3(X, S) - t1|2 EIIX -t112 - 2aES(X_ tyX + a2E S2 11X1,12 IalX.112 = -j -2aE S E ($)')°+a2E(S)2E 1 -a2 {p -2an(p - 2)E - 2+ 2K + a2n(n + 2)E - 2K} by (9) and (16) with X and t replaced by X/a and t/a respectively. Here K has a Poisson distribution with mean ItI12/2a2. The choice a = (p - 2)/(n + 2) minimizes (20) for all t, giving it the value (21) p( $p3) = cr2 - n+2 (P- 2)2E -2+rp-n p 1 2+~Kf We can also treat the case where the covariance matrix is unknown but an estimate based on a Wishart matrix is available. Let X and S be independently distributed, X having a p-dimensional normal distribution with mean t and covariance matrix 2 and S being distributed as a p X p Wishart matrix with n degrees of freedom and expectation n2, where both t and 2 are unknown and 2 is nonsingular. Suppose we want to estimate t by I with loss function (22) L[(Q, 2), j] = (t -)t-(-. We consider estimators of the form (23) (p(X, S) I(i - X. The risk function of (p is given by (24) p[(Q, I), sp] = EE,x[(p(X, S) - t]' I-'[((X, S)z 1 - - M-1 [(i- c -= R,[(1- -,S X) X-]21[1X'S-,X) Xt = EE,1 - XS-,X) ] [(i X'S- X) where t*' = [(Q'Z-1j)1'2, 0, *, 0]. But it is well known (see, for example Wijsman [19]) that the conditional distribution of X'S-1X given X is that of X'X/S*, where S* is distributed as x2-p+l independent of X. Comparing (24) ESTIMATION WITH QUADRATIC LOSS 367 and (20) we see that the optimum choice of c is (p - 2)/(n - p + 3) and, for this choice, the risk function is given by (25) P[(' 1) f] = P-n p+ 1 (p- 2)2 1 n - p+3 p- 2 +2K where K has a Poisson distribution with mean (1/2)t'2;'. The improvement achieved by these estimators over the usual estimator may be understood better if we break up the error into its component along X and its component orthogonal to X. For simplicity we consider the case where the covariance matrix is known to be the identity. If we consider any estimator which lies along X, the error orthogonal to X is t - (t'X/lXlI2)X and its mean square is (26) Et lit- X || = lftl-l2Et ((X)12 1l2(1-E 1+ 2K) = (p 1)1101 p + 2K = (p - 1)[1-I -2)Ep 2 + 2K] Thus the mean square of the component along X of the error of [1(p - 2)/1 XII2]Xis (27) p p)2Et(P -(p -2)Et2 2K [P (p 2)2Et p-2 + 2K 1)[1- 2 + 2K] 1(P-2)E~ p-2 +2K =2. On seeing the results given above several people have expressed fear that they were closely tied up with the use of an unbounded loss function, which many people consider unreasonable. We give an example to show that, at least qualitatively, this is not so. Again, for simplicity we suppose X a p-variate random vector normally distributed with mean t and covariance matrix equal to the identity matrix. Suppose we are interested in estimating t by f with loss function (28) L(Q, j) = F( I -t112) where F has a bounded derivative and is continuouusly differentiable and concave (that is, F"(t) _ 0 for all t > 0). We shall show that for sufficiently small b and large a (independent of t) the estimator so defined by (29) 'p(X) = (1-- a +1IIX1l2)X has smaller risk than the usual estimator X. We have (with Y = X -t) 368 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN (30) pQ, p) = EZF 1- + X_t1|2] =EtF [!IX _ t12-2b X'(X-0) + b2 I1X112)2 _ EeF(flX-tI2) - 2bE+[X1a+ (Xa12b IX-122b XVY+)t'- = EF(1XY112)-2bE a +IM Y + tiI22 F'(1XY-12) 2 (Y+ t)'Y-- (31) E a + {lY + t112 2 F'(11Y1a2) = Y+a '+IIllyl 1- lll2 + IIYII) ({2) )+°[(a +Il4112)J a+E ll |2+ a,fl+2!eY F+(flYfl2)+1111\} b~~~~~~~~ (IYl+V FI(1Y1-) +F[(I +II2 II2)1 {a + I2 + 11Y112 - (a + I1lI! + /ly +O112 To see that this is everywhere positive if b is sufficiently small and a sufficiently large we look at (32) +Vy - and (33) AE AF(I1IY =12) a+IIYA + 1Y12 = (A) ESTIMATION WITH QUADRATIC LOSS 369 It is clear thatf and g are continuous strictly positive valued functions on [1, ) with finite nonzero limits at X. It follows that (34) c = inf f(A) = 0,A>1 (35) d = sup g(A) < oo. A>1 Then, if b is chosen so that (36) 1--) _-2 d > O it follows from (30) and (31) that, for sufficiently large a (37) p(, p) < EZF(IiX - l12) for all t. The inadmissibility of the usual estimator for three or more location parameters does not require the assumption of normality. We shall give the following result without proof. Let X be a p-dimensional random coordinate vector with mean t, and finite fourth absolute moments: (38) E(Xi-t)4,_C< oc. For simplicity of notation we assume the Xi are uncorrelated and write (39) 2 = EJX,-i)2. Then for p 2 3 and (40) b < 2(p-2) min at and sufficiently large a depending only on C (41) E [(1 2[a + E X2/y2])Xi -{i < Et (X_-.)2 = a2 It would be desirable to obtain explicit formulas for estimators one can seriously recommend in the last two cases considered above. 3. Formulation of the general problem of admissible estimation with quadratic loss Let Z be a set (the sample space), 63 a a-algebra of subsets of Z and X a a-finite measure on 63. Let 0 be another set (the parameter space), e a o-algebra of subsets of 0, and p(. ) a nonnegative valued (6B-measurable function on Z X 0 such that for each (E 0, p(. 10) is a probability density with respect to X, that is, (42) fp(zO)dX(z) = 1. Let A, the action space, be the k-dimensional real coordinate space, a a e-measurable function on 0 to the set of positive semidefinite symmetric k X k matrices, 370 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN and ,q a C-measurable function on 0 to A. We observe Z distributed over Z according to the probability density p(. 10) with respect to X, where 0 is unknown, then choose an action a E A and suffer the loss (43) L(0, a) = [a - q(0)]'a(6)[a - o()]An estimator (p of q(o) is a B-measurable function on Z to A, the interpretation being that after observing Z one takes action sp(Z), or in other words, estimates ,1(0) to be p(Z). The risk function p(., so) is the function on 0 defined by (44) p(0, sp) = Eo[
0 and C a
set in 5Y such that II(C n Sf) > 0 where S, is the set of 0 for which
(51) Ee[,pi(X) - 7(0)]'a(0)[p(X) -()]
< Ee[sp(X) -r(O)]'a(6)['p(X) -r7(0)] - e.
Then, for any q E g(C)
(52) f q(6) dH(O)Ee[&p1(X) - 7j()]'a(0)['p1(X) -7(0)]
ff q(O) dH(O)Ee[4p(X) - 7(0)]'a(6)['P(X) - tn()] - e f q(D) dH(O).
cfnsE
It follows that
(53) eII(Ccn )
f q(O) dHI(O)
cns
= f
q(O)dIH(O)
c
-fi(@ d ( )fq(O)dll(O)Ee[vp(X) - 7(0)]'a(0)[sp>(X) - n()]
< fi f q(O) dH(O)Ee[,p(X) -q(O)]'a(O)['p(X) -7(0)]
f q(O) dHl(9)
c
- f q(O) dHJ(O)Ee['p(q,n)(X) -l(0)]'a(O)['p(q,ll)(X) - n()]
fq(O) dH(o)E0['p(X) - 'p(qF,)(X)]'a(O)['P(X) -'p(qjn)(X)]
f q(o) dIH(O)
and this contradicts (49).
372 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
4. Admissibility of Pitman's estimator for location parameters
in certain low dimensional cases
The sample space Z of section 3 is now of the form $E X yIJ, where $tis a finitedimensional
real coordinate space, and yijarbitrary and the o-algebra (B is a product
a-algebra (B = M1M2 where G3, consists of the Borel sets in $r and (B2 is an
arbitrary a-algebra of subsets of y3. Here X is the product measure X = yv
where ,u is a Lebesgue measure on G3i and v is an arbitrary probability measure
on (B2. The parameter space 0 and the action space A coincide with $c. The loss
function is
(54) L(6, a) = (a - 6)'(a - 0).
We observe (X, Y) whose distribution, for given 0, is such that Y is distributed
according to v and the conditional density of X - 0 given Y is p(. Y), a known
density. We assume
(55) f p(x, y) dx = 1
and
(56) fxp(x, y) dx =0
for all y. Condition (56) is introduced only for the purpose of making the natural
estimator X (see [6] or [16]). The condition (49) for the natural estimator X
to be almost admissible becomes the existence of a sequence 7r, of densities with
respect to Lebesgue measure in $Csuch that
1 r |||, _¢X)(1v |
(57) ~inf 1 dp(y) dlf t~rg(x - ,;)p(iijy) dlii2(57) inf 7r(x) j 0.
l4il 1 f 7r6(X - ,q)p(6qIy) d'q
This is derived in formula (63) below.
In [16] one of the authors proved the following theorem.
THEOREM 4.1. When dim SC = 1, if in addition to the above conditions
(58) f dv(y) [f x2p(xjy) dx]312 < oo,
then X is an admissible estimator of 0, that is, there does not exist a function so
such that
(59) f dv(y) f [,p(x, y) - 0]2p(X- Bly) dx < f dv(y) f x2p(xly) dx
for all 0 with strict inequality for some 0.
This is proved by first showing that (57) holds with
(60) 'ir,(x) =
wa( +
ESTIMATION WITH QUADRATIC LOSS 373
so that X is almost admissible, and then proving that this implies that X is
admissible. It is not clear that the condition (58) is necessary.
We shall sketch the proof of a similar but more difficult theorem in the twodimensional
case.
THEOREM 2. When dim X = 2, if in addition to (55) and (56)
(61) f d'(y) [f llxll2 log'+' jlxjl2p(x, y) dx] < o,
then X is an admissible estimator of 0, that is, there does not exist a function so on
DC X lJ to $t such that
(62) f d'(y) f ij,(x, y) - 01l2p(x - Oly) dx _ f dv(y) f l|xl|2p(x|y) dx
for all 0 with strict inequality for some 0.
SKETCH OF PROOF. First we show that (57) implies (49) in the present case.
We take II to be a Lebesgue measure in $C and in accordance with the notation
of the present section write 7r instead of q. Because of the invariance of our
problem under translation we need prove (49) only for C equal to a solid sphere
of radius 1 about the origin. Then the integral in the numerator of (49) is
f lp(X- ti, Y)ir(.i) dt, 2
(63) j 7(r,) diEE X J
|~~~~p(X -,1 Y)ir(tai)
= f 7r,) dt f dv(y) f p(x -, y) dx 2f (x ti)p(x
- i
y)wr,) dt1 2
[f p(x - ,y)prx(-i,)dyd]
f dv(y) f dx f 7r,(Q) dt p(x - i, Y) 2f (x -i -til
[f p(x -
1, y)7r,(Q) dt1]
= f dv(y) f dx J (X- )p(x - , y)7r.() dti|
f p(x - t, y)7r.(Q) dt
=|dp(y) vdx ?1l7r,(x-rnwn, y) dillf
T,(X - q)p(i1, y) dr,
and it is clear that (57) implies (49).
374 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
To prove theorem 2 we define ire by
-,log2V, III2< 11
0*[K, log2 a 1t112 < Ma,
(64) |-log (1 () 1 _
II1I 2 _ Ma,
a aTlogL2og 1il_Ma
where 1 < < 1 + 6, 0 < M < 1, A'VM > 1, and these constants and B are
chosen so that 'r is continuous everywhere and continuously differentiable except
at lltllf = 1. A computation analogous to that in [16] but much more tedious
leads to the following bound for the inner integral on the right side of (63)
12
1l||717re( -t)p(, y) do||
(65) I(y) = dx
f 7rz(x -71)p('i2, y) dr7