9
Statistical errors, confidence intervals and limits
In Chapters 5-8, several methods for estimating properties of p.d.f.s (moments and other parameters) have been discussed along with techniques for obtaining the variance of the estimators. Up to now the topic of 'error analysis' lias been limited to reporting the variances (and covariances) of estimators, or eqmva-lently the standard deviations and correlation coefficients. This turns out to be inadequate in certain cases, and other ways of communicating the statistical uncertainty of a measurement must be found.
After reviewing in Section 9.1 what is meant by reporting the standard deviation as an estimate of statistical uncertainty, the confidence interval is intro duced in Section 9.2. This allows for a quantitative statement, about the fraction of times that such an interval would contain the true value of the parameter in a large number of repeated experiments. Confidence intervals are treated for a number of important cases in Sections 9.3 through 9.6, and are extended to the multidimensional case in Section 9.7. In Sections 9.8 and 9.9, both Bayesian and classical confidence intervals are used to estimate limits on parameters near a physically excluded region.
9.1   The standard deviation as statistical error
Suppose the result of an experiment is an estimate of a certain parameter. I he variance (or equivalently its square root, the standard deviation) of the estimator is a measure of how widely the estimates would be distributed if the experiment were to be repeated many times with the same number of observations per experiment. As such, the standard deviation cr is often reported as the statistical uncertainty of a measurement, and is referred to as the standard error.
For example, suppose one has n observations of a random variable a- and a hypothesis for the p.d.f. f(x;6) which contains an unknown parameter 0. From the sample xi,...,xn a function 9(x\,...,xn) is constructed (e.g. using maximum likelihood) as an estimator for 8. Using one of the techniques discussed in Chapters 5-8 (e.g. analytic method, RCF bound, Monte Carlo, graphical) the standard deviation of 8 can be estimated. Let 0obs be the value of the estimator actually observed, and <7g- the estimate of its standard deviation. In reporting the measurement of 9 as 0Obs ± <r§ one means that repeated estimates all based
Classical confidence mlerv
,/s (exact methodj
119
on n observations of x would be distributed according in a p.I I. ,,{(>) entered around some true value 0 and true standard deviation ct-,. whirl, an- estimated to be (?0bs and b^.
For most practical estimators, the sampling p.d.f. «/(#) becomes approxi mately Gaussian in the large sample limit. If more than one paramct.T is estimated, then the p.d.f. will become a multidimensional Gaussian characterized by a covariance matrix V. Thus by estimating the standard deviation, or lor more than one parameter the covariance matrix, one effectively summarizes all of the information available about how repeated estimates would be distributed. By using the error propagation techniques of Section 1.0. the covariance matrix also gives the equivalent information, at least approximately, for functions ol tin-estimators.
Although the 'standard deviation' definition of statistical error bars could in principle be used regardless of the form of the estimator's p.d.f. <j(0), it is not. in fact, the conventional definition if y(0) is not Gaussian. In such cases, [unusually reports confidence intervals as described in the next section: this can in general lead to asymmetric error bars. In Section 9.3 it is shown that if g(0) is Gaussian, then the so-called 68.3% confidence interval is the same as the interval covered by 6ohs ± &g.
9.2   Classical confidence intervals (exact method)
An alternative (and often equivalent) method of reporting the statistical error of a measurement is with a confidence interval, which was first developed by Ney-man [Ney37]. Suppose as above that one has n observations of a random variable
x which can be used to evaluate an estimator 6(x\.....x„) for a parameter 0,
and that the value obtained is 8obs. Furthermore, suppose that by means of, say, an analytical calculation or a Monte Carlo study, one knows the p.d.f. of 0, g(6;0), which contains the true value 9 as a parameter. That is. the real value of 9 is not known, but for a given 9, one knows what the p.d.f. of 0 would be.
Figure 9.1 shows a probability density for an estimator 0 for a particular value of the true parameter 9. From g{9\ 9) one can determine the value u„ such that there is a fixed probability a to observe 6 > u„, and similarly the value vp such that there is a probability 0 to observe 9 < vg. The values ua and vp depend on the true value of 8, and are thus determined by
/•CO
a = P(0 > ua(9)) = /      g(9;6)dO = l-G(uQ(0);e), (9.1) J«D(i)
and
where G is the c
p = P[e<M9)) = j^^d^GMO)^,
umulative distribution corresponding to the p.d.f. gfr 0).
(9.2)
Classical confidence intervals (exact method) 121
F,g- 9:X A P-d-f. g(O;0) for an estimator Ö for a given value of the true parameter 9. The two shaded regions .nd,catethe values „f j < ,,. 4ieh
£-*P">^Hity/J.andfl>Uo>whirh nas ;i probability ».
Jig. 9.2 Construction of the confidence interval [a,6]given an obser™ v*lue <?ob3 0f the estimator ' parameter 8 (see text).
for Ihe
the confidence belt. The probability f   T 6611 the two cu™eS is called
regardless of the value of 0?iT^ly *° be lnslde the belt
(9.3)
m general should be the ca to be a"" fUnCtl°nS °f Which
the mverse functions bc » good estimator for 0, one can deterge
The inequalities
(9.4)
0 > »„(0). 0 < M").
then imply respectively
a(0) > 0, b(9) < 0.
Equations (9.1) and (9.2) thus become
P(a{9) >0) = a, P(b(0) <8)=ß,
or taken together.
P(a(9)<9<b(9)) = 1-
ß.
(9.5)
(9.6)
(9-7]
(9.8)
If the functions a(0) and 6(0) are evaluated with the value of the estimator actually obtained in the experiment, 0obs. then this determines two values, a and 6, as illustrated in Fig. 9.2. The interval [a, 6] is called a confidence interval at a confidence level or coverage probability of 1 - a - ß. The idea behind its construction is that the coverage probability expressed by equations (9.7), and hence also (9.8), holds regardless of the true value of 0, which of course is unknown. It should be emphasized that a and b are random values, since they depend on the estimator 0, which is itself a function of the data. If the experiment were repeated many times, the interval [a, b] would include the true value of the parameter 0 in a fraction 1 — a - ß of the experiments.
The relationship between the interval [a, b] and its coverage probability 1 -Q-ß can be understood from Fig. 9.2 by considering the hypothetical true value indicated as 0true. If this is the true value of 0, then 0obs will intersect the solid segment of the vertical line between u„(0true) and w^(0true) with a probability of 1 - a - ß. From the figure one can see that the interval [a, b] will cover 0true if 0Obs intersects this segment, and will not otherwise.
In some situations one may only be interested in a one-sided confidence interval or limit. That is, the value a represents a lower limit on the parameter 0 such that a < 0 with the probability 1 - a. Similarly, b represents an upper limit on 0 such that P(0 <b) — l-ß.
Two-sided intervals (i.e. both a and b specified) are not uniquely determined by the confidence level 1-a-ß. One often chooses, for example, a = ß = -y/2 giving a so-called central confidence interval with probability 1 - 7. Note that a central confidence interval does not necessarily mean that a and b are equidistant from the estimated value 9, but only that the probabilities a and ß are equal.
By construction, the value a gives the hypothetical value of the true parameter 0 for which a fraction a of repeated estimates 0 would be higher than the
122   Statistical errors, confidence
intervals and limits
=    /    ff(fl;a)^ = l-<?(lob,;a), =   f°h' g(6;b)d9=G(6ohs;b).
V —do
e previously described procedure to determine the confidence int equivalent to solving (9.9) for a and 6, e.g. numerically
(9.9)
erval is thus
Fig. 9.3 (a) The p.d.f. g(6;a), where a is the lower limit of the confidence interval. If the true parameter 9 were equal to a, the estimates 6 would be greater than the one actually observed <?obs with a probability a. (b) The p.d.f. g{6; b), where b is the upper limit of the confidence interval. If 9 were equal to 6, 9 would be observed less than #0t,s with probability (i.
Figure 9.3 also illustrates the relationship between a confidence interval and a test of goodness-of-fit, cf. Section 4.5. For example, we could test the hypothesis 0 — a using 9 as a test statistic. If we define the region 9 > 60^s as having equal or less agreement with the hypothesis than the result obtained (a one-sided test), then the resulting P-value of the test is a. For the confidence interval, however, the probability a is specified first, and the value a is a random quantity depending on the data. For a goodness-of-fit test, the hypothesis, here 9 = a, is specified and the P-value is treated as a random variable.
Note that one sometimes calls the P-value, here equal to a, the 'confidence level' of the test, whereas the one-sided confidence interval 9 > a has a confidence level of 1 - a. That is, for a test, small a indicates a low level of confidence in the hypothesis 6 — a. For a confidence interval, small a indicates a high level of
Confidence interval tor ,i Gaussian distributed estimator 123
confidence that the interval 0 > n includes the true parameti r. To avoid ronfusion we will use the term P value or (observed) significance level for goodness-of-fit tests, and reserve the term confidence level In mean the coverage probability ol a confidence interval.
The confidence interval [a.b] is often expressed bj reporting the resull of a measurement as Of}, where 0 is the estimated value, and c = 0 - a and d=b-0 are usually displayed as error bars. In many cases the p.d.f. g{6\9) is approximately Gaussian, so that an interval of phis or minus one standard deviation around the measured value corresponds to a central confidence interval with 1 - 7 = 0.683 (see Section 9.3). The 68.3% central confidence interval is usually adopted as the conventional definition for error bars even when the p.d.f. of the estimator is not Gaussian.
If, for example, the result of an experiment is reported as 9+} = 5.79t025i it is meant that if one were to construct the interval [9 - cj + d] according to the prescription described above in a large number of similar experiments with the same number of measurements per experiment, then the interval would include the true value 9 in 1 - a - 6 of the cases. It does not mean that the probability (in the sense of relative frequency) that the true value of 9 is in the fixed interval [5.54,6.11] is 1 - a - 0. In the frequency interpretation, the true parameter 9 is not a random variable and is assumed to not fluctuate from experiment to experiment. In this sense the probability that 9 is in [5.54,6.11] is either 0 or 1, but we do not know which. The interval itself, however, is subject to fluctuations since it is constructed from the data.
A difficulty in constructing confidence intervals is that the p.d.f. of the estimator g{9\9), or equivalently the cumulative distribution G{9;9), must be known. An example is given in Section 10.4, where the p.d.f. for the estimator of the mean£ of an exponential distribution is derived, and from this a confidence interval for £ is determined. In many practical applications, estimators are Gaussian distributed (at least approximately). In this case the confidence interval can be determined easily; this is treated in detail in the next section. Even in the case of a non-Gaussian estimator, however, a simple approximate technique can be applied using the likelihood function; this is described in Section 9.6.
9.3   Confidence interval for a Gaussian distributed estimator
A simple and very important application of a confidence interval is when the distribution of 9 is Gaussian with mean 9 and standard deviation a§. That is, the cumulative distribution of 9 is
G{9;9,*;) =
1
27T<T?
: exp
-{9' -9f 2al
d9'.
(9.10)
This is a commonly occurring situation since, according to the central limit theorem, any estimator that is a linear function of a sum of random variables becomes Gaussian in the large sample limit. We will see that for this case, the
ical errors, confidence intervals and limits
Confidence interval for a Gaussian distributed estimator 125
x
Fig. 9.4 The standard Gaussian p.d.f. showing the relationship between the quantiles
$-1 and the confidence level for (a) a central confidence interval and (b) a one-sided confidence interval.
somewhat complicated procedure explained in the previous section results in a simple prescription for determining the confidence interval.
Suppose that the standard deviation <r8- is known, and that the experiment has resulted in an estimate 8obs. According to equations (9.9). the confidence interval [o,6] is determined by solving the equations
= l-G&b.;a><r,-)=l-*[22*^£[
(To
0   =   G(§ohs;b,(rs) = ^(e^LZl
(9.11)
0^^^~ ^■ ****** of the
a = ^obs-^-$-1(l-Q)! * = 4b» + o-,-*-1(l -p).
(9.12)
Here $_1 is the inverse function of i.e. the quantile of the standard Gaussian, and in order to make the two equations symmetric we have used 3>_1(/3) =
The quantiles $-1(l-a) and — /?) represent how far away the interval
limits a and 6 are located with respect to the estimate 9obs in units of the standard deviation er^. The relationship between the quantiles of the standard Gaussian distribution and the confidence level is illustrated in Fig. 9.4(a) for central and Fig. 9.4(b) for
one-sided confidence intervals.
Consider a central confidence interval with a = ,0 = 7/2. The confidence level 1-7 is often chosen such that the quantile is a small integer, e.g. (1 -7/2) = 1,2,3,.... Similarly, for one-sided intervals (limits) one often chooses a small integer for $-x(l - a). Commonly used values for both central and one-sided intervals are shown in Table 9.1. Alternatively one can choose a round number for the confidence level instead of for the quantile. Commonly used values are shown in Table 9.2. Other possible values can be obtained from [Bra92, Fro79, Dud88] or from computer routines (e.g. the routine GAUSIN in [CER97]).
Table 9.1 The values of the confidence level for different values of the quantile of the standard Gaussian $_1: for central intervals (left) the quantile $_1 (1 - t/2) and confidence level 1-7; for one-sided intervals (right) the quantile <J>-1 (1 - a) and confidence level 1 - a.
(1-7/2)	1-7	~a)	1 - a
1	0.6827	1	0.8413
2	0.9544	2	0.9772
3	0.9973	3	0.9987
Table 9.2 The values of the quantile of the standard Caussian $_1 for different values of the confidence level: for central intervals (left) the confidence level 1 - 7 and the quantile - 7/2); for one-sided intervals (right) the confidence level 1 - a and the quantile $-'(1 - a).
1-7	*"1(l-7/2)	1 -a	- a)
0.90	1.645	0.90	1.282
0.95	1.960	0.95	1.645
0.99	2.576	0.99	2.326
For the conventional 68.3% central confidence interval one has a — 0 — 7,/2, with $-1(l-7/2) = 1, i.e. a'l a error bar'. This results in the simple prescription
[a, b] = [9obs - <Tg, 9obs + erg].
(9.13)
Thus for the case of a Gaussian distributed estimator, the 68.3% central confidence interval is given by the estimated value plus or minus one standard deviation. The final result of the measurement of 0 is then simply reported as
#obs ± (Tg.
If the standard deviation crs is not known a priori but rather is estimated from the data, then the situation is in principle somewhat more complicated. If, for example, the estimated standard deviation had been used instead of fg, then it would not have been so simple to relate the cumulative distribution g(9; 9, 6-q) to <£>, the cumulative distribution of the standard Gaussian, since <Xg-depends in general on 6. In practice, however, the recipe given above can still
126   Statistical errors, confidence intervals and limits
be applied using the estimate instead of <7e-, as long as cr$ is a sufficiently good approximation of the true standard deviation, e.g. for a large enough data sample. For the small sample case where 9 represents the mean of n Gaussian random variables of unknown standard deviation, the confidence interval can be determined by relating the cumulative distribution G{0;0,cr§) to Student's I distribution (see e.g. [Fro79], [Dud88] Section 10.2).
Exact determination of confidence intervals becomes more difficult if the p.d.f. of the estimator g(9\9) is not Gaussian, or worse, if it is not known analytically. For a non-Gaussian p.d.f. it is sometimes possible to transform the paramete-9 —► r](9) such that the p.d.f. for the estimator r/ is approximately Gaussian. The confidence interval for the transformed parameter 7? can then be converted back into an interval for 6. An example of this technique is given in Section 9.5.
9.4   Confidence interval for the mean of the Poisson distribution
Along with the Gaussian distributed estimator, another commonly occurring case is where the outcome of a measurement is a Poisson variable n (n = 0,1,2,...). Recall from (2.9) that the probability to observe n is
f(n-iy)
(9.14)
and that the parameter v is equal to the expectation value E[n]. The maximum likelihood estimator for v can easily be found to be v — n. Suppose that a single measurement has resulted in the value i>0bs = "obs, and that from this we would like to construct a confidence interval for the mean v.
For the case of a discrete variable, the procedure for determining the confidence interval described in Section 9.2 cannot be directly applied. This is because the functions ua{9) and vp(9), which determine the confidence belt, do not exist for all values of the parameter 9. For the Poisson case, for example, we would need to find ua(v) and vp(y) such that P(v > ua(v)) = a and P(i> < vp{y)) — j3 for all values of the parameter v. But if a and /? are fixed, then because v only takes on discrete values, these equations hold in general only for particular values of v.
A confidence interval [a, b] can still be determined, however, by using equations (9.9). For the case of a discrete random variable and a parameter v these become
a   =   P{u > j>obs;a), ß = P(i><i>obs;b), and in particular for a Poisson variable one has
(9.15)
a   =    £ /(«;<*)= 1-  £ =       Z-   n! '
n=n0bs nDbs
n = 0
"»bs in
0 -b
(9.16)
n = 0
For an estimate v = nobs and given probabilities a and j3, these equations can be solved numerically for a and b. Here one can use the following relation between the Poisson and x2 distributions,
ylL-e-"   =    /   fxi{;;nd = 2{nobi+l))dz
-   1 - Fxi{2v; nd - 2(nobs + 1)),
(9.17)
where /x, is the * p.d.f. for nd degrees of freedom and Fx> is the corresponding cumulative distribution. One then has
a   =   lF-31{a;nd = 2nobs.)..
2 X
(9.18)
Quantiles F~? of the \2 distribution can be obtained from standard tables (e.g.
in [Bra92]) or from computer routines such as CHISIN in [CER97]. Some values
for nohs = 0,..., 10 are shown in Table 9.3.
Note that the lower limit a cannot be determined if n0bs = 0. Equations (9.15)
say that if v — a (i/ = b), then the probability is a (/?) to observe a value greater (less) than or equal to the one actually observed. Because the case of equality, v — v0bs, is included in the inequalities (9.15), one obtains a conservatively large confidence interval, i.e.
P(v>a)   > 1-a, p{v < b)   > l-ß, P{a<u<b)   > 1-a-ß.
(9.19)
• i       ;= «V,pn the observed number nobs is zero, and one An important special case is when the ooserv becomes is interested in establishing an upper limit b. Equation [9.1*)
°  b"e-b _6
n=0
(9.20)
130   Statistical errors, confidence intervals and limits
for p simply by using the inverse of the transformation (9.22), i.e. .4 = tanlir; and B = tanh 6.
Consider for example a sample of size n = 20 for which one has obtained the estimate r = 0.5. From equation (5.17) the standard deviation of r can be estimated as &r = (1 - r'-)/v/» = 0.168. If one were to make the incorrect approximation that r is Gaussian distributed for such a small sample, this would lead to a 68.3% central confidence interval for p of [0.332, 0.668], or [0.067,0.933] at a confidence level of 99%. Thus since the sample correlation coefficient r is almost three times the standard error <xr, one might be led to the incorrect conclusion that there is significant evidence for a non-zero value of p, i.e. a '3<x effect'. By using the ^-transformation, however, one obtains z = 0.549 and frz —
0. 243. This corresponds to a 99% central confidence interval of [-0.075,1.174] for (, and [-0.075,0.826] for p. Thus the 99% central confidence interval includes zero.
Recall that the lower limit of the confidence interval is equal to the hypothetical value of the true parameter such that r would be observed higher than the one actually observed with the probability a. One can ask, for example, what the confidence level would be for a lower limit of zero. If we had assumed that g(r; p, n) was Gaussian, the corresponding probability would be 0.14%. By using the ^-transformation, however, the confidence level for a limit of zero is 2.3%,
1. e. if p were zero one would obtain r greater than or equal to the one observed, r = 0.5, with a probability of 2.3%. The actual evidence for a non-zero correlation is therefore not nearly as strong as one would have concluded by simply using the standard error <rr with the assumption that r is Gaussian.
9.6   Confidence intervals using the likelihood function of x2
Even in the case of a non-Gaussian estimator, the confidence interval can be determined with a simple approximate technique which makes use of the likelihood function or equivalently the \2 function where one has L = exp(—x2/2). Consider first a maximum likelihood estimator 0 for a parameter 0 in the large sample limit. In this limit it can be shown ([Stu91] Chapter 18) that the p.d.f. g{0\0) becomes Gaussian,
2(7
2
(9.26)
centered about the true value of the parameter 0 and with a standard deviat
ion
One can also show that in the large sample limit the likelihood function itself becomes Gaussian in form centered about the ML estimate 9
L(0) = Lmax exp
2<T2
(9.27)
Confidence intervals using the likelihood function or xJ 131
From the RCF inequality (6.16), which for an ML estimator in the large sample limit becomes an equality, one obtains that <r#- in the likelihood function (9.27) is the same as in the p.d.f. (9.26). This has already been encountered in Section 6.7, equation {6.24), where the likelihood function was used to estimate the variance of an estimator 6. This led to a simple prescription for estimating <Tq, since by changing the parameter $ by n standard deviations, the log-likelihood function decreases by n2/'2 from its maximum value,
log L(0 ± Nffg) = log Lmsx - ~. (9.28)
From the results of the previous section, however, we know that for a Gaussian distributed estimator 0, the 68.3% central confidence interval can be constructed from the estimator and its estimated standard deviation &j as [a, b] = [0-&ft9 + <r§} (or more generally according to (9.12) for a confidence level of 1 - 7). The 68.3% central confidence interval is thus given by the values of B at which the log-likelihood function decreases by 1/2 from its maximum value. (This is assuming, of course, that 6 is the ML estimator and thus corresponds to the maximum of the likelihood function.)
In fact, it can be shown that even if the likelihood function is not a Gaussian
function of the parameters, the central confidence interval [a, b] can still be approximated by using
log L(eti) = logLmax - '~z
n2
[6 - c, 6 + d]
(9.29)
where n = fc-^l-7/2) is the quantile of the standard Gaussian corresponding to the desired confidence level 1 - 7. (For example, n = 1 for a 68.3% central confidence interval; see Table 9.1.) In the case of a least squares fit with Gaussian errors, i.e. with logl = -x2/2, the prescription becomes
A heuristic proof that the intervals defined by equations (9.29) and (9.30) approximate the classical confidence intervals of Section 9.2 can be found in [Ead71, Fro79]. Equations (9.29) and (9.30) represent one of the most commonly used methods for estimating statistical uncertainties. One should keep in mind, however, that the correspondence with the method of Section 9.2 is only exact in the large sample limit. Several authors have recommended using the term 'likelihood interval' for an interval obtained from the likelihood function [Fro79, Hud64]. Regardless of the name, it should be kept in mind that it is interpreted here as an approximation to the classical confidence interval, i.e. a random interval constructed so as to include the true parameter value with a given probability.
As an example consider the estimator f = £ J2i=i ^or tne parameter r of an exponential distribution, as in the example of Section 6.2 (see also Section 6.7). There, the ML method was used to estimate r given a sample of n = 50 measurements of an exponentially distributed random variable t. This sample
132   Statistical errors, confidence intervals and limits
Multidimensional confidence regions 133
was sufficiently large that the standard deviation &■? could be approximated by the values of r where the log-likelihood function decreased by 1/2 from its maximum (sec Fig. 6.4). This gave f — 1.06 and &+ ~ Af_ « Af+ « 0.15.
Figure 9.6 shows the log-likelihood function log L(r) as a function of r for a sample of only n = 5 measurements of an exponentially distributed random variable, generated using the Monte Carlo method with the true parameter r = 1. Because of the smaller sample size the log-likelihood function is less parabolic than before.
O ,4:
-4.5 h
-5
Fig. 9.6 The log-likelihood function log L(r) as a function of t for a sample of n = 5 measurements. The interval [f - Af_, f + Af+] determined by log L(t) = logLmax - 1/2 can be used to approximate the 68.3% central confidence interval.
One could still use the half-width of the interval determined by logImax —1/2 to approximate the standard deviation <Tf, but this is not really what we want. The statistical uncertainty is better communicated by giving the confidence interval, since one then knows the probability that the interval covers the true parameter value. Furthermore, by giving a central confidence interval (and hence asymmetric errors, Af_ ^ Af+), one has equal probabilities for the true parameter to be higher or lower than the interval limits. As illustrated in Fig. 9.6, the central confidence interval can be approximated by the values of r where logl(r) = logLmax - 1/2, which gives [f - Af_,f+ Af+] = [0.55,1.37] or f = O.85i030.
In fact, the same could have been done in Section 6.7 by giving the result there as f = 1.062±ft}||. Whether one chooses this method or simply reports an averaged symmetric error (i.e. f = 1.06 ± 0.15) will depend on how accurately the statistical error needs to be given. For the case of n = 5 shown in Fig. 9.6, the error bars are sufficiently asymmetric that one would probably want to use the 68.3% central confidence interval and give the result as f — 0.85ig so-
9.7   Multidimensional confidence regions
In Section 9.2, a confidence interval [a,b] was constructed so as to have a certain probability 1 — 7 of containing a parameter 9. In order to generalize this
		-i-1—	—I-~r— (a)
8			
6	-		\
4	-	\	' 8        / •
2 0		v__	
„- 10
10
Pig. 9.7 (a) A contour of constant g(8;8lTue) (i.e. constant Q(0, Otrae}) in S-space. (b) A contour of constant L(0) corresponding to constant Q[0oba<&) 'n S-space. The values ©true and 0oba represent particular constant values of 8 and 8, respectively.
to the case of n parameters, 9 = (9\,...,9n), one might attempt to find an n-dimensional confidence interval [a, b] constructed so as to have a given probability that a, < 6i < 6,-, simultaneously for all i. This turns out to be computationally difficult, and is rarely done.
It is nevertheless quite simple to construct a confidence region in the parameter space such that the true parameter 9 is contained within the region with a given probability (at least approximately). This region will not have the form a,- < Qt < b,;, i = 1,.. ., n, but will be more complicated, approaching an n-dimensional hyperellipsoid in the large sample limit.
As in the single-parameter case, one makes use of the fact that both the joint p.d.f. for the estimator 0 = (§1,..., 6n) as well as the likelihood function become Gaussian in the large sample limit. That is, the joint p.d.f. of 9 becomes
g(9\9) =
1
(2T)n/2|V|l/2
exp
■kQ(0,9)
where Q is defined as
(9.31)
(9.32)
Q{6,9) = (9 -9)TV-l{9 -9).
Here V~l is the inverse covariance matrix and the superscript T indicates a transposed (i.e. row) vector. Contours of constant g(9\9) correspond to constant Q(9,9). These are ellipses (or for more than two dimensions, hyperellipsoids) in ©-space centered about the true parameters 9. Figure 9.7(a) shows a contour of constant Q(9), where 9true represents a particular value of 9.
Also as in the one-dimensional case, one can show that the likelihood function L(9) takes on a Gaussian form centered about the ML estimators 6,
134   Statistical errors, confidence intervals and limits
L(9) = Lmmexp -${$~e)TV~l{0-§)
— Lit
■ exp
• iQ(Ö,Ö) . (9.33)
The inverse covariance matrix V'-5 is the same here as in (9.31); this can be seen from the RCF inequality (6.19) and using the fact that the ML estimators attain the RCF bound in the large sample limit. The quantity Q here is regarded as a function of the parameters 6 which has its maximum at the estimates 8. This is shown in Fig. 9.7(b) for 0 equal to a particular value 0obs- Because of the symmetry between 9 and 9 in the definition (9.32), the quantities q have the same value in both the p.d.f. (9.31) and in the likelihood function (9.33), i.e. Q(9,9) = Q(9,9).
As discussed in Section 7.5, it can be shown that, if 9 is described by an n.-dimensional Gaussian p.d.f. g(9,9), then the quantity Q{9%9) is distributed according to a x2 distribution for n degrees of freedom. The statement that Q(9,8) is less than some value Q7, i.e. that the estimate is within a certain
distance of the true value 9, implies Q(8, 9) < Q-,
i.e.
that the true value 9
is within the same distance of the estimate. The two events therefore have the same probability,
P(Q(0,6) < Qy) =
f(z;n)dz,
(9.34)
where f{z;n) is the x'1 distribution for n degrees of freedom (equation (2.34)). The value Q1 is chosen to correspond to a given probability content,
rQt Jo
That is,
f{z;n)dz = l-7.
Q7 = F-1(l-7;n)
(9.35)
(9.36)
is the quantile of order I-7 of the \2 distribution. The region of 0-space defined by Q(9, 8) < Q1 is called a confidence region with the confidence level I-7. For a likelihood function of Gaussian form (9.33) it can be constructed by finding the values of 9 at which the log-likelihood function decreases by Q7/2 from its maximum value,
logjL(6>) = logL„
2
(9.37)
As in the single-parameter case, one can still use the prescription given by (9.37) even if the likelihood function is not Gaussian, in which case the probability statement (9.34) is only approximate. For an increasing number of parameters, the approach to the Gaussian limit becomes slower as a function of the sample size, and furthermore it is difficult to quantify when a sample is large enough for (9.34) to apply. If needed, one can determine the probability that a region
Multidimensional confidence regions
135
constructed accordmg to (9.37) includes the true parameter by means of a Monte
Carlo calculation. _ F(,~in _y.n) for several confidence
Quantiles of the r distribution Q7 = *    i 1    7, »M > -levels 1 -7 and n = 1,2,3,4,5 parameters are given in hbkf confidence level are shown for various values of the quantile Q-, in 1 able 9,).
Table 9.4 The values of the confidence level 1 n - 1,2,3,4,5 fitted parameters.
for different values of Qi and for
	1-7				
	n = 1	n = 2	n — 3	n = 4	n — 0
1.0	0.683	0.393	0.199	0.090	0.037
2.0	0.843	0.632	0.428	0.264	0.151
4.0	0.954	0.865	0.739	0.594	0.451
9.0	0.997	0.989	0.971	0.939	0.891
Table 9.5 The values of the quantile Q1 for different vah.es of the confidence level 1 - n, for n =;1,2,3,4,5 fitted parameters.
1-7	Q-				
	n = 1	n = 2	n = 3	1! - 4	n — 5
lh683^	1.00	2.30	3.53	4.72	5.89
0.90	2.71	4.61	6.25	7.78	9.24
0.95	3.84	5.99	7.82	9.49	11.1
0.99	6.63	9.21	11.3	13.3	15.1
For n = 1 the expression (9.36) for Q7 can be shown to imply
7/2),
(9.38)
where S"1 is the inverse function of the standard normal distribution. The procedure here thus reduces to that for a single parameter given ir.Section J.b where N = JqZ is the half-width of the interval in standard deviations (see
wnere iv = ^/y7 is me uoii-«iu».. .
equations (9.28), (9.29)). The values for n = 1 in Tables 9.4 and 9.5 are thus related to those in Tables 9.1 and 9.2 by equation (9.38).
For increasing n, the confidence level for a given Qy decreases. For example, in the single-parameter case, Q7 = 1 corresponds to 1 — 7 = 0.683. For n = 2, Qy = 1 gives a confidence level of only 0.393, and in order to obtain 1 — 7 = 0.683
one needs Q7 = 2.30.
We should emphasize that, as in the single-parameter case, the confidence region Q(8,9) < is a random region in #-space. The confidence region varies upon repetition of the experiment, since 9 is a random variable. The true parameters, on the other hand, are unknown constants.
136   Statistical errors, confidence intervals and limits
Limits near a physical boundary 137
9.8   Limits near a physical boundary
Often the purpose of an experiment is to search for a new effect, the existence of which would imply that a certain parameter is not equal to zero. For example, one could attempt to measure the mass of the neutrino, which in the standard theory is massless. If the data yield a value of the parameter significantly different from zero, then the new effect has been discovered, and the parameter's value and a confidence interval to reflect its error are given as the result. If, on the other hand, the data result in a fitted value of the parameter that is consistent with zero, then the result of the experiment is reported by giving an upper limit on the parameter. (A similar situation occurs when absence of the new effect corresponds to a parameter being large or infinite; one then places a lower limit. For simplicity we will consider here only upper limits.)
Difficulties arise when an estimator can take on values in the excluded region. This can occur if the estimator 8 for a parameter 8 is of the form 9 = x — y, where both x and y are random variables, i.e. they have random measurement errors. The mass squared of a particle, for example, can be estimated by measuring independently its energy E and momentum p, and using ro2 = E7 —p2 ■ Although the mass squared should come out positive, measurement errors in E. and p could result in a negative value for m2. Then the question is how to place a limit on m2, or more generally on a parameter 9 when the estimate is in or near an excluded region.
Consider further the example of an estimator 8 — x — y where x and y are Gaussian variables with means /iT, \iy and variances ff2., o~7. One can show that the difference 9 = x — y is also a Gaussian variable with 9 = fJix — l*y and 0"? — c2 -f- o~2. (This can be shown using characteristic functions as described in Chapter 10.)
Assume that 6 is known a priori to be non-negative (e.g. like the mass squared), and suppose the experiment has resulted in a value #0bs for the estimator 8. According to (9.12), the upper limit 0up at a confidence level 1 — 0
Jup
= 0ohs + (T^-l{\- ß).
(9.39)
For the commonly used 95% confidence level one obtains from Table 9.2 the quantile $_1(0.95) = 1.645.
The interval (—oo, 8up] is constructed to include the true value 9 with a probability of 95%, regardless of what 9 actually is. Suppose now that the standard deviation is <y-9 = 1, and the result of the experiment is #Qbs = —2.0. From equation (9.39) one obtains 0up = -0.355 at a confidence level of 95%. Not only is #obs in the forbidden region (as half of the estimates should be if 8 is really zero) but the upper limit is below zero as well. This is not particularly unusual, and in fact is expected to happen in 5% of the experiments if the true value of 8 is zero.
As far as the definition of the confidence interval is concerned, nothing fundamental has gone wrong. The interval was designed to cover the true value of 9 in a certain fraction of repeated experiments, and we have obviously encountered one of those experiments where 9 is not in the interval. 'But this is not a very satisfying result, since it was already known that 9 is greater than zero (and certainly greater than 9up = -0.355) without having to perform the experiment.
Regardless of the upper limit, it is important to report the actual value of the estimate obtained and its standard deviation, i.e. 0obs±0«> even if the estimate is in the physically excluded region. In this way, the average of many experiments (e.g. as in Section 7.6) will converge to the correct value as long as the estimator is unbiased. In cases where the p.d.f. of 8 is significantly non-Gaussian, the entire likelihood function L(9) should be given, which can be combined with that of other experiments as discussed in Section 6.12.
Nevertheless, most experimenters want to report some sort of upper limit, and in situations such as the one described above a number of techniques have been proposed (see e.g. [Hig83, Jam91]). There is unfortunately no established convention on how this should be done, and one should therefore state what procedure was used.
As a solution to the difficulties posed by an upper limit in an unphysical region, one might be tempted to simply increase the confidence level until the limit enters the allowed region. In the previous example, if we had taken a confidence level 1-/3 = 0.99, then from Table 9.2 one has *_1(0.99) = 2.326, giving #up = 0.326. This would lead one to quote an upper limit that is smaller than the intrinsic resolution of the experiment (<r$ = 1) at a very high confidence level of 99%, which is clearly misleading. Worse, of course, would be to adjust the confidence level to give an arbitrarily small limit, e.g. 4>-1 (0.97725) = 2.00001, or 0up = 10-5 at a confidence level of 97.725%!
In order to avoid this type of difficulty, a commonly used technique is to simply shift a negative estimate to zero before applying equation (9.39), i.e.
8up = max(6>obs, 0) + <j§<$>   (1 - 0).
(9.40)
In this way the upper limit is always at least the same order of magnitude as the resolution of the experiment. If 0obs is positive, the limit coincides with that of the classical procedure. This technique has a certain intuitive appeal and is often used, but the interpretation as an interval that will cover the true parameter value with probability 1 — 0 no longer applies. The coverage probability is clearly greater than 1 — 0, since the shifted upper limit (9.40) is in all cases greater than or equal to the classical one (9.39).
Another alternative is to report an interval based on the Bayesian posterior p.d.f. p{9\x). As in Section 6.13, this is obtained from Bayes' theorem,
p(0\x) =
L(x\0)*(0) fL(x\0')n{0')d91'
(9.41)
138   Statistical errors, confidence intervals and limits
Upper limit on the mean of Poisson variable with background 139
where x represents the observed data, L(x\9) is the likelihood function and v(6) is the prior p.d.f. for 0. In Section 6.13, the mode of p(0|x) was used as an estimator for 0. and it was shown that this coincides with the ML estimator if the prior density >r(0) is uniform. Here, we can use p(f?|x) to determine an interval [a, 6] such that for given probabilities <v and ,6 one has
a —
p{8\x) d9
(9.42)
8   =    I p(6\x)d0. Jb
Choosing a = 8 then gives a central interval, with e.g. 1 - a - 0 = 68.3%. Another possibility is to choose a and 8 such that all values of p{0\x) inside the interval [a, b] are higher than any values outside, which implies p(o|x) = p(6|x). One can show that this gives the shortest possible interval.
One advantage of a Bayesian interval is that prior knowledge, e.g. 9 > 0, can easily be incorporated by setting the prior p.d.f. ~(8) to zero in the excluded region. Bayes' theorem then gives a posterior probability p(0|x) with p(0|x) = 0 for 9 < 0. The upper limit is thus determined by
1 - 8
p{9\x)d9 =
Jil L[x\0)-(p)dO
(9.43)
JZj^\0)n(9)d9-
The difficulties hero have already been mentioned in Section 6.13, namely that there is no unique way to specify the prior density ~(0). A common choice
is
ir(0) =.
0 9 < 0
1 9 > 0.
(9.44)
The prescription says in effect: normalize the likelihood function to unit area in the physical region, and then integrate it out to 9up such that the fraction of area covered is 1 — 0. Although the method is simple, it has some conceptual drawbacks. For the case where one knows 9 > 0 (e.g. the neutrino mass) one does not really believe that 0 < 9 < 1 has the same prior probability as 10
40
<
9 < 1040 + 1. Furthermore, the upper limit derived from it (6) = constant is not invariant with respect to a nonlinear transformation of the parameter.
It has been argued [Jef48] that in cases where 0 > 0 but with no other prior information, one should use
< o
> 0.
(9.45)
This has the advantage that upper limits are invariant with respect to a transformation of the parameter by raising to an arbitrary power. This is equivalent to a uniform (improper) prior of the form (9.44) for log 9. For this to be usable.
however, the likelihood function must go to zero for 9 —>■ 0 and 9 —>■ oo, or else the integrals in (9.43) diverge. It is thus not applicable in a number of cases of practical interest, including the example discussed in this section. Therefore, despite its conceptual difficulties, the uniform prior density is the most commonly used choice for setting limits on parameters.
Figure 9.8 shows the upper limits at 95% confidence level derived according to the classical, shifted and Bayesian techniques as a function of 9obs = x — y for <7fl- = 1. For the Bayesian limit, a prior density it(9) = constant was used. The shifted and classical techniques are equal for 9obs > 0. The Bayesian limit is always positive, and is always greater than the classical limit. As 9obs becomes larger than the experimental resolution cr^-, the Bayesian and classical limits rapidly approach each other.
o
in
— classical - shifted
Bayesian, it(9) = const.
Fig. 9.8 Upper limits at 95% confidence level for the example of Section 9.8 using the classical, shifted and Bayesian techniques. The shifted and classical techniques are equal for ft,h, > 0.
9.9   Upper limit on the mean of Poisson variable with background
As a final example, recall Section 9.4 where an upper limit was placed on the mean v of a Poisson variable n. Often one is faced with a somewhat more complicated situation where the observed value of n is the sum of the desired signal events ns as well as background events nh,
n = ns + nb, (9.46)
where both ns and nb can be regarded as Poisson variables with means vs and I'h, respectively. Suppose for the moment that the mean for the background v\> is known without any uncertainty. For i/s one only knows a priori that z/5 > 0. The goal is to construct an upper limit for the signal parameter i/s given a measured value of n.
Since n is the sum of two Poisson variables, one can show that it is itself a Poisson variable, with the probability function
140   Statistical errors, confidence intervals and limits
Upper limit on the mean of Poisson variable with background 141
/(n;y.,yb)=(l/' + ,yb>We-C.+^).
The ML estimator for
(9.47)
(9.48)
which has zero bias since E[n] = vs + ;/,,. Equations (9.15), which are used to determine the confidence interval, become
a   =   P(!/s > j>°bs; i/sto)=
n>nob,
ß   =   P{vs < f/sobs;j/,p) = V
n<n0h
" +fb)
(9.49)
These can be solved numerically for the lower and upper limits ux° and ^,p. Comparing with the case vb = 0, one sees that the limits from (9.49) are related to what would be obtained without background by
= i/g (no background) - vb, =   i^lp(no background) — vb.
(9.50)
The difficulties here are similar to those encountered in the previous example. The problem occurs when the total number of events observed n0bs is not large compared to the expected number of background events vb. Values of p for 1 - 0 = 0.95 are shown in Fig. 9.9(a) as a function of the expected number of background events vh. For small enough nobs and a high enough background level vb) a non-negative solution for i/3up does not exist. This situation can occur, of course, because of fluctuations in ns and nb.
Because of these difficulties, the classical limit is not recommended in this case. As previously mentioned, one should always report vs and an estimate of its variance even if j>s comes out negative. In this way the average of many experiments will converge to the correct value. If, in addition, one wishes to report an upper limit on us, the Bayesian method can be used with, for example, a uniform prior density [Hel83]. The likelihood function is given by the probability (9.47), now regarded as a function of vs.
"obs!
(9.51)
The posterior probability density for vs is obtained as usual from Bayes' theorem,
£("obsK) 7r(l/8)
Pl^sKbs)
(9.52)
o ii
co.
12 10
. I1"—    1-—I-1	-1 (b)
X   X6 events observed	
\\\	
x ••...... -••	-
o       ~ —-	~--,--
_i        '-1-1-1-	
10
12
Fig. 9.9 Upper limits j/"p at a confidence level of 1 -0 = 0.95 for different numbers of events observed nobs and as a function of the expected number of background events i/b. (a) The classical limit, (b) The Bayesian limit based on a uniform prior density for v,.
Taking it{vt) to be constant for vs > 0 and zero for vs < 0, the upper limit 2/sup at a confidence level of 1 — [3 is given by
l-ß
_   Jo"'  L{n0\^\vs) dv% L(nobs\iss) dvs
(9.53)
The integrals can be related to incomplete gamma functions (see e.g. [Arf95]), or since nobs is a positive integer, they can be solved by making the substitution x = vs + i/b and integrating by parts nobs times. Equation (9.53) then becomes
ß =
(9.54)
This can be solved numerically for the upper limit fsup. The upper limit as a function of i/b is shown in Fig. 9.9(b) for various values of n.obs. For the case without background, setting vb = 0 gives
ß = e-v
"ob«
"L
n=0
Kupf
(9.55)
which is identical to the equation for the classical upper limit (9.16). This can be seen by comparing Figs 9.9(a) and (b). The Bayesian limit is always greater than or equal to the corresponding classical one, with the two agreeing only for Vb - 0.
142   Statistical errors, confidence intervals and limits
The agreement for the case without background must be considered accidental, however, since the Bayesian limit depends on the particular choice of a constant prior density n(vs). Nevertheless, the coincidence spares one the trouble of having to defend either the classical or Bayesian viewpoint, which may account for the general acceptance of the uniform prior density in this case.
Often the result of an experiment is not simply the number n of observed events, but includes in addition measured values x\,..., xn of some property of the events. Suppose the probability density for x is
Vbfb(x)
Vb
(9.56)
where the components fs(x) for signal and fb{x) for background events are both assumed to be known. If these p.d.f.s have different shapes, then the values of x contain additional information on whether the observed events were signal or background. This information can be incorporated into the limit vs by using the extended likelihood function,
L{vs)
{Vb + O"    -(*.+«/„) TT "s/sfot) + VbfbjXi)
n!
e-(".+"b) „2,
n
JJ [vsfs{Xi) + Vbfb{Xi
(9.57)
as defined in Section 6.9, or by using the corresponding formula for binned data as discussed in Section 6.10.
In the classical case, one uses the likelihood function to find the estimator vs. In order to find the classical upper limit, however, one requires the p.d.f. of £>s. This is no longer as simple to find as before, where only the number of events was counted, and must in general be determined numerically. For example, one can perform Monte Carlo experiments using a given value of vs (and the known value vb) to generate numbers ns and nb from a Poisson distribution, and corresponding x values according to /s(x; i/s) and fb{x; fb)- By adjusting vB, one can find that value for which there is a probability 0 to obtain i>5 < v°hs. Here one must still deal with the problem that the limit can turn out negative.
In the Bayesian approach, L(vs) is used directly in Bayes' theorem as before. Solving equation (9.53) for i/"p must in general be done numerically. This has the advantage of not requiring the sampling p.d.f. for the estimator i>s, in addition to the previously mentioned advantage of automatically incorporating the prior knowledge v% > 0 into the limit.
Further discussion of the issue of Bayesian versus classical limits can be found in [Hig83, Jam91, Cou95]. A technique for incorporating systematic uncertainties in the limit is given in [Cou92].
10
Characteristic functions and related examples
10.1   Definition and properties of the characteristic function
The characteristic function 0x(k) for a random variable x with p.d.f. /(*) is defined as the expectation value of cikx.
J>x(k) = E{eik*} =
ikx
f(x)dx.
(10.1)
This is essentially the Fourier transform of the probability density function. It is useful in proving a number of important theorems, in particular those involving sums of random variables. One can show that there is a one-to-one correspondence between the p.d.f. and the characteristic function, so that knowledge of one is equivalent to knowledge of the other. Some characteristic functions of important p.d.f.s are given in Table 10.1.
Suppose one has n independent random variables Xi,...,x„, with p.d.f.s /l(*i)> • • •, fn(xn), and corresponding characteristic functions <j>i(k), ■ • •' <Mk)> and consider the sum z = £V xi ■ The characteristic function <£z (k) for z is related to those of the Xi by
^(Ifc)   = j...Jexp^k^x^fi(xi):.J„{xn)dxi./:dxn =   J e^fi^dxi ...J eikx"fn(xn)dxn
=  <h(fc)...<£„(*)■
(10.2)
That is, the characteristic function for a sum of independent random variables is given by the product of the individual characteristic functions. The p d f. f(z) is obtained from the inverse Fourier transform,
dk.
(10.3)