152
Characteristic function.
S and related example
for the parameter £ and th I
interval [«, b] glven ;he data ™ue is *     The goal is to determine an
and Pf, < 6] = /? ho]d for fixe i;;and" fdl Uf «« Probabilities P[« <    = Q
The confides interval is found hi °f the true value f
found by solvmg equations (9.9) for a ^ 6
/. s(i;a)di,
■■■ v — co
(10.30)
Figure 10.2 shows the 68.3% confidence intervals for various values of n assuming a measured value £obs = 1. Also shown are the intervals one would obtain from the measured value plus or minus the estimated standard deviation. As n becomes larger the p.d.f. g{£;n,£) becomes Gaussian (as it must by the central limit theorem) and the 68.3% central confidence interval approaches
68.3% confidence interval
i
A    Šobs * Of
Fig. 10.2 Classical confidence intervals for the parameter of the exponential distribution £ (between solid points) and the interval [£0bs — £obs + ^j] (between open triangles) for different values of the number of measurements n, assuming an observed value £obs = 1.
UnfbIiiN-4
Up to now we have considered random variables such as particle energies, decay times, etc., usually with the assumption that their values can be measured without error. The present chapter concerns the distortions to distributions which occur when the values of these variables are subject to additional random fluctuations due to the limited resolution of the measuring device. The procedure of correcting for these distortions is known as unfolding. The same mathematics can be found under the general heading of inverse problems, and is also called decon-volutionor unsmearing. Although the presentation here is mainly in the context of particle physics, the concepts have been developed and applied in fields such as optical image reconstruction, radio astronomy, crystallography and medical imaging.
The approach here, essentially that of classical statistics, follows in many ways that of [Any91, Any92, Bel85, Zhi83, Zhi88]. Some of the methods have a Bayesian motivation as well, however, cf. [Siv96, Ski85, Ski86, Jay86].
In Section 11.1 the unfolding problem is formulated and the notation defined. Unfolding by inversion of the response matrix is discussed in Section 11.2. This technique is rarely used in practice, but is a starting point for better solutions. A simple method based on correction factors is shown in Section 11.3. The mam topic of this chapter, regularized unfolding, is described in Sections 11.4 through 11.7. This includes the strategy used to find the solution, a survey of several regularization functions, and methods for estimating the variance and bias of the solution. These points are illustrated by means of examples in Section 11.8, and information on numerical implementation of the methods is given in Section 11.9.
It should be emphasized that in many problems it is not necessary to unfold the measured distribution, in particular if the goal is to compare the result with the prediction of an existing theory. In that case one can simply modify the prediction to include the distortions of the detector, and this can be directly compared with the measurement. This procedure is considerably simpler than unfolding the measurement and comparing it with the original (unmodified) theory.
Without unfolding, however, the measurement cannot be compared with the results of other experiments, in which the effects of resolution will in general be different. It can also happen that a new theory is developed many years after a measurement has been carried out, and the information needed to modify
154 Unfolding
Formulation of the unfolding problem 155
the theory for the effects of resolution, i.e. the response function or matrix (see below), may no longer be available. If a particularly important measured distribution is to retain its value, then both the measurement and the response matrix should be preserved. Unfortunately, this is often impractical, and it is rarely done.
By unfolding the distribution one provides a result, which can directly be compared with those of other experiments as well as with theoretical predictions. Other reasons for unfolding exist in applications such as image reconstruction, where certain features may not be recognizable in the uncorrected distribution. In this chapter we will assume that these arguments have been considered and that the decision has been made to unfold.
11.1   Formulation of the unfolding problem
Consider a random variable x whose p.d.f. we would like to determine. In this chapter we will allow for limited resolution in the measurement of x, as well as detection efficiency less than 100% and the presence of background processes. As an example, we could consider the distribution of electron energies resulting from the beta decay of radioactive nuclei, i.e. the variable x refers to the energy of the emitted electron.
By 'limited resolution' we mean that because of measurement errors, the measured values of x may differ in a random way from the values that were actually created. For example, a particular beta decay may result in an electron with a certain energy, but because of the resolution of the measuring device, the recorded value will in general be somewhat different. Each observed event is thus characterized by two quantities: a true value y (which we do not know) and an observed value x.
In general one must also allow for the occurrence of a true value that does not result in any measured value at all. For the example of beta decay, it could be that an emitted electron escapes completely undetected, since the detector may not cover the entire solid angle surrounding the radioactive source, or electron energies below a certain minimum threshold may not produce a sufficiently large signal to be detected. The probability that an event leads to some measured value is called the detection efficiency1 e(y), which in general depends on the true value of the event, y.
Suppose the true values are distributed according to the p.d.f. ftme(y). In order to construct a usable estimator for /true(y), it is necessary to represent it by means of some finite set of parameters. If no functional form for /true(y) is known a priori, then it can still be represented as a normalized histogram with M bins. The probability to find y in bin j is simply the integral over the bin,
'If the reason that the event went undetected is related to the geometry, e.g. limited solid angle of the detector, then the efficiency is often called acceptance. The term efficiency is sometimes used to refer to the conditional probability that an event is detected given that it is contained in the sensitive region of the detector. Here we will use efficiency in the more general sense, meaning the overall probability for an event to be detected.
Pj    f   /m,e(y)f/y- (Hi)
J bin j
Suppose we perform an experiment in which a certain total number of events mtot occur; this will differ in general from the number observed. The number TOtc could be treated as fixed or as a random variable. In either case, we will call the expectation value of the total number of events /jtot - E[rntol], so that the expected number of events in bin j is
Hj = fttotPj- I11-2)
Wc will refer to the vector fi = (m, ■. .,Hm) as the 'true histogram'. Note that these are not the actual numbers of events in the various bins, but rather the corresponding expectation values, i.e. the m are not in general integers. One could, for example, regard the true number of events in bin i as a random variable nii with mean fit. Because of the limited resolution and efficiency, however, m,-is not directly observable, and it does not even enter the present formulation of the problem. Instead, we will construct estimators directly for the parameters
For reasons of convenience one usually constructs a histogram of the observed values as well. Suppose that we begin with a sample of measured values of x, and that these are entered into a histogram with N bins, yielding n = («i,. • .,%)• These values could also be sample moments, Fourier coefficients, etc. In fact, the variable x could be multidimensional, containing not only a direct measurement of the true quantity of interest y, but also correlated quantities which provide additional information on y.
The number of bins N may in general be greater, less than, or equal to the number of bins M in the true histogram. Suppose the ith bin contains reentries, and that the total number of entries is ]T\ m = ntot. It is often possible to regard the variables n,- as independent Poisson variables with expectation values Vi. That is, for this model the probability to observe m entries in bin i is given by
n;!
Since a sum of Poisson variables is itself a Poisson variable (cf. Section 10.4), "tot will then follow a Poisson distribution with expectation value i/tot = 5Z»"»■ We may also consider the case where ntot is regarded as a fixed parameter, and where the n,- follow a multinomial distribution. Whatever the distribution, we will call the expectation values
Ui = E[m]. (11.4)
156 Unfolding
Formulation of the unfolding problem 157
The form of the probability distribution for the data n = (n1(..., nN) (Pois-son, multinomial, etc.) will be needed in order to construct the likelihood function, used in unfolding methods based on maximum likelihood. Alternatively, we may be given the covariance matrix,
covin,-, rij]
(11.5)
which is needed in methods based on least squares. We will assume that either the form of the probability law or the covariance matrix is known.
By using the law of total probability, (1.27), the expectation values v{ = E[nt] can be expressed as
Ptot P (event observed in bin i J dyP
Mtot
observed in bin i
true value y and event detected
— ßtot
dx
bin i
jdys(x\y)e(y) ftT,x(y).
(11.6)
Here s(x\y) is the conditional p.d.f. for the measured value x given that the true value was y, and given that the event was observed somewhere, i.e. it is normalized such that / s(x\y)dx = 1. We will call s the resolution function or in imaging applications the point spread function. One can also define a response function,
r{x\y) = s{x\y)e(y),
(11.7)
which gives the probability to observe x, including the effect of limited efficiency, given that the true value was y. Note that this is not normalized as a conditional p.d.f. for x. One says that the true distribution is folded with the response function, and thus the task of estimating /true is called unfolding.
Breaking the integral over y in equation (11.6) into a sum over bins and multiplying both numerator and denominator by ftj} the expected number of entries to be observed in bin i becomes
J2 jkaj ^ dV s(x\y) S(y) ftrue(y)
M
(11.8)
where the response matrix R is given by
Ri
/bin,-dar /bin i rfy *<ar>y> e(y> <y>
fbinjdyftrue{y)
P(observed in bin i and true value in bin j) P(true value in bin j)
=   P (observed in bin j | true value in bin j).
(11.9)
The response matrix element R;, is thus the conditional probability that an event will be found with measured value x in bin i given that the true value y was in bin j. The effect of off-diagonal elements in R is to smear out any fine structure. A peak in the true histogram concentrated mainly in one bin will be observed over several bins. Two peaks separated by less than several bins will be merged into a single broad peak.
As can be seen from the first line of equation (11.9), the response matrix depends on the p.d.f. /true(s/). This is a priori unknown, however, since the goal of the problem is to determine /true(s0- If the bms of the unfolded histogram are small enough that s(x\y) and s{y) are approximately constant over the bin of y, then the direct dependence on /true(y) cancels out. In the following we will assume that this approximation holds, and that the error in the response matrix due to any uncertainty in /true(y) can be neglected. In practice, the response matrix will be determined using whatever best approximation of /true(y) is available prior to carrying out the experiment.
Although s(x\y) and e[y) are by construction independent of the probability that a given value y occurs (i.e. independent of/true(y)), they are not in general completely model independent. The variable y may not be the only quantity that influences the probability to obtain a measured value x. For the example of beta decay where y represents the true and x the measured energy of the emitted electron, s(x\y) and e(y) will depend in general on the angular distribution of the electrons (some parts of the detector may have better resolution than others), and different models of beta decay might predict different angular distributions.
In the following we will neglect this model dependence and simply assume that the resolution function s(x\y) and efficiency e(y), and hence the response matrix Rij, depend only on the measurement apparatus. We will assume in fact that R can be determined with negligible uncertainty both from the standpoint of model dependence as well as from that of detector response. In practice, R is determined either by means of calibration experiments where the true value y is known a priori, or by using a Monte Carlo simulation based on an understanding of the physical processes that take place in the detector. In real problems the model dependence may not be negligible, and the understanding of the detector itself is never perfect. Both must be treated as a possible sources of systematic
Note that the response matrix Rij is not in general symmetri
c (nor even
158 Unfolding
inverting the response matrix 159
square), with the first index i = l,...,N denoting the bin of the observed histogram and the second index j — 1,..., M referring to a bin of the true histogram. Summing over the first index and using f s(x\y)dx = 1 gives
k   ~ h to/*-)
/bin,' ^^ /true(y)
/bin7 ftrue(y)dy
(11.10)
i.e. one obtains the average value of the efficiency over bin j.
Tn addition to limited resolution and efficiency, one must also allow for the possibility that the measuring device produces a value when no true event of the type under study occurred, i.e. the measured 'value was caused by some background process. In the case of beta decay, this could be the result of spurious signals in the detector, the presence of radioactive nuclei in the sample other than the type under study, interactions due to particles coming from outside the apparatus such as cosmic rays, etc. Suppose that we have an expectation value /?,• for the number of entries observed in bin i which originate from background processes. The relation (11.8) is then modified to be
m
vt = + j3i. (11.11)
i=i
Note that the (3, include the effects of limited resolution and efficiency of the detector. They will usually be determined either from calibration experiments or from a Monte Carlo simulation of both the background processes and the detector response. In the following we will assume that the values are known, although in practice this will only be true to a given accuracy. The uncertainty in the background is thus a source of systematic error in the unfolded result.
To summarize, we have the following vector quantities (referred to also in a general sense as histograms or distributions):
(1) the true histogram (expectation values of true numbers of entries in each bin), fx= (pi,. ..,hm),
(2) the normalized true histogram (probabilities), p = (pi, ... ,Pm) = m/Vtot,
(3) the expectation values of the observed numbers of entries, v = (yi,..,, vn),
(4) the actual number of entries observed (the data), n = (m,..., nN),
(5) efficiencies e = (Si,... ,Sm), and
(6) expected background values ft = (fa,..., j3N).
It is assumed either that we know the form of the probability distribution for the data n, which will allow us to construct the likelihood function, or that we
have the covariance matrix Vy = covfn.-.n,-], which can be used to construct a X2 function. In addition we have the response matrix TUj, where t = A
represents the bin of the observed histogram, and j = 1.....M gives the bin of
the true histogram. We will assume that 11 and ft are known. The vectors /*. v. (5 and the matrix R are related by
v = Rß ■"+ ft,
(11.12)
where (i, v and ft should be understood as column vectors in matrix equations. The goal is to construct estimators A for t he true histogram, or estimators p for the probabilities.
11.2   Inverting the response matrix
In this section we will examine a seemingly obvious method for constructing estimators for the true histogram /i, which, however, often leads to an unacceptable solution. Consider the case where the number of bins in the true and observed histograms arc equal, M = N. For now we will assume that the matrix relation v — Rfi + ft can be inverted to give
(l = Br1(y-ft).
(11.13)
An obvious choice for the estimators of v is given by the corresponding data values,
i> = n. (H-14) The estimators for the n are then simply
/IrrJl-^lk-jS). (1L15)
One can easily show that this is, in fact, the solution obtained from maximizing the log-likelihood function,
n
\ogL{p) - Yl log Tint; Vi),
(11.16)
i=i
where P(nt;i/,) is a Poisson or binomial distribution. It is also the least squares solution, where one minimizes
n
(11.17)
Note that logi(/x) and x2(m) can be written as functions of » or u, since the relation u = Rfi + ft always holds. That is, when differentiating (11.16) or (11.17) with respect to m one uses dvi/dfij = Rij-
160 Unfolding
inverting che response matrix 161
Before showing how the estimators constructed in this way can fail, it is interesting to investigate their bias and variance. The expectation value of ftj is given by
n n
m] = Y,(R~lh E[m - ßi] = ^(ir1),, fa - ßi)
i=l
(11.18)
so the estimators fa are unbiased, since by assumption, v{ = n{ is unbiased. For the covariance matrix we find
TV
k,l = l
N
)ik
(R-
jk Vk,
(11.19)
A = l
where to obtain the last line we have used the covariance matrix for independent Poisson variables, covfn/j. n{\ — Ski^k-
In the following wc will use the notation VJ-j = cov[n,-, n,-] for the covariance matrix of the data, and Vij = cov[/i,-, for that of the estimators of the true distribution. Equation (11.19) can then be written in matrix notation,
U = R'1 V (R-1f. (11.20)
Consider now the example shown in Fig. 11.1. The original true distribution /i is shown in Fig. 11.1(a), and the expectation values for the observed distribution v are shown in the histogram of Fig. 11.1(b).
The histogram v has been computed according to v = Ftp, i.e. the background (3 is taken to be zero. The response matrix R is based on a Gaussian resolution function with a standard deviation equal to 1.5 times the bin width, and the efficiencies e, are all taken to be unity. This results in a probability of approximately 26% for an event to remain in the bin where it was created, 21% for the event to migrate one bin, and 16% to migrate two or more bins.
Figure 11.1(c) shows the data n = (m,. ..,n«). These have been generated by the Monte Carlo method using Poisson distributions with the mean values Vi from Fig. 11.1(b). Since the number of entries in each bin ranges from around 102 to 103, the relative statistical errors (ratio of standard deviation to mean value) for the n; are in the range from 3 to 10%.
Figure 11.1(d) shows the estimates p, obtained from matrix inversion, equation (11.15). The error bars indicate the standard deviations for each bin. Far from achieving the 3-10% precision that we had for the n,-, the \ij oscillate
M
2000
1000
		--r-----1---— (a)
	i—s	A
	J__	1. .................i i
2000 h
1000
	-I-"- 1------- (b)
	1 H
u r i	J L 1 -
r ......i _j__i_—	Ll L _J__1-
0       0.2      0.4      0.6 0!
A
0      0.2     0.4     0.6     0.8 1
n
2000
1000
	(c)
	
-J	
	i i i
X10 10000
sooo I-
•5000
-10000
(d)
J=3.--1_
0.2
0.4 0.6
0.2
0.4 0.6
Fig. 11.1 (a) A hypothetical true histogram m. (b) the histogram of expectation values V = Kix, (c) the histogram of observed data n, and (d) the estimators £ obtained from inversion of the response matrix.
wildly from bin to bin, and the error bars are as large as the estimated values themselves. (Notice the increased vertical scale on this plot.) The correlation coefficients for neighboring bins are close to —1.
The reason for the catastrophic failure stems from the fact that we do not have the expectation values v\ if we did, we could simply compute fi = R~xv. Rather, we only have the data n, which are random variables and hence subject to statistical fluctuations. Recall that the effect of the response matrix is to smear out any fine structure. If there had been peaks close together in fj., then although these would be merged together in v, there would still remain a certain residual fine structure. Upon applying R~x to is, this remnant of the original structure would be restored. The data n have indeed statistical fluctuations from bin to bin, and this leads to the same qualitative result as would a residual fine structure in v. Namely, the unfolded result is given a large amount of fine structure, as is evident in Fig. 11.1(d).
It is interesting to compare the covariance matrix U (11.19) with that given by the RCF inequality (cf. Section 6.6); this gives the smallest possible variance for any choice of estimator. For this we will regard the n, as independent Poisson
162 Unfolding
Inverting the response matrix 163
variables with mean values     The log-likelihood function is
n n \0gL(ß) = £ log/>(„.; „,.) = £>g (^1
thus
Dropping additive terms that do not depend on /* gives
n
hgL(ti) = J2(ru \ogVi~Ui).
(11.21)
(11.22)
Comoonentn,nf€Ck *SCttmg ^ dellVativeS °f wilh ^ to the
components of p equal to zero,
dlogl dßk
n
n
{rt  dv>   dm, ~ 2-* \ — - 1) Rik = °>
(11.23)
one obtains in fact the same estimate f, — *. u nifWr,+;o+; estimators,     = n, as we have seen previously,
umerentiating one more time gr
lives
d2\ogL d/i.k dp
The RCF bound for the inverse tion (6.19)) is therefore
n
ni Rik Ri,
1 _____ v-^ rij Uj
i     4-i v.
(11.24)
covariance matrix for the case of zero bias (equa-
([/-%   = -E
\d2 logi
dßk dpi
n
_ \p M^]Rik Rü
i = l
Rik Ril
_ nth -
^ Vi
»=1
(11.25)
By multiplying both sides of the equation once by U, twice by JJ-1 and summing Z&ZSSSr mdlC6S' ^       SOlVS (lL25) *» the RCF bound fo^thf
n
k = l
(11.26)
This is the same as the result of the exact calculation (11.19) so we see that the maximum likelihood solution is both unbiased and efficient., Ie It has Jhe
smallest possible variance lor an estimator with zero bias. We would obtain the same result using the method of least squares; in that case, unbiased and efficient estimators are guaranteed by the Gauss-Markov theorem.
Although the solution in Fig. 11.1(d) bears little resemblance to the true distribution, it has certain desirable properties. It is simple to construct, has zero bias, and the variance is equal to the RCF bound. In order to be of use, however, the correlations must be taken into account. For example, one can test the compatibility of the estimators £ with a hypothesis /i0 by constructing a r statistic,
(11,27)
which uses the full covariance matrix U of the estimators. This test would be meaningless if the %2 were to be computed with only the diagonal elements of U. We should also note that although the variances are extremely large in the example shown here, they would be significantly smaller if the bins are made large compared to the width of the resolution function.
Regardless of its drawbacks, response-matrix inversion indicates some important lessons and provides a starting point for other methods. Since the inverse-matrix solution has zero bias and minimum variance as given by the RCF inequality, any reduction in variance can only be achieved by introducing a bias. The art of unfolding consists of constructing biased estimators p. such that the bias will be small if our prior beliefs, usually some assumptions concerning smoothness, are in fact correct. Roughly speaking, the goal is to find an optimal trade-off between bias and variance, although we will see in Section 11.7 that there is a certain arbitrariness in determining how this optimum is achieved.
The need to incorporate prior knowledge suggests using the Bayesian approach, where the a priori probabilities are combined with the data to yield a posteriori probabilities for the true distribution (cf. Sections 1.2, 6.13). This is a common starting point in the literature on unfolding. It suffers from the difficulty, however, that prior knowledge is often of a complicated or qualitative nature and is thus difficult to express in terms of prior probabilities. The fact that prior beliefs are inherently subjective is not a real disadvantage here; in the classical approach as well there is a certain subjectivity as to how one chooses a biased estimator. In the following we will mainly follow classical statistics, using bias and variance as the criteria by which to judge the quality of a solution, while pointing out the connections with the Bayesian techniques wherever possible.
As a final remark on matrix inversion, we can consider the case where the number of bins M in the unfolded histogram is not equal to the number of measured bins N. For M > N, the system of equations (11.12), v - Rp + (5, is underdetermined, and the solution is not unique. The methods presented in Section 11.4 can be used to select a solution as the estimator p.. For M < N, (11.12) is overdetermined, and an exact solution does not exist in general. An approximate solution can be constructed using, for example, the methods of maximum
164 Unfolding
General strategy of regularized unfolding 165
likelihood or of least squares, i.e. the problem is equivalent to parameter estimation as discussed in Chapters 5-8. If M is large, then correlations between the estimators can lead to large variances. In such a case it may be desirable to reduce the variances, at the cost of introducing bias, by using one of the regularization methods of Section 11.4.
MC
,MC
,MC
,MC
+ m
(11.31:
11.3   The method of correction factors
Consider the case where the bins of the true distribution fi are taken to be the same as those of the data n. One of the simplest and perhaps most commonly used techniques is to take as the estimator for m
fii = d{ni - /?;),
;n.28)
where /?,• is the expected background and C, is a multiplicative correction factor. The correction factors can be determined using a Monte Carlo program which includes both a model of the process under study as well as a simulation of the measuring apparatus. The factors Q are determined by running the Monte Carlo program once with and once without the detector simulation, yielding model predictions for the observed and true values of each bin, i/f*0 and fif10. Here j/mc refers to the signal process only, i.e. background is not included. The correction factor is then simply the ratio,
MC
.MC '
(11.29)
For now we will assume that it is possible to generate enough Monte Carlo data so that the statistical errors in the correction factors are negligible. If this is not the case, the uncertainties in the C\ can be incorporated into those of the estimates //,• by the usual procedure of error propagation.
If the effects of resolution are negligible, then the response matrix is diagonal, i.e. Rij = SijSj, and therefore one has
Pi — £j/x»
(11.30)
where i/?lg is the expected number of entries in bin i without background. Thus the correction factors become simply C; = 1/e,-, so that 1/C; plays the role of a generalized efficiency. When one has off-diagonal terms in the response matrix, however, the values of 1/C,- can be greater than unity. That is, because of migrations between bins, it is possible to find more entries in a given bin than the number of true entries actually created there. The expectation value of the estimator is
The estimator     thus lias a bias which is only zero if the ratios m/vf* are same for the Monte Carlo mode! and for the real experiment. The covariance matrix for the estimators is given by
the
/[/},-,/ij]   =   C'f cov[m,nj} = CfSijVi.
(11.32)
The last line uses the covariance matrix for the case where the n,- are independent Poisson variables with expectation values v4. For many practical problems, the Q are of order unity, and thus the variances of the estimates /),, are approximately the same as what one would achieve with perfect resolution. In addition, the technique is simple to implement, not even requiring a matrix inversion. The price that one pays is the bias,
h =
„MC Pi
„MC
Pi
(11.33)
A rough estimate of the systematic uncertainty due to this bias can be obtained by computing the correction factors with different Monte Carlo models. Clearly a better model leads to a smaller bias, and therefore it is often recommended that the estimated distribution fi be used to tune the Monte Carlo, i.e. by adjusting its parameters to improve the agreement between i/MC and the background subtracted data n - /3. One can then iterate the procedure and obtain improved correction factors from the tuned model.
A danger in the method of correction factors is that the bias often pulls the estimates £ towards the model prediction nMC. This complicates the task of testing the model, which may have been the purpose of carrying out the measurement in the first place. In such cases one must ensure that the uncertainty in the unfolded result due to the model dependence of the correction factors is taken into account in the estimated systematic errors, and that these are incorporated into any model tests.
11.4   General strategy of regularized unfolding
Although the method of correction factors is simple and widely practiced, it has a number of disadvantages, primarily related to the model dependence of the result. An alternative approach is to impose in some way a measure of smoothness on
166 Unfolding
Regularization functions 167
the estimators for the true histogram ft. This is known as regularization of the unfolded distribution.
As a starting point, let us return to the oscillating solution of Section 11.2 obtained from inversion of the response matrix. This estimate for \t is characterized by a certain maximum value of the log-likelihood function logLmax, or a minimum value of the x2- In the following we will usually refer only to the log-likelihood function; the corresponding relations using x2 can be obtained by the replacement logL = —x2/2.
One can consider a certain region of /i-space around the maximum likelihood (or least squares) solution as representing acceptable solutions, in the sense that they have an acceptable level of agreement between the predicted expectation values v and the data n. The extent of this region can be defined by requiring that log L stay within some limit of its maximum value. That is, one determines the acceptable region of fi-space by
bgL(ji) > logLmax - A log L or for the case of least squares,
(11.34)
X2(M) < xLn + Ax2 (H-35) for appropriately chosen A logL or Ax2. The values of A log L or Ax2 will determine the trade-off between bias and variance achieved in the unfolded histogram; we will return to this point in detail in Section 11.7.
In addition to the acceptability of the solution, we need to define a measure of its smoothness by introducing a function S(ft), called the regularization function. Several possible forms for S{fx) will be discussed in the next section. The general strategy is to choose the solution with the highest degree of smoothness out of the acceptable solutions determined by the inequalities (11.34) or (11.35).
Maximizing the regularization function 5(/x) with the constraint that log £>(/*) remain equal to log Lmax - A log L is equivalent to maximizing the quantity
apogL(M) - (logLmax - A logL)] + S(ut) (11.36) with respect to both \i and a. Here a is a Lagrange multiplier called the regularization parameter, which can be chosen to correspond to a specific value of A logL. For a given a, the solution is thus determined by finding the maximum of a weighted combination of logL and the S(n),
*(M) = a logL(M) + S(/i). (11.37) Setting a = 0 leads to the smoothest distribution possible; this ignores completely the data n. A very large a leads to the oscillating solution from inversion of the response matrix, corresponding to having the likelihood function equal to its maximum value.
In order for the prescription of maximizing ®{fi) to be in fact equivalent to the general strategy stated above, the surfaces of constant log L(fjt) and S(ft)
must be sufficiently well behaved; in the following we will assume tins to be the case. In particular, they should not change from convex to concave or have a complicated topology such that multiple local maxima exist.
Recall that we can write log L and S as functions of /tor u, since the relation v=Rti + l3 always holds. In a similar way, we will always take the relation
v = Rfi + 0
(11.38)
to define the estimators for u\ knowing these is equivalent to knowing the estimators p,. Note, however, that in contrast to the method of Section 11.2, we will no longer have i> = n. It should also be kept in mind that /«tot = Ylj N an(l ^tot = Yli vi — Yli j R'jVj are also functions of /x.
Here we will only consider estimators /X for which the estimated total number of events z>tot is equal to the number actually observed,
n
n m t=l i=l 3=1
This condition is not in general fulfilled automatically. It can be imposed by modifying equation (11.37) to read
Vtot
"tot-
(11.39)
<p{H,\) = a logL{ft) + S(n) + A
;JV.
"tot
(11.40)
where A is a Lagrange multiplier. Setting df/dX = 0 then leads to ]T\ Vi = ntot-As a technical aside, note that it does not matter whether the regularization parameter a is attached to the regularization function S(/.i) (as it is in many references) or with the likelihood function. In the particular numerical implementation given in Section 11.9, it is more convenient to associate a with the likelihood.
11.5   Regularization functions
11.5.1   Tikhonov regularization
A commonly used measure of smoothness is the mean value of the square of some derivative of the true distribution. This technique was suggested independently by Phillips [Phi62] and Tikhonov [Tik63, Tik77], and is usually called Tikhonov regularization. If we consider the p.d.f. /true(2/) before being discretized as a histogram, then the regularization function is
si/,,
.Ml = - /
dkftvue(y)
dyk
dy,
(11.41)
where the integration is over all allowed values of y. The munis sign comes from the convention taken here that we maximize <p as defined by (11.4U). lnat is, greater S corresponds to more smoothness. (Equivalents one can of course
168 Unfolding
Regularization functions 169
minimize a combination of regularization and log-likelihood functions with the opposite sign; this convention as well is often encountered in the literature.)
In principle, a linear combination of terms with different derivatives could be used; in practice, one value of A; is usually chosen. When /true(y) is represented as a histogram, the derivatives are replaced by finite differences. For equal bin widths, one can use for k = 1 (cf. [Pre92])
M-l
S{n) = - 5^(/i,-'->,-+i)2,
for k = 2
M-2
i=l
or for   = 3
M-3
(11.42)
(11.43)
(11.44)
A common choice for the derivative is k = 2, so that S^//) is related to the average curvature.
If the bin widths Ay,- are all equal, then they can be ignored in (11.42)-(11.44). This would only give a constant of proportionality, and can be effectively-absorbed into the regularization parameter a. If the Ay,- are not all equal, then this can be included in the finite differences in a straightforward manner. For k = 2, for example, one can assume a parabolic form for ftrue(y) within each group of three adjacent bins,
fi{y) = aoi + any + a2iy2
(11.45)
There are M-2 such groups, centered around bins i — 2,____M-l. The
coefficients can be determined in each group by setting the integrals of fi(y) over bins i — 1, i and i + 1 equal to the corresponding values of m and
The second derivative for the group centered around bin i is then /" = 2a2;, and the regularization function can thus be taken to be
M-i
s(ß) = ~ £ /pAj/,,
(11.46)
2=2
Note that the second derivative cannot be determined in the first and last bins. Here they are not included in the sum (11.46), i.e. they are taken to be zero; alternatively one could set them equal to the values obtained in bins 2 and M-l.
For any value of the derivative k and regardless of the bin widths, the functions S(n) given above can be expressed as
M
(11.47)
i,j=l
where G is a symmetric matrix of constants. For A- = 2 with equal bin widths (11.4.3), for example, G is given by
3 < i < M - 2,
Gu = 6
Gi,«±i = Gi±i,i — ~4
Gi,,±2 = Gi±2,» = 1
Gu = Gum — 1>
G22 = Gjw-i,m-i = 5)
G12 = G21 = Gm,m-i = Gm-i,m ~ -2,
(11.48)
with all other Gu equal to zero. .
In order to obtain the estimators and their covariance matrix (Section 11.6), we will need the first and second derivatives of S. These are
ÖS
m
-2^GijMj
and
d2S 9r
(11.49)
(11.50)
Tikhonov regularization using k = 2 has been widely applied in particle physics for the unfolding of structure functions (distributions of kinematic variables in lepton-nucleon scattering). Further descriptions can be found in [Blo85, Höc96, Ftoe92, Zec95].
11.5.2   Regularization functions based on entropy
Another commonly used regularization function is based on the entropy H of a probability distribution p = {pi,.. - ,Pm), defined as [Sha48]
m
H - - ^p.TogPi
(11.51)
«=i
The idea here is to interpret the entropy as a measure of the smoothness of a histogram fjt = (mi , • • • > A«M), and to use
m
5(^)^(m) = -E£;10^
(11.52)
i=i
as a regularization function. Estimators based on (11.52) are said to be constructed according to the principle of maximum entropy or MaxEnt. To see how
170 Unfolding
Regularization functions 171
2-hi'swf ^ t0 T°0thneSS' Wnsider the number of ways in which a par-icular histogram ,x = (^ ..,, m) can be constructed out of fhot entries (here the values ^ are integers). This is given by 1 "
£2(/i) =
A* tot!
Stft^V116 fa/tor appearS m the multinomial distribution (2.6).) By nilogn - 1), valid for large n, one obtains
(11.53)
logfi
m
^tot(log/itot - 1) - ^^(log/ii - 1)
= - y~!wiog—
=   ft„t%). (11.54)
We will use equation (11.54) to generalize logO to the case where the /x,- are not integers.
If all of the events are concentrated in a single bin, i.e. the histogram has the minimum degree of smoothness, then there is only one way of arranging them, and hence the entropy is also a minimum. At the other extreme, one can show that the entropy is maximum for the case where all m are equal, i.e. the histogram corresponds to a uniform distribution. To maximize H with the constraint J2iPi = 1, a Lagrange multiplier can be used.
For later reference, we list here the first and second derivatives of the entropy-based S(n):
dS_
dm
= _^logifL_^)
and
Mtot
Ptot
lhot
(11.55)
In the Bayesian approach, the values ft are treated as random variables in the sense of subjective probability (cf. Section 1.2), and the joint probability density /(/i|n) represents the degree of belief that the true histogram is given by fi. To update our knowledge about fj, in light of the data n, we use Bayes' theorem,
/(/i|n) oc L(n\fi) ir(li), (11.57)
where L[n\n) is the likelihood function (the conditional probability for the data n given fi) multiplied by the prior density n(fjt). The prior density represents our knowledge about fi before seeing the data n.
Here we will regard the total number of events //tot as an integer. This is in contrast to the classical approach, where j/tot represents an expectation value of an integer random variable, and thus is not necessarily an integer itself. Suppose we have no prior knowledge about how these /itot entries are distributed in the histogram. One can then argue that by symmetry, each of the possible ways of placing fitot entries into M bins is equally likely. The probability for a certain histogram (/i1;. . .,/jM) therefore should be, in the absence of any other prior information, proportional to the number of ways in which it can be made; this is just the number Q given by equation (11.53). The total number of ways of distributing the entries fi(/x) is thus interpreted as the prior probability 7r(/x),
ir(ß)   -   Q(ß) =
lhoť-
ßi\ß2l ■■■ /*M!
= exp(//tot#),
(11.58)
where H is the entropy given by equation (11.51).
From the strict Bayesian standpoint, the job is finished when we have determined f(n\n). It is not practical to report f{ß\n) completely, however, since this is a function of as many variables as there are bins M in the unfolded distribution. Therefore some way of summarizing it must be found; to do this one typically selects a single vector ß as the Bayesian estimator. The usual choice is the ß for which the probability f{ß\n), or equivalently its logarithm, is a maximum. According to equation (11.57), this is determined by maximizing
Ptot
i ßtot
+ log
Ptot
+ 2S(ß)
(11.56)
11.5.3   Bayesian motivation for the use of entropy
In much, of the literature on unfolding problems, the principle of maximum entropy is developed in the framework of Bayesian statistics. (See, for example, [Siv96, Jay86, Pre92].) This approach to unfolding runs into difficulties, however, as we will see below. It is nevertheless interesting to compare Bayesian MaxEnt with the classical methods of the previous section.
log/(/*|n)   oc   logI(/x|n) + log w(fj.)
-   log /.(//.jn) + //,„, It (ft) =   logI(/*|n) + fitotHht). (11.59) The Bayesian prescription thus corresponds to using a regularization function
M
S(ß) = ßtotH(ß) = - Y] ßi log—.
r-f ßtot
(11.60)
172 Unfolding
Variance and bias of the estimators 173
Furthermore, the regularization parameter o is no longer an arbitrary factor but is set equal to 1. If all of the efficiencies are equal, then the requirement "tot = "tot also implies that /itot is constant. This is then equivalent to using the previous regularization function S(n) = H with a = l//itot-
If the efficiencies are not all equal, however, then constant utot does not imply constant fitot, and as a result, the distribution of maximum S{fi) = Ht0tH(fJ.) is no longer uniform. This is because S can increase simply by increasing /itot, and thus in the distribution of maximum 5, bins with low efficiency are enhanced. In this case, then, using H and /Jtot-ff as regularization functions will lead to somewhat different results, although the difference is in practice not great if the efficiencies are of the same order of magnitude. In any event, S = H is easier to justify as a measure of smoothness, since the distribution of maximum H is always uniform.
We will see in Section 11.9 that the Bayesian estimator (11.59) gives too much weight to the entropy term (see Fig. 11.3(a) and [Ski86]). From the classical point of view one would say that it does not represent a good trade-off between bias and variance, having an unreasonably large bias. One can modify the Bayesian interpretation by replacing ^tot in (11.59) by an effective number of events /ieff, which can be adjusted to be smaller than /itot- The estimator is then given by the maximum of
logL(/i|n) + fief{H(n):
(11.61)
This is equivalent to using S(fi) = H(fx) as before, and the parameter /ieff plays the role of the regularization parameter.
The problem with the original Bayesian solution stems from our use of Q(m) as the prior density. From either the Bayesian or classical points of view, the quantities p = fi/ntot are given by some set of unknown, constant numbers, e.g. the electron energy distribution of specific type of beta decay. In either case, our prior knowledge about the form of the distribution (i.e. about p, not n) should be independent of the number of observations in the data sample that we obtain. This points to a fundamental problem in using 7r(^i) = since this becomes
increasingly concentrated about a uniform distribution (i.e. all p,- equal) as //tot increases.
It is often the case that we have indeed some prior beliefs about the form of the distribution p, but that these are difficult to quantify. We could say, for example, that distributions with large amounts of structure are a priori unlikely, since it may be difficult to imagine a physical theory predicting something with lots of peaks. On the other hand, a completely flat distribution may not seem very physical either, so fi(/x) does not really reflect our prior beliefs. Because of these difficulties with the interpretation of Cl(fJ.) as a prior p.d.f., we will stay with the classical approach here, and simply regard the entropy as one of the possible regularization functions.
11.5.4 Regularization function based on cross-entropy Recall that the distribution of maximum entropy is flat, and thus the bias introduced into the estimators p will tend to pull the result towards a more uniform distribution. Suppose we know a distribution q = (</i,.. - ,cjm) that we regard as the most likely a priori shape for the true distribution p = fj,/fitot- We will call q the reference distribution. Suppose that we do not know how to quantify our degree of belief in q, however, and hence we do not have a prior density Tt(fi) for use with Bayes' theorem. That is, q represents the normalized histogram fi/ntot for which the prior density is a maximum, but it does not specify the entire prior density.
In this case, the regularization function can be taken as
S(/i)= A'(p;q), (11.62) where A'(p;q) is called the cross-entropy [Kul64] or Shannon-Jaynes entropy [Jay68], defined as
A(p;q) =
m
E
i = l
Pi log
Pi
(11.63)
The cross-entropy is often defined without the factor of M, and also without the minus sign, in which case the principle of maximum entropy becomes the principle of minimum cross-entropy. We will keep the minus sign so as to maintain the similarity between K(p; q) and the Shannon entropy H(p) (11.51). Note that A(p;q) = H(p) when the reference distribution is uniform, i.e. qi = 1/M for all i.
One can easily show that the cross-entropy A'(p; q) is a maximum when the probabilities p are equal to those of the reference distribution q. The effect of using the regularization function (11.62) is that the bias of the estimators p will be zero (or small) if the true distribution is equal (or close) to the reference distribution.
11.6   Variance and bias of the estimators
The estimators p are functions of the data n, and are hence themselves random variables. In order to obtain the covariance matrix Uy = cov[/i;, fij], we can calculate an approximate expression for p as a function of n, and then use the error propagation formula (1.54) to relate U to the covariance matrix for the data, Vjj = cov[rii,nj].
The estimators p are found by maximizing the function <p(fJL, X) given by (11.40), which uses a given log-likelihood or x2 function and some form of the regularization function S(n) (Tikhonov, entropy, etc.). The estimators p and the Lagrange multiplier A are thus solutions to the system of M + 1 equations
Fi(t»,X,n)=0,   i= 1,...,M+1,
(11.64)
where
174 Unfolding
Variance and bias of the estimators 175
%,xb)=   ^ ' (11.65)
Iff    i=M + l.
Suppose the data actually obtained are given by the vector n, the corresponding estimates are fx — £(n), and the Lagrange multiplier A has the value A. We would like to know how ft and A would change if the data were given by some different values n. Expanding the functions Fi((x, A,n) to first order in a Taylor series about the values ft, A and n gives
dnk
= -{A-lB)ik = dk,
(11.71)
where the matrices A and B are given by equations (11.68) and (11.69). What we will use here is not the entire matrix C, but rather only the M x N submatrix, excluding the row i = M + 1. which refers to the Lagrange multiplier a. The final expression for the covariance matrix U can thus be expressed in the more compact form,
m r
Fi\H, A,n)   a    F,-(A~t.A.n) + £
3 = 1
dFi
+
dX
dfj,,
. (a-a) + y
£.A.ii
n
i=i
dFi
dm
M A n
. (ni ~ (11-66)
The first term F,-(/i,A,n) as well as the entire expression F,(fz, A,n) are both equal to zero, since both sets of arguments should represent solutions. Solving equation (11.66) for n gives
fi,(n) k p, - A"1 B{n-n), (11.67) where the M + 1 component of rz refers to the Lagrange multiplier A. The symmetric (M + 1) x [M + 1) matrix A is given by
3ft,d^j
.   «,J=.l.....W,
A-j —
(11.68)
gj& = 0,   t = M + l,j = M + l, and the (M + 1) x JV matrix 5 is
B      f aSk?   i = l,---,M,; = l,...,A', Uxfe = 1.   i=M + l,j=l,...,N. By using the error propagation formula (1.54), the covariance matrix for the estimators Uij = cov[/j,-,/jj-] is obtained from the covariance matrix for the data Vij — cov[rc,-, rij] by
(11.69)
n
(11.70)
The derivatives in (11.70) can be computed using (11.67) to be
u = CFC7
(11.72)
The derivatives in (11.68) and (11.69) depend on the choice of regularization function and on the particular log-likelihood function used to define >p(fi,X) (11.40), e.g. Poisson, Gaussian (logL = -v2/'2). etc. In the case, for example, where the data are treated as independent Poisson variables with covariance matrix Vij = SjjVi, and where the entropy-based regularization function (11.54) is used, one has
ay
dmdfij
N
E
aj=i
-a >   R-kiRkj —?
+
/-'tot
1 -
o~ij Mtot Hi
+ log ( Hp- ) + 2S(»)
and
6\p di-iidnj
aRj
(11.73)
(11.74)
The matrices A and B (and hence C) can be determined by evaluating the derivatives (11.73) and (11.74) with the estimates for n and v obtained in the actual experiment. Table 11.1 summarizes the necessary ingredients for Poisson and Gaussian log-likelihood functions. Note that for the Gaussian case, i.e. for the method of least squares, the quantities always refer to logL = — f x2i ancl not to x2 itself. The derivatives of Tikhonov and entropy-based regularization functions are given in Sections 11.5.1 and 11.5.2.
In order to determine the biases b{ = E[jm] — /"», we can compute the expectation values E[jj,i] by means of the approximate relation (11.67),
bi = E[(ii\ - m « ju,- + 2^ Cij{vj - hj) - Hi
3 = 1
(11.75)
This can be estimated by substituting the estimator from equation (11.67) for Hi and replacing Vj by its corresponding estimator i/j = Y^k Rjkfik, which yields
176 Unfolding
Choice of the regularization parameter 177
n
i=i on,-   J J
(11.76)
The approximations used to construct are valid for small (vj — rij), or equivalents, large values of the regularization parameter a. For small a, the matrix C in fact goes to zero, since the estimators fii are then decoupled from the measurements rij, cf. equation (11.71). In this case, however, the bias is actually at its largest. But since we will only use 6j and its variance in order to determine the regularization parameter a, the approximation is sufficient for our purposes.
By error propagation (neglecting the variance of the matrix C), one obtains the covariance matrix W for the 6,-,
% = c0V[Mi]= f; cik cjt
cov
k,l=l
Uh - nk), (vt -m)}.
(11.77)
This can be computed by using vk = V   z?    -   * ,
co\[Ok,i>t] to that of rhp /    * m, Am^m to reiate tne covariance matrix
whih is l i; s rtehdistribution' Ui>=w^
by 17 = CVC\ Putting thisqalH0giher gives ***** °f ^ ^
W   = (CRC~C)V(CRC~C)T = (CR-I)U(CR-Ift
(11.78)
c» be employed m . ^    aTffiS^"^ *™
e>2log£
£,(n, log i/j - i/ť) £j    -1)
5iHÍÍ!^líÍaštl^^^
~5 - rii)(V ^{vj - Uj)
~(flT V1 Jl)y (V-1 Jl)y
Before proceeding to the question of the regularization parameter, however, it is important to note that the biases are in general nonzero for all regularized unfolding methods, in the sense that they are given by some functions, not everywhere zero, of the true distribution. Their numerical values, however, can in fact be zero for particular values of /x. A guiding principle in unfolding is to choose a method such that the bias will be zero (or small) if fi has certain properties believed a priori to be true. For example, if the true distribution is uniform, then estimates based on Tikhonov regularization with k = 1 (11.42) will have zero bias; if the true distribution is linear, then k = 2 (11.43) gives zero bias, etc. If the true distribution is equal to a reference distribution q, then unfolding using the cross-entropy (11.63) will yield zero bias.
11.7   Choice of the regularization parameter
The choice of the regularization parameter a, or equivalently the choice of A log i (or Ax2), determines the trade-off between the bias and variance of the estimators fi. By setting a very large, the solution is dominated by the likelihood function, and one has logL = logLmax (or with least squares, \2 = Xmin) an<^ correspondingly very large variances. At the other extreme, a —> 0 puts all of the weight on the regularization function and leads to a perfectly smooth solution.
Various definitions of an optimal trade-off are possible; these can incorporate the estimates for the covariance matrix Uij = covf/i,-, ftj], the biases 6,-, and the covariance matrix of their estimators, Wij = cov[6,-, bj]. Here U and W will refer to the estimated values, U and W; the hats will not be written explicitly.
One possible measure of the goodness of the final result is the mean squared error, cf. equation (5.5), averaged over all bins,
MSE
1 M
M
(11.79)
The method of determining a so as to obtain a particular value of the MSE will depend on the numerical implementation. Often it is simply a matter of trying a value a, maximizing <p(fj,,X), and iterating the procedure until the desired solution is found.
One could argue, however, that the contribution to the mean squared error should be different for different bins depending on how accurately they are measured. Since the variance of a Poisson variable with mean value [i, is equal to one can define a weighted MSE,
M
MSE' =ÍjY]
Hi
(11.80)
in analogy with the x2 used in the method of least squares. For Poisson distributed data, the quantity MSE' represents the mean squared increase in the errors due to limited resolution. It is thus reasonable to require that this quantity be small.
178 Unfolding
Examples of unfolding 179
A popular choice for the regularization parameter is based on the idea that, on average, each bin should contribute approximately one unit to the x2> ie- a is determined such that \2 — N. This can be generalized to the log-likelihood case as AlogL = logLmax — logL = N/2, since for Gaussian distributed n one has logL = -x2/2.
Naively one might expect that an increase in the \ : of one unit would set the appropriate level of discrepancy between the data n and the estimates v. This typically leads, however, to solutions with unreasonably large variance. The problem can be traced to the fact that the estimator Vi receives contributions not only from n,- but also from neighboring bins as well. The coupling of the estimators i>; to the measurements rij can be expressed by the matrix
drij
d
M
■3 t_
(11.81)
k = l
A modification of the criterion A,\2 = 1 has been suggested in [Sch94] which incorporates this idea. It is based on an increase of one unit in an effective x2,
Axe2ff = (v-n)T RC V~l{RC)T (i> - n) = 1, (11.82)
where the matrix RC effectively takes into account the reduced coupling between the estimators     and the data n;.
Alternatively, one can look at the estimates of the biases and their variances. If the biases are significantly different from zero, then it is reasonable to subtract them. This is equivalent to going to a smaller value of A logL. As a measure of the deviation of the biases from zero, one can construct the weighted sum of squares,
The strategy is thus to reduce A log I (i.e. increase a) until xl is equal to a sufficiently small value, such as the number of bins M. At this point the standard deviations of the biases are approximately equal to the biases themselves, and therefore any further bias reduction would introduce as much error as it removes.
The bias squared, the variance, and their sum, the mean squared error, are shown as a function of A log L in Fig. 11.2. These are based on the example from Fig. 11.1, there unfolded by inverting the response matrix, and here treated using (a) maximum entropy and (b) Tikhonov regularization. The increase in the estimated bias for low AlogL reflects the variance of the estimators the true bias decreases to zero as A log I goes to zero. The arrows indicate solutions based on the various criteria introduced above; these are discussed further in the next section.
Further criteria for setting the regularization parameter have been proposed based on singular value analysis [H6c96], or using a procedure known as cross-validation [Wah79]. Unfortunately, the choice of a is still a somewhat open
10000
7500
5000
2500
(a) 'i	—r—■-1--r— — mean variance
l\	mean squared bias
1 \	— mean squared error
\	if   \l/b, I ...................
10000
7500
5000
2500
(b)	—i——-r-"~~~ ---mean variance
	mean squared bias .
-	- mean squared error
	\hVtf "c"
	xUA:/                        V--J
10 15 A log l
20
25
10
15
4 logL
Fig. 11.2 The estimated mean variance, mean squared bias, and their sum, the mean squared error, as a function of A log L for (a) MaxEnt and (b) Tikhonov regularization (k = 2). The arrows indicate the solutions from Figs 11.3 and 11.4: (b) is minimum MSE, (c) is A log I = N/2, (d) is AxJJfj = 1, and (e) is \2b = M. For the MaxEnt case, the Bayesian solution A logL = 970 is not shown. For Tikhonov regularization, (a) gives the solution for minimum weighted MSE.
question. In practice, the final estimates are relatively stable as the value of AlogL decreases, until a certain point where the variances suddenly shoot up (see Fig. 11.2). The onset of the rapid increase in the variances indicates roughly the natural choice for setting a.
11.8   Examples of unfolding
Figures 11.3 and 11.4 show examples based on maximum entropy and Tikhonov regularization, respectively. The distributions ß, u and n are the same as seen previously in Figs 11.l(a)-(c), having N = M = 20 bins, all efficiencies e equal to unity, and backgrounds ß equal to zero. The estimators ß are found by maximizing the function <p{n,\) (11.40), here constructed with a log-likelihood function based on independent Poisson distributions for the data. On the left, the original 'true' histograms /z are shown along with the unfolded solutions fa and error bars yfUTi corresponding to a given value of the regularization parameter a, or equivalently to a given AlogL. On the right are the corresponding estimates of the biases hi with their standard deviations y/Wu. These should not be confused with the true residuals /i,- — m, which one could not construct without knowledge of the true histogram \i. The estimates on the other hand, are determined from the data.
Consider first Fig. 11.3, with the entropy-based regularization function S(p) = H{fi). Figure 11.3(a) corresponds to a = 1/^tot, i.e- the Bayesian prescription (11.59), and gives AlogL = 970. We show this choice simply to illustrate that the prior density 7r(p) = does not lead to a reasonable solution. Although
the standard deviations y/Üü are very small, there is a large bias. The estimates hi shown on the right are indeed large, and from their error bars one can see that they are significantly different from zero. Note that the estimated biases here are
180 Unfolding
Examples of unfolding 181
not, in fact, in very good agreement with the true residuals /},,- — ju,-, owing to the approximations made in constructing the estimators 6,-, cf. equation (11.67). The approximations become better as AlogL is decreased, until the standard deviations y/Wn become comparable to the biases themselves.
Figure 11.3(b) shows the result based on minimum mean squared error (11.79). This corresponds to AlogL = 13.3 and xl — 154. Although the estimated biases are much smaller than for a = 1/^tot, they are still significantly different from zero.
The solution corresponding to AlogL = N/2 = 10 is shown in Fig. 11.3(c). Here the biases are somewhat smaller than in the result based on minimum MSE, but are still significantly different from zero, giving xl — 87. In this particular example, requiring minimum weighted mean squared error (11.80) gives AlogL = 10.5, and is thus similar to the result from AlogL = N/2 — 10.
The results corresponding to Ax^ff = 1 and xl — M are shown in Figs 11.3(d) and (e), respectively. Both of these have biases which are consistent with zero, at the expense of larger variances compared to the results from AlogL = N/2 or minimum MSE. The ^-xln — 1 case has xl — 20-8, and the xl = M case has A\-2ff = 0.85. so in this example they are in fact very similar.
Now consider Fig. 11.4, which shows examples based on the same distribution, but now using Tikhonov regularization with k = 2. The figures correspond to (a) minimum weighted MSE, (b) minimum MSE, (c) AlogL = N/2, (d) Axlfí =T, and (e) xl — M. Here in particular the solution from A log L = N/2 does not appear to go far enough; although the statistical errors \fU7i are quite small, the biases are large and significantly different from zero (6? ^> Wu). Reasonable results are achieved in (a), (b) and (e), but the requirement Axlfí = 1 (d) appears to go too far. The bias is consistent with zero, but no more so than in the case with xl — M. The statistical errors are, however, much larger.
A problem with Tikhonov regularization, visible in the right most bins in Fig. 11.4, is that the estimates can become negative. (All of the bins are positive only for Fig. 11.4(a).) There is in fact nothing in the algorithm to prevent negative values. If this must be avoided, then the algorithm has to be modified by, for example, artificially decreasing the errors on points where the negative estimates would occur. This problem is absent in MaxEnt unfolding, since there the gradient of S(fi) diverges if any m approach zero. This penalty keeps all of the fii positive.
The techniques discussed in this chapter can easily be generalized to multidimensional distributions. For the case of two dimensions, for example, unfolding methods have been widely applied to problems of image restoration [Fri72, Fri80, Fri83, Ski85], particularly in astronomy [Nar86], and medical imaging [Lou92]. A complete discussion is beyond the scope of this book, and we will only illustrate some main ideas with a simple example.
Figure 11.5 shows an example of MaxEnt unfolding applied to a test photograph with 56x56 pixels. Figure 11.5(a) is taken as the :true' image, representing the vector \i. In Fig. 11.5(b), the image has been blurred with a Gaussian resolution function with a standard deviation equal to 0.6 times the pixel size.
b rr
200
-200 -
200
-200
200
+
minimum MSE
Alog l=n/2
(c)
200
-200
200
+
AX>1
(d)
■I 4-4-	-1--r— I		
TT -   xa = M		T 'ti" I.	
(e)
0 0.2 0.4 0.6 0.8 1
X
Fie 11 3 MaxEnt unfolded distributions shown as points with the true distribution shown II 'his ogram(left) and the estimated biases (right) for different values of the -Sular,^n
as a nibiu»i«ui   s.  ; __ Raves an prescription a = 1/Mtot, tb) mini-
parameter«. The examples correspond to (a) the Bayes     P        P     = ^ ^ ^ ^
mum mean squared error, c) AlogL - n/2, (dj Axeff \ i *6 .
The^olution of minimum weighted MSE turns out sim.lar to case (c) with A logi - n/2.
182 Unfolding
Examples of unfolding 183
2000
1000
0
2000
1000
2000
1000
0
M
2000
1000
2000
r	-1--—,- FX	
rt-		v
		I i i
f		-
j	A	V
F J r		
_r		4_
1000 h
■	~l---1--r_____	-
	-1-<--j--_i	1 ■
0 0.2        0.4       0.6 0i
200
-200
minimum weighted MSE
(a)
-200
200
1   1 ~t~~*"~+                  1   ! -	-
-   minimum MSE	
(b)
-200
200
Alog l=niz
(c)
-200
200
(d)
-200
(e)
0 0-2        0.4        0.6       0.8 1
20000 h
10000
10
20 30 40 pixel number (row 36)
Fig. 11.5 (a) The original 'true' image fx. (b) The observed image n, blurred with a Gaussian point spread function with a standard deviation equal to 60% of the pixel size, (c) The maximum entropy unfolded image. The histograms to the right show the light intensity in pixel row 36 (indicated by arrows).
For purposes of this exercise, the effective number of 'photons' (or, depending on the type of imaging system, silver halide grains, photoelectrons, etc.) was assigned such that the brightest pixels have on the order of 104 entries. Thus if the number of entries in pixel * is treated as a Poisson variable n,- with expectation value Vi, the relative sizes of the fluctuations in the brighter regions are on the order of 1% (c,/"i = 1/y/vi). Figure 11.5(c) shows the unfolded image according to maximum entropy with AlogL = N/2 where N = 3136 is the number of pixels. The histograms to the right of Fig. 11.5 show the light intensity in pixel row 36 of the corresponding photographs.
For this particular example, the method of maximum entropy has certain advantages over Tikhonov regularization. First, there is the previously mentioned feature of MaxEnt unfolding that all of the bins remain positive by construction. Beyond that, one has the advantage that the entropy can be directly generalized
184 Unfolding
Numerical implementation 185
to multidimensional distributions. This follows immediately from the fact that the entropy H = — Ylj Pj^°&Pj 1S simply a sum over all bins, and pays no attention to the relative values of adjacent bins. For Tikhonov regularization, one can generalize the function S(n) to two dimensions by using a finite-difference approximation of the Laplacian operator; see e.g. [Pre92], Chapter 18.
A consequence of the fact that entropy is independent of the relative bin locations is that the penalty against isolated peaks is relatively slight. Large peaks occur in images as bright spots such as stars, which is a reason for MaxEnt's popularity among astronomers. For relatively smooth distributions such as those in Figs 11.3 and 11.4, Tikhonov regularization leads to noticeably smaller variance for a given bias. This would not be the case for distributions with sharp peaks, such as the photograph in Fig. 11.5.
A disadvantage of MaxEnt is that it necessarily leads to nonlinear equations for fi. But the number of pixels in a picture is typically too large to allow for solution by direct matrix inversion, so that one ends up anyway using iterative numerical techniques.
11.9   Numerical implementation
The numerical implementation of the unfolding methods described in the previous sections can be a nontrivial task. Finding the maximum of the function
<p(i*,X) = a log L{fi) + S(m) + A
ntot
n
j=i
(11.84)
with respect to \i and the Lagrange multiplier A implies solving the M 4-1 equations (11.64). If <p is a quadratic function of fi, then the equations (11.64) are linear. This occurs, for example, if one has a log-likelihood function for Gaussian distributed n, giving log I = — x2/2, in conjunction with Tikhonov regularization. Methods of solution for this case based on singular value decomposition are discussed in [H6c96]. If <p contains, for example, a log-likelihood function based on the Poisson distribution, or an entropy-based regularization function, then the resulting equations are nonlinear and must be solved by iterative numerical techniques.
Consider as an example the case of a Poisson-based likelihood function, cf. equations (11.21), (11.22),
n
log [/{ft) = ____,("; log v.- - vf),
(11.85)
with the regularization function S = H where H is the entropy (11.51).
A possible method of solution for MaxEnt regularization is illustrated in Fig. 11.6. The three axes represent three dimensions of //-space, and the diagonal plane is a subspace of constant __\ vi = ntot, ■ The two points indicated in the plane are the point of maximum entropy (all pi equal) and the point of maximum
subspace of constant E_v;
maximum likelihood
maximum entropy
Fig. 11.6 Throe dimensions of /i-space illustrating the numerical implementation of maximum entropy unfolding (see text).
likelihood. The curve connecting the two indicates possible solutions to (11.40) corresponding to different values of the regularization parameter a; for example, a = 0 gives the point of maximum entropy; a —> co corresponds to the point of maximum likelihood. The curve passes through the points at which contours of constant entropy and constant likelihood touch. Note that the point of maximum likelihood is not in the allowed region with all m > 0. This is, in fact, typical of the oscillating maximum likelihood solution, cf. Fig. 11.1(d).
The program used for the MaxEnt examples shown in Figs 11.3 and 11.5 employs the following algorithm, which includes some features of more sophisticated methods described in [Siv96, Ski85]. The point of maximum likelihood usually cannot be used for the initial value of fi, since there one often has negative pt, and hence the entropy is not defined. Instead, the point of maximum entropy is taken for the initial values. This is determined by requiring all m equal, subject to the constraint
JV n   m m
5> = £X>i« = I>«
= 1 j = l
= "toti
(11.86)
where is
the efficiency for bin j. The point of maximum entropy is thus
186 Unfolding
Numerical implementation 187
"tot
(11.87)
If one uses 5(/i) = /utot#, and if the efficiencies are not all equal, then the distribution of maximum S(/a) is not uniform, but rather is given by the solution to the M equations,
logi^ + ^i^O,
Ptot
ntot
1,...,M.
(11.88)
Starting from the point of maximum S((J.), one steps along the curve of maximum ip in the subspace of constant vtot. As long as one remains in this subspace, it is only necessary to maximize the quantity
*(/*) = o logL(fi) + S((i),
(11.89)
i.e. the same as <p(fJt) but without the Lagrange multiplier A, cf. (11.40). Simply requiring V3> = 0 will not, however, lead to the desired solution. Rather, V$ must be first projected into the subspace of constant Vtot and the components of the resulting vector set equal to zero. In this way the Lagrange multiplier A never enters explicitly into the algorithm. That is, the solution is found by requiring
D$ = V$-u(u-Y$) =0, (11.90) where u is a unit vector in the direction of Vi^ot- This is given by (cf. (11.10))
dvt,
n m
dpk
dp.j
(11.91)
so that the vector u is simply given by the vector of efficiencies, normalized to unit length,
a —
\DS\ D\ogL\
(11.93)
The parameter a can simply be set equal to the right-hand side of (11.93), and a side step taken to return to the curve of D3> = 0. This can be done using standard methods of function maximization (usually reformulated as minimization; cf. [Bra92, Pre92]). These side steps as well are made such that one remains in the subspace of vtot = ntot, i.e. the search directions are projected into this subspace. One then proceeds in this fashion, increasing a by means of the forwards steps along D log L and moving to the solution 1)$ = 0 with the side steps, until the desired value of Alogi = logLmax — log!/ is reached. Intermediate results can be stored and examined in order to determine the optimal stopping point.
Although the basic ideas of the algorithm outlined above can also be applied to Tikhonov regularization, the situation there is somewhat complicated by the fact that the solution of maximum S((J.) is not uniquely determined. For k = 2, for example, any linear function gives S — 0. One can simply start at fti — ntot
JM
and set the regularization parameter a sufficiently large that a unique solution is found.
It is also possible with Tikhonov regularization to leave off the condition }2i Vi — ntot, since here the regularization function does not tend to pull the solution to a very different total normalization. If the normalization condition is omitted, however, then one will not obtain exactly J2i &i = "tot - One can argue that j>tot should be an unbiased estimator for the total number of events, but since the bias is not large, the constraint is usually not included.
(11.92)
We will use the differential operator D to denote the projection of the gradient into the subspace of constant i/tot, as defined by equation (11.90).
One begins thus at the point of maximum entropy and takes a small step in the direction of DlogZ. The resulting fj, is in general not directly on the curve of maximums, but it will be close, as long as the step taken is sufficiently small. As a measure of distance from this curve one can examine |D3>|. If this exceeds a given limit then the step was too far; it is undone and a smaller step is taken.
If the resulting point fi were in fact on the curve of maximum then we would have O'DlogL + DS = 0 and the corresponding regularization parameter would be