Statistical Science
2003, Vol. 18, No. 1, 104–117
© Institute of Mathematical Statistics, 2003
Class Prediction by Nearest Shrunken
Centroids, with Applications to DNA
Microarrays
Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan and Gilbert Chu
Abstract. We propose a new method for class prediction in DNA microarray
studies based on an enhancement of the nearest prototype classiﬁer. Our
technique uses “shrunken” centroids as prototypes for each class to identify
the subsets of the genes that best characterize each class. The method
is general and can be applied to other high-dimensional classiﬁcation
problems. The method is illustrated on data from two gene expression
studies: lymphoma and cancer cell lines.
Key words and phrases: Sample classiﬁcation, gene expression arrays.
1. INTRODUCTION
Class prediction with high-dimensional features is
an important problem and has recently received a great
deal of attention in the context of DNA microarrays.
The task is to classify and predict the diagnostic
category of a sample, based on its gene expression
proﬁle. Recent proposals for this problem include
Golub et al. (1999), Hedenfalk et al. (2001), Hastie,
Tibshirani, Botstein and Brown (2001) and the artiﬁcial
neural network approach in Khan et al. (2001).
The microarray problem is a unique and challenging
classiﬁcation task because there are a large number
of inputs (genes) from which to predict classes and a
relatively small number of samples. It is especially important
to identify which genes contribute toward the
Robert Tibshirani is Professor, Departments of Health
Research and Policy, and Statistics, Stanford University,
Stanford, California 94305-5405 (e-mail: tibs@
stat.stanford.edu). Trevor Hastie is Professor, Departments
of Statistics, and Health Research and Policy,
Stanford University, Stanford, California 94305-5405
(e-mail: hastie@stat.stanford.edu). Balasubramanian
Narasimhan is Senior Research Associate, Departments
of Statistics, and Health Research and Policy,
Stanford University, Stanford, California 94305-
5405 (e-mail: naras@stat.stanford.edu). Gilbert Chu
is Professor, Departments of Biochemistry, and Medical
Oncology, Stanford University, Stanford, California
94305-5151 (e-mail: chu@cmgm.stanford.edu).
classiﬁcation. This can aid in biological understanding
of the disease process and is also important in development
of clinical tests for early diagnosis. In this
article we propose a simple approach to the problem
that performs well and is easy to understand and inter-
pret.
As an example, we consider data from Alizadeh
et al. (2000), which is available from the authors’ web
site. These data consist of expression measurements on
4,026 genes from samples of 59 lymphoma patients.
The samples are classiﬁed into diffuse large B-cell
lymphoma and leukemia (DLCL), follicular lymphoma
(FL) and chronic lymphocytic leukemia (CLL). We
selected a random subset of 20 samples and set them
aside as a test set; the remaining 39 samples formed
the training set.
We began with a nearest centroid classiﬁcation.
Figure 1 (light grey bars) shows the training-set centroids
(average expression of each gene) within each of
the three classes. The overall average expression of the
corresponding gene has been subtracted, so that these
values are differences from the overall centroid.
To apply the nearest centroid classiﬁcation, we take
the gene expression proﬁle of the test sample and
compute its squared distance from each of the three
class centroids. The predicted class is the one whose
centroid is closest to the expression proﬁle of the test
sample. This procedure makes zero errors on the 20
test samples, but has the major drawback that it uses
all 4,026 genes.
104
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 105
FIG. 1. Centroids (grey) and shrunken centroids (red) for the lymphoma/leukemia data set. Each centroid has the overall centroid
subtracted; hence, what we see are contrasts. The horizontal units are log ratios of expression. Going from left to right, the number of
training samples is 27, 5 and 7. The order of the genes is determined by hierarchical clustering.
We propose the “nearest shrunken centroid” method,
which uses denoised versions of the centroids as
prototypes for each class. The optimally shrunken
centroids, derived using a method described below, are
shown as red bars in Figure 1. Classiﬁcation is then
made to the nearest (shrunken) centroid. The resulting
procedure has zero test errors. In addition, only 81
genes have a nonzero red bar for one or more classes in
Figure 1 and, hence, are the only ones that contribute
toward the classiﬁcation. The amount of shrinkage is
determined by cross-validation.
In the preceding example, the (unshrunken) nearest
centroid method had the same error rate as the nearest
shrunken centroid procedure. This is not always the
case. Table 1 shows results taken from Tibshirani,
Hastie, Narasimhan and Chu (2002) on classiﬁcation
of small round blue cell tumors. The data are taken
from Khan et al. (2001). There are 25 test samples
106 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU
TABLE 1
Results on classiﬁcation of small round blue cell tumors
Method Test error rate Number of genes used
Nearest centroid 4/25 2,308
Nearest shrunken centroids 0/25 43
Neural network 0/25 96
Regularized discriminant analysis 0/25 2,308
and 2,308 genes. The neural network and regularized
discriminant analysis methods used in the table are
described in Section 7.
We gave a brief description of the nearest shrunken
centroid method in Tibshirani, Hastie, Narasimhan and
Chu (2002), focussing on the biological ﬁndings from
two different applications. Here we give a broader and
more thorough statistical treatment.
In Section 2 we describe the basic method. We
detail our procedure for adaptive choice of thresholds
in Section 3. Additional issues and comparisons are
discussed in Sections 4–8, including application of the
method to capturing heterogeneity with an “abnormal”
class compared to a control class, in Section 6. Finally
we conclude with a brief discussion in Section 9.
2. NEAREST SHRUNKEN CENTROIDS
2.1 Details of the Proposal
Let xij be the expression for genes i = 1,2,...,p
and samples j = 1,2,...,n. Each sample belongs to
one of K classes 1,2,...,K. Let Ck be indices of
the nk samples in class k. The ith component of the
centroid for class k is ¯xik = j∈Ck
xij /nk, the mean
expression in class k for gene i; the ith component of
the overall centroid is ¯xi = n
j=1 xij /n.
Our idea is to shrink the class centroids toward the
overall centroid. However, we ﬁrst normalize by the
within-class standard deviation for each gene. Let
dik =
¯xik − ¯xi
mk · si
,(1)
where si is the pooled within-class standard deviation
for gene i,
s2
i =
1
n − K
K
k=1 j∈Ck
(xij − ¯xik)2
,(2)
and mk =
√
1/nk − 1/n makes the denominator in
Equation (1) equal to the estimated standard error of
FIG. 2. Soft threshold function.
the numerator. Thus dik is a t-statistic for gene i,
comparing class k to the average class. (In fact, we also
add a regularization parameter s0 to the values si; see
Section 9.) We can write
¯xik = ¯xi + mksidik.(3)
Our proposal shrinks each dik toward zero, giving dik
and new shrunken centroids or prototypes
¯xik = ¯xi + mksidik.(4)
The shrinkage we use is called soft thresholding: The
absolute value of each dik is reduced by an amount
and is set to zero if the result is less than zero.
Algebraically, this is expressed as
dik = sign(dik)(|dik| − )+,(5)
where the subscript plus means positive part (t+ =
t if t > 0 and zero otherwise). This is shown in
Figure 2. Since many of the ¯xik will be noisy and
close to the overall mean ¯xi, soft thresholding usually
produces “better” (more reliable) estimates of the true
means (Donoho and Johnstone, 1994). The proposed
method has the attractive property that many of the
components (genes) are eliminated as far as class
prediction is concerned if the shrinkage parameter is
large enough. Speciﬁcally, if causes dik to shrink to
zero for all classes k, then the centroid for gene i is ¯xi,
the same for all classes. Thus gene i does not contribute
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 107
FIG. 3. Lymphoma/leukemia data: training error (tr, blue), crossvalidation
error (cv, green) and test error (te, red) as the threshold
parameter is varied. In the top panel, the default soft threshold
scaling is used: a solution with = 0.463 and 3,666 genes is
chosen. In the bottom panel, adaptive threshold scaling was used;
the value = 4.01 is chosen, resulting in a subset of just 81 genes,
with the same test error rate as in the top panel.
to the nearest centroid computation. We chose by
cross-validation, as illustrated below.
Note that the standardization by si in (1) has the
effect of giving higher weight to genes that have
stable expression within samples of the same class.
This same standardization is inherent in other common
statistical methods, such as linear discriminant analysis
(see Section 7).
The top panel of Figure 3 shows the training, 10fold
cross-validation and test errors as the shrinkage
parameter is varied. The top of the plot indicates
the number of genes retained (for the training data)
at that particular threshold. The left end of the ﬁgure
represents no shrinkage, while the right end represents
complete shrinkage. The test error is minimized near
= 0.463; when the curve is ﬂat near the minimum,
we typically chose the largest value of (smallest
number of genes) that achieves the minimal error. The
upper axis shows the number of active genes with at
least one nonzero component dik, as is varied. At
= 0.463 there are about 3,666 active genes. The
numbers of genes with nonzero dik in each class are
(3200, 2497, 3133).
Note that the selection of genes for a given value of
is carried out separately for each of the 10 crossvalidation
trials. This is important to avoid selection
bias and an unrealistically optimistic cross-validation
error rate. As pointed out by Ambroise and McLachlan
(2002), a number of authors have made the mistake
of selecting genes based on all of the training
data (expression values and classes) and then subjecting
only the selected genes to cross-validation. This
can produce a wildly optimistic estimate for misclassiﬁcation
error: it is easy to simulate two-class examples
in which the class labels are independent of the expression
values (test error = 50%), but cross-validation
after selection reports an error of zero.
Formula (1) takes into account the size of each class
and effectively applies a larger threshold to a smaller
(higher variance) class. Even after this adjustment,
some classes may be farther away than others from
the overall centroid and, hence, may be easier to
distinguish. In this case, many of the nonzero genes for
that class may not be needed for accurate classiﬁcation.
Thus we might try to vary the class thresholds to
minimize the total number of nonzero genes needed to
achieve a given error rate. The details of how we do this
are discussed in Section 3. In this case the procedure
increased the thresholds for the ﬁrst and third classes,
and was very successful: as shown in the bottom panel
of Figure 3, it reduced the number of genes to just 81
without increasing the test error.
2.2 Finding the Predictors that Matter
Figure 4 shows the shrunken differences dik for the
81 genes that have at least one nonzero difference.
Figure 5 shows the heat map of the chosen 81 genes.
Within each of the horizontal partitions, we have ordered
the genes by hierarchical clustering, and similarly
for the samples within each vertical partition.
Clear separation of the classes is evident. The top set
of genes characterizes CLL with some genes overexpressed
and others underexpressed. Similarly the middle
set of genes characterizes FL. The genes in the bottom
set of the ﬁgure are overexpressed in DLCL, and
underexpressed in FL and CLL.
2.3 The Log-Likelihood
It is quite common to have a small number of
samples in each class, especially when the number of
108 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU
FIG. 4. Shrunken differences dik for the 81 genes that have at least one nonzero difference.
classes is large. This can result in a cross-validation
curve that has discrete jumps and high variability.
To help with this problem, we can use the mean
cross-validated log-likelihood rather than misclassiﬁcation
error. Since our model produces class probability
estimates [see Equation (8) in Section 2.2], the loglikelihood
of a test sample x∗ with class label y∗ is
log ˆpy∗ (x∗). The mean log-likelihood curve is typically
smoother than the misclassiﬁcation error curve.
Figure 6 shows the test set log-likelihood and misclassiﬁcation
error curves for the lymphoma data. (This
is for illustration only; we are not suggesting use of
the test error to select .) They give a similar picture,
although the choice of the smallest model where the
log-likelihood starts to dip yields more genes than that
from the misclassiﬁcation error curve. In the next section
we make use of the log-likelihood in estimation of
class probabilities.
2.4 Class Probabilities and Discriminant Functions
We classify test samples to the closest shrunken
centroid, again standardizing by si. We also make a
correction for the relative abundance of members of
each class. Details are given next.
Suppose we have a test sample (vector) with expression
levels x∗ = (x∗
1,x∗
2,...,x∗
p). We deﬁne the discriminant
score for class k as
δk(x∗
) =
p
i=1
(x∗
i − ¯xik)2
s2
i
− 2logπk.(6)
The ﬁrst part of (6) is simply the standardized squared
distance of x∗ to the kth shrunken centroid. The
second part is a correction based on the class prior
probability πk, where K
k=1 πk = 1. This prior gives
the overall proportion of class k in the population. The
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 109
FIG. 5. Heat map of the chosen 81 genes. Within each of the
horizontal partitions, we have ordered the genes by hierarchical
clustering, and similarly for the samples within each vertical
partition. The data for all 59 samples are shown.
classiﬁcation rule is then
C(x∗
) = if δ (x∗
) = min
k
δk(x∗
).(7)
If the smallest distances are close and hence ambiguous,
the prior correction gives a preference for larger
classes, since they potentially account for more errors.
We usually estimate the πk by the sample priors
ˆπk = nk/n. If the sample prior is not representative of
the population, then more realistic priors or even uniform
priors πk = 1/K can instead be used. We can
use the discriminant scores to construct estimates of
the class probabilities by analogy to Gaussian linear
discriminant analysis:
ˆpk(x∗
) =
e−(1/2)δk(x∗)
K
=1 e−(1/2)δ (x∗)
.(8)
The left panel of Figure 7 displays these probabilities
for the lymphoma data. For illustration, we used the
largest value of (= 4.41) that minimizes the test
error in the bottom panel of Figure 3, rather than the
cross-validation-minimizing value of 4.01 used earlier.
The value = 4.41 yields 48 genes. We derived the
probabilities using the centroids that were deﬁned by
applying this value of to the test set.
In Figure 6, the value = 4.04 gives exactly the
same test error (in fact, the same class predictions)
as = 4.41, but gives a higher log-likelihood value.
The estimated probabilities resulting from = 4.04
are shown in the right panel of Figure 7. These
probabilities are more extreme than those in the left
panel. The rightmost probabilities are preferred, since
they produce a higher log-likelihood score.
FIG. 6. Test set mean log-likelihood curve (red) and test set misclassiﬁcation error curve (green). The latter has been translated so that
it ﬁts in the same plotting region. The broken line shows where the log-likelihood curve starts to dip, while the dotted line shows where the
misclassiﬁcation error starts to rise.
110 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU
FIG. 7. Estimated test set probabilities using the 48 gene model from minimizing misclassiﬁcation error (left) and the 78 gene model from
maximizing the log-likelihood (right). Probabilities are partitioned by the true class. There are no classiﬁcation errors in the test set.
3. ADAPTIVE CHOICE OF THRESHOLDS
In this section we describe the procedure for adaptive
threshold choice in the nearest shrunken centroid
method. We deﬁne a scaling vector (θ1,θ2,...,θK)
and initially set θk = 1 for all k. These scalings are
included in the denominator of expression (1), that is,
dik =
¯xik − ¯xi
mkθk · si
.(9)
We scale the values so that minj (θj ) = 1: values
greater than 1 mean that a larger threshold is effectively
used for class k.
We applied the following procedure:
1. Find the class k with the largest number of training
errors averaged over the grid of values used.
2. Decrease θk by 10% and then rescale all θj so that
minj (θj ) = 1.
3. Repeat the above steps for a number of iterations
(here 10) and ﬁnd the solution that gives the lowest
average error, among the values of (θ1,θ2,...,θK)
visited.
Note that this procedure is based entirely on the
training set and does not use information from crossvalidation
or a test set. It is admittedly heuristic, but
does produce useful results in practice.
For the lymphoma data, we obtained the solution
(θ1,θ2,θ3) = (1.88,1.00,1.52), which is the value we
used to produce Figures 1 and 4. Most of the errors
in the original solution occurred in class FL; the new
thresholds are larger for classes DLCL and CLL, and
hence many fewer genes are used to discriminate these
classes. Remarkably, the total number of genes used
has decreased from 3,666 to 81 without raising the test
error.
To test this procedure further, we simulated some
data consisting of 10 samples in each of four classes
and 1,000 genes. We ran two different simulations,
with the results shown in the top and bottom panels
of Figure 8. For a concise description, let r(a,n)
represent the number a repeated n times. All exFIG.
8. Simulated data: mean ± 1 standard deviation of the test
error over ﬁve simulations, for default (equal) thresholds (red) and
adaptive thresholds (green). In the setup for the top panel, the class
centroids are unevenly spaced; in the bottom panel, the within-class
variances are unequal.
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 111
FIG. 9. Centroids for each of four classes for the two simulation scenarios. The standard deviations for each class are indicated at the top
of the plot.
pression values were independent Gaussian with variance
1. In the ﬁrst simulation, the class centroids
were [r(3,500),r(0.4,500)], [r(0.5,100),r(0,900)],
[r(0,100), r(0.5,100), r(0,800)] and [r(0,100),
r(0,100), r(0.5,100),r(0,700)]. The centroids are
shown in the top panel of Figure 9.
Thus the ﬁrst class is far from the others, in the
space spanned by the ﬁrst 500 genes. The top panel of
Figure 8 shows the mean ± 1 standard deviation of the
test error over ﬁve simulations. The methods used were
default (equal) thresholds (red) and adaptive thresholds
(green). The average values of the adaptive threshold
were 2.0,1.0,1.0 and 1.0. The adaptive threshold
method generally has lower test error.
In the second simulation, the means in the four
classes were [r(0.5,300),r(0,700)], [r(0.5,150),
r(−0.5,150),r(0,700)], [r(−0.5,150),r(0.5,150),
r(0,700)] and [r(−0.5,150),r(−0.5,150), r(0,700)].
The centroids are shown in the bottom panel of
Figure 9. The standard deviations in each class were
2,1.5,1.5 and 1.0. Thus each class centroid is equidistant
from the overall centroid (the origin), but the
within-class standard deviations are different. The bottom
of Figure 8 shows the results: again the adaptive
threshold does better in terms of test error; the average
values of the adaptive threshold were 1.4,1.1,1.2
and 1.0. With equal thresholds, the majority of nonzero
genes were in class 1: under the adaptive thresholds,
the distribution was more balanced.
4. SOFT VERSUS HARD THRESHOLDING
An alternative to the soft thresholding (5) would be
to keep all differences greater in absolute value than
and discard the others; that is,
dik = dik · I (|dik| > ).(10)
This is sometimes known as hard thresholding. It differs
from soft thresholding in that differences greater
than are unchanged, rather than shrunken toward
zero by the amount . One drawback of hard thresholding
is its “jumpy” nature: as the threshold is increased,
a gene with a full contribution dik suddenly is
set to zero.
To investigate the relative behavior of hard versus
soft thresholding, we generated standard normal expression
data for 1,000 genes and 40 samples, with
20 samples in each of two classes. For the ﬁrst 100
genes, we added a random effect µi ∼ N(0.0,0.52) to
each expression level in class 2 for each gene i. Hence
100 of the 1,000 genes are differentially expressed in
the two classes by varying amounts. This experiment
was repeated 10 times and the results were averaged.
The left panel of Figure 10 shows the test error for
hard and soft thresholding, as the threshold is varied,
while the right panel displays the mean squared
error i( ˆµi − µi)2/p, where ˆµi = 20
j=1 xij /20 −
40
j=21 xij /20. In the left panel, we see that soft thresholding
yields lower test error at its minimum; the right
112 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU
FIG. 10. Simulated data in two classes. Left: Test misclassiﬁcation error as the threshold is varied, using hard thresholding (h) and soft
thresholding (s). Right: The estimation error ( ˆµi − µi)2/p, where µi and ˆµi are the true and estimated difference in expression between
class 1 and class 2 for gene i. Results are averages over 10 simulations: standard error of the average is about 0.015 in the left panel and
0.01 in the right panel.
panel shows that soft thresholding does a much better
job of estimating the gene expression differences.
5. NATIONAL CANCER INSTITUTE CANCER LINES
AND SUBCLASS DISCOVERY
Here we describe how to use nearest centroid shrinkage
to discover subclasses. We consider data from Ross
et al. (2000) that consist of measurements on 6,830
genes on 61 cell lines. The samples have been categorized
into eight different cancer classes: breast (BRE),
CNS, colon (COL), leukemia (LEU), melanoma (MEL),
non-small cell lung cancer (NSC), ovarian (OVA) and
renal (REN). We randomly chose a training set of size
40 and a test set of size 21, so that the classes were well
represented in both sets. Default (equal) soft thresholding
was used, with the prior probabilities set to the
sample class proportions. The results are shown in Figure
11. The best cross-validated error rate occurs at
about 5,000 genes, giving a test error of 5/21. Adaptive
thresholding failed to improve this result.
We also tried both support vector machines (Ramaswamy
et al., 2001) and regularized discriminant
analysis (Section 7). Both gave ﬁve errors on the test
set. However, neither method gave a simple picture of
the data.
Next we show a generalization of the nearest shrunken
centroid approach that facilitates the discovery of
potentially important subclasses. It may be valuable biologically
to look for distinct subclasses of diseases
in microarray analyses. We can generalize the nearest
shrunken centroid procedure to facilitate the discovery
of subclasses. Consider the problem illustrated in
Figure 12. The values indicate average gene expression.
There are two subclasses in class 2, and each
of these can be distinguished from class 1 based on
a small set of genes. However, nearest shrunken centroids
will fail here, because the overall centroids for
each class are the same. Linear separating classiﬁers,
such as support vector machines (SVM), and linear
discriminant analysis will also do poorly here. Either
could be made to work with a suitable nonlinear transformation
of the features (or choice of kernel for the
FIG. 11. NCI cancer cell lines: training, cross-validation and test
error curves.
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 113
TABLE 2
NCI subclass results: test errors (out of 21 samples) for nearest shrunken centroid model with no subclasses (second
column from left) and two subclasses per class (third column from left); the columns on the right show the resulting
number of errors when a pair of subclasses for a given class is fused into one subclass
Number of Zero Two Fusing subclasses for each class
genes subclasses subclasses
BRE CNS COL LEU MEL NSC OVA REN
6830 5 6 6 6 6 6 6 7 5 8
6827 5 6 5 5 7 6 6 7 6 6
6122 5 5 6 5 5 5 5 5 5 5
3571 7 6 8 7 6 6 6 6 7 6
1695 9 6 8 7 6 6 6 6 7 6
696 9 7 9 6 7 7 7 7 8 8
293 9 6 8 7 7 6 6 7 7 6
119 10 6 8 8 8 6 8 7 7 8
42 10 12 13 14 14 12 12 12 12 12
17 14 14 14 14 16 14 14 14 14 13
SVM); while these may give low prediction error, they
may not reveal the biologically important subclasses
that are present.
For any class, our idea is to apply r-means clusters
to the samples in that class, resulting in r subclasses
for that class. Doing this for each of the K classes
results in a total of K · r subclasses. We apply nearest
FIG. 12. Two class problem with distinct subclasses. Numbers
indicate the average gene expression.
shrunken centroids to this r ·K class problem. If the
predicted class from this large problem is h, then
our ﬁnal predicted class is the class k that contains
subclass h.
With typical sample sizes, the choice r = 2 will
be most reasonable. Table 2 shows the results on
the National Cancer Institute (NCI) data. Without
subclasses, the test error rates start to rise when fewer
than 2,000 or 3,000 genes are used. Using subclasses,
we achieve about the same error rate with as few as 119
genes. The right part of the table shows that for 119 the
subclasses are most important for BRE, CNS, COL,
MEL and REN. The 119 gene solution is displayed in
Figure 13 and shows some distinct subclasses among
some of the main classes.
6. CAPTURING HETEROGENEITY
In discriminating an “abnormal” from a “normal”
group, the average gene expression may not differ
between the groups. However, the variability in expression
may be greater in the abnormal group, due to heterogeneity
in the abnormal population. This is illustrated
in Figure 14. Nearest centroid classiﬁcation will
not work in this case, since the class centroids are not
separated. The subclass method of the previous section
might help: we propose an alternative approach here.
We deﬁne new features xij = |xij − ¯mi|, where ¯mi
is the mean expression for gene i in the normal group.
Then we apply nearest shrunken centroids to the new
features xij .
To illustrate this, we generated the expression of
1,000 genes in 40 samples—20 from a normal group
114 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU
FIG. 13. NCI subclass results. Shown are pairs of centroids for each class for the genes that survived the thresholding.
and 20 from an abnormal group. All expression values
were generated independently as standard Gaussian
except for the ﬁrst 200 genes in the abnormal group,
which had mean zero, but standard deviation 2. An
independent test set of size 200 was also generated.
FIG. 14. Illustration of heterogeneity in gene expression. Abnormal
group A has the same average gene expression as the normal
group N, but shows larger variability.
Nearest centroid shrinkage on the transformed features
xij showed a test error rate of near zero, with 150 or
more nonzero genes. Figure 15 compares the results
of nearest shrunken centroids on the raw expression
values xij and the transformed expression values xij .
Nearest centroid shrinkage on the raw values does
poorly with an error rate greater than 40%, while use
of the transformed values reduces the error rate to near
zero.
By transforming to the distance from the normal
centroid, the use of the features xij might also provide
discrimination in situations where the abnormal class
is not heterogeneous, but is instead mean-shifted.
The right panel of Figure 15 investigates this. The
expression of the ﬁrst 200 genes in the abnormal class
has mean 0.5 and standard deviation 1 (versus 0 and 1
for the normal class). Now nearest shrunken centroids
on the raw features is much more powerful, while use
of the transformed features works poorly. We conclude
that use of neither the raw nor transformed features
dominates the other, and both should be tried on a given
problem.
We have successfully used the heterogeneity model
to predict toxicity from radiation sensitivity using
transcriptional responses to DNA damage in lymphoid
cells (Rieger et al., 2003).
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 115
FIG. 15. Left: Test error for data simulated from the heterogeneous two-class problem, using nearest shrunken centroids on raw expression
values (red) and transformed expression values |xij − ¯mi| (blue). Right: Same as in left panel, but data are simulated from the mean-shifted
homogeneous two-class problem.
7. RELATIONSHIP TO OTHER APPROACHES
The discriminant scores (6) are similar to those used
in linear discriminant analysis (LDA), which arise from
using the Mahalanobis metric to compute the distance
to centroids:
δLDA
k (x∗
) = (x∗
− ¯xk)T
W−1
(x∗
− ¯xk)−2logπk.(11)
Here we use vector notation and W is the pooled
within-class covariance matrix. With thousands of
genes and tens of samples (p n), W is huge and
any sample estimate will be singular (and hence its
inverse is undeﬁned). Our scores can be seen to be a
heavily restricted form of LDA, necessary to cope with
the large number of variables (genes). The differences
are the following:
• We assume a diagonal within-class covariance matrix
for W; without this, LDA would be ill-conditioned
and fail.
• We use shrunken centroids rather than centroids as a
prototype for each class.
• As the shrinkage parameter increases, an increasing
number of genes will have all their dik = 0,
k = 1,...,K, due to the soft thresholding in (5).
Such genes contribute no discriminatory information
in (6), and in fact cancel in Equation (8).
Both our scores (6) and the LDA scores (11) are
linear in x∗
i . If we expand the square in (6), discard
the terms involving x∗
i
2
(since they are independent of
the class index k and hence do not contribute toward
class discrimination) and multiply by −1/2, we get
˜δk(x∗
) =
p
i=1
x∗
i ¯xik
s2
i
−
1
2
p
i=1
¯x 2
ik
s2
i
+ logπk,(12)
which is linear in x∗
i . Because of the sign change, our
rule classiﬁes to the largest ˜δk(x∗). Likewise the LDA
discriminant scores have the equivalent linear form
˜δLDA
k (x∗
) = x∗T
W−1
¯xk − 1
2 ¯x T
k W−1
¯xk +logπk.(13)
Regularized discriminant analysis (RDA; Friedman,
1989) leaves the centroids alone and modiﬁes the
covariance matrix in a different way,
δRDA
k (x∗
) = (x∗
− ¯xk)T
(W + λI)−1
(x∗
− ¯xk),(14)
where λ is a parameter (like our ). The fattened
W + λI is nonsingular, and as λ gets large, this procedure
approaches the nearest centroid procedure (with
no variance scaling or centroid shrinking). A slightly
modiﬁed version uses W +λD, where D = diag(s2
1,s2
2,
...,s2
p). As λ gets large, this approaches the variance
weighted nearest centroid procedure. In practice, we
normalize this regularized covariance by dividing by
1 + λ, leading to the convex combination (1 − α)W +
αD, where α = λ/(1 + λ). Although the relative distances
do not change, this is important when making
the adjustment for the class priors.
Although RDA shows some promise, it is more complicated
than our nearest shrunken centroid procedure.
Furthermore, in the process of its regularization, it does
not select a subset of genes as the shrunken centroid
procedure does. We are considering other hybrid approaches
of RDA and nearest centroids in ongoing research
projects.
8. NEAREST CENTROID CLASSIFIER
VERSUS LDA
As discussed in the previous section, the nearest centroid
classiﬁer is equivalent to Fisher’s linear discrimi-
116 TIBSHIRANI, HASTIE, NARASIMHAN AND CHU
FIG. 16. Simulation results: bias and variance (top panels) and mean-squared error and misclassiﬁcation error (bottom panels) for linear
discriminant analysis and the nearest centroid classiﬁer. Details of the simulation are given in the text. The nearest centroid classiﬁer
outperforms LDA because of its smaller variance.
nant analysis if we restrict the within-class covariance
matrix to be diagonal. When is this restriction a good
one?
Consider a two class microarray problem with p
genes and n samples. For simplicity we consider the
standard (unshrunken) nearest centroid classiﬁer and
standard (full within covariance) LDA. The recent thesis
of Levina (2002) did some theoretical comparisons
of these methods. She assumed p → ∞, n → ∞ and
p/n → γ ∈ (0,1), and analyzed the worst case error
of each method. The relative performance of the two
methods depends on the correlation structure of the
features (samples). Her results show that if p is a large
fraction of n, for a large class of correlation structures,
nearest centroid classiﬁcation outperforms full LDA.
Now in our problem, usually we have p n: in that
case, LDA is not even deﬁned without some regularization.
Hence to proceed we assume that p is a little
less than n and hope that what we learn will extend
to the case p > n. Let xj be a p-vector of gene expression
values in class j. Suppose x1 ∼ N(0, ) and
x2 ∼ N(µ, ), where is a full (nondiagonal) matrix.
Then LDA uses the maximum likelihood unbiased estimate
of −1µ, while nearest centroid uses a biased estimate.
However, the LDA method estimates −1µ in a
multivariate manner, and hence will tend to have higher
variance. What is the resulting bias–variance tradeoff
and how does it translate into misclassiﬁcation error?
We did an experiment with p = 30 and n = 40, with
20 samples in each of two classes. We set the ijth element
of to ρ|i−j|, where ρ was varied from 0 to 0.8.
Each of the components of the mean vector µ was set
to ±1 at random: such a mixed vector is needed to give
full LDA a potential advantage over LDA with a diagonal
covariance. For each simulation, an independent
test set of size 500 was also generated. The results
of 100 simulations from this model are shown in
Figure 16. Bias, variance and mean-squared error refer
to estimation of −1µ. For small correlations, the underlying
(diagonal covariance) model for nearest centroids
is approximately correct and the method wins;
LDA shows a small improvement in bias for larger
correlations, but this is more than offset by the increased
variance. Overall the nearest centroid method
has lower mean-squared error and test misclassiﬁcation
error in all cases.
Now for real microarray problems, p n, and both
LDA and nearest centroid methods can be improved
by appropriate regularization or shrinkage. We have
CLASS PREDICTION BY NEAREST SHRUNKEN CENTROIDS 117
not included regularization in the above comparison,
but the above results suggest that the bias–variance
tradeoff will cause the nearest centroid method to
outperform full LDA.
9. DISCUSSION
The nearest shrunken centroid classiﬁer is potentially
useful in any high-dimensional classiﬁcation
problem. In addition to its application to gene expression
arrays, it could also be applied to other kinds of
emerging genomic data, including mass spectroscopy
for protein measurements, tissue arrays and single nucleotide
polymorphism arrays.
Our proposal can also be applied in conjunction
with unsupervised methods. For example, it is now
standard to use hierarchical clustering methods on
expression arrays to discover clusters in the samples
(Eisen, Spellman, Brown and Botstein, 1998). The
methods described here can identify subsets of the
genes that succinctly characterize each cluster.
Finally, we touch on computational issues. The
computations involved in the nearest shrunken centroid
method are straightforward. One important detail: in
the denominator of the statistics dik in Equation (1)
we add the same positive constant s0 to each of the
si values. This guards against the possibility of large
dik values arising by chance from genes at very low
expression levels. We set s0 equal to the median value
of the si over the set of genes. A similar strategy was
used in the signiﬁcance analysis of microarrays (SAM)
methodology of Tusher, Tibshirani and Chu (2001).
We have developed a package in the Excel and R
language called prediction analysis for microarrays.
It implements all of the nearest shrunken centroids
methodology discussed in this article and is available at
the website http://www-stat.stanford.edu/∼tibs/PAM.
REFERENCES
ALIZADEH, A. A., EISEN, M. B., DAVIS, R. E., MA, C., LOSSOS,
I. S., ROSENWALD, A., BOLDRICK, J. C., SABET, H.,
TRAN, T., YU, X., POWELL, J. I., YANG, L., MARTI,
G. E., MOORE, T., HUDSON, JR., J., LU, L., LEWIS,
D. B., TIBSHIRANI, R., SHERLOCK, G., CHAN, W. C.,
GREINER, T. C., WEISENBURGER, D. D., ARMITAGE, J. O.,
WARNKE, R., LEVY, R., WILSON, W., GREVER, M. R.,
BYRD, J. C., BOTSTEIN, D., BROWN, P. O. and STAUDT,
L. M. (2000). Distinct types of diffuse large B-cell lymphoma
identiﬁed by gene expression proﬁling. Nature 403 503–
511.
AMBROISE, C. and MCLACHLAN, G. (2002). Selection bias in
gene extraction on the basis of microarray gene-expression
data. Proc. Natl. Acad. Sci. U.S.A. 99 6562–6566.
DONOHO, D. and JOHNSTONE, I. (1994). Ideal spatial adaptation
by wavelet shrinkage. Biometrika 81 425–455.
EISEN, M. B., SPELLMAN, P. T., BROWN, P. O. and BOTSTEIN, D.
(1998). Cluster analysis and display of genome-wide expression
patterns. Proc. Natl. Acad. Sci. U.S.A. 95 14 863–14 868.
FRIEDMAN, J. (1989). Regularized discriminant analysis. J. Amer.
Statist. Assoc. 84 165–175.
GOLUB, T. R., SLONIM, D. K., TAMAYO, P., HUARD, C.,
GAASENBEEK, M., MESIROV, J. P., COLLER, H., LOH, M.,
DOWNING, J. R., CALIGIURI, M. A., BLOOMFIELD, C. D.
and LANDER, E. S. (1999). Molecular classiﬁcation of cancer:
Class discovery and class prediction by gene expression monitoring.
Science 286 531–537.
HASTIE, T., TIBSHIRANI, R., BOTSTEIN, D. and BROWN, P.
(2001). Supervised harvesting of expression trees. Genome
Biology 2 (1) research/0003.
HEDENFALK, I., DUGGAN, D., CHEN, Y., RADMACHER, M.,
BITTNER, M., SIMON, R., MELTZER, P., GUSTERSON,
B., ESTELLER, M., RAFFELD, M., YAKHINI, Z.,
BEN-DOR, A., DOUGHERTY, E., KONONEN, J., BUBENDORF,
L., FEHRLE, W., PITTALUGA, S., GRUVBERGER, S.,
LOMAN, N., JOHANNSSON, O., OLSSON, H., WILFOND, B.,
SAUTER, G., KALLIONIEMI, O., BORG, A. and TRENT, J.
(2001). Gene-expression proﬁles in hereditary breast cancer.
New England Journal Medicine 344 539–548.
KHAN, J., WEI, J., RINGNER, M., SAAL, L., LADANYI, M.,
WESTERMANN, F., BERTHOLD, F., SCHWAB, M., ANTONESCU,
C., PETERSON, C. and MELTZER, P. (2001). Classiﬁcation
and diagnostic prediction of cancers using gene expression
proﬁling and artiﬁcial neural networks. Nature Medicine
7 673–679.
LEVINA, E. (2002). Statistical issues in texture analysis. Ph.D.
dissertation, Dept. Statistics, Univ. California, Berkeley.
RAMASWAMY, S., TAMAYO, P., RIFKIN, R., MUKHERJEE, S.,
YEANG, C., ANGELO, M., LADD, C., REICH, M., LATULIPPE,
E., MESIROV, J., POGGIO, T., GERALD, W.,
LODA, M., LANDER, E. and GOLUB, T. (2001). Multiclass
cancer diagnosis using tumor gene expression signatures. Proc.
Natl. Acad. Sci. U.S.A. 98 15 149–15 154.
RIEGER, K., HONG, W., TUSHER, V., TANG, J., TIBSHIRANI, R.
and CHU, G. (2003). Toxicity of radiation therapy associated
with abnormal transcriptional responses to DNA damage.
Submitted.
ROSS, D., SCHERF, U., EISEN, M., PEROU, C., REES, C.,
SPELLMAN, P., IYER, V., JEFFERY, S., VAN DE RIJN, M.,
WALTHAM, M., PERGAMENSCHIKOV, A., LEE, J., LASHKARI,
D., SHALON, D., MYERS, T., WEINSTEIN, J., BOTSTEIN,
D. and BROWN, P. (2000). Systematic variation in gene
expression patterns in human cancer cell lines. Nature Genetics
24 227–235.
TIBSHIRANI, R., HASTIE, T., NARASIMHAN, B., and CHU,
G. (2002). Diagnosis of multiple cancer types by shrunken
centroids of gene expression. Proc. Natl. Acad. Sci. U.S.A. 99
6567–6572.
TUSHER, V. G., TIBSHIRANI, R. and CHU, G. (2001). Signiﬁcance
analysis of microarrays applied to the ionizing radiation
response. Proc. Natl. Acad. Sci. U.S.A. 98 5116–5121.