Masaryk University
Faculty of Informatics
}w¡¢£¤¥¦§¨!"#$%&123456789@ACDEFGHIPQRS`ye|
Computational biomarker
discovery: methods and
practice
Habilitation Thesis
Collection of articles
Ing. Vlad Popovici, M.Sc., Ph.D.
Brno, 2017
Acknowledgement
The articles collected in this thesis span more than a decade of research
activity and are the result of many fruitful collaborations with
researchers across the world. My warmest thanks are due to all coauthors
of the selected papers. Last but not least, I would like to thank
my family for their unshattered patience and especially my wife for
her continuous support and encouragements during all these years without
her, this thesis would have never been written.
iii
Abstract
The development of high throughput techniques allows us the exploration
of the biological samples at a scale never achieved before.
Only two decades ago, the bottleneck of biological discoveries was on
the experimental side. Today, it shifted on the analytical side and led
to more and more computational disciplines to be drawn into play.
The definition of bioinformatics nicely traces these evolutions: in the
beginning, it was seen as the application of computer programs to sequence
alignment, protein structure prediction and virtual evolution,
while nowadays it is more of an umbrella term for a wide spectrum
of methods combining computer science, statistics, mathematics and
engineering with the goal of analyzing and interpreting biological
data.
The present thesis gathers twelve peer-reviewed journal articles
in the field of bioinformatics that are related to biomarker discovery
and validation. While the methods developed and employed are not
specific to any particular pathology, the majority of the results were
obtained in the field of oncology, particularly in the case of colon and
breast cancers. The articles reproduced here deal with various aspects
of biomarker discovery: (i) development of methods for gene expression
data normalization with applications (Chapters 7, 14, 15); (ii)
classifiers for biomarker design and their applications (Chapters 8, 11,
17); (iii) general methodological aspects for biomarker discovery and
validation applied to problems in breast and colon cancers (Chapters
9, 10, 13); and (iv) methods for histopathology image analysis in
the context of molecular data for proxy biomarker discovery (Chapters
12, 16, 18). Naturally, this is an over-simplified view since each of
these articles is falling under several categories.
The thesis is written as a commentary to a collection of journal
articles with estimated personal contribution to each article varying
between 5% and 80%, for an average of about 40%.
v
Contents
I COMMENTARY 1
1 Introduction 3
2 Of DNA and gene expression 7
2.1 DNA and genetic information . . . . . . . . . . . . . . . . 7
2.2 DNA microarrays . . . . . . . . . . . . . . . . . . . . . . . 9
3 Gene expression data preprocessing 13
3.1 Data acquisition and background correction . . . . . . . . . 13
3.2 Signal estimation at probeset level . . . . . . . . . . . . . . 15
3.3 Quality control for Affymetrix microarrays . . . . . . . . . 16
3.4 A note on normalizing PCR expression data . . . . . . . . . 19
4 Comments on the performance of predictive and prognostic
models built on gene expression data 21
4.1 General considerations on model learnability . . . . . . . . . 22
4.2 A note on model performance estimation . . . . . . . . . . . 25
5 Integration of pathology images: towards a multimodal
biomarker discovery 31
6 Concluding remarks 37
II SELECTED ARTICLES 45
7 Selecting control genes for RT-QPCR using public microarray
data 47
vi
8 Rgtsp: a generalized top scoring pairs package for class
prediction 59
9 Effect of training-sample size and classification difficulty
on the accuracy of genomic predictors 63
10 The MicroArray Quality control (MAQC)-II study of common
practices for the development and validation of microarraybased
predictive models 79
11 Identification of a poor-prognosis BRAF-mutant-like population
of patients with colon cancer 97
12 Identification of "BRAF-Positive" Cases Based on WholeSlide
Image Analysis 107
13 A robust genomic signature for the detection of colorectal
cancer patients with microsatellite instability phenotype
and high mutation frequency 117
14 Expression profiling with RNA from formalin-fixed, paraffinembedded
material 129
15 Molecular risk assessment of BIG 1-98 participants by expression
profiling using RNA from archival tissue 147
16 Joint analysis of histopathology image features and gene
expression in breast cancer 163
17 Gene expression patterns unveil a new level of molecular
heterogeneity in colorectal cancer 175
18 Image-based surrogate biomarkers for molecular subtypes
of colorectal cancer 191
vii
PART I
COMMENTARY
1 Introduction
The last years of the XX-th century witnessed a true technological revolution
in biology: the development of first DNA microarrays. They
represented a major step forward from the previous semi-quantitative
techniques as, for the first time, it was possible to measure the expression
level of hundreds (and later, tens of thousands) of genes. The
biology was entering the high-throughput data generation era. The
first published results from expression profiling experiments were
extremely encouraging so, in the beginning, it was hoped that most
of the diseases with high impact (social and economic) would have
found a cure within a decade. Yet, twenty years later we still face the
same problems in predicting the outcome of a treatment or the likelihood
of a cancer to metastasize, despite the tremendous developments
during this period. With a few exceptions (e.g. BCR/ABL fusion
gene in chronic myelogenous leukemia has now a targeted treatment
with very good results), the large majority of cancers are still
treated with standard chemotherapy as half a century ago.
So what went wrong? Actually, nothing! As with any new technology
of high impact, false hopes and plain naivety fooled us in believing
that, finally, the holy grail of modern medicine - individualized
treatment - was within grasp. However, these new technologies
allowed us to gain insights into a totally new dimension of biology
that greatly expanded our knowledge - but also brought numerous
challenges in digesting the new types of data.
This thesis is about such challenges of extracting actionable gems
of knowledge from large collections of high-throughput genomic data
and their transformation into predictive and prognostic models. Additionally,
we discuss later development in integrating computational
pathology tools both for biomarker discover and for developing a
more comprehensive view of the disease of interest. A number of
methods for addressing these challenges are presented and discussed
and they represent a volume of work in bioinformatics spanning the
last decade.
It is clear that, today, the microarrays - the main technological
platform used throughout this work - are slowly fading away being
replaced by a more versatile technique - the RNASeq. Nevertheless,
3
1. Introduction
the work and the results reviewed here remain valid since most of
the problems one has in building predictive/prognostic models are
the same for RNASeq: normalization, batch effects, validation, model
learnability and model interpretability. These aspects are addressed
in the various articles reproduced here (and in the corresponding
supplemental materials available online from the respective journals)
and they represent but a drop in the ocean of all the choices one is
presented when challenged to mine genomic data.
The expression of genes represents one facet of the biological reality,
many other perspectives could be added by considering the information
at protein or epigenetic level, or even at a different scale
such that tissue or organism level (Figure 1.1). Ideally, all these data
would be taken into account when investigating a pathology but our
current ability of managing, mining and interpreting such complex
collections of data is still limited.
Figure 1.1: Data puzzle in biomarker discovery: a plethora of modalities
that each bring a different perspective on the investigated biological
phenomenon.
The rest of this first part starts with some background information
and a short overview of the technological aspects to facilitate the un-
4
1. Introduction
derstanding of the subsequent discussions (Chapters 2,3). Then, the
next chapters are dedicated to commenting some aspects of biomarker
discovery from both gene expression and histopathology images (Chapters
4,5). The discussion includes some additional results that were
not published but which may help enrich the reproduced articles. Finally,
some concluding remarks are given in Chpater 6.
5
2 Of DNA and gene expression
2.1 DNA and genetic information
The deoxyribonucleic acid (DNA) is a large molecule that encodes all the
biological information needed for the development and reproduction
of all living organisms. It is formed of a pair of strands inter-twined
in the so called double helix. The constitutive unit of this molecule is a
nucleotide - a monomer consisting of a nucleobase (one of the cytosine
(C), guanine (G), adenine (A) or thymine (T)), a sugar (deoxyribose)
and a phosphate group. The nucleotides are bound one to another
via the covalent links between sugars and phosphates which alternate
forming the sugar-phosphate backbone. The nucleobases of one
strand bind to the complementary ones from the opposite strand according
to base paring rules: C and G, A and T - thus the two strands
are said to be antiparallel. The process of base binding is called hybridization
(or annealing) and the opposite process, of de-coupling the
two strands is called denaturation (or melting).
The information is coded in the sequences of bases and it relates in
part to the production of various proteins or the regulation of various
processes. The molecular unit of transmission of hereditary information
is the gene - a variable-length sequence of bases. In humans, it is
estimated that the number of genes is somewhere between 20,000 and
25,000 and they are organized in 23 pairs of chromosomes. The central
dogma of the molecular biology provides a simplified workflow of information
transmission within a biological system: "DNA is used to
produce ribonucleic acid (RNA) (transcription), RNA is used to produce
proteins (translation)". However, there are many other information
flows that are not cover by this model, for example the methylation
processes which alter the gene expression levels. These main
flows of information at molecular level are depicted in Figure 2.2.
The DNA replication ensures the transmission of information from
parent to progeny and involves the replication of the DNA by a protein
complex called replisome, usually in the S-phase of cell cycle.
DNA transcription is the process of producing messenger RNA (mRNA)
from a segment of DNA by RNA polymerases, mainly under the control
of various transcription factors. After some post-trasncription mod-
7
2. Of DNA and gene expression
DNA
Basepair
Cytosine
Guanine
Adenine
Thymine
Sugar and phosphate
DNA bases
Sugar and phosphateSugar and phosphate
backbone
Figure 2.1: Schematic representation of the DNA molecule. Adapted
from https://en.wikipedia.org/wiki/Nucleic_acid
Replication
Transcription
Translation
DNA
RNA
protein
Figure 2.2: A very schematic representation of the main information
flow according to the central dogma
8
2. Of DNA and gene expression
ifications (most notably alternate splicing) of precursors of mRNA,
the mRNA is externalized from the nucleus and its translation takes
place in ribosomes leading to the production of polypeptides that,
after further processing, will finally result in proteins.
This overly simplified description of the information flow at molecular
level provides the basis for understanding the interest of measuring
the gene expression levels: in general, it is assumed that the
amount of mRNA produced from various genes can be equated to
gene expression levels and is proportional with the amount of protein
products resulting. Thus, the key of the whole process is the reliable
estimation of the amount of specific mRNA sequences.
2.2 DNA microarrays
The DNA microarray technology has its roots in electrophoresis methods
for the detection of known DNA sequences (Southern blotting)
and dates back from the last decades of the twentieth century. It consists
in challenging a set of target DNA fragments with a series of
known (complementary) DNA sequences and measuring the abundance
of the resulting bound molecules, which is usually obtained
as the intensity of some electrical signal. In general, the DNA (micro)array
is a substrate (nylon membrane, glass or plastic) on which a
number of single stranded DNA fragments of known length and sequence
are deposited. On this, the properly prepared single stranded
target DNA (or RNA) is "washed over" with the intent of having the
fragments of target DNA attaching to the the probes (hybridizing)
and thus allowing the quantification of their abundance levels (Figure
2.3). While, in theory, this approach should allow a precise quantification
of gene expression levels (as abundance of specific sequences),
many factors influence the quality of the final measurements, starting
from microarray design (including the selection of probes and
their distribution over the microarray surface) and sample preparation
(quality of the extracted DNA/RNA, chemical agents, amplification
protocol, etc), to the data preprocessing methods (normalization,
batch effects removal, etc.). An exhaustive presentation of this
subject is beyond the scope of the present dissertation, but we will
briefly present the data generation process on an Affymetrix (http:
9
2. Of DNA and gene expression
substrate
probes
target (sample)
Figure 2.3: The core principle of gene expression level measuring.
//www.affymetrix.com) platform, which became the de facto standard
for microarray experiments.
We will exemplify the experimental protocol for the Affymetrix
Gene Chip HG-133 Plus 2.0 array. The array contains 1,300,000 unique
nucleotide probes (features, of length 25 nucleotides) targeting more
than 47,000 transcripts and variants representing about 39,000 human
genes (and candidate genes). A characteristic of the Affymetrix arrays
is the use of probe pairs: for each target sequence there is a pair
of probes designed such that one matches perfectly the target (perfect
match probe: PM) while the second one has a single nucleotide
mismatched (mismatch probe: MM) and is supposed to be used as a
negative control to improve the specificity of the measurements. In
this array, there are 11 pairs of probes for each sequence, forming a
probeset. As a side remark, we note that in the latest versions of analytical
protocols, these MM probes are no longer used.
A hybridization experiment (Figure 2.4) involves the following
main steps1:
1. isolation and quantitation of total RNA from the sample
2. reverse transcription to obtain complementary cDNA
1. see the technical manual at https://assets.thermofisher.com/TFS-Assets/
LSG/manuals/expression_analysis_technical_manual.pdf
10
2. Of DNA and gene expression
Figure 2.4: Overview of a hibridization experiment on an Affymetrix
platform (from [25])
3. transcription and labeling to complementary RNA cRNA, followed
by
4. fragmentation (by sonication) to obtain short single stranded
RNA segments
5. the RNA segments are hybridized on the array and, after washing
it, the raw transcript abundance is obtained as the intensity
signal in a scanned image.
Each of these steps has an influence on the final result and a deviation
from the protocol may lead to errors that are difficult to detect.
As a consequence, aside from following standardized protocols, one
has to resort to a number of preprocessing data manipulation and
preliminary analyses before data can be considered clean enough for
proper analysis. Another consequence is that simply combining data
from different protocols is normally not possible due to strong effects
introduced in data generation step by individual laboratories (even
technicians) and a batch effect removal step is mandatory.
11
2. Of DNA and gene expression
Before concluding this chapter, we note that the quality of the original
RNA extracted from the biological sample is equally important.
Most notably, the fixation of the biological specimens in formalin and
paraffin blocks leads to a degradation of the genetic material. Special
protocols need to be devised for such cases, protocols that are
accompanied by specific computational methods for data normalization
(see Sections 7, 14 and 15 in the present dissertation).
12
3 Gene expression data preprocessing
Due to the technical variations in gene expression level measurements,
their direct analysis is posed to fail. To cope with various artifacts introduced
by the experimental conditions and to enhance the signal,
various preprocessing steps are needed. It is hoped that after the preprocessing,
most of the "true" signal is preserved while the noise is
reduced, making the measurements comparable across the samples
and, if possible, across experiments.
The preprocessing of microarray experimental data is usually tailored
to the platform and many alternative paths are available. The
question of choosing the "right" preprocessing pipeline has received
considerable attention from the beginning (see, for example [3, 2]) but
no definite answer was given. Nevertheless, through experimentation
and learning from earlier failures, standard preprocessing workflows
emerged for major platforms. Since in almost all the examples discussed
in the present dissertation the Affymetrix platform is used (or
its derivatives), we will briefly review the main steps for data preprocessing
and curation for analysis. One has to bear in mind that
while a lot of these preprocessing steps can be (and are) automatized,
the detection of abnormalities relies in many cases on manual inspection
and ad hoc judgement.
3.1 Data acquisition and background correction
As mentioned, for the Affymetrix plaform (as for the majority of microarray
platforms), the initial raw data is obtained by scanning the
microarrays and quantifying the intensity of the light at each probe
location (see Figure 3.1). The amplitude of the signal is given by the
quantizer (scanner) and is typically between 0 and 216 − 1 (for a 16 bit
quantizer). From Figure 3.1 it is apparent that the signal is affected by
both systematic (while the probes are distributed randomly on the array,
the images show a clear stripe pattern of lower intensities which
may be attributed to the scanner) and random noise (darker or lighter
spots, in the image indicated by white arrows). The background correction
has the purpose of removing the systematic noise and performs
a locally weighted adaptive background estimation (e.g. 2-percentile
13
3. Gene expression data preprocessing
Figure 3.1: An example of scanned Affymetrix microarrays. The yellow
circles indicate control spots while the white arrows indicate artifacts.
(adapted from [28]).
14
3. Gene expression data preprocessing
of the signal intensity in the local neighborhood) and subtraction.
3.2 Signal estimation at probeset level
The goal of this step is to estimate gene (more precisely, probeset)specific
intensity values as a proxy for the amount of transcript in
the sample. As mentioned, the Affymetrix microarrays use a pair of
probes - perfect match (PM) and mismatch (MM) - in an attempt to
improve the specificity of the signal by subtracting an estimate of
unspecific hybridization (non targeted molecules that, nevertheless,
bind to the probe). The estimates of these quantities are based on
robust statistics (like Tukey’s biweight estimate). Once they are estimated,
for each probeset an average difference is used as the final
signal intensity estimate and a scaling step is used to ensure that the
signal is comparable across arrays. This signal estimation procedure
is the one initially proposed by Affymetrix and implemented in their
MAS5 normalization method.
We noted already that the use of MM probes has been discontinued
in the later versions of various preprocessing methods, mainly
because experience has shown that MM probes are unreliable and
often have higher intensities than the PM probes. This observation
led to the development of alternative probeset-level signal estimators
such as those implemented in RMA (log-scale robust multi-array
analysis), which tries to exploit the whole set of arrays in order to obtain
better estimates [12]. This method uses median polish iterative
procedure for obtaining the estimates of the probeset-level signals.
As a final step, it is customary to apply quantile normalization
over the whole batch of arrays in an experiment, in order to align
the distribution of the signal across all arrays. The justification of this
step comes from the observation that most of the genes are expressed
similarly across experimental conditions, with only a small fraction
being differentially expressed, hence it is reasonable to assume that the
overall distribution of signal intensities should not vary much across
arrays.
Aside from the MAS5 and RMA, a number of additional methods
have been proposed, but they did not reach the popularity of the two
mentioned here. For a comparison and discussion, see [11]. In the var-
15
3. Gene expression data preprocessing
ious articles included in the last chapter of the present dissertation,
these two methods are used, with RMA being used in almost all the
cases.
One fundamental difference should be emphasized between the
MAS5 and RMA (and any other multi-array normalization method):
in contrast with MAS5, the multi-array methods estimate their parameters
and the probeset-level signal by considering all available arrays
(i.e. all arrays with enough quality, see next section) and, thus, the
final result is influenced, at least theoretically, by each array in an experiment.
This observation has implications in evaluating the performance
of the predictive models, since any re-sampling method relies
on the assumption of independence between training and testing sets.
It is clear that, in order to enforce this independence, a proper performance
estimation procedure would repeat the normalization of each
training set and normalize the testing set using the parameters estimated
on the training set alone. This aspect will be later discussed in
Chapter 4 (and in Sections 9 and 10).
3.3 Quality control for Affymetrix microarrays
An essential step - actually a suite of steps applied at different stages
of data preprocessing - is the control of quality of the samples included
in analysis. We have already mentioned the visual inspection
of the scanned images (see Figure 3.1) which can help identifying obviously
defective arrays. However, as the number of arrays in an experiment
increases, this task becomes tedious in addition of being
subjective. To help providing a more quantitative measure of array
quality, different quality scores and coefficients have been proposed
along with guidelines for selecting the quality criteria. Still, depending
of the type of experiment, these guidelines have to be adapted.
For example, in the case of profiling archival material (formalin fixed,
paraffin-embedded (FFPE)), it is expected that the overall intensity of
the signal is lower than in the case of fresh frozen material, due to the
degradation of the DNA. This aspect is addressed at several occasions
in the articles reproduced here – see Sections 11, 14 and 15.
Two main criteria are used to judge the quality of the individual
microarrays:
16
3. Gene expression data preprocessing
• percentage of present calls (%PC): this is the main quality metrics
provided by Affymetrix from the earlier versions of their arrays
and is computed using Wilcoxon rank test to test whether significantly
more PMs have higher signal than their corresponding
MMs and produces a detection call (absent, present or marginal).
This method is implemented within the MAS5 normalization
procedure. For example, if an array has %PC below 80% than
one may choose to call it defective and remove it from further
processing.
• median normalized unscaled standard error (median NUSE) is the
procedure usually used in the context of RMA normalization
and relies on the estimation of the residuals from fitting a probelevel
model on all arrays in a batch. Briefly, the model assumes
that the normalized, background-adjusted, probe-level data is
a linear combination (in log-space) of a gene expression in an
array, a probe-level effect and a measurement error. Visualizing
the distribution of the residuals in an array can help in identifying
artifacts but, as mentioned, this is not a feasible approach
for large sets. Therefore, a summary statistics (like median) can
be computed on a per-array basis and used in deciding whether
an array is of sufficient quality (for example, a cut-off value of
1.02 for median NUSE has proven reasonable for fresh frozen
samples).
An in depth discussion of this matter is beyond the scope of the
present dissertation and has been addressed in several publications
(see, for example [13]). In the context of using gene expression data
derived from archival material (FFPE) we performed a less formal
comparison of the two approaches and found them to be highly concordant.
In Figure 3.2 the two criteria are plotted for a set of 240 arrays.
The "traditional" cut-off for fresh-frozen samples seemed to be
too drastic, therefore a less stringent value - indicated in red - for median
NUSE was adopted, since it was expected to have in general a
lower signal from this arrays. Still, it is apparent that there is a direct
relation between the two quality metrics.
17
3. Gene expression data preprocessing
Figure 3.2: Present call vs. median NUSE on a set of 240 customized
Affymetrix microarrays. This data has been used in articles reproduced
in Sections 11, 17.
18
3. Gene expression data preprocessing
3.4 A note on normalizing PCR expression data
Polymerase chain reaction (PCR) is a technique of amplifying a DNA
region several orders of magnitude to allow the detection and then
quantification of the number of copies of that region, which is then
converted into an expression level. Without delving into details, we
just remark that this technique allows accurate measurement of expression
levels of several tens of genes (thus it is not a high-throughput
technique as microarrays) and is used as a diagnostic tool due to its
low costs and relatively fast processing time. As always, this method
requires a proper data normalization before any gene-based score can
be computed.
The normalization implies computing a differential expression level
with reference to the expression of one or several control genes (housekeeping
genes) that are supposed to have stable expression in the
given condition. We have shown that some of the traditional control
genes may actually vary across a number of cases in breast and colon
cancer. Thus, a new set of control genes had to be selected for different
pathologies. To avoid the tedious trial-and-error cycle of experimental
biology, we developed a computational method that exploits
public data from microarray platforms ([23] and Section 7) and
proposes a score for estimating the suitability of the candidate gene.
This method has been used for selecting control genes in several experiments
and as basis for computing genomic prognostic scores in
breast cancer (see Sections 14, 15, also [16, 1]).
In Figure 3.3 the variability (standard deviation) as a function of
mean expression level for several control genes is shown in a PCR
experiment involving 25 fresh frozen breast cancer samples. We note
that the proposed procedure allowed the identification of more robust
control genes (lower variability, stable low expression - RPLP0,
UBB, RPS11) which formed the basis of the reference point for the
genomic score.
19
3. Gene expression data preprocessing
Figure 3.3: Control genes (solid black triangles) and target genes for
the development of a genomic score. The newly proposed control
genes RPLP0, UBB, RPS11 perform better than the traditional ones
GUSB, ACTB and TFRC.
20
4 Comments on the performance of predictive
and prognostic models built on gene expression
data
From the beginning, it was clear that one of the main applications of
the microarray technology would be in the development of predictive
and prognostic models. Here distinguish between two major types
of models built on gene expression data: in line with the standard
biostatistical nomenclature, we call a model predictive if it is intended
to predict whether a patient would respond or not to a treatment, and
we call a model prognostic if it is intended to predict whether or not
a patient would die from the disease (in a reasonable time frame) or
if it predicts the time to an event (such as disease relapse or death).
In general, building predictive models is much more complicated
because they require a proper experimental design (as the prognostic
ones as well), but one is required to prove that the model is indeed
predictive for treatment response, rather than just prognostic (within
the specific treatment regimen). Another complication arise from the
fact that, at least in theory, for building a predictive model one has to
compare a treated group of patients with an untreated group (normally
patients would be randomly assigned to the two groups at the
enrollment). In the case of severe pathologies like cancer it is nowadays
not possible to obtain such cohorts (with, maybe, the exception
of very early stages), since denying the treatment to a patient would
be unethical. Hence, most of the predictive models refer to predicting
the benefit from adding a new compound to the standard of care.
One of the early successes was the identification of the molecular
basis of a subtype of the chronic leukemia characterized by the
expression of the chimeric BCR/ABL fusion gene [14] and a corresponding
targeted therapy [26]. The first prognostic models to reach
widespread were intended to predict survival of patients with breast
cancers. While they seemed to have a genuine prognostic value, they
also stirred a lot of critics quite early on. The main concern was related
to the reproducibility of the models: independently developed
models led to different gene signatures and were apparently contradictory
(see, for example [9]). It took years and a sustained effort to
21
4. On the performance of predictive and prognostic models
realize that breast cancer was a heterogeneous disease also from a
molecular perspective, hence the population sampling would influence
dramatically the gene expression signatures learned. Also, since
the genes were not independent but rather formed clusters of coregulated
genes (gene modules) some models picked one or other
gene from various clusters, the final lists having a small number of
common genes. On top of these biology-related causes, the chosen
computational modeling approaches were quite different and hence
the results differed as well. This early story should serve as a warning
that, from a computational/machine learning perspective, there is no
single "true" predictive model, but rather there are several views on
the same reality. What matters in the end, is the validation and reproducibility
of the claimed results, not necessarily the names of the
genes in the models.
4.1 General considerations on model learnability
In practice, the usual scenario for developing a new predictive or
prognostic model starts with a biological or clinical problem - for
example, "build a prognostic model for triple-negatve breast cancer
patients". This means that the patient population is (normally) well
defined (here, "triple-negative" meaning ER-, PgR- and Her2- breast
cancer) as is the end-point (say, time-to-relapse). However, there is
no guarantee that the solution to the problem actually exists and,
if it exists, whether it can be found in the given feature space (in
our case, the gene expression space). Hence, the fundamental question
is: can a model be actually learned for the given problem? And the
usual approach to answer this question is to sistematicaly try solving
the problem using a number of different approaches. But when
the results are not satisfactory, the question becomes even more difficult
to answer, because one can wonder whether the sample size was
enough, or whether the methods attempted were appropriate, etc.
Ideally, we would like to have a "score" that would indicate how difficult
a problem is, independent of the methods. Clearly, as stated, this
is an unsolvable problem, but insights into the problem difficulty can
be gained by examining the performance of some basic classifiers. In
a sense, we would like to find a method of characterizing the prob-
22
4. On the performance of predictive and prognostic models
Relative complexity of the three problems
Proportion of top features
Cumulativeinformation
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
00.20.40.60.81
ER status endpoint
pCR endpoint
pCR(ER-) endpoint
Figure 4.1: Problem complexity as a function of cumulative information:
a "simpler" problem would have more informative features and,
hence, the curresponding curve would be above the more complex
problems (e.g. black curve)
lem difficulty similarly as the classifiers can be characterized by the
Vapnik-Chervonenkis dimensionality [31].
This is the context in which we set out to investigate the impact
of problem difficulty and sample size, with applications to a classification
problem in breast cancer. The MAQC-II project provided
the perfect opportunity (and the required data sets) for this investigation.
In our investigation (Section 9, [22]) we introduced a new index
called cumulative information which was used to approximate the
problem complexity (Figure 4.1). It is clear that this index is an oversimplification
(for example, it does not account for the inter-variable
correlations), but it proved its utility in ranking the problems under
investigation. This ranking was then confirmed by the classifiers’ performance
which reproduced the ranking.
Interestingly, the obtained ranking of the biological problems mimicked
the clinicians intuition that predicting the oestrogen-receptor
(ER) status is much easier that predicting whether a patient will have
23
4. On the performance of predictive and prognostic models
0.00.20.40.60.81.0
Learning curves
Training sample size
AUC
20 22 26 30 36 45 60 90 180 Inf
ER status: FS1+QDA
pCR: FS5+LREG
pCR(ER−): FS4+LREG
Figure 4.2: Learning curves for three problems in breast cancer. Note
the logarithmic scale on the x-axis. See Section 9 and [22] for details.
a complete response to neo-adjuvant chemotherapy (in breast cancer).
We also studied the influence of the sample size on the quality
of the predictions. While the required sample size for constructing a
classifier for a given endpoint can be estimated only in toy examples
(under constraining assumptions regarding the underlying distributions),
we found that using learning curves for guiding the sample size
selection is more appropriate, even though much more computationally
intensive. For the same three problems ranked above, the learning
curves are shown in Figure 4.2.
The three learning curves suggest different behaviors of the classifiers
(here only one representative classifier was chosen per problem):
while for the easiest problem, increasing the sample size seem
to bring little benefit (as seems to be the case for the most difficult
problem as well), for the average difficulty, the learning curves suggest
that the model could still be improved - at the cost of doubling
(or even tripling) the size of the data set. For the most difficult problem,
it seems that there is little hope in gaining anything. Of course,
these observations should be taken cautiously, since extrapolating the
24
4. On the performance of predictive and prognostic models
learning curves may prove delicate, even though the sample size used
was in the order of 200 cases.
4.2 A note on model performance estimation
The fast uptake of the microarray technology in biological and clinical
studies put under pressure the existing data analysis capabilities
of various laboratories and led to a series of sub-optimal or even
erroneous analyses. For example, in an early critical review of the
gene expression-based prognostic and predictive models in oncology,
Dupuy and Simon [8] found that more than 50% of the studies
contain at least one of the 3 fundamental errors they considered: (i)
incorrect control for multiple-testing in gene filtering; (ii) spurious
claims in class discovery studies (usually based on "visual" discovery
of classes in cluster analysis); and (iii) incorrect cross-validation procedures
resulting in optimistically-biased performance estimation. A
decade later, the situation improved dramatically as much more experience
has been gained from the many failed trials.
The initial results of Dupuy and Simon [8] were among the causes
for setting up the MAQC-II project by US’s FDA. The main results
are reproduced in Section ??. Besides them, many other side projects
were focussing on more specific aspects. Here we present a different
perspective on the published results. Indeed, one key question
is whether the estimated performance (at modeling stage) is reproduced
by the independent validations. The design of the MAQC-II
allowed the investigation of these aspects on a large scale collection
of predictive models. The set up of the MAQC-II required that each
participating team (in total, there were 38 teams, most of them from
US, but also a few from Europe and Asia) submitted a data analysis
plan (written before having access to the data) to be applied to each
of the 13 predictive problems. Each team had the choice in modeling
one, several or all of the problems (endpoints), but the modeling
procedure was required to be identical. In the end, more than 30,000
models were submitted for blind validation (some teams had chosen
to submit thousands of models, one model per combination of parameters)
and by comparing their observed performance with the initial
estimates, one can gain some insights into the stability of various an-
25
4. On the performance of predictive and prognostic models
Normalization (MAS5)
CV performance estimation
Feature selection/ranking
Classifier design
CV: performance estimation
Model optimization
Classifier design
Feature selection/ranking
Quality Control
Test on external data
CV: performance estimation
Figure 4.3: Overall design of the analytical pipeline for the MAQC-II
project, as put forward by the SIB team
alytical approaches. While the full discussion and main results are
presented in Section 10 and [27], here we will briefly discuss some of
the results regarding the model performance estimation procedures.
At the time of the project, I was with the Swiss Institute of Bioinformatics
(SIB) which I represented in the project, hence the "SIB" refers
to the results I obtained.
Given the rather constraining nature of the exercise, we have adopted
a very conservative approach, with well tested procedures for feature
selection and classifier design. The drawback was clear: we might
have not profited from tuned-to-the-problem modeling strategies, but
the expected benefits were a more robust performance and small bias
in its estimation. The overall design of the processing pipeline is given
in Figure 4.3.
A first observation is that the overall performance of the system
was evaluated by cross-validation (actually a repeated (10×) 5-fold
cross validation), corresponding to the outer CV loop in the Figure 4.3.
26
4. On the performance of predictive and prognostic models
It was already mentioned elsewhere (Sections 3.2 and 3.3) that the
most commonly used quality control and normalization procedures
are using batches of microarrays for parameter estimation. In order to
avoid repeatedly fitting these models inside the cross-validation, we
opted for procedures applicable on individual arrays (%PC for quality
control - Section 3.3, and MAS5 for normalization 3.2), thus being
able to perform them only once, outside the cross-validation, without
violating the performance estimation assumptions (different data for
model building and model assessment).
Also, because the training set sample size was relatively small, the
feature selection method employed was based on single-variable assessment
(ratio of between- to within-group sum of squares - similar
to Fisher criterion) and the optimal number of features was estimated
within an inner cross-validation loop (Figure 4.3). The same
constraints restricted the types of classifiers tested, to those that experience
has shown to perform robustly on large number of problems
(diagonal LDA, general LDA, logistic and penalized logistic regression,
and CART). Again, any meta-parameter those methods had
were optimized in the inner cross-validation loop. The details of SIB’s
data analysis plan were presented during MAQC-II plenary meeting
at FDA headquarters in Washington DC (March 2008).
In Figure 4.4 are shown the boxplots corresponding to the performance
estimates provided by each participating team. It is already
clear that a wide range of performances were expected to be observed
on the independent validation sets. More troubling, for the "positive
control" endpoints (H and L), which were supposed to be predicted
with a performance close to 1.0 (for AUC), some of the models seemed
to be far off-target.
Finally, when comparing the estimated/expected performance of
the models with the observed performance, the results were even
more worrying: in some cases, the AUC bias (in absolute values) was
around 0.5 which would be the difference between a perfect model
and a totally random one (see left panel in Figure 4.5). It was clear that
the performance estimation procedure of some teams was extremely
biased. This led to the selection of a set of rules that would guarantee,
in principle, an unbiased (or, more likely, low bias) estimate of
the performance (but not necessarily a good performance), recommendations
that are now part of the FDA’s guide for good practices
27
4. On the performance of predictive and prognostic models
A B C D E F G H I J K L M
0.00.20.40.60.81.0
AUC estimated by 10x5-CV
Endpoint
AUC
Figure 4.4: Estimated performance of the SIB’s models by repeated
5-fold cross validation (red dots) and the estimated performance of
all other models submitted to MAQC-II, for the 13 endpoints (A-M).
An AUC above 0.6 was considered useful for the prediction of the
endpoint. The yellow endpoints (H and L) were later revealed to be
"positive controls" - problems easy to predict, while the orange endpoints
(I and M) were "negative controls" - randomly assigned labels.
The first three endpoints (A-C) were related to toxico-genomics and
were not modeled by SIB.
28
4. On the performance of predictive and prognostic models
A B C D E F G H I J K L M
-1.0-0.50.00.51.0
CV - Validation AUC
Endpoint
deltaAUC
-1.0-0.50.00.51.0
CV - Validation AUC
Organization
deltaAUC
Figure 4.5: Bias of the performance estimation procedures: estimated
AUC minus observed AUC. Left panel: estimation bias by endpoint
(red dots correspond to SIB), right panel: estimation bias by participating
team (cyan colored box corresponds to SIB).
in biomarker discovery.
29
5 Integration of pathology images: towards a
multimodal biomarker discovery
Modern investigation methods in biology and clinical research rely
more often than not on multiple sources of information. For example,
combining clinical observations, like patient survival or pathologic
response, with gene expression data is, nowadays, routinely used for
discovering new biomarkers or therapy targets. Similarly, combining
gene expression and copy number variation information and/or
methylation data, brings a new level of resolution when investigating
molecular changes at cellular level. Each of these different modalities
provides another perspective on the same underlying biological
reality. The current proposal is concerned with the combination of
three modalities: histopathology imaging, gene expression and clinical
data.
Digital pathology is an active research field which employs methods
of image processing and analysis for assisting the interpretation
and understanding the histopathology slide images. It has the potential
of proposing a more quantitative, thus less subjective, characterization
of the slides and of introducing new image descriptors,
which can be further mined for diagnostic and prognostic clues [10].
As an example of combining digital pathology and clinical information,
the recently proposed immune response score [15], relies on precise
counts of all T-cells (TH1) in whole slide images, a task that is
clearly too tedious for a human expert to perform for each sample
to be assessed. Initial tests show that the score has more prognostic
power than even the well-established TNM grading, providing an excellent
argument in favor of using digital pathology in clinical practice.
However, it relies on special staining for correctly labeling the
different types of T-cells.
The histopathology assessment of the samples can be combined
with the gene expression and clinical data in a joint model. For example,
the tumor grade (a histopathology categorical variable, usually
with three levels: "well differentiated", "moderately differentiated"
and "undifferentiated") can be combined with expression of ESR1
gene and a genomic proliferation score in order to build a prognostic
31
5. Integration of pathology images
score for breast tumors (similar to [29]). In this approach the information
extracted from the histopathology modality is highly filtered
(the human expert extracts only several aspects from the pathology
slides, according to the current practice) and extremely summarized
(in the above example, only three values are possible), in contrast with
the gene expression, which preserves basically all its information.
While the process of filtering and summarization greatly improves
the signal-to-noise ratio and eases the interpretability of the data, it
does this at the expense of discarding some useful information and
limiting the descriptive vocabulary of the histopathology images.
In this context we set out to investigate different aspects of exploiting
and integrating the whole slide imaging in the biomarker
discovery pipeline. Our approach, in contrast with many others, takes
a completely data-driven perspective, without benefitting from - nor
being biased by - pathologist’s expert supervision. However, once the
model were built, the pathologists were called for validating them.
The advantage is that the resulting models revealed new features,
some of them - this being the drawback - without a clear correspondence
in pathology practice. The full description is given in Sections 12,
16 and 18.
The computational approach taken was based on extensions to
bag-of-visual features method [7]. These extensions aimed at producing
more descriptive dictionaries for the histopathology images
and investigated the possibility of structuring the visual dictionaries
around some semantical terms, allowing an easier interpretation of
the results (see Figure 5.1.
Another computational aspect addressed was the optimization of
the visual features for the purpose of analyzing pathology images
and in-depth analyses were performed using both "classical" features
(from Gabor wavelets to local binary patterns) (see [5] for detailed
results) and convolutional neural networks features (as used in [20]).
Also, in [4] we propose a hierarchical quantification schema for building
multiresolution visual dictionaries. Annectodically, this approach
led to features that were more appealing to an expert pathologist than
the convolutional features, but the overall performance of the system
was lower.
The main message of all this investigations is that not only it is
possible to combine whole-slide imaging with molecular/gene ex-
32
5. Integration of pathology images
C.41 C.69 C.67 C.17 C.50
proliferationpatterns
invasion/differentiationpatterns/
connectivetissue
sparsetumornuclei/
differentiation/fat
C.64 C.56 C.65
Figure 5.1: The structure of a visual dictionary for breast cancer: three
main clusters of features can be observed, each related to a different
architectural pattern with clear interpretation [18].
33
5. Integration of pathology images
pression data, but this combination reveals new connections between
the "genetic program" and the tissue architecture. The results were
obtained from breast and colon cancer data but the techniques employed
are easily applicable to other pathologies as well. The most
important results were:
• Construction of a joint imaging and genomic prognostic score
in breast cancer.
In the case of breast cancer, proliferation of cancer cells is a
strong prognostic marker (in addition to, and independent of,
ER and Her2 hormonal statuses), and well known to pathologists.
Hence it was reassuring to see that a part of the imagebased
features that were correlated with the outcome reproduced
this result. In Figure 5.1 the structure of the visual dictionary
is shown along with semantic annotations. The prognostic
score built solely on image features was almost as sensitive
as its gene expression correspondent. However, a combined
image-expression score performed even better [18].
• Recognition of "BRAF-positive" high-risk patients with colorectal
cancer.
In [21] we described a novel gene expression signature identifying
a high risk group of patients (positive by the signature,
hence called "BRAF-positive"). This group assembled both patients
with BRAF V600E mutation (a known risk marker) and
other patients not harboring this mutation but subject to the
same dismal outcome. In [17] and later in [24] we describe a
system for identifying most of these patients based solely on
histopathology images. This result shows that typical tumor
architecture patterns (including the papillary/serrated phenotype)
can be linked to this high risk group allowing its recognition
even in the absence of molecular profiling.
• First steps towards a computational imaging characterization
of inter-tumoral heterogeneity.
The fact that solid tumors are heterogeneous is well known and
the recent advances in molecular profiling confirmed and expanded
the characterization of tumor subtyping. In colorectal
34
5. Integration of pathology images
cancer, we have proposed a molecular taxonomy based on 5
subtypes, with the observation that not all tumors could be assigned
with high confidence to these subtypes (there are probably
some lower prevalence subgroups that were not enough
present in our data for their characterization) [6]. Using deep
learning image features and a hierarchy of support vector machines
we were able to construct a decision system capable of
predicting the molecular subtypes with high confidence (for
four out of the five subtypes). To the best of our knowledge,
this is the first image-based predictor of molecular subtypes
for any tumor type. The implications of our result go beyond
the prediction aspect. Indeed, what we noticed is that part of
the tumor correspond to one subtype while other parts may
correspond to different subtypes. The decision mostly reflects
the dominant subtype but this observation clearly indicates the
sensitivity of the results obtained from molecular profiling to
the tumor sampling strategy, and impacts the large majority of
the results published so far in the field.
35
6 Concluding remarks
The previous chapters tried to present the context in which the articles
reproduced in the second part were written, and also to bring
to the attention of the reader some results that were left aside from
the publications. It is clear - also from browsing the articles - that the
methods evolved along with the technology and the type of problems
one is facing in biomarker discovery. The latest directions of research,
concerning the joint analysis of histopathology images and
gene expression (or other molecular data), clearly show that having
the right data and a modern computational infrastructure allows one
novel ways of exploring an ever increasingly complex biological real-
ity.
37
Bibliography
[1] Janine Antonov, Vlad Popovici, Mauro Delorenzi, Pratyaksha
Wirapati, Anna Baltzer, Andrea Oberli, Beat Thurlimann, Anita
Giobbie-Hurder, Giuseppe Viale, Hans Altermatt, Stefan Aebi,
and Rolf Jaggi. Molecular risk assessment of BIG 1-98 participants
by expression profiling using RNA from archival tissue.
BMC Cancer, 10(1):37, 2010.
[2] Reija Autio, Sami Kilpinen, Matti Saarela, Olli Kallioniemi,
Sampsa Hautaniemi, and Jaakko Astola. Comparison of
Affymetrix data normalization methods using 6,926 experiments
across five array generations. BMC Bioinformatics, 10
Suppl 1:S24–S24, December 2008.
[3] B M Bolstad, R A Irizarry, M Astrand, and T P Speed. A comparison
of normalization methods for high density oligonucleotide
array data based on variance and bias. Bioinformatics, 19(2):185–
193, 2003.
[4] Eva Budinská, Fred Bosman, and Vlad Popovici. Experiments
in molecular subtype recognition based on histopathology images.
In International Symposium on Biomedical Imaging, pages
1168–1172, Prague, CZ, June 2016. IEEE.
[5] Eva Budinská, Lenka Čápková, Daniel Schwarz, Ladislav Dušek,
Rolf Jaggi, Josef Feit, and Vlad Popovici. Gene expressionguided
selection of histopathology image features. In IEEE
15th International Conference on Bioinformatics and Bioengineering,
pages 1–6, Belgrade, 2015. IEEE.
[6] Eva Budinská, Vlad Popovici, Mauro Delorenzi, Sabine Tejpar,
Giovanni D’Ario, Nicolas Lapique, Katarzyna Otylia Sikora, Antonio
Fabio Di Narzo, Pu Yan, John Graeme Hodgson, Scott
Weinrich, Fred Bosman, and Arnaud Roth. Gene expression patterns
unveil a new level of molecular heterogeneity in colorectal
cancer. Journal of Pathology, 231(1):63–76, July 2013.
39
BIBLIOGRAPHY
[7] G Csurka, C Dance, and L Fan. Visual categorization with bags
of keypoints. In ECCV International Workshop on Statistical Learning
in Computer Vision, 2004.
[8] A Dupuy and R M Simon. Critical Review of Published Microarray
Studies for Cancer Outcome and Guidelines on Statistical
Analysis and Reporting. JNCI Journal of the National Cancer Institute,
99(2):147–157, January 2007.
[9] Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, and Eytan Domany.
Outcome signature genes in breast cancer: is there a
unique set? Bioinformatics, 21(2):171–178, January 2005.
[10] Metin N Gurcan, Laura E Boucheron, Ali Can, Anant Madabhushi,
Nasir M Rajpoot, and B Yener. Histopathological image
analysis: a review. IEEE Reviews in Biomedical Engineering, 2:147–
171, 2009.
[11] R A Irizarry. Summaries of Affymetrix GeneChip probe level
data. Nucleic Acids Research, 31(4):15e–15, February 2003.
[12] Rafael A Irizarry, Bridget Hobbs, Francois Collin, Yasmin D
Beazer-Barclay, Kristen J Antonellis, Uwe Scherf, and Terence P
Speed. Exploration, normalization, and summaries of high density
oligonucleotide array probe level data. Biostatistics, 4(2):249–
264, April 2003.
[13] Jeanette N McClintick and Howard J Edenberg. Effects of filtering
by Present call on analysis of microarray experiments. BMC
Bioinformatics, 7:49, January 2006.
[14] J V Melo, D E Gordon, NCP Cross, and J M Goldman. The AblBcr
Fusion Gene Is Expressed in Chronic Myeloid-Leukemia.
Blood, 81(1):158–165, 1993.
[15] B Mlecnik, M Tosolini, A Kirilovsky, A Berger, G Bindea,
T Meatchi, P Bruneval, Z Trajanoski, W H Fridman, F Pages, and
J Galon. Histopathologic-Based Prognostic Factors of Colorectal
Cancers Are Associated With the State of the Local Immune Reaction.
Journal of Clinical Oncology, 29(6):610–618, February 2011.
40
BIBLIOGRAPHY
[16] Andrea Oberli, Vlad Popovici, Mauro Delorenzi, Anna Baltzer,
Janine Antonov, Sybille Matthey, Stefan Aebi, Hans Jörg Altermatt,
and Rolf Jaggi. Expression profiling with RNA from
formalin-fixed, paraffin-embedded material. BMC Medical Genomics,
1(1):9, April 2008.
[17] Vlad Popovici. Towards the identification of tissue-based proxy
biomarkers. In AMIA Joint Summits on Translational Science proceedings,
pages 75–83, San Francisco, US, 2016.
[18] Vlad Popovici, Eva Budinská, Lenka Čápková, Daniel Schwarz,
Ladislav Dušek, Josef Feit, and Rolf Jaggi. Joint analysis of
histopathology image features and gene expression in breast
cancer. BMC Bioinformatics, 17(1):209, 2016.
[19] Vlad Popovici, Eva Budinská, and Mauro Delorenzi. Rgtsp - a
generalized top scoring pairs package for class prediction. Bioinformatics,
27(12):1729–1730, 2011.
[20] Vlad Popovici, Eva Budinská, Ladislav Dušek, Michal Kozubek,
and Fred Bosman. Image-based surrogate biomarkers for molecular
subtypes of colorectal cancer. Bioinformatics, January 2017.
[21] Vlad Popovici, Eva Budinská, Sabine Tejpar, Scott Weinrich,
Heather Estrella, Graeme Hodgson, Eric Van Cutsem, Tao Xie,
Fred T Bosman, Arnaud D Roth, and Mauro Delorenzi. Identification
of a poor-prognosis BRAF-mutant-like population of patients
with colon cancer. Journal of Clinical Oncology, 30(12):1288–
1295, April 2012.
[22] Vlad Popovici, Weijie Chen, Brandon G Gallas, Christos Hatzis,
Weiwei Shi, Frank W Samuelson, Yuri Nikolsky, Marina
Tsyganova, Alex Ishkin, Tatiana Nikolskaya, Kenneth R Hess, Vicente
Valero, Daniel Booser, Mauro Delorenzi, Gabriel N Hortobagyi,
Leming Shi, W Fraser Symmans, and Lajos Pusztai. Effect
of training-sample size and classification difficulty on the accuracy
of genomic predictors. Breast Cancer Research, 12(1):R5–R5,
2010.
41
BIBLIOGRAPHY
[23] Vlad Popovici, Darlene R Goldstein, Janine Antonov, Rolf Jaggi,
Mauro Delorenzi, and Pratyaksha Wirapati. Selecting control
genes for RT-QPCR using public microarray data. BMC Bioinformatics,
10:42–42, December 2008.
[24] Vlad Popovici, Aleš Křenek, and Eva Budinská. Identification of
“BRAF-Positive” Cases Based on Whole-Slide Image Analysis.
BioMed Research International, 2017(24):1–7, 2017.
[25] Margaret M Ryan, Stephen J Huffaker, Maree J Webster, Matt
Wayland, Tom Freeman, and Sabine Bahn. Application and optimization
of microarray technologies for human postmortem
brain studies. Biological Psychiatry, 55(4):329–336, February 2004.
[26] D G Savage and K H Antman. Imatinib mesylate - A new oral
targeted therapy. New England Journal of Medicine, 346(9):683–
693, 2002.
[27] Leming Shi, Gregory Campbell, Wendell D Jones, Fabien Campagne,
Zhining Wen, Stephen J Walker, Zhenqiang Su, TzuMing
Chu, Federico M Goodsaid, Lajos Pusztai, John D Shaughnessy,
André Oberthuer, Russell S Thomas, Richard S Paules,
Mark Fielden, Bart Barlogie, Weijie Chen, Pan Du, Matthias Fischer,
Cesare Furlanello, Brandon D Gallas, Xijin Ge, Dalila B
Megherbi, W Fraser Symmans, May D Wang, John Zhang,
Hans Bitter, Benedikt Brors, Pierre R Bushel, Max Bylesjo, Minjun
Chen, Jie Cheng, Jing Cheng, Jeff Chou, Timothy S Davison,
Mauro Delorenzi, Youping Deng, Viswanath Devanarayan,
David J Dix, Joaquin Dopazo, Kevin C Dorff, Fathi Elloumi, Jianqing
Fan, Shicai Fan, Xiaohui Fan, Hong Fang, Nina Gonzaludo,
Kenneth R Hess, Huixiao Hong, Jun Huan, Rafael A Irizarry,
Richard Judson, Dilafruz Juraeva, Samir Lababidi, Christophe G
Lambert, Li Li, Yanen Li, Zhen Li, Simon M Lin, Guozhen Liu,
Edward K Lobenhofer, Jun Luo, Wen Luo, Matthew N McCall,
Yuri Nikolsky, Gene A Pennello, Roger G Perkins, Reena Philip,
Vlad Popovici, Nathan D Price, Feng Qian, Andreas Scherer,
Tieliu Shi, Weiwei Shi, Jaeyun Sung, Danielle Thierry-Mieg, Jean
Thierry-Mieg, Venkata Thodima, Johan Trygg, Lakshmi Vishnuvajjala,
Sue Jane Wang, Jianping Wu, Yichao Wu, Qian Xie,
42
BIBLIOGRAPHY
Waleed A Yousef, Liang Zhang, Xuegong Zhang, Sheng Zhong,
Yiming Zhou, Sheng Zhu, Dhivya Arasappan, Wenjun Bao,
Anne Bergstrom Lucas, Frank Berthold, Richard J Brennan, Andreas
Buness, Jennifer G Catalano, Chang Chang, Rong Chen,
Yiyu Cheng, Jian Cui, Wendy Czika, Francesca Demichelis, Xutao
Deng, Damir Dosymbekov, Roland Eils, Yang Feng, Jennifer
Fostel, Stephanie Fulmer-Smentek, James C Fuscoe, Laurent
Gatto, Weigong Ge, Darlene R Goldstein, Li Guo, Donald N Halbert,
Jing Han, Stephen C Harris, Christos Hatzis, Damir Herman,
Jianping Huang, Roderick V Jensen, Rui Jiang, Charles D
Johnson, Giuseppe Jurman, Yvonne Kahlert, Sadik A Khuder,
Matthias Kohl, Jianying Li, Menglong Li, Quan-Zhen Li, Shao Li,
Zhiguang Li, Jie Liu, Ying Liu, Zhichao Liu, Lu Meng, Manuel
Madera, Francisco Martinez-Murillo, Ignacio Medina, Joseph
Meehan, Kelci Miclaus, Richard A Moffitt, David Montaner, Piali
Mukherjee, George J Mulligan, Padraic Neville, Tatiana Nikolskaya,
Baitang Ning, Grier P Page, Joel Parker, R Mitchell Parry,
Xuejun Peng, Ron L Peterson, John H Phan, Brian Quanz, Yi Ren,
Samantha Riccadonna, Alan H Roter, Frank W Samuelson, Martin
M Schumacher, Joseph D Shambaugh, Qiang Shi, Richard
Shippy, Shengzhu Si, Aaron Smalter, Christos Sotiriou, Mat
Soukup, Frank Staedtler, Guido Steiner, Todd H Stokes, Qinglan
Sun, Pei-Yi Tan, Rong Tang, Zivana Tezak, Brett Thorn, Marina
Tsyganova, Yaron Turpaz, Silvia C Vega, Roberto Visintainer,
Juergen von Frese, Charles Wang, Eric Wang, Junwei Wang, Wei
Wang, Frank Westermann, James C Willey, Matthew Woods,
Shujian Wu, Nianqing Xiao, Joshua Xu, Lei Xu, Lun Yang, Xiao
Zeng, Jialu Zhang, Li Zhang, Min Zhang, Chen Zhao, Raj K Puri,
Uwe Scherf, Weida Tong, and Russell D Wolfinger. The MicroArray
Quality Control (MAQC)-II study of common practices for
the development and validation of microarray-based predictive
models. Nature biotechnology, 28(8):827–838, July 2010.
[28] James P Stewart, Susan Richman, Tim Maughan, Mark Lawler,
Philip D Dunne, and Manuel Salto-Tellez. Standardising RNA
profiling based biomarker application in cancer—The need for
robust control of technical variables. Biochimica et Biophysica Acta
- Reviews on Cancer, 1868(1):258–272, August 2017.
43
BIBLIOGRAPHY
[29] Carina Strand, Cecilia Ahlin, Pär-Ola Bendahl, Marie-Louise
Fjällskog, Ingrid Hedenfalk, Per Malmström, and Mårten Fernö.
Combination of the proliferation marker cyclin A, histological
grade, and estrogen receptor status in a new variable with high
prognostic impact in breast cancer. Breast Cancer Research and
Treatment, 131(1):33–40, 2012.
[30] Sun Tian, Paul Roepman, Vlad Popovici, Magali Michaut, Ian
Majewski, Ramon Salazar, Cristina Santos, Robert Rosenberg,
Ulrich Nitsche, Wilma E Mesker, Sjoerd Bruin, Sabine Tejpar,
Mauro Delorenzi, Rene Bernards, and Iris Simon. A robust genomic
signature for the detection of colorectal cancer patients
with microsatellite instability phenotype and high mutation frequency.
Journal of Pathology, 228(4):586–595, October 2012.
[31] V N Vapnik and A YA Chervonenkis. On the uniform convergence
of relative frequencies of events to their probabilities. Theory
of probability and its applications, 16(2):264–280, 1971.
44
PART II
SELECTED ARTICLES
This second part is dedicated to reproducing a number of articles
published over the years in various international journals, dealing
with different theoretical and practical aspects of mining, designing,
evaluating and validating a number of biomarkers and gene expression
signatures. For each article, its current (as of August 2017) bibliometrics
information is provided. Most of the articles are accompanied
by supplemental materials freely available online at the respective
journals web pages.
46
7 Selecting control genes for RT-QPCR using
public microarray data
• BMC Bioinformatics 10(42), 2009
• IF: 2.448
• number of citations: 30
• personal contribution (60%): method design, data processing,
experiment implementation (with R code) and statistical analyses,
manuscript writing
47
BioMed Central
Page 1 of 10
(page number not for citation purposes)
BMC Bioinformatics
Open AccessMethodology article
Selecting control genes for RT-QPCR using public microarray data
Vlad Popovici*1, Darlene R Goldstein1,2, Janine Antonov3, Rolf Jaggi3,
Mauro Delorenzi1 and Pratyaksha Wirapati1
Address: 1Bioinformatics Core Facility, Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland, 2Institut de mathématiques (IMA),
Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland and 3Department of Clinical Research, University of Bern, CH-
3010 Bern, Switzerland
Email: Vlad Popovici* - vlad.popovici@isb-sib.ch; Darlene R Goldstein - darlene.goldstein@epfl.ch;
Janine Antonov - janine.antonov@dkf.unibe.ch; Rolf Jaggi - rolf.jaggi@dkf.unibe.ch; Mauro Delorenzi - mauro.delorenzi@isb-sib.ch;
Pratyaksha Wirapati - pratyaksha.wirapati@isb-sib.ch
* Corresponding author
Abstract
Background: Gene expression analysis has emerged as a major biological research area, with realtime
quantitative reverse transcription PCR (RT-QPCR) being one of the most accurate and widely
used techniques for expression profiling of selected genes. In order to obtain results that are
comparable across assays, a stable normalization strategy is required. In general, the normalization
of PCR measurements between different samples uses one to several control genes (e.g.
housekeeping genes), from which a baseline reference level is constructed. Thus, the choice of the
control genes is of utmost importance, yet there is not a generally accepted standard technique for
screening a large number of candidates and identifying the best ones.
Results: We propose a novel approach for scoring and ranking candidate genes for their suitability
as control genes. Our approach relies on publicly available microarray data and allows the
combination of multiple data sets originating from different platforms and/or representing different
pathologies. The use of microarray data allows the screening of tens of thousands of genes,
producing very comprehensive lists of candidates. We also provide two lists of candidate control
genes: one which is breast cancer-specific and one with more general applicability. Two genes from
the breast cancer list which had not been previously used as control genes are identified and
validated by RT-QPCR. Open source R functions are available at http://www.isrec.isb-sib.ch/
~vpopovic/research/
Conclusion: We proposed a new method for identifying candidate control genes for RT-QPCR
which was able to rank thousands of genes according to some predefined suitability criteria and we
applied it to the case of breast cancer. We also empirically showed that translating the results from
microarray to PCR platform was achievable.
Background
Real-time quantitative reverse transcription PCR (RTQPCR)
has become a method of choice for gene expression
profiling in a large number of applications. However,
obtaining reliable measurements still depends on the
choice of control genes on which the baseline level is conPublished:
2 February 2009
BMC Bioinformatics 2009, 10:42 doi:10.1186/1471-2105-10-42
Received: 11 August 2008
Accepted: 2 February 2009
This article is available from: http://www.biomedcentral.com/1471-2105/10/42
© 2009 Popovici et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
7. Control genes for RT-QPCR
48
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 2 of 10
(page number not for citation purposes)
structed. Selecting the control genes remains a critical
point in the normalization process. Often, a short list of
candidates is produced based on non-systematic and/or
often poorly defined biological considerations.
In early studies, normalization was usually based on a single
control gene. More recently, the trend is to use several
control genes whose average expression level (on a logscale)
is used as baseline [1,2]. Suitable control genes are
selected from a short list of 10–15 genes by ranking them
according to a criterion that essentially selects those genes
having low variation across samples. We describe brie y a
few such methods below.
[2] introduces a stability coefficient which is used along
with the coefficient of variation for ranking the genes
from a predefined list of candidates. Gene stability is
defined in terms of average standard deviation of the logratios
of pairs of candidate genes. Genes are ranked by
iteratively removing those most unstable. This approach
has the drawback that repeated comparison of pairs of
genes is required, which is feasible only when the number
of candidates is small. In addition, the method implicitly
assumes that there are no co-regulated genes. A modelbased
approach proposed by [1] aims at estimating the
overall variation as well as the between sample variation
of each candidate gene. However, with this approach it is
cumbersome to integrate different platforms. In an application
to plant pathogen profiling, [3] investigates a list of
18 pre-selected candidate housekeeping genes, using the
method proposed in [2] and RT-QPCR for measuring the
gene expressions. [4] proposes a PCA-based statistical
analysis to identify the most suitable control genes among
13 candidates which were selected such that they had
independent functions in cellular maintenance.
[5] introduces a strategy which combines the coefficient of
variation, maximum fold change and mean expression
value in a ranking criterion that is applied to a large
number of samples representing a wide variety of tissues.
All these samples were hybridized on either Affymetrix
HG-U133A or HG-U133 Plus 2.0 arrays and quantile-normalized
together prior to ranking. Only probesets common
to both arrays were used, with probesets targeting the
same gene averaged into a single value.
There are some important differences between the methods
described above and our approach (described below).
Firstly, in contrast with all the studies based on PCR, we
do not require a short list of candidate genes to be produced
before assessing their suitability as control genes.
Instead, we screen all the genes represented on the microarray
chips, giving us the opportunity to assess genes that
have not been reported previously. Moreover, we take a
meta-analytical approach to the problem, first creating an
independent ranking within each data set then aggregating
these rankings into a single list. This approach has the
advantage of being platform- and normalization-independent.
In addition, the approach is not limited to using
only genes common between different data sets. Also, by
not using the coefficient of variation, we can treat uniformly
both single and two-colors arrays. Thus, we are
able to exploit data obtained from different platforms
without requiring them to be normalized together. Furthermore,
the meta-analytical approach allows us to integrate
gene lists produced using our ranking system with
other ranked gene lists from the literature and we do not
require all data to be normalized together. Another key
difference is that we introduce a new stability coefficient
that combines the mean expression and the standard deviation
in a ranking criterion that corresponds to our
requirements for candidate control genes for RT-QPCR. In
general, these requirements are:
• low variability across different specimens (e.g., subtypes
of tumors or normal tissues);
• high and moderate level of expression, such that control
genes with expression levels across a larger range may be
selected;
• consistency across experiments and platforms.
A key question is whether it is possible to select genes
from microarray studies that perform as control genes on
PCR platform, given that the two technologies are different.
We hypothesize that translating the list of candidate
genes from microarray to PCR platform is feasible and we
provide empirical evidence in this sense.
Results
Data sets and pre-processing steps
We have collected ten publicly available data sets [6-15],
listed in Table 1, from which we derived the quantities of
interest: the mean and standard deviations of the logTable
1: The ten public microarray data sets used (n = number of
samples).
Data set ID and reference n Platform
BWH [6] 47 Affymetrix U133v2
EMC [7] 286 Affymetrix U133A
EXPO [8] 1375 Affymetrix U133Plus2
JRH2 [9] 61 Affymetrix U133A
MGH [10] 60 Agilent
NKI [11] 337 Agilent (custom)
STOCK [12] 159 Affymetrix U133A, B
TGIF1 [13] 49 Affymetrix U133A
UNC [14] 153 Agilent HuA1
UPP [15] 249 Affymetrix U133A, B
7. Control genes for RT-QPCR
49
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 3 of 10
(page number not for citation purposes)
intensities (on Affymetrix platforms) or of the log-ratios
(on Agilent platforms).
We note here that the original EXPO data set contains a
number of different pathologies, but we restrict analysis
here to eight different types of cancer (breast, colon,
endometrium, kidney, lung, ovary, prostate and uterus)
for which a sufficient number of samples existed. EXPO
breast cancer samples (n = 328) were used to produce
both the breast cancer and general cancer lists of candidate
genes.
The Affymetrix data are available as MAS5.0 normalized
values. The Agilent data contains log-ratios (base 10) and
mean-centered log-intensities. The standard deviations of
log-intensities (Affymetrix) and log-ratios (Agilent) were
used as measures of variability. The means of log-intensities
(both Affymetrix and Agilent) were used as measures
of average expression level.
When multiple probesets of the same gene are present,
only the most variable one is used. We consider all genes
from each platform, the aggregation methods used being
able to cope with 'missing' genes (those not represented
on the array). Considering only those genes common to
all platforms is an unnecessary limiting constraint, as
increasing the number of data sets and the heterogeneity
of the collection leads to a successively smaller intersection
of genes.
Before any further usage of the data, we reduce the variability
across platforms by scaling with a factor given by a
first order LOESS fit of the data. The effect of this transformation
can be seen in Figure 1, where the black line represents
the fitted curve. This simple approach seems
effective, except for genes with low expression. However,
as we are interested in genes with higher mean expression,
this deficiency is not problematic.
Ranking the genes
Let us consider that we have M microarray data sets, each
containing expression values of a set of genes Gk, k =
1,...,M, and let G = ∪kGk = {1,...,N} be the set of all genes
represented at least once in any of these data sets.
Example of variance stabilization by LOESS correctionFigure 1
Example of variance stabilization by LOESS correction. LOESS correction applied to three data sets: BWH, NKI and
EXPO-breast, respectively. The first row shows the original data with the fitted first order LOESS curve, while the second row
shows the variance-stabilized data.
7. Control genes for RT-QPCR
50
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 4 of 10
(page number not for citation purposes)
Gene scores
We aim to design a scoring function which ranks the genes
such that higher scores correspond to genes that are more
suitable to be used as control genes. As mentioned above,
the score has to combine each gene's mean expression and
standard deviation into a single value such that higher
expression levels and lower variances (standard deviations)
are favored. Moreover, the score must be independent
of the technology used to measure expression levels
and the method for normalization.
These requirements lead us to propose a new stability score
for the gene expression levels. This score for gene i in data
set k, denoted sik, is defined as
where and are the estimated mean log-intensity
and the standard deviation of the gene i in data set k. The
coefficient allows the user to control the trade-off
between the mean expression and the standard deviation
in gene scoring. Results reported here were obtained with
= 0.25. The k parameter allows one to define the level of
mean expression below which the genes are not considered
for ranking, i.e. the score for these genes is -∞. We
have set k to be the 25th percentile of the mean expression,
for each data set k. Genes having a higher score are
considered more suitable as control genes. As we see from
Eq. 1, high variation in gene expression leads to a lower
score when mean expression levels are equal. This is one
reason we select the most highly variable probeset from
the probesets representing the same gene, in order to
encompass the worst-case scenario. Note also that there is
no need to normalize the scores to make them comparable
across data sets, because they are used solely for ranking
the genes within the same data set. Finally, having
computed the scores for all the genes within a data set, we
order the genes from high to low values of the scores, with
ties resolved by ordering by the mean expression (from
high to low). From this perspective, the scores can be seen
as defining classes of equivalence among genes: all the
genes in the same class (having the same score) are
equally useful as normalization genes. By using the second
ordering criterion, we can select control genes with a
desired expression level (examples of classes of equivalence
are the equal score levels in Figure 2).
Figure 2 displays the influence of the mean expression
level and the standard deviation on the gene score. All
genes located on the curves have the same score value
(they belong to the same equivalence class). Two consecutive
curves are separated by one score unit.
Using this stability score, we ranked the genes from each
data set, obtaining the lists that will be later combined. An
excerpt from the ten lists for the breast cancer data sets is
shown in Table 2 (first ten columns).
Combining results from different data sets
Once genes are ranked according to their scores in each
data set (lower ranks correspond to higher scores), the
natural next step is to combine these rankings into a global
ranked list. We combine the ranks of the genes rather
sik ik k ik= − −a m b slog (max{ , }) ,2 0 (1)
ˆmik ˆs ik
Scatter plots of standard deviation versus mean log-intensity for BWH, NKI and EXPO-breast data sets, respectivelyFigure 2
Scatter plots of standard deviation versus mean log-intensity for BWH, NKI and EXPO-breast data sets,
respectively. The shading codes the gene stability scores, with darker colors indicating higher scores. These three data sets
are from different microarray platforms. The light gray points indicate the discarded genes (those with mean expression level
below the value – see Eq. 1). The curves correspond to equal score levels and are one score unit apart.
7. Control genes for RT-QPCR
51
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 5 of 10
(page number not for citation purposes)
than their scores to avoid normalizing the scores across
different data sets, thereby achieving platform-independence.
To this end we use the rank product score [16], which
is a fast and efficient method for combining ranked lists.
It computes, for each gene i ∈ G, a new score
where rankk(i) is the rank of sik, the score for gene i in data
set k (topmost gene has the rank 1), and ni is the number
of data sets in which the gene i appears. The final list is
obtained by sorting the genes in increasing order of Ri. The
top 20 genes from the aggregated breast cancer list are
given in the 'Meta' (last) column of Table 2.
Validation of the aggregated lists
There is no absolute criterion by which one can judge the
quality of the resulting lists. Rather, the aggregated list
could be used to select from the top genes (100, for example)
those genes that also satisfy further conditions of the
specific application.
We can, however, have a subjective impression of the
validity of the aggregated list by visualizing the resulting
top genes in data sets not used for producing the list. We
obtained a list of the top 100 genes by applying the
method described above on eight of the ten data sets, leaving
NKI and UPP aside as validation sets. The top 100
genes in both validation sets (different microarray platforms)
are plotted in Figure 3. As a comparison, we also
include the five control genes used in [17] (represented as
triangles in the figure). It is seen that the genes are generally
concentrated in the lower right part of the plot, corresponding
to high mean expression levels and low
variance. There is a notable difference between the quality
of the results (given by the concentration of the control
genes in the lower right corner) on the two platforms, due
to the fact that most of the data sets used for gene selection
are from Affymetrix platforms. While the top 100 lists
contain genes with high stability scores on the Affymetrix
platforms (the UPP data set), on the custom Agilent platform
(NKI) there are a number of genes that are missed.
Nevertheless, those selected still function well as control
genes.
Control genes lists
We have analyzed ten different data sets which have samples
hybridized on different versions of Agilent and
Affymetrix platforms. Using our proposed method, we
compiled two different lists of candidate control genes:
one specific to breast cancer [see Additional file 1] and
one resulting from the analysis of eight different types of
cancer, thus applicable to cancers in general [see Additional
file 2]. From the breast cancer list we selected two
new control genes which were validated in an RT-QPCR
assay that also included five previously used control genes
(ACTB, TFRC, GUSB, RPLP0 and GAPDH – see [17]) and
breast cancer-related genes (e.g. ESR1, ERBB2, AURKA,
etc.). The RT-QPCR results confirm the findings from the
microarray analysis and show that more stably expressed
control genes can be selected by applying the criteria menR
i
ni
i k
k
=
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟∏rank ( ) ,
1
(2)
Table 2: Top 20 control genes from the ten breast cancer data sets and top 20 genes from the aggregated list (Meta column)
BWH EMC JRH2 MGH NKI STOCK TGIF1 UNC UPP EXPO-breast Meta
RPL37A PPIA RPL41 ZNF557 UBC RPS11 RPL41 RPS10 RPL9 CALM2 RPL37A
RPL41 CALM2 RPL39 CDR1 UBB RPS24 RPL37A RPS18 RPL37A HNRPA1 RPL27A
RPS18 SRP14 RPL23A PPP1R2 OAZ1 RPL9 EEF1A1 RPLP1 ACTG1 NACA RPS18
RPL39 RPL37A RPL37A TCN2 DYNLL1 RPL37A RPL30 RPS11 RPL27A UBA52 RPL30
RPL23A RPS18 EEF1A1 SSBP1 RAPSN RPL41 RPL39 RPS23 CFL1 LAPTM4A RPL41
RPL9 RPL30 RPS23 RPL27A PCBP1 RPL27A PPIA RPL37A RPS11 RPL27A CALM2
RPLP1 RPL27A RPS27 RPS3 KCNH3 RPL39 ACTG1 RPL11 RPS13 RPL30 RPL27
RPS27 RPS11 CALM2 BRCC3 RPL3 RPLP1 CFL1 RPS15 RPL27 RPL9 K-ALPHA-1
RPL27A RPL39 RPS18 PTMA RPL8 UBB RPS23 RPL14 RPL41 RPL31 RPS11
RPL30 RPS15 ACTG1 ABCF2 MYL6 RPS15A RPL10 NACA RPS18 RPL37 RPL39
RPS29 RPS24 RPL10 PCDH18 RPL14 CALM2 CALM2 RPL36AL RPS15 RPS11 RPS13
ACTG1 RPL32 RPS24 LAX1 RPL7A NACA RPS11 UBA52 RPL6 RPS29 NACA
CALM2 RPS15A RPS15A TPMT FAU RPL30 HNRPA1 NEDD8 RPLP1 RPS24 RPL23A
RPS13 RPLP1 RPL32 GALE ARF1 CFL1 RPL6 PCBP1 RPL32 RPS13 RPS24
HNRPA1 RPL9 RPL27A MTCH1 CCT3 RPS13 RPL23A NDUFB2 RPL31 RPS21 HNRPA1
RPS24 UBB UBB ATP5G2 PSAP RPS3A K-ALPHA-1 HNRPM RPL39 UBB RPL9
RPL31 K-ALPHA-1 RPS29 SF3B2 CD81 RPL37 RPS18 HNRPC UBB RPS27A RPLP1
RPL34 RPS13 RPL30 SND1 SQSTM1 RPS18 EEF1G NDUFB8 RPS24 RPS15 RPL32
RPS15A RPL27 PPIA RPL5 K-ALPHA-1 RPL27 TUBA6 ATP5J2 RPS27 RPL32 LAPTM4A
RPS21 FAU CFL1 SKAP2 CALR RPL24 RPS3A TARDBP DDX5 RPL24 RPS15A
7. Control genes for RT-QPCR
52
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 6 of 10
(page number not for citation purposes)
tioned above. Also, they provide empirical evidence supporting
the working hypothesis that PCR control genes
can be selected from microarray data.
The list of the top 50 control genes obtained from the ten
breast cancer data sets is given in Table 3. More comprehensive
lists, including one containing the top 2000 candidate
breast cancer genes and a similar list compiled
from eight different types of cancer, are available [see
Additional file 1 and Additional file 2]. In the case of
breast cancer control genes, it is interesting to note that
some of the "classical" genes (e.g. ACTB, GAPDH, TFRC)
are not among the top 50.
Evaluation of control genes by RT-QPCR
Motivated by the consistency of the selection process for
suitable control genes among different microarray platforms,
we performed a small scale RT-QPCR experiment
to test the performance of two new control genes along
with a number of more commonly used control genes. In
this experiment, RNA was isolated from 25 cryo-preserved
breast cancer samples and the expression of 47 genes was
measured by RT-QPCR [18]. Test genes were selected
according to their relatedness to proliferation or estrogen
receptor functions. Some of the test genes had been previously
identified and used for characterizing primary
breast cancers [17]. Two genes, RPS11 and UBB, ranked 9
and 31 in Table 3 respectively, were compared to five
additional control genes and to a number of test genes
previously measured by [17]. Mean raw expression values
of all candidate control and test genes were plotted against
standard deviations of each gene (Figure 4). The raw Ct
(cycle threshold) value is the number of PCR cycles
required for the fluorescence signal to cross the background
threshold, so that low Ct values correspond to
high expression levels. RPS11 and UBB are clearly among
the most stably expressed genes, as their standard deviations
are both quite low. Other genes frequently used as
control genes are also shown. For comparison, mean
expression and standard deviation of several test genes are
also indicated. The expression of most test genes is much
more variable than UBB and RPS11.
The two new control genes, together with RPLP0, offer the
best trade-off between mean expression level and variability,
while others like ACTB or TFRC are less stably
expressed and therefore seem less suitable for use as normalization
genes.
Discussion
We propose a new approach which leverages publicly
available microarray data to produce lists of candidate
control genes for RT-QPCR. Our method is independent
of the microarray platform or normalization methodolScatter
plots of standard deviation versus mean log-intensity for two validation data sets (from left to right: NKI and UPP)Figure 3
Scatter plots of standard deviation versus mean log-intensity for two validation data sets (from left to right:
NKI and UPP). The top 100 breast cancer control genes resulting from aggregating eight data sets are plotted as circles. Triangles
correspond to the five control genes used in [17] (NKI does not contain the ACTB gene).
7. Control genes for RT-QPCR
53
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 7 of 10
(page number not for citation purposes)
ogy, and is able to cope with gene lists that overlap only
partially. After screening thousands of genes (generally
more than 10,000 genes in each data set), we have produced
two separate lists of candidate genes: one specific to
breast cancer and one generally applicable to different
types of cancer. We do not consider these lists as generally
applicable, as the data used do not allow such generalization.
Different pathologies may have a different impact on
the control genes and some of the control genes we
selected may become ineffective in the case of a disease
which affects their particular functions. On the other
hand, more diverse data should be used if the goal is finding
global control genes. The list of the top 50 breast cancer
control genes (Table 3) is dominated by ribosomal
proteins. This finding is consistent with the fact that ribosomes
are a major component of basic physiologic processes
in all the cells and not a primary target of changing
conditions. Other genes present among the first 50 genes
Table 3: Top 50 control genes as resulting from aggregating the ten breast cancer data sets. Two genes – RPS11 and UBB – were
selected as control genes and validated by RT-PCR
Rank Gene symbol Gene ID Description
1 RPL37A 6168 ribosomal protein L37a
2 RPL27A 6157 ribosomal protein L27a
3 RPS18 6222 ribosomal protein S18
4 RPL30 6156 ribosomal protein L30
5 RPL41 6171 ribosomal protein L41
6 CALM2 805 calmodulin 2 (phosphorylase kinase, delta)
7 RPL27 6155 ribosomal protein L27
8 K-ALPHA-1 10376 alpha tubulin
9 RPS11 6205 ribosomal protein S11
10 RPL39 6170 ribosomal protein L39
11 RPS13 6207 ribosomal protein S13
12 NACA 4666 nascent-polypeptide-associated complex alpha polypeptide
12 RPL23A 6147 ribosomal protein L23a
14 RPS24 6229 ribosomal protein S24
15 HNRPA1 3178 heterogeneous nuclear ribonucleoprotein A1
16 RPL9 6133 ribosomal protein L9
17 RPLP1 6176 ribosomal protein, large, P1
18 RPL32 6161 ribosomal protein L32
19 LAPTM4A 9741 lysosomal-associated protein transmembrane 4 alpha
20 RPS15A 6210 ribosomal protein S15a
21 DYNLL1 8655 dynein, light chain, LC8-type 1
22 ACTG1 71 actin, gamma 1
23 TUBA6 84790 tubulin, alpha 6
24 SRP14 6727 signal recognition particle 14kDa (homologous Alu RNA binding protein)
25 MYL6 4637 myosin, light chain 6, alkali, smooth muscle and non-muscle
26 RPL24 6152 ribosomal protein L24
27 FAU 2197 Finkel-Biskis-Reilly murine sarcoma virus (FBR-MuSV) ubiquitously expressed (fox derived); ribosomal protein S30
28 RPL31 6160 ribosomal protein L31
29 RPS15 6209 ribosomal protein S15
30 MTCH1 23787 mitochondrial carrier homolog 1 (C. elegans)
31 UBB 7314 ubiquitin B
32 RPL37 6167 ribosomal protein L37
33 HMGN2 3151 high-mobility group nucleosomal binding domain 2
34 RPS27 6232 ribosomal protein S27 (metallopanstimulin 1)
35 GDF8 2660 growth differentiation factor 8
36 RPL38 6169 ribosomal protein L38
37 RPS29 6235 ribosomal protein S29
38 SULT1C2 27233 sulfotransferase family, cytosolic, 1C, member 2
39 RPL6 6128 ribosomal protein L6
40 UBC 7316 ubiquitin C
41 UBA52 7311 ubiquitin A-52 residue ribosomal protein fusion product 1
42 MRFAP1 93621 Mof4 family associated protein 1
43 HNRPK 3190 heterogeneous nuclear ribonucleoprotein K
44 PARK7 11315 Parkinson disease (autosomal recessive, early onset) 7
45 PSMC1 5700 proteasome (prosome, macropain) 26S subunit, ATPase, 1
46 LOC158572 158572 hypothetical protein LOC158572
47 RPS8 6202 ribosomal protein S8
48 ATP5A1 498 ATP synthase, H+ transporting, mitochondrial F1 complex, alpha subunit 1, cardiac muscle
49 EIF4H 7458 eukaryotic translation initiation factor 4H
50 CD63 967 CD63 molecule
7. Control genes for RT-QPCR
54
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 8 of 10
(page number not for citation purposes)
code for protein turnover (ubiquitin), tubulin-related
proteins or actins, structures which are required in all living
cells.
Our results are supported by recent findings of de Jonge et
al. [5], who used a different ranking method. In addition,
the lists of control gene candidates for breast cancer and
for diverse types of cancer are similar [see Additional file
1 and Additional file 2], as a large number of the top
ranked genes belong to the same functional category
(ribosomal genes, protein turnover).
RT-QPCR experimentFigure 4
RT-QPCR experiment. Standard deviation as a function of the mean expression level (expressed as raw Ct values) of 47
genes in a RT-QPCR experiment. Higher expression levels correspond to smaller raw Ct values. Control genes are represented
by triangles, test genes by circles. The new control genes RPS11 and UBB are in the lower left corner.
7. Control genes for RT-QPCR
55
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 9 of 10
(page number not for citation purposes)
Another important finding is that some of the commonly
used control genes in breast cancer (ACTB and TFRC)
appear to be less stable than previously assumed. This has
an impact on the normalization strategy of the QPCR
measurements: indeed, in our more recent experiments
we have chosen to use the mean of RPLP0, RPS11 and
UBB (on the log2 scale) for normalizing the expression of
test genes.
Finally, we would like to emphasize that these two lists
should not be taken in an absolute sense: a gene in top 10
is not necessarily a better choice than a gene in the top 20
to 30. But we do consider it to be definitely a better candidate
than a gene not in top 100. Nor do we consider the
resulting ranking as providing a solution to the problem
of finding normalization genes in all contexts. Rather, the
lists produced through this process are meant to guide the
choice of control genes while also taking into consideration
the specific requirements of any individual analysis.
Depending on the planned application, other parameters
must be considered. For example, short amplicons or
intron-spanning primers must be used when the starting
RNA is considerably degraded or when residual DNA contaminations
might affect QPCR. The final choice of control
genes should be made not by blind adherence to the
ranked list, but be imposed by the intended application.
Conclusion
Starting from clearly defined criteria, we have designed a
novel method for ranking the candidate genes for their
suitability as control genes in RT-QPCR experiments. The
genes from a data set were ranked according to their stability
score, which represented a trade-off between gene's
average expression level and its variance. Finally, the rankings
from several data sets were combined into a list of
candidate genes, with higher ranked genes being considered
to be more suitable as control genes. The proposed
approach had the advantage of being platform- and normalization-
independent and of not being restricted to
only the list of common genes across all data sets.
By applying the proposed method to two particular collections
of data sets we were able to produce two lists of candidate
genes from which control genes for either breast
cancer or more diverse cancer could be easily selected.
Two new control genes for breast cancer – UBB and RPS11
– have been identified and validated by RT-QPCR.
Our results support the hypothesis that selecting control
genes for QPCR from microarray data is feasible.
Authors' contributions
VP conducted the analysis, devised algorithms and wrote
the computer programs. PW collected the datasets and
remapped the probes. PW, DRG and MD designed the
study and statistical analyses. JA and RJ initiated the biological
problems and conducted the RT-PCR validation.
All authors have read and approved the final manuscript.
Additional material
Acknowledgements
This work has been funded by the Swiss National Science Foundation
through the National Centre for Competence in Research (Molecular
Oncology (MD, VP and PW)) and the National Centre for Competence in
Research (Plant Survival (DRG)), and by the Swiss Cancer League (Project
OCS 01704-04-2005 (RJ and MD)).
References
1. Andersen CL, Jensen JL, Ørntoft TF: Normalization of real-time
quantitative reverse transcription-PCR data: a model-based
variance estimation approach to identify genes suited for
normalization, applied to bladder and colon cancer data
sets. Cancer Research 2004, 64(15):5245-5250.
2. Vandesompele J, Preter KD, Pattyn F, Poppe B, Roy NV, Paepe AD,
Speleman F: Accurate normalization of real-time quantitative
RT-PCR data by geometric averaging of multiple internal
control genes. Genome Biology 2002, 3(7):RESEARCH0034.
3. Yan HZ, Liou RF: Selection of internal control genes for realtime
quantitative RT-PCR assays in the oomycete plant
pathogen Phytophthora parasitica. Fungal Genet Biol 2006,
43(6):430-438.
4. de Kok JB, Roelofs RW, Giesendorf BA, Pennings JL, Waas ET, Feuth
T, Swinkels DW, Span PN: Normalization of gene expression
measurements in tumor tissues: comparison of 13 endogenous
control genes. Laboratory Investigations 2005, 85:154-159.
5. de Jonge HJM, Fehrmann RSN, de Bont ESJM, Hofstra RMW, Gerbens
F, Kamps WA, de Vries EGE, Zee AGJ van der, te Meerman GJ, ter
Elst A: Evidence based selection of housekeeping genes. PLoS
ONE 2007, 2(9):e898.
6. Richardson AL, Wang ZC, Nicolo AD, Lu X, Brown M, Miron A, Liao
X, Iglehart JD, Livingston DM, Ganesan S: X chromosomal abnormalities
in basal-like human breast cancer. Cancer Cell 2006,
9(2):121-132.
7. Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov
D, Timmermans M, van Gelder MEM, Yu J, Jatkoe T, Berns EMJJ,
Atkins D, Foekens JA: Gene-expression profiles to predict distant
metastasis of lymph-node-negative primary breast cancer.
Lancet 2005, 365(9460):671-679.
8. IGC: Expression Project for Oncology. 2008 [http://www.int
gen.org/expo.cfm].
9. Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H,
Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso
F, Peterse H, Nuyten D, Buyse M, de Vijver MJV, Bergh J, Piccart M,
Delorenzi M: Gene expression profiling in breast cancer:
Additional file 1
Top 2000 breast cancer candidate control genes. Excel file containing
top 2000 genes as resulted from combining the ten breast cancer data sets.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-10-42-S1.xls]
Additional file 2
Top 2000 diverse cancer candidate control genes. Excel file containing
top 2000 genes as resulted from combining the eight different types of cancer
from the EXPO data set.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-10-42-S2.xls]
7. Control genes for RT-QPCR
56
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and publishedimmediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
BMC Bioinformatics 2009, 10:42 http://www.biomedcentral.com/1471-2105/10/42
Page 10 of 10
(page number not for citation purposes)
understanding the molecular basis of histologic grade to
improve prognosis. J Natl Cancer Inst 2006, 98(4):262-272.
10. Ma XJ, Wang Z, Ryan PD, Isakoff SJ, Barmettler A, Fuller A, Muir B,
Mohapatra G, Salunga R, Tuggle JT, Tran Y, Tran D, Tassin A, Amon
P, Wang W, Wang W, Enright E, Stecker K, Estepa-Sabal E, Smith B,
Younger J, Balis U, Michaelson J, Bhan A, Habin K, Baer TM, Brugge J,
Haber DA, Erlander MG, Sgroi DC: A two-gene expression ratio
predicts clinical outcome in breast cancer patients treated
with tamoxifen. Cancer Cell 2004, 5(6):607-616.
11. Vijver MJ van de, He YD, van't Veer LJ, Dai H, Hart AAM, Voskuil
DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M,
Atsma D, Witteveen A, Glas A, Delahaye L, Velde T van der, Bartelink
H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression
signature as a predictor of survival in breast cancer. N
Engl J Med 2002, 347(25):1999-2009.
12. Pawitan Y, Bjöhle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg
L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A,
Sandelin K, Shaw PM, Smeds J, Skoog L, Wedrén S, Bergh J: Gene
expression profiling spares early breast cancer patients from
adjuvant therapy: derived and validated in two populationbased
cohorts. Breast Cancer Res 2005, 7(6):R953-R964.
13. Farmer P, Bonnefoi H, Becette V, Tubiana-Hulin M, Fumoleau P, Larsimont
D, Macgrogan G, Bergh J, Cameron D, Goldstein D, Duss S,
Nicoulaz AL, Brisken C, Fiche M, Delorenzi M, Iggo R: Identification
of molecular apocrine breast tumours by microarray analysis.
Oncogene 2005, 24(29):4660-4671.
14. Hu Z, Fan C, Oh DS, Marron JS, He X, Qaqish BF, Livasy C, Carey LA,
Reynolds E, Dressler L, Nobel A, Parker J, Ewend MG, Sawyer LR,
Wu J, Liu Y, Nanda R, Tretiakova M, Orrico AR, Dreher D, Palazzo
JP, Perreard L, Nelson E, Mone M, Hansen H, Mullins M, Quackenbush
JF, Ellis MJ, Olopade OI, Bernard PS, Perou CM: The molecular
portraits of breast tumors are conserved across microarray
platforms. BMC Genomics 2006, 7:96.
15. Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan
Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53
status in human breast cancer predicts mutation status,
transcriptional effects, and patient survival. Proc Natl Acad Sci
USA 2005, 102(38):13550-13555.
16. Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a
simple, yet powerful, new method to detect differentially
regulated genes in replicated microarray experiments. FEBS
Lett 2004, 573(1–3):83-92.
17. Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker
MG, Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant
J, Wolmark N: A multigene assay to predict recurrence of
tamoxifen-treated, node-negative breast cancer. N Engl J Med
2004, 351(27):2817-2826.
18. Oberli A, Popovici V, Delorenzi M, Baltzer A, Antonov J, Matthey S,
Aebi S, Altermatt HJ, Jaggi R: Expression profiling with RNA
from formalin-fixed, paraffin-embedded material. BMC Med
Genomics 2008, 1:9.
7. Control genes for RT-QPCR
57
7. Control genes for RT-QPCR
58
8 Rgtsp: a generalized top scoring pairs package
for class prediction
• Bioinformatics, 27 (12):1729–1730, 2011
• IF: 7.307
• number of citations: 4
• personal contribution (70%): method design, data collection and
processing, experimental design and implementation (R package),
manuscript writing
59
Rgtsp: a generalized top scoring pairs package for class
prediction
Vlad Popovici 1,2,∗, Eva Budinsk´a 1,3 and Mauro Delorenzi 1
1
Bioinformatics Core Facility, Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland
2
Swiss National Center of Competence in Research Molecular Oncology, School of Life Sciences,
Ecole Polytechnique F´ed´erale de Lausanne, Switzerland
3
Institute of Biostatistics and Analyses, Masaryk University, Brno, Czech Republic
ABSTRACT
Summary: A top scoring pair (TSP) classiﬁer consists of a pair
of variables whose relative ordering can be used for accurately
predicting the class label of a sample. This classiﬁcation rule has
the advantage of being easily interpretable and more robust against
technical variations in data, as those due to different microarray
platforms. Here we describe a parallel implementation of this classiﬁer
which signiﬁcantly reduces the training time, and a number of
extensions, including a multi–class approach which have the potential
of improving the classiﬁcation performance.
Availability and Implementation: Full C++ source code and
R package Rgtsp are freely available from http://lausanne.isbsib.ch/∼vpopovic/research/.
The implementation relies on existing
OpenMP libraries.
Contact: vlad.popovici@isb-sib.ch
1 INTRODUCTION
Top scoring pairs (TSPs) (Geman et al. (2004)) are simple two–
variables binary classiﬁers, in which the prediction of the class label
is based solely on the relative ranking of the expression levels of
the two genes. The rank–based approach to classiﬁcation ensures a
higher degree of robustness to technical variations and makes the
rule easily portable across platforms. Also, the direct comparison of
the expression level of the genes is easily interpretable in the clinical
context, making the TSPs attractive for medical tests.
Let x = [xi]i=1,...,m ∈ Rm
be a vector of measurements (e.g.
gene expression) representing a sample and let the corresponding
class label be y, with two classes denoted by 0 and 1. Then, for all
pairs of variables i and j, a score is computed,
si,j = P(xi < xj|y = 1)−P(xi < xj|y = 0), 1 ≤ i, j ≤ m (1)
where P are conditional probabilities estimated from the data, and
the corresponding decision rule is: if sign(si,j)xi < sign(si,j)xj
then predict y = 1, otherwise y = 0. The pairs are ordered by
the absolute values of their scores and the top t pairs (t ≥ 1)
are then considered for the ﬁnal model (Xu et al. (2005); Tan
et al. (2005); Geman et al. (2004)). Remarkably, training a TSP
∗to whom correspondence should be addressed
Fig. 1. Predicting estrogen receptor status: if GSTP1 < ESR1, then the
sample is considered ER+ (red dots), otherwise ER- (blue dots).
does not require the optimization of any parameter and does not
depend on any threshold. Selecting a suitable value for t should be
done following the usual machine learning paradigm for optimizing
meta–parameters (see, for example, Hastie et al. (2001)). Figure 1
shows an example of a TSP predicting the estrogen receptor status.
The decision boundary (in grey) is always a line with a slope of 1.
2 IMPLEMENTATION
While the method brieﬂy described above is simple and poses
no implementation problems, using it in the context of highly
dimensional data requires the evaluation of an extremely large
number of pairs of variables making its usage impractical, especially
in the context of resampling techniques for performance estimation.
However, most if not all of the modern desktop computers are multi–
core machines, making parallel programs a feasible alternative to
classical serial ones.
Our implementation in C++ exploits the multi-core architecture
by using the OpenMP libraries of the system (Chapman et al.
(2007)), and is wrapped in an R package – Rgtsp. The full source
code and the R package are available from http://lausanne.isbsib.ch/∼vpopovic/research/.
As C++ is the main implementation
language, the library can easily be extended and integrated with
1
Associate Editor: Dr. Jonathan Wren
© The Author (2011). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Bioinformatics Advance Access published April 19, 2011
8. Rgtsp
60
other software libraries. Also, the R functions are independent of
the domain of application so they could be applied to any kind of
data.
3 USAGE EXAMPLES
We present a typical case of using Rgtsp package. These examples
represent solely some code snippets and not the full process of
developing and assessing the performance of a classiﬁer.
The data used in these examples consists of 130 samples stage I
to III breast cancer (Hess et al. (2006)) and the goal is to predict the
estrogen receptor status (positive or negative coded with ”+1” and
”0”, respectively). For illustration purposes we use only a subset of
full data set available from GEO repository under accession number
GSE16716.
Before starting R, the user has the option of choosing the
number of processing units that will be used, by setting the
environment variable OMP_NUM_THREADS. If not set, it defaults
to the maximum number of processing units available.
The ﬁrst steps load the library and the data and build a list of TSPs
(note that the matrix X contains the variables as columns):
> library(Rgtsp)
> data(mdabr)
> tsp.list = tsp.n(X, y.erpos, 500)
> str(tsp.list)
> print(tsp.list)
The function tsp.n() returns at most n TSPs as a list with three
components: the ﬁrst two correspond to the indexes of the selected
variables and the third one contains the associated scores. A similar
function, tsp.s(), returns all the TSPs that have a score larger
than a speciﬁed value.
For the p−th TSP, the prediction rule can be written as: predict
class ”+1” if X[,tsp.list$I[p]] < X[,tsp.list$J[p]]
and this forms the core of the predict function. The decision
function for p = 1 in the above example is shown in Figure 1.
Given a list of TSPs one has different choices on how to obtain
the ﬁnal predicted labels. Currently, Rgtsp proposes two means
of combining the predictions of individual TSPs: either by majority
voting or by weighting the votes with the correspoding scores
- giving more weight to the TSPs with better scores. This
functionality is available through the predict() generic function:
> yp = predict(tsp.list, X, combiner="majority")
> sum(yp != y.erpos) # count the errors
[1] 3
By inspecting the list of TSPs, it becomes clear that there are
variables that are selected many times as having always either higher
or lower value than all its pairing variables. We call such a structure
a TSP hub and we can construct all the hubs larger than a speciﬁed
size (25 pairs for example) using
> h = tsp.hub(tsp.list, min.hub.size=25)
> print(h)
Hub 1: 194 pairs
Center: 953 >
14 25 42 43 44 45 54 105 140 146 149 150 152 202 ...
This corresponds to a TSP hub in which the probeset
colnames(X)[953] (205225_at, ESR1) has a higher
expression than all other probesets in the list tsp.list. The TSP
hubs can also be used in predicting the labels, through the same
mechanism as above:
> yph = predict(h, X, combiner="majority")
> sum(yph != y.erpos) # no. of errors: 6
We see that in this particular case the prediction by TSP hubs is
slightly less accurate than the combined predictions of the individual
TSPs.
The generalization performance of the TSPs classiﬁers can be
estimated by various methods. The Rgtsp package provides a
function for k-fold cross–validation of the binary TSP classiﬁers
(either tsp.n() or tsp.s() functions), cv.tsp(), which
returns the training and validation performance of the classiﬁer (it
defaults to 5–fold cross–validation).
> r = cv.tsp(X, y.erpos)
> print(r)
$tr.m
Error.rate Sensitivity Specificity AUC
0.02884615 0.97812500 0.96000000 0.96906250
In the case of a multi–class problem, we propose to use
classiﬁcation trees built on top of TSPs predictions. For C > 2
classes, one can train TSPs to solve each of the C(C−1)/2 pairwise
binary classiﬁcation problems (called one–versus–one (Hsu and
Lin (2002)) or round robin (F¨urnkranz (2002)) strategy) and then
combine the predictions of the TSPs through a classiﬁcation tree to
predict the original classes. For more details the reader is referred to
the package web page. This approach is implemented in the function
mtsp() and makes use of the ctree() function in the party R
package (y4 is an artiﬁcial 4–class label vector):
> m = mtsp(X, y4)
> yp = predict(m, X)
ACKNOWLEDGEMENT
Funding: VP and MD acknowledge the support of the Swiss
National Science Foundation NCCR Molecular Oncology. EB
acknowledges the support of Fondation Medic.
REFERENCES
Chapman, B., Jost, G., and van der Pas, R. (2007). Using OpenMP. The MIT Press.
F¨urnkranz, J. (2002). Round robin classiﬁcation. Journal of Machine Learning
Research, 2, 721–747.
Geman, D., d’Avignon, C., Naiman, D. Q., and Winslow, R. L. (2004). Classifying
gene expression proﬁles from pairwise mRNA comparisons. Stat Appl Genet Mol
Biol, 3, Article19.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning.
Data Mining, Inference and Prediction. Springer–Verlag.
Hess, K. R., Anderson, K., Symmans, W. F., Valero, V., Ibrahim, N., Mejia, J. A.,
Booser, D., Theriault, R. L., Buzdar, A. U., Dempsey, P. J., Rouzier, R., Sneige,
N., Ross, J. S., Vidaurre, T., G´omez, H. L., Hortobagyi, G. N., and Pusztai, L.
(2006). Pharmacogenomic predictor of sensitivity to preoperative chemotherapy
with paclitaxel and ﬂuorouracil, doxorubicin, and cyclophosphamide in breast
cancer. J Clin Oncol, 24(26), 4236–4244.
Hsu, C.-W. and Lin, C.-J. (2002). A comparison of methods for multiclass support
vector machines. IEEE Transaction on Neural Networks, 13(2), 415–425.
Tan, A. C., Naiman, D. Q., Xu, L., Winslow, R. L., and Geman, D. (2005).
Simple decision rules for classifying human cancers from gene expression proﬁles.
Bioinformatics, 21(20), 3896–3904.
Xu, L., Tan, A. C., Naiman, D. Q., Geman, D., and Winslow, R. L. (2005).
Robust prostate cancer marker genes emerge from direct integration of inter-study
microarray data. Bioinformatics, 21(20), 3905–3911.
2
8. Rgtsp
61
8. Rgtsp
62
9 Effect of training-sample size and classification
difficulty on the accuracy of genomic pre-
dictors
• Breast Cancer Research, 12(1):R5
• IF: 6.345
• number of citations: 76
• personal contribution (60%): data processing, experimental design
and implementation, statistical analysis of the results, manuscript
writing
63
RESEARCH ARTICLE Open Access
Effect of training-sample size and classification
difficulty on the accuracy of genomic predictors
Vlad Popovici1
, Weijie Chen2
, Brandon G Gallas2
, Christos Hatzis3
, Weiwei Shi4
, Frank W Samuelson2
, Yuri Nikolsky4
,
Marina Tsyganova5
, Alex Ishkin5
, Tatiana Nikolskaya4,5
, Kenneth R Hess6
, Vicente Valero7
, Daniel Booser7
,
Mauro Delorenzi1,8
, Gabriel N Hortobagyi7
, Leming Shi9
, W Fraser Symmans10
, Lajos Pusztai7*
Abstract
Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of
univariate feature-selection methods and classification algorithms may influence the performance of genomic
predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints.
Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent
validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight
different classifiers) for each of the three endpoints. Their classification performance was estimated on the training
set by using two different resampling methods and compared with the accuracy observed in the independent
validation set.
Results: A ranking of the three classification problems was obtained, and the performance of 120 models was
estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the
validation performance than were the cross-validation estimates. The required sample size for each endpoint was
estimated, and both gene-level and pathway-level analyses were performed on the obtained models.
Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample
size and classification difficulty. Variations on univariate feature-selection methods and choice of classification
algorithm have only a modest impact on predictor performance, and several statistically equally good predictors
can be developed for any given classification problem.
Introduction
Gene-expression profiling with microarrays represents a
novel tissue analytic tool that has been applied successfully
to cancer classification, and the first generation of
genomic prognostic signatures for breast cancer is
already on the market [1-3]. So far, most of the published
literature has addressed relatively simple classification
problems, including separation of cancer from
normal tissue, distinguishing between different types of
cancers, or sorting cancers into good or bad prognoses
[4]. The transcriptional differences between these conditions
or disease states are often large compared with
transcriptional variability within the groups, and therefore,
reasonably successful classification is possible. The
methodologic limitations and performance characteristics
of gene expression based classifiers have not been
examined systematically when applied to increasingly
challenging classification problems in real clinical data
sets.
The MicroArray Quality Control (MAQC) (MAQC
Consortium project-II: a comprehensive study of common
practices for the development and validation of
microarray-based predictive models) breast cancer data
set (Table 1) offers a unique opportunity to study the
performance of genomic classifiers when applied across
a range of classification difficulties.
One of the most important discoveries in breast cancer
research in recent years has been the realization that
estrogen receptor (ER)-positive and -negative breast cancers
represent molecularly distinct diseases with large
differences in gene-expression patterns [5,6]. Therefore,
* Correspondence: lpusztai@mdanderson.org
7
Department of Breast Medical Oncology, P.O. Box 301439, Houston, TX
77230-1439, USA
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
© 2010 Popovici et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
9. Effect of training-samples size and classification difficulty
64
gene expression-based prediction of ER status represents
an easy classification problem.
A somewhat more difficult problem is to predict
extreme chemotherapy sensitivity, including all breast
cancers in the analysis. This classification problem is
facilitated by the association between clinical disease
characteristics and chemotherapy sensitivity. For example,
ER-negative cancers are more chemotherapy sensitive
than are ER-positive tumors [7].
A third, and more difficult, classification problem is to
predict disease outcome in clinically and molecularly
homogeneous patient populations. Genomic predictors
could have the greatest clinical impact here, because traditional
clinical variables alone are only weakly discriminatory
of outcome in these populations. In the current
data set, prediction of chemotherapy sensitivity among
the ER-negative cancers represents such a challenge.
The goal of this analysis was to assess how the
degree of classification difficulty may affect which elements
of prediction methods perform better. We
divided the data into a training set (n = 130) and a
validation set (n = 100) and developed a series of
Table 1 Patient characteristics in the training and validation sets
Training set (n = 130) Validation set (n = 100) P value
Median age 51 years (28-79 years) 50 years (26-73 years)
Race 0.804
Caucasian 85 (65%) 68 (68%)
African American 13 (10%) 12 (12%)
Asian 9 (7%) 7 (7%)
Hispanic 21 (16%) 13 (13%)
Mixed 2 (2%) 0
Cancer histology 0.047
Invasive ductal (IDC) 119 (92%) 85 (85%)
Mixed ductal/lobular (IDC/ILC) 8 (6%) 8 (8%)
Invasive lobular (ILC) 1 (0.7%) 7 (7%)
Others 2 (1.3%) 0
Tumor size 0.643
T0 1 (1%) 2 (2%)
T1 12 (9%) 8 (8%)
T2 70 (54%) 62 (62%)
T3 21 (16%) 13 (13%)
T4 26 (20%) 15 (15%)
Lymph node stage 0.935
N0 39 (30%) 27 (27%)
N1 60 (46%) 47 (47%)
N2 14 (11%) 13 (13%)
N3 17 (13%) 13 (13%)
Nuclear grade (BMN) 0.005
1 2 (2%) 11 (11%)
2 52 (40%) 42 (42%)
3 76 (58%) 47 (47%)
Estrogen receptor 0.813
Estrogen receptor positive 80 (62%) 60 (60%)
Estrogen receptor negative 50 (38%) 40 (40%)
HER-2 < 0.001
HER-2 positive 33 (25%) 7 (7%)
HER-2 negative 96 (74%) 93 (93%)
Neoadjuvant therapy 0.005
Weekly T × 12 + FAC × 4 112 (86%) 98 (98%)
3-Weekly T × 4 + FAC × 4 18 (14%) 2 (2%)
Pathologic complete response (pCR) 33 (25%) 15 (15%) 0.055
Estrogen receptor: cases in which more than 10% of tumor cells stained positive for ER with immunohistochemistry (IHC) were considered positive. HER-2: cases
that showed either 3+ IHC staining or had gene copy number greater than 2.0 were considered HER-2 “positive.” T = paclitaxel; FAC = 5-fluorouracil, doxorubicin,
and cyclophosphamide. The P values for the association tests were obtained from a c2
test unless the number of cases was fewer than five in any category, in
which case, Fisher’s Exact test was used.
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 2 of 13
9. Effect of training-samples size and classification difficulty
65
classifiers to predict (a) ER status, (b) pathologic complete
response (pCR) to preoperative chemotherapy for
all breast cancers, and (c) pCR for ER-negative breast
cancers. A predictor, or classifier, in this article is
defined as a set of informative features (generated by a
particular feature-selection method) and a trained discrimination
rule (produced by applying a particular
classification algorithm).
First, we examined whether the success of a predictor
was influenced by a feature-selection method. We examined
five different univariate feature-selection methods
including three variations of a t test-based ranking and
two methods that order features based on differences in
expression values. It has been shown that several different
classification algorithms can yield predictors with
rather similar performance metrics [8-10]. However, it
remains unknown whether the relative performances of
different methods may vary depending on the difficulty
of the prediction problem. We examined this question
for eight different classifiers representing a broad range
of algorithms, including linear (LDA), diagonal linear
(DLDA), and quadratic discriminant analysis (QDA);
logistic regression (LREG); and two versions of supportvector
machines (SVM) and k-nearest neighbor (KNN)
methods. Altogether, 40 different predictors were developed
for each of the three classification problems (five
different feature-selection methods × eight different
classifiers). We also were interested determine to what
extent the cross-validation classification performance is
influenced by different data-resampling methods and the
difficulty of the classification problem. We estimated the
classification performance by using 10-times-repeated
fivefold cross validation (10 × 5-CV) and leave-pair-out
(LPO) bootstrapping [11] (a method that better accounts
for training and testing variability). We calculated performance
metrics for each of the 120 predictors (40 predictors
× three endpoints) and compared the estimated
accuracy in the training set with the observed accuracy
in the independent validation set.
Materials and methods
Patients and materials
Gene-expression data from 230 stage I to III breast
cancers, without individual patient identifiers, were
provided to the MAQC project by the University of
Texas M.D. Anderson Cancer Center (MDACC) Breast
Cancer Pharmacogenomic Program. Gene-expression
results were generated from fine-needle aspiration specimens
of newly diagnosed breast cancers before any
therapy. The biopsy specimens were collected sequentially
during a prospective pharmacogenomic marker
discovery study approved by the institutional review
board between 2000 and 2008. These specimens
represent 70% to 90% pure neoplastic cells with minimal
stromal contamination [12]. All patients signed
informed consent for genomic analysis of their cancers.
Patients received 6 months of preoperative (neoadjuvant)
chemotherapy including paclitaxel, 5-fluorouracil,
cyclophosphamide, and doxorubicin, followed by surgical
resection of the cancer. Response to preoperative
chemotherapy was categorized as a pathologic complete
response (pCR = no residual invasive cancer in
the breast or lymph nodes) or residual invasive cancer
(RD). The prognostic value of pCR has been discussed
extensively in the medical literature [13]. Genomic
analyses of subsets of this sequentially accrued patient
population were reported previously [9,14,15]. For each
endpoint, we used the first 130 cases as a training set
to develop prediction models, and the next 100 cases
were set aside as independent validation set. Table 1
and Additional file 1 show patient and sample characteristics
in the two data sets.
Gene-expression profiling
Needle-aspiration specimens of the cancer were placed
into RNAlater™ solution (Qiagen, Germantown, MD,
USA) and stored at -80°C until further analysis. RNA
extraction and gene-expression profiling were performed
in multiple batches over time, as described previously
[16,17] by using Affymetrix U133A (Affymetrix,
Santa Clara, CA, USA) microarrays. Gene-expression
data have been uploaded to the Gene Expression Omnibus
website under the accession number GSE16716.
Normalization was performed by using MAS 5.0 software
(Affymetrix, Santa Clara, CA, USA) with default
settings. Quality-control assessment of the hybridization
results were performed with SimpleAffy software by
Bioconductor; the percentage present call had to be
more than 30%, scaling factor less than 3, and the 3’/5’
ratios for b-actin less than 3, and for GAPDH, less than
1.3. These quality-control metrics are presented for
each case in Additional file 2.
Ranking of classification problems by informative feature
utility score
To assess the relative difficulty of the three classification
problems that we selected to study, we adopted an
approach similar to that described in [18]. This method
defines the utility of a feature i as its Fisher score,
f i i
i i
i =
−
+
( )
,
 
 
1 2
2
1
2
2
2
where μ1i and μ2i are the class means, and s1i and s2i
are the class standard deviations for the feature i,
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 3 of 13
9. Effect of training-samples size and classification difficulty
66
respectively. If features are ordered f1 ≥ f2 ≥ ... then, for
each endpoint, the cumulative information is defined as
F fj i
i
j N
=
=
≤
∑1
,
where N is the sample size. This cumulative information
score assumes that the features are independent
and that their effect on the classification performance is
additive. This is rarely the case, as features are often
correlated. Nonetheless, this cumulative information
score is a simple and straightforward approach to estimate
the relative difficulty of a classification problem
early in the classifier-development process: an easier
problem tends to have larger values for F than does a
more difficult problem.
Feature-selection methods
No prefiltering of probe sets was done; all probe sets were
considered by the feature-ranking methods that included
(a) unequal variance t test (FS1); (b) unequal variance t
test with filtering of probe sets that were correlated with
one another (Pearson correlation > 0.75) to generate independently
informative features (FS2); (c) instead of removing
the correlated features, they were combined into metafeatures
by averaging them (FS3); and (d) we also ranked
features according to their ratio of between- to withingroup
sum of squares (FS4) and (e) according to the absolute
differences in the class means (FS5).
Classification algorithms
We examined eight classifiers in combination with the
previously mentioned feature-selection methods, including
linear discriminant analysis (LDA), diagonal linear
discriminant analysis (DLDA), quadratic discriminant
analysis (QDA), logistic regression (LREG), two k nearest
neighbors classifiers with k = 3 (KNN3) and k = 11
(KNN11), and support vector machines with a radial
basis function kernel with two different values for the
kernel parameter: g = 0.5 (SVM05) and g = 2.0 (SVM2),
respectively. Overall, 40 models were developed for each
of the three prediction problems.
Estimation of predictive performance
Leave-N-out cross-validation and other resampling
methods of the training set are often used to select a
final predictor for independent validation. Therefore, it
is important to understand how resampling-based predictive
performance correlates with predictive performance
on independent validation cases. To study this
question, we used a nested two-level cross-validation
scheme, in which the cross-validation in the outer loop
had the role of estimating the performance of the whole
modeling procedure, whereas the cross-validation in the
inner loop was used for selecting the optimal number of
features [19].
The procedure in the inner loop is as follows. For
each combination of a feature-selection method F and
a classification algorithm C, the number of features j
(F, C) in the model was considered as a free-parameter
(within a predefined set of allowable values) and was
optimized. In the inner loop, a repeated (5 times), stratified
(to preserve the proportion of the two classes in
all training and testing splits), fivefold cross-validation
was used to define the number of features that maximized
the AUC. A ranking of the features was first
obtained by applying F on the reduced internal training
set (obtained by leaving aside one fold from the
current training set). Then the classifier C was trained
on the same set, but considering only the top j(F, C)
features. The predictions on the internal testing set
(the left-out fold) were recorded, and the procedure
was repeated. At the end, an estimation of the AUC
was obtained, corresponding to the given combination
of F, C, and j(F, C). The procedure was repeated with
different folds, and an average estimate of the AUC
was obtained for each F, C, and j(F, C). The optimal
number of features j*(F, C) was selected as the value j
(F, C) yielding the highest average AUC. The number
of features allowed for each model was chosen a
priori, to avoid overfitting of models and to limit the
computation time. For the prediction of ER status, the
feature size was chosen to contain all values between 2
and 15, whereas for both pCR endpoints, it was
{2,5,8,...,41}; 41 being almost half the size of the smallest
training set (n = 85 ER-negative cancer). For a
pseudo-code that details the schema used for crossvalidation
[see Additional file 3]. To avoid adding
variability due to random partitioning the data into
folds, all estimates were obtained on the same splits of
the data.
We investigated two methods in the outer loop. The
first method is a stratified 10-times-repeated fivefold
cross-validation (10 × 5-CV). In each of the five crossvalidation
iterations, 80% of the data were first used as
input to the inner loop procedure for feature selection
and training the classifier with the selected features,
and finally, the remaining 20% of the data were used
to test the classifier. The 95% CI for the area under
the receiver operating characteristics curve (AUC) was
approximated by [AUC - 1.96 SEM, AUC + 1.96
SEM]. The SEM was estimated by averaging the 10
estimates of the standard error of the mean obtained
from the five different estimates of the AUC produced
by the 5-CV.
The second method in the outer loop is a bootstrapbased
method, also known as a smoothed version of
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 4 of 13
9. Effect of training-samples size and classification difficulty
67
cross-validation [20]. Efron and Tibshirani [20] proposed
the leave-one-out bootstrap method on the performance
metric error rate, and their technique was recently
extended by Yousef and colleagues [11] to the performance
metric AUC. This method uses a leave-pair-out
(LPO) bootstrap approach to estimate the mean AUC
(mean over training sets) and a “delta method after
bootstrap” to estimate the variability of the estimated
mean AUC. We point out that this variability captures
both the effect of finite training-set size and the effect
of finite testing-set size. In the LPO approach, multiple
(n = 5,000) training sets are obtained by stratified bootstrap
resampling, and each training set is used as input
to the inner-loop procedure for feature selection and
training the classifier with the selected features. In testing,
any pair of cases (one from the positive class and
one from the negative class) is tested on the classifiers
trained on the bootstrap samples that do not contain
the two held-out cases. The Wilcoxon-Mann-Whitney
statistic of the prediction results on pairs of cases is
averaged over all bootstrap-training sets and is used to
estimate the mean AUC. An advantage of this technique
is that it allows estimating the variability of the AUC
estimator by using the influence function method
[11,20]. By assuming that the estimated AUC is asymptotically
normal, the 95% CI of the AUC can be approximated
by [AUC - 1.96 SEM; AUC + 1.96 SEM].
The estimated performance and the associated CIs
from the training and internal-assessment process are
compared with the independent validation performance.
The conditional validation performance was obtained by
selecting features and training the classifier with the
training data set and testing on the validation data set.
This performance is conditional on the particular finite
training set and may vary when the training set varies.
Therefore, we estimated the mean of this conditional
performance where the mean is over multiple training
sets and obtained by bootstrapping the training set multiple
times and averaging the conditional AUCs, as
tested on the validation set [21].
We also estimated the variability of the conditional
validation performance and decomposed the variance
into two components: the variability due to the finite
size of the training set and the variability due to the
finite size of the test set [21]. The training variability
reflects the stability of the classifier performance when
the training set varies, and the testing variability reflects
the expected performance variation for different test
sets.
To compare the ability of the performance estimates
of 10 × 5-CV and the LPO bootstrap to predict the performance
on the independent set, we used a root mean
square error (RMSE) measure, which is defined as
RMSE A AF C
ernal
F C
independent
CF
= −
==
∑∑1
40
2
1
8
1
5
( ) ,,
int
,
where F and C index feature selection and classifier,
respectively, A denotes the mean AUC; the superscript
“internal” can be “10 × 5-CV” or “LPO bootstrap.”
Estimation of predictor learning over increasing trainingset
size
Predictor learning was evaluated for the models that
performed nominally the best in independent validation
for each of the three prediction problems. All 230
cases were included in the analysis to fit learning
curves to these three models. For the ER-status endpoint,
10 different training-sample sizes, ranging from
n = 60 to n = 220 by increments of 20, were used to
estimate the dependence of the performance parameters
on the sample size. For each sample size, 10
different random samples were drawn from the full set
by stratified sampling, and fivefold cross-validation was
used to assess the error rate and AUC of the models
where all the parameters of the models were recalculated.
A similar approach was taken for the pCR (n =
50, 70, ..., 210) and “pCR in ER-negative cancer” predictors
(n = 25, 40, ..., 85). By following the work of
Fukunaga [22], the following learning-curve model was
fit to the resulting AUC: Y = a+b/TrainingSize.
Congruence of different predictors at gene and functional
pathway level
We were interested in examining the congruence of
informative features that were selected by different
methods for the same prediction endpoint and also for
different endpoints. Both gene-level and pathway-level
analyses were performed as described previously [23].
MetaCore protein-function classification was used to
group genes into protein functions, and GeneGo Pathway
maps were used for functional classification of
predictive features. We assessed congruency by using
the kappa statistics. The input for kappa involves “learners”
that classify a set of objects into categories. We
considered each feature-selection method as a learner
and each probe set as an object. The probe sets used
in this analysis are presented in Additional file 4. Each
probe set from the rank-ordered lists is categorized by
each feature-selection method either as 1 (that is,
selected as informative) or 0 (that is, nonselected). By
using such an 0/1 matrix for all probe sets × all feature-selection
methods for every prediction endpoint
as input, we can calculate Cohen’s kappa function
for the congruency. For pathway-level analysis, we
mapped the probe sets to pathway lists by using
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 5 of 13
9. Effect of training-samples size and classification difficulty
68
hypergeometric enrichment analysis. The pathways are
ranked by enrichment P values, and the top n pathways
(n equals the number of genes in the input list
for comparison and consistency between the two
levels) were selected for presentation.
All statistical analysis was performed by using R
software.
Results
Difficulty of the classification problems
Three distinct classification problems were studied: (a)
ER-status prediction, including 80 ER-positive (62%) and
50 ER-negative training cases (38%); (b) pCR prediction,
including 33 cases with pCR (25%) and 97 cases with
residual cancer (75%) for training; and (c) pCR prediction
for ER-negative cancers, including 27 training cases with
pCR (54%) and 23 with residual cancer (46%). Figure 1
shows the cumulative information scores for the three
endpoints: larger cumulative information is an indicator
for a simpler classification problem. The obtained ranking
implies that the three endpoints represent different
degrees of classification difficulty.
We also assessed the significance of the utility scores by
using permutation tests (10,000 permutations) for computing
the raw P values, followed by Benjamini-Hochberg
correction for multiple testing. For the ER-status endpoint,
1,502 features with significant utility scores (P
value < 0.0001) were used, whereas for the pCR (all
cases), 252 significant features and only five features (corresponding
to A2M [HGNC:7], RNMT [HGNC:10075],
KIAA0460 [HGNC:29039], AHNAK [HGNC:347], and
ACSM1 [HGNC:18049] genes) were used for pCR among
ER-negative cancers.
Effect of feature-selection methods and classification
algorithms on cross-validation performance
Figure 2 illustrates the average cross-validation AUC
estimated by 10 × 5-CV for all predictors, stratified by
feature-selection method (left column). All feature-selection
methods performed similarly in combination with
various classification algorithms for a given endpoint.
The two non-t test-based methods, FS4 and FS5,
showed slightly better performances than did t testbased
feature selection for the most difficult prediction
Figure 1 Relative complexity of the three prediction problems. The cumulative information values have been scaled such that the
maximum value is 1. To make the curves comparable and to take into account the sample size, the ratio between the number of features used
in the cumulative information (F) and the sample size is used on the horizontal axis. Larger values of the cumulative information indicate simpler
problems.
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 6 of 13
9. Effect of training-samples size and classification difficulty
69
endpoint “pCR on ER-negative cancers” in cross validation,
but confidence intervals widely overlapped. Additional
file 5 shows the average error rates and AUCs
generated from 10 × 5-CV for each prediction model
applied to all three classification problems, along with
the average number of features selected. Interestingly,
the number of selected features did not increase as the
prediction problem became more difficult. For the most
difficult problem, the number of selected features was
lower than that for the moderately difficult problem.
This is probably because of the lack of informative features:
as the classification problem becomes more difficult,
fewer features are informative for the outcome
(also see Figure 1).
Figure 2 also shows the variability of the classification
error rates and AUC estimated through 10 × 5-CV for
all predictors, stratified by classification algorithm (right
column). All methods performed similarly. The prediction
endpoint (that is, classification difficulty) had the
greatest effect on the cross-validation AUC. The effects
of feature-selection method and choice of classifier algorithm
were modest.
Bootstrap and independent-validation results
Figure 3 shows the estimated AUCs obtained with 10 ×
5-CV (black square), LPO bootstrap (black circle), and
the conditional AUC (blue circle) on the independent
validation set and its variability (blue error bar representing
± 2 SD) and mean (red cross). Additional file 5
includes the internal (10 × 5-CV and LPO bootstrap)
and independent validation-performance metrics for
each predictor. Both internal-estimation methods
yielded AUCs that were very close, well within 2 standard
deviations of the mean, to the conditional and
mean AUCs observed in the independent validation.
Internal-performance estimates generated within the
training set only slightly overestimated the performance
relative to independent validation, indicating both that
the modeling approach was correct and that no strong
batch effect occurred between training and validation
sets. Simpler linear methods, such as LREG, LDA, and
DLDA, performed generally well in both internal and
independent validation, and these methods were among
the top five nominally best-performing models for all
prediction endpoints [see Additional file 5]. The non-t
test-based feature-selection methods (FS4, FS5) that
showed good results in cross validation also performed
well in independent validation and were included in
four of the top five models for each endpoint. However,
the 95% CIs of the point estimates overlap broadly for
all predictors, and no single strategy emerged as clearly
superior for any particular endpoint.
To assess the confidence-interval estimation, we calculated
the RMSE for the AUC estimates obtained with 10
× 5-CV and LPO bootstrap for all the three endpoints.
Leave-pair-out bootstrap performed better than 10 × 5CV
in terms of the agreement with the mean AUC estimated
in the independent-validation set: RMSEs for
LPO bootstrap were 0.0484, 0.0491, and 0.357 in comparison
with 0.0562, 0.0713, and 0.449 for 10 × 5-CV
for the ER status, pCR, and pCR within ER-negative
endpoints, respectively.
Figure 3 clearly shows that the variability of the estimated
classification performance increases as the level
of classification difficulty increases. This implies that,
to achieve the same level of statistical precision of the
estimated performance, more cases are needed for a
more-difficult endpoint. Figure 3 also shows both the
conditional (blue circle) and mean validation AUCs
(red cross). The larger the difference between the conditional
validation AUC and the mean validation AUC,
the less stable the predictor is with respect to varying
the training sets. A quantitative measure of classifier
stability is the training variability, and we have decomposed
the variability of the conditional validation AUC
shown in Figure 3 into two components (training
variability and testing variability) and put the results in
Additional file 5.
Predictor-performance and sample-size estimations
through learning curves
To estimate the training-set size that is necessary to
develop predictors that operate near their respective plateaus,
we examined how the performance characteristics
of each of the nominally best predictors for each endpoint
improved as the training-set size increased. For
ER-status prediction, we selected QDA with FS1 (conditional
validation AUC = 0.939); for pCR prediction
including both the ER-positive and -negative cancers, we
selected LREG with FS5 (conditional validation AUC =
0.805); and for pCR in ER-negative cancers, we selected
LREG with FS4 (conditional validation AUC = 0.627).
Figure 4 shows the observed changes in average AUCs
for each of the classifiers as the training-set size
increased from 60 to 220 (or from 25 to 85 for pCR prediction
in ER-negative cancers) and the projected
improvements for assumed larger training sets. The
results indicate that for the easiest problem (ER), the
predictor seems to perform at its best with a sample
size around 80 to 100. For the moderately difficult problem
(pCR), the steady increase of the learning curve
suggests that the performance of the model can be
improved by increasing the sample size, beyond the
highest value currently tested (220). For the pCR in ERnegative
cancer endpoint, the learning curves manifested
a very modest and gradual improvement in performance
between training sample sizes of 25 and 85, suggesting
that either too few samples were available for a reliable
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 7 of 13
9. Effect of training-samples size and classification difficulty
70
estimation of the learning curve or that limited information
in the mRNA space exists to predict this particular
outcome with the methods applied in this analysis. The
learning curve that had a slope significantly different
from 0 was the one for the pCR endpoint (P = 0.001;
ER endpoint, P = 0.05; pCR in ER-negative endpoint, P
= 0.365).
Functional analysis of predictive features
Our results demonstrate that several different feature
sets can yield predictors with statistically similar
performances [8-10,24]. This may occur because the
various probe sets that represent different genes capture
information from the same complex molecular pathways
that determine a particular clinical outcome [25]. In
other words, different features measure different components
of the same informative biologic pathway. To test
this concept, we mapped each of the 15 feature sets
used in the final validation models to known biologic
pathways. The different feature sets selected for a particular
prediction endpoint had a high level of congruency
at both the gene and the pathway levels across all the
Figure 2 Boxplots of the estimated area under the curve (AUC), stratified by feature-selection and classification methods. The boxplots
show the mean AUC in 10 times fivefold cross validation (CV). The left column contains the estimated AUC stratified by the feature-selection
method, and the right column contains the estimated AUC stratified by the classification method.
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 8 of 13
9. Effect of training-samples size and classification difficulty
71
five different ranking methods (Table 2). The selected
gene sets and pathways were also rather similar to each
other for the ER and pCR prediction endpoints. However,
the genes and pathways predictive of pCR in ERnegative
cancers were very different from the other two
informative gene sets.
Additional file 6 contains the pathway-enrichment
tables for the three endpoints, including pathways with
enrichment P values < 0.1. Thirty-two pathways contributed
to the prediction of ER status; 36, to pCR prediction;
and 11, to pCR prediction within ER-negative
cancers across the five feature-selection methods. For
the ER endpoint, development, cell adhesion, cytoskeleton
remodeling, DNA damage, apoptosis, and ER transcription
factor activity were the most significant
pathway elements common to all informative feature
sets. We also noted that most pathways that were
involved in pCR prediction (31 of 36) were the same as
those involved in ER-status prediction. This is consistent
with the known association between pCR rate and
ER status [7]. Estrogen receptor-negative cancers had
significantly higher pCR rates than ER-positive cancers
0.2
0.4
0.6
0.8
1
ER status
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
AUC
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
DLDA
LDA
KNN3
KNN11
SVM05
SVM2
LREG
QDA
pCR
DLDA
LDA
KNN3
KNN11
SVM05
SVM2
LREG
QDA
FS1
pCR|ER−
FS2FS3FS4
DLDA
LDA
KNN3
KNN11
SVM05
SVM2
LREG
QDA
FS5
10x5−CV (± 2SD)
LPO bootstrap (± 2SD)
conditional indep. valid. (± 2SD)
mean of indep. valid.
chance line
Figure 3 Graphic summaries of the estimated and observed areas under the curve (AUCs) for each of the 120 models. For each
combination of feature-selection method and classification algorithm, the AUCs ± 2 standard deviations are plotted. Mean AUCs obtained from
10 × 5-CV (cross-validation; black square), LPO bootstrap (black dot), and the conditional (blue circle) and mean (red cross) validation AUCs are
shown.
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 9 of 13
9. Effect of training-samples size and classification difficulty
72
(54% pCR in ER-negative cancers versus 7.5% pCR in
ER-positive cancers; c2
test P value = 1.068e-08). The
pathways that were selected for prediction of pCR in
ER-negative cancers were distinct from the pathways
that were predictive of pCR in all patients and included
immune response-related pathways (IL-2 and T-helper
cell activation), opioid-receptor signaling, and endothelial
cell-related pathways.
Discussion
The goal of this analysis was to examine how the choice
of a univariate feature-selection method and classification
algorithm may influence the performance of predictors
under varying degrees of classification difficulty.
We examined the influence of changing two critical
components, feature selection and classification algorithm
in the predictor development process, for three
different prediction problems that represented three
levels of difficulty in a clinically annotated human breast
cancer data set. Classification of breast cancer into ERpositive
or -negative categories is an easy classification
problem; the large number of informative probe sets
and high information content of the features allow clear
separation of the groups. The AUC values for the 40
different prediction models for this endpoint ranged
from 0.875 to 0.939 in the independent validation set.
Prediction of pCR across all breast cancers, including
both ER-negative and ER-positive cases, represented a
slightly more difficult prediction problem with AUCs
ranging between 0.61 and 0.80 in the validation set. Prediction
of pCR in the molecularly more homogeneous
ER-negative breast cancer subpopulaton proved to be
the most difficult classification challenge: the validation
AUCs ranged from 0.34 to 0.62. No predictor-development
strategy emerged as clearly superior for any of the
classification problems. The 95% CI of the prediction
accuracies overlaped broadly for most of the predictors.
However, LDA, DLDA, LREG, and QDA classification
algorithms were consistently among the best-performing
models for each problem. Interestingly, KNN3 and SVM
methods were often among the worst-performing models
in independent validation, even though these reached
relatively high AUC values in cross validation. It is possible
that further fine tuning of parameters for these
Figure 4 Learning curves for the best predictors for each of the three endpoints. For each endpoint, the learning curve of the bestperforming
model on the validation set was estimated by fivefold cross-validation for gradually increasing sample sizes. The plot shows both
the estimated performance for different sample sizes and the fitted curve. The quadratic discriminant analysis (QDA) classifier required more than
60 samples, so the minimum sample size for it was 80. Note the nonlinear scale of the x-axis.
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 10 of 13
9. Effect of training-samples size and classification difficulty
73
more-complex classifiers (in the sense of an implementable
decision boundary) could have improved predictive
performance. We examined only the radial function kernel
for SVM with two a priori set kernel parameters g =
0.5 and 2.0, and the parameter C (cost of misclassification)
was also fixed at 10. Fixing these parameters may
have resulted in “less than optimally trained” models
that could lead to added variability in the performance
of the classifiers. Also, we examined only two versions
of KNN with a priori set k of 3 and 11, and found that
KNN11 outperformed KNN3. Low values of k yield
local classifiers with low bias but high variance, whereas
higher values led to more-global classifiers with higher
bias and lower variance; exploring a broader range of k
values could have optimized prediction results. Optimizing
the parameters g or k is not a straightforward task.
It should be done within the inner cross-validation process,
just as is done with feature selection. Fine tuning
different model parameters outside of the two-stage
cross-validation process would lead to model-selection
bias, or optimization bias [19].
An interesting observation was that simple featureselection
methods that ranked features based on
difference in means performed very well in both
cross-validation and independent validation relative to
the more commonly used t statistic-based ranking.
Four of the top five models for each prediction problem
used features selected by the non-t test-based
methods. However, it is important to recognize that
all of the feature-selection methods that we examined
represented univariate filtering approaches that rank
features individually and independent of the classification
method. It is possible that nonparametric or
multivariate feature-selection methods could yield different
results. Penalized feature-selection methods,
which embed feature selection in the classifier fitting
step, may also have advantages, because features that
might not be discriminatory individually could be
jointly predictive in combination with other features.
At least one article suggested that multivariate sparse
penalized likelihood methods, including lasso and
elastic net, might have a slight edge compared with
univariate filtering [26]. Other publications that compared
several univariate and multivariate featureselection
methods in public cancer data sets by using
10-fold cross-validation estimates found that simple
univariate feature-selection methods often outperformed
more-complex multivariate approaches
[27,28].
Our data demonstrate that many different feature sets
and classification methods can yield similarly accurate
predictors for a given endpoint. When we mapped the
feature sets generated by five different univariate feature-selection
methods to biologic pathways, each
method tended to identify similar genes and pathways.
The biologic pathways that were implicated in ER-status
or pCR prediction were distinct from the pathways that
were predictive of pCR in ER-negative cancers. This
pathway-level analysis is hypothesis generating and will
require further laboratory validation to determine the
importance of the identified pathways (for example,
immune response, endothelial-cell regulation, G-protein
signaling) in the biology of chemotherapy response in
ER-negative breast cancer.
To estimate potential improvements in predictive performance
of the nominally best predictors for each classification
problem, we pooled all cases and carried out a
series of split-sample training and validation analyses in
which the predictors were trained on increasingly larger
data sets. For the easy classification problem (ER-status),
relatively small sample sizes (80 to 100 samples) were
enough for constructing excellent predictors. In contrast,
for the moderately difficult problem (pCR prediction),
the accuracy of the model steadily improved as
the sample size increased. For the most difficult problem,
pCR prediction in ER-negative cancer, a minimal
improvement was observed over a range of 25 to 85
training cases. It is important to note that the pCR and
ER status predictors trained on 80 cases showed good
or excellent conditional AUCs (0.65 and 0.94, respectively).
This modest performance and limited improvement
of the pCR predictor for ER-negative cancer may
be due to (a) too small sample size for trainig or (b) the
incompletness of the mRNA expression-based feature
space, meaning that this class-separation problem cannot
be fully accomplished by using information only
from the available probes by using the methods that we
applied. However, fitting learning curves to preliminary
data sets could assisst investigators in estimating
Table 2 Congruencies across different endpoints and
different feature-selection methods
Same endpoint but different feature selection (FS)
Endpoint Gene-level Level of canonic-pathway maps
ER status 0.541 0.573
pCR 0.544 0.572
pCR(ER)
0.593 0.532
Same FS but different endpoints
FS Gene-level Level of canonic-pathway maps
FS1 0.300 0.290
FS2 0.299 0.274
FS3 0.291 0.278
FS4 0.295 0.291
FS5 0.272 0.282
The table shows that kappa statistics (that is, congruency) are high for
different feature-selection methods for the same endpoint but are low for the
same feature-ranking method for different endpoints. Both gene-level and
pathway-level analyses show similar results.
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 11 of 13
9. Effect of training-samples size and classification difficulty
74
sample-size requirements for a particular prediction problem
for any given model.
Conclusions
This analysis confirms that it is possible to build multigene
classifiers of clinical outcome that hold up in independent
validation. Predictor performance is determined
largely by an interplay between training-sample size and
classification difficulty. Variations on univariate featureselection
methods and choice of classification algorithm
had only a modest impact on predictor performance,
and it is clear that within our statistical precision, several
equally good predictors can be developed for each
of our classification problems. Pathway-level analysis of
informative features selected by different methods
revealed a high level of congruency. This indicates that
similar biologic pathways were identified as informative
for a given prediction endpoint by the different univariate
feature-selection methods. The independent validation
results also showed that internal 10 × 5-CV and
LPO bootstrap both yielded reasonably good and only
slightly optimistic performance estimates for all the
endpoints.
Additional file 1: Supplemental Table S1. Clinical data for all the
patients in the training and validation sets.
Click here for file
[ http://www.biomedcentral.com/content/supplementary/bcr2468-S1.xls ]
Additional file 2: Supplemental Table S2. Quality control results.
Click here for file
[ http://www.biomedcentral.com/content/supplementary/bcr2468-S2.xls ]
Additional file 3: Supplemental Table S3. Pathways mapping for all
endpoints.
Click here for file
[ http://www.biomedcentral.com/content/supplementary/bcr2468-S3.xls ]
Additional file 4: Supplemental methods. Pseudo-code description of
the two-level external cross-validation scheme.
Click here for file
[ http://www.biomedcentral.com/content/supplementary/bcr2468-S4.pdf ]
Additional file 5: Supplemental Table S4. Features (probesets) selected
in the 120 models.
Click here for file
[ http://www.biomedcentral.com/content/supplementary/bcr2468-S5.xls ]
Additional file 6: Supplemental Table S5. Estimated and validation
performance of all models.
Click here for file
[ http://www.biomedcentral.com/content/supplementary/bcr2468-S6.xls ]
Abbreviations
10 × 5-CV: repeated (10 times) fivefold cross validation; AUC: area under the
receiver operating characteristic curve; CI: confidence interval; DLDA:
diagonal linear discriminant analysis; ER: estrogen receptor; KNN: k nearestneighbors
classifier; LDA: linear discriminant analysis; LPO: leave-pair-out
bootstrap; LREG: logistic regression classifier; pCR: pathologic complete
response; QDA: quadratic discriminant analysis; RD: residual invasive cancer;
RMSE: root mean square error; SD: standard deviation; SEM: standard error of
the mean; SVM: support vector machine.
Acknowledgements
This research was supported by grants from the NCI R-01 program (LP), The
Breast Cancer Research Foundation (LP and WFS), The MD Anderson Cancer
Center Faculty Incentive Funds (WFS), and the Commonwealth Cancer
Fundation (LP, WFS). VP and MD acknowledge the support of the Swiss
National Science Foundation NCCR Molecular Oncology. Certain commercial
materials and equipment are identified to specify experimental procedures
adequately. In no case does such identification imply recommendation or
endorsement by the FDA, nor does it imply that the items identified are
necessarily the best available for the purpose. The views presented in this
article do not necessarily reflect those of the U.S. Food and Drug
Administration.
Author details
1
Bioinformatics Core Facility, Swiss Institute of Bioinformatics, Génopode
Building, Quartier Sorge, Lausanne CH-1015, Switzerland. 2
Center for Devices
and Radiological Health, US Food and Drug Administration, 10903 New
Hampshire Ave WO62-3124, Silver Springs, MD 20993-0002, USA. 3
Nuvera
Biosciences, 400 West Cummings Park, Woburn, MA 01801, USA. 4
GeneGo,
Inc., 500 Renaissance Drive, St. Joseph, MI 49085, USA. 5
Department of
Systems Biology, Vavilov Institute for General Genetics, Russian Academy of
Sciences, Gubkina str. 3 korp. 1, Moscow 119333, Russia. 6
Department of
Biostatistics, P.O. Box 301439, Houston, TX 77230-1439, USA. 7
Department of
Breast Medical Oncology, P.O. Box 301439, Houston, TX 77230-1439, USA.
8
Swiss NCCR Molecular Oncology, Swiss Institute for Experimental Cancer
Research (ISREC), School of Life Sciences, Ecole Polytechnique Fédérale de
Lausanne, Lausanne CH-1015, Switzerland. 9
National Center for Toxicological
Research, US Food and Drug Administration, Jefferson, AR 72079, USA.
10
Department of Pathology of the University of Texas M. D. Anderson Cancer
Center, P.O. Box 301439, Houston, TX 77230-1439, USA.
Authors’ contributions
LP, VP, and LS designed the study. VP, WFS, and WC performed the
experiments. VP, WC, BG, CH, WS, FS, YN, MT, AI, TN, KH, MD, and LP
performed the statistical analyses and interpreted the results. VV, DB, GH,
WFS, and LP contributed the clinical, pathologic, and molecular data. All
authors contributed to the writing of the manuscript and read and
approved the manuscript.
Competing interests
The authors declare that they have no competing interests.
Received: 14 October 2009 Revised: 18 December 2009
Accepted: 11 January 2010 Published: 11 January 2010
References
1. Vijver van de MJ, He YD, van’t Veer LJ, Dai H, Hart AAM, Voskuil DW,
Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D,
Witteveen A, Glas A, Delahaye L, Velde van der T, Bartelink H, Rodenhuis S,
Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a
predictor of survival in breast cancer. N Engl J Med 2002, 347:1999-2009.
2. Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG,
Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant J, Wolmark N: A
multigene assay to predict recurrence of tamoxifen-treated, nodenegative
breast cancer. N Engl J Med 2004, 351:2817-2826.
3. Ross JS, Hatzis C, Symmans WF, Pusztai L, Hortobágyi GN: Commercialized
multigene predictors of clinical outcome for breast cancer. Oncologist
2008, 13:477-493.
4. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods
for the classification of tumors using gene expression data. J Am Statist
Assoc 2002, 97:77-87.
5. Perou CM, Sørlie T, Eisen MB, Rijn van de M, Jeffrey SS, Rees CA, Pollack JR,
Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C,
Zhu SX, Lønning PE, Børresen-Dale AL, Brown PO, Botstein D: Molecular
portraits of human breast tumours. Nature 2000, 406:747-752.
6. Pusztai L, Ayers M, Stec J, Clark E, Hess K, Stivers D, Damokosh A, Sneige N,
Buchholz TA, Esteva FJ, Arun B, Cristofanilli M, Booser D, Rosales M,
Valero V, Adams C, Hortobagyi GN, Symmans WF: Gene expression profiles
obtained from fine-needle aspirations of breast cancer reliably identify
routine prognostic markers and reveal large-scale molecular differences
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 12 of 13
9. Effect of training-samples size and classification difficulty
75
between estrogen-negative and estrogen-positive tumors. Clin Cancer
Res 2003, 9:2406-2415.
7. Andre F, Mazouni C, Liedtke C, Kau S-W, Frye D, Green M, GonzalezAngulo
AM, Symmans WF, Hortobagyi GN, Pusztai L: HER2 expression and
efficacy of preoperative paclitaxel/FAC chemotherapy in breast cancer.
Breast Cancer Res Treat 2008, 108:183-190.
8. Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in
breast cancer: is there a unique set?. Bioinformatics 2005, 21:171-178.
9. Hess KR, Anderson K, Symmans WF, Valero V, Ibrahim N, Mejia JA, Booser D,
Theriault RL, Buzdar AU, Dempsey PJ, Rouzier R, Sneige N, Ross JS,
Vidaurre T, Gómez HL, Hortobagyi GN, Pusztai L: Pharmacogenomic
predictor of sensitivity to preoperative chemotherapy with paclitaxel
and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J
Clin Oncol 2006, 24:4236-4244.
10. Boutros PC, Lau SK, Pintilie M, Liu N, Shepherd FA, Der SD, Tsao M-S,
Penn LZ, Jurisica I: Prognostic gene signatures for non-small-cell lung
cancer. Proc Natl Acad Sci USA 2009, 106:2824-2828.
11. Yousef WA, Wagner RF, Loew MH: Estimating the uncertainty in the
estimated mean area under the ROC curve of a classifier. Pattern Recog
Lett 2005, 26:2600-2610.
12. Symmans WF, Ayers M, Clark EA, Stec J, Hess KR, Sneige N, Buchholz TA,
Krishnamurthy S, Ibrahim NK, Buzdar AU, Theriault RL, Rosales MFM,
Thomas ES, Gwyn KM, Green MC, Syed AR, Hortobagyi GN, Pusztai L: Total
RNA yield and microarray gene expression profiles from fine-needle
aspiration biopsy and core-needle biopsy samples of breast carcinoma.
Cancer 2003, 97:2960-2971.
13. Liedtke C, Mazouni C, Hess KR, André F, Tordai A, Mejia JA, Symmans WF,
Gonzalez-Angulo AM, Hennessy B, Green M, Cristofanilli M, Hortobagyi GN,
Pusztai L: Response to neoadjuvant therapy and long-term survival in
patients with triple-negative breast cancer. J Clin Oncol 2008,
26:1275-1281.
14. Ayers M, Symmans WF, Stec J, Damokosh AI, Clark E, Hess K, Lecocke M,
Metivier J, Booser D, Ibrahim N, Valero V, Royce M, Arun B, Whitman G,
Ross J, Sneige N, Hortobagyi GN, Pusztai L: Gene expression profiles
predict complete pathologic response to neoadjuvant paclitaxel and
fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in
breast cancer. J Clin Oncol 2004, 22:2284-2293.
15. Peintinger F, Anderson K, Mazouni C, Kuerer HM, Hatzis C, Lin F,
Hortobagyi GN, Symmans WF, Pusztai L: Thirty-gene pharmacogenomic
test correlates with residual cancer burden after preoperative
chemotherapy for breast cancer. Clin Cancer Res 2007, 13:4078-4082.
16. Stec J, Wang J, Coombes K, Ayers M, Hoersch S, Gold DL, Ross JS, Hess KR,
Tirrell S, Linette G, Hortobagyi GN, Symmans WF, Pusztai L: Comparison of
the predictive accuracy of DNA array-based multigene classifiers across
cDNA arrays and Affymetrix genechips. J Mol Diagn 2005, 7:357-367.
17. Rouzier R, Perou CM, Symmans WF, Ibrahim N, Cristofanilli M, Anderson K,
Hess KR, Stec J, Ayers M, Wagner P, Morandi P, Fan C, Rabiul I, Ross JS,
Hortobagyi GN, Pusztai L: Breast cancer molecular subtypes respond
differently to preoperative chemotherapy. Clin Cancer Res 2005,
11:5678-5685.
18. Ho TK, Basu M: Complexity measures of supervised classification
problems. IEEE Trans Patt Anal Mach Intel 2002, 24:289-300.
19. Wood IA, Visscher PM, Mengersen KL: Classification based upon gene
expression data: bias and precision of error rates. Bioinformatics 2007,
23:1363-1370.
20. Efron B, Tibshirani R: Improvements on cross-validation: the 632+
bootstrap method. J Am Statist Assoc 1997, 92:548-560.
21. Yousef WA, Wagner RF, Loew MH: Assessing classifiers from two
independent data sets using ROC analysis: a nonparametric approach.
IEEE Trans Patt Anal Mach Intel 2006, 28:1809-1817.
22. Fukunaga K, Hayes RR: Effects of sample size in classifier design. IEEE
Trans Patt Anal Mach Intel 1989, 11:873-885.
23. Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J,
Stephens R, Baseler MW, Lane HC, Lempicki RA: The DAVID gene
functional classification tool: a novel biological module-centric algorithm
to functionally analyze large gene lists. Genome Biol 2007, 8:R183.
24. Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DSA, Nobel AB, van’t Veer LJ,
Perou CM: Concordance among gene-expression-based predictors for
breast cancer. N Engl J Med 2006, 355:560-569.
25. Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B,
Desmedt C, Ignatiadis M, Sengstag T, Schütz F, Goldstein DR, Piccart M,
Delorenzi M: Meta-analysis of gene expression profiles in breast cancer:
toward a unified understanding of breast cancer subtyping and
prognosis signatures. Breast Cancer Res 2008, 10:R65.
26. Zucknick M, Richardson S, Stronach EA: Comparing the characteristics of
gene expression profiles derived by univariate and multivariate
classification methods. Stat Appl Genet Mol Biol 2008, 7:Article7.
27. Lai C, Reinders MJT, van’t Veer LJ, Wessels LFA: A comparison of univariate
and multivariate gene selection techniques for classification of cancer
datasets. BMC Bioinformatics 2006, 7:235.
28. Lecocke M, Hess KR: An empirical study of univariate and genetic
algorithm-based feature selection in binary classification with microarray
data. Cancer Inform 2007, 2:313-327.
doi:10.1186/bcr2468
Cite this article as: Popovici et al.: Effect of training-sample size and
classification difficulty on the accuracy of genomic predictors. Breast
Cancer Research 2010 12:R5.
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color ﬁgure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Popovici et al. Breast Cancer Research 2010, 12:R5
http://breast-cancer-research.com/content/12/1/R5
Page 13 of 13
9. Effect of training-samples size and classification difficulty
76
9. Effect of training-samples size and classification difficulty
77
10 The MicroArray Quality control (MAQC)II
study of common practices for the development
and validation of microarray-based predictive
models
• Nature Biotechnology, 28(8):827–U109
• IF: 41.667
• number of citations: 339
• personal contribution (5%): responsible for SIB participation –
experimental design and implementation, manuscript writing
79
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnology  advance online publication	 
A rt i c l e s
of guidelines for safe and effective use of preclinical and clinical
genomic data. Although previous studies have compared and benchmarked
individual steps in the model development process19, no
prior published work has, to our knowledge, extensively evaluated
current community practices on the development and validation of
microarray-based predictive models.
Microarray-based gene expression data and prediction models are
increasingly being submitted by the regulated industry to the FDA
to support medical product development and testing applications20.
For example, gene expression microarray–based assays that have
been approved by the FDA as diagnostic tests include the Agendia
MammaPrint microarray to assess prognosis of distant ­metastasis in
breast cancer patients21,22 and the Pathwork Tissue of Origin Test
to assess the degree of similarity of the RNA expression ­pattern in
a patient’s tumor to that in a database of tumor samples for which
the origin of the tumor is known23. Gene expression data have
also been the basis for the development of PCR-based diagnostic
assays, including the xDx Allomap test for detection of rejection of
heart transplants24.
The possible uses of gene expression data are vast and include diagnosis,
early detection (screening), monitoring of disease progression,
risk assessment, prognosis, complex medical product characterization
and prediction of response to treatment (with regard to safety or
efficacy) with a drug or device labeling intent. The ability to generate
models in a reproducible fashion is an important consideration in
predictive model development.
A lack of consistency in generating classifiers from publicly available
data is problematic and may be due to any number of factors
including insufficient annotation, incomplete clinical identifiers,
coding errors and/or inappropriate use of methodology25,26. There
The MicroArray Quality Control (MAQC)-II study of
common practices for the development and validation
of microarray-based predictive models
As part of the United States Food and Drug Administration’s (FDA’s)
Critical Path Initiative to medical product development (http://www.
fda.gov/oc/initiatives/criticalpath/), the MAQC consortium began in
February 2005 with the goal of addressing various microarray reliability
concerns raised in publications1–9 pertaining to reproducibility
of gene signatures. The first phase of this project (MAQC-I) extensively
evaluated the technical performance of microarray platforms
in identifying all differentially expressed genes that would potentially
constitute biomarkers. The MAQC-I found high intra-platform reproducibility
across test sites, as well as inter-platform concordance of
differentially expressed gene lists10–15 and confirmed that microarray
technology is able to reliably identify differentially expressed genes
between sample classes or populations16,17. Importantly, the MAQC-I
helped produce companion guidance regarding genomic data submission
to the FDA (http://www.fda.gov/downloads/Drugs/GuidanceCo
mplianceRegulatoryInformation/Guidances/ucm079855.pdf).
Although the MAQC-I focused on the technical aspects of gene
expression measurements, robust technology platforms alone are
not sufficient to fully realize the promise of this technology. An
additional requirement is the development of accurate and reproducible
multivariate gene expression–based prediction models, also
referred to as classifiers. Such models take gene expression data from
a patient as input and as output produce a prediction of a clinically
relevant outcome for that patient. Therefore, the second phase of the
project (MAQC-II) has focused on these predictive models18, studying
both how they are developed and how they are evaluated. For
any given microarray data set, many computational approaches can
be followed to develop predictive models and to estimate the future
performance of these models. Understanding the strengths and limitations
of these various approaches is critical to the ­formulation
*A full list of authors and affiliations appears at the end of the paper. Correspondence should be addressed to L.S. (leming.shi@fda.hhs.gov or leming.shi@gmail.com).
Received 2 March; accepted 30 June; published online 30 July 2010; doi:10.1038/nbt.1665
MAQC Consortium*
Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of
these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets
to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in
rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many
combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of
the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model
performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar
performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees
and independent investigators that evaluate methods for global gene expression analysis.
10. MAQC-II
80
©2010NatureAmerica,Inc.Allrightsreserved.
	 advance online publication  nature biotechnology
A rt i c l e s
are also examples in the literature of classifiers whose performance
cannot be reproduced on independent data sets because of poor study
design27, poor data quality and/or insufficient cross-validation of all
model development steps28,29. Each of these factors may contribute
to a certain level of skepticism about claims of performance levels
achieved by microarray-based classifiers.
Previous evaluations of the reproducibility of microarray-based
classifiers, with only very few exceptions30,31, have been limited
to simulation studies or reanalysis of previously published results.
Frequently, published benchmarking studies have split data sets at
random, and used one part for training and the other for validation.
This design assumes that the training and validation sets are produced
by unbiased sampling of a large, homogeneous population of samples.
However, specimens in clinical studies are usually accrued over years
and there may be a shift in the participating patient population and
also in the methods used to assign disease status owing to changing
practice standards. There may also be batch effects owing to time
variations in tissue analysis or due to distinct methods of ­sample
­collection and handling at different medical centers. As a result,
­samples derived from sequentially accrued patient populations, as
was done in MAQC-II to mimic clinical reality, where the first cohort
is used for developing predictive models and subsequent patients are
included in validation, may differ from each other in many ways that
could influence the prediction performance.
The MAQC-II project was designed to evaluate these sources of
bias in study design by constructing training and validation sets at
different times, swapping the test and training sets and also using
data from diverse preclinical and clinical scenarios. The goals of
MAQC-II were to survey approaches in genomic model development
in an attempt to understand sources of variability in prediction
performance and to assess the influences of endpoint signal strength
in data. By providing the same data sets to many organizations for
analysis, but not restricting their data analysis protocols, the project
has made it possible to evaluate to what extent, if any, results depend
on the team that performs the analysis. This contrasts with previous
benchmarking studies that have typically been conducted by single
laboratories. Enrolling a large number of organizations has also made
it feasible to test many more approaches than would be practical for
any single team. MAQC-II also strives to develop good modeling
practice guidelines, drawing on a large international collaboration of
experts and the lessons learned in the perhaps unprecedented effort
of developing and evaluating >30,000 genomic classifiers to predict
a variety of endpoints from diverse data sets.
MAQC-II is a collaborative research project that includes
parti­cipants from the FDA, other government agencies, industry
and academia. This paper describes the MAQC-II structure and
experimental design and summarizes the main findings and key
results of the consortium, whose members have learned a great deal
during the process. The resulting guidelines are general and should
not be construed as specific recommendations by the FDA for
regulatory submissions.
RESULTS
Generating a unique compendium of >30,000 prediction models
The MAQC-II consortium was conceived with the primary
goal of examining model development practices for generating
binary classifiers in two types of data sets, preclinical and clinical
(Supplementary Tables 1 and 2). To accomplish this, the project
leader distributed six data sets containing 13 preclinical and clinical
endpoints coded A through M (Table 1) to 36 voluntary participating
data analysis teams representing academia, industry
and government institutions (Supplementary Table 3). Endpoints
were coded so as to hide the identities of two negative-control endpoints
(endpoints I and M, for which class labels were randomly
assigned and are not predictable by the microarray data) and two
­positive-control endpoints (endpoints H and L, representing the
sex of patients, which is highly predictable by the microarray data).
Endpoints A, B and C tested teams’ ability to predict the toxicity
of chemical agents in rodent lung and liver models. The remaining
endpoints were predicted from microarray data sets from human
patients diagnosed with breast cancer (D and E), multiple myeloma
(F and G) or neuro­blastoma (J and K). For the multiple myeloma
and neuroblastoma data sets, the endpoints represented event free
survival (abbreviated EFS), meaning a lack of malignancy or disease
recurrence, and overall survival (abbreviated OS) after 730 days
(for multiple myeloma) or 900 days (for neuroblastoma) post treatment
or diagnosis. For breast cancer, the endpoints represented
estrogen receptor status, a common diagnostic marker of this
cancer type (abbre­viated ‘erpos’), and the success of treatment
involving chemotherapy followed by surgical resection of a tumor
(abbreviated ‘pCR’). The biological meaning of the control endpoints
was known only to the project leader and not revealed to
the project participants until all model development and external
validation processes had been completed.
To evaluate the reproducibility of the models developed by a data
analysis team for a given data set, we asked teams to submit models
from two stages of analyses. In the first stage (hereafter referred to as
the ‘original’ experiment), each team built prediction models for up to
13 different coded endpoints using six training data sets. Models were
‘frozen’ against further modification, submitted to the consortium
and then tested on a blinded validation data set that was not available
to the analysis teams during training. In the second stage (referred
to as the ‘swap’ experiment), teams repeated the model building and
validation process by training models on the original validation set
and validating them using the original training set.
To simulate the potential decision-making process for evaluating a
microarray-based classifier, we established a process for each group
to receive training data with coded endpoints, propose a data analysis
protocol (DAP) based on exploratory analysis, receive feedback on
the protocol and then perform the analysis and validation (Fig. 1).
Analysis protocols were reviewed internally by other MAQC-II participants
(at least two reviewers per protocol) and by members of the
MAQC-II Regulatory Biostatistics Working Group (RBWG), a team
from the FDA and industry comprising biostatisticians and others
with extensive model building expertise. Teams were encouraged to
revise their protocols to incorporate feedback from reviewers, but
each team was eventually considered responsible for its own analysis
protocol and incorporating reviewers’ feedback was not mandatory
(see Online Methods for more details).
We assembled two large tables from the original and swap experiments
(Supplementary Tables 1 and 2, respectively) containing
summary information about the algorithms and analytic steps, or
‘modeling factors’, used to construct each model and the ‘internal’
and ‘external’ performance of each model. Internal performance
measures the ability of the model to classify the training samples,
based on cross-validation exercises. External performance measures
the ability of the model to classify the blinded independent validation
data. We considered several performance metrics, including Matthews
Correlation Coefficient (MCC), accuracy, sensitivity, speci­ficity,
area under the receiver operating characteristic curve (AUC) and
root mean squared error (r.m.s.e.). These two tables contain data on
>30,000 models. Here we report performance based on MCC because
10. MAQC-II
81
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnology  advance online publication	 
A rt i c l e s
it is informative when the distribution of the two classes in a data set
is highly skewed and because it is simple to calculate and was available
for all models. MCC values range from +1 to −1, with +1 indicating
perfect prediction (that is, all samples classified correctly and none
incorrectly), 0 indicates random prediction and −1 indicating perfect
inverse prediction.
The 36 analysis teams applied many different options under each
modeling factor for developing models (Supplementary Table 4)
including 17 summary and normalization methods, nine batch-effect
removal methods, 33 feature selection methods (between 1 and >1,000
features), 24 classification algorithms and six internal validation
­methods. Such diversity suggests the community’s common practices are
Table 1  Microarray data sets used for model development and validation in the MAQC-II project
Date set
code
Endpoint
code
Endpoint
description
Microarray
platform
Training seta Validation seta
Comments and references
Number
of samples
Positives
(P)
Negatives
(N)
P/N
ratio
Number
of samples
Positives
(P)
Negatives
(N)
P/N
ratio
Hamner A Lung tumorigen
vs. non-tumorigen
(mouse)
Affymetrix Mouse
430 2.0
70 26 44 0.59 88 28 60 0.47 The training set was first
published in 2007 (ref. 50) and
the validation set was generated
for MAQC-II
Iconix B Non-genotoxic liver
carcinogens vs.
non-carcinogens
(rat)
Amersham Uniset
Rat 1 Bioarray
216 73 143 0.51 201 57 144 0.40 The data set was first published
in 2007 (ref. 51). Raw microarray
intensity data, instead of ratio
data, were provided for MAQC-II
data analysis
NIEHS C Liver toxicants vs.
non-toxicants based
on overall necrosis
score (rat)
Affymetrix
Rat 230 2.0
214 79 135 0.58 204 78 126 0.62 Exploratory visualization of the
data set was reported in 2008
(ref. 53). However, the phenotype
classification problem was
formulated specifically for
MAQC-II. A large amount of
additional microarray and
phenotype data were provided to
MAQC-II for cross-platform and
cross-tissue comparisons
Breast
cancer
(BR)
D Pre-operative treatment
response (pCR,
pathologic complete
response)
Affymetrix Human
U133A
130 33 97 0.34 100 15 85 0.18 The training set was first
published in 2006 (ref. 56) and
the validation set was specifically
generated for MAQC-II. In addition,
two distinct endpoints (D
and E) were analyzed in MAQC-II
E Estrogen receptor
status (erpos)
130 80 50 1.6 100 61 39 1.56
Multiple
myeloma
(MM)
F Overall survival
milestone outcome
(OS, 730-d cutoff)
Affymetrix Human
U133Plus 2.0
340 51 289 0.18 214 27 187 0.14 The data set was first published
in 2006 (ref. 57) and 2007
(ref. 58). However, patient
survival data were updated and
the raw microarray data (CEL
files) were provided specifically
for MAQC-II data analysis. In
addition, endpoints H and I were
designed and analyzed specifically
in MAQC-II
G Event-free survival
milestone outcome
(EFS, 730-d cutoff)
340 84 256 0.33 214 34 180 0.19
H Clinical parameter
S1 (CPS1). The
actual class label
is the sex of the
patient. Used as a
“positive” control
endpoint
340 194 146 1.33 214 140 74 1.89
I Clinical parameter
R1 (CPR1). The
actual class label is
randomly assigned.
Used as a “negative”
control endpoint
340 200 140 1.43 214 122 92 1.33
Neuro-
blastoma
(NB)
J Overall survival
milestone outcome
(OS, 900-d cutoff)
Different versions
of Agilent human
microarrays
238 22 216 0.10 177 39 138 0.28 The training data set was first
published in 2006 (ref. 63).
The validation set (two-color
Agilent platform) was generated
specifically for MAQC-II. In addition,
one-color Agilent platform
data were also generated for most
samples used in the training and
validation sets specifically for
MAQC-II to compare the prediction
performance of two-color
versus one-color platforms.
Patient survival data were also
updated. In addition, endpoints L
and M were designed and
analyzed specifically in MAQC-II
K Event-free survival
milestone outcome
(EFS, 900-d cutoff)
239 49 190 0.26 193 83 110 0.75
L Newly established
parameter S (NEP_S).
The actual class label
is the sex of the
patient. Used as a
“positive” control
endpoint
246 145 101 1.44 231 133 98 1.36
M Newly established
parameter R (NEP_R).
The actual class label
is randomly assigned.
Used as a “negative”
control endpoint
246 145 101 1.44 253 143 110 1.30
The first three data sets (Hamner, Iconix and NIEHS) are from preclinical toxicogenomics studies, whereas the other three data sets are from clinical studies. Endpoints H and L are positive
controls (sex of patient) and endpoints I and M are negative controls (randomly assigned class labels). The nature of H, I, L and M was unknown to MAQC-II participants except for the project
leader until all calculations were completed.
aNumbers shown are the actual number of samples used for model development or validation.
10. MAQC-II
82
©2010NatureAmerica,Inc.Allrightsreserved.
	 advance online publication  nature biotechnology
A rt i c l e s
well ­represented. For each of the models nominated by a team as being
the best model for a particular endpoint, we compiled the list of features
used for both the original and swap experiments (see the MAQC Web
site at http://edkb.fda.gov/MAQC/). These comprehensive tables represent
a unique resource. The results that follow describe data mining
efforts to determine the potential and limitations of current practices for
developing and validating gene expression–based prediction models.
Performance depends on endpoint and can be estimated
during training
Unlike many previous efforts, the study design of MAQC-II provided
the opportunity to assess the performance of many different modeling
approaches on a clinically realistic blinded external validation data set.
This is especially important in light of the intended clinical or preclinical
uses of classifiers that are constructed using initial data sets and
validated for regulatory approval and then are expected to accurately
predict samples collected under diverse conditions perhaps months or
years later. To assess the reliability of performance estimates derived
during model training, we compared the performance on the internal
training data set with performance on the external validation data set
for of each of the 18,060 models in the original experiment (Fig. 2a).
Models without complete metadata were not included in the analysis.
We selected 13 ‘candidate models’, representing the best model for
each endpoint, before external validation was performed. We required
that each analysis team nominate one model
for each endpoint they analyzed and we then
selected one candidate from these nominations
for each endpoint. We observed a
higher correlation between internal and
external performance estimates in terms
10/07
10/07 – 12/07
4. Data sets
5. Classifiers
11. Swap
6. Models 7. Validation 8. Prediction
9
5′
1. Exploration 2. DAP 3. DAP review
11/07 12/07
9/1/2007 2/1/2009
12/07 – 1/08
1/08
1/08 – 3/08
2/08 3/08
3/08 – 8/08 8/08 – 9/08 10/08 – 2/09
4/08 5/08 6/08 7/08 8/08 9/08 10/08 11/08 12/08 1/09
1/08 3/08 8/08 9/08
9/08 – 10/08
11. Swap
prediction
results
1. Exploratory
data analysis
(36 DATs)
2. Data analysis
protocol (DAP)
5. Classifiers are frozen
(mark one for validation)
6. MAQC-II’s
candidate models
8. Prediction
results
12. Meta-data analysis
& visualization
3. Review & approval
of DAP by RBWG
4. Six training
data sets
(13 endpoints)
Face-to-face
meeting
7. Validation
(blind test)
data sets
distribution
9-10. Meta-data
distribution
9/07 – 10/07
10. Table of model information
Performance metrics
Models
1
MF1
1
2
3
n
MF2 MF3 IV1 IV2 IV3 EV1 EV2 EV3
...
...
...
...
...
...
...
... ... ... ...
...
... ... ...
...
Modeling
factors
Internal
validation
External
validation
2 3 m
12. Meta-data analysis
0.9
1.0
0.8
0.7
0.6
0.5
0.4
r = 0.8495, N = 17,092
0.3
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Figure 1  Experimental design and timeline
of the MAQC-II project. Numbers (1–11)
order the steps of analysis. Step 11 indicates
when the original training and validation
data sets were swapped to repeat steps 4–10.
See main text for description of each step.
Every effort was made to ensure the complete
independence of the validation data sets from
the training sets. Each model is characterized
by several modeling factors and seven internal
and external validation performance metrics
(Supplementary Tables 1 and 2). The modeling
factors include: (i) organization code; (ii) data
set code; (iii) endpoint code; (iv) summary and
normalization; (v) feature selection method;
(vi) number of features used; (vii) classification
algorithm; (viii) batch-effect removal method;
(ix) type of internal validation; and (x) number
of iterations of internal validation. The seven
performance metrics for internal validation and
external validation are: (i) MCC; (ii) accuracy;
(iii) sensitivity; (iv) specificity; (v) AUC;
(vi) mean of sensitivity and specificity; and
(vii) r.m.s.e. s.d. of metrics are also provided for
internal validation results.
a
c
b1.0
0.8
Externalvalidation(MCC)
Externalvalidation(MCC)
r = 0.840, N = 18,060 r = 0.951, N = 13
0.6
0.4
0.2
0
–0.6
–0.6 –0.4 –0.2 0
Internal validation (MCC) Internal validation (MCC)
0.2 0.4 0.6 0.8 1.0
M
L
K
J
I
H
G
F
E
D
C
B
A
Endpoint
–0.4
–0.2
1.0
0.8
MCC
0.6
0.4
0.2
1796 970 866 1143 1079 2263 1192 2905 877 863 1569 807 1730
Internal validation
External validation
NB-
positive
MM-
positive
BR-
erpos
BR-
pCR
MM-
EFS
MM-
OS
MM-
negative
NB-
negative
Hamner
(mouse
lung tumor)
NB-
EFS
NB-
OS
Iconix
(rat liver
tumor)
NIEHS
(rat liver
necrosis)
0
–0.4
–0.2
1.0
0.8
L
H
C
E
B
F
K
D
A
G
M
I
J
0.6
0.4
0.2
0
–0.6
–0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8
MIAFGDBJKEHCL
1.0
–0.4
–0.2
Figure 2  Model performance on internal
validation compared with external validation.
(a) Performance of 18,060 models that were
validated with blinded validation data.
(b) Performance of 13 candidate models.
r, Pearson correlation coefficient; N, number
of models. Candidate models with binary and
continuous prediction values are marked as
circles and squares, respectively, and the
standard error estimate was obtained using
500-times resampling with bagging of the
prediction results from each model. (c) Distribution
of MCC values of all models for each endpoint in
internal (left, yellow) and external (right, green)
validation performance. Endpoints H and L (sex of
the patients) are included as positive controls and
endpoints I and M (randomly assigned sample
class labels) as negative controls. Boxes indicate
the 25% and 75% percentiles, and whiskers
indicate the 5% and 95% percentiles.
10. MAQC-II
83
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnology  advance online publication	 
A rt i c l e s
of MCC for the selected candidate models
(r = 0.951, n = 13, Fig. 2b) than for the overall
set of models (r = 0.840, n = 18,060, Fig. 2a),
suggesting that extensive peer review of
analysis protocols was able to avoid selecting
models that could result in less reliable
predictions in external validation. Yet, even
for the hand-selected candidate models, there is noticeable bias in the
performance estimated from internal validation. That is, the internal
validation performance is higher than the external validation performance
for most endpoints (Fig. 2b). However, for some endpoints
and for some model building methods or teams, internal and external
performance correlations were more modest as described in the following
sections.
To evaluate whether some endpoints might be more predictable
than others and to ­calibrate performance against the positive- and
negative-control endpoints, we assessed all models generated for each
endpoint (Fig. 2c). We observed a clear dependence of prediction
performance on endpoint. For example, endpoints C (liver necrosis
score of rats treated with hepatotoxicants), E (estrogen receptor status
of breast cancer patients), and H and L (sex of the multiple myeloma
and neuroblastoma patients, respectively) were the easiest to predict
(mean MCC > 0.7). Toxicological endpoints A and B and disease
progression endpoints D, F, G, J and K were more difficult to predict
(mean MCC ~0.1–0.4). Negative-control endpoints I and M were
totally unpredictable (mean MCC ~0), as expected. For 11 endpoints
(excluding the negative controls), a large proportion of the submitted
models predicted the endpoint significantly better than chance (MCC
> 0) and for a given endpoint many models performed similarly well
on both internal and external validation (see the distribution of MCC
in Fig. 2c). On the other hand, not all the submitted models performed
equally well for any given endpoint. Some models performed
no better than chance, even for some of the easy-to-predict endpoints,
suggesting that additional factors were responsible for differences in
model performance.
Data analysis teams show different proficiency
Next, we summarized the external validation performance of the
­models nominated by the 17 teams that analyzed all 13 endpoints
(Fig. 3). Nominated models represent a team’s best assessment of its
model-building effort. The mean external validation MCC per team
over 11 endpoints, excluding negative controls I and M, varied from
0.532 for data analysis team (DAT)24 to 0.263 for DAT3, indicating
appreciable differences in performance of the models developed by different
teams for the same data. Similar trends were observed when AUC
was used as the performance metric (Supplementary Table 5) or when
the original training and validation sets were swapped (Supplementary
Tables 6 and 7). Table 2 summarizes the modeling approaches that
were used by two or more MAQC-II data analysis teams.
Many factors may have played a role in the difference of external validation
performance between teams. For instance, teams used different
modeling factors, criteria for selecting the nominated models, and software
packages and code. Moreover, some teams may have been more
proficient at microarray data modeling and better at guarding against
clerical errors. We noticed substantial variations in ­performance among
the many K-nearest neighbor algorithm (KNN)-based models developed
by four analysis teams (Supplementary Fig. 1). Follow-up investigations
identified a few possible causes leading to the discrepancies in
performance32. For example, DAT20 fixed the parameter ‘number of
neighbors’ K = 3 in its data analysis protocol for all endpoints, whereas
DAT18 varied K from 3 to 15 with a step size of 2. This investigation
also revealed that even a detailed but standardized description of model
building requested from all groups failed to capture many important
tuning variables in the process. The subtle modeling differences not
captured may have contributed to the differing performance levels
achieved by the data analysis teams. The differences in performance
for the models developed by various data analysis teams can also be
observed from the changing patterns of internal and external validation
performance across the 13 endpoints (Fig. 3, Supplementary
Tables 5–7 and Supplementary Figs. 2–4). Our observations highlight
the importance of good model­ing practice in developing and validating
microarray-based predictive models including reporting of computational
details for results to be replicated26. In light of the MAQC-II
experience, recording structured information about the steps and
parameters of an analysis process seems highly desirable to facilitate
peer review and reanalysis of results.
Swap and original analyses lead to consistent results
To evaluate the reproducibility of the models generated by each team,
we correlated the performance of each team’s models on the original
training data set to performance on the validation data set and
repeated this calculation for the swap experiment (Fig. 4). The correlation
varied from 0.698–0.966 on the original experiment and from
Mean* L
N
B
pos
R
atlivernecr.
R
atlivertum
or
BR
erposN
B
EFS
N
B
O
S
BR
pC
R
M
ouse
lung
tum
orM
M
EFS
M
M
O
S
M
M
neg
N
B
neg
M
M
pos
H C E K J B D A G F I M
DAT24
DAT13
DAT25
DAT11
DAT12
DAT32
DAT10
DAT20
DAT4
DAT18
DAT36
DAT29
DAT35
DAT7
DAT19
DAT33
DAT3
Median
Candidate
Endpoint
Dataanalysisteamcode
0.532 0.982 0.910 0.845 0.748 0.575 0.557 0.311 0.323 0.244 0.193 0.168 0.011 −0.059
0.513 0.973 0.918 0.829 0.792 0.493 0.437 0.322 0.306 0.307 0.202 0.060 0.044 −0.041
0.504 0.965 0.801 0.816 0.652 0.514 0.349 0.383 0.360 0.217 0.243 0.247 0.016 −0.051
0.500 0.991 0.752 0.750 0.778 0.509 0.483 0.345 0.305 0.295 0.193 0.099 0.029 0.012
0.495 0.973 0.869 0.825 0.755 0.403 0.413 0.321 0.275 0.193 0.266 0.152 −0.016 −0.117
0.489 0.982 0.762 0.823 0.702 0.533 0.557 0.284 0.203 0.143 0.257 0.129 0.043 −0.006
0.485 0.982 0.871 0.445 0.728 0.472 0.249 0.429 0.353 0.295 0.293 0.222 0.016 −0.035
0.483 0.930 0.838 0.805 0.773 0.542 0.386 0.345 0.289 0.225 0.181 0.000 0.067 −0.152
0.473 0.982 0.847 0.835 0.737 0.488 0.344 0.118 0.324 0.110 0.176 0.247 −0.067 −0.112
0.460 0.973 0.860 0.829 0.690 0.371 0.376 0.344 0.229 0.057 0.243 0.090 −0.059 −0.059
0.457 0.956 0.815 0.847 0.773 0.491 0.202 0.185 0.385 −0.014 0.187 0.203 0.002 −0.075
0.443 0.982 0.847 0.780 0.755 0.377 0.423 0.313 −0.042 0.198 0.241 0.000 0.000 −0.041
0.427 0.725 0.782 0.824 0.770 0.531 0.344 0.168 0.349 −0.096 0.165 0.140 0.068 0.036
0.371 0.982 0.707 0.782 0.466 0.499 0.184 0.271 0.000 −0.062 0.203 0.051 0.013 −0.103
0.364 0.636 0.761 0.454 0.748 0.247 0.377 0.062 0.324 0.043 0.085 0.271 0.016 −0.020
0.284 0.856 0.054 0.709 0.751 0.455 −0.213 –0.078 0.114 0.479 −0.096 0.091 0.051 0.024
0.263 0.982 0.830 0.595 0.544 0.036 −0.090 −0.027 0.336 −0.143 −0.030 −0.142 −0.047 0.019
0.488 0.973 0.830 0.816 0.748 0.491 0.376 0.311 0.306 0.193 0.193 0.129 0.016 −0.041
0.511 0.982 0.891 0.829 0.732 0.403 0.479 0.429 0.301 0.217 0.162 0.196 0.067 −0.103
−1.0
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1.0Figure 3  Performance, measured using MCC,
of the best models nominated by the 17 data
analysis teams (DATs) that analyzed all 13
endpoints in the original training-validation
experiment. The median MCC value for
an endpoint, representative of the level of
predicability of the endpoint, was calculated
based on values from the 17 data analysis
teams. The mean MCC value for a data analysis
team, representative of the team’s proficiency
in developing predictive models, was calculated
based on values from the 11 non-random
endpoints (excluding negative controls I and M).
Red boxes highlight candidate models. Lack
of a red box in an endpoint indicates that the
candidate model was developed by a data analysis
team that did not analyze all 13 endpoints.
10. MAQC-II
84
©2010NatureAmerica,Inc.Allrightsreserved.
	 advance online publication  nature biotechnology
A rt i c l e s
0.443–0.954 on the swap experiment. For all but three teams (DAT3,
DAT10 and DAT11) the original and swap correlations were within
±0.2, and all but three others (DAT4, DAT13 and DAT36) were within
±0.1, suggesting that the model building process was relatively robust,
at least with respect to generating models with ­similar performance.
For some data analysis teams the internal validation performance
drastically overestimated the performance of the same model in predicting
the validation data. Examination of some of those models
revealed several reasons, including bias in the feature selection and
cross-validation process28, findings consistent with what was observed
from a recent literature survey33.
Previously, reanalysis of a widely cited single study34 found that
the results in the original publication were very fragile—that is, not
reproducible if the training and validation sets were swapped35. Our
observations, except for DAT3, DAT11 and DAT36 with correlation
<0.6, mainly resulting from failure of accurately predicting the positive-control
endpoint H in the swap analysis (likely owing to operator
errors), do not substantiate such fragility in the currently examined
data sets. It is important to emphasize that we repeated the entire
model building and evaluation processes during the swap analysis
and, therefore, stability applies to the model building process for
each data analysis team and not to a particular model or approach.
Supplementary Figure 5 provides a more detailed look at the correlation
of internal and external validation for each data analysis team
and each endpoint for both the original (Supplementary Fig. 5a) and
swap (Supplementary Fig. 5d) analyses.
As expected, individual feature lists differed from analysis group
to analysis group and between models developed from the original
and the swapped data. However, when feature lists were mapped to
biological processes, a greater degree of convergence and concordance
was observed. This has been proposed previously but has never been
demonstrated in a comprehensive manner over many data sets and
thousands of models as was done in MAQC-II36.
The effect of modeling factors is modest
To rigorously identify potential sources of variance that explain the
variability in external-validation performance (Fig. 2c), we applied
random effect modeling (Fig. 5a). We observed that the endpoint
itself is by far the dominant source of variability, explaining >65%
of the variability in the external validation performance. All other
factors explain <8% of the total variance, and the residual variance
is ~6%. Among the factors tested, those involving inter­actions with
endpoint have a relatively large effect, in particular the interaction
between endpoint with organization and classification algorithm,
highlighting variations in pro­ficiency between analysis teams.
To further investigate the impact of individual levels within each
modeling factor, we estimated the empirical best linear unbiased predictors
(BLUPs)37. Figure 5b shows the plots of BLUPs of the corresponding
­factors in Figure 5a with proportion of variation >1%.
The BLUPs reveal the effect of each level of the factor to the corresponding
MCC value. The BLUPs of the main endpoint effect show
that rat liver necrosis, breast cancer estrogen receptor status and the
sex of the patient (endpoints C, E, H and L) are relatively easier to be
predicted with ~0.2–0.4 advantage contributed on the corresponding
MCC values. The rest of the endpoints are relatively harder to
be predicted with about −0.1 to −0.2 disadvantage contributed to
the corresponding MCC values. The main factors of normalization,
classification algorithm, the number of selected features and
the feature selection method have an impact of −0.1 to 0.1 on the
corresponding MCC values. Loess normalization was applied to the
endpoints (J, K and L) for the neuro­blastoma data set with the twocolor
Agilent platform and has 0.1 advantage to MCC values. Among
the Microarray Analysis Suite version 5 (MAS5), Robust Multichip
Analysis (RMA) and dChip normalization methods that were
applied to all endpoints (A, C, D, E, F, G and H) for Affymetrix data,
the dChip method has a lower BLUP than the others. Because
­normalization methods are partially confounded with endpoints, it
may not be suitable to compare methods between different confounded
groups. Among classification methods, discriminant analysis has the
largest positive impact of 0.056 on the MCC values. Regarding the
number of selected features, larger bin number has better impact on
the average across endpoints. The bin number is assigned by applying
the ceiling function to the log base 10 of the number of selected features.
All the feature selection methods have a slight impact of −0.025 to 0.025
Table 2  Modeling factor options frequently adopted by MAQC-II data
analysis teams
Modeling factor Option
Original analysis (training => validation)
Number
of teams
Number
of endpoints
Number
of models
Summary and normalization Loess 12 3 2,563
RMA 3 7 46
MAS5 11 7 4,947
Batch-effect removal None 10 11 2,281
Mean shift 3 11 7,279
Feature selection SAM 4 11 3,771
FC+P 8 11 4,711
T-Test 5 11 400
RFE 2 11 647
Number of features 0~9 10 11 393
10~99 13 11 4,445
≥1,000 3 11 474
100~999 10 11 4,298
Classification algorithm DA 4 11 103
Tree 5 11 358
NB 4 11 924
KNN 8 11 6,904
SVM 9 11 986
Analytic options used by two or more of the 14 teams that submitted models for all endpoints in both
the original and swap experiments. RMA, robust multichip analysis; SAM, significance analysis of
microarrays; FC, fold change; RFE, recursive feature elimination; DA, discriminant analysis; Tree,
decision tree; NB, naive Bayes; KNN, K-nearest neighbors; SVM, support vector machine.
1.0
10
12 18
24
29
20
32
4
7
25
36
11
3
13
Correlationinswapanalysis(validation→training)
Correlation in original analysis (training → validation)
0.9
0.8
0.7
0.7 0.8 0.9 1.0
0.6
0.6
0.5
0.5
0.4
0.4
Figure 4  Correlation between internal and external validation is
dependent on data analysis team. Pearson correlation coefficients
between internal and external validation performance in terms of MCC are
displayed for the 14 teams that submitted models for all 13 endpoints
in both the original (x axis) and swap (y axis) analyses. The unusually low
correlation in the swap analysis for DAT3, DAT11 and DAT36 is a result
of their failure to accurately predict the positive endpoint H, likely due to
operator errors (Supplementary Table 6).
10. MAQC-II
85
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnology  advance online publication	 
A rt i c l e s
on MCC values except for recursive feature elimination (RFE) that
has an impact of −0.006. In the plots of the four selected interactions,
the estimated BLUPs vary across endpoints. The large variation across
endpoints implies the impact of the corresponding modeling factor on
different endpoints can be very different. Among the four interaction
plots (see Supplementary Fig. 6 for a clear labeling of each interaction
term), the corresponding BLUPs of the three-way interaction
of organization, classification algorithm and endpoint show the highest
variation. This may be due to different tuning parameters applied
to individual algorithms for different organizations, as was the case
for KNN32.
We also analyzed the relative importance of modeling factors on
external-validation prediction performance using a decision tree
model38. The analysis results revealed observations (Supplementary
Fig. 7) largely consistent with those above. First, the endpoint code
was the most influential modeling factor. Second, feature selection
method, normalization and summarization method, classification
method and organization code also contributed to prediction performance,
but their contribution was relatively small.
Feature list stability is correlated with endpoint predictability
Prediction performance is the most important criterion for evaluating
the performance of a predictive model and its modeling process.
However, the robustness and mechanistic relevance of the model and
the corresponding gene signature is also important (Supplementary
Fig. 8). That is, given comparable prediction performance between
two modeling processes, the one yielding a more robust and reproducible
gene signature across similar data sets (e.g., by swapping the
­training and validation sets), which is therefore less susceptible to
sporadic fluctuations in the data, or the one that provides new insights
to the underlying biology is preferable. Reproducibility or stability of
feature sets is best studied by running the same model selection ­protocol
on two distinct collections of samples, a scenario only possible, in
this case, after the blind validation data were distributed to the data
anal­ysis teams that were asked to perform their analysis after swapping
their original training and test sets. Supplementary Figures 9 and 10
show that, although the feature space is extremely large for microarray
data, different teams and protocols were able to consistently select the
best-performing features. Analysis of the lists of features indicated that
for endpoints relatively easy to predict, various data analysis teams
arrived at models that used more common features and the overlap
of the lists from the original and swap analyses is greater than those
for more difficult endpoints (Supplementary Figs. 9–11). Therefore,
the level of stability of feature lists can be associated to the level of difficulty
of the prediction problem (Supplementary Fig. 11), although
multiple models with different feature lists and comparable performance
can be found from the same data set39. Functional analysis of the
most frequently selected genes by all data analysis protocols shows
Summary normalization
Endpoint
0 10 20 30 40 50 60 70
0.40
0.10
–0.10
0.10
0.05
–0.05
–0.10
1 2 3 4 5
Bscatter
FC
R
M
A
Tree
SVM
SMO
Logistic
PLS
RFE
NC
NB
ML
KNN
GLM
Forest
DA
ANN
M
edian
M
ean
M
AS5
Loess
G
A
dC
hip
FC+P
T-Test
SAM
Wilcoxon
Welch
Vote
RFE
P
KS
Golub
Fisher
Number of features
–0.20
A B
Tox BR MM NB
C D E F G H J K L
Endpoint
A B C D E F G H J K L A B C D E F G H J K L A B C D E F G H J K L
A B C D E F G H J K L
Classification algorithm*
endpoint
Number of features*
endpoint
Feature selection method
Organization*classification*
endpoint
Summary normalization*
endpoint
Classification algorithmSummary normalization
0
0
0
0
0
00
BLUPBLUP
0.10
0.10
0.10
0.100.10
0.15
0.20 0.20
0.05
0.05
0.050.05
–0.05 –0.05
–0.05–0.05
–0.10 –0.10
–0.10–0.10
0
0.10
0.05
–0.05
–0.10
0
0.10
0.05
–0.05
–0.10
–0.10
–0.20
–0.30
–0.40
–0.50
BLUP
0.20
0.30
Organization*classification
algorithm
Summary normalization*endpoint
Batch effect removal
Batch effect removal*endpoint
Organization
Organization*classification
algorithm*endpoint
Residual
0 1 2 3 4 5 6 7 8 9
Percentage of variation
Organization*endpoint
Validation iterations*endpoint
Validation iterations
Feature selection
Feature selection*endpoint
Number of features
Number of features*endpoint
Classification algorithm
Classification algorithm*endpoint
a b
Tox BR MM NB
Tox BR MM NB Tox BR MM NB Tox BR MM NB
Figure 5  Effect of modeling factors on estimates of model performance. (a) Random-effect models of external validation performance (MCC) were
developed to estimate a distinct variance component for each modeling factor and several selected interactions. The estimated variance components
were then divided by their total in order to compare the proportion of variability explained by each modeling factor. The endpoint code contributes the
most to the variability in external validation performance. (b) The BLUP plots of the corresponding factors having proportion of variation larger than 1%
in a. Endpoint abbreviations (Tox., preclinical toxicity; BR, breast cancer; MM, multiple myeloma; NB, neuroblastoma). Endpoints H and L are the sex
of the patient. Summary normalization abbreviations (GA, genetic algorithm; RMA, robust multichip analysis). Classification algorithm abbreviations
(ANN, artificial neural network; DA, discriminant analysis; Forest, random forest; GLM, generalized linear model; KNN, K-nearest neighbors; Logistic,
logistic regression; ML, maximum likelihood; NB, Naïve Bayes; NC, nearest centroid; PLS, partial least squares; RFE, recursive feature elimination;
SMO, sequential minimal optimization; SVM, support vector machine; Tree, decision tree). Feature selection method abbreviations (Bscatter, betweenclass
scatter; FC, fold change; KS, Kolmogorov-Smirnov algorithm; SAM, significance analysis of microarrays).
10. MAQC-II
86
©2010NatureAmerica,Inc.Allrightsreserved.
	 advance online publication  nature biotechnology
A rt i c l e s
that many of these genes represent biological processes that are highly
relevant to the clinical outcome that is being predicted36. The sexbased
endpoints have the best overlap, whereas more difficult survival
endpoints (in which disease processes are confounded by many other
factors) have only marginally better overlap with biological processes
relevant to the disease than that expected by random chance.
Summary of MAQC-II observations and recommendations
The MAQC-II data analysis teams comprised a diverse group, some
of whom were experienced microarray analysts whereas others were
graduate students with little experience. In aggregate, the group’s
composition likely mimicked the broad scientific community engaged
in building and publishing models derived from microarray data. The
more than 30,000 models developed by 36 data analysis teams for
13 endpoints from six diverse clinical and preclinical data sets are a
rich source from which to highlight several important observations.
First, model prediction performance was largely endpoint (biology)
dependent (Figs. 2c and 3). The incorporation of multiple data
sets and endpoints (including positive and negative controls) in the
MAQC-II study design made this observation possible. Some endpoints
are highly predictive based on the nature of the data, which
makes it possible to build good models, provided that sound modeling
procedures are used. Other endpoints are inherently difficult to predict
regardless of the model development protocol.
Second, there are clear differences in proficiency between data
analysis teams (organizations) and such differences are correlated
with the level of experience of the team. For example, the top-
performingteamsshowninFigure3weremainlyindustrialparticipants
with many years of experience in microarray data analysis, whereas
bottom-performing teams were mainly less-experienced graduate
students or researchers. Based on results from the positive and negative
endpoints, we noticed that simple errors were sometimes made,
suggesting rushed efforts due to lack of time or unnoticed implementation
flaws. This observation strongly suggests that mechanisms are
needed to ensure the reliability of results presented to the regulatory
agencies, journal editors and the research community. By examining
the practices of teams whose models did not perform well, future
studies might be able to identify pitfalls to be avoided. Likewise,
practices adopted by top-performing teams can provide the basis for
developing good modeling practices.
Third, the internal validation performance from well-implemented,
unbiased cross-validation shows a high degree of concordance with the
external validation performance in a strict blinding process (Fig. 2).
This observation was not possible from previously published studies
owing to the small number of available endpoints tested in them.
Fourth, many models with similar performance can be developed
from a given data set (Fig. 2). Similar prediction performance is
attainable when using different modeling algorithms and para­meters,
and simple data analysis methods often perform as well as more
complicated approaches32,40. Although it is not essential to include
the same features in these models to achieve comparable prediction
performance, endpoints that were easier to predict generally yielded
models with more common features, when analyzed by different
teams (Supplementary Fig. 11).
Finally, applying good modeling practices appeared to be more
important than the actual choice of a particular algorithm over the
others within the same step in the modeling process. This can be seen
in the diverse choices of the modeling factors used by teams that produced
models that performed well in the blinded validation (Table 2)
where modeling factors did not universally contribute to variations in
model performance among good performing teams (Fig. 5).
Summarized below are the model building steps recommended to
the MAQC-II data analysis teams. These may be applicable to model
building practitioners in the general scientific community.
Step one (design). There is no exclusive set of steps and procedures,
in the form of a checklist, to be followed by any practitioner for all
problems. However, normal good practice on the study design and
the ratio of sample size to classifier complexity should be followed.
The frequently used options for normalization, feature selection and
classification are good starting points (Table 2).
Step two (pilot study or internal validation). This can be accomplished
by bootstrap or cross-validation such as the ten repeats of a
fivefold cross-validation procedure adopted by most MAQC-II teams.
The samples from the pilot study are not replaced for the pivotal
study; rather they are augmented to achieve ‘appropriate’ target size.
Step three (pivotal study or external validation). Many investigators
assume that the most conservative approach to a pivotal study is to
simply obtain a test set completely independent of the training set(s).
However, it is good to keep in mind the exchange34,35 regarding the
fragility of results when the training and validation sets are swapped.
Results from further resampling (including simple swapping as in
MAQC-II) across the training and validation sets can provide important
information about the reliability of the models and the modeling
procedures, but the complete separation of the training and validation
sets should be maintained41.
Finally, a perennial issue concerns reuse of the independent validation
set after modifications to an originally designed and validated
data analysis algorithm or protocol. Such a process turns the validation
set into part of the design or training set42. Ground rules must
be developed for avoiding this approach and penalizing it when it
occurs; and practitioners should guard against using it before such
ground rules are well established.
DISCUSSION
MAQC-II conducted a broad observational study of the current community
landscape of gene-expression profile–based predictive model
development. Microarray gene expression profiling is among the most
commonly used analytical tools in biomedical research. Analysis of
the high-dimensional data generated by these experiments involves
multiple steps and several critical decision points that can profoundly
influence the soundness of the results43. An important requirement
of a sound internal validation is that it must include feature selection
and parameter optimization within each iteration to avoid overly optimistic
estimations of prediction performance28,29,44. To what extent
this information has been disseminated and followed by the scientific
community in current microarray analysis remains unknown33.
Concerns have been raised that results published by one group of
investigators often cannot be confirmed by others even if the same
data set is used26. An inability to confirm results may stem from any
of several reasons: (i) insufficient information is provided about the
methodology that describes which analysis has actually been done;
(ii) data preprocessing (normalization, gene filtering and feature
selection) is too complicated and insufficiently documented to be
reproduced; or (iii) incorrect or biased complex analytical methods26
are performed. A distinct but related concern is that genomic data may
yield prediction models that, even if reproducible on the ­discovery
data set, cannot be extrapolated well in independent validation. The
MAQC-II project provided a unique opportunity to address some of
these concerns.
Notably, we did not place restrictions on the model building methods
used by the data analysis teams. Accordingly, they adopted numerous
different modeling approaches (Table 2 and Supplementary Table 4).
10. MAQC-II
87
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnology  advance online publication	 
A rt i c l e s
For example, feature selection methods varied widely, from statistical
significance tests, to machine learning algorithms, to those more
reliant on differences in expression amplitude, to those employing
knowledge of putative biological mechanisms associated with the
endpoint. Prediction algorithms also varied widely. To make internal
validation performance results comparable across teams for different
models, we recommended that a model’s internal performance was
estimated using a ten times repeated fivefold cross-validation, but this
recommendation was not strictly followed by all teams, which also
allows us to survey internal validation approaches. The diversity of
analysis protocols used by the teams is likely to closely resemble that
of current research going forward, and in this context mimics reality.
In terms of the space of modeling factors explored, MAQC-II is a survey
of current practices rather than a randomized, controlled experiment;
therefore, care should be taken in interpreting the results. For
­example, some teams did not analyze all endpoints, causing missing
data (models) that may be confounded with other modeling factors.
Overall, the procedure followed to nominate MAQC-II candidate
models was quite effective in selecting models that performed reasonably
well during validation using independent data sets, although
generally the selected models did not do as well in validation as in
training. The drop in performance associated with the validation
highlights the importance of not relying solely on internal validation
performance, and points to the need to subject every classifier to at
least one external validation. The selection of the 13 candidate models
from many nominated models was achieved through a peer-review
collaborative effort of many experts and could be described as slow,
tedious and sometimes subjective (e.g., a data analysis team could
only contribute one of the 13 candidate models). Even though they
were still subject to over-optimism, the internal and external performance
estimates of the candidate models were more concordant than
those of the overall set of models. Thus the review was productive in
identifying characteristics of reliable models.
An important lesson learned through MAQC-II is that it is almost
impossible to retrospectively retrieve and document decisions that
were made at every step during the feature selection and model development
stage. This lack of complete description of the model building
process is likely to be a common reason for the inability of different
data analysis teams to fully reproduce each other’s results32. Therefore,
although meticulously documenting the classifier building procedure
can be cumbersome, we recommend that all genomic publications
include supplementary materials describing the model building and
evaluation process in an electronic format. MAQC-II is making available
six data sets with 13 endpoints that can be used in the future as a
benchmark to verify that software used to implement new approaches
performs as expected. Subjecting new software to benchmarks against
these data sets could reassure potential users that the software is
mature enough to be used for the development of predictive models
in new data sets. It would seem advantageous to develop alternative
ways to help determine whether specific implementations of modeling
approaches and performance evaluation procedures are sound, and to
identify procedures to capture this information in public databases.
The findings of the MAQC-II project suggest that when the same
data sets are provided to a large number of data analysis teams, many
groups can generate similar results even when different model building
approaches are followed. This is concordant with studies29,33 that
found that given good quality data and an adequate number of informative
features, most classification methods, if properly used, will yield
similar predictive performance. This also confirms reports6,7,39 on
small data sets by individual groups that have suggested that ­several
different feature selection methods and prediction algorithms can
yield many models that are distinct, but have statistically similar
performance. Taken together, these results provide perspective on
the large number of publications in the bioinformatics literature that
have examined the various steps of the multivariate prediction model
building process and identified elements that are critical for achieving
reliable results.
An important and previously underappreciated observation from
MAQC-II is that different clinical endpoints represent very different
levels of classification difficulty. For some endpoints the currently
available data are sufficient to generate robust models, whereas for
other endpoints currently available data do not seem to be sufficient
to yield highly predictive models. An analysis done as part of the
MAQC-II project and that focused on the breast cancer data demonstrates
these points in more detail40. It is also important to point out
that for some clinically meaningful endpoints studied in the MAQC-II
project, gene expression data did not seem to significantly outperform
models based on clinical covariates alone, highlighting the challenges
in predicting the outcome of patients in a heterogeneous population
and the potential need to combine gene expression data with
clinical covariates (unpublished data).
The accuracy of the clinical sample annotation information may
also play a role in the difficulty to obtain accurate prediction results
on validation samples. For example, some samples were misclassified
by almost all models (Supplementary Fig. 12). It is true even for some
samples within the positive control endpoints H and L, as shown
in Supplementary Table 8. Clinical information of neuroblastoma
patients for whom the positive control endpoint L was uniformly
misclassified were rechecked and the sex of three out of eight cases
(NB412, NB504 and NB522) was found to be incorrectly annotated.
The companion MAQC-II papers published elsewhere give more
in-depth analyses of specific issues such as the clinical ­benefits of
genomic classifiers (unpublished data), the impact of ­ different
m­odeling ­factors on prediction performance45, the objective assessment
of micro­array cross-platform prediction46, cross-tissue prediction47,
one-color ­ versus two-color prediction comparison48,
functional analysis of gene signatures36 and recommendation of a
simple yet robust data analysis protocol based on the KNN32. For
example, we systematically compared the classification performance
resulting from one- and two-color gene-expression profiles of
478 neuroblastoma samples and found that analyses based on either
­platform yielded similar ­classification performance48. This newly generated
one-color data set has been used to evaluate the applicability of
the KNN-based simple data analysis protocol to future data sets32. In
addition, the MAQC-II Genome-Wide Association Working Group
assessed the variabilities in ­genotype calling due to experimental or
algorithmic factors49.
In summary, MAQC-II has demonstrated that current methods
commonly used to develop and assess multivariate gene-expression
based predictors of clinical outcome were used appropriately by
most of the analysis teams in this consortium. However, ­differences
in proficiency emerged and this underscores the importance
of proper implementation of otherwise robust analytical methods.
Observations based on analysis of the MAQC-II data sets may be
applicable to other diseases. The MAQC-II data sets are publicly
available and are expected to be used by the scientific community
as benchmarks to ensure proper modeling practices. The experience
with the MAQC-II clinical data sets also reinforces the notion that
clinical classification problems represent several different degrees
of prediction difficulty that are likely to be associated with whether
mRNA abundances measured in a specific data set are informative for
the specific prediction problem. We anticipate that including other
10. MAQC-II
88
©2010NatureAmerica,Inc.Allrightsreserved.
10	 advance online publication  nature biotechnology
A rt i c l e s
types of biological data at the DNA, microRNA, protein or metabolite
levels will enhance our capability to more accurately ­predict
the clinically relevant endpoints. The good modeling practice guidelines
established by MAQC-II and lessons learned from this unprecedented
collaboration provide a solid foundation from which other
high-dimensional biological data could be more reliably used for the
purpose of predictive and personalized medicine.
Methods
Methods and any associated references are available in the online
version of the paper at http://www.nature.com/naturebiotechnology/.
Accession codes. All MAQC-II data sets are available through
GEO (series accession number: GSE16716), the MAQC Web site
(http://www.fda.gov/nctr/science/centers/toxicoinformatics/maqc/),
ArrayTrack (http://www.fda.gov/nctr/science/centers/toxicoinformatics/ArrayTrack/)
or CEBS (http://cebs.niehs.nih.gov/) accession
number: 009-00002-0010-000-3.
Note: Supplementary information is available on the Nature Biotechnology website.
Acknowledgments
The MAQC-II project was funded in part by the FDA’s Office of Critical Path
Programs (to L.S.). Participants from the National Institutes of Health (NIH) were
supported by the Intramural Research Program of NIH, Bethesda, Maryland or
the Intramural Research Program of the NIH, National Institute of Environmental
Health Sciences (NIEHS), Research Triangle Park, North Carolina. J.F. was
supported by the Division of Intramural Research of the NIEHS under contract
HHSN273200700046U. Participants from the Johns Hopkins University were
supported by grants from the NIH (1R01GM083084-01 and 1R01RR021967-01A2
to R.A.I. and T32GM074906 to M.M.). Participants from the Weill Medical College
of Cornell University were partially supported by the Biomedical Informatics
Core of the Institutional Clinical and Translational Science Award RFA-RM-07-
002. F.C. acknowledges resources from The HRH Prince Alwaleed Bin Talal Bin
Abdulaziz Alsaud Institute for Computational Biomedicine and from the David A.
Cofrin Center for Biomedical Information at Weill Cornell. The data set from The
Hamner Institutes for Health Sciences was supported by a grant from the American
Chemistry Council’s Long Range Research Initiative. The breast cancer data set
was generated with support of grants from NIH (R-01 to L.P.), The Breast Cancer
Research Foundation (to L.P. and W.F.S.) and the Faculty Incentive Funds of the
University of Texas MD Anderson Cancer Center (to W.F.S.). The data set from
the University of Arkansas for Medical Sciences was supported by National Cancer
Institute (NCI) PO1 grant CA55819-01A1, NCI R33 Grant CA97513-01, Donna D.
and Donald M. Lambert Lebow Fund to Cure Myeloma and Nancy and Steven
Grand Foundation. We are grateful to the individuals whose gene expression data
were used in this study. All MAQC-II participants freely donated their time and
reagents for the completion and analyses of the MAQC-II project. The MAQC-II
consortium also thanks R. O’Neill for his encouragement and coordination among
FDA Centers on the formation of the RBWG. The MAQC-II consortium gratefully
dedicates this work in memory of R.F. Wagner who enthusiastically worked on the
MAQC-II project and inspired many of us until he unexpectedly passed away in
June 2008.
DISCLAIMER
This work includes contributions from, and was reviewed by, individuals at the
FDA, the Environmental Protection Agency (EPA) and the NIH. This work has
been approved for publication by these agencies, but it does not necessarily reflect
official agency policy. Certain commercial materials and equipment are identified
in order to adequately specify experimental procedures. In no case does such
identification imply recommendation or endorsement by the FDA, the EPA or the
NIH, nor does it imply that the items identified are necessarily the best available
for the purpose.
COMPETING FINANCIAL INTERESTS
The authors declare competing financial interests: details accompany the full-text
HTML version of the paper at http://www.nature.com/naturebiotechnology/.
Published online at http://www.nature.com/naturebiotechnology/. 	
Reprints and permissions information is available online at http://npg.nature.com/
reprintsandpermissions/.
1.	 Marshall, E. Getting the noise out of gene arrays. Science 306, 630–631 (2004).
2.	 Frantz, S. An array of problems. Nat. Rev. Drug Discov. 4, 362–363 (2005).
3.	 Michiels, S., Koscielny, S. & Hill, C. Prediction of cancer outcome with microarrays:
a multiple random validation strategy. Lancet 365, 488–492 (2005).
4.	 Ntzani, E.E. & Ioannidis, J.P. Predictive ability of DNA microarrays for cancer
outcomes and correlates: an empirical assessment. Lancet 362, 1439–1444
(2003).
5.	 Ioannidis, J.P. Microarrays and molecular research: noise discovery? Lancet 365,
454–455 (2005).
6.	 Ein-Dor, L., Kela, I., Getz, G., Givol, D. & Domany, E. Outcome signature genes in
breast cancer: is there a unique set? Bioinformatics 21, 171–178 (2005).
7.	 Ein-Dor, L., Zuk, O. & Domany, E. Thousands of samples are needed to generate
a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA
103, 5923–5928 (2006).
8.	 Shi, L. et al. QA/QC: challenges and pitfalls facing the microarray community and
regulatory agencies. Expert Rev. Mol. Diagn. 4, 761–777 (2004).
9.	 Shi, L. et al. Cross-platform comparability of microarray technology: intra-platform
consistency and appropriate data analysis procedures are essential. BMC
Bioinformatics 6 Suppl 2, S12 (2005).
10.	Shi, L. et al. The MicroArray Quality Control (MAQC) project shows inter- and
intraplatform reproducibility of gene expression measurements. Nat. Biotechnol.
24, 1151–1161 (2006).
11.	Guo, L. et al. Rat toxicogenomic study reveals analytical consistency across
microarray platforms. Nat. Biotechnol. 24, 1162–1169 (2006).
12.	Canales, R.D. et al. Evaluation of DNA microarray results with quantitative gene
expression platforms. Nat. Biotechnol. 24, 1115–1122 (2006).
13.	Patterson, T.A. et al. Performance comparison of one-color and two-color platforms
within the MicroArray Quality Control (MAQC) project. Nat. Biotechnol. 24,
1140–1150 (2006).
14.	Shippy, R. et al. Using RNA sample titrations to assess microarray platform
performance and normalization techniques. Nat. Biotechnol. 24, 1123–1131
(2006).
15.	Tong, W. et al. Evaluation of external RNA controls for the assessment of microarray
performance. Nat. Biotechnol. 24, 1132–1139 (2006).
16.	Irizarry, R.A. et al. Multiple-laboratory comparison of microarray platforms. Nat.
Methods 2, 345–350 (2005).
17.	Strauss, E. Arrays of hope. Cell 127, 657–659 (2006).
18.	Shi, L., Perkins, R.G., Fang, H. & Tong, W. Reproducible and reliable microarray
results through quality control: good laboratory proficiency and appropriate data
analysis practices are essential. Curr. Opin. Biotechnol. 19, 10–18 (2008).
19.	Dudoit, S., Fridlyand, J. & Speed, T.P. Comparison of discrimination methods for
the classification of tumors using gene expression data. J. Am. Stat. Assoc. 97,
77–87 (2002).
20.	Goodsaid, F.M. et al. Voluntary exploratory data submissions to the US FDA and
the EMA: experience and impact. Nat. Rev. Drug Discov. 9, 435–445 (2010).
21.	van ‘t Veer, L.J. et al. Gene expression profiling predicts clinical outcome of breast
cancer. Nature 415, 530–536 (2002).
22.	Buyse, M. et al. Validation and clinical utility of a 70-gene prognostic signature for
women with node-negative breast cancer. J. Natl. Cancer Inst. 98, 1183–1192
(2006).
23.	Dumur, C.I. et al. Interlaboratory performance of a microarray-based gene expression
test to determine tissue of origin in poorly differentiated and undifferentiated
cancers. J. Mol. Diagn. 10, 67–77 (2008).
24.	Deng, M.C. et al. Noninvasive discrimination of rejection in cardiac allograft recipients
using gene expression profiling. Am. J. Transplant. 6, 150–160 (2006).
25.	Coombes, K.R., Wang, J. & Baggerly, K.A. Microarrays: retracing steps. Nat. Med.
13, 1276–1277, author reply 1277–1278 (2007).
26.	Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression
analyses. Nat. Genet. 41, 149–155 (2009).
27.	Baggerly, K.A., Edmonson, S.R., Morris, J.S. & Coombes, K.R. High-resolution serum
proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer 11,
583–584, author reply 585–587 (2004).
28.	Ambroise, C. & McLachlan, G.J. Selection bias in gene extraction on the basis of
microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99, 6562–6566
(2002).
29.	Simon, R. Using DNA microarrays for diagnostic and prognostic prediction. Expert
Rev. Mol. Diagn. 3, 587–595 (2003).
30.	Dobbin, K.K. et al. Interlaboratory comparability study of cancer gene expression
analysis using oligonucleotide microarrays. Clin. Cancer Res. 11, 565–572
(2005).
31.	Shedden, K. et al. Gene expression-based survival prediction in lung adenocarcinoma:
a multi-site, blinded validation study. Nat. Med. 14, 822–827 (2008).
32.	Parry, R.M. et al. K-nearest neighbors (KNN) models for microarray gene-expression
analysis and reliable clinical outcome prediction. Pharmacogenomics J. 10, 292–309
(2010).
33.	Dupuy, A. & Simon, R.M. Critical review of published microarray studies for cancer
outcome and guidelines on statistical analysis and reporting. J. Natl. Cancer Inst.
99, 147–157 (2007).
34.	Dave, S.S. et al. Prediction of survival in follicular lymphoma based on molecular
features of tumor-infiltrating immune cells. N. Engl. J. Med. 351, 2159–2169
(2004).
35.	Tibshirani, R. Immune signatures in follicular lymphoma. N. Engl. J. Med. 352,
1496–1497, author reply 1496–1497 (2005).
10. MAQC-II
89
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnology  advance online publication	 11
A rt i c l e s
36.	Shi, W. et al. Functional analysis of multiple genomic signatures demonstrates that
classification algorithms choose phenotype-related genes. Pharmacogenomics J. 10,
310–323 (2010).
37.	Robinson, G.K. That BLUP is a good thing: the estimation of random effects.
Stat. Sci. 6, 15–32 (1991).
38.	Hothorn, T., Hornik, K. & Zeileis, A. Unbiased recursive partitioning: a conditional
inference framework. J. Comput. Graph. Statist. 15, 651–674 (2006).
39.	Boutros, P.C. et al. Prognostic gene signatures for non-small-cell lung cancer. Proc.
Natl. Acad. Sci. USA 106, 2824–2828 (2009).
40.	Popovici, V. et al. Effect of training sample size and classification difficulty on the
accuracy of genomic predictors. Breast Cancer Res. 12, R5 (2010).
41.	Yousef, W.A., Wagner, R.F. & Loew, M.H. Assessing classifiers from two independent
data sets using ROC analysis: a nonparametric approach. IEEE Trans. Pattern Anal.
Mach. Intell. 28, 1809–1817 (2006).
42.	Gur, D., Wagner, R.F. & Chan, H.P. On the repeated use of databases for testing
incremental improvement of computer-aided detection schemes. Acad. Radiol. 11,
103–105 (2004).
43.	Allison, D.B., Cui, X., Page, G.P. & Sabripour, M. Microarray data analysis: from
disarray to consolidation and consensus. Nat. Rev. Genet. 7, 55–65 (2006).
44.	Wood, I.A., Visscher, P.M. & Mengersen, K.L. Classification based upon gene expression
data: bias and precision of error rates. Bioinformatics 23, 1363–1370 (2007).
45.	Luo, J. et al. A comparison of batch effect removal methods for enhancement of
prediction performance using MAQC-II microarray gene expression data.
Pharmacogenomics J. 10, 278–291 (2010).
46.	Fan, X. et al. Consistency of predictive signature genes and classifiers generated using
different microarray platforms. Pharmacogenomics J. 10, 247–257 (2010).
47.	Huang, J. et al. Genomic indicators in the blood predict drug-induced liver injury.
Pharmacogenomics J. 10, 267–277 (2010).
48.	Oberthuer, A. et al. Comparison of performance of one-color and two-color geneexpression
analyses in predicting clinical endpoints of neuroblastoma patients.
Pharmacogenomics J. 10, 258–266 (2010).
49.	Hong, H. et al. Assessing sources of inconsistencies in genotypes and their effects
on genome-wide association studies with HapMap samples. Pharmacogenomics J.
10, 364–374 (2010).
Leming Shi1, Gregory Campbell2, Wendell D Jones3, Fabien Campagne4, Zhining Wen1, Stephen J Walker5,
Zhenqiang Su6, Tzu-Ming Chu7, Federico M Goodsaid8, Lajos Pusztai9, John D Shaughnessy Jr10,
André Oberthuer11, Russell S Thomas12, Richard S Paules13, Mark Fielden14, Bart Barlogie10, Weijie Chen2,
Pan Du15, Matthias Fischer11, Cesare Furlanello16, Brandon D Gallas2, Xijin Ge17, Dalila B Megherbi18,
W Fraser Symmans19, May D Wang20, John Zhang21, Hans Bitter22, Benedikt Brors23, Pierre R Bushel13,
Max Bylesjo24, Minjun Chen1, Jie Cheng25, Jing Cheng26, Jeff Chou13, Timothy S Davison27, Mauro Delorenzi28,
Youping Deng29, Viswanath Devanarayan30, David J Dix31, Joaquin Dopazo32, Kevin C Dorff33, Fathi Elloumi31,
Jianqing Fan34, Shicai Fan35, Xiaohui Fan36, Hong Fang6, Nina Gonzaludo37, Kenneth R Hess38,
Huixiao Hong1, Jun Huan39, Rafael A Irizarry40, Richard Judson31, Dilafruz Juraeva23, Samir Lababidi41,
Christophe G Lambert42, Li Li7, Yanen Li43, Zhen Li31, Simon M Lin15, Guozhen Liu44, Edward K Lobenhofer45,
Jun Luo21, Wen Luo46, Matthew N McCall40, Yuri Nikolsky47, Gene A Pennello2, Roger G Perkins1, Reena Philip2,
Vlad Popovici28, Nathan D Price48, Feng Qian6, Andreas Scherer49, Tieliu Shi50, Weiwei Shi47, Jaeyun Sung48,
Danielle Thierry-Mieg51, Jean Thierry-Mieg51, Venkata Thodima52, Johan Trygg24, Lakshmi Vishnuvajjala2,
Sue Jane Wang8, Jianping Wu53, Yichao Wu54, Qian Xie55, Waleed A Yousef56, Liang Zhang53, Xuegong Zhang35,
Sheng Zhong57, Yiming Zhou10, Sheng Zhu53, Dhivya Arasappan6, Wenjun Bao7, Anne Bergstrom Lucas58,
Frank Berthold11, Richard J Brennan47, Andreas Buness59, Jennifer G Catalano41, Chang Chang50,
Rong Chen60, Yiyu Cheng36, Jian Cui50, Wendy Czika7, Francesca Demichelis61, Xutao Deng62,
Damir Dosymbekov63, Roland Eils23, Yang Feng34, Jennifer Fostel13, Stephanie Fulmer-Smentek58,
James C Fuscoe1, Laurent Gatto64, Weigong Ge1, Darlene R Goldstein65, Li Guo66, Donald N Halbert67,
Jing Han41, Stephen C Harris1, Christos Hatzis68, Damir Herman69, Jianping Huang36, Roderick V Jensen70,
Rui Jiang35, Charles D Johnson71, Giuseppe Jurman16, Yvonne Kahlert11, Sadik A Khuder72, Matthias Kohl73,
Jianying Li74, Li Li75, Menglong Li76, Quan-Zhen Li77, Shao Li36, Zhiguang Li1, Jie Liu1, Ying Liu35, Zhichao Liu1,
Lu Meng35, Manuel Madera18, Francisco Martinez-Murillo2, Ignacio Medina78, Joseph Meehan6, Kelci Miclaus7,
Richard A Moffitt20, David Montaner78, Piali Mukherjee33, George J Mulligan79, Padraic Neville7,
Tatiana Nikolskaya47, Baitang Ning1, Grier P Page80, Joel Parker3, R Mitchell Parry20, Xuejun Peng81,
Ron L Peterson82, John H Phan20, Brian Quanz39, Yi Ren83, Samantha Riccadonna16, Alan H Roter84,
Frank W Samuelson2, Martin M Schumacher85, Joseph D Shambaugh86, Qiang Shi1, Richard Shippy87,
Shengzhu Si88, Aaron Smalter39, Christos Sotiriou89, Mat Soukup8, Frank Staedtler85, Guido Steiner90,
Todd H Stokes20, Qinglan Sun53, Pei-Yi Tan7, Rong Tang2, Zivana Tezak2, Brett Thorn1, Marina Tsyganova63,
Yaron Turpaz91, Silvia C Vega92, Roberto Visintainer16, Juergen von Frese93, Charles Wang62, Eric Wang21,
Junwei Wang50, Wei Wang94, Frank Westermann23, James C Willey95, Matthew Woods21, Shujian Wu96,
Nianqing Xiao97, Joshua Xu6, Lei Xu1, Lun Yang1, Xiao Zeng44, Jialu Zhang8, Li Zhang8, Min Zhang1,
Chen Zhao50, Raj K Puri41, Uwe Scherf2, Weida Tong1 & Russell D Wolfinger7
1National Center for Toxicological Research, US Food and Drug Administration, Jefferson, Arkansas, USA. 2Center for Devices and Radiological Health, US Food and
Drug Administration, Silver Spring, Maryland, USA. 3Expression Analysis Inc., Durham, North Carolina, USA. 4Department of Physiology and Biophysics and HRH
Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Medical College of Cornell University, New York, New York, USA.
5Wake Forest Institute for Regenerative Medicine, Wake Forest University, Winston-Salem, North Carolina, USA. 6Z-Tech, an ICF International Company at NCTR/FDA,
Jefferson, Arkansas, USA. 7SAS Institute Inc., Cary, North Carolina, USA. 8Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring,
Maryland, USA. 9Breast Medical Oncology Department, University of Texas (UT) M.D. Anderson Cancer Center, Houston, Texas, USA. 10Myeloma Institute for Research
10. MAQC-II
90
©2010NatureAmerica,Inc.Allrightsreserved.
12	 advance online publication  nature biotechnology
A rt i c l e s
and Therapy, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA. 11Department of Pediatric Oncology and Hematology and Center for Molecular
Medicine (CMMC), University of Cologne, Cologne, Germany. 12The Hamner Institutes for Health Sciences, Research Triangle Park, North Carolina, USA. 13National
Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina, USA. 14Roche Palo Alto LLC, South San Francisco,
California, USA. 15Biomedical Informatics Center, Northwestern University, Chicago, Illinois, USA. 16Fondazione Bruno Kessler, Povo-Trento, Italy. 17Department of
Mathematics & Statistics, South Dakota State University, Brookings, South Dakota, USA. 18CMINDS Research Center, Department of Electrical and Computer
Engineering, University of Massachusetts Lowell, Lowell, Massachusetts, USA. 19Department of Pathology, UT M.D. Anderson Cancer Center, Houston, Texas, USA.
20Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, Georgia, USA. 21Systems Analytics Inc., Waltham,
Massachusetts, USA. 22Hoffmann-LaRoche, Nutley, New Jersey, USA. 23Department of Theoretical Bioinformatics, German Cancer Research Center (DKFZ),
Heidelberg, Germany. 24Computational Life Science Cluster (CLiC), Chemical Biology Center (KBC), Umeå University, Umeå, Sweden. 25GlaxoSmithKline, Collegeville,
Pennsylvania, USA. 26Medical Systems Biology Research Center, School of Medicine, Tsinghua University, Beijing, China. 27Almac Diagnostics Ltd., Craigavon, UK.
28Swiss Institute of Bioinformatics, Lausanne, Switzerland. 29Department of Biological Sciences, University of Southern Mississippi, Hattiesburg, Mississippi, USA.
30Global Pharmaceutical R&D, Abbott Laboratories, Souderton, Pennsylvania, USA. 31National Center for Computational Toxicology, US Environmental Protection
Agency, Research Triangle Park, North Carolina, USA. 32Department of Bioinformatics and Genomics, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain.
33HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Medical College of Cornell University, New York, New York,
USA. 34Department of Operation Research and Financial Engineering, Princeton University, Princeton, New Jersey, USA. 35MOE Key Laboratory of Bioinformatics
and Bioinformatics Division, TNLIST / Department of Automation, Tsinghua University, Beijing, China. 36Institute of Pharmaceutical Informatics, College of
Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, China. 37Roche Palo Alto LLC, Palo Alto, California, USA. 38Department of Biostatistics,
UT M.D. Anderson Cancer Center, Houston, Texas, USA. 39Department of Electrical Engineering & Computer Science, University of Kansas, Lawrence, Kansas, USA.
40Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA. 41Center for Biologics Evaluation and Research, US Food and Drug
Administration, Bethesda, Maryland, USA. 42Golden Helix Inc., Bozeman, Montana, USA. 43Department of Computer Science, University of Illinois at UrbanaChampaign,
Urbana, Illinois, USA. 44SABiosciences Corp., a Qiagen Company, Frederick, Maryland, USA. 45Cogenics, a Division of Clinical Data Inc., Morrisville,
North Carolina, USA. 46Ligand Pharmaceuticals Inc., La Jolla, California, USA. 47GeneGo Inc., Encinitas, California, USA. 48Department of Chemical and
Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. 49Spheromics, Kontiolahti, Finland. 50The Center for Bioinformatics and
The Institute of Biomedical Sciences, School of Life Science, East China Normal University, Shanghai, China. 51National Center for Biotechnology Information,
National Institutes of Health, Bethesda, Maryland, USA. 52Rockefeller Research Laboratories, Memorial Sloan-Kettering Cancer Center, New York, New York, USA.
53CapitalBio Corporation, Beijing, China. 54Department of Statistics, North Carolina State University, Raleigh, North Carolina, USA. 55SRA International (EMMES),
Rockville, Maryland, USA. 56Helwan University, Helwan, Egypt. 57Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA.
58Agilent Technologies Inc., Santa Clara, California, USA. 59F. Hoffmann-La Roche Ltd., Basel, Switzerland. 60Stanford Center for Biomedical Informatics Research,
Stanford University, Stanford, California, USA. 61Department of Pathology and Laboratory Medicine and HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute
for Computational Biomedicine, Weill Medical College of Cornell University, New York, New York, USA. 62Cedars-Sinai Medical Center, UCLA David Geffen School of
Medicine, Los Angeles, California, USA. 63Vavilov Institute for General Genetics, Russian Academy of Sciences, Moscow, Russia. 64DNAVision SA, Gosselies, Belgium.
65École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland. 66State Key Laboratory of Multi-phase Complex Systems, Institute of Process
Engineering, Chinese Academy of Sciences, Beijing, China. 67Abbott Laboratories, Abbott Park, Illinois, USA. 68Nuvera Biosciences Inc., Woburn, Massachusetts,
USA. 69Winthrop P. Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA. 70VirginiaTech, Blacksburg, Virgina, USA.
71BioMath Solutions, LLC, Austin, Texas, USA. 72Bioinformatic Program, University of Toledo, Toledo, Ohio, USA. 73Department of Mathematics, University of
Bayreuth, Bayreuth, Germany. 74Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, USA. 75Pediatric Department,
Stanford University, Stanford, California, USA. 76College of Chemistry, Sichuan University, Chengdu, Sichuan, China. 77University of Texas Southwestern Medical
Center (UTSW), Dallas, Texas, USA. 78Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain. 79Millennium Pharmaceuticals Inc., Cambridge,
Massachusetts, USA. 80RTI International, Atlanta, Georgia, USA. 81Takeda Global R & D Center, Inc., Deerfield, Illinois, USA. 82Novartis Institutes of Biomedical
Research, Cambridge, Massachusetts, USA. 83W.M. Keck Center for Collaborative Neuroscience, Rutgers, The State University of New Jersey, Piscataway, New Jersey,
USA. 84Entelos Inc., Foster City, California, USA. 85Biomarker Development, Novartis Institutes of BioMedical Research, Novartis Pharma AG, Basel, Switzerland.
86Genedata Inc., Lexington, Massachusetts, USA. 87Affymetrix Inc., Santa Clara, California, USA. 88Department of Chemistry and Chemical Engineering, Hefei
Teachers College, Hefei, Anhui, China. 89Institut Jules Bordet, Brussels, Belgium. 90Biostatistics, F. Hoffmann-La Roche Ltd., Basel, Switzerland. 91Lilly Singapore
Centre for Drug Discovery, Immunos, Singapore. 92Microsoft Corporation, US Health Solutions Group, Redmond, Washington, USA. 93Data Analysis Solutions DA-SOL
GmbH, Greifenberg, Germany. 94Cornell University, Ithaca, New York, USA. 95Division of Pulmonary and Critical Care Medicine, Department of Medicine, University of
Toledo Health Sciences Campus, Toledo, Ohio, USA. 96Bristol-Myers Squibb, Pennington, New Jersey, USA. 97OpGen Inc., Gaithersburg, Maryland, USA.
10. MAQC-II
91
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnologydoi:10.1038/nbt.1665
ONLINE METHODS
MAQC-II participants. MAQC-II participants can be grouped into several
­categories. Data providers are the participants who provided data sets to the
consortium. The MAQC-II Regulatory Biostatistics Working Group, whose
members included a number of biostatisticians, provided guidance and standard
operating procedures for model development and performance estimation. One
or more data analysis teams were formed at each organization. Each data analysis
team actively analyzed the data sets and produced prediction models. Other participants
also contributed to discussion and execution of the project. The 36 data
analysis teams listed in Supplementary Table 3 developed data analysis protocols
and predictive models for one or more of the 13 endpoints. The teams included
more than 100 scientists and engineers with diverse backgrounds in machine
learning, statistics, biology, medicine and chemistry, among others. They volunteered
tremendous time and effort to conduct the data analysis tasks.
Six data sets including 13 prediction endpoints. To increase the chance
that MAQC-II would reach generalized conclusions, consortium members
strongly believed that they needed to study several data sets, each of high
quality and sufficient size, which would collectively represent a diverse set of
prediction tasks. Accordingly, significant early effort went toward the selection
of appropriate data sets. Over ten nominated data sets were reviewed
for quality of sample collection and processing consistency, and quality of
microarray and clinical data. Six data sets with 13 endpoints were ultimately
selected among those nominated during a face-to-face project meeting with
extensive deliberations among many participants (Table 1). Importantly, three
preclinical (toxicogenomics) and three clinical data sets were selected to test
whether baseline practice conclusions could be generalized across these rather
disparate experimental types. An important criterion for data set selection
was the anticipated support of MAQC-II by the data provider and the commitment
to continue experimentation to provide a large external validation
test set of comparable size to the training set. The three toxicogenomics data
sets would allow the development of predictive models that predict toxicity
of compounds in animal models, a prediction task of interest to the pharmaceutical
industry, which could use such models to speed up the evaluation of
toxicity for new drug candidates. The three clinical data sets were for endpoints
associated with three diseases, breast cancer (BR), multiple myeloma (MM)
and neuroblastoma (NB). Each clinical data set had more than one endpoint,
and together incorporated several types of clinical applications, including
treatment outcome and disease prognosis. The MAQC-II predictive modeling
was limited to binary classification problems; therefore, continuous endpoint
values such as overall survival (OS) and event-free survival (EFS) times were
dichotomized using a ‘milestone’ cutoff of censor data. Prediction endpoints
were chosen to span a wide range of prediction difficulty. Two endpoints,
H (CPS1) and L (NEP_S), representing the sex of the patients, were used as
positive control endpoints, as they are easily predictable by microarrays. Two
other endpoints, I (CPR1) and M (NEP_R), representing randomly assigned
class labels, were designed to serve as negative control endpoints, as they
are not supposed to be predictable. Data analysis teams were not aware of
the characteristics of endpoints H, I, L and M until their swap prediction
results had been submitted. If a data analysis protocol did not yield models to
accurately predict endpoints H and L, or if a data analysis protocol claims to
be able to yield models to accurately predict endpoints I and M, something
must have gone wrong.
The Hamner data set (endpoint A) was provided by The Hamner Institutes
for Health Sciences. The study objective was to apply microarray gene expression
data from the lung of female B6C3F1 mice exposed to a 13-week treatment
of chemicals to predict increased lung tumor incidence in the 2-year
rodent cancer bioassays of the National Toxicology Program50. If successful,
the results may form the basis of a more efficient and economical approach
for evaluating the carcinogenic activity of chemicals. Microarray analysis was
performed using Affymetrix Mouse Genome 430 2.0 arrays on three to four
mice per treatment group, and a total of 70 mice were analyzed and used as
MAQC-II’s training set. Additional data from another set of 88 mice were
collected later and provided as MAQC-II’s external validation set.
The Iconix data set (endpoint B) was provided by Iconix Biosciences.
The study objective was to assess, upon short-term exposure, hepatic tumor
induction by nongenotoxic chemicals51, as there are currently no accurate and
well-validated short-term tests to identify nongenotoxic hepatic tumorigens,
thus necessitating an expensive 2-year rodent bioassay before a risk assessment
can begin. The training set consists of hepatic gene expression data from 216
male Sprague-Dawley rats treated for 5 d with one of 76 structurally and mechanistically
diverse nongenotoxic hepatocarcinogens and nonhepatocarcinogens.
The validation set consists of 201 male Sprague-Dawley rats treated for 5 d with
one of 68 structurally and mechanistically diverse nongenotoxic hepatocarcinogens
and nonhepatocarcinogens. Gene expression data were generated using the
Amersham Codelink Uniset Rat 1 Bioarray (GE HealthCare)52. The separation
of the training set and validation set was based on the time when the microarray
data were collected; that is, microarrays processed earlier in the study
were used as training and those processed later were used as validation.
The NIEHS data set (endpoint C) was provided by the National Institute
of Environmental Health Sciences (NIEHS) of the US National Institutes
of Health. The study objective was to use microarray gene expression data
acquired from the liver of rats exposed to hepatotoxicants to build classifiers
for prediction of liver necrosis. The gene expression ‘compendium’ data set
was collected from 418 rats exposed to one of eight compounds (1,2-dichlorobenzene,
1,4-dichlorobenzene, bromobenzene, monocrotaline, N-nitrosomorpholine,
thioacetamide, galactosamine and diquat dibromide). All eight
compounds were studied using standardized procedures, that is, a common
array platform (Affymetrix Rat 230 2.0 microarray), experimental procedures
and data retrieving and analysis processes. For details of the experimental
design see ref. 53. Briefly, for each compound, four to six male, 12-week-old
F344 rats were exposed to a low dose, mid dose(s) and a high dose of the toxicant
and sacrificed 6, 24 and 48 h later. At necropsy, liver was harvested for
RNA extraction, histopathology and clinical chemistry assessments.
Animal use in the studies was approved by the respective Institutional
Animal Use and Care Committees of the data providers and was conducted
in accordance with the National Institutes of Health (NIH) guidelines
for the care and use of laboratory animals. Animals were housed in fully
accredited American Association for Accreditation of Laboratory Animal
Care facilities.
The human breast cancer (BR) data set (endpoints D and E) was contributed
by the University of Texas M.D. Anderson Cancer Center. Gene expression data
from 230 stage I–III breast cancers were generated from fine needle aspiration
specimens of newly diagnosed breast cancers before any therapy. The biopsy
specimens were collected sequentially during a prospective pharmacogenomic
marker discovery study between 2000 and 2008. These specimens represent
70–90% pure neoplastic cells with minimal stromal contamination54. Patients
received 6 months of preoperative (neoadjuvant) chemotherapy including
paclitaxel (Taxol), 5-fluorouracil, cyclophosphamide and doxorubicin
(Adriamycin) followed by surgical resection of the cancer. Response to preoperative
chemotherapy was categorized as a pathological complete response
(pCR = no residual invasive cancer in the breast or lymph nodes) or residual
invasive cancer (RD), and used as endpoint D for prediction. Endpoint E is the
clinical estrogen-receptor status as established by immunohistochemistry55.
RNA extraction and gene expression profiling were performed in multiple
batches over time using Affymetrix U133A microarrays. Genomic analysis of
a subset of this sequentially accrued patient population were reported previously56.
For each endpoint, the first 130 cases were used as a training set and
the next 100 cases were used as an independent validation set.
The multiple myeloma (MM) data set (endpoints F, G, H and I) was contributed
by the Myeloma Institute for Research and Therapy at the University
of Arkansas for Medical Sciences. Gene expression profiling of highly purified
bone marrow plasma cells was performed in newly diagnosed patients with
MM57–59. The training set consisted of 340 cases enrolled in total therapy 2
(TT2) and the validation set comprised 214 patients enrolled in total therapy 3
(TT3)59. Plasma cells were enriched by anti-CD138 immunomagnetic bead
selection of mononuclear cell fractions of bone marrow aspirates in a central
laboratory. All samples applied to the microarray contained >85% plasma
cells as determined by two-color flow cytometry (CD38+ and CD45−/dim)
performed after selection. Dichotomized overall survival (OS) and event-free
survival (EFS) were determined based on a 2-year milestone cutoff. A gene
expression model of high-risk multiple myeloma was developed and validated
by the data provider58 and later on validated in three additional independent
data sets60–62.
10. MAQC-II
92
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnology doi:10.1038/nbt.1665
The neuroblastoma (NB) data set (endpoints J, K, L and M) was contributed
by the Children’s Hospital of the University of Cologne, Germany. Tumor
samples were checked by a pathologist before RNA isolation; only samples
with ≥60% tumor content were used and total RNA was isolated from ~50 mg
of snap-frozen neuroblastoma tissue obtained before chemotherapeutic
treatment. First, 502 preexisting 11 K Agilent dye-flipped, dual-color replicate
profiles for 251 patients were provided63. Of these, profiles of 246 neuroblastoma
samples passed an independent MAQC-II quality assessment by majority
decision and formed the MAQC-II training data set. Subsequently, 514 dyeflipped
dual-color 11 K replicate profiles for 256 independent neuroblastoma
tumor samples were generated and profiles for 253 samples were selected to
form the MAQC-II validation set. Of note, for one patient of the validation
set, two different tumor samples were analyzed using both versions of the
2 × 11K microarray (see below). All dual-color gene-expression of the MAQC-II
training set were generated using a customized 2 × 11K neuroblastoma-related
microarray63. Furthermore, 20 patients of the MAQC-II validation set were
also profiled using this microarray. Dual-color profiles of the remaining
patients of the MAQC-II validation set were performed using a slightly revised
version of the 2 × 11K microarray. This version V2.0 of the array comprised
200 novel oligonucleotide probes whereas 100 oligonucleotide probes of the
original design were removed due to consistent low expression values (near
background) observed in the training set profiles. These minor modifications
of the microarray design resulted in a total of 9,986 probes present on both
versions of the 2 × 11K microarray. The experimental protocol did not differ
between both sets and gene-expression profiles were performed as described63.
Furthermore, single-color gene-expression profiles were generated for 478/499
neuroblastoma samples of the MAQC-II dual-color training and validation sets
(training set 244/246; validation set 234/253). For the remaining 21 samples
no single-color data were available, due to either shortage of tumor material
of these patients (n = 15), poor experimental quality of the generated singlecolor
profiles (n = 5), or correlation of one single-color profile to two different
dual-color profiles for the one patient profiled with both versions of the 2 ×
11K microarrays (n = 1). Single-color gene-expression profiles were generated
using customized 4 × 44K oligonucleotide microarrays produced by Agilent
Technologies. These 4 × 44K microarrays included all probes represented by
Agilent’s Whole Human Genome Oligo Microarray and all probes of the version
V2.0 of the 2 × 11K customized microarray that were not present in the
former probe set. Labeling and hybridization was performed following the
manufacturer’s protocol as described48.
Sample annotation information along with clinical co-variates of the patient
cohorts is available at the MAQC web site (http://edkb.fda.gov/MAQC/). The
institutional review boards of the respective providers of the clinical microarray
data sets had approved the research studies, and all subjects had provided
written informed consent to both treatment protocols and sample procurement,
in accordance with the Declaration of Helsinki.
MAQC-II effort and data analysis procedure. This section provides details
about some of the analysis steps presented in Figure 1. Steps 2–4 in a first
round of analysis was conducted where each data analysis team analyzed
MAQC-II data sets to generate predictive models and associated performance
estimates. After this first round of analysis, most participants attended
a consortium meeting where approaches were presented and discussed. The
meeting helped members decide on a common performance evaluation protocol,
which most data analysis teams agreed to follow to render performance
statistics comparable across the consortium. It should be noted that some data
analysis teams decided not to follow the recommendations for performance
evaluation protocol and used instead an approach of their choosing, resulting
in various internal validation approaches in the final results. Data analysis
teams were given 2 months to implement the revised analysis protocol (the
group recommended using fivefold stratified cross-validation with ten repeats
across all endpoints for the internal validation strategy) and submit their final
models. The amount of metadata to collect for characterizing the modeling
approach used to derive each model was also discussed at the meeting.
For each endpoint, each team was also required to select one of its
submitted models as its nominated model. No specific guideline was given
and groups could select nominated models according to any objective or
subjective criteria. Because the consortium lacked an agreed upon reference
performance measure (Supplementary Fig. 13), it was not clear how the
nominated models would be evaluated, and data analysis teams ranked models
by different measures or combinations of measures. Data analysis teams were
encouraged to report a common set of performance measures for each model
so that models could be reranked consistently a posteriori. Models trained
with the training set were frozen (step 6). MAQC-II selected for each endpoint
one model from the up-to 36 nominations as the MAQC-II candidate
for validation (step 6).
External validation sets lacking class labels for all endpoints were distributed
to the data analysis teams. Each data analysis team used its previously
frozen models to make class predictions on the validation data set (step 7).
The sample-by-sample prediction results were submitted to MAQC-II by
each data analysis team (step 8). Results were used to calculate the external
validation performance metrics for each model. Calculations were carried
out by three independent groups not involved in developing models, which
were provided with validation class labels. Data analysis teams that still had
no access to the validation class labels were given an opportunity to correct
apparent clerical mistakes in prediction submissions (e.g., inversion of class
labels). Class labels were then distributed to enable data analysis teams to
check prediction performance metrics and perform in depth analysis of results.
A table of performance metrics was assembled from information collected in
steps 5 and 8 (step 10, Supplementary Table 1).
To check the consistency of modeling approaches, the original validation and
training sets were swapped and steps 4–10 were repeated (step 11). Briefly, each
team used the validation class labels and the validation data sets as a training
set. Prediction models and evaluation performance were collected by internal
and external validation (considering the original training set as a validation
set). Data analysis teams were asked to apply the same data analysis protocols
that they used for the original ‘Blind’ Training → Validation analysis. Swap
analysis results are provided in Supplementary Table 2. It should be noted
that during the swap experiment, the data analysis teams inevitably already
had access to the class label information for samples in the swap validation set,
that is, the original training set.
Model summary information tables. To enable a systematic comparison of
models for each endpoint, a table of information was constructed containing
a row for each model from each data analysis team, with columns containing
three categories of information: (i) modeling factors that describe the model
development process; (ii) performance metrics from internal validation; and
(iii) performance metrics from external validation (Fig. 1; step 10).
Each data analysis team was requested to report several modeling factors for
each model they generated. These modeling factors are organization code, data
set code, endpoint code, summary or normalization method, feature selection
method, number of features used in final model, classification algorithm,
internal validation protocol, validation iterations (number of repeats of crossvalidation
or bootstrap sampling) and batch-effect-removal method. A set of
valid entries for each modeling factor was distributed to all data analysis teams
in advance of model submission, to help consolidate a common vocabulary
that would support analysis of the completed information table. It should be
noted that since modeling factors are self-reported, two models that share a
given modeling factor may still differ in their implementation of the modeling
approach described by the modeling factor.
The seven performance metrics for internal validation and external validation
are MCC (Matthews Correlation Coefficient), accuracy, sensitivity, specificity,
AUC (area under the receiver operating characteristic curve), binary
AUC (that is, mean of sensitivity and specificity) and r.m.s.e. For internal
validation, s.d. for each performance metric is also included in the table.
Missing entries indicate that the data analysis team has not submitted the
requested information.
In addition, the lists of features used in the data analysis team’s nominated
models are recorded as part of the model submission for functional analysis
and reproducibility assessment of the feature lists (see the MAQC Web site at
http://edkb.fda.gov/MAQC/).
Selection of nominated models by each data analysis team and selection
of MAQC-II candidate and backup models by RBWG and the steering
committee. In addition to providing results to generate the model information
10. MAQC-II
93
©2010NatureAmerica,Inc.Allrightsreserved.
nature biotechnologydoi:10.1038/nbt.1665
table, each team nominated a single model for each endpoint as its preferred
model for validation, resulting in a total of 323 nominated models, 318 of
which were applied to the prediction of the validation sets. These nominated
models were peer reviewed, debated and ranked for each endpoint by the
RBWG before validation set predictions. The rankings were given to the
MAQC-II steering committee, and those members not directly involved in
developing models selected a single model for each endpoint, forming the 13
MAQC-II candidate models. If there was sufficient evidence through documentation
to establish that the data analysis team had followed the guidelines
of good classifier principles for model development outlined in the standard
operating procedure (Supplementary Data), then their nominated models
were considered as potential candidate models. The nomination and selection
of candidate models occurred before the validation data were released.
Selection of one candidate model for each endpoint across MAQC-II was
performed to reduce multiple selection concerns. This selection process turned
out to be highly interesting, time consuming, but worthy, as participants had
different viewpoints and criteria in ranking the data analysis protocols and
selecting the candidate model for an endpoint. One additional criterion was
to select the 13 candidate models in such a way that only one of the 13 models
would be selected from the same data analysis team to ensure that a variety
of approaches to model development were considered. For each endpoint, a
backup model was also selected under the same selection process and criteria
as for the candidate models. The 13 candidate models selected by MAQC-II
indeed performed well in the validation prediction (Figs. 2c and 3).
50.	Thomas, R.S., Pluta, L., Yang, L. & Halsey, T.A. Application of genomic biomarkers
to predict increased lung tumor incidence in 2-year rodent cancer bioassays. Toxicol.
Sci. 97, 55–64 (2007).
51.	Fielden, M.R., Brennan, R. & Gollub, J. A gene expression biomarker provides early
prediction and mechanistic assessment of hepatic tumor induction by nongenotoxic
chemicals. Toxicol. Sci. 99, 90–100 (2007).
52.	Ganter, B. et al. Development of a large-scale chemogenomics database to improve
drug candidate selection and to understand mechanisms of chemical toxicity and
action. J. Biotechnol. 119, 219–244 (2005).
53.	Lobenhofer, E.K. et al. Gene expression response in target organ and whole blood
varies as a function of target organ injury phenotype. Genome Biol. 9, R100
(2008).
54.	Symmans, W.F. et al. Total RNA yield and microarray gene expression profiles from
fine-needle aspiration biopsy and core-needle biopsy samples of breast carcinoma.
Cancer 97, 2960–2971 (2003).
55.	Gong, Y. et al. Determination of oestrogen-receptor status and ERBB2 status of
breast carcinoma: a gene-expression profiling study. Lancet Oncol. 8, 203–211
(2007).
56.	Hess, K.R. et al. Pharmacogenomic predictor of sensitivity to preoperative
chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide
in breast cancer. J. Clin. Oncol. 24, 4236–4244 (2006).
57.	Zhan, F. et al. The molecular classification of multiple myeloma. Blood 108,
2020–2028 (2006).
58.	Shaughnessy, J.D. Jr. et al. A validated gene expression model of high-risk multiple
myeloma is defined by deregulated expression of genes mapping to chromosome 1.
Blood 109, 2276–2284 (2007).
59.	Barlogie, B. et al. Thalidomide and hematopoietic-cell transplantation for multiple
myeloma. N. Engl. J. Med. 354, 1021–1030 (2006).
60.	Zhan, F., Barlogie, B., Mulligan, G., Shaughnessy, J.D. Jr. & Bryant, B. High-risk
myeloma: a gene expression based risk-stratification model for newly diagnosed
multiple myeloma treated with high-dose therapy is predictive of outcome in
relapsed disease treated with single-agent bortezomib or high-dose dexamethasone.
Blood 111, 968–969 (2008).
61.	Chng, W.J., Kuehl, W.M., Bergsagel, P.L. & Fonseca, R. Translocation t(4;14) retains
prognostic significance even in the setting of high-risk molecular signature.
Leukemia 22, 459–461 (2008).
62.	Decaux, O. et al. Prediction of survival in multiple myeloma based on gene
expression profiles reveals cell cycle and chromosomal instability signatures in
high-risk patients and hyperdiploid signatures in low-risk patients: a study of the
Intergroupe Francophone du Myelome. J. Clin. Oncol. 26, 4798–4805 (2008).
63.	Oberthuer, A. et al. Customized oligonucleotide microarray gene expression-based
classification of neuroblastoma patients outperforms current clinical risk
stratification. J. Clin. Oncol. 24, 5070–5078 (2006).
10. MAQC-II
94
10. MAQC-II
95
11 Identification of a poor-prognosis BRAFmutant-like
population of patients with colon
cancer
• Journal of Clinical Oncology, 30(12):1288–1295
• IF: 24.008
• number of citations: 79
• personal contribution (60%): method design, data preprocessing,
experimental design and implementation, biomarker discovery,
results analysis, manuscript writing
97
Identiﬁcation of a Poor-Prognosis BRAF-Mutant–Like
Population of Patients With Colon Cancer
Vlad Popovici, Eva Budinska, Sabine Tejpar, Scott Weinrich, Heather Estrella, Graeme Hodgson,
Eric Van Cutsem, Tao Xie, Fred T. Bosman, Arnaud D. Roth, and Mauro Delorenzi
See accompanying editorial on page 1255; listen to the podcast by Dr Meyerhardt at
www.jco.org/podcasts
Vlad Popovici, Eva Budinska, and Mauro
Delorenzi, Swiss Institute of Bioinformatics;
Fred T. Bosman and Mauro
Delorenzi, Lausanne University Medical
Center, Lausanne; Arnaud D. Roth,
Geneva University Hospital, Geneva;
Arnaud D. Roth, The Swiss Group for
Clinical Cancer Research, Bern, Switzerland;
Sabine Tejpar and Eric Van
Cutsem, University Hospital Gasthuisberg,
Katholieke Universiteit Leuven,
Leuven, Belgium; and Scott Weinrich,
Heather Estrella, Graeme Hodgson, and
Tao Xie, Pﬁzer, La Jolla, CA.
Submitted September 23, 2011;
accepted December 20, 2011;
published online ahead of print at
www.jco.org on March 5, 2012.
Supported by the Swiss National
Center for Competence in Research
Molecular Biology (V.P. and M.D.) and
Fondation Medic (E.B.); S.T. is a senior
clinical investigator of the Fund for
Scientiﬁc Research Flanders and has
received research grants from the
Belgian Federation Against Cancer and
from the Belgian National Cancer Plan.
Authors’ disclosures of potential conﬂicts
of interest and author contributions
are found at the end of this
article.
Corresponding author: Vlad Popovici,
PhD, Bioinformatics Core Facility, Swiss
Institute of Bioinformatics, Quartier
Sorge, Genopode, CH-1015 Lausanne,
Switzerland; e-mail: vlad.popovici@
isb-sib.ch.
© 2012 by American Society of Clinical
Oncology
0732-183X/12/3012-1288/$20.00
DOI: 10.1200/JCO.2011.39.5814
A B S T R A C T
Purpose
Our purpose was development and assessment of a BRAF-mutant gene expression signature for
colon cancer (CC) and the study of its prognostic implications.
Materials and Methods
A set of 668 stage II and III CC samples from the PETACC-3 (Pan-European Trails in Alimentary Tract
Cancers) clinical trial were used to assess differential gene expression between c.1799TϾA (p.V600E)
BRAF mutant and non-BRAF, non-KRAS mutant cancers (double wild type) and to construct a gene
expression–based classiﬁer for detecting BRAF mutant samples with high sensitivity. The classiﬁer was
validated in independent data sets, and survival rates were compared between classiﬁer positive and
negative tumors.
Results
A 64 gene-based classiﬁer was developed with 96% sensitivity and 86% speciﬁcity for detecting
BRAF mutant tumors in PETACC-3 and independent samples. A subpopulation of BRAF wild-type
patients (30% of KRAS mutants, 13% of double wild type) showed a gene expression pattern and
had poor overall survival and survival after relapse, similar to those observed in BRAF-mutant
patients. Thus they form a distinct prognostic subgroup within their mutation class.
Conclusion
A characteristic pattern of gene expression is associated with and accurately predicts BRAF
mutation status and, in addition, identiﬁes a population of BRAF mutated-like KRAS mutants and
double wild-type patients with similarly poor prognosis. This suggests a common biology between
these tumors and provides a novel classiﬁcation tool for cancers, adding prognostic and biologic
information that is not captured by the mutation status alone. These results may guide therapeutic
strategies for this patient segment and may help in population stratiﬁcation for clinical trials.
J Clin Oncol 30:1288-1295. © 2012 by American Society of Clinical Oncology
INTRODUCTION
Activation of the KRAS/BRAF/MEK/ERK cascade is
believed to occur frequently in colorectal (CRC)
canceronthebasisoftheobserved40%incidenceof
KRAS mutations and 10% to 15% incidence of
BRAF mutations.1-4
KRAS and BRAF mutations occur
in a mutually exclusive pattern in CRC, which
has long been interpreted as a sign of functional
redundancy.However,thesemutationsoccurindifferent
histopathologic subtypes of CRC,5,6
and we
recently showed7
that the prognosis of patients with
KRAS and BRAF mutant metastatic CRC is quite
different, with a clearly worse prognosis for BRAFmutant
disease. It has been suggested this could be
due to higher levels of mitogen-activated protein
kinase activation in BRAF-mutant (BRAFm) colon
cancer.8,9
Unlike the majority of KRAS-mutant
(KRASm) CRCs, BRAFm metastatic CRCs do not
respond to any current chemotherapy, and the out-
comeofpatientswithBRAFmCRCissimilartothat
of untreated patients.
Our main objective was to better understand
the underlying biology of BRAFm CRCs as
captured by gene expression. We developed a
BRAFm gene signature that allowed an accurate
identiﬁcation of BRAFm samples, and which,
when applied to BRAF wild-type samples, identiﬁed
additional colon cancer (CC) samples that
manifested a similar gene expression pattern. Although
a substantial amount of work has been
dedicated to the development of BRAFm gene
JOURNAL OF CLINICAL ONCOLOGY O R I G I N A L R E P O R T
VOLUME 30 ⅐ NUMBER 12 ⅐ APRIL 20 2012
1288 © 2012 by American Society of Clinical Oncology
UniversitÃƒÆ’Ã‚Â¤t ZÃƒÆ’Ã‚Â¼rich) on April 19, 2012 from 130.223.120.6
Information downloaded from jco.ascopubs.org and provided by at SWISS CONSORTIUM (Hauptbibliothek
Copyright © 2012 American Society of Clinical Oncology. All rights reserved.
11. BRAF-mutant-like colon cancer
98
expression signatures in melanoma,10-12
to the best of our knowledge,
thereisnosuchpublishedworkinthe CC context. Taking advantage
of a large series of tumors with gene expression and mutation data
from the PETACC-3 (Pan-European Trails in Alimentary Tract
Cancers) clinical trial,13
we studied the genes differentially expressed
between c.1799TϾA (p.V600E) BRAFm and double-wildtype
(WT2) tumors, deﬁned as non-BRAF mutant, non-KRAS
mutant. We purposely excluded the KRASm tumors from this com-
parisonbecauseitwasunclearwhetherKRASmcarcinomashadoverlapping
biology with BRAFm. Next, we built a classiﬁer able to
recognize with high sensitivity BRAFm CCs in our own and external
data sets.
When the BRAF classiﬁer was applied to the whole population, it
identiﬁed a BRAF wild-type subpopulation, with similar gene expression
and prognostic characteristics. Approximately 62% of these
BRAFm-like tumors were KRASm (30% of all KRASm were BRAFm-
like),withtherestbeingWT2(13%ofallWT2).Inourdata,theBRAFmlike
population represented 18% of CCs. This intriguing ﬁnding
suggests a common biology between these tumors, not predicted by
the mutation status. The results obtained show that our current clas-
siﬁcationsoftumorsasKRAS-orBRAF-mutantormitogen-activated
protein kinase–active versus nonactive are inadequate to capture the
whole underlying biology and clinical behavior.
MATERIALS AND METHODS
Tumor Samples and Data Preparation
Within the PETACC-3 clinical trial,13
formalin-ﬁxed parafﬁnembedded
tissue blocks were collected after cancer diagnosis and independently
of future research plans, and DNA was extracted from 1,404
microdissected tissue sections. The analysis of KRAS exon 2 and BRAF
exon 15 was performed by allele-speciﬁc real-time polymerase chain reac-
tion.7
The mutation status has been conﬁrmed for all samples by a second
analysis, using Sequenom.14
RNA of sufﬁcient quantity and quality was extracted
from 895 samples, and gene expressions were measured on the ALMAC
Colorectal Cancer DSA platform (Craigavon, Northern Ireland)—a
customized Affymetrix chip with 61,528 probe sets mapping to 15,920 unique
EntrezGeneIDs—intwophases(phase1:nϭ322,phase2:nϭ573).Intotal,
688uniquesamplespassedtheﬁnalqualitycontrol(phase1:nϭ265[82.3%],
phase 2: n ϭ 423 [73.8%]) and were used in subsequent analysis (Data Supplement).
Of this series of CCs, 257 (37.4%) were KRAS mutated, whereas
BRAF mutation was detected in 47 (6.8%) of the cases (Data Supplement).
The stage III subset included all samples for which proﬁle data could be
obtained and is thus representative of the clinical population of the trial. The
stageIIsubsetincludedallpatientswithrelapseforwhomproﬁledatacouldbe
obtained and is thus also representative of this group, whereas from the
nonrelapsing patients, a randomly selected population was proﬁled.
Three additional independent data sets15-17
were used for validation of
the signature, whereas a fourth data set,18
with available survival information,
was used for validating the prognostic value of the signature.
Statistical Analysis
PETACC-3 gene expression data were retrospectively analyzed to derive
the BRAF gene signature discriminating between c.1799TϾA (p.V600E)
BRAFm and double-wild-type (WT2; BRAF and KRAS wild-type) tumors.
Sampleswithmissingmutationinformation(nϭ39)werediscardedfromthe
gene signature development, but were included later in the survival analysis.
Gene expression data were normalized using RMA (Robust Microchip
Average)19
andsummarizedatthegenelevelbychoosingtheprobesetwiththe
highest standard deviation as a representative of each gene, in each data
set individually.
Differentiallyexpressedgeneswereobtainedbyﬁttingmultivariatelinear
models (using LIMMA20
package) to probe set–level data to fully exploit the
potential of the platform. To account for known association between micro-
satelliteinstability-high(MSI-H),BRAFm,andright-sidedtumors,7
thelinear
model for the whole population included factors for BRAF mutation, MSI
status,andtumorsite(allbinaryvariables).Forthemicrosatellitestable(MSS)
subpopulation,themodelincludedonlytheBRAFmutationstatusandtumor
site. The false discovery rate was controlled by Benjamini-Hochberg proce-
dure21
andrequiredtobeatmost1%,whereastheminimumabsolutelog-fold
change was 0.585 (ϭ log2 1.5). As the MSI-H subpopulation was small and
consisted only of right-sided samples, the differentially expressed genes were
derived by comparing BRAFm and WT2 only in the right colon, with a false
discovery rate less than 25% and no constraint on the fold change.
For signature generation, an adapted version of the top scoring pairs
algorithm22
(multiple top scoring pairs [mTSP]; Data Supplement) was used,
resulting in gene pairs deemed as the most informative in the process of
classiﬁer construction. The ﬁnal classiﬁcation model consisted of two groups
of genes (G1 and G2), and the prediction was made comparing the averages of
these groups: If, for a given sample, the average of G1 was smaller than the
average of G2, then the sample was predicted to be BRAFm, otherwise WT2.
We also deﬁned a BRAF score (BS) as the difference between the average
expression of G2 genes and the average expression of G1 genes (from the
mTSP model) and used it to analyze the stratiﬁcation for different threshold
values (a threshold of 0 leading to the original decision rule). An alternative
threshold for the BRAF score was obtained as the value that maximized
Matthews correlation coefﬁcient23
on the PETACC-3 data set.
The performance of the classiﬁer was estimated by repeated (10 times)
stratiﬁed ﬁve-fold cross-validation, following the MAQC-II guidelines,24
and
measured in terms of sensitivity, speciﬁcity, and error rate. The ﬁnal BRAF
classiﬁer was built from all BRAFm and WT2 samples in the PETACC-3 data
set and then applied to the full PETACC-3 data set (including KRASm) and
independent validation sets for the analysis of stratiﬁcation of the population
(DataSupplement).BecausethestageIIsubgroupofPETACC-3issmallerand
not fully representative, the analysis of the prognostic value of the signature is
focused on stage III subgroup. However, results for both stages are given
(Data Supplement).
The association between predicted class and survival outcomes was
tested using Cox proportional hazard models (log-likelihood test) and logrank
test for dichotomous variables. Three survival outcomes have been con-
sidered:overallsurvival,relapse-freesurvivalandsurvivalafterrelapse.Fisher’s
exact test was used for testing differences in proportions in contingency tables.
RESULTS
BRAFm: Characteristic Genes and Classiﬁer
In the PETACC-3 data set, we identiﬁed 314 differentially expressed
probe sets between BRAFm and WT2 (see Materials and
Methods for details), mapping to 223 unique EntrezGene IDs. Top 50
differentially expressed probe sets are given in Table 1, with the full
table given in the Data Supplement. We also derived lists of differentially
expressed genes for the MSI-H and MSS tumors separately
(Data Supplement).
Using the technique of mTSP, a 32-gene pair BRAFm signature
(Table 2) was obtained by training on the c.1799TϾA (p.V600E)
BRAFm and WT2 samples, considering all genes, whether or not they
were previously identiﬁed to be differentially expressed. Its performance
was estimated at a sensitivity of 95.8% and a speciﬁcity of
86.5% (Table 3). Fifty of the 64 genes of the signature were among the
223 differentially expressed genes (Data Supplement).
BRAFm-Like Tumors
Tomakethedistinctionbetweenthetrueandclassiﬁer-predicted
mutation status, we preﬁx the predictions by “pred-”: pred-BRAFm
denotes the samples predicted to be BRAFm, whereas pred-BRAFwt
Identiﬁcation of BRAF-Like Patients
www.jco.org © 2012 by American Society of Clinical Oncology 1289
UniversitÃƒÆ’Ã‚Â¤t ZÃƒÆ’Ã‚Â¼rich) on April 19, 2012 from 130.223.120.6
Information downloaded from jco.ascopubs.org and provided by at SWISS CONSORTIUM (Hauptbibliothek
Copyright © 2012 American Society of Clinical Oncology. All rights reserved.
11. BRAF-mutant-like colon cancer
99
denotes those predicted to be BRAF wild type. The pred-BRAFm
samples consist of true BRAF mutants and the subset of WT2 and
KRASm samples that are positive for the signature. These tumors
share a common gene expression pattern, as can be seen in Appendix
FigureA1(onlineonly).WecallthesubsetofBRAFwild-typesamples
that are positive for the signature BRAFm-like to distinguish them
from the true BRAFm.
Having identiﬁed a population of BRAFm-like samples, we proceeded
to its characterization: In the population stratiﬁcation analysis
of PETACC-3, approximately 30% (76 of 257) of KRASm and 13%
Table 1. Top 50 Differentially Expressed Probe Sets Between c.1799TϾA (p.V600E) BRAFm and WT2
Probe Set ID Gene Symbol Entrez GeneID LFC Ofﬁcial Full Name
ADXCRPD.7995.C1_x_at AQP5 362 Ϫ2.91 Aquaporin 5
ADXCRIH.384.C1_s_at REG4 83998 Ϫ2.80 Regenerating islet-derived family, member 4
ADXCRAG_BC014461_x_at CDX2 1045 2.02 Caudal type homeobox 2
ADXCRAG_BC014461_at CDX2 1045 1.97 Caudal type homeobox 2
ADXCRPD.10572.C1_at HSF5 124535 1.70 Heat shock transcription factor family member 5
ADXCRAG_AK024491_s_at SOX8 30812 Ϫ1.95 SRY (sex determining region Y)-box 8
ADXCRSS.Hs#S2988180_at HSF5 124535 2.02 Heat shock transcription factor family member 5
ADXCRPD.7687.C1_at TM4SF4 7104 Ϫ1.70 Transmembrane 4 L six family member 4
ADXCRAG_M14335_s_at F5 2153 Ϫ1.18 Coagulation factor V (proaccelerin, labile factor)
ADXCRAG_AJ250717_s_at CTSE 1510 Ϫ2.62 Cathepsin E
ADXCRAG_AJ132099_s_at VNN1 8876 Ϫ0.93 Vanin 1
ADXCRAD_NM_025113_s_at C13orf18 80183 1.77 Chromosome 13 open reading frame 18
ADXCRAG_NM_182510_s_at LOC146336 146336 Ϫ1.33 Hypothetical LOC146336
ADXCRAG_BC028581_s_at PIWIL1 9271 Ϫ0.72 Piwi-like 1 (Drosophila)
ADXCRAD_BX094012_s_at SOX13 9580 Ϫ0.72 SRY (sex determining region Y)-box 13
ADXCRPDRC.4289.C1_at RNF43 54894 1.38 Ring ﬁnger protein 43
ADXCRPD.10016.C1_at SATB2 23314 1.82 SATB homeobox 2
ADXCRPDRC.8321.C1_s_at TFCP2L1 29842 1.26 Transcription factor CP2-like 1
ADXCRIH.1549.C1_at ELOVL5 60481 0.94 ELOVL family member 5, elongation of long chain fatty acids (FEN1/
Elo2, SUR4/Elo3-like, yeast)
ADXCRAG_BC028581_x_at PIWIL1 9271 Ϫ1.72 Piwi-like 1 (Drosophila)
ADXCRIH.1305.C1_s_at LYZ 4069 Ϫ1.61 Lysozyme
ADXCRSS.Hs#S1405714_at RNF43 54894 1.27 Ring ﬁnger protein 43
ADXCRSS.Hs#S3740849_at HSF5 124535 1.21 Heat shock transcription factor family member 5
ADXCRSS.Hs#S3012761_at HSF5 124535 1.20 Heat shock transcription factor family member 5
ADXCRAD_BM825250_s_at TM4SF4 7104 Ϫ0.99 Transmembrane 4 L six family member 4
ADXCRPD.7300.C1_s_at LOC388199 388199 Ϫ1.28 Proline rich 25
ADXCRIH.4080.C1_s_at SPINK1 6690 2.09 Serine peptidase inhibitor, Kazal type 1
ADXCRAD_NM_006113_s_at VAV3 10451 1.38 Vav 3 guanine nucleotide exchange factor
ADXCRIH.546.C1_at GGH 8836 1.49 ␥-glutamyl hydrolase (conjugase, folylpolygammaglutamyl hydrolase)
ADXCRAD_AJ709424_s_at ABLIM3 22885 Ϫ0.65 Actin binding LIM protein family, member 3
ADXCRPDRC.1943.C1_at AXIN2 8313 1.32 Axin 2
ADXCRAD_BG470190_s_at CDX2 1045 0.77 Caudal type homeobox 2
ADXCRAG_XM_371238_at TRNP1 388610 Ϫ1.03 TMF1-regulated nuclear protein 1
ADXCRAD_BU664688_s_at SLC14A1 6563 Ϫ0.82 Solute carrier family 14 (urea transporter), member 1 (Kidd blood group)
ADXCRPD.12823.C1_s_at SYT13 57586 Ϫ0.77 Synaptotagmin XIII
ADXCRAD_CK823169_at ANXA10 11199 Ϫ0.80 Annexin A10
ADXCRPD.8346.C1_at HSF5 124535 1.34 Heat shock transcription factor family member 5
ADXCRPD.15182.C1_at MIR142 406934 0.95 MicroRNA 142
ADXCRIH.31.C9_at LYZ 4069 Ϫ1.61 Lysozyme
ADXCRAD_BP299698_s_at VNN1 8876 Ϫ0.96 Vanin 1
ADXCRPD.14261.C1_at ANO1 55107 Ϫ1.12 Anoctamin 1, calcium activated chloride channel
ADXCRAG_NM_002526_at NT5E 4907 Ϫ1.27 5Ј-nucleotidase, ecto (CD73)
ADXCRAD_CN404528_s_at DCBLD2 131566 Ϫ0.76 Discoidin, CUB and LCCL domain containing 2
ADXCRAD_BM852899_at DUSP4 1846 Ϫ0.98 Dual speciﬁcity phosphatase 4
ADXCRAD_BP376354_at AXIN2 8313 1.27 Axin 2
ADXCRAG_U04313_s_at SERPINB5 5268 Ϫ0.89 Serpin peptidase inhibitor, clade B (ovalbumin), member 5
ADXCRIH.482.C1_at KLK6 5653 Ϫ0.76 Kallikrein-related peptidase 6
ADXCRAD_BM718216_s_at TRNP1 388610 Ϫ1.16 TMF1-regulated nuclear protein 1
ADXCRAG_XM_031357_s_at KIAA0802 23255 Ϫ0.82 KIAA0802
ADXCRPD.1115.C1_s_at MLPH 79083 Ϫ1.32 Melanophilin
NOTE. Positive LFC indicates higher expression in WT2.
Abbreviations: LFC, log fold change; WT2, double wild type.
Popovici et al
1290 © 2012 by American Society of Clinical Oncology JOURNAL OF CLINICAL ONCOLOGY
UniversitÃƒÆ’Ã‚Â¤t ZÃƒÆ’Ã‚Â¼rich) on April 19, 2012 from 130.223.120.6
Information downloaded from jco.ascopubs.org and provided by at SWISS CONSORTIUM (Hauptbibliothek
Copyright © 2012 American Society of Clinical Oncology. All rights reserved.
11. BRAF-mutant-like colon cancer
100
(46 of 345) of WT2 samples were BRAFm-like. The BRAFm-like
samples were signiﬁcantly enriched in right-sided tumors in compar-
isonwithnon–BRAF-likeoverallandalsoseparatelyforKRASm(51%
were right-sided) and WT2 (63% were right-sided). There was no
association with a particular KRAS mutation subtype. Approximately
29% of the BRAFm-like samples were MSI-H (whereas 41% of the
BRAFm were MSI-H). On the other hand, 50% of the MSI-H samples
were BRAFm-like, with an additional 27% being BRAFm (Data Supplement).
Separate hierarchical clustering of the KRASm and WT2 subpopulations,
based on the genes from the signature, showed a split
betweenBRAFm-likeandtherestofthesamples(DataSupplement).The
identiﬁed BRAFm-like subpopulation was further described in terms of
clinicopathologicfeatures(DataSupplement),survivalrates(Table4and
Data Supplement), and differentially expressed genes between BRAFm-
likeandBRAFmsamples(DataSupplement).Thetwogroupsofpatients
weresimilarwithrespecttotheirclinicalandpathologicparameters,with
theonlyexceptionsbeingage(BRAFm-likecomprisemorepatientsolder
than 60 years) and tumor site (56% of BRAFm-like were right-sided,
whereas 77% of BRAFm are right-sided; Data Supplement).
Prognostic Value of the Classiﬁer
The prognostic value of the BRAF signature was assessed in
the combined stage II and III population and in the stage III only
subpopulation for three end points—overall survival (OS),
relapse-free survival (RFS), and survival after relapse (SAR)—
within the whole population, WT2 only, and KRASm only subpopulations,
respectively. To account for the known prognostic
effect of the MSI status (mainly for RFS) and its association with
the BRAF mutation, the survival analysis was also performed
within the MSS population only. The small number of MSI-H
samples prevented a similar analysis of the signature predictions
within MSI-H. In whole population and in MSS, the BRAFm and
BRAFm-like patients have shorter survival times (OS and SAR), as
can be seen in Figure 1 and the Data Supplement for different
stratiﬁcations. The BRAFm-likeness showed the strongest prognostic
effect for SAR, for both KRASm and WT2 (in all and MSSonly
samples; see Figs 1F and 1H). The corresponding hazard ratios
and their 95% CIs as well as the corresponding log-rank test P
values for each of these comparisons are summarized in Table 4.
No statistically signiﬁcant difference in survival was found between
the BRAFm and BRAFm-like subpopulations, even though a
tendency was observed for the patients with a BRAFm-like tumor to
have a slightly better prognosis than those with a BRAFm tumor.
To identify potential drivers of the prognostic effect, we assessed
the prognostic value of each of the 64 genes in the signature by ﬁtting
univariateCoxregressionmodelsinthewholePETACC-3population
and in the subset of BRAF wild-type samples (KRASm and WT2).
Most of these genes were found to be signiﬁcantly associated with the
SAR end point, and, for 25 of them, the association was found also in
the BRAF wild-type subgroup. These results reveal multiple interesting
genes for future studies (Data Supplement).
External Validation
The BRAF signature was validated on three external data
sets: Koinuma,15
Kim,16
and an internal series of patients with
cetuximab-treated stage IV disease with gene expression data from
primary tumors.17
When genes from the signature were not represented
on a platform, only the complete pairs of genes were considered.
The aggregated observed sensitivity was 96.0% (24 of 25
BRAFm correctly identiﬁed) and the speciﬁcity was 86.24% (94 of
109 WT2 and KRASm correctly predicted; Table 3). This conﬁrmed
the highly sensitive recognition of tumors with a BRAFm
and their distinction from majority non-BRAFm tumors, whereas
approximately 14% of the latter were also wrongly classiﬁed as
BRAFm. The reported speciﬁcity refers to KRASm and WT2 samples
that should have been labeled as BRAF wild type by the
classiﬁer. The existence of a BRAFm-like group of patients is thus
conﬁrmed in these data sets.
TheprognosticvalueoftheBRAFsignaturehasbeenvalidatedin
all and in the stage II and III only samples from the Mofﬁtt data set18
for OS and SAR (RFS being only marginally signiﬁcant in stage II and
III). No information on BRAF or KRAS mutational status was available,
Table 2. 32 Pairs of Genes Deﬁning the BRAF Signature
Pair Gene 1 (G1) Gene 2 (G2) Pair Gene 1 (G1) Gene 2 (G2)
1 C13orf18 CTSE 17 VAV3 OSBP2
2 DDC AQP5 18 CFTR KLK10
3 PPP1R14D REG4 19 PHYH DUSP4
4 HSF5 RSBN1L 20 PLCB4 HOXD3
5 SATB2 RASSF6 21 ZNF141 C11orf9
6 TNNC2 CRIP1 22 PPP1R14C CD55
7 GGH PPPDE2 23 FLJ32063 TRNP1
8 SPINK1 PLK2 24 APCDD1 FSCN1
9 PTPRO TM4SF4 25 ACOX1 KIAA0802
10 ZSWIM1 MLPH 26 C10orf99 PLLP
11 RNF43 RBM8A 27 MIR142 IRX3
12 CELP SOX8 28 ARID3A SLC25A37
13 CBFA2T2 PIWIL1 29 C20orf111 PIK3AP1
14 PTPRD LOC388199 30 AMACR TPK1
15 CDX2 S100A16 31 AIFM3 ZIC2
16 TSPAN6 RBBP8 32 CTTNBP2 SERPINB5
NOTE. A sample is predicted to be BRAF mutant if the average expression of
the genes in the Gene 1 (G1) columns is lower than the average expression of
genes in Gene 2 (G2) columns.
Table 3. Performance Metrics for the BRAF Signature
Data Set Sensitivity Speciﬁcity Error Rate
PETACC-313
% 95.78 86.52 12.41
Standard deviation 4.04 0.18 0.14
Kim,16
n ϭ 20
% 100.00 54.55 25.00
No. 9/9 6/11 5/20
Koinuma,15
n ϭ 20
% 100.00 72.73 15.00
No. 9/9 8/11 3/20
Cetuximab,17
n ϭ 94
% 85.71 91.95 8.51
No. 6/7 80/87 8/94
Aggregated, on validation sets, n ϭ 134
% 96.00 86.24 11.94
No. 24/25 94/109 16/134
NOTE. PETACC-3: cross-validation estimated performance. For the other
data sets, the values indicate the observed performance.
Abbreviation: PETACC-3, Pan-European Trials in Alimentary Tract Cancers.
Identiﬁcation of BRAF-Like Patients
www.jco.org © 2012 by American Society of Clinical Oncology 1291
UniversitÃƒÆ’Ã‚Â¤t ZÃƒÆ’Ã‚Â¼rich) on April 19, 2012 from 130.223.120.6
Information downloaded from jco.ascopubs.org and provided by at SWISS CONSORTIUM (Hauptbibliothek
Copyright © 2012 American Society of Clinical Oncology. All rights reserved.
11. BRAF-mutant-like colon cancer
101
making it impossible to draw any conclusions on the prognostic value of
the signature within the KRASm or WT2 subpopulations. The signature
was conﬁrmed to be prognostic for SAR and progression-free survival
(PFS) in the cetuximab17
data set as well (OS information was not available
for this data set). The survival analysis results and the corresponding
Kaplan-Meier curves are given in Table 4 and in the Data Supplement.
DISCUSSION
Ourresultsshowthatforc.1799TϾA(p.V600E)BRAFmtumors,acharacteristic
gene expression signature of high sensitivity can be identiﬁed,
andthissignatureextendstoapopulationofBRAFwild-typesubgroupof
colon carcinomas (BRAFm-like) sharing similar clinicopathologic and
gene expression features of potential prognostic importance. The BRAF
mutation status has been previously shown to have prognostic value in
CRC,7,25-27
bothinMSSandMSI-Htumors,andthisfeatureisalsoshared
byoursignatureinthecaseofMSStumors.Becauseofthelimitednumber
of MSI-H tumors, we could not assess its prognostic value in those sam-
ples.TheBRAFm-liketumors,eitherKRASmordoublewildtype,showa
similarpoorprognosticinallandMSS-onlysamples.Thiseffectwasalso
independent of tumor stage.
Globally, the group of BRAFm-like tumors discovered studying the
geneexpressiondatashowsclinicopathologicfeaturesmoresimilartothe
BRAFmtumors(DataSupplement)thantopred-BRAFwt.Aspreviously
described,13,28
BRAFmtumorsarefoundwithhigherfrequenciesinright
(proximal) colon, are enriched for the MSI-H phenotype, and are of
higher grade. In our study, the frequencies of high-grade were 30% in
BRAFm,20%inBRAFm-like,and5%inpred-BRAFwt;ofMSI-H,30%,
30%, and 3%, respectively; of right-side, 75%, 55%, and 30%, respec-
tively.ThemucinoustumorsaremostfrequentlyBRAFm-like(45%)and
arelessoftenBRAFm(30%vonly10%inpred-BRAFwt).Theexception
isage,forwhichthefrequencyofyoungpatientsishighestinBRAFm-like
(55%) and lowest in BRAFm (35%).
Fromabiologicperspective,thisﬁndingsupportsthenotionthat
the poor outcome of tumors with BRAFm is shared with some non–
BRAF-mutated tumors, suggesting that they have common biology
that drives poor survival after relapse. For the genes in the signature,
the c.1799TϾA (p.V600E) BRAFm tumors display a homogeneous
Table 4. Survival Analyses Results
Data Set
OS RFS SAR
P HR 95% CI P HR 95% CI P HR 95% CI
PETACC-3, all
pred-BRAFm/pred-BRAFwt .0005 1.67 1.25 to 2.25 .2447 1.17 0.90 to 1.53 < .001 2.85 2.06 to 3.95
BRAFm/BRAFwt .0021 2.01 1.28 to 3.17 .1602 1.37 0.88 to 2.12 < .001 3.68 2.20 to 6.16
Within KRASm: BRAFm-like/pred-BRAFwt .5196 1.16 0.74 to 1.83 .4724 1.17 0.76 to 1.78 .0021 2.13 1.30 to 3.48
Within WT2: BRAFm-like/pred-BRAFwt .1312 1.58 0.87 to 2.87 .4866 1.20 0.72 to 2.01 .0011 2.72 1.46 to 5.06
PETACC-3, stage III
pred-BRAFm/pred-BRAFwt < .0001 1.93 1.41 to 1.79 .0455 1.34 1.00 to 1.79 < .0001 3.04 2.15 to 4.29
BRAFm/BRAFwt .0024 2.14 1.29 to 3.55 .1685 1.41 0.86 to 2.32 < .0001 4.53 2.54 to 8.07
Within KRASm: BRAFm-like/pred-BRAFwt .1916 1.37 0.85 to 2.21 .8203 1.05 0.68 to 1.64 .0038 2.09 1.26 to 3.46
Within WT2: BRAFm-like/pred-BRAFwt .0365 1.90 1.03 to 3.50 .2154 1.40 0.82 to 2.40 .0012 2.75 1.45 to 5.19
PETACC-3, MSS
pred-BRAFm/pred-BRAFwt < .0001 2.19 1.57 to 3.07 .0159 1.46 1.07 to 1.99 < .0001 3.16 2.17 to 4.59
BRAFm/BRAFwt < .0001 2.91 1.74 to 4.88 .0228 1.79 1.08 to 2.98 < .0001 4.67 2.57 to 8.45
Within KRASm: BRAFm-like/pred-BRAFwt .0511 1.59 0.99 to 2.53 .4690 1.17 0.76 to 1.82 .0043 2.07 1.24 to 3.43
Within WT2: BRAFm-like/pred-BRAFwt .0642 1.98 0.95 to 4.16 .3464 1.37 0.71 to 2.63 .0001 4.24 1.89 to 9.47
PETACC-3, MSS/stage III
pred-BRAFm/pred-BRAFwt < .0001 2.27 1.58 to 3.25 .0105 1.54 1.10 to 2.15 < .0001 2.97 2.01 to 4.40
BRAFm/BRAFwt .0024 2.43 1.35 to 4.40 .1149 1.59 0.89 to 2.86 < .0001 3.88 1.99 to 7.56
Within KRASm: BRAFm-like/pred-BRAFwt .0216 1.77 1.08 to 2.89 .1765 1.37 0.87 to 2.16 .0089 1.98 1.18 to 3.34
Within WT2: BRAFm-like/pred-BRAFwt .0220 2.35 1.11 to 4.98 .2789 1.46 0.73 to 2.93 < .0001 4.67 2.05 to 10.63
Mofﬁtt18
pred-BRAFm/pred-BRAFwt .0376 1.67 1.02 to 2.73 .0956 1.77 0.90 to 3.50 .0014 3.78 1.58 to 9.04
pred-BRAFm/pred-BRAFwt (stages II,III) .0003 3.22 1.66 to 6.26 .0498 2.02 0.99 to 4.15 .0017 3.97 1.58 to 9.99
pred-BRAFm/pred-BRAFwt (stage III) .0002 4.26 1.87 to 9.69 .0204 2.79 1.13 to 6.87 .0028 4.95 1.58 to 15.44
Cetuximab,17
MSS
OS PFS SAR
P HR 95% CI P HR 95% CI P HR 95% CI
pred-BRAFm/pred-BRAFwt < .0001 4.49 2.40 to 8.38 < .0001 4.58 2.45 to 8.56
BRAFm/BRAFwt .0018 3.24 1.46 to 7.19 < .0001 5.72 2.49 to 13.12
Within BRAFwt: BRAFm-like/pred-BRAFwt .0017 3.45 1.56 to 7.63 < .0001 3.26 1.47 to 7.22
NOTE. Highly signiﬁcant results (P Ͻ .01) are set in bold. For the Cetuximab data set, only two end points could be considered: SAR and PFS. This data set contained
also only stage IV MSS patients. When the predictions are considered within KRASm or WT2 subpopulations, those samples positive for the signature are called
BRAFm-like (see the Results section). The comparison is given in the ﬁrst column, with the reference category in italic font.
Abbreviations: BRAFm, true BRAF mutant; BRAFwt, true BRAF wild type; HR, hazard ratio; MSS, microsatellite stable; OS, overall survival; PETACC-3,
Pan-European Trails in Alimentary Tract Cancers; PFS, progression-free survival; pred-BRAFm, classiﬁer-predicted BRAF mutant; pred-BRAFwt, classiﬁer-predicted
BRAF wild type; SAR, survival after relapse.
Popovici et al
1292 © 2012 by American Society of Clinical Oncology JOURNAL OF CLINICAL ONCOLOGY
UniversitÃƒÆ’Ã‚Â¤t ZÃƒÆ’Ã‚Â¼rich) on April 19, 2012 from 130.223.120.6
Information downloaded from jco.ascopubs.org and provided by at SWISS CONSORTIUM (Hauptbibliothek
Copyright © 2012 American Society of Clinical Oncology. All rights reserved.
11. BRAF-mutant-like colon cancer
102
gene expression pattern, which is also found in some KRASm and
WT2 samples (approximately 30% and 13% in our data, respectively;
Appendix Fig A1). It is interesting to note that BRAF mutations have
been strongly associated with the serrated adenoma pathway,29,30
and
thus the clear differences in gene expression between BRAFm and
other colon tumors may be related to a different adenoma-carcinoma
progression sequence. The existence of several subgroups of CCs,
deﬁned by their DNA methylation and mutation status, was ﬁrst
discovered in a population-based study31
and was then subsequently
conﬁrmed.32,33
A recent study34
similarly presented evidence validating
the existence of a cluster that included all BRAFm samples and a
fraction of KRASm (18% of all KRASm) and WT2 samples and that
was enriched for CIMP-positive, MLH1 hypermethylated, and rightsided
tumors. For the moment, we can only speculate about the relation
between our BRAFm-like concept and this cluster. In any case, it also
supports the idea that c.1799TϾA (p.V600E) BRAFm tumors form a
homogeneous group with respect to the genes in the signature and that a
sizeablesetofothertumorsshowsimilarcharacteristics.Theunderlying
driver biology of this BRAFm-like group remains unknown, although
it is clearly associated with clinicopathologic features, such as MSI-H,
right-sidedness, and mucinous histology.
The identiﬁcation of a BRAFm-like subpopulation of CC that
includes KRASm and WT2 samples and that manifests a coherent
clinical behavior suggests that a new deﬁnition of CC subgroups is
needed. To the best of our knowledge, this is the ﬁrst reported split
based on gene expression data of the KRASm tumors (see also Data
Supplement), which were considered until now as a compact group,
based solely on their mutation status.
The genes associated with the BRAF c.1799TϾA (p.V600E) mu-
tationinCCandinmelanomaaredissimilar,indicatingtissue-speciﬁc
biology that needs to be understood and targeted differently. It is
therefore not surprising that BRAF-speciﬁc inhibitors, such as
PLX4032 or GSK2118436, although very successful in BRAFm melanoma,
have failed in BRAFm colorectal cancer treatment.35,36
In summary, our results show that for c.1799TϾA (p.V600E)
BRAFm tumors, a high-sensitivity gene expression signature can be
BA
0
No. at risk
pred-BRAFwt 426 417 401 371 346 321 59 5
pred-BRAFm 154 142 121 109 95 90 15 1
P < .001
HR, 1.93 (95% CI, 1.41 to 2.65)
pred-BRAFwt
pred-BRAFm
OverallSurvival
(proportion)
Time (months)
1.0
0.8
0.6
0.4
0.2
12 24 36 48 7260 84 0
No. at risk
pred-BRAFwt 139 109 74 45 22 3
pred-BRAFm 60 25 9 4 2 1
P < .001
HR, 3.04 (95% CI, 2.15 to 4.29)
pred-BRAFwt
pred-BRAFm
SurvivalAfterRelapse
(proportion)
Time (months)
1.0
0.8
0.6
0.4
0.2
12 24 36 48 60
DC
0
No. at risk
BRAFm 37 31 27 24 21 20 2 0
BRAFm-like 118 111 94 85 74 70 13 1
pred-BRAFwt 425 417 401 371 346 321 59 5
BRAFm
BRAFm-like
pred-BRAFwt
OverallSurvival
(proportion)
Time (months)
1.0
0.8
0.6
0.4
0.2
12 24 36 48 7260 84 0
No. at risk
BRAFm 14 3 0 0 0 0
BRAFm-like 47 22 9 4 2 1
pred-BRAFwt 138 109 74 45 22 3
SurvivalAfterRelapse
(proportion)
Time (months)
1.0
0.8
0.6
0.4
0.2
12 24 36 48 60
BRAFm
BRAFm-like
pred-BRAFwt
/ /
/
/
// / / // /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// ///////
//
/ / ////////////////////////////////////////////////////////////////////// /// /
//
/
/
/ /
/
/
/
/
// / / / //////
// ///// / /
/ / ///////// //
/
/ /
/// ////////// //// / /
/
//
/ / ///////////////////////////////////////////////////////// // /
/ /
/
/
// / / // /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// //////
/
/
/ /
//
/
/
/ /
/
/
/
/
// / / / ////// // ///// / /
/ / ///////// /
Fig 1. Kaplan-Meier curves for different stratiﬁcations of the stage III subpopulation and different end points. Columns correspond to overall survival and survival after
relapse end points, respectively. Panels A-D correspond to stratiﬁcations into samples predicted to be BRAF mutant (pred-BRAFm)/predicted to be BRAF wild type
(pred-BRAFwt; A, B) and BRAF mutant (BRAFm)/BRAF mutant like (BRAFm-like)/pred-BRAFwt (C, D) in the whole stage III subpopulation. Panels E-H correspond to
stratiﬁcations BRAFm-like/pred-BRAFwt within KRAS mutant (E, F) and double wild type (WT2; G, H) subpopulations, in microsatellite stable. For the cases when only
two populations are compared, the log-rank test P values and the hazard ratios (HRs; with 95% CIs) are given.
Identiﬁcation of BRAF-Like Patients
www.jco.org © 2012 by American Society of Clinical Oncology 1293
UniversitÃƒÆ’Ã‚Â¤t ZÃƒÆ’Ã‚Â¼rich) on April 19, 2012 from 130.223.120.6
Information downloaded from jco.ascopubs.org and provided by at SWISS CONSORTIUM (Hauptbibliothek
Copyright © 2012 American Society of Clinical Oncology. All rights reserved.
11. BRAF-mutant-like colon cancer
103
derived and that this signature identiﬁes also a subgroup of BRAFmlike
tumors sharing similar clinicopathologic features of potential
prognostic importance. They also indicate histologic and prognostic
heterogeneity within the KRASm and thus challenge the current assumption
that these tumors can all be considered alike. This stratiﬁcation
may be of interest in randomized clinical trials and in drug
development studies and can easily be obtained by applying the proposed
classiﬁer.
AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS
OF INTEREST
Although all authors completed the disclosure declaration, the following
author(s) indicated a ﬁnancial or other interest that is relevant to the subject
matter under consideration in this article. Certain relationships marked
with a “U” are those for which no compensation was received; those
relationships marked with a “C” were compensated. For a detailed
description of the disclosure categories, or for more information about
ASCO’s conﬂict of interest policy, please refer to the Author Disclosure
Declaration and the Disclosures of Potential Conﬂicts of Interest section in
Information for Contributors.
Employment or Leadership Position: Scott Weinrich, Pﬁzer (C);
Heather Estrella, Pﬁzer (C); Graeme Hodgson, Pﬁzer (C); Tao Xie, Pﬁzer
(C) Consultant or Advisory Role: Sabine Tejpar, Pﬁzer (C); Fred T.
Bosman, Pﬁzer (C); Arnaud D. Roth, Pﬁzer (C) Stock Ownership: Scott
Weinrich, Pﬁzer; Heather Estrella, Pﬁzer; Graeme Hodgson, Pﬁzer
Honoraria: Arnaud D. Roth, Pﬁzer Research Funding: Sabine Tejpar,
Pﬁzer, Merck; Eric Van Cutsem, Pﬁzer; Fred T. Bosman, Pﬁzer Expert
Testimony: None Other Remuneration: None
AUTHOR CONTRIBUTIONS
Conception and design: Vlad Popovici, Eva Budinska, Sabine Tejpar,
Arnaud D. Roth, Mauro Delorenzi
Provision of study materials or patients: Eric Van Cutsem
Collection and assembly of data: Vlad Popovici, Eva Budinska, Sabine
Tejpar, Scott Weinrich, Heather Estrella, Graeme Hodgson, Eric Van
Cutsem, Tao Xie, Fred T. Bosman, Arnaud D. Roth
Data analysis and interpretation: Vlad Popovici, Eva Budinska, Sabine
Tejpar, Heather Estrella, Graeme Hodgson, Eric Van Cutsem, Fred T.
Bosman, Mauro Delorenzi
Manuscript writing: All authors
Final approval of manuscript: All authors
FE
0
No. at risk
BRAFm-like 51 48 38 34 29 27 7 1
pred-BRAFwt 143 140 135 119 109 98 18 3
P = .022
HR, 1.77 (95% CI, 1.08 to 2.89)
BRAFm-like
pred-BRAFwt
OverallSurvival
(proportion)
Time (months)
1.0
0.8
0.6
0.4
0.2
12 24 36 48 7260 84 0
No. at risk
BRAFm-like 24 12 5 3 1 0
pred-BRAFwt 58 42 25 15 7 1
P = .009
HR, 1.98 (95% CI, 1.18 to 3.34)
BRAFm-like
pred-BRAFwt
P = .022
HR, 2.35 (95% CI, 1.11 to 4.98)
BRAFm-like
pred-BRAFwt
P < .001
HR, 4.67 (95% CI, 2.05 to 10.63)
BRAFm-like
pred-BRAFwt
SurvivalAfterRelapse
(proportion)
Time (months)
1.0
0.8
0.6
0.4
0.2
12 24 36 48 60
HG
0
No. at risk
BRAFm-like 21 21 19 15 11 10 1 0
pred-BRAFwt 226 222 212 200 190 179 33 2
OverallSurvival
(proportion)
Time (months)
1.0
0.8
0.6
0.4
0.2
12 24 36 48 7260 84 0
No. at risk
BRAFm-like 9 4 1 0 0 0
pred-BRAFwt 65 53 38 24 13 2
SurvivalAfterRelapse
(proportion)
Time (months)
1.0
0.8
0.6
0.4
0.2
12 24 36 48 60
/ /// ////
/////// ///////////
/
/
/ ///////////////////////////////////////////////////////////////////// // ////
/ /
//
/
/
/ /
/ / /
/
/
/
///
/
/
/
/ /
/
/
/ / //
//
/ // /
/
/ / // / /
/
/
/
/ // // //// /
/ /
/
/
/ / / / //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// / //////
Fig 1. (continued).
Popovici et al
1294 © 2012 by American Society of Clinical Oncology JOURNAL OF CLINICAL ONCOLOGY
UniversitÃƒÆ’Ã‚Â¤t ZÃƒÆ’Ã‚Â¼rich) on April 19, 2012 from 130.223.120.6
Information downloaded from jco.ascopubs.org and provided by at SWISS CONSORTIUM (Hauptbibliothek
Copyright © 2012 American Society of Clinical Oncology. All rights reserved.
11. BRAF-mutant-like colon cancer
104
REFERENCES
1. Samowitz WS, Albertsen H, Herrick J, et al:
Evaluation of a large, population-based sample supports
a CpG island methylator phenotype in colon
cancer. Gastroenterology 129:837-845, 2005
2. Nosho K, Irahara N, Shima K, et al: Comprehensive
biostatistical analysis of CpG island methylator
phenotype in colorectal cancer using a large
population-based sample. PLoS One 3:e3698, 2008
3. Brink M, de Goeij AF, Weijenberg MP, et al:
K-ras oncogene mutations in sporadic colorectal
cancer in The Netherlands Cohort Study. Carcinogenesis
24:703-710, 2003
4. English DR, Young JP, Simpson JA, et al:
Ethnicity and risk for colorectal cancers showing
somatic BRAF V600E mutation or CpG island methylator
phenotype. Cancer Epidemiol Biomarkers Prev
17:1774-1780, 2008
5. Rosenberg DW, Yang S, Pleau DC, et al:
Mutations in BRAF and KRAS differentially distinguish
serrated versus non-serrated hyperplastic
aberrant crypt foci in humans. Cancer Res 67:3551-
3554, 2007
6. Velho S, Moutinho C, Cirnes L, et al: BRAF,
KRAS and PIK3CA mutations in colorectal serrated
polyps and cancer: Primary or secondary genetic
events in colorectal carcinogenesis? BMC Cancer
8:255, 2008
7. Roth AD, Tejpar S, Delorenzi M, et al: Prognostic
role of KRAS and BRAF in stage II and III
resected colon cancer: Results of the translational
study on the PETACC-3, EORTC 40993, SAKK 60-00
trial. J Clin Oncol 28:466-474, 2010
8. Pratilas CA, Xing F, Solit DB: Targeting oncogenic
BRAF in human cancer. Curr Top Microbiol
Immunol [epub ahead of print on August 5, 2011]
9. Pratilas CA, Taylor BS, Ye Q, et al:
(V600E)BRAF is associated with disabled feedback
inhibition of RAF-MEK signaling and elevated transcriptional
output of the pathway. Proc Natl Acad Sci
U S A 106:4519-4524, 2009
10. Dry JR, Pavey S, Pratilas CA, et al: Transcriptional
pathway signatures predict MEK addiction and
response to selumetinib (AZD6244). Cancer Res
70:2264-2273, 2010
11. Pavey S, Johansson P, Packer L, et al: Microarray
expression proﬁling in melanoma reveals a
BRAF mutation signature. Oncogene 23:4060-4067,
2004
12. Kannengiesser C, Spatz A, Michiels S, et al:
Gene expression signature associated with BRAF
mutations in human primary cutaneous melanomas.
Mol Oncol 1:425-430, 2008
13. Van Cutsem E, Labianca R, Bodoky G, et al:
Randomized phase III trial comparing biweekly infusional
ﬂuorouracil/leucovorin alone or with irinotecan
in the adjuvant treatment of stage III colon cancer:
PETACC-3. J Clin Oncol 27:3117-3125, 2009
14. De Roock W, Claes B, Bernasconi D, et al:
Effects of KRAS, BRAF, NRAS, and PIK3CA mutations
on the efﬁcacy of cetuximab plus chemotherapy in
chemotherapy-refractory metastatic colorectal cancer:
A retrospective consortium analysis. Lancet Oncol
11:753-762, 2010
15. Koinuma K, Yamashita Y, Liu W, et al:
Epigenetic silencing of AXIN2 in colorectal carcinoma
with microsatellite instability. Oncogene
25:139-146, 2006
16. Kim IJ, Kang HC, Jang SG, et al: Oligonucleotide
microarray analysis of distinct gene expression
patterns in colorectal cancer tissues harboring BRAF
and K-ras mutations. Carcinogenesis 27:392-404,
2006
17. Budinska E, Delorenzi M, De Roock W, et al:
New insights to gene expression signatures from
primary FFPE tumors for the prediction of response
to cetuximab in KRAS and BRAF wild-type colorectal
cancer (CRC). J Clin Oncol 28, 243s, 2010 (suppl;
abstr 3588)
18. Smith JJ, Deane NG, Wu F, et al: Experimentally
derived metastasis gene expression proﬁle
predicts recurrence and death in patients with colon
cancer. Gastroenterology 138:958-968, 2010
19. Irizarry RA, Bolstad BM, Collin F, et al: Summaries
of Affymetrix GeneChip probe level data.
Nucleic Acids Res 31:e15, 2003
20. Smyth GK: Linear models and empirical Bayes
methods for assessing differential expression in
microarray experiments. Stat Appl Genet Mol Biol
3:Article3, 2004
21. Benjamini Y, Hochberg Y: Controlling the false
discovery rate: A practical and powerful approach to
multiple testing. J R Stat Soc B 57:289-300, 1995
22. Tan AC, Naiman DQ, Xu L, et al: Simple
decision rules for classifying human cancers from
gene expression proﬁles. Bioinformatics 21:3896-
3904, 2005
23. Matthews BW: Comparison of the predicted
and observed secondary structure of T4 phage lysozyme.
Biochim Biophys Acta 405:442-451, 1975
24. Shi L, Campbell G, Jones WD, et al: The
MicroArray Quality Control (MAQC)-II study of common
practices for the development and validation of
microarray-based predictive models. Nat Biotechnol
28:827-838, 2010
25. Samowitz WS, Sweeney C, Herrick J, et al:
Poor survival associated with the BRAF V600E mutation
in microsatellite-stable colon cancers. Cancer
Res 65:6063-6069, 2005
26. Ogino S, Nosho K, Kirkner GJ, et al: CpG
island methylator phenotype, microsatellite instability,
BRAF mutation and clinical outcome in colon
cancer. Gut 58:90-96, 2009
27. French AJ, Sargent DJ, Burgart LJ, et al:
Prognostic signiﬁcance of defective mismatch repair
and BRAF V600E in patients with colon cancer. Clin
Cancer Res 14:3408-3415, 2008
28. Li WQ, Kawakami K, Ruszkiewicz A, et al:
BRAF mutations are associated with distinctive clinical,
pathological and molecular features of colorectal
cancer independently of microsatellite instability
status. Mol Cancer 5:2, 2006
29. Snover DC: Update on the serrated pathway
to colorectal carcinoma. Hum Pathol 42:1-10, 2011
30. Leggett B, Whitehall V: Role of the serrated
pathway in colorectal cancer pathogenesis. Gastroenterology
138:2088-2100, 2010
31. Ogino S, Kawasaki T, Kirkner GJ, et al: CpG
island methylator phenotype-low (CIMP-low) in colorectal
cancer: Possible associations with male sex
and KRAS mutations. J Mol Diagn 8:582-588, 2006
32. Yagi K, Akagi K, Hayashi H, et al: Three DNA
methylation epigenotypes in human colorectal cancer.
Clin Cancer Res 16:21-33, 2010
33. Dahlin AM, Palmqvist R, Henriksson ML, et al:
The role of the CpG island methylator phenotype in
colorectal cancer prognosis depends on microsatellite
instability screening status. Clin Cancer Res
16:1845-1855, 2010
34. Hinoue T, Weisenberger DJ, Lange CP, et al:
Genome-scale analysis of aberrant DNA methylation
in colorectal cancer. Genome Res [epub ahead of
print on June 9, 2011]
35. Kopetz S, Desai J, Chan E, et al: PLX4032 in
metastatic colorectal cancer patients with mutant
BRAF tumors. J Clin Oncol 28:269s, 2010 (suppl;
abstr 3534)
36. Kefford R, Arkenau H, Brown MP, et al: Phase
I/II study of GSK2118436, a selective inhibitor of
oncogenic mutant BRAF kinase, in patients with
metastatic melanoma and other solid tumors. J Clin
Oncol 28:611s, 2010 (suppl; abstr 8503)
■ ■ ■
Identiﬁcation of BRAF-Like Patients
www.jco.org © 2012 by American Society of Clinical Oncology 1295
UniversitÃƒÆ’Ã‚Â¤t ZÃƒÆ’Ã‚Â¼rich) on April 19, 2012 from 130.223.120.6
Information downloaded from jco.ascopubs.org and provided by at SWISS CONSORTIUM (Hauptbibliothek
Copyright © 2012 American Society of Clinical Oncology. All rights reserved.
11. BRAF-mutant-like colon cancer
105
11. BRAF-mutant-like colon cancer
106
12 Identification of "BRAF-Positive" Cases Based
on Whole-Slide Image Analysis
• Biomed Research International, art. no.:3926498, 2017
• IF: 2.476
• number of citations: 0
• personal contribution (80%): image analysis method design, data
collection and processing, experimental design and implementation,
manuscript writing
107
Research Article
Identification of (BRAF-Positive) Cases Based on
Whole-Slide Image Analysis
Vlad Popovici,1
Aleš Klenek,2
and Eva Budinská3
1
Institute of Biostatistics and Analyses, Faculty of Medicine and Research Centre for Toxic Compounds in the Environment,
Faculty of Science, Masarykova Univerzita, Kamenice 5, 625 00 Brno, Czech Republic
2
Institute of Computer Science, Masarykova Univerzita, ˇSumavsk´a 15, 602 00 Brno, Czech Republic
3
Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masarykova Univerzita, Kamenice 5,
625 00 Brno, Czech Republic
Correspondence should be addressed to Vlad Popovici; popovici@iba.muni.cz
Received 11 November 2016; Accepted 20 March 2017; Published 24 April 2017
Academic Editor: Xudong Huang
Copyright © 2017 Vlad Popovici et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
A key requirement for precision medicine is the accurate identification of patients that would respond to a specific treatment or
those that represent a high-risk group, and a plethora of molecular biomarkers have been proposed for this purpose during the last
decade. Their application in clinical settings, however, is not always straightforward due to relatively high costs of some tests, limited
availability of the biological material and time, and procedural constraints. Hence, there is an increasing interest in constructing
tissue-based surrogate biomarkers that could be applied with minimal overhead directly to histopathology images and which could
be used for guiding the selection of eventual further molecular tests. In the context of colorectal cancer, we present a method for
constructing a surrogate biomarker that is able to predict with high accuracy whether a sample belongs to the “BRAF-positive”
group, a high-risk group comprising V600E BRAF mutants and BRAF-mutant-like tumors. Our model is trained to mimic the
predictions of a 64-gene signature, the current definition of BRAF-positive group, thus effectively identifying histopathology image
features that can be linked to a molecular score. Since the only required input is the routine histopathology image, the model can
easily be integrated in the diagnostic workflow.
1. Introduction
The pathologic assessment of the tumor specimen provides
the essential information for patient management, outcome
estimation, and treatment decision. In the case of colorectal
cancer (CRC), the main parameters of the pathologic
assessment include the TNM stage, histologic grade, tumor
type, vascular infiltration, and status of the resection margins
[1]. Aside from these classical parameters, the discovery of
molecular drivers and markers for resistance led to refined
prognostic and predictive models [2]. For example, it has
been shown that KRAS-mutated tumors are resistant to
anti-EGFR treatment [3, 4]. In parallel several molecular
taxonomies partially explaining intertumoral heterogeneity
have been proposed for CRC [5–7]. Of interest for the current
study is the identification of a high-risk group of CRC patients
consisting of V600E BRAF mutants and a sizeable BRAFwild
type subset of tumors which display a similar pattern
of gene activation, the so-called BRAF-mutant-like tumors
[8]. This group is collectively called BRAF-positive, as the
defining 64-gene signature has positive values for these cases
[8]. These are only a few of the plethora of gene expression
signatures proposed for CRC (in other types of cancer, the
situation being similar) and they all have in common the
requirement for profiling a rather large panel of genes and
the limited usage in clinical practice. Among the reasons
for their slow adoption are the associated costs for tests
and limited availability of biological material. On the other
hand, if one could robustly predict the outcome of some of
these molecular tests directly from the data available for the
pathologic assessment, significant speed-ups and cost cuts
would be achieved. This is one of the main justifications of
Hindawi
BioMed Research International
Volume 2017,Article ID 3926498, 7 pages
https://doi.org/10.1155/2017/3926498
12. Identification of "BRAF-Positive"
108
2 BioMed Research International
the present study, in which we propose an image analysis
model for recognizing the “BRAF-positive” cases of CRC,
that is, to predict the (dichotomized) outcome of the BRAF
signature [8]. A second and broader in scope justification is
the interest in identifying and understanding the connections
between tumor architecture and gene activity as captured by
transcriptomics.
Such connections between phenotypical appearance of
the tumor and gene activity have been established before. For
example, in the case of breast cancer the lobular phenotype
is associated with deletions in the CDH1 gene (encoding Ecadherin)
[9] and the mesenchymal/metaplastic features are
predictive in the case of AR-positive triple negative breast
cancers [10]. In the case of colorectal cancer (CRC) the association
of mucinous/serrated carcinomas with BRAF mutations
is well known and we have shown that such association can
be extended to the group of “BRAF-mutated-like” tumors,
characterized by a specific genomic signature [8]. Similarly,
connections between nuclear morphometry and molecular
data have been identified in glioblastoma [11] and exploited
in a multimodal prognostic signature in breast cancer [12].
When deriving molecular subtypes for colorectal cancer, we
have also identified tumor architecture patterns preferentially
enriched in those subtypes [5]. These observations all support
the idea that genomic and phenotypic traits can be put in
correspondence and, by consequence, that some phenotypic
features could potentially be used as proxies for genomic
markers.
In the present work, we propose an approach at building
a histology image-based classifier able to predict the “BRAFpositive”
status, as defined by the genomic signature. The gene
expression data for the signature is supposed to be obtained
from the same (or adjacent) tumor section as the histopathology
whole-slide image. The key point of our approach resides
in a convenient summarization of the imaging data into
a code vector used for building the classification model.
Apart from our own earlier results [13], there were no other
studies to guide our selection of image features useful for
this task. Hence, we took a data-driven approach in which
the implicit hypothesis was that local tumor appearance
contained enough information to build a predictor for the
genomic “BRAF-positive” status. Thus, our approach was
prior-free, in the sense that we did not restrict ourselves to
a set of predefined (by an expert pathologist) measurements,
with the potential drawback of limiting interpretability of the
results.
Having a tissue-based surrogate biomarker for a genomic
test allows an immediate integration in the routine diagnostic
workflow and may provide the pathologist with hints for
further genomic testing. This integration is supported by the
increased adoption of digital pathology solutions. Additionally,
such models can be applied to pathology image archives
for the selection of cases for retrospective studies.
2. Materials and Methods
2.1. Data. The data collection used consisted of 𝑛 = 291
samples for which both histopathology whole-slide images
and clinical data (including BRAF and KRAS mutation status)
were available, along with gene expression necessary for
Table 1: Summary of main clinical parameters.
Parameter N Proportion (%)
Stage
Stage II 55 18.9
Stage III 236 81.1
MSI
MSI-H 12 4.1
MSI-L & MSS 279 95.6
V600E BRAF status
Mutated 16 5.5
Wild type 275 94.5
KRAS (codons 12 and 13) status
Mutated 113 38.8
Wild type 178 61.2
BRAF score
Positive 59 20.3
Negative 232 79.7
Mucinous
Yes 33 11.3
No 258 88.7
computing the BRAF score [8]. These samples were a subset of
the data collected in the PETACC-3 clinical trial [14] and were
selected based on the image quality and availability of the
mutation information. A summary of the data is presented
in Table 1 detailing the following clinical and molecular
parameters, in this order: tumor stage; microsatellite stability
status (high microsatellite instability (MSI-M) versus
low microsatellite instability (MSI-L) or microsatellite stable
(MSS)); mutation status of BRAF (V600E mutation) and
KRAS (in codons 12 and 13) oncogenes; BRAF score (from
the genomic signature) and the mucinous histology status of
the tumor.
For each sample, a whole-slide image of haematoxylineosin
(H&E) stained tumor sections was acquired at 20x
magnification, using Hamamatsu NanoZoomer C9600 scanner.
The resulting images were compressed by the image
acquisition software using JPEG standard (at 80% quality)
and stored in the proprietary NDPI format. The resolution
of the images was 455 nm/pixel (equivalent to 55824 DPI) for
a typical size of 100,000 × 50,000 pixels (varying with the size
of the tissue section). The images were exported in standard
TIFF format using OpenSlide software library [15].
2.2. Image Preprocessing. The whole-slide images were downscaled
to an equivalent 5x magnification and only tumoral
regions were retained from each sample (manually cut
following the pathologist’s annotations), the pixels outside
the tumors being set to zero. To obtain the intensity signal
corresponding to the haematoxylin and eosin dyes, the color
deconvolution method from [16] was used, resulting in two
single channel (intensity) images (H- and E-images).
2.3. Feature Extraction and Image Summarization. Our main
assumption for image data modeling was that local appearance
of the tissue section (local texture) contains enough
12. Identification of "BRAF-Positive"
109
BioMed Research International 3
information to yield discriminative features. However, the
representation of an image in terms of a set of local descriptors
still does not allow a direct comparison of two images
(required for building a classifier); hence further summarization
and standardization of the representation are needed.
A suitable framework is represented by the image-retrieval
applications based on Bag-of-Visual-Words methods [17]. In
this framework, the local descriptors are used to construct a
codebook for image representation (the information in the
image is highly compressed) and the image is recoded in
terms of frequencies of elements (visual codewords) from the
codebook. We adapted this general approach to the problem
at hand, as follows.
We decided to use a two-level approach to image representation
with the first level (L1) being generic for all images
and the second one (L2) specific to each class. The main
reason behind this approach was that the first coding level
was designed to capture the appearance of small structures
(several cells, patches of stroma, parts of the colon crypts,
etc.), while the second level was intended to capture larger
arrangements of basic structures, which might be specific to
each class. Additionally, since the classification problem was
highly imbalanced, such separation would allow structures
of both classes to be equally represented. Such multilevel
approach has been already used in natural scene categorization
[18]; however in our method we used the class label in
generating the second level representation.
The first level (L1) of coding considered local patches of
size 32 × 32 pixels as the basic processing unit. For such
patches, we used the Gabor descriptors computed on both Hand
E-images for each sample. These descriptors were based
on the real component of the Gabor filter [19]:
𝐺 (𝑥, 𝑦; ], 𝜃, 𝜎) = exp (−
𝑥2
+ 𝑦2
2𝜎2
)
× exp (2𝜋]𝑗 (𝑥 cos 𝜃 + 𝑦 sin 𝜃)) ,
(1)
where 𝑗 = √−1 and ] was the frequency, 𝜃 the orientation,
and 𝜎 the bandwidth of the Gaussian kernel, respectively.
The parameters were fixed throughout all experiments: 𝜎 ∈
{1, 2√2}, 𝜃 ∈ {𝑘(𝜋/4) | 𝑘 = 0, . . . , 3}, and ] ∈ {3/4, 3/8, 3/16}.
In total, there were 24 Gabor filters that led to a 48-valued
descriptor vector for each H- and E-image, with the first 24
values representing the mean response and the last 24 values
representing the variance of the filter responses, over the
considered 32 × 32 pixels’ patch. Thus, to each local patch
from the original images corresponded 96-value descriptor
vectors obtained by concatenating the Gabor descriptors of
the H- and E-images.
From each image in the training set (which will be generated
within the cross-validation loop, see Classifier Design),
1,000 random patch descriptors were selected for building
the L1 codebook using the standard k-means clustering, with
𝐾1 = 128 clusters. Then, all the patches were assigned a code
1, . . . , 𝐾1 based on the closest cluster (codeword) from the L1
codebook.
The second level of coding (L2) considered neighborhoods
of 15 × 15 L1 patches (i.e., 480 × 480 pixels). For each
such neighborhood, the descriptor computed was the vector
of frequencies of the L1 codes (a vector with 𝐾1 values).
Similarly to L1 coding, a new codebook was constructed
by clustering L2 descriptors (500 random L2 descriptors
selected from each image) with 𝐾2 = 128 clusters. Two
such codebooks were constructed, one of each class (BRAFpositive
and BRAF-negative), and then both used for coding
each image, leading to a representation with codes 1, . . . , 2𝐾2.
The process described above led to a recoding of each
image in terms of a histogram with 2𝐾2 bins, each corresponding
to an L2 code. We note that, in all the steps for image
coding, the patches containing more than 50% of background
pixels were excluded.
2.4. Classifier Design. After the image recoding step, to each
image corresponded a 2𝐾2-value vector which constituted
the input data for the classifier design. The classifier design
included the following main steps:
(1) Classifier feature selection: features (elements of the
input vectors) were ordered based on recursive feature
elimination (RFE) method [20] and subsets of
features of sizes 𝑓 = 30, 50, . . . , 130 (approximately
half of total number of features) were considered for
Step (2).
(2) For each subset of features, a Support Vector Machine
(SVM) [21] with Radial Basis Function (RBF) kernel
was trained and its metaparameters were optimized
in an inner cross-validation loop. Its performance was
estimated by cross-validation and the estimated area
under the ROC curve (AUC) recorded.
(3) The number of features yielding the maximum AUC
was deemed optimal and the final SVM was trained
on that number of features.
To estimate the performance of the system, the image
recoding procedure followed by Steps (1)–(3) above was
embedded into an external 10-fold stratified cross-validation
loop, thus ensuring an unbiased estimation. The vector of
predicted labels within this outer cross-validation was taken
to represent typical predictions of the model and used in
statistical analyses to avoid overly optimistic conclusions that
would have been obtained from the predictions made by the
model trained on the full data set.
2.5. Statistical Analyses. The main performance parameter
for the classifier was AUC, but sensitivity and specificity
were equally measured. For sensitivity and specificity 95%
confidence intervals were computed using Agresti-Coull
approximation [22] while for AUC they were obtained by
bootstrap [23]. To test the association between individual
image features and the class label, univariable logistic regression
models were fit and the sign of the resulting coefficient
was used to determine the sense of the association. To test for
the association between clinical variables and classifier predictions
we used 𝜒2
-test on 2 × 2 contingency tables. Survival
analysis was performed using survival package (version 2.39-
4) from R statistical computing environment (version 3.3.1,
12. Identification of "BRAF-Positive"
110
4 BioMed Research International
BRAF score versus predictions
−3
−2
−1
0
1
2
3
BRAFscore
Predicted BRAF− (BRAF score < 0)
Predicted BRAF+ (BRAF score ≥ 0)
(a)
Margin
Classification errors:
−2
−1
0
1
2
3
4
0 1−1 2 3−2
BRAF score
margin versus BRAF score
(b)
Figure 1: Analysis of the classifier’s predictions. (a) Waterfall plot of the BRAF scores and the corresponding predictions (color-coded). (b)
The relationship between the genomic score (𝑥-axis) and the prediction margin (𝑦-axis) for the misclassified samples.
Table 2: Confusion matrix for classifier predictions. The ground
truth is given by the genomic signature.
Predicted
BRAF-negative
Predicted
BRAF-positive
Genomic BRAF-negative 221 11
Genomic BRAF-positive 9 50
http://www.r-project.org). The estimation of hazard ratios
was obtained from Cox proportional hazards regression in
the absence of any other covariates, while the comparison of
survival experiences of different subgroups was assessed by
log-rank test (Mantel-Haenszel test). Statistical significance
level was chosen to be 𝑝 = 0.05 and no adjustment for
multiple hypotheses testing was performed.
3. Results and Discussion
3.1. Image-Based Predictor. The estimated performance of the
classifier was AUC = 0.938, 95% CI = (0.903–0.972), with a
default operating point yielding a sensitivity Se = 0.848, 95%
CI = (0.733–0.920), and a specificity Sp = 0.926, 95% CI =
(0.917–0.974), corresponding to an accuracy Acc = 0.931,
95% CI = (0.896–0.956). The optimal number of features
varied throughout the cross-validation iterations between 70
and 110. In Table 2, the confusion matrix from the crossvalidation
predictions is shown.
The relationship between the image-based classifier predictions
(from cross-validation) and the genomic score can
be seen in Figure 1. The misclassified samples are covering
the whole range of genomic scores (Figure 1(a)). For the
SVMs, the margin of a sample can be viewed as a confidence
in the prediction; hence we were interested in studying the
classification errors in the context of their corresponding
margins. In Figure 1(b), the margins are shown as a function
of genomic score. It appears that smaller margin corresponds
to larger (in absolute value) BRAF scores indicating that the
confidence in those (erroneous) predictions is rather low.
A different trade-off between sensitivity and specificity
could be obtained by adapting the classifier’s threshold: for
example, an operating point yielding Se = 0.915, 95% CI =
(0.812–0.967), and Sp = 0.776, 95% CI = (0.718–0.825), would
favor the detection of BRAF-positives.
3.2. Relationship with Clinical Parameters. Further investigation
of the classifier’s errors showed that most of the false
negatives were KRAS mutants (6 out of 9) while the majority
of the false positives were double wild type (BRAF and KRAS
wild type). We also note that the classifier labeled two cases
(out of 16) of BRAF mutant tumors as “BRAF-negative”;
however, one of them had also a negative genomic score. The
predictions were also associated with the mucinous status
of the tumors (𝜒2
test 𝑝 value = 0.0066), the microsatellite
instability status (𝜒2
test 𝑝 value < 0.0001), and the grade
(𝜒2
test 𝑝 value = 0.0006) as expected [8] but not with
other clinical parameters including KRAS mutation status
and tumor stage.
The BRAF genomic signature was shown to have a strong
prognostic value for overall survival (OS) and survival after
relapse (SAR) and limited value for relapse-free survival
(RFS) [8]. In the subset of samples considered, the genomic
signature maintained its prognostic value and the classifier
predictions inherited, to some degree, this property: the
predictions were prognostic for OS (𝑝 = 0.007, HR = 1.81,
95% CI = (1.17–2.81)) and SAR (𝑝 = 0.010, HR = 1.89, 95%
CI = (1.16–3.10)) but not for RFS (𝑝 = 0.072, HR = 1.44, 95%
CI = (0.97–2.13)).
3.3. The Predictive Image Features. We investigated the structure
of the final model generated using the complete data set,
on which both image recoding and the classifier design steps
were applied as described above. For this model, 90 features
(corresponding to codewords from the L2 codebook) were
selected as the optimal set and using the logistic regression
coefficient (from single-variable models) they were divided
into “positive features” (preferentially present in BRAFpositive
cases, 58 features in total, see Figure 2) and “negative
12. Identification of "BRAF-Positive"
111
BioMed Research International 5
Figure 2: “Positive features”: image patterns associated with BRAF-positive class. Each feature is a 480 × 480 image patch and corresponds
to an L2 codeword. Higher resolution image is available at DOI: 10.5281/zenodo.376999.
Figure 3: “Negative features”: image patterns associated with BRAF-negative class. Each feature is a 480 × 480 image patch and corresponds
to an L2 codeword. Higher resolution image is available at DOI: 10.5281/zenodo.376999.
features” (preferentially present in BRAF-negative cases, 32
features in total, see Figure 3). We note that a number
of features were dedicated to representing the border of
the tumors and that some were partially affected by the
markings present on the slides. It appears that the color
deconvolution used in combination with Gabor descriptors
made the representation robust to this type of noise. A second
observation was that there were, roughly, twice as many
image features representing the positive class compared to
the negative one. This was to some degree not unexpected:
indeed, in general, the BRAF-mutated and MSI-H CRC
tumors show more intratumoral heterogeneity than the rest;
however our results may suggest that this characteristic is
common to a larger group of tumors.
The exact contribution of each feature to the final decision
is less obvious as their involvement in the classifier’s prediction
is through the RBF kernel and since the support vectors
(actually a number of images from the training set) are
12. Identification of "BRAF-Positive"
112
6 BioMed Research International
(a) (b)
(c) (d)
Figure 4: Spatial distribution of (positive and negative) features in two correctly classified images. The regions with low contrast were not
involved in the classification process. (a-b) A BRAF-positive tumor: (a) positive image features; (b) negative image features. (c-d) A BRAFnegative
tumor: (c) positive image features; (d) negative image features. Higher resolution images are available at DOI: 10.5281/zenodo.376999.
defining the separation boundary between classes. However,
a visualization of their spatial distribution in images may
help in qualitatively understanding the model: in Figure 4
two examples of correctly classified tumors are shown. It
appears that the features identified as “positive features” cover
a relatively larger region in the BRAF-positive tumors than
the “negative features.” The inverse relationship holds for the
BRAF-negative tumors.
We also investigated whether the codebooks (for both
levels of coding, L1 and L2) are biased towards one or a
small group of images. We recall that the codebooks have
been generated using an equal number of image patches
randomly selected from the images. None of the clusters
of the codebooks was dominated by a particular image,
indicating that the codebooks capture general features.
4. Conclusions
We presented an image-based classifier that was able to
predict with high accuracy the outcome of a genomic score.
The input images were scans of H&E pathology slides making
the system suitable for integration in the routine diagnostic
procedures. Since the predictions of the classifier (as those of
the corresponding genomic score) were not correlated with
the TNM staging, they brought an independent indication
of high-risk tumors (in the case of positive predictions). The
system could also be applied for the retrospective selection of
cases from tumor archives, reducing the volume of cases that
an expert would need to evaluate.
Another important outcome is the observation that
some gene expression based signatures may be translated
into an image-based surrogate biomarker. Such tissue-based
biomarkers may be used as a filtering step before the genomic
tests.
Disclosure
This article reflects only the author’s views and the Union is
not liable for any use that may be made of the information
contained therein.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This project is financed by the SoMoPro II programme.
The research leading to this result has acquired a financial
grant from the People Programme (Marie Curie action) of
the Seventh Framework Programme of EU according to the
REA Grant Agreement no. 291782. The research is further
cofinanced by the South Moravian Region. The necessary
computational resources were provided by the CESNET
LM2015042 and the CERIT Scientific Cloud LM2015085
projects under the programme “Projects of Projects of Large
Research, Development, and Innovations Infrastructures.”
12. Identification of "BRAF-Positive"
113
BioMed Research International 7
The authors also thank PETACC3 translational research
group and Professor Fred Bosman in particular, for making
the data for the present study available.
References
[1] C. C. Compton, “Colorectal carcinoma: diagnostic, prognostic,
and molecular features,” Modern Pathology, vol. 16, no. 4, pp.
376–388, 2003.
[2] F. T. Bosman and P. Yan, “Molecular pathology of colorectal
cancer,” Polish Journal of Pathology, vol. 65, no. 4, pp. 257–266,
2014.
[3] A. Li`evre, J. B. Bachet, D. le Corre et al., “KRAS mutation status
is predictive of response to cetuximab therapy in colorectal
cancer,” Cancer Research, vol. 66, no. 8, pp. 3992–3995, 2006.
[4] S. Benvenuti, A. Sartore-Bianchi, F. di Nicolantonio et al.,
“Oncogenic activation of the RAS/RAF signaling pathway
impairs the response of metastatic colorectal cancers to antiepidermal
growth factor receptor antibody therapies,” Cancer
Research, vol. 67, no. 6, pp. 2643–2648, 2007.
[5] E. Budinska, V. Popovici, S. Tejpar et al., “Gene expression
patterns unveil a new level of molecular heterogeneity in
colorectal cancer,” Journal of Pathology, vol. 231, no. 1, pp. 63–
76, 2013.
[6] L. Marisa, A. de Reyni`es, A. Duval et al., “Gene expression
classification of colon cancer into molecular subtypes: characterization,
validation, and prognostic value,” PLoS Medicine, vol.
10, no. 5, Article ID e1001453, 2013.
[7] A. Sadanandam, C. A. Lyssiotis, K. Homicsko et al., “A colorectal
cancer classification system that associates cellular phenotype
and responses to therapy,” Nature Medicine, vol. 19, no. 5, pp.
619–625, 2013.
[8] V. Popovici, E. Budinska, S. Tejpar et al., “Identification of
a poor-prognosis BRAF-mutant—like population of patients
with colon cancer,” The Journal of Clinical Oncology, vol. 30, no.
12, pp. 1288–1295, 2012.
[9] G. Berx, A.-M. Cleton-Jansen, F. Nollet et al., “E-cadherin is
a tumour/invasion suppressor gene mutated in human lobular
breast cancers,” EMBO Journal, vol. 14, no. 24, pp. 6107–6115,
1995.
[10] B. D. Lehmann, J. A. Bauer, X. Chen et al., “Identification of
human triple-negative breast cancer subtypes and preclinical
models for selection of targeted therapies,” Journal of Clinical
Investigation, vol. 121, no. 7, pp. 2750–2767, 2011.
[11] J. Kong, L. A. D. Cooper, F. Wang et al., “Integrative, multimodal
analysis of glioblastoma using TCGA molecular data, pathology
images, and clinical outcomes,” IEEE Transactions on Biomedical
Engineering, vol. 58, no. 12, pp. 3469–3474, 2011.
[12] Y. Yuan, H. Failmezger, O. M. Rueda et al., “Quantitative image
analysis of cellular heterogeneity in breast tumors complements
genomic profiling,” Science Translational Medicine, vol. 4, no.
157, Article ID 3004330, 2012.
[13] V. Popovici, “Towards the identification of tissue-based proxy
biomarkers,” in Proceedings of the AMIA Joint Summits on
Translational Science, 2016.
[14] E. van Cutsem, R. Labianca, G. Bodoky et al., “Randomized
phase III trial comparing biweekly infusional fluorouracil/leucovorin
alone or with irinotecan in the adjuvant
treatment of stage III colon cancer: PETACC-3,” Journal of
Clinical Oncology, vol. 27, no. 19, pp. 3117–3125, 2009.
[15] M. Satyanarayanan, A. Goode, B. Gilbert, J. Harkes, and D.
Jukic, “OpenSlide: a vendor-neutral software foundation for
digital pathology,” Journal of Pathology Informatics, vol. 4, no.
1, p. 27, 2013.
[16] A. C. Ruifrok and D. A. Johnston, “Quantification of histochemical
staining by color deconvolution,” Analytical and
Quantitative Cytology and Histology, vol. 23, no. 4, pp. 291–299,
2001.
[17] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray,
“Visual categorization with bags of keypoints,” Proceeding of the
Workshop on Statistical Learning in Computer Vision, 2004.
[18] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:
spatial pyramid matching for recognizing natural scene categories,”
in Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR ’06), pp.
2169–2178, June 2006.
[19] J. G. Daugman, “Uncertainty relation for resolution in
space, spatial frequency, and orientation optimized by twodimensional
visual cortical filters,” Journal of the Optical Society
of America A: Optics and Image Science, and Vision, vol. 2, no.
7, pp. 1160–1169, 1985.
[20] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection
for cancer classification using support vector machines,”
Machine Learning, vol. 46, no. 1–3, pp. 389–422, 2002.
[21] C. Cortes and V. Vapnik, “Support-vector networks,” Machine
Learning, vol. 20, no. 3, pp. 273–297, 1995.
[22] A. Agresti and B. A. Coull, “Approximate is better than “exact”
for interval estimation of binomial proportions,” The American
Statistician, vol. 52, no. 2, pp. 119–126, 1998.
[23] X. Robin, N. Turck, A. Hainard et al., “pROC: an open-source
package for R and S+ to analyze and compare ROC curves,”
BMC Bioinformatics, vol. 12, article 77, 2011.
12. Identification of "BRAF-Positive"
114
12. Identification of "BRAF-Positive"
115
13 A robust genomic signature for the detection
of colorectal cancer patients with microsatellite
instability phenotype and high mutation
frequency
• Journal of Pathology, 228(4):586-595, 2012
• IF: 6.894
• number of citations: 16
• personal contribution (10%): data preprocessing, experimental
design, data analysis and results interpretation, manuscript
writing
117
Journal of Pathology
J Pathol 2012
Published online in Wiley Online Library
(wileyonlinelibrary.com) DOI: 10.1002/path.4092
ORIGINAL PAPER
A robust genomic signature for the detection of colorectal cancer
patients with microsatellite instability phenotype and high
mutation frequency#
Sun Tian,1 Paul Roepman,1 Vlad Popovici,2 Magali Michaut,3 Ian Majewski,3 Ramon Salazar,4 Cristina Santos,4
Robert Rosenberg,5 Ulrich Nitsche,5 Wilma E Mesker,6 Sjoerd Bruin,3 Sabine Tejpar,7 Mauro Delorenzi,2,8,9
Rene Bernards1,3 and Iris Simon1
*
1
Agendia NV, Amsterdam, The Netherlands; and Agendia Inc., Irvine, CA, USA
2
Swiss Institute for Bioinformatics, Lausanne, Switzerland
3 Netherlands Cancer Institute, Amsterdam, The Netherlands
4
IDIBELL, Institut Catala d’Oncologia, L’Hospitalet de Llobregat, Barcelona, Spain
5
Klinikum Rechts der Isar, Technische Universit¨at M¨unchen, Germany
6 Leiden University Medical Centre, Leiden, The Netherlands
7
University Hospital Gasthuisberg, Katholieke Universiteit Leuven, Belgium
8
D´epartement de Formation et Recherche, Centre Hospitalier Universitaire Vaudois, France
9 University of Lausanne, Switzerland
*Correspondence to: I Simon, Agendia NV, Amsterdam, The Netherlands; OR Agendia Inc. Irvine, CA, USA. e-mail: iris.simon@agendia.com
#
Array data have been deposited on: http://research.agendia.com/
Re-use of this article is permitted in accordance with the Terms and Conditions set out at http://wileyonlinelibrary.com/onlineopen#
OnlineOpen_Terms
Abstract
Microsatellite instability (MSI) occurs in 10–20% of colorectal tumours and is associated with good prognosis.
Here we describe the development and validation of a genomic signature that identiﬁes colorectal cancer
patients with MSI caused by DNA mismatch repair deﬁciency with high accuracy. Microsatellite status for 276
stage II and III colorectal tumours has been determined. Full-genome expression data was used to identify
genes that correlate with MSI status. A subset of these samples (n = 73) had sequencing data for 615 genes
available. An MSI gene signature of 64 genes was developed and validated in two independent validation
sets: the ﬁrst consisting of frozen samples from 132 stage II patients; and the second consisting of FFPE
samples from the PETACC-3 trial (n = 625). The 64-gene MSI signature identiﬁed MSI patients in the ﬁrst
validation set with a sensitivity of 90.3% and an overall accuracy of 84.8%, with an AUC of 0.942 (95% CI,
0.888–0.975). In the second validation, the signature also showed excellent performance, with a sensitivity
94.3% and an overall accuracy of 90.6%, with an AUC of 0.965 (95% CI, 0.943–0.988). Besides correct
identiﬁcation of MSI patients, the gene signature identiﬁed a group of MSI-like patients that were MSS by
standard assessment but MSI by signature assessment. The MSI-signature could be linked to a deﬁcient MMR
phenotype, as both MSI and MSI-like patients showed a high mutation frequency (8.2% and 6.4% of 615 genes
assayed, respectively) as compared to patients classiﬁed as MSS (1.6% mutation frequency). The MSI signature
showed prognostic power in stage II patients (n = 215) with a hazard ratio of 0.252 (p = 0.0145). Patients
with an MSI-like phenotype had also an improved survival when compared to MSS patients. The MSI signature
was translated to a diagnostic microarray and technically and clinically validated in FFPE and frozen samples.
Copyright  2012 Pathological Society of Great Britain and Ireland. Published by John Wiley & Sons, Ltd.
Keywords: Colorectal cancer; microsatellite instability; deﬁcient mismatch repair system; gene expression; mutation frequency; genomic
signature; prognosis
Received 25 June 2012; Revised 14 August 2012; Accepted 15 August 2012
Sun Tian, Paul Roepman, Rene Bernards and Iris Simon are employees of Agendia (the company that developed ColoPrint and the MSI signature).
Introduction
There are at least two recognized pathways of
colorectal carcinogenesis [1]. The most common
pathway is a progressive model that involves stepwise
accumulation of genetic alterations in several key
oncogenes and tumour suppressor genes, such as
KRAS, BRAF, TP53 and, importantly, the adenomatous
polyposis coli (APC) gene [2,3]. These tumours
account for approximately 85% of all sporadic disease
and commonly display a chromosomal instability
(CIN) phenotype that is associated with widespread
structural alterations. A second class of colon tumours
manifests a microsatellite instability (MSI) phenotype;
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
118
2 S Tian et al
these tumours typically display various insertions or
deletions, most commonly in short tandem repeats,
the so-called microsatellites [4]. MSI is the molecular
ﬁngerprint of a deﬁcient mismatch repair system.
Approximately 15% of colorectal cancers (CRCs)
display MSI, owing either to epigenetic silencing
of MLH1 or to somatic or germline mutations in
one of the mismatch repair genes MLH1, MLH3,
MSH2, MSH6 or PMS2 [5]. Consequently, the MSI
phenotype is also referred to as the deﬁcient MMR
(dMMR) phenotype. MSI rates vary with tumour
stage and, in the adjuvant setting, MSI patients have
been associated with longer survival than patients
with microsatellite-stable (MSS) tumours [6,7]. The
deﬁciencies in MMR genes lead to loss of function
of tumour suppressor genes and are associated with
activating mutations in oncogenes such as BRAF [8].
Patients with MSI cancers might have different
responses to chemotherapy compared to MSS patients
[1,9]. The MMR involves the recognition and repair
of incorrectly paired nucleotides during DNA replication.
5-Fluorouracil (5-FU)-based chemotherapy is
the standard treatment for stage II and III CRCs
after surgery, and the survival advantage associated
with this treatment is about 10% [10]. Data from
patients with MSI and from cell lines with dMMR
indicate that MSI promotes resistance to 5-FU treatment
[1]. However, results from clinical studies are
conﬂicting. It seems that MSI patients with stage II
cancer have no beneﬁt from 5-FU treatment [11,12],
while stage III MSI patients might beneﬁt from treatment,
but this is predominantly seen in patients that
have a germline predisposition [13]. Evidence supporting
the preferential efﬁcacy of irinotecan in MSI
tumours continues to emerge, but are still considered
preliminary [14]. Other studies have shown that
MSI colorectal cancer might be speciﬁcally sensitive
to compounds inhibiting the phosphatidylinositol 3kinase
(PI3K)–AKT–mammalian target of rapamycin
(mTOR) pathway [15].
Considering the different prognosis and treatment
response of MSI patients when compared to MSS
patients, an accurate diagnosis is needed to facilitate
appropriate treatment decisions. Today, several
methods for the detection of MSI status are used.
MSI can be detected by PCR ampliﬁcation of speciﬁc
microsatellite repeats. The presence of instability is
determined by comparing the length of nucleotide
repeats in tumour cells and normal cells. A consensus
conference established a panel of microsatellite
markers with appropriate sensitivity and speciﬁcity to
diagnose MSI [16]. This reference panel, known as
the Bethesda panel, included ﬁve microsatellite loci:
two mononucleotides (Bat25 and Bat26) and three
dinucleotides (D5s346, D2s123 and D17s250) [17].
Immunohistochemical analysis of MMR proteins is
an alternative method to detect MSI in the clinical
setting and complements the genetic testing of Lynch
syndrome [18]. Lack of expression of one or more of
the MMR proteins is indicative of deﬁcient MMR, and
can help to determine which gene harbours a germline
mutation or has been inactivated by another mechanism.
However, traditional methods for determining
MSI status might not identify all patients with a deﬁcient
mismatch repair system and other methods might
be required for a more comprehensive detection [19].
As demonstrated by others [15,20] and in this paper,
patients with MSI have a very distinct gene expression
pattern that allows the development of strong gene
expression signatures. Pairwise comparisons between
studies showed that 94–98% of genes have consistent
changes in expression, even though samples were
analysed on different platforms and in different studies
[20]. Here we describe the development and validation
of a robust gene expression signature that identiﬁes
patients with MSI status, determined by standard methods
(PCR, IHC) with high accuracy, and additionally
identiﬁes a group of MSS patients with a MSI-like
phenotype. The signature was translated into a diagnostic
test that can be used in fresh or FFPE material
and can be performed in combination with other gene
expression signatures [21,22] for further classiﬁcation
of early-stage colon cancer patients.
Methods
Patients and samples
In this study, microsatellite instability was assessed in
three patient cohorts that have been described previously:
a development cohort (A) [22], a ﬁrst independent
validation cohort (B) [23] and a second independent
cohort in the subset of the PETACC-3 gene
expression dataset with complete MSI status information
(cohort D) [24–26]. The prognostic value of the
developed MSI signature was assessed on cohort B
combined with an additional set of samples with patient
follow-up data but without hospital-based MSI assessment
(cohort C). Patient and sample characteristics are
shown in Table 1. All tissue samples were collected
from patients with appropriate informed consent. The
study was carried out in accordance with the ethical
standards of the Helsinki Declaration and was approved
by the medical ethical boards of the participating medical
centres and hospitals.
Hospital-based assessment for microsatellite
instability (MSI)
For the development cohort (cohort A), fresh-frozen
tumour samples from patients with colorectal cancer
were collected (n = 276; Table 1). For 90 patients,
5 µm slides were immunohistochemically stained for
the markers MLH1 and PMS2. For the remaining 186
patients and for all patients in validation cohort B (n
= 132; Table 1) the MSI/MSS status was assessed
by PCR ampliﬁcation, following the standard protocol
of the hospital and described in [21,22,26] and in
Supplementary methods (see Supplementary material).
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
119
Genomic MSI signature 3
Table 1. Patient characteristics
Cohorts A B C D Total
Development Validation Validation (prognosis) Validation
Patients (n) 276 132 131 625 1164
Tissue type Fresh Fresh Fresh FFPE
Age
< 70 157 84 60 529 830
≥ 70 119 48 71 96 334
Stage
I 40 – – – 40
II 157 132 131 104 524
III 78 – – 521 599
IV 1 – – – 1
Gender
Male 165 74 66 382 687
Female 111 58 65 243 477
Location
Left colon 143 76 56 391 666
Right colon 96 56 57 234 443
Rectum 37 – 10 – 47
Not available – – 8 – 8
Grade
1 83 1 21 – 105
2 172 90 87 567∗
916*
3 20 41 21 55∗
137*
Not available 1 – 2 3 6
BRAF
Activating mutation 24 18 13 46 101
Wild-type/unknown mutation 248 86 92 577 1003
Not available 4 28 26 2 60
Microsatellite stability
MSI 29 31 – 70 130
MSS 247 101 – 555 903
Not available – – 131 – 131
∗
The PETACC3 dataset dichotomized the grade information by grouping stages 1 and 2, and 3 and 4, respectively.
Patients who had at least two microsatellite unstable
markers were deﬁned as MSI. A tumour with only
normal markers was deﬁned as microsatellite-stable
(MSS). MSI assessment of the PETACC-3 samples
(cohort D) was performed as described previously,
using a standard panel of 10 mononucleotide and dinucleotide
microsatellite loci by PCR ampliﬁcation of
normal/tumour DNA pairs [26]. Irregularity in one
marker (two in the PETACC-3 study) was deﬁned
as low-grade microsatellite instability (MSI-L); irregularity
in more markers was deﬁned as high-grade
microsatellite instability (MSI) [27]. Patients with MSIL
were classiﬁed as MSS for all analysis.
Development and validation of a 64-gene signature
associated with MSI status
RNA extraction, T7-based linear ampliﬁcation, Cydye
labelling and hybridization to Agilent arrays was
performed as described previously [22]. All tumour
samples contained > 30% tumour cells. Samples were
analysed against a common reference that was generated
using a pool of 44 CRC samples. Gene expression
measurements were normalized (Lowess normalization)
and log-ratios were used for identiﬁcation of
genes that were associated with the MSI status of the
tumours (based on two-sided Student’s t-test). We used
a 10-fold cross-validation (CV10) procedure that has
been described previously [22,28]. The CV10 procedure
was applied on the development cohort (n =
276) and repeated 1000 times to determine classiﬁcation
performance and for robust gene selection. During
each CV10 round, genes were ranked by p value. The
64 genes (see Supplementary material, Table S1) with
the highest frequency of appearance within the topranking
genes in each of the 1000 CV loops were
selected as the ﬁnal set with the strongest MSI association
(http://research.agendia.com/).
The 64 gene set was used to construct a nearest
centroid-based classiﬁcation method (cosine correlation);
a MSI gene signature index for the individual
samples was deﬁned as the difference of the two correlations.
Samples were classiﬁed within the MSI group
if their index exceeded a predeﬁned optimized threshold.
This threshold was determined to reach a maximal
overall accuracy (sum of sensitivity and speciﬁcity).
The 64-gene signature was validated on 132 independent
CRC samples analysed in the same way as
the development cohort, using the same microarray
platform and threshold (cohort B, Table 1). Samples
were classiﬁed as MSI if their index (the difference
of the two correlations) exceeded the predeﬁned optimized
threshold. A second validation was performed
on data from the PETACC-3 study comprising 625
colon tumour FFPE samples with known MSI status,
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
120
4 S Tian et al
1.0
B
A
64-gene signature for MSI phenotype
276colorecataltumours
-1 0
MSS
MSI
MSI-like
BRAF
MSIstatus
signature index
1
1st validation cohort
spec. 83.2%
AUC 0.942
sens. 90.3%
0.80.6
sensitivity
0.40.20.0
1.00.80.6
1-specificity
0.40.20.0
1.0
C 2nd validation cohort (FFPE)
spec. 90.1%
AUC 0.965
sens. 94.3%
0.80.6
sensitivity
0.40.20.0
1.00.80.6
1-specificity
0.40.20.0
TNNT1
TRIM7
AGR2
RPL22L1
STRN3
LMO4
VNN2
MDM2
SMAD2
TNFSF9
SLC25A22
EPDR1
C13ORF18
ASCL2
QPRT
GPR143
SHROOM2
GGT7
SORBS1
ROCK2
GUCY2C
CEACAM5
CEACAM3
TFCP2L1
RNF43
AXIN2
SHROOM4
ARID3A
LOC157860
KCNK5
PPP1R3D
UNKL
GGA2
BC000986
ZSWIM3
ASXL1
C20orf11
C20orf43
DYNLRB1
DIDO1
PLAGL2
Unknown
SMCR7L
DUSP18
ZFP36L2
CEP68
LFNG
PRR15
ATP9A
TSPAN6
TCF7
KHDRBS3
MLH1
C10orf47
HNRNPL
EP300
TGFBR2
FBXO34
VAV3
ACSL6
TNNC2
GNG4
KRT23
OIT3
Figure 1. (A) A 64-gene expression signature for identiﬁcation of colorectal cancer samples with MSI, MSI-like and MSS phenotypes.
The MSI signature read-outs (index) are shown for 276 tumour samples (cohort A): red, relative up-regulation; green, down-regulation.
Standard hospital-based MSI assessment is indicated in the middle bars, together with the BRAF V600E mutation status: light grey, MSS
or BRAF wild-type, dark grey, MSI or BRAF mutation. (B) ROC curve and AUC of the signature read-out on validation cohort B. (C) ROC
curve and AUC of the MSI signature on validation cohort D (PETACC-3 study). The optimal sensitivity and speciﬁcity (with a sensitivity of
at least 0.9 and sum of sensitivity and speciﬁcity is maximal) is indicated in grey.
of which 70 (11.2%) were MSI (cohort D, Table 1). As
described previously [25], these 625 samples had been
hybridized to a custom Affymetrix platform optimized
for analysis of degraded RNA in FFPE samples. We
could identify 58 of the 64 MSI signature genes. Readout
of the MSI gene signature index on the Affymetrix
data was done in a similar fashion as for the ﬁrst validation
cohort. A receiver operating characteristic (ROC)
curve was plotted and the area under the ROC curve
(AUC) was calculated. Sensitivity and speciﬁcity were
calculated based on the optimal overall accuracy, with
a sensitivity of at least 90%.
Besides the main binary classiﬁcation of MSS and
MSI samples, a secondary threshold was determined
to subclassify MSI-like samples that were positive by
MSI gene expression signature but typically classiﬁed
as MSS by hospital assessment. Both thresholds for
MSI and MSI-like classiﬁcation were determined using
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
121
Genomic MSI signature 5
the development cohort A only and are indicated in
Figure 1A.
Functional analysis of 64-gene signature
Functional analysis of the genes in the signature was
performed by using the Database for Annotation, Visualization
and Integrated Discovery (DAVID) software,
v 6.7 [29]. The enriched functional annotation clusters
were calculated by DAVID through grouping enriched
functional terms. The parameter set used had a similarity
threshold of 0.4, multiple linkage threshold of 0.3
and an EASE parameter of 0.5. Only clusters larger
than three functional terms were used.
Investigation of mutation frequency
DNA fragment libraries were prepared using the
TruSeq DNA Sample Preparation Kit (Illumina) and
were hybridized to the SureSelect Human Kinome bait
library according to the manufacturer’s protocol (Agilent).
Captured DNA samples were sequenced on a
HiSeq 2000 (Illumina), using a 55 bp paired-end protocol.
Sequence reads were aligned to the human genome
(GRCh37/hg19) and unique pairs were used for variant
calling. Candidate variants were identiﬁed using SAMtools
and the following inclusion criteria were applied:
minimum coverage 10; minimum variant count 5; a
variant must be detected on both strands. Variants were
assessed using the Ensembl variant effect predictor (v
62) to deﬁne those that were likely to impact protein
coding sequences and to ﬁlter out germline polymorphisms.
Matched germline DNA was sequenced for 19
of the 73 tumour samples and an additional 56 normal
samples were used to improve the removal of germline
SNPs and sequencing errors. In this paper we focus on
mutation load; a full analysis of the sequence alterations
is the subject of another study.
Statistical and survival analysis
All analyses and statistical tests were performed
in Matlab (MathWorks) or R (v 2.14.1; www.rproject.org).
All tests were two-sided and the significance
level of p values was set to be 0.05. Survival
analysis was performed on cohorts B and C combined,
using Cox proportional hazard models with 10-year
distant metastasis-free survival (dmfs) as end point.
Results
Development of an MSI signature
A cohort of 276 colorectal tumour samples (cohort A,
Table 1) was analysed for their microsatellite status
[microsatellite instability (MSI) or stability (MSS)]
according to the local standard methodology at the
originating hospital (see Methods for details); 11%
(n = 29) of the tumours were identiﬁed as MSI
(Table 1). This cohort was used for identiﬁcation of
genes with expression strongly associated with MSI
status. Using a 10-fold cross-validation procedure,
we identiﬁed a set of 64 genes (see Supplementary
material, Table S1) that formed the basis of a single
sample-based classiﬁer to accurately identify MSI
tumours (Figure 1A). Optimal accuracy was reached
upon classiﬁcation of 57 samples as MSI by the
signature and 219 samples as MSS, corresponding
to a sensitivity of 93.1% and a speciﬁcity of 87.9%
(Table 2).
The 64-gene signature was validated in an independent
cohort of 132 stage II colon cancer samples
(validation cohort B, Table 1) that was analysed using
the single sample predictor (SSP), as established in
the development cohort. Performance in the validation
samples showed an area under the ROC curve (AUC)
of 0.942 (95% CI, 0.888–0.975) with a sensitivity of
90.3% and a speciﬁcity of 83.2% when applying the
established threshold for MSS and MSI classiﬁcation
(Figure 1B, Table 2).
A second independent validation of gene signature
was performed on a prospective cohort of FFPE tissue
samples from the randomized PETACC-3 study (cohort
D, Table 1) [24]. Signature read-out in the PETACC-3
samples showed a very high concordance with hospitalbased
MSI assessment, with an ROC of 0.965 (95%
CI, 0.943–0.988), which has an optimal sensitivity of
94.3% and speciﬁcity of 90.1% (Figure 1C, Table 2).
Besides validating the signature in an independent
prospective study, this result showed that the developed
64-gene signature can be successfully translated to a
different microarray platform and can likely be used
for MSI assessment on FFPE samples.
MSI signature and mutation frequency
In all patient cohorts, the MSI signature was able to
correctly identify nearly all MSI patients (sensitivity
above 92%) but they were classiﬁed as MSI by the
gene signature (Figure 1A). We hypothesized that,
although these MSI-like tumours were assessed as
MSS by standard methods, they do have a deﬁcient
mismatch repair (dMMR)-related phenotype. As such,
the developed gene signature might be able to identify
MSI samples but also MSS samples that harbour a
dMMR phenotype (MSI-likes).
To test this hypothesis, we have deep-sequenced
73 tumour samples for their ‘cancer kinome’ (615
genes in total). The sequencing results conﬁrmed
that samples identiﬁed as MSI by the gene signature
have a signiﬁcantly higher mutation frequency (on
average, candidate mutations were identiﬁed in 7.4%
of the analysed genes) compared to MSS samples (on
average, candidate mutations were identiﬁed in 1.6% of
the genes) (Student’s t-test, p = 3.15e-12). Importantly,
further classiﬁcation into MSI and MSI-like samples
indicated that the mutation frequency of the MSI-like
tumours (average 6.4%) is also signiﬁcantly higher than
that of MSS samples (Student’s t-test, p = 6.26e-6) and
comparable to the mutation frequency in MSI samples
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
122
6 S Tian et al
Table 2. Performance of MSI gene signature: performance of MSI and MSS classiﬁcation by the 64-gene signature compared to standard
local hospital methodology
Tissue n Sensitivity Speciﬁcity Overall accuracy
Development cohort A Fresh 276 93.1 87.9 88.4
Validation cohort B Fresh 132 90.3 83.2 84.8
Validation cohort D (PETTAC-3) FFPE 625 94.3 90.1 90.6
(8.2%) (Figure 2). This result suggests that MSI-like
tumours also harbour a dMMR phenotype, resulting in
a higher mutation rate.
It is important to note that the MSI-like patients, as
identiﬁed by the signature, were not patients with a
low-grade MSI (MSI-low) assessment by the hospital
(data not shown), conﬁrming that the MSI-likes might
be a subclass that cannot be identiﬁed by standard MSI
assessment.
Investigation of activating mutations in BRAF
showed that 64.3% of all samples classiﬁed as MSI
by the gene signature harboured an activating BRAF
mutation (36 of 56 samples with a known BRAF
mutation status). In the MSI-like class, 17.4% of
the samples had an activating BRAF mutation, while
the MSS classiﬁed samples were almost exclusively
(98.0%) BRAF wild-type (342 of 349 samples).
Functional annotation
The association between the MSI gene signature and a
dMMR phenotype was further supported by functional
analysis. The results indicated that four functional
annotation clusters were signiﬁcantly enriched in the 64
signature genes (see Methods; see also Supplementary
material, Tables S1 and S2). Annotation cluster 1
indicated that the encoded proteins of the signature
are enriched with zinc-ﬁnger domain proteins, which
are often found as part of transcription, translation,
DNA replication and repair machineries [30]. Together
with the enrichment in functional terms related to
DNA binding and the nucleic acid metabolic processes
(annotation cluster 2), these results are in agreement
with the nature of DNA mismatch repair proteins as
DNA interacting/metabolism partners that often form
large complexes in the nucleus (annotation cluster 4)
[31]. In addition, annotation cluster 3 indicated that the
signature genes are also involved in apoptosis.
MSI-signature and prognosis
The prognostic value of the 64-gene MSI signature was
tested on 263 mostly (80%) untreated stage II tumours:
132 samples from validation cohort B, plus an additional
set of 131 stage II colon tumours with no available
hospital-based MSI assessment (validation cohort
C, Table 1). Patients with samples classiﬁed as MSI by
the gene signature showed a signiﬁcantly better distant
metastasis-free survival (DMFS) compared to patients
with MSS tumours, with a hazard ratio (HR) of 0.252
(95% CI, 0.076–0.83, p = 0.0145) (Figure 3A). After
further subclassiﬁcation into MSI, MSI-like and MSS,
the MSI-like group also showed a signiﬁcantly better
survival compared to MSS samples. Interestingly, the
MSI group with concordant MSI classiﬁcation by signature
and hospital method showed a 100% survival
rate (Figure 3B). In contrast to stage II, investigation
in stage III samples (n = 201) showed no prognostic
value of MSI/MSS classiﬁcation (p = 0.29) (data not
shown).
It has been postulated that MSI patients might be
resistant to 5-FU treatment and that this resistance is
associated with thymidylate synthase (TYMS) activity.
We therefore investigated the expression of TYMS
in the tumours. Samples classiﬁed as MSI showed
a signiﬁcant higher expression of TYMS compared
to samples classiﬁed as MSS (cohort A, p < 1e-
18). Samples classiﬁed as MSI-like showed also a
signiﬁcantly higher expression of TYMS compared to
MSS (p = 3.9e-13) (Figure 4).
Technical validation of the MSI gene signature
The reproducibility of the MSI signature was investigated
by replicate hybridization and analysis of 53
samples. MSI gene signature results were highly reproducible,
with an R2
value of 0.992 (Figure 5A)
and, importantly, all samples resulting in the same
classiﬁcation (100% concordance). Matching samples
from the same patients (n = 60) that were either
preserved as formalin-ﬁxed and parafﬁn-embedded
(FFPE) or preserved fresh in RNA-retain were analysed
to address tumour heterogeneity and technical differences
between FFPE and fresh preservation. The readouts
of MSI signature score from these two biopsies
were highly correlated (R = 0.93) and the binary results
(MSS versus MSI) were 98.4% concordant. In addition,
a repeated assessment was performed for three samples
over 20 consecutive days by ﬁve different technicians.
Signature read-out was stable across the 20 consecutive
days, with an average standard deviation of well below
5% of the total dynamic range (Figure 5B). Of the 60
measurements, only two read-outs resulted in a change
in classiﬁcation outcome (96.7% concordance). Finally,
validation of the signature on the PETACC-3 study
(Figure 1C) indicated that the gene signature, which
has been developed and validated on fresh-frozen tissue
samples, can be used for assessment of FFPE samples
as well as fresh tissue.
Discussion
In this report we describe the development of a
64-gene expression signature that identiﬁes patients
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
123
Genomic MSI signature 7
73colorectaltumours
MSI
MSI-like
MSS
p = 2.97e-5
-1 0
signature index
MSI
status1 0 5 10 15 20 25
‘cancer kinome’ mutation frequency (%genes)
Figure 2. MSI and MSI-like samples classiﬁed by the 64-gene signature show an increased mutation frequency. Seventy-three colorectal
tumour samples were sorted according to their MSI-signature index; the middle bar shows standard hospital-based MSI assessment when
available (light grey, MSS; dark grey, MSI) and the right barplot show the mutation frequency (% of genes mutated) of each sample in the
‘cancer kinome’ (615 genes).
A
MSI (n = 79)
MSI (n = 31)
MSI-like
(n = 48)
MSS (n = 184) MSS (n = 184)
p = 0.0145 p = 0.0103
1.00.80.6
distantmetastasis-freesurvival
0.40.20.0
0 2 4 6
years
8 10 0 2 4 6
years
8 10
B
1.00.80.6
distantmetastasis-freesurvival
0.40.20.0
Figure 3. Prognostic value of the 64-gene MSI signature in 263 stage II colorectal cancer (cohorts B and C combined). (A) Kaplan–Meier
(KM) survival curves for samples classiﬁed as MSI (MSI and MSI-like combined) and MSS by the gene signature; (B) KM curves for samples
classiﬁed as MSI, MSI-like and MSS by the gene signature. p values are based on log-rank test.
with DNA mismatch repair deﬁciency resulting in
a MSI phenotype. The signature was developed and
independently validated in large sets of samples and
showed high reproducibility in technical validation
experiments. To our knowledge, this is the ﬁrst report
to describe a genomic MSI-signature directly linked to
mutation frequency, which was translated into a robust
diagnostic test.
The MSI-signature identiﬁes patients with MSI
status with high accuracy (85% and 91% accuracy in
validation sets B and D, respectively) but also identiﬁes
a group of MSI-like patients who are not recognized by
traditional methods as MSI but have features similar
to MSI patients, eg high mutation frequency, frequent
BRAF mutations, high TYMS expression and better
prognosis. This observation is in good agreement with
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
124
8 S Tian et al
0.6
0.4
0.2
0
MSS MSI-like MSI
TYMSexpression
-0.2
-0.4
-0.6
-0.8
Figure 4. Relative gene expression levels (log10 scale) of thymidylate
synthase (TYMS) in samples classiﬁed as MSS, MSI and MSI-like
by the 64-gene signature. Samples classiﬁed as MSI showed a signiﬁcant
higher expression of TYMS compared to samples classiﬁed
as MSS (p < 1e-18, Student’s t-test) Samples classiﬁed as MSI-like
also showed a signiﬁcantly higher expression of TYMS compared
to MSS (p = 3.9e-13, Student’s t-test).
a recently published study from the Cancer Genome
Atlas (TCGA) Network that also identiﬁed a group
of patients with MSI-like features (high mutation
frequency) who were classiﬁed as MSS by traditional
methods [19]. This is clinically relevant because
these patients might be better served without adjuvant
chemotherapy if they are stage II. Additionally, these
result indicate that that microsatellite instability is not
necessarily a good surrogate for dMMR in all patients.
Interestingly, in our study, the sample with the
highest mutation frequency (23.8%) is MSI-like by
gene signature but was classiﬁed MSS by standard
PCR assessment. This is again in good agreement with
observations from TCGA that found that six patients
with highest mutation rates were classiﬁed as MSS
by standard methods. On the other hand, the single
sample that was MSI by standard methods but with
a strong MSS gene expression pattern in our set did
not have an increased mutation rate, suggesting that
this sample was incorrectly classiﬁed by standard MSI
assessment (Figure 2).
The more comprehensive identiﬁcation of MSI and
MSI-like patients might be explained by the fact that
the read-out of gene expression is a measurement of
cellular consequences of DNA MMR deﬁciency in colorectal
tumour, and is therefore independent of knowing
the cause of the defect. At this moment, not all
components of the MMR pathway in human cells are
known, eg the human counterparts of Escherichia coli
MutH and UvrD are not yet identiﬁed [31]. Although
the epigenetic silencing of MLH1 is often observed as
the main factor, other factors are known to play a role.
MMR defects can be caused by any genetic or epigenetic
alteration of the genes in the DNA MMR pathway.
Knock-out mouse models of Msh2, Msh3, Msh6,
Mlh1, Mlh3, Pms1, Pms2 and Exo1 all confer a MSI
phenotype [32,33]. It is therefore difﬁcult to comprehensively
measure all possible sources causing MMR
deﬁciency. Moreover, although somatic mutations in
known mismatch-repair genes might be detectable, the
mutations do not always result in microsatellite instability,
at least not in those microsatellites that are
traditionally assessed [19]. However, it is possible to
summarize the cellular consequence of DNA MMR
deﬁciency with a dominant gene expression pattern,
as with the 64-gene signature, that measures the downstream
effect. The functional annotation of the 64 genes
further supports the theory that the signature measures
an activation that is caused by MMR deﬁciency, rather
than the deﬁciency itself. Proteins with classical conserved
zinc-ﬁnger domains, DNA binding domains and
associated to the nucleic acid metabolic processes were
enriched in the signature. The expression signature is
indicative of active DNA damage signalling, which in
turn leads to cell cycle arrest and apoptosis (see Supplementary
material, Table S2).
The 64-gene signature summarizes the gene expression
pattern displayed by colorectal tumours with DNA
MMR deﬁciency, regardless of the diverse causes of
this defect, and therefore might have advantages when
compared to IHC or PCR methods [9]. Using a gene
expression signature for MSI assessment might also
have technical advantages: it does not require a comparison
of DNA microsatellite regions from paired normal
and tumour tissues; in addition, the nature of a
molecular signature builds upon the read-out of a relatively
large set of genes, resulting in robust and reproducible
measurements; additionally, the MSI signature
can be read out from the same tissue biopsy and in
the same assay as other diagnostic signatures [20,21],
minimizing sample requirements and systematic errors.
It has been well established that stage II MSI
patients have better prognosis compared to patients
with functional mismatch repair [34]. Consistent with
this knowledge, we report here that tumours predicted
by the 64-gene signature as MSI showed better distant
metastasis-free survival. While the good prognosis of
MSI tumours is well documented, the value of MSI
to predict response to adjuvant chemotherapy is still
under investigation. Cell line models support the idea
that CRCs require a functional MMR system to induce
apoptosis in response to 5-FU treatment [35]. In addition,
meta-analysis of seven independent clinical studies
indicated that MSI patients do not beneﬁt from adjuvant
chemotherapy with 5-FU [12]. The mechanism of
action of 5-FU is through its metabolite dUMP, which
competes for the binding site of thymidylate synthase
(TYMS), an enzyme catalysing conversion of dUMP to
dTMP during DNA synthesis. The non-responsiveness
to 5-FU therapy in MSI patients might be related to
higher expression of TYMS in these tumours [36]. In
our dataset, we have conﬁrmed this association, as MSI
patients identiﬁed by the signature have high expression
of TYMS. MSI-like patients might present an additional
population of CRC patients that are unlikely to
respond to treatment with 5-FU.
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
125
Genomic MSI signature 9
1.0
R2
= 0.992
A B
0.0
replicateB
-1.0
1.00.0
-1.0 1.0
tumour A tumour B tumour C
0.0
replicate A
-1.0
Figure 5. Reproducibility and precision of the 64-gene signature. (A) Replicate analysis of 53 tumour samples shows a very high correlation
in signature index. (B) Stability of the MSI signature read-out for three representative diagnostic samples across a time period of 20
consecutive days. In both panels, the classiﬁcation threshold (MSI vs MSS) is indicated by the dashed line.
To conclude, we have developed a 64-gene signature
characterizing DNA MMR deﬁciency in colorectal
tumours. This signature is technically robust and can
be used as an alternative diagnostic tool to assess
MSI status. It was implemented on a diagnostic array
and validated in both fresh-frozen and FFPE tumour
samples. The results from this test provide information
on the prognosis of colorectal cancer patients and
aid decision making for the selection of appropriate
chemotherapeutic agents.
Acknowledgements
The authors thank Mireille Snel, Frank Nieboer, Jeroen
Heijmans and Astrid Bosma for sample and microarray
processing, Christa Dreezen for support with clinical
annotation and Jeroen van den Akker and Annuska
Glas for technical support and discussion. This work
was funded in part by the European Commission 7th
Framework research programme COLTHERES (Modelling
and Predicting Sensitivity to Targeted Therapies
in Colorectal Cancers; Grant No. 259015). ST is a
senior clinical investigator of the Fund for Scientiﬁc
Research Flanders and has received research grants
from the Belgian Federation Against Cancer and Belgian
National Cancer Plan.
Author contributions
All authors were involved in writing the manuscript
and in reviewing the ﬁnal draft; ST, PR, RB and
IS conceived experiments and study design; ST, PR,
VP, MM, IA and MD performed data analysis; RS,
CS, RR, UN, WM, SB and SaT were involved in
sample collection, updating patient information and/or
generating MSI-data; and PR, VP, IA, RS, UN, SaT,
MD, RB and IS were involved in data interpretation.
Abbreviations
5-FU, 5-ﬂuorouracil; CRC, colorectal cancer; MMR,
mismatch repair; MSI, microsatellite instability; MSS,
microsatellite stability.
References
1. Warusavitarne J, Schnitzler M. The role of chemotherapy in
microsatellite unstable (MSI-H) colorectal cancer. Int J Colorect
Dis 2007; 22(7): 739–748.
2. Kinzler KW, Vogelstein B. Lessons from hereditary colorectal
cancer. Cell 1996; 87(2): 159–170.
3. Fearon ER, Vogelstein B. A genetic model for colorectal tumorigenesis.
Cell 1990; 61(5): 759–767.
4. Ionov Y, Peinado MA, Malkhosyan S, et al. Ubiquitous somatic
mutations in simple repeated sequences reveal a new mechanism
for colonic carcinogenesis. Nature 1993; 363(6429): 558–561.
5. Kane MF, Loda M, Gaida GM, et al. Methylation of the hMLH1
promoter correlates with lack of expression of hMLH1 in sporadic
colon tumors and mismatch repair-defective human tumor cell lines.
Cancer Res 1997; 57(5): 808–811.
6. Tejpar S, Bertagnolli M, Bosman F, et al. Prognostic and predictive
biomarkers in resected colon cancer: current status and future
perspectives for integrating genomics into biomarker discovery.
Oncologist 2010; 15(4): 390–404.
7. Koopman M, Kortman GA, Mekenkamp L, et al. Deﬁcient mismatch
repair system in patients with sporadic advanced colorectal
cancer. Br J Cancer 2009; 100(2): 266–273.
8. Miquel C, Jacob S, Grandjouan S, et al. Frequent alteration of DNA
damage signalling and repair pathways in human colorectal cancers
with microsatellite instability. Oncogene 2007; 26(40): 5919–5926.
9. Vilar E, Gruber SB. Microsatellite instability in colorectal cancer
– the stable evidence. Nat Rev Clin Oncol 2010; 7(3): 153–162.
10. Sargent D, Sobrero A, Grothey A, et al. Evidence for cure by
adjuvant therapy in colon cancer: observations based on individual
patient data from 20 898 patients on 18 randomized trials. J Clin
Oncol 2009; 27(6): 872–877.
11. Ribic CM, Sargent DJ, Moore MJ, et al. Tumor microsatelliteinstability
status as a predictor of beneﬁt from ﬂuorouracil-based
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
126
10 S Tian et al
adjuvant chemotherapy for colon cancer. N Engl J Med 2003;
349(3): 247–257.
12. Des GG, Schischmanoff O, Nicolas P, et al. Does microsatellite
instability predict the efﬁcacy of adjuvant chemotherapy in colorectal
cancer? A systematic review with meta-analysis. Eur J Cancer
2009; 45(10): 1890–1896.
13. Sinicrope FA, Foster NR, Thibodeau SN, et al. DNA mismatch
repair status and colon cancer recurrence and survival in clinical
trials of 5-ﬂuorouracil-based adjuvant therapy. J Natl Cancer Inst
2011; 103(11): 863–875.
14. Bertagnolli MM, Niedzwiecki D, Compton CC, et al. Microsatellite
instability predicts improved response to adjuvant therapy with
irinotecan, ﬂuorouracil, and leucovorin in stage III colon cancer:
Cancer and Leukemia Group B Protocol 89803. J Clin Oncol 2009;
27(11): 1814–1821.
15. Vilar E, Mukherjee B, Kuick R, et al. Gene expression patterns in
mismatch repair-deﬁcient colorectal cancers highlight the potential
therapeutic role of inhibitors of the phosphatidylinositol 3-kinaseAKT-mammalian
target of rapamycin pathway. Clin Cancer Res
2009; 15(8): 2829–2839.
16. Gonzalez-Garcia I, Moreno V, Navarro M, et al. Standardized
approach for microsatellite instability detection in colorectal carcinomas.
J Natl Cancer Inst 2000; 92(7): 544–549.
17. Umar A, Boland CR, Terdiman JP, et al. Revised Bethesda
Guidelines for hereditary nonpolyposis colorectal cancer (Lynch
syndrome): and microsatellite instability. J Natl Cancer Inst 2004;
96(4): 261–268.
18. Poulogiannis G, Frayling IM, Arends MJ. DNA mismatch repair
deﬁciency in sporadic colorectal cancer and Lynch syndrome.
Histopathology 2010; 56(2): 167–179.
19. Comprehensive molecular characterization of human colon and
rectal cancer. Nature 2012; 487(7407): 330–337.
20. Jorissen RN, Lipton L, Gibbs P, et al. DNA copy-number alterations
underlie gene expression differences between microsatellite
stable and unstable colorectal cancers. Clin Cancer Res 2008;
14(24): 8061–8069.
21. Sun T, Simon I, Moreno V, et al. A combined oncogenic pathway
signature of BRAF, KRAS and PI3KCA mutation improves colorectal
cancer classiﬁcation and Cetuximab treatment prediction.
Gut 2012 Jul 14. [Epub ahead of print] PMID: 22798500
22. Salazar R, Roepman P, Capella G, et al. Gene expression signature
to improve prognosis prediction of stage II and III colorectal cancer.
J Clin Oncol 2011; 29(1): 17–24.
23. Maak M, Simon I, Nitsche U, et al. Independent validation of a
prognostic genomic signature (ColoPrint): for stage II colon cancer
patients. Ann Surg 2012 (in press).
24. Van CE, Labianca R, Bodoky G, et al. Randomized phase III
trial comparing biweekly infusional ﬂuorouracil/leucovorin alone or
with irinotecan in the adjuvant treatment of stage III colon cancer:
PETACC-3. J Clin Oncol 2009; 27(19): 3117–3125.
25. Popovici V, Budinska E, Tejpar S, et al. Identiﬁcation of a poorprognosis
BRAF-mutant-like population of patients with colon
cancer. J Clin Oncol 2012; 30(12): 1288–1295.
26. Roth AD, Tejpar S, Delorenzi M, et al. Prognostic role of KRAS
and BRAF in stage II and III resected colon cancer: results of
the translational study on the PETACC-3, EORTC 40993, SAKK
60–00 trial. J Clin Oncol 2010; 28(3): 466–474.
27. Nardon E, Glavac D, Benhattar J, et al. A multicenter study
to validate the reproducibility of MSI testing with a panel of
ﬁve quasimonomorphic mononucleotide repeats. Diagn Mol Pathol
2010; 19(4): 236–242.
28. Roepman P, Jassem J, Smit EF, et al. An immune response
enriched 72-gene prognostic proﬁle for early-stage non-small-cell
lung cancer. Clin Cancer Res 2009; 15(1): 284–290.
29. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative
analysis of large gene lists using DAVID bioinformatics resources.
Nat Protoc 2009; 4(1): 44–57.
30. Krishna SS, Majumdar I, Grishin NV. Structural classiﬁcation of
zinc ﬁngers: survey and summary. Nucleic Acids Res 2003; 31(2):
532–550.
31. Li GM. Mechanisms and functions of DNA mismatch repair. Cell
Res 2008; 18(1): 85–98.
32. Wei K, Kucherlapati R, Edelmann W. Mouse models for human
DNA mismatch-repair gene defects. Trends Mol Med 2002; 8(7):
346–353.
33. Prolla TA, Baker SM, Harris AC, et al. Tumour susceptibility and
spontaneous mutation in mice deﬁcient in Mlh1, Pms1 and Pms2
DNA mismatch repair. Nat Genet 1998; 18(3): 276–279.
34. Popat S, Hubner R, Houlston RS. Systematic review of microsatellite
instability and colorectal cancer prognosis. J Clin Oncol 2005;
23(3): 609–618.
35. Arnold CN, Goel A, Boland CR. Role of hMLH1 promoter
hypermethylation in drug resistance to 5-ﬂuorouracil in colorectal
cancer cell lines. Int J Cancer 2003; 106(1): 66–73.
36. Ricciardiello L, Ceccarelli C, Angiolini G, et al. High thymidylate
synthase expression in colorectal cancer with microsatellite instability:
implications for chemotherapeutic strategies. Clin Cancer Res
2005; 11(11): 4234–4240.
SUPPORTING INFORMATION ON THE INTERNET
The following supporting information may be found in the online version of this article:
Supplementary methods
Table S1. Genes in the MSI-64 gene signature
Table S2. Functional annotation: functional category analysis of the 64 genes by DAVID software
Copyright  2012 Pathological Society of Great Britain and Ireland. J Pathol 2012
Published by John Wiley & Sons, Ltd. www.pathsoc.org.uk www.thejournalofpathology.com
13. Robust genomic signature for MSI
127
13. Robust genomic signature for MSI
128
14 Expression profiling with RNA from formalinfixed,
paraffin-embedded material
• BMC Medical Genomics, 1(9), 2008
• IF: 2.848
• number of citations: 36
• personal contribution (30%): design of genomic signatures and
scores, statistical analyses, manuscript writing
129
BioMed Central
Page 1 of 15
(page number not for citation purposes)
BMC Medical Genomics
Open AccessTechnical advance
Expression profiling with RNA from formalin-fixed,
paraffin-embedded material
Andrea Oberli†1, Vlad Popovici†2, Mauro Delorenzi2,3, Anna Baltzer1,
Janine Antonov1, Sybille Matthey1, Stefan Aebi1, Hans Jörg Altermatt4 and
Rolf Jaggi*1
Address: 1Department of Clinical Research, University of Bern, Murtenstrasse 35 CH-3010 Bern, Switzerland, 2Swiss Institute of Bioinformatics
(SIB), CH-1015 Lausanne, Switzerland, 3National Center of Competence in Research (NCCR) Molecular Oncology, Swiss Institute for
Experimental Cancer Research (ISREC), Epalinges, Switzerland and 4Pathology Länggasse, Forstweg 56, CH-3012 Bern, Switzerland
Email: Andrea Oberli - andrea.oberli@dkf.unibe.ch; Vlad Popovici - vlad.popovici@isb-sib.ch;
Mauro Delorenzi - mauro.delorenzi@isrec.unil.ch; Anna Baltzer - anna.baltzer@dkf.unibe.ch; Janine Antonov - janine.antonov@dkf.unibe.ch;
Sybille Matthey - sybille.matthey@dkf.unibe.ch; Stefan Aebi - stefan.aebi@insel.ch; Hans Jörg Altermatt - altermatt@patholaenggasse.ch;
Rolf Jaggi* - rolf.jaggi@dkf.unibe.ch
* Corresponding author †Equal contributors
Abstract
Background: Molecular characterization of breast and other cancers by gene expression profiling
has corroborated existing classifications and revealed novel subtypes. Most profiling studies are
based on fresh frozen (FF) tumor material which is available only for a limited number of samples
while thousands of tumor samples exist as formalin-fixed, paraffin-embedded (FFPE) blocks.
Unfortunately, RNA derived of FFPE material is fragmented and chemically modified impairing
expression measurements by standard procedures. Robust protocols for isolation of RNA from
FFPE material suitable for stable and reproducible measurement of gene expression (e.g. by
quantitative reverse transcriptase PCR, QPCR) remain a major challenge.
Results: We present a simple procedure for RNA isolation from FFPE material of diagnostic
samples. The RNA is suitable for expression measurement by QPCR when used in combination
with an optimized cDNA synthesis protocol and TaqMan assays specific for short amplicons. The
FFPE derived RNA was compared to intact RNA isolated from the same tumors. Preliminary scores
were computed from genes related to the ER response, HER2 signaling and proliferation.
Correlation coefficients between intact and partially fragmented RNA from FFPE material were
0.83 to 0.97.
Conclusion: We developed a simple and robust method for isolating RNA from FFPE material.
The RNA can be used for gene expression profiling. Expression measurements from several genes
can be combined to robust scores representing the hormonal or the proliferation status of the
tumor.
Published: 19 April 2008
BMC Medical Genomics 2008, 1:9 doi:10.1186/1755-8794-1-9
Received: 17 October 2007
Accepted: 19 April 2008
This article is available from: http://www.biomedcentral.com/1755-8794/1/9
© 2008 Oberli et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
14. Expression profiling of FFPE material
130
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 2 of 15
(page number not for citation purposes)
Background
Breast cancer has been widely studied in the past and
molecular characterization has increased the understanding
of biological pathways that are altered during neoplastic
transformation of cells [1-4]. However, the findings
based on molecular profiling have not yet altered diagnosis,
and decisions about treatment still rely mostly on histopathological
and immunohistochemical techniques
which are at best semi-quantitative [5,6]. Currently, many
patients with primary, non-metastatic breast cancer with
positive estrogen receptor (ER) status undergo several
cycles of chemotherapy, although a substantial proportion
of them does not benefit from it. Presently, no conventional
parameters exist for many patients which allow
to identify individuals who will benefit from chemotherapy.
Personalized diagnosis on the basis of highly specific
molecular analyses has the potential to improve the situation
of many patients by optimizing medication, and at
the same time, sparing others from unnecessary treatment
regimens.
DNA chip studies are based on measuring gene expression
for many genes in parallel [1,4,7,8]. Most protocols for
gene expression analysis on the basis of DNA chips are
sensitive to RNA degradation and RNA must be isolated
from freshly prepared or FF tumor material. As a consequence,
material is fairly limited and often originates
from convenience samples of heterogeneous patients.
Many of these studies including meta-analyses have
revealed genes and biological functions of their products
which are relevant for classification and prognosis [9,10].
However, many samples were derived from patients who
did not participate in clinical studies and their treatment
regimens were not standardized. Therefore, follow up
data must still be interpreted with caution.
Obviously, procedures based on formalin-fixed, paraffinembedded
(FFPE) material would greatly facilitate and
speed up research in this area as large amounts of highly
valuable material and clinical data have already been collected.
In many cases, FFPE blocks are still available and
they could be used for a molecular analysis. Especially
material from clinical trials would allow investigating distinct
clinical questions with existing material rather than
material from newly designed studies.
Many efforts are currently made to individualize diagnosis
of breast cancer by including molecular parameters into
diagnosis. Fresh frozen material would obviously be ideal
for a molecular analysis by gene expression measurements
but it may be difficult to implement novel procedures
which complicate current workflows of daily routine. Procedures
based on FFPE material would be more feasible as
they do not interfere with current protocols and they do
not affect routine diagnosis as material for molecular
analysis could be collected after standard diagnosis has
been terminated. Only relatively few molecular
approaches have been described which are based on FFPE
material. For example, Paik and co-workers have established
a recurrence score (RS, Oncotype DX), it allows to
quantify the likelihood of distant recurrence and to predict
the magnitude of chemotherapy benefit [11,12].
It is generally accepted that molecular profiles which
reflect primarily biological characteristics of tumor cells,
may complement clinical and histopathological diagnosis,
resulting in a more detailed characterization of individual
tumors, a pre-requisite for better treatment
decisions. In this study we present the development of a
novel procedure for RNA isolation from FFPE material
and an optimized workflow for expression measurements
by QPCR.
Methods
Human breast cancer samples
Human breast cancer specimens were divided into two
aliquots, one of which was processed for histological diagnosis
by fixation with formalin and embedding in paraffin.
FFPE material was obtained from the Institute of
Pathology (University of Bern) and the Pathology Länggasse,
Bern. Tissue (3–5 mm thick slices of tumor) was
fixed over night in buffered formalin and processed for
paraffin embedding in a Tissue Processing Center TPC 15
(Medite Medizintechnik, Germany). The second aliquot
was frozen on dry ice and stored at -80°C. Fresh frozen
material was obtained from the Tumorbank Bern. Both,
FF and FFPE samples were checked by hematoxylin and
eosin staining and only samples with more than 50%
tumor cells were used for this study. An informed consent
to use the material for research was obtained from all the
patients.
RNA Extraction
Intact RNA was isolated from four 25 μm thick kryo-sections
of approximately 0.5 cm2. The tissue was homogenized
in 420 μl lysis buffer (4 M guanidinium
thiocyanate, 30 mM Tris pH 8.0, 1% Triton-X-100), 8.0,1
using a TissueLyser (Mixer Mill, Retsch GmbH, Haan, Germany)
at 15 Hz for 3 min. Total RNA was bound to silicabased
columns (Epoch Biolabs, Huston Texas), treated
with DNase I (30 Kunitz units for 20 min. at room temperature;
Qiagen, Hilden, Germany), washed once with
lysis buffer (containing 30% ethanol) and once with 20
mM NaCl (containing 20% ethanol) and eluted in 50 μl
10 mM Tris pH 7.4, 0.1 mM EDTA and stored at -20°C.
RNA quantity was measured on an ND-1000 spectrophotometer
(NanoDrop Technologies, Wilmington, DE) and
quality assessed by capillary electrophoresis with an Agilent
2100 Bioanalyzer (Agilent Technologies, Inc., Santa
Clara, CA) using Agilent RNA 6000 Series Nano kits.
14. Expression profiling of FFPE material
131
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 3 of 15
(page number not for citation purposes)
RNA was isolated from ten 10 μm thick FFPE sections
according to the RNeasy FFPE protocol of Qiagen (Fig. 1,
lanes B), the ncLysis protocol of Applied Biosystems
(lanes C) and the protocol developed in our laboratory
(lanes D). Paraffin sections were de-paraffinized with
xylene, washed with ethanol and dried in a speed vac. For
our own protocol, 200 μl lysis buffer (4 M guanidinium
thiocyanate, 30 mM Tris, pH 8.0, 1% Triton-X-100) was
added to the dried sections and immediately homogenized
in a Mixer Mill at 20 Hz for 4 min. Proteinase K
(Roche Diagnostics, Mannheim, Germany) was added (1
mg/ml final concentration) and tissue was digested for 1
hour at 55°C. One milliliter dilution buffer (30 mM Tris,
pH 8.0, 1% Triton-X-100) was added to each lysate and
digestion continued for 1 hr after adding fresh proteinase
K (final concentration 1 mg/ml). RNA was de-modified by
adding 318 μl of de-modification solution (5 M NH4Cl)
and incubating at 94°C for 20 min or as described in the
text. RNA was bound to silica-based columns and digested
with DNase I as described for fresh-frozen tissue samples.
The reproducibility of our own procedure was tested by
isolating several independent RNAs from consecutive sections
of the same tissue block. About 10 μg of total RNA
could be isolated from 5 to 10 FFPE sections (0.5–1 cm2/
section). RNA was isolated from closely matched sections
using the RNeasy FFPE kit (Qiagen) or the ncLysis system
(Applied Biosystems) according to the protocols included
with the kits. In both cases, the RNA was purified on silicabased
columns. 22 samples were available. In 14 cases sufficient
RNA was obtained from all 4 parallel isolations. In
2 cases of FF material (samples 4 and 11) and in 6 cases of
FFPE material (samples 1, 5, 7, 9, 12 and 21) less than 1.5
μg RNA could be isolated with the ncLysis protocol. These
samples were excluded from further analysis.
cDNA synthesis and QPCR
Aliquots of 100 to 500 ng of total RNA were reverse transcribed
using MultiScribe™ MuLV reverse transcriptase
(High-Capacity cDNA Archive Kit; Applied Biosystems,
Foster City, CA, USA) and random or gene-specific primers.
Reverse primers were kindly provided by Applied Biosystems,
they were used at 1 μM each, cDNAs were made
in the presence of 3, 10 or 22 reverse primers as 3-plex, 10plex
or 22-plex, respectively. Regular Assays on Demand
(Applied Biosystems) were used for QPCR (Table 1).
Manually designed assays coding for short, medium-size
and long amplicons of the insulin growth factor-binding
protein 5 (IGBP5) were selected with Primer Express (Version
3, Applied Biosystems). Forward primer and probe
were kept constant for all assays while reverse primers
were selected such that amplicons of different sizes were
generated [13]. QPCR reactions were carried out in triplicates
in a final volume of 10 μl in 1× FAST Master mix
(Applied Biosystems) and cDNA corresponding to 4 ng
total RNA. QPCR was performed on an ABI 7500 FAST
instrument (2 min at 95°C, followed by 45 cycles of 95°C
for 3 sec and 60°C for 30 sec). The quality of the assays
and the absence of contaminating DNA were assessed
with water and RNA instead of cDNA, respectively (data
not shown). Three positive controls containing cDNA
derived of ZR-7-51 cells were included on each 96-well
plate. Cycle threshold values (Ct) were determined using
the SDS software of the 7500 FAST System (Version
1.3.1). Constant threshold values were set for each gene
throughout the study.
RNA isolation and characterizationFigure 1
RNA isolation and characterization. Total RNA was
isolated from kryo-sections (lanes A) and from paraffin sections
according to the RNeasy FFPE protocol of Qiagen
(lanes B), the ncLysis protocol of Applied Biosystems (lanes
C) or according to our own protocol (lanes D). Aliquots of
each RNA were separated by capillary electrophoresis (Agilent
Bioanalyzer) on Nano chips along with RNA ladder (L;
Ambion). Shown are RNAs from two representative tumors
(Tu#10 and #18).
14. Expression profiling of FFPE material
132
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 4 of 15
(page number not for citation purposes)
Data processing and determination of breast cancer
classification scores
All the measured cycle threshold (Ct) values represent
log2 expression levels. These values need to be normalized
such that they are comparable across samples and suitable
for generating scores. For a gene, a large Ct value corresponds
to a low expression level, so the first processing
step needed was to reverse the sense of this relationship by
letting
Ct' = max(cut_off - Ct, 0)
be the new value for each measured gene. The cut off value
was set empirically to 35.0 as any higher raw Ct value was
deemed unreliable. This cut off was fixed a priori and kept
constant throughout all the experiments reported here.
Then, the final value of each target gene was taken to be
ΔCt = max_val*(Ct' - R + cut_off)/(2*cut_off),
where R represents the reference value and was taken as
the mean of Ct' values of 5 selected reference genes
(GAPDH, GUSB, RPLP0, TFRC, UBB, see Results section
for details). The approach guarantees that all ΔCt values
are positive and upper bounded by max_val (set to 33 for
all the results reported here).
We used the scores associated with three of the gene
groups listed in Table 1: the ER, HER2 and Proliferation
group. While for the HER2 and Proliferation groups the
scores were taken as the average ΔCt value of the genes in
the group, for the ER group more weight was given to the
ESR1 gene:
ER_score = 0.55*ESR1 + 0.15*(BCL2 + CEGP1 + PGR)
where the gene symbols stand for the corresponding ΔCt
values.
Finally, for each tumor a Total score was computed as
Total_score = (Proliferation_score + HER2_score ER_score
+ max_val)/3
The Total score, together with the group scores as computed
above, are used in all subsequent discussions.
Table 1: QPCR assays. QPCR assays (Assays on Demand) were from Applied Biosystems (Palo Alto, CA). Reverse primers from each
assay were used for the synthesis of gene-specific cDNAs. They were provided separately by Applied Biosystems. Three assays
(IGBP5_short, IGBP5_medium, IGBP5_long) were designed manually.
AoD Assay Acc_Nr AmpliconSize Module
Hs00608023_m1 BCL2 NM_000633 81 Estrogen
Hs00221277_m1 CEGP1 NM_020974 64 Estrogen
Hs00174860_m1 ESR1 NM_000125 62 Estrogen
Hs00172183_m1 PGR NM_000926 118 Estrogen
Hs00180450_m1 GRB7 NM_005310 70 Her2
Hs01001598_g1 HER2 NM_004448 55 Her2
Hs00952036_m1 CTSL2 NM_001333 72 Invasion
Hs00171829_m1 STMY3 NM_005940 66 Invasion
Hs01030097_m1 CCNB1 NM_031966 66 Proliferation
Hs01032443_m1 MKI67 NM_002417 66 Proliferation
Hs00231158_m1 MYBL2 NM_002466 81 Proliferation
Hs00269212_m1 STK15 NM_003600 85 Proliferation
Hs00153353_m1 SURV NM_001168 93 Proliferation
Hs99999903_m1 ACTB NM_001101 171 Reference
Hs00266705_g1 GAPDH NM_002046 74 Reference
Hs99999908_m1 GUSB NM_000181 81 Reference
Hs99999902_m1 RPLP0 NM_001002 105 Reference
Hs00174609_m1 TFRC NM_003234 79 Reference
Hs00430290_m1 UBB NM_018955 120 Reference
Hs01630490_s1 RPL7A BX641050 84 Reference
Hs00817975_g1 RPS11 NM_001015 168 Reference
Hs01922548_s1 RPS23 NM_001025 90 Reference
Hs00185390_m1 BAG1 NM_004323 58
Hs00154355_m1 CD68 NM_001251 68
Hs01383449_s1 GSTM1 AY532925 65
(own design) IGBP5_short NM_000599 60 Test
(own design) IGBP5_medium NM_000599 109 Test
(own design) IGBP5_long NM_000599 147 Test
14. Expression profiling of FFPE material
133
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 5 of 15
(page number not for citation purposes)
Results
Isolation of RNA from FFPE material
Total RNA was isolated from FF human breast cancer specimen
which resulted in intact RNA in all samples (Fig. 1,
lanes A, shown are RNAs from two representative tumors
from a series of 14 tumors). RNA from FF tissue was used
as reference for partially fragmented RNA isolated from
FFPE material of the same tumors. RNA was assessed by
capillary electrophoresis. The size distribution of RNA isolated
according to our own protocol was in the range of
200 to 1000 nucleotides (Fig. 1, panel D) while the majority
of RNA fragments was in the range of 100 nucleotides
when RNA was isolated according to RNeasy FFPE (panel
B) or the ncLysis system (panel C). Gene expression was
measured by QPCR using 25 commercially available and
three own TaqMan assays [13] (Tab. 1). The cycle threshold
values (Ct values) were determined from RNAs isolated
according to one of the three protocols for FFPE
material and compared to Cts obtained with intact RNA of
the same tumors. Fig. 2 shows correlation coefficients
between intact RNA (A) and FFPE-derived RNAs isolated
according to the RNeasy FFPE protocol, (A vs B); the ncLysis
system (A vs C); or our own protocol (A vs D) for all
14 tumors using the expression levels of 5 genes (GAPDH,
GUSB, RPLP0, TFRC, UBB; see below). The cDNAs were
made in the presence of random (white boxes) or genespecific
primers (gray boxes). Clearly, correlation coefficients
between intact and partially fragmented RNA were
higher with gene-specific primers than random primers
and RNA isolated according to our own protocol resulted
in cDNA which performed better in QPCR than cDNA
made from RNA isolated according to RNeasy FFPE and
ncLysis protocols.
Parameters affecting the RNA quality and QPCR
Several parameters were systematically optimized to
improve the protocol for RNA isolation from FFPEderived
sections. For example, QPCR made in the presence
of primers specific for large amplicons (Fig. 3, dashed
line) is very sensitive to RNA fragmentation and modification
resulting in higher Ct values than primers specific for
medium-size amplicons (dotted line) or short amplicons
(non-interrupted line). In addition, the effect of de-modification
of FFPE-derived RNA is apparent: the Ct determined
from de-modified RNA is 3 or more units lower
than the Ct measured from the same RNA but without demodification.
The effect was consistently observed with
several tumors and also when expression was measured
Comparison of RNAs isolated according to different protocolsFigure 2
Comparison of RNAs isolated according to different protocols. RNA was reverse transcribed in the presence of random
primers (white boxes) or gene-specific primers (hatched boxes). Gene expression was measured from an equivalent of 4
ng of RNA by QPCR for five reference genes (GAPDH, GUSB, RPLP0, TFRC and UBB). Pearson correlations were computed
between matched Cts for the five reference genes and each tumor RNA isolated from FF (A) and FFPE material. Shown are
correlations between intact RNA and RNA isolated from FFPE material according to the RNeasy FFPE protocol (A versus B),
intact RNA and RNA isolated from FFPE material according to the ncLysis system (A versus C) and intact RNA and RNA isolated
according to our own protocol (A versus D).
14. Expression profiling of FFPE material
134
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 6 of 15
(page number not for citation purposes)
with TaqMan assays from Applied Biosystems (data not
shown). The optimum time of demodification was 20
min, longer times led to higher Ct values (not shown).
The different protocols of RNA isolation from FFPE material
were further compared by measuring expression levels
of reference genes in the 14 tumors and by comparing the
results to Cts generated from corresponding intact RNAs
(Fig. 4). Experimental variation was reduced by comparing
mean Ct values from 5 reference genes (GAPDH,
GUSB, RPLP0, TFRC, UBB) instead of their single values.
Mean Cts of the five reference genes were plotted for each
tumor and each protocol (panel A) and their distribution
summarized (panel B). As expected, the Ct values generated
with intact RNA resulted in the lowest and most stable
Cts (diamonds). RNA prepared from FFPE tissue
according to our own protocol (circles) resulted in higher
but fairly constant Ct values (compare diamonds and circles).
RNA isolated according to the RNeasy FFPE protocol
(squares) and the ncLysis protocol (triangles) resulted in
Ct values that were not only much higher than with intact
RNA, they also exhibited large variations among different
isolates when compared to corresponding Cts based on
intact RNA. This result suggests a generally poorer and
more variable quality of RNA isolated according to the
two commercial protocols than our own protocol, leading
to relatively large variations of Cts for the 5 reference
genes among the different tumors. The Ct values generated
from RNA isolated according to our own protocol
were on average 2.9 units higher than Cts from intact
RNA. RNA isolated according to RNeasy FFPE and ncLysis
were 7.6 and 5.8 units higher than Cts from intact RNA of
the same tumors, respectively (Fig. 4B). Standard deviations
of Cts for the 14 tumors were 0.45 for intact RNA,
De-modification of RNA results in higher efficiency during subsequent QPCRFigure 3
De-modification of RNA results in higher efficiency during subsequent QPCR. RNA was isolated from FFPE material
according to our own protocol and compared to intact RNA derived of FF tissue. RNA samples were reverse transcribed
without previous de-modification (labeled "no") or after de-modification at room temperature (1), 94°C and pH 8.0 (2) or
94°C and pH 5.0 (3). Each RNA was tested by QPCR using three amplicons for IGBP5. Primers used code for short (60 bp, ᮀ),
medium-size (109 bp, ) or long amplicons (147 bp, ᭝). Shown are raw Ct values from intact RNAs from FF material and
from RNAs derived of FFPE material of the same tumors. The benefit of de-modification is visualized as delta Ct values. They
are indicated for short and long amplicons (dotted lines).
14. Expression profiling of FFPE material
135
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 7 of 15
(page number not for citation purposes)
and 4.21, 2.69 and 1.01 for FFPE-derived RNA isolated
according to the RNeasy FFPE, ncLysis and our own protocol,
respectively.
An important aspect when working with RNA from FFPE
material relates to the reproducibility of the RNA isolation
procedure. This was directly tested for our own protocol
by isolating independent samples of RNA from closely
matched FFPE sections of the same tissue block and measuring
gene expression by QPCR from both RNAs (Fig. 5A
and 5B showing two representative examples). RNAs were
also isolated from two independent tumors from the
same patient, resulting in a third panel of data sets (C).
Data points are shown as polygonal diagrams of raw Cts
for each gene measured. Horizontal, parallel lines indicate
closely similar expression, crossing lines indicate discrepancies
between two measurements in matched samples.
The Pearson correlation of raw Cts between matched samples
was 0.99 for replicates shown in panels A and B and
0.74 for results shown in panel C.
Normalization
Results generated in the presence of partially fragmented
RNA cannot be directly aligned with results produced
from intact RNA and a suitable normalization is required
to eliminate or reduce the effects of fragmentation and
residual modification in RNA from FFPE material. Nine
putative reference genes were selected from the literature
[14] and from microarray results [15]. Expression was
measured from intact and FFPE-derived RNA and raw Cts
from all the 14 tumors are plotted for each putative reference
gene (Fig. 6). Analyses based on intact RNAs revealed
that 8 of the 9 tested genes performed similarly well
(panel A). RPS23 which was hardly measurable (mean Ct
in intact RNA > 37) was characterized by a large variation
between the different tumors. A slightly higher variation
was observed when expression levels were compared for
FFPE-derived RNAs (panel B): GAPDH, GUSB, RPLP0,
TFRC, RPL7A and UBB showed a similar performance and
small variations between the 14 tumors as was seen with
intact RNA. In contrast, the Ct values with RNA from FFPE
material revealed larger variations for ACTB and RPS11
and therefore, the two genes were excluded as reference
genes. The ACTB and RPS11 amplicons are larger than
amplicons for the other reference genes and also for the
test genes (Tab. 1, see also Fig. 3). Five genes were used as
reference genes: GAPDH, GUSB, RPLP0, TFRC and UBB.
For comparison, raw Ct values are shown for 4 genes
related to the ER response (BCL2, CEPG1, ESR1, PGR)
(Fig. 6, left). As expected, a high variation was observed
for these genes between the 14 tumors. Protocols B and C
did not yield enough usable data, precluding the data
from further analysis. For example, protocol B did not
have data for all the reference genes and for protocol C
Comparison of RNA isolation methodsFigure 4
Comparison of RNA isolation methods. Shown are the means of raw Cts of five reference genes (GAPDH, GUSB, RPLP0,
TFRC, UBB) for intact RNA ( , FF) and for RNA isolated from matched FFPE material according to the protocols of Qiagen
(ᮀ, Q), Applied Biosystems (᭝, AB) and our own ( , own). Individual mean Cts of the 14 tumors and summarized box plots of
Cts are shown in panel A and panel B, respectively. Tumors are aligned according to increasing Ct in FFPE-derived RNA (Qiagen
protocol).
14. Expression profiling of FFPE material
136
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 8 of 15
(page number not for citation purposes)
several test genes could not be measured reliably (e.g.
BCL2, PGR of the ER group).
RNAs isolated from FFPE material according to our own
protocol were also compared to RNA derived of kryo-preserved
material of the same tumors in a different way. The
arithmetic mean of the five reference genes (GAPDH,
GUSB, RPLP0, TFRC and UBB) was used for normalizing
expression values of all the genes in each RNA. Normalized
expression values were compared between intact and
FFPE-derived RNA for each gene and each tumor [see
Additional File 1]. Good conservation of inter-tumor differences
were observed between kryo-preserved and FFPE
samples for most genes.
Module scores
Normalized expression values were also used to compute
scores representing ER-related genes (ESR1, PGR, BCL2,
CEPG1), HER2-related genes (HER2 and GRB7), genes
related to proliferation (STK15/AURKA, CCNB1, MYBL2,
MKI67, BIRC5/SURV) and a Total score representing all
the genes of the three scores (for details see Methods). The
computation of biologically meaningful scores with multiple
genes instead of relying on just one has the scope to
reduce noise variation. Module scores and Total scores
were computed separately from normalized expression
values of intact RNAs (circles) and of RNAs isolated
according to our own protocol (triangles) and Total scores
are depicted separately for each tumor (Fig. 7). The figure
demonstrates that similar values are obtained for each
tumor irrespective of whether they are computed from
intact RNA or from RNA derived of FFPE material. This
suggests that scores can be computed with RNA from FF
samples as well as with RNA from FFPE samples. ER and
HER2 scores were visualized in scatter plots, where the ER
and HER2 scores were represented on the x- and y-axis,
respectively (Fig. 8A and 8B). It was apparent that the
three immunohistochemically ER-negative tumors have
low ER scores (#15, #18, #20) and the only immunohistochemically
HER2 positive tumor (#6) among the 14
tested tumors had a high HER2 score and an intermediate
ER score (see also Table 2). The remaining tumors were all
ER positive as assessed by immunohistochemistry (IHC)
and they had relatively high ER scores. ER-negative and
HER2-positive tumors all had high Proliferation scores
(visualized by the red color of the dots). A larger spectrum
of Proliferation scores (from blue to red) was found for ER
positive tumors. Similar distributions were found when
Reproducibility of RNA isolation from FFPE materialFigure 5
Reproducibility of RNA isolation from FFPE material. The RNAs were isolated from paraffin blocks according to our
own protocol. BM33 and BM36 (panel A) are two separate RNAs isolated from tissue block "BM", D33 and D36 are RNAs isolated
from block "D" (panel B). For comparison, 45T and 56T originate from two distinct tumors isolated from one patient
(panel C). Gene expression was measured by QPCR for 24 genes and raw Ct values are shown for each gene measured from
the two matching RNAs.
14. Expression profiling of FFPE material
137
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 9 of 15
(page number not for citation purposes)
Stability of reference gene expression in RNA isolated from FF and FFPE materialFigure 6
Stability of reference gene expression in RNA isolated from FF and FFPE material. Raw Cts are shown for 9 putative
reference genes (ACTB, GAPDH, GUSB, RPLP0, TFRC, RPL7A, RPS11, RPS23 and UBB). Results based on intact RNA
derived of FF material (A) and based on RNA isolated according to our own protocol from FFPE material (B) are depicted for
all the 14 tumors. The Ct values for 4 ER-related genes (BCL2, CEPG1, ESR1 and PGR) are shown for comparison (left).
14. Expression profiling of FFPE material
138
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 10 of 15
(page number not for citation purposes)
scores were computed from intact RNA (Fig. 8A) and
FFPE-derived RNA that was isolated according to our own
protocol (B). A different presentation of scores is shown
where ER, HER2, Proliferation and Total Scores are plotted
separately for each tumor [see Additional file 2]. The
scores determined from the 14 FF and FFPE-derived samples
are in the same range and only few tumors were classified
in a different order between intact and FFPE-derived
RNAs (leading to crossing lines).
The similarity between the results generated from intact
and partially fragmented RNA was also assessed by calculating
Pearson correlation coefficients between the scores
of both RNAs. Correlation coefficients (and corresponding
p-values and 95% confidence intervals) were 0.966 (p
= 2.071*10-8, CI = 0.893; 0.989), 0.856 (p = 9.32*10-5,
CI = 0.597; 0.954) and 0.833 (p = 2.177*10-4, CI = 0.541;
0.946) for ER, HER2 and Proliferation scores, respectively.
The corresponding Spearman correlations were 0.938 (p <
2.2*10-16), 0.851 (p = 1.167*10-4) and 0.867 (p =
2.048*10-5), respectively.
Discussion
Methods and protocols for RNA isolation from formalinfixed
tissues have been published since almost 20 years
[16-32].
RNA was quantified by dot blot hybridization [23], semiquantitative
PCR [19] and more recently, by QPCR
[24,18,26,13,17,33,32] and other methods [28-30]. RNA
derived of FFPE material is not only partially hydrolyzed
but also chemically modified: formalin reacts with nucleotides
leading to the formation of methylol groups in
nucleobases. These groups tend to further react and form
intra- and inter-molecular methylene bridges in RNA,
DNA [34,35,31] and protein [36]. As a result, reverse transcription
is impaired and threshold cycle values (Ct values)
increase during subsequent QPCR.
The protocol for RNA isolation described here was complemented
by adding a separate demodification step
which involves incubation at elevated temperature in a
buffer containing ammonium chloride which favors the
reversion of methylol groups to amino groups in nucleobases.
It does not only improve the efficiency of downstream
applications (mainly reverse transcription), it also
improves the recovery of RNA from FFPE sections. RNA
yield and quality can be further improved by extensive
digestion of FFPE material with protease in a buffer containing
guanidinium thiocyanate. Reverse transcription in
the presence of gene-specific primers prevents the initiation
of cDNA synthesis inside amplicons and therefore,
cDNA made in the presence of gene-specific primers is a
better template for QPCR than cDNA made from random
primers (Fig. 2). Several papers have demonstrated that
QPCR with primers coding for short amplicons are more
efficient than primers coding for long amplicons
[17,20,24,13,32].
Finally, normalization of raw data is used to eliminate or
at least reduce the effect of poorer quality of starting RNA.
Various approaches of normalization were proposed in
the literature [37,14,38,32]. They are based on calculating
relative expression values: expression levels of genes of
interest are expressed relative to the expression of one or a
panel of several suitable reference genes. An ideal referTable
2: Clinical and molecular parameters of breast cancers. Clinical and molecular parameters are given for each breast cancer used
in this study. Module scores for each tumor were calculated from the results based on intact RNA (FF material) and based on RNA
isolated from FFPE material according to our own method. N.A., data not available.
Clinical classification Immunohistochemistry Module Score (FF/FFPE)
Tu# T N Grade ER PR ErbB2 ER HER2 Prolif. Histological type
2 2 0 3 70% pos. neg. 1+ 16.6/17.1 15.8/16.3 14.2/14.5 invasive ductal
3 2 1a 2 70% pos. pos. 1+ 17.2/17.7 16.4/16.6 14.3/14.0 mixed (duct./lob)
6 1c 3a 3 >90% pos. pos. 3+ 15.7/16.2 17.2/18.1 14.5/15.4 invasive ductal
8 2 2a 3 >90% pos. pos. 2+ 16.5/17.2 15.8/16.3 14.7/14.7 invasive ductal
10 1c N.A. 2 >90% pos. pos. 2+ 14.5/16.6 15.2/16.2 13.5/14.4 invasive ductal
13 2 3a 3 >90% pos. neg. 1+ 16.6/17.0 15.7/15.6 14.6/14.5 invasive ductal
14 1c N.A. 2 >90% pos. neg. 1+ 16.5/16.9 16.2/16.4 13.8/13.5 invasive ductal
15 N.A. 3 neg. neg. 0 11.8/13.0 14.9/15.5 14.9/15.5 invasive ductal
16 2 N.A. 1 65% pos. pos. 0 17.9/18.3 16.4/16.6 13.8/14.1 invasive ductal/cribriform
17 2 0 3 >90% pos. pos. 2+ 16.5/17.2 16.3/16.9 14.9/15.6 invasive ductal
18 2 N.A. 3 neg. neg. 0 13.0/12.9 15.1/15.9 15.2/16.0 invasive ductal
19 2 N.A. 2 >90% pos. pos. 0 17.2/17.6 15.9/16.0 13.8/14.2 invasive ductal
20 1c N.A. 2 neg. neg. 0 12.4/13.0 15.7/15.9 14.6/15.2 invasive ductal
22 2 0 2 N.A. N.A. N.A. 16.8/17.4 15.6/16.1 13.3/13.9 invasive ductal
14. Expression profiling of FFPE material
139
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 11 of 15
(page number not for citation purposes)
Comparison of Total scores computed from intact and FFPE-derived RNAFigure 7
Comparison of Total scores computed from intact and FFPE-derived RNA. Total scores were computed from normalized
expression values based on the results of intact RNA ( ) and FFPE-derived RNA (᭝, own protocol) as described in the
Methods section. They are shown separately for each of the 14 tumors.
14. Expression profiling of FFPE material
140
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 12 of 15
(page number not for citation purposes)
Module scoresFigure 8
Module scores. ER, HER2 and proliferation scores were computed from expression values of 14 breast cancers and visualized
in a scatter plot. The ER score was determined from four genes, the HER2 score from 2 and the proliferation score from
5 genes (see Methods). Tumors are positioned according to their ER score (x-axis) and HER2 score (y-axis). Proliferation
scores are color coded. The histological ER status is indicated by a "-" or "+" sign next to the tumor numbers in the plot. The
results were computed from intact RNA derived of FF material (A) and RNA isolated from FFPE material according to our
own protocol (B). Individual scores for each tumor are given in Table 2.
14. Expression profiling of FFPE material
141
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 13 of 15
(page number not for citation purposes)
ence gene has a stable expression level in all the samples
under investigation. As such "ideal" reference gene normally
does not exist, the mean or median expression level
of several suitably chosen reference genes is used as a relatively
stable reference Ct value. We used a formalized
approach to characterize all candidate reference genes.
Candidate reference genes were ranked according to their
standard deviations of raw Ct values in RNA from FF and
FFPE material. The final rank of each candidate reference
gene was taken as the mean of the two ranks obtained
with RNA from intact and FFPE material. Genes with
higher ranks were excluded as reference genes.
We also applied GeNorm [14] to characterize candidate
reference genes: ACTB and RPS11 had poorest stability
measure M [14] for FFPE-derived RNA and RPL7A had a
poor stability measure when RNA from FF material was
tested (data not shown). For these reasons GAPDH,
GUSB, RPLP0, TFRC and UBB were used as reference
genes in this study.
Our own RNA isolation protocol was compared to RNA
that was isolated from the same material but according to
commercial protocols and products (Qiagen RNeasy FFPE
and ncLysis system of Applied Biosystems). Additional
products for FFPE material from commercial providers
(e.g. Stratagene, Ambion) were tested and the results
obtained with our own protocol were superior to all tested
commercial products (data not shown).
We determined module scores for each of the 14 tumors
in this study. The limited number of samples does not
allow statements about the clinical significance of module
scores but they can be used to compare scores computed
from intact RNA from FF material and RNA isolated from
FFPE according to our own protocol. Pearson correlations
between these RNAs in the 14 tumors were 0.966, 0.856
and 0.833 for ER, HER2 and Proliferation scores, respectively.
As kryo-preserved RNA and RNA from FFPE material
always originated from different portions of the same
tumor, a certain variation of gene expression cannot be
excluded and, as a consequence, part of the observed variability
between kryo and FFPE material may be attributed
to biological heterogeneity in the tumors. The three module
scores were combined to a Total score. The Total score
is similar to the recurrence score described by Paik [11],
with high expression of genes related to proliferation and
HER2 and low expression of ER-related genes indicating
higher risk.
The data generated from FF and FFPE material were also
compared to ER and HER2 levels assessed by IHC results
from the same tumors. Three tumors (#15, #18, #20) were
ER-negative and one was strongly HER2-positive (#6)
(Tab. 2). The same tumors had low ER scores when
assessed by QPCR (Fig. 8). Tumor #6 had a high HER2
score and an intermediate ER score. These results are in
good agreement with the expected distribution of the
three scores [15,39]. By comparing QPCR based data with
well known tumor subtypes allowed to validate the protocols
developed here, even if no new biological findings are
provided. The primary issue of this work was to document
that stable and robust expression values can be determined
from FFPE-derived RNA which are close to the values
computed from intact RNA of the same tumors. The
optimization and validation of the scoring procedure
remains an important issue but obviously, the available
number of samples is not sufficient to deal with this
aspect and it will be addressed separately and on a larger
collection of samples.
While IHC results are at most semi-quantitative, QPCRbased
results reflect more accurately the expression level
of genes in question. The module scores proposed here
integrate quantitative gene expression data from several
genes, this makes the resulting scores more robust than
measurements based on single genes. QPCR is not only
quantitative, it is also very sensitive over a large dynamic
range. The number of genes which can be measured by
QPCR is not limited and additional genes and module
scores can be included in the analysis if this will be
required.
Importantly, certain predictive parameters still cannot be
determined with current technologies. For example,
breast cancers are classified into histological grade 1, 2 or
3. This grading most likely reflects the proliferative state of
tumor cells [40]. Grading may be especially important as
high grade tumors seem to respond more favorably to
chemotherapy than low grade tumors. Unfortunately,
many tumors are histological grade 2 and for those
tumors the benefit is not clear. Paik and co-workers documented
that their recurrence score (RS) was also predictive
for a response to chemotherapy [41]. The RS defined
by Paik and coworkers is composed of 16 test genes
mainly representing ER response genes, proliferationassociated
genes, HER2-related genes and invasion genes
and 5 genes for normalization [11,41].
The genes selected for this study (Tab. 1) were selected
from published DNA chip studies with breast cancer samples
[15]. They mostly coincide with the genes used by
Paik ad co-workers.
Conclusion
The results presented in this study reveal that RNA isolated
from FFPE material according to the protocol developed
in our laboratory can be used for expression
measurements by QPCR although the RNA is partially
degraded. The optimized isolation and de-modification
14. Expression profiling of FFPE material
142
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 14 of 15
(page number not for citation purposes)
procedures combined with a normalization procedure
results in stable and robust gene expression data. Robustness
of results was further increased by computing scores
from several genes representing the hormonal and the
proliferation status of the tumor. Molecular profiling
from FFPE material may be of interest for routine diagnostics
in the near future as FFPE material is always available
[42]. Similarly, molecular profiling from FFPE material
may be of great interest in the context of existing and
newly planed clinical trials for which only formalin-fixed
samples exist.
Abbreviations
ER, estrogen receptor; FF, fresh frozen tissue; FFPE, formalin-fixed,
paraffin-embedded tissue; IHC, immunohistochemistry;
PGR, progesterone receptor; QPCR,
quantitative polymerase chain reaction.
Competing interests
The author(s) declare that they have no competing inter-
ests.
Authors' contributions
AO and AB developed the procedure, performed validation
studies and were involved in all the experiments presented
here. HJA and SA contributed clinical and
pathological information. JA and SM participated in the
design of the study and VP and MD performed the statistical
analysis and developed the scoring system. RJ conceived
the study, participated in its design and
coordination and drafted the manuscript. All the authors
read and approved the final manuscript.
Additional material
Acknowledgements
The authors thank Dr. R. Haener, University of Bern for helpful suggestions
on formalin-induced modification of RNA, Drs. M. Müller, C. Meier and A.
Günthert for providing tumors for the tumorbank, Drs. A. Fleischmann and
H. Burger for providing FFPE tumor material, and Dr. M. Schobesberger for
critical reading of the manuscript. This work was supported by the Swiss
Cancer League (to RJ and MD), the NCCR "Molecular Oncology" (to MD),
the Bernese Cancer League and Applied Biosystems, Rotkreuz Switzerland
(to RJ). Kryo-preserved material was provided by the Tumorbank Bern.
The Tumorbank Bern is sponsored by the Department of Clinical Research,
the Institute of Pathology, University of Bern, and the Bernese Cancer
League. Informed consent was provided from the patients for all the samples
used in this study.
References
1. Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie
T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC,
Brown PO, Botstein D, Eystein Lonning P, Borresen-Dale AL: Gene
expression patterns of breast carcinomas distinguish tumor
subclasses with clinical implications. Proc Natl Acad Sci U S A
2001, 98(19):10869-10874.
2. Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S,
Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lonning PE,
Brown PO, Borresen-Dale AL, Botstein D: Repeated observation
of breast tumor subtypes in independent gene expression
data sets. Proc Natl Acad Sci U S A 2003, 100(14):8418-8423.
3. Sotiriou C, Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat
P, Fox SB, Harris AL, Liu ET: Breast cancer classification and
prognosis based on gene expression profiles from a population-based
study. Proc Natl Acad Sci U S A 2003,
100(18):10393-10398.
4. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack
JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A,
Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Botstein
D: Molecular portraits of human breast tumours. Nature
2000, 406(6797):747-752.
5. Rampaul RS, Pinder SE, Elston CW, Ellis IO: Prognostic and predictive
factors in primary breast cancer and their role in
patient management: The Nottingham Breast Team. Eur J
Surg Oncol 2001, 27(3):229-238.
6. Goldhirsch A, Glick JH, Gelber RD, Coates AS, Thurlimann B, Senn
HJ: Meeting highlights: international expert consensus on the
primary therapy of early breast cancer 2005. Ann Oncol 2005,
16(10):1569-1583.
7. Sotiriou C, Powles TJ, Dowsett M, Jazaeri AA, Feldman AL, Assersohn
L, Gadisetti C, Libutti SK, Liu ET: Gene expression profiles
derived from fine needle aspiration correlate with response
to systemic chemotherapy in breast cancer. Breast Cancer Res
2002, 4(3):R3.
8. van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M,
Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ,
Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene
expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415(6871):530-536.
9. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins
PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC,
al. : The MicroArray Quality Control (MAQC) project shows
inter- and intraplatform reproducibility of gene expression
measurements. Nat Biotechnol 2006, 24(9):1151-1161.
10. Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis
P, Harris A, Bergh J, Foekens JA, Klijn JG, Larsimont D, Buyse M, Bontempi
G, Delorenzi M, Piccart MJ, Sotiriou C: Definition of clinically
distinct molecular subtypes in estrogen receptorpositive
breast carcinomas through genomic grade. J Clin
Oncol 2007, 25(10):1239-1246.
11. Paik S: Molecular profiling of breast cancer. Curr Opin Obstet
Gynecol 2006, 18(1):59-63.
12. Cronin M, Pho M, Dutta D, Stephans JC, Shak S, Kiefer MC, Esteban
JM, Baker JB: Measurement of gene expression in archival paraffin-embedded
tissues: development and performance of a
92-gene reverse transcriptase-polymerase chain reaction
assay. Am J Pathol 2004, 164(1):35-42.
Additional file 1
Comparison of normalized expression for each gene in FF and FFPE
material. Expression was determined by QPCR from RNA derived of FF
and FFPE material (own protocol). Normalized expression levels (see
Methods for details) are shown for each gene and the 14 tumors as polygonal
plots.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1755-
8794-1-9-S1.jpeg]
Additional file 2
Polygonal representation of ER, HER2, Proliferation and Total scores.
Gene expression was measured from RNA derived of FF and FFPE material
(own protocol) and ER, HER2, proliferation and Total scores were
computed for each RNA of the 14 tumors and results are shown as polygonal
plots. Parallel lines indicate good correlations and crossing lines are
indicative for discrepancies between scores computed from FF and FFPEderived
RNA
Click here for file
[http://www.biomedcentral.com/content/supplementary/1755-
8794-1-9-S2.jpeg]
14. Expression profiling of FFPE material
143
BMC Medical Genomics 2008, 1:9 http://www.biomedcentral.com/1755-8794/1/9
Page 15 of 15
(page number not for citation purposes)
13. Antonov J, Goldstein DR, Oberli A, Baltzer A, Pirotta M, Fleischmann
A, Altermatt HJ, Jaggi R: Reliable gene expression measurements
from degraded RNA by quantitative real-time PCR
depend on short amplicons and a proper normalization. Lab
Invest 2005, 85(8):1040-1050.
14. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De
Paepe A, Speleman F: Accurate normalization of real-time
quantitative RT-PCR data by geometric averaging of multiple
internal control genes. Genome Biol 2002,
3(7):RESEARCH0034.
15. Wirapati P, Kunkel S, Goldstein DG, Farmer P, Pradervand S, HaibeKains
B, Desmedt C, Sengstag T, Schütz F, Piccart M, Sotiriou C,
Delorenzi M: Integrative analysis of gene-expression profiles:
toward a unified understanding of breast cancer subtyping
and prognosis signatures. 2007 in press.
16. Shibutani M, Uneyama C, Miyazaki K, Toyoda K, Hirose M: Methacarn
fixation: a novel tool for analysis of gene expressions in
paraffin-embedded tissue specimens. Lab Invest 2000,
80(2):199-208.
17. Abrahamsen HN, Steiniche T, Nexo E, Hamilton-Dutoit SJ, Sorensen
BS: Towards quantitative mRNA analysis in paraffin-embedded
tissues using real-time reverse transcriptase-polymerase
chain reaction: a methodological study on lymph nodes
from melanoma patients. J Mol Diagn 2003, 5(1):34-41.
18. Godfrey TE, Kim SH, Chavira M, Ruff DW, Warren RS, Gray JW,
Jensen RH: Quantitative mRNA expression analysis from formalin-fixed,
paraffin-embedded tissues using 5' nuclease
quantitative reverse transcription-polymerase chain reaction.
J Mol Diagn 2000, 2(2):84-91.
19. Houze TA, Gustavsson B: Sonification as a means of enhancing
the detection of gene expression levels from formalin-fixed,
paraffin-embedded biopsies. Biotechniques 1996, 21(6):1074-8,
1080, 1082.
20. Koopmans M, Monroe SS, Coffield LM, Zaki SR: Optimization of
extraction and PCR amplification of RNA extracts from paraffin-embedded
tissue in different fixatives. J Virol Methods
1993, 43(2):189-204.
21. Lewis F, Maughan NJ, Smith V, Hillan K, Quirke P: Unlocking the
archive--gene expression in paraffin-embedded tissue. J
Pathol 2001, 195(1):66-71.
22. Reichmuth C, Markus MA, Hillemanns M, Atkinson MJ, Unni KK,
Saretzki G, Hofler H: The diagnostic potential of the chromosome
translocation t(2;13) in rhabdomyosarcoma: a Pcr
study of fresh-frozen and paraffin-embedded tumour samples.
J Pathol 1996, 180(1):50-57.
23. Rupp GM, Locker J: Purification and analysis of RNA from paraffin-embedded
tissues. Biotechniques 1988, 6(1):56-60.
24. Specht K, Richter T, Muller U, Walch A, Werner M, Hofler H: Quantitative
gene expression analysis in microdissected archival
formalin-fixed and paraffin-embedded tumor tissue. Am J
Pathol 2001, 158(2):419-429.
25. Stanta G, Bonin S: RNA quantitative analysis from fixed and
paraffin-embedded tissues: membrane hybridization and
capillary electrophoresis. Biotechniques 1998, 24(2):271-276.
26. Thomazy VA, Luthra R, Uthman MO, Davies PJ, Medeiros LJ: Determination
of cyclin D1 and CD20 mRNA levels by real-time
quantitative RT-PCR from archival tissue sections of mantle
cell lymphoma and other non-Hodgkin's lymphomas. J Mol
Diagn 2002, 4(4):201-208.
27. Mies C: A simple, rapid method for isolating RNA from paraffin-embedded
tissues for reverse transcription-polymerase
chain reaction (RT-PCR). J Histochem Cytochem 1994,
42(6):811-813.
28. Bibikova M, Talantov D, Chudin E, Yeakley JM, Chen J, Doucet D,
Wickham E, Atkins D, Barker D, Chee M, Wang Y, Fan JB: Quantitative
gene expression profiling in formalin-fixed, paraffinembedded
tissues using universal bead arrays. Am J Pathol
2004, 165(5):1799-1807.
29. Bibikova M, Chudin E, Arsanjani A, Zhou L, Garcia EW, Modder J,
Kostelec M, Barker D, Downs T, Fan JB, Wang-Rodriguez J: Expression
signatures that correlated with Gleason score and
relapse in prostate cancer. Genomics 2007, 89(6):666-672.
30. Haller AC, Kanakapalli D, Walter R, Alhasan S, Eliason JF, Everson RB:
Transcriptional profiling of degraded RNA in cryopreserved
and fixed tissue samples obtained at autopsy. BMC Clin Pathol
2006, 6:9.
31. Rait VK, Zhang Q, Fabris D, Mason JT, O'Leary TJ: Conversions of
formaldehyde-modified 2'-deoxyadenosine 5'-monophosphate
in conditions modeling formalin-fixed tissue dehydration.
J Histochem Cytochem 2006, 54(3):301-310.
32. Koch I, Slotta-Huspenina J, Hollweck R, Anastasov N, Hofler H, Quintanilla-Martinez
L, Fend F: Real-time quantitative RT-PCR shows
variable, assay-dependent sensitivity to formalin fixation:
implications for direct comparison of transcript levels in paraffin-embedded
tissues. Diagn Mol Pathol 2006, 15(3):149-156.
33. Hamatani K, Eguchi H, Takahashi K, Koyama K, Mukai M, Ito R, Taga
M, Yasui W, Nakachi K: Improved RT-PCR amplification for
molecular analyses with long-term preserved formalin-fixed,
paraffin-embedded tissue specimens. J Histochem Cytochem
2006, 54(7):773-780.
34. Masuda N, Ohnishi T, Kawamoto S, Monden M, Okubo K: Analysis
of chemical modification of RNA from formalin-fixed samples
and optimization of molecular biology applications for
such samples. Nucleic Acids Res 1999, 27(22):4436-4443.
35. Chaw YF, Crane LE, Lange P, Shapiro R: Isolation and identification
of cross-links from formaldehyde-treated nucleic acids.
Biochemistry 1980, 19(24):5525-5531.
36. Orlando V, Strutt H, Paro R: Analysis of chromatin structure by
in vivo formaldehyde cross-linking. Methods 1997,
11(2):205-214.
37. Fleige S, Walf V, Huch S, Prgomet C, Sehm J, Pfaffl MW: Comparison
of relative mRNA quantification models and the impact of
RNA integrity in quantitative real-time RT-PCR. Biotechnol
Lett 2006, 28(19):1601-1613.
38. Andersen CL, Jensen JL, Orntoft TF: Normalization of real-time
quantitative reverse transcription-PCR data: a model-based
variance estimation approach to identify genes suited for
normalization, applied to bladder and colon cancer data
sets. Cancer Res 2004, 64(15):5245-5250.
39. Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DS, Nobel AB, van't
Veer LJ, Perou CM: Concordance among gene-expressionbased
predictors for breast cancer. N Engl J Med 2006,
355(6):560-569.
40. Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H,
Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso
F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart
M, Delorenzi M: Gene expression profiling in breast cancer:
understanding the molecular basis of histologic grade to
improve prognosis. J Natl Cancer Inst 2006, 98(4):262-272.
41. Paik S, Tang G, Shak S, Kim C, Baker J, Kim W, Cronin M, Baehner
FL, Watson D, Bryant J, Costantino JP, Geyer CE Jr., Wickerham DL,
Wolmark N: Gene expression and benefit of chemotherapy in
women with node-negative, estrogen receptor-positive
breast cancer. J Clin Oncol 2006, 24(23):3726-3734.
42. Sun Y, Goodison S, Li J, Liu L, Farmerie W: Improved breast cancer
prognosis through the combination of clinical and
genetic markers. Bioinformatics 2007, 23(1):30-37.
Pre-publication history
The pre-publication history for this paper can be accessed
here:
http://www.biomedcentral.com/1755-8794/1/9/prepub
14. Expression profiling of FFPE material
144
14. Expression profiling of FFPE material
145
15 Molecular risk assessment of BIG 1-98 participants
by expression profiling using RNA from
archival tissue
• BMC Cancer, 10(37), 2010
• IF: 3.288
• number of citations: 6
• personal contribution (30%): design of genomic signatures and
scores, statistical analyses, manuscript writing
147
RESEARCH ARTICLE Open Access
Molecular risk assessment of BIG 1-98 participants
by expression profiling using RNA from archival
tissue
Janine Antonov1†
, Vlad Popovici2†
, Mauro Delorenzi2
, Pratyaksha Wirapati2
, Anna Baltzer1
, Andrea Oberli1
,
Beat Thürlimann3,8
, Anita Giobbie-Hurder4
, Giuseppe Viale5
, Hans Jörg Altermatt6
, Stefan Aebi1,7,8
, Rolf Jaggi1*
Abstract
Background: The purpose of the work reported here is to test reliable molecular profiles using routinely processed
formalin-fixed paraffin-embedded (FFPE) tissues from participants of the clinical trial BIG 1-98 with a median followup
of 60 months.
Methods: RNA from fresh frozen (FF) and FFPE tumor samples of 82 patients were used for quality control, and
independent FFPE tissues of 342 postmenopausal participants of BIG 1-98 with ER-positive cancer were analyzed by
measuring prospectively selected genes and computing scores representing the functions of the estrogen receptor
(eight genes, ER_8), the progesterone receptor (five genes, PGR_5), Her2 (two genes, HER2_2), and proliferation
(ten genes, PRO_10) by quantitative reverse transcription PCR (qRT-PCR) on TaqMan Low Density Arrays. Molecular
scores were computed for each category and ER_8, PGR_5, HER2_2, and PRO_10 scores were combined into a
RISK_25 score.
Results: Pearson correlation coefficients between FF- and FFPE-derived scores were at least 0.94 and high
concordance was observed between molecular scores and immunohistochemical data. The HER2_2, PGR_5,
PRO_10 and RISK_25 scores were significant predictors of disease free-survival (DFS) in univariate Cox proportional
hazard regression. PRO_10 and RISK_25 scores predicted DFS in patients with histological grade II breast cancer
and in lymph node positive disease. The PRO_10 and PGR_5 scores were independent predictors of DFS in
multivariate Cox regression models incorporating clinical risk indicators; PRO_10 outperformed Ki-67 labeling index
in multivariate Cox proportional hazard analyses.
Conclusions: Scores representing the endocrine responsiveness and proliferation status of breast cancers were
developed from gene expression analyses based on RNA derived from FFPE tissues. The validation of the molecular
scores with tumor samples of participants of the BIG 1-98 trial demonstrates that such scores can serve as
independent prognostic factors to estimate disease free survival (DFS) in postmenopausal patients with estrogen
receptor positive breast cancer.
Trial Registration: Current Controlled Trials: NCT00004205
Background
Clinical and histopathological factors such as lymph
node status, tumor size, histological grade, age, and
expression of estrogen receptor (ER) and Her2 have traditionally
guided treatment decisions of patients with
operable breast cancer [1,2]. Various prognostic models
are based on these factors, for example the Nottingham
Prognostic Index (NPI) [3,4], Adjuvant!Online [5,6] and
others [7]. Despite providing excellent estimates of the
average risk of recurrence, there remains substantial variation
in outcome which may be explained by molecular
differences among these tumors [8,9].
DNA-chip based expression analyses have confirmed
the heterogeneity of breast cancer and allowed the
development of clinically relevant gene “signatures” or
* Correspondence: rolf.jaggi@dkf.unibe.ch
† Contributed equally
1
Department of Clinical Research, University of Bern, Bern, Switzerland
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
© 2010 Antonov et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
15. Risk assessment of BIG 1-98
148
“profiles” [10-20]. Such profiles are being implemented
widely in routine patient care even though many signatures
were developed and validated on heterogeneous
patient cohorts with respect to stage of disease and therapy.
The utility of gene signatures as part of the decision
making process is being validated in ongoing studies
(TAILORx [21] and MINDACT [22]). Most profiling
studies are based on fresh-frozen (FF) or RNAlater conserved
tissue. Such material must be collected and processed
separately after surgery, complicating the
implementation of molecular analyses into the clinical
workflow. Procedures based on formalin-fixed, paraffinembedded
(FFPE) material simplify the acquisition of
tumor material and can easily be established as part of
the routine pathological procedures. In addition, FFPE
tissues collected in the framework of clinical trials could
be a valuable resource for future research.
We prospectively selected genes from publicly available
microarray data and developed molecular scores
representing the ER, progesterone receptor (PgR), Her2
and proliferation (PRO) status, and the overall risk of
recurrence (RISK). The reproducibility and robustness
of the molecular scores was validated by comparing
expression data with RNA from FF and FFPE material
of 82 tumors. Molecular scores were determined from
342 ER positive tumor samples of the BIG 1-98 clinical
trial. Multivariate Cox proportional hazard models
revealed that molecular scores are independent prognostic
factors to estimate disease free survival (DFS).
Methods
To assess the quality of expression profiling from FFPE
material, matched FF and FFPE samples from 82
human breast cancers were used. Histopathological
information was irreversibly anonymized according to
Swiss law. Independent FFPE blocks and corresponding
clinical data of 437 Swiss participants of the trial BIG 1-
98 were provided by the International Breast Cancer
Study Group. The ethics committees and required
health authorities of each participating institution
approved the study protocol, and all patients gave written
informed consent (ClinicalTrials.gov number,
NCT00004205) [23]. Retrospective tissue collection was
carried out in accordance with institutional guidelines
and national laws. The patient and tumor characteristics
of these patients were similar to the entire BIG 1-98
population (Table 1). BIG 1-98 is a randomized controlled
clinical trial of adjuvant hormonal therapy for
postmenopausal patients with endocrine-responsive
breast cancer comparing 4 arms: 5 years of tamoxifen, 5
years of letrozole, two years of tamoxifen followed by 3
years of letrozole, or vice versa [24-26]. All the patients
from the BIG 1-98 were treated by mastectomy or
breast conserving surgery [24-26]. The available paraffin
blocks contained material derived from representative
tumor regions.
Tissue samples and data processing
The RNA was isolated from 4 sections (25 μm) of FF
material and from 10 paraffin sections (10 μm thick) as
described previously [27]. After demodification, the
RNA was bound to silica-based columns, DNase I
digested and eluted with water. The protocols and
reagents for RNA isolation from FF and FFPE tissues
were recently incorporated in commercial protocols
(RNAready and FFPE RNAready, AmpTec, Hamburg,
Germany). RNA qualities were assessed on an Agilent
2100 Bioanalyzer (Agilent Technologies, Inc., Santa
Clara, CA, USA). RNA prepared from FF material had a
RIN>6 (RNA integrity number), the RIN of RNA from
FFPE was 2-3. The percentage of tumor cells in each
FFPE block was evaluated on stained tissue sections.
From 437 available FFPE samples 43 samples (9.8%)
with less than ~30% tumor cells, 10 ER-negative tumor
samples and 7 samples (1.6%) with less than 1.5 μg total
RNA recovery were excluded from further analysis.
Approximately 30% of the sections contained 30-50%
tumor cells, and about 60% contained 50-100% tumor
cells. Each of the remaining RNAs was tested by quantitative
reverse transcription PCR (qRT-PCR) with 3 control
genes (GUSB, RPLP0 and UBB). The mean of the
three raw Cts (cycle thresholds) was determined. In 35
samples (8%) the mean Ct was >31, indicating poor
quality of the RNA. These RNAs were excluded from
further analyses. For the remaining 342 RNAs (78.3%),
the expression of 34 genes (see Table 1) was measured
by qRT-PCR on TaqMan Low Density Arrays (TLDAs)
(Applied Biosystems, Foster City, CA, USA) using a one
step protocol (Invitrogen, Basel, Switzerland) on an
Applied Biosystems 7900HT instrument. Technical
replicates were performed for several intact and several
partially degraded RNAs from FF and FFPE material,
respectively. They revealed Pearson correlation coefficients
higher than 0.95 for all 34 assays.
Genes with high correlation to the expression of ER,
PgR, Her2 and proliferation related genes were prospectively
selected from publicly available microarray data
[28]. A complete list of microarray data sets used in the
meta-analysis is available at “.http://breast-cancerresearch.com/content/10/4/R65/table/T1[28]
(Additional
File 1, Table S1). The scores were defined by giving
equal weight to each gene in the four groups (proliferation,
estrogen response, progesterone response, Her2
response). Thus, a training set was not used as the
scores were based on in silico gene selection.
Raw Ct values were normalized against the mean
expression of GUSB, RPLP0 and UBB. Scores for ER
(ER_8), PgR (PGR_5), Her2 (HER2_2) and proliferation
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 2 of 13
15. Risk assessment of BIG 1-98
149
Table 1 Gene Identifications, Categories and Score affiliations
Gene Category Accession Nr. Description AS Score
GUSB Control NM_000181.1 glucuronidase, beta 81 control
RPLP0 Control NM_053275.3
NM_001002.3
ribosomal protein, large, P0 105 control
UBB Control NM_018955.2 ubiquitin B 120 control
AR ER NM_001011645.1
NM_000044.2
androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and
bulbar muscular atrophy; Kennedy disease)
72 ER_8
ERBB4 ER NM_001042599.1
NM_005235.2
v-erb-a erythroblastic leukemia viral oncogene homolog 4 (avian) 77 ER_8
ESR1 ER NM_000125.2 estrogen receptor 1 62 ER_8
ER_4
FOXA1 ER NM_004496.2 forkhead box A1 74 ER_8
GATA3 ER NM_001002295.1
NM_002051.2
GATA binding protein 3 80 ER_8
MAPT ER NM_016834.2
NM_016835.2
NM_016841.2
NM_005910.3
microtubule-associated protein tau 60 ER_8
MYB ER NM_005375.2 v-myb myeloblastosis viral oncogene homolog (avian) 96 ER_8
XBP1 ER NM_005080.2 X-box binding protein 1 60 ER_8
BCL2 ER NM_000633.2 B-cell CLL/lymphoma 2 81 ER_4
GREB1 PGR NM_033090.1
NM_148903.1
NM_014668.2
GREB1 protein 77 PGR_5
PGR PGR NM_000926.3 progesterone receptor 118 PGR_5
ER_4
RAB31 PGR NM_006868.2 RAB31, member RAS oncogene family 109 PGR_5
RBBP8 PGR NM_203291.1
NM_203292.1
NM_002894.2
retinoblastoma binding protein 8 75 PGR_5
SERPINA3 PGR NM_001085.4 serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 3 70 PGR_5
SCUBE2 PGR NM_020974.1 CEGP1, signal peptide, CUB domain, EGF-like 2 64 ER_4
ERBB2 HER2 NM_001005862.1
NM_004448.2
v-erb-b2 erythroblastic leukemia viral oncogene homolog 2, neuro/glioblastoma
derived oncogene homolog (avian)
120 HER2_2
GRB7 HER2 NM_005310.2 growth factor receptor-bound protein 7 70 HER2_2
CCNB2 Proliferation NM_004701.2 cyclin B2 73 PRO_10
CCNE2 Proliferation NM_057735.1
NM_057749.1
cyclin E2 70 PRO_10
CDC2 Proliferation NM_033379.2
NM_001786.2
cell division cycle 2, G1 to S and G2 to M 92 PRO_10
CENPF Proliferation NM_016343.3 centromere protein F, 350/400 ka (mitosin) 99 PRO_10
KIF20A Proliferation NM_005733.1 kinesin family member 20A 130 PRO_10
MKI67 Proliferation NM_002417.3 antigen identified by monoclonal antibody Ki-67 131 PRO_10
PRO_5
ORC6L Proliferation NM_014321.2 origin recognition complex, subunit 6 like (yeast) 78 PRO_10
PRC1 Proliferation NM_199413.1
NM_199414.1
NM_003981.2
protein regulator of cytokinesis 1 66 PRO_10
SPAG5 Proliferation NM_006461.3 sperm associated antigen 5 114 PRO_10
TOP2A Proliferation NM_001067.2 topoisomerase (DNA) II alpha 170 kDa 125 PRO_10
AURKA Proliferation NM_003600.2 STK15 aurora kinase A 85 PRO_5
BIRC5 Proliferation NM_001012271.1
NM_001168.2
baculoviral IAP repeat-containing 5 (survivin) 93 PRO_5
CCNB1 Proliferation NM_031966.2 cyclin B1 104 PRO_5
MYBL2 Proliferation NM_002466.2 v-myb myeloblastosis viral oncogene homolog (avian)-like 2 81 PRO_5
Abbreviation: AS, amplicon size
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 3 of 13
15. Risk assessment of BIG 1-98
150
(PRO_10) were defined as mean expression of all genes
in each category (Table 1). A RISK score comprising 25
genes was calculated as follows: RISK_25 = PRO_10
+HER2_2-(8 × ER_8+5 × PGR_5)/13. For comparison,
ER_4 and PRO_5 scores were calculated based on 4 and
5 genes described previously [27]. The genes corresponding
to ER_4 and PRO_5 scores corresponded to
the genes used for calculating the recurrence score (RS)
[29].
Concordance of molecular scores and pathological
parameters
Histopathological data of BIG 1-98 samples were
derived from a central review, with the exception of the
grade which was locally assessed. The ER and PgR status
were dichotomized into positive (≥ 10% immunoreactive
cells) or negative (<10%) [30]. Her2 was measured by
fluorescence in-situ hybridization or immunohistochemistry
(IHC) and tumors were classified according to Rasmussen
et al. [31]. The Ki-67 labeling index (LI) was
centrally assessed by IHC as described and classified
into low or high using the median LI (11%) as cut-off
[32]. The same assays and cut-offs were used for the 82
matched samples with the exception of Her2 which was
measured using the CB11 monoclonal antibody and
using a cut-off of ≥ 50% [33]. Continuous molecular
scores were compared to binary IHC parameters using
the area under the curve (AUC). The 95% confidence
intervals (CI) were estimated by a bootstrap method
(100 bootstraps). Two-sided Mann-Whitney tests were
used to assess the association between clinicopathological
factors and scores.
Statistical analyses
Primary endpoint of survival analyses was DFS as
defined previously [25]. Forty-five events were observed
in 342 patients with a median follow-up time (estimated
by reverse Kaplan-Meier [34]) of 60 months. DFS was
estimated by Kaplan Meier analysis. Patients were classified
into low and high PRO or RISK scores using the
corresponding median score as cut-off. The differences
in survival experience between the two resulting groups
were assessed with log rank tests. Univariate and multivariate
Cox proportional hazard models were used [35]
and hazard ratios (HR), CIs and p-values were obtained.
The multivariate models were assessed using the loglikelihood
and the deviance of residuals. Likelihood ratio
tests (LRT) were used to compare different nested multivariate
models. No adjustments were made for multiple
testing. Univariate Cox proportional hazard models
were applied to estimate the rate of events and to produce
corresponding plots.
Results
Reliable expression profiling from FFPE tumor tissue
Gene expression was measured from 34 genes using
TLDAs with RNA isolated from FF and FFPE material
of 82 breast cancers. These data were used solely for the
assessment of the expression profiling from FFPE material.
Pearson correlation coefficients between FF and
FFPE expression values for each tumor and all assays
ranged from 0.91 to 0.98. The mean increase of raw Ct
values derived of FFPE compared to matched FF tissues
was 1.30 units. This Ct shift was mostly compensated by
normalization (Additional File 2, Figure S1. and Additional
File 3, Figure S2).
Unsupervised hierarchical clustering demonstrated the
stability of gene clusters and revealed an excellent agreement
between FF- and FFPE-based expression profiles
(Additional File 4, Figure S3). Molecular scores were
determined for ER, PGR, HER2 and PRO. A linear relationship
of scores was found for RNA from FF and
RNA from FFPE material (Figure 1). Pearson correlation
coefficients for the four scores were 0.968, 0.974, 0.942
and 0.944, respectively. The distributions of ER_8,
PGR_5 and HER2_2 scores are shown as histograms
together with the fitted mixture of two Gaussian distributions
(Additional File 1, Figure S4) used for discriminating
the subtypes.
The agreement between molecular scores and corresponding
binary IHC variables was assessed by receiver
operating characteristic (ROC) curves and AUC. AUCs
and 95% CI were calculated for ER_8 (FF = 0.940
(0.835-1.00), FFPE = 0.931 (0.804-1.00)), PGR_5 (FF =
0.919 (0.828-0.986), FFPE = 0.916 (0.806-0.987) and
HER2_2 (FF = 0.961 (0.895-1.00), FFPE = 0.963 (0.915-
0.993)). PRO_10 was compared with IHC data for Ki-67
using a cut-off of 11% and the resulting AUCs were
0.798 (0.609-0.900) for FF and 0.810 (0.660-0.907) for
FFPE, respectively. In conclusion, the agreement of the
IHC with FFPE samples was as good as with FF samples.
Concordance between pathological parameters and
molecular scores for tumors of the BIG 1-98 clinical trial
Molecular scoring was applied to an independent set of
tissue samples from Swiss patients participating in the
BIG 1-98 randomized clinical trial and scores were compared
to centrally assessed histopathological data by
ROC curves. From a total of 437 provided tumor samples
342 ER-positive tumors (78.3%) were suitable for
analysis. The AUC was 0.974 (95% CI = 0.946-0.995) for
HER2_2 and 0.847 (95% CI = 0.794-0.902) for PGR_5.
PRO_10 scores positively correlated with Ki-67 LI (Pearson
correlation coefficient 0.51); the AUC was 0.815
(95% CI = 0.768-0.864) for Ki-67 binarized at 11% [32].
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 4 of 13
15. Risk assessment of BIG 1-98
151
The PRO_10 score correlates with histological grade and
other clinical factors
The histological grade was assessed according to Elston
and Ellis [36]. The PRO_10 score positively correlated
with Elston and Ellis scores and with grade (Pearson
correlation coefficient 0.453 and 0.409, respectively)
(Figure 2). Furthermore, PRO_10 scores were significantly
higher in Her2 positive tumors, in tumors larger
than 2 cm and in tumors with axillary lymph node
metastasis as compared to Her2 negative tumors, T1
tumors and N0 tumors (p ≤ 0.0015, Mann-Whitney
tests), respectively (data not shown).
PRO and RISK scores predict disease free survival in
lymph node positive patients and patients with grade II
breast cancer
The prognostic values of PRO_10 and RISK_25 scores
were assessed by their ability to assign patients to low
and high risk groups. Patients were stratified according
to histological grade and low or high PRO_10 and
RISK_25 scores using the corresponding medians as
cut-offs (Figure 3). As expected, patients with grade III
tumors had poorer DFS than patients with grade I or
grade II tumors (p = 0.0019, panel A). High PRO_10
scores correlated with poorer DFS compared to low
scores in all (p = 0.0043, panel B) and in histological
grade II tumors (p = 0.0024, panel C). Similarly,
RISK_25 discriminated between favorable and poor DFS
in all (p = 0.0005, panel D) and in node positive tumors
(p = 0.0009, panel E). Univariate Cox proportional
hazards regression analysis confirmed these
observations.
The PGR_5, PRO_10 and RISK_25 scores were all significant
predictors of DFS (p < 0.05) as were histological
grade, tumor size, number of positive lymph nodes and
Ki-67 LI (Table 2). The PRO_5 score was also a significant
predictor of DFS but PRO_10 score was numerically
better than PRO_5 in terms of log-likelihood (L)
and deviance of residuals (D) (PRO_10: L = -223.35, D
= 225.83; PRO_5: L = -224.16, D = 227.57).
Figure 4 shows the estimated rate of recurrence as a
function of PRO_10, PGR_5 and RISK_25 scores. The
PRO_5, PRO_10 and the RISK_25 scores remained significant
predictors of DFS when applied to patients with
grade II breast cancer.
PRO_10 and PGR_5 scores are independent risk factors in
multivariate analyses
The impact of the molecular scores PRO_10 and PGR_5
was further documented in multivariate models
Figure 1 Comparison of scores computed from intact RNA and partially degraded RNA from FFPE material. Scores were determined for
RNA from FF material and RNA from corresponding FFPE tumor material of 82 patients. Scatter plots are shown between scores from FF and
FFPE tissues representing ER_8 (A), PGR_5 (B), HER2_2 (C) and PRO_10 (D) for each tumor. Pearson correlations are indicated.
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 5 of 13
15. Risk assessment of BIG 1-98
152
comprising clinicopathologic predictors and molecular
scores that were significant in univariate analyses.
Multivariate analyses revealed that PRO_10 is a predictor
of DFS independent of tumor size (T), number of
positive lymph nodes (N), grade (G) and Ki-67 LI.
PRO_10 represents proliferation-related genes and it
was of interest to compare it to Ki-67. Table 2 shows
the results of multivariate analyses including T, N, G
and either Ki-67 (model 1) or PRO_10 (model 3) in
comparison with a model containing both markers
(model 2). The full model (model 2) was significantly
better than model 1 (LRT p = 0.0071). No significant
difference was found for PRO_10 between models 2 and
3 (LRT p = 0.8075). Thus, adding PRO_10 to T, N, G
and Ki-67 significantly improved the model. In contrast,
adding Ki-67 to T, N, G and PRO_10 did not bring
additional information.
The same procedure was used to evaluate whether
PGR_5 further improved model 6 containing T, N, G
and PRO_10 (Table 2). The full model including all 5
variables (model 5) performed better than model 4 (T,
N, G, PGR_5; LRT p = 0.0089) and model 6 (T, N, G,
PRO_10; LRT p = 0.0339). Both, PGR_5 and PRO_10
remained significant in model 5 suggesting that the two
scores contain independent information with respect to
prognosis and outcome.
Discussion
Gene expression profilings define clinically relevant gene
signatures [15,17,37,38]. For the present work, we
selected genes correlating with the ER, PgR, Her2 and
proliferative status using a meta-analysis of gene expression
profiles [28]. The prognostic power of resulting
gene expression scores for ER, PgR, proliferation and
overall risk of recurrence was validated using tissues and
clinical data from a representative subset of participants
of trial BIG 1-98 confirming the correlation structure of
these genes and their association with clinical and outcome
variables.
Multiple genes representing each score were quantified
by qRT-PCR. RNA from 82 matched FF and FFPE
tissues were compared by qRT-PCR on TLDAs. The
mean increase of raw Ct values between RNA from FF
and FFPE tissues was 1.3 units. This is similar to the
Figure 2 Comparison of scores and immunohistochemical analysis. Correlation of histological grading and PRO_10 score. The 342 tumors
were classified according to histological grading. The data are shown as boxplots with median (solid line), interquartile ranges (boxes) and
minimum and maximum non-outlier values (whiskers). The PRO_10 scores higher and lower than the median are indicated as red and blue dots,
respectively for each grade.
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 6 of 13
15. Risk assessment of BIG 1-98
153
findings of Cronin and co-workers (+2.0 units) in a
comparable setting [39]. Duration of formalin fixation,
storage time and conditions influence the quality of
RNA derived of FFPE tissues with direct effects on the
sensitivity of subsequent PCR reactions [40]. However,
normalization effectively compensated for this shift of
Ct values (Additional File 2, Fig S1 and Additional File
3, Figure S2).
The mean expression of eight genes related to ER and
five genes related to PgR were used to calculate the
ER_8 and PGR_5 scores. Scores representing different
functional categories were combined in RISK_25 score.
The molecular scores determined from 82 paired samples
of FF and FFPE tumors were highly concordant, as
were molecular scores and immunohistochemically
assessed parameters demonstrating the reliability of the
procedure.
Molecular scores were validated in an independent set
of tumor tissues from 342 participants of trial BIG 1-98.
In contrast to histological analyses which can also be
performed from tissue sections that contain considerable
normal, stromal or fat components the architecture of
the tissue is completely lost during work up for molecular
analyses and therefore, it was important to exclude
samples with inadequate tumor content. A histological
section was taken from the immediate vicinity of each
sample that was used for molecular analyses. Each section
was assessed by an experienced pathologist (H.J.A.)
and molecular analyses were restricted to samples containing
at least 30% tumor cells. For comparison, RNA
Figure 3 Survival data based on molecular scores. Kaplan-Meier plots for DFS. Patients were stratified into grade I (blue), II (green) and III (red
line) (A), into low (blue) and high (red) PRO_10 scores in all samples (B) and in Grade II samples (C). The RISK_25 score is shown for all samples
(D) and for tumors of patients with lymph node positive (N+) cancer (E). Median values of the scores were used as cut-offs. The p-values
correspond to Log-rank test.
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 7 of 13
15. Risk assessment of BIG 1-98
154
Table 2 Baseline characteristics.
Characteristic Patients with FFPE profiles from Swiss
participants used in the study (N = 342)
Provided material of
Swiss participants
(N = 437)
Patients of the BIG 1-98
population not used in the
study
(N = 7573)
Overall BIG 1-
98 population
(N = 8010)
Menopausal
category - N (%)
Postmen. before
chemo
321 (93.9) 413 (94.5) 7279 (96.1) 7692 (96.0)
Postmen. after
chemo
10 (2.9) 11 (2.5) 181 (2.4) 192 (2.4)
Premenopausal
(ineligible)
0 (0.0) 2 (0.5) 21 (0.3) 23 (0.3)
Uncertain status 10 (2.9) 10 (2.3) 92 (1.2) 102 (1.3)
Unknown/
missing
1 (0.3) 1 (0.2) 0 1 (<0.1)
Age at
randomization -
years
Median 62 62 61 61
Range 41-86 41-86 38-90 38-90
Tumor size - N
(%)
≤ 2 cm 195 (57.0) 251 (57.4) 4706 (62.1) 4957 (61.9)
> 2 cm 144 (42.1) 179 (41.0) 2794 (36.9) 2973 (37.1)
Unknown/
missing
3 (0.9) 7 (1.6) 73 (1.0) 80 (1.0)
Tumor grade - N
(%)
Grade 1 94 (27.5) 124 (28.4) 2007 (26.5) 2131 (26.6)
Grade 2 196 (57.3) 251 (57.4) 3649 (48.2) 3900 (38.7)
Grade 3 49 (14.3) 59 (13.5) 1166 (15.4) 1225 (15.3)
Unknown/
missing
3 (0.9) 3 (0.7) 751 (9.9) 754 (9.4)
Nodal status - N
(%)
Negative
(including Nx)
186 (54.4) 245 (56.1) 4342 (57.3) 4587 (57.3)
Positive 152 (44.4) 188 (43.0) 3123 (41.2) 3311 (41.3)
Unknown/
missing
4 (1.2) 4 (1.0) 108 (1.4) 112 (1.4)
ER and PgR
status - N (%)
ER pos and PgR
pos.
268 (78.4) 340 (77.8) 4715 (62.3) 5055 (63.1)
ER pos and PgR
neg.
66 (19.3) 87 (19.9) 1544 (20.4) 1631 (20.4)
ER pos and PgR
unknown
1 (0.3) 1 (0.2) 1153 (15.2) 1154 (14.4)
ER neg and PgR
pos.
5 (1.5) 7 (1.6) 136 (1.8) 143 (1.8)
ER unknown,
PGR pos.
0 0 7 (0.1) 7 (0.1)
Other 2 (0.6) 2 (0.5) 18 (0.3) 20 (0.2)
Local therapy N
(%)
BCS and RT 236 (69.0) 310 (70.9) 3987 (52.7) 4297 (53.7)
BCS and no RT 13 (3.8) 16 (3.7) 228 (3.0) 244 (3.0)
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 8 of 13
15. Risk assessment of BIG 1-98
155
was also isolated from tumor-surrounding cells which
led to rather poor RNA recoveries from comparable tissue
areas (data not shown). However, this does not
exclude that tumor-surrounding cells may have a limited
impact on molecular scores in such analyses. Contamination
by non-tumor cells may be reduced by macrodissecting
tumors before RNA isolation and molecular
assessment. The same procedure would also make
tumors accessible to molecular analysis when sections
contain less than 30% tumor cells.
Classification of patients by low and high PRO_10 and
RISK_25 scores corresponded to low and high risk of
recurrence. PRO, RISK and PGR scores were prognostic
for DFS not only in the entire patient population but
also in a subpopulation of patients with node positive
disease (Figure 3D and 3E). We provide evidence independent
of Genomic Health™ that a RISK score based
on similar biological processes as the recurrence score
(RS), but with other genes selected through a different
procedure, can predict DFS [29,41,42]. In contrast to the
RS which was validated with tamoxifen-treated patients,
PRO_10, RISK_25 and PGR_5 scores were validated
with patients treated with tamoxifen, letrozole or a
sequence of both drugs; therefore, they may apply to
patients who received either of these drugs.
Histological grading is an important factor in estimating
the risk of recurrence of patients with breast cancer
[2,43]. Recently, Sortiriou and colleagues have developed
the gene expression grade index (GGI) based on the
expression of 97 genes related to proliferation. They
demonstrated that grade II cancers are comprised of
tumors which are similar to genomic grade I or grade
III with corresponding clinical outcomes [16,44]. Our
findings agree with these observations as grade II
tumors could be further classified into low and high risk
of recurrence by 10 genes (PRO_10) (Figure 3C) or even
by 5 genes (PRO_5 score) (data not shown). Seven of
the PRO_10 and three of the PRO_5 genes are also part
of GGI. The PRO_5 genes (Table 1) corresponded to
the proliferation-related genes of the RS [29]. The
assessment of gene signatures related to proliferation
such as GGI or PRO scores is of special interest in ER
positive, grade II breast cancer for whom therapeutic
decisions are often difficult. Both, GGI and RS were
shown to be associated with response to chemotherapy
[45,46]. In contrast to GGI which requires FF tumor
material, PRO scores or RS can be determined from a
few microtome slices or cores such as used for tissue
microarrays [47]. Material for molecular analysis can be
taken from the same FFPE tissue block used for histological
and immunohistochemical analyses without interfering
with clinicopathological workflow.
The prognostic value of Ki-67 in early breast cancer
was recently confirmed [48]. However, Ki-67 is not used
uniformly in clinical practice [49,50] as it appears to be
difficult to agree on cut-off values separating high and
low proliferation tumors or on its value in assisting the
choice of adjuvant therapy [50,51]. Therefore, instead of
dichotomizing Ki-67 it may be more feasible to use Ki-
67 as continuous variable [52]. Here, we made a comparison
between centrally assessed Ki-67 LI and a qRTPCR
based proliferation signature. The PRO_10 score
correlated with Ki-67 LI, and both were significant predictors
of DFS in univariate Cox analyses. In multivariate
models however, PRO_10 offered superior
prognostic value and outperformed Ki-67 LI (Table 3).
Moreover, the PRO_10 score added independent prognostic
information to anatomical staging.
PgR, as measured by immunohistochemistry [30] or
microarray analysis [53], was shown to positively correlate
with prognosis. Here we show that the molecular
PGR_5 score was also positively associated with DFS
(Figure 4) and added independent prognostic information
to anatomical staging and PRO_10 (Table 3). Thus,
PGR_5 and PRO_10 scores independently predict prognosis
in the BIG 1-98 population.
Compared to immunohistochemically assessed parameters,
qRT-PCR based scores are quantitative,
Table 2: Baseline characteristics. (Continued)
Mastectomy and
RT
24 (7.0) 25 (5.7) 1415 (18.7) 1440 (18.0)
Mastectomy and
no RT.
68 (19.9) 85 (19.5) 1926 (25.4) 2011 (25.1)
Other 1 (0.3) 1 (0.2) 17 (0.2) 18 (0.2)
Adjuvant or
neoadjuvant
chemo (or both)
- N (%)
Yes 133 (38.9) 159 (36.4) 1865 (24.6) 2024 (25.3)
No 209 (61.1) 278 (63.6) 5708 (75.4) 5986 (74.7)
Abbreviations: BCS, breast conserving surgery; Nx, nodal status unknown; postmen., postmenopausal; RT, radiotherapy; PgR, progesterone receptor; pos., positive;
neg., negative
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 9 of 13
15. Risk assessment of BIG 1-98
156
relatively independent on operator expertise and less
affected by inter-observer variability. The procedure is
simple, economical and can be standardized easily with
good control genes, reference samples and quality control
procedures.
The results of this study are based on a limited number
of patients and follow-up time (60 months). Similar
Figure 4 Expected rate of disease-free survival (DFS). The
expected rate of events at 60 months (solid line) is shown as a
function of PRO_10 (A), PGR_5 (B) and RISK_25 scores (C). The 95%
confidence intervals are indicated (dashed lines). Vertical lines
represent the median of all scores (solid line) and 25% and 75%
quantiles (dashed lines).
Table 3 Cox Proportional Hazard Analyses.
Covariate P-value HR (95% CI)
Univariate Analyses*
Clinicopathological Variables
HER2 0.7816 1.18 (0.36 - 3.84)
PgR 0.5147 0.78 (0.36 - 1.66)
Histological grade 0.0032 1.99 (1.26 - 3.14)
Ki-67 LI 0.0226 1.02 (1.00 - 1.04)
Tumor size 0.0047 1.22 (1.06 - 1.39)
Number of positive nodes <0.0001 1.13 (1.08 - 1.18)
Treatment (4 categories) 0.1540 Molecular
scores
HER2_2 0.1080 1.20 (0.96 - 1.51)
PGR_5 0.0344 0.66 (0.44 - 0.97)
PRO_5 0.0003 2.14 (1.42 - 3.22)
PRO_10 <0.0001 2.09 (1.45 - 3.00)
RISK_25 0.0001 1.54 (1.24 - 1.91)
Multivariate Analyses: Comparison of PRO_10 and Ki-67 LI**
Model 1: log-likelihood = -179.38, Deviance = 188.11
Number of positive nodes <0.0001 1.19 (1.12 - 1.27)
Tumor size 0.0370 1.19 (1.01 - 1.39)
Grade 0.4200 1.25 (0.72 - 2.17)
Ki-67 LI 0.1300 1.02 (1.00 - 1.04)
Model 2: log-likelihood = -175.75, Deviance = 180.71
Number of positive nodes <0.0001 1.19 (1.12 - 1.27)
Tumor size 0.1300 1.14 (0.96 - 1.34)
Grade 0.9600 0.99 (0.55 - 1.76)
PRO_10 0.0092 2.12 (1.20 - 3.72)
Ki-67 LI 0.8100 1.00 (0.97 - 1.03)
Model 3: log-likelihood = -175.78, Deviance = 180.77
Number of positive nodes <0.0001 1.19 (1.12 - 1.27)
Tumor size 0.1200 1.14 (0.97 - 1.34)
Grade 0.9400 0.98 (0.55 - 1.74)
PRO_10 0.0026 2.03 (1.28 - 3.23)
Multivariate Analyses: Role of PGR_5***
Model 4: log-likelihood = -215.27, Deviance = 214.30
Number of positive nodes <0.0001 1.12 (1.07 - 1.16)
Tumor size 0.2000 1.11 (0.95 - 1.30)
Grade 0.0170 1.78 (1.11 - 2.87)
PGR_5 0.0570 0.68 (0.45 - 1.01)
Model 5: log-likelihood = -211.85, Deviance = 208.03
Number of positive nodes <0.0001 1.06 (1.06 - 1.16)
Tumor size 0.4300 1.07 (0.91 - 1.26)
Grade 0.3000 1.32 (0.78 - 2.23)
PRO_10 0.0092 1.73 (1.15 - 2.62)
PGR_5 0.0360 0.65 (0.43 - 0.97)
Model 6: log-likelihood = -214.10, Deviance = 211.25
Number of positive nodes <0.0001 1.11 (1.06 - 1.16)
Tumor size 0.1700 1.13 (0.95 - 1.34)
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 10 of 13
15. Risk assessment of BIG 1-98
157
analyses with independent, larger sample sizes and more
mature follow-up data are planned to further consolidate
the prognostic and possibly predictive value of the
proposed scores in each treatment arm separately.
Gene expression profiling has improved the understanding
of molecular subtypes of breast cancer. FFPE
material is not widely used although it may facilitate
and speed up the development and validation of novel
gene signatures due to the availability of well-characterized
tissues from numerous clinical trials [54,55]. The
same material can be used for molecular diagnostics.
The investigation of gene signatures may become more
important in the future as an increasing proportion of
agents under development for breast cancer treatment
have defined molecular targets. Early integration of biomarker
analysis in the drug development process has
the potential to improve the specificity and efficiency of
novel therapeutics. This opens the possibility to further
individualize therapy of patients with breast cancer.
Conclusions
We define four molecular scores based on quantitative
measurement of gene expression with RNA derived of
FFPE tissues. The genes for each score were selected
from a large meta-analysis of microarrays. The genes do
not coincide with genes used for other molecular scores
like the RS (except genes that were previously used as
immunohistochemical markers such as ER, PgR or
Her2). Two of the described scores are shown to be
independent predictors of disease-free survival of postmenopausal
patients with operable, estrogen receptor
positive breast cancer. The proliferation-associated score
outperforms the Ki-67 labeling index measured by
immunohistochemistry.
List of abbreviations
AUC: area under the (ROC) curve; CI: confidence interval;
DFS: disease-free survival; ER: estrogen receptor; FF:
fresh frozen; FFPE: formalin-fixed, paraffin embedded;
HR: hazard ratio; IHC: immunohistochemistry; GGI:
gene expression grade index; LI: labeling index; LRT:
likelihood ratio tests; PCR: polymerase chain reaction;
RIN: RNA integrity number; PgR: progesterone receptor;
ROC: receiver operating characteristic; RS: recurrence
score; TLDA: TaqMan Low Density Arrays.
Additional file 1: Publicly available gene expression data from breast
cancer studies.
Click here for file
[ http://www.biomedcentral.com/content/supplementary/1471-2407-10-
37-S1.PDF ]
Additional file 2: Effect of normalization. Mean expression of 34
assays determined for 82 RNAs isolated from FFPE and from
corresponding FF tissue. Shown are the differences between FFPE and FF
before (Raw) and after normalization against the mean of three control
genes (UBB, RPLP0 and GUSB) (Normalized).
Click here for file
[ http://www.biomedcentral.com/content/supplementary/1471-2407-10-
37-S2.PDF ]
Additional file 3: Unsupervised hierarchical clustering of data from
FF- and FFPE-derived RNA. Shown are heat maps based on normalized
expression from RNA of FF (A) and FFPE tissues (B). Proliferation (red
box), Her2 (blue box) and ER or PgR related genes (green box) are
indicated. The hormone receptor status of each tumor was also assessed
by IHC. ER negative (closed circles) and Her2 positive tumors (open
circles) are indicated.
Click here for file
[ http://www.biomedcentral.com/content/supplementary/1471-2407-10-
37-S3.PDF ]
Additional file 4: Distribution of molecular scores. Shown are
histograms of ER, PGR and HER2 scores and fitted mixtures of Gaussian
distributions. Results of 82 matched samples are shown for ER_8 (A, B),
PGR_5 (C, D) and HER2_2 (E, F) scores derived from FF (A, C, E) and FFPE
tissues (B, D, F).
Click here for file
[ http://www.biomedcentral.com/content/supplementary/1471-2407-10-
37-S4.PDF ]
Acknowledgements
We thank all patients, physicians and pathologists who participated in the
BIG 1-98 trial. We thank the IBCSG for providing material and corresponding
clinicopathological data. We thank Dr. M. Schobesberger and K. Price for
helpful discussions and critical reading of the manuscript and R. Kammler for
preparing FFPE material. This work was supported by Cancer Research
Switzerland/Oncosuisse, the NCCR “Molecular Oncology” and the Bernese
Cancer League. The IBCSG Statistical Center receives funding from the US
NCI (CA-75362). The BIG 1-98 trial was financed by Novartis and coordinated
by the International Breast Cancer Study Group (funded by the Swedish
Cancer Society, The Cancer Council Australia, Australian New Zealand Breast
Cancer Trials Group, Frontier Science and Technology Research Foundation,
Swiss Group for Clinical Cancer Research (SAKK), Cancer Research
Switzerland/Oncosuisse and the Foundation for Clinical Cancer Research of
Eastern Switzerland (OSKK).
BIG 1-98 Steering Committee: B. Thürlimann (Chair), S. Aebi, L. Blacher, M.
Castiglione, A. S. Coates, T. Cufer, P. Dinh, J. F. Forbes, R. D. Gelber, A.
Giobbie-Hurder, A. Goldhirsch, A. Hiltbrunner, S. B. Holmberg, R. Maibach, A.
Martoni, L. Mauriac, G. McGrogan, H. T. Mouridsen, R. Paridaens, K. N. Price,
M. Rabaglio, B.B. Rasmussen, M.M. Regan, A. Santoro, I. E. Smith, A. Wardley,
G. Viale. Novartis: H. A. Chaudri-Ross, R. Dias, D. B. Evans, C. Sguotti, U.
Trostmann
Swiss Group for Clinical Cancer Research (SAKK): A. Goldhirsch (up to
January 2004), R. Herrmann (from June 2004): Kantonsspital Aarau,
Zentrum f. Onkologie, Aarau: A. Schönenberger, W. Mingrone, Ch.
Honegger, E. Bärtschi, M. Neter, M. Rederer, G. Schär; University Hospital
Basel, Basel: C. Rochlitz, R. Herrmann, D. Oertli, E. Wight, H. Moch; Institute
of Oncology of Southern Switzerland: Ospedale San Giovanni,
Table 3: Cox Proportional Hazard Analyses. (Continued)
Grade 0.2100 1.40 (0.83 - 2.37)
PRO_10 0.0150 1.71 (1.11 - 2.62)
*Histological grading was analyzed according to three categories (histological
grade I, II or III). Number of lymph node metastases and tumor size were
continuous variables. PgR and Her2 were centrally assessed and binary IHC
data were included in the analyses [30,31]. Centrally assessed Ki-67 labeling
index and molecular scores were included as continuous variables.
**Data of 299 patients with available Ki-67 LI were included in model 1, 2 and
3, respectively.
***Data of all 342 patients were included in model 4, 5 and 6, respectively.
Abbreviations: HR, hazard ratio; CI, confidence interval; LRT, likelihood ratio
test; Ki-67 LI, Ki-67 labeling index.
Models 3 and 6 should not be compared directly as they were fitted on
different sample sizes, due to missing data in Ki-67 LI.
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 11 of 13
15. Risk assessment of BIG 1-98
158
Bellinzona: J. Bernier, L. Bronz, F. Cavalli, E. Gallerani, A. Richetti, A. Franzetti;
Ospedale Regionale di Lugano (Civico & Italiano), Lugano: M. ContiBeltraminelli,
M. Ghielmini, T. Gyr, S. Mauri, P. C. Saletti; Ospedale Regionale
Beata Vergine, Mendrisio: A. Goldhirsch, O. Pagani, R. Graffeo, M. Locatelli,
S. Longhi, P.C. Rey, M. Ruggeri; Ospedale Regionale La Carità, Locarno: E.
Zucca, D. Wyss; Istituto Cantonale di Patologia, Locarno: L. Mazzucchelli,
E. Pedrinis, T. Rusca; Inselspital, Berne: S. Aebi, M. F. Fey, M. Castiglione, M.
Rabaglio; Kantonsspital Olten, Olten: S. Aebi, M. F. Fey, M. Zuber, G. Beck;
Bürgerspital, Solothurn: S. Aebi, M. F. Fey, R. Schönenberger;Spital ThunSimmental
AG Thun: J.M. Lüthi, D. Rauch; Hôpital Cantonal Universitaire
HCUG, Geneva: H. Bonnefoi; Rätisches Kantons- und Regionalspital, Chur:
F. Egli, R. Steiner, P. Fehr; Centre Pluridisciplinaire d’Oncologie, Lausanne:
L. Perey, P. de Grandi, W. Jeanneret, S. Leyvraz, J.-F. Delaloye; Kantonsspital
St. Gallen, St. Gallen: B. Thürlimann, D. Köberle, F. Weisser, S., Mattmann, A.
Müller, T. Cerny, B. Späti, M. Höfliger, G. Fürstenberger, B. Bolliger, C.
Öhlschlegel, U. Lorenz, M. Bamert, J. Kehl-Blank, E. Vogel; Kantonales Spital
Herisau, Herisau: B. Thürlimann, D. Hess, I. Senn, D. Köberle, A. Ehrsam, C.
Nauer, C. Öhlschlegel, J. Kehl-Blank, E. Vogel; Stadtspital Triemli, Zürich: L.
Widmer, M. Häfner; Universitätsspital Zürich, Zürich: B. C. Pestalozzi, M.
Fehr, R. Caduff, Z. Varga, R. Trüb, D. Fink.
Swiss Private MDs: Private Praxis, Zürich: B. A. Bättig; Sonnenhof-Klinik
Engeried, Berne: K. Buser; Frauenklinik Limmattalspital, Schlieren: N. Bürki;
Private Praxis, Birsfelden: A. Dieterle; Private Praxis, Biel: L. Hasler; Private
Praxis, Baar: M. Mannhart-Harms; Brust-Zentrum, Zürich: C. Rageth; Private
Praxis, Berne: J. Richner; Private Praxis, Bellinzona: V. Spataro; Private Praxis,
Winterthur: M. Umbricht.
Author details
1
Department of Clinical Research, University of Bern, Bern, Switzerland.
2
National Center of Competence in Research (NCCR) Molecular Oncology,
Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland. 3
Senology
Center of Eastern Switzerland, Kantonsspital, St. Gallen, Switzerland.
4
International Breast Cancer Study Group Statistical Center, Dana-Farber
Cancer Institute, Boston, MA, USA. 5
Division of Pathology and Laboratory
Medicine, European Institute of Oncology, University of Milan, Milan, Italy.
6
Pathology Länggasse, Bern, Switzerland. 7
Medical Oncology, University
Hospital Bern, Bern, Switzerland. 8
Swiss Group of Clinical Cancer Research
(SAKK), Bern, Switzerland.
Authors’ contributions
JA, SA and RJ organized the study, planned the experiments and wrote the
manuscript. SA and BT organized samples from the International Breast
Cancer Study Group. AO and AB carried out RNA isolations, quality controls
and gene expression measurements. VP, PW, MD and AGH carried out the
statistical analyses. HJA and GV were responsible for histological assessment
of stained sections. All authors contributed to the manuscript, they read and
approved the final manuscript.
Competing interests
JA, VP, MD, PW, AB, AO, AGH, GV, HJA, SA and RJ declare that they have no
competing interest. B.T. holds stocks from Novartis (Ciba Geigy) since 1990.
Received: 10 September 2009
Accepted: 9 February 2010 Published: 9 February 2010
References
1. Carlson RW, Jahanzeb M, Kiel K, Marks LB, Mc Cromick B, Pierce LJ, Ward JH,
Topham NS: NCCN Clinical Practice Guidelines in Oncology V.2.2008.
Book NCCN Clinical Practice Guidelines in Oncology V.2 2008http://www.nccn.
org.
2. Goldhirsch A, Wood WC, Gelber RD, Coates AS, Thurlimann B, Senn HJ:
Progress and promise: highlights of the international expert consensus
on the primary therapy of early breast cancer 2007. Ann Oncol 2007,
18:1133-1144.
3. Blamey RW, Pinder SE, Ball GR, Ellis IO, Elston CW, Mitchell MJ, Haybittle JL:
Reading the prognosis of the individual with breast cancer. Eur J Cancer
2007, 43:1545-1547.
4. Galea MH, Blamey RW, Elston CE, Ellis IO: The Nottingham Prognostic
Index in primary breast cancer. Breast Cancer Res Treat 1992, 22:207-219.
5. Olivotto IA, Bajdik CD, Ravdin PM, Speers CH, Coldman AJ, Norris BD,
Davis GJ, Chia SK, Gelmon KA: Population-based validation of the
prognostic model ADJUVANT! for early breast cancer. J Clin Oncol 2005,
23:2716-2725.
6. Ravdin PM, Siminoff LA, Davis GJ, Mercer MB, Hewlett J, Gerson N,
Parker HL: Computer program to assist in making decisions about
adjuvant therapy for women with early breast cancer. J Clin Oncol 2001,
19:980-991.
7. Ellis MJ, Tao Y, Luo J, A’Hern R, Evans DB, Bhatnagar AS, Chaudri Ross HA,
von Kameke A, Miller WR, Smith I, et al: Outcome prediction for estrogen
receptor-positive breast cancer based on postneoadjuvant endocrine
therapy tumor characteristics. J Natl Cancer Inst 2008, 100:1380-1388.
8. Andre F, Pusztai L: Molecular classification of breast cancer: implications
for selection of adjuvant chemotherapy. Nat Clin Pract Oncol 2006,
3:621-632.
9. Pusztai L, Mazouni C, Anderson K, Wu Y, Symmans WF: Molecular
classification of breast cancer: limitations and potential. Oncologist 2006,
11:868-877.
10. Brenton JD, Carey LA, Ahmed AA, Caldas C: Molecular classification and
molecular forecasting of breast cancer: ready for clinical application?. J
Clin Oncol 2005, 23:7350-7360.
11. Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas AM, d’Assignies MS,
Bergh J, Lidereau R, Ellis P, et al: Validation and clinical utility of a 70gene
prognostic signature for women with node-negative breast cancer.
J Natl Cancer Inst 2006, 98:1183-1192.
12. Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G,
Delorenzi M, Zhang Y, d’Assignies MS, et al: Strong time dependence of
the 76-gene prognostic signature for node-negative breast cancer
patients in the TRANSBIG multicenter independent validation series. Clin
Cancer Res 2007, 13:3207-3214.
13. Perou CM, Sorlie T, Eisen MB, Rijn van de M, Jeffrey SS, Rees CA, Pollack JR,
Ross DT, Johnsen H, Akslen LA, et al: Molecular portraits of human breast
tumours. Nature 2000, 406:747-752.
14. Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S,
Johnsen H, Pesich R, Geisler S, et al: Repeated observation of breast
tumor subtypes in independent gene expression data sets. Proc Natl
Acad Sci USA 2003, 100:8418-8423.
15. Sotiriou C, Piccart MJ: Taking gene-expression profiling to the clinic:
when will molecular signatures become relevant to patient care?. Nat
Rev Cancer 2007, 7:545-553.
16. Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P,
Praz V, Haibe-Kains B, et al: Gene expression profiling in breast cancer:
understanding the molecular basis of histologic grade to improve
prognosis. J Natl Cancer Inst 2006, 98:262-272.
17. Stadler ZK, Come SE: Review of gene-expression profiling and its clinical
use in breast cancer. Crit Rev Oncol Hematol 2008, 1-11.
18. Vijver van de MJ, He YD, van’t Veer LJ, Dai H, Hart AA, Voskuil DW,
Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al: A gene-expression
signature as a predictor of survival in breast cancer. N Engl J Med 2002,
347:1999-2009.
19. van ‘t Veer LJ, Dai H, Vijver van de MJ, He YD, Hart AA, Mao M, Peterse HL,
Kooy van der K, Marton MJ, Witteveen AT, et al: Gene expression profiling
predicts clinical outcome of breast cancer. Nature 2002, 415:530-536.
20. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D,
Timmermans M, Meijer-van Gelder ME, Yu J, et al: Gene-expression profiles
to predict distant metastasis of lymph-node-negative primary breast
cancer. Lancet 2005, 365:671-679.
21. Sparano JA, Paik S: Development of the 21-gene assay and its application
in clinical practice and clinical trials. J Clin Oncol 2008, 26:721-728.
22. Cardoso F, Van’t Veer L, Rutgers E, Loi S, Mook S, Piccart-Gebhart MJ:
Clinical application of the 70-gene profile: the MINDACT trial. J Clin
Oncol 2008, 26:729-735.
23. Viale G, Giobbie-Hurder A, Regan MM, Coates AS, Mastropasqua MG,
Dell’Orto P, Maiorano E, MacGrogan G, Braye SG, Ohlschlegel C, et al:
Prognostic and predictive value of centrally reviewed Ki-67 labeling
index in postmenopausal women with endocrine-responsive breast
cancer: results from Breast International Group Trial 1-98 comparing
adjuvant tamoxifen with letrozole. J Clin Oncol 2008, 26:5569-5575.
24. Coates AS, Keshaviah A, Thurlimann B, Mouridsen H, Mauriac L, Forbes JF,
Paridaens R, Castiglione-Gertsch M, Gelber RD, Colleoni M, et al: Five years
of letrozole compared with tamoxifen as initial adjuvant therapy for
postmenopausal women with endocrine-responsive early breast cancer:
update of study BIG 1-98. J Clin Oncol 2007, 25:486-492.
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 12 of 13
15. Risk assessment of BIG 1-98
159
25. Thurlimann B, Keshaviah A, Coates AS, Mouridsen H, Mauriac L, Forbes JF,
Paridaens R, Castiglione-Gertsch M, Gelber RD, Rabaglio M, et al: A
comparison of letrozole and tamoxifen in postmenopausal women with
early breast cancer. N Engl J Med 2005, 353:2747-2757.
26. Mouridsen H, Giobbie-Hurder A, Goldhirsch A, Thurlimann B, Paridaens R,
Smith I, Mauriac L, Forbes JF, Price KN, Regan MM, et al: Letrozole therapy
alone or in sequence with tamoxifen in women with breast cancer. N
Engl J Med 2009, 361:766-776.
27. Oberli A, Popovici V, Delorenzi M, Baltzer A, Antonov J, Matthey S, Aebi S,
Altermatt HJ, Jaggi R: Expression profiling with RNA from formalin-fixed,
paraffin-embedded material. BMC Med Genomics 2008, 1:1-9.
28. Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B,
Desmedt C, Ignatiadis M, Sengstag T, Schutz F, et al: Meta-analysis of
gene-expression profiles in breast cancer: toward a unified
understanding of breast cancer sub-typing and prognosis signatures.
Breast Cancer Res 2008, 10:R65.
29. Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG,
Watson D, Park T, et al: A multigene assay to predict recurrence of
tamoxifen-treated, node-negative breast cancer. N Engl J Med 2004,
351:2817-2826.
30. Viale G, Regan MM, Maiorano E, Mastropasqua MG, Dell’Orto P,
Rasmussen BB, Raffoul J, Neven P, Orosz Z, Braye S, et al: Prognostic and
predictive value of centrally reviewed expression of estrogen and
progesterone receptors in a randomized trial comparing letrozole and
tamoxifen adjuvant therapy for postmenopausal early breast cancer: BIG
1-98. J Clin Oncol 2007, 25:3846-3852.
31. Rasmussen BB, Regan MM, Lykkesfeldt AE, Dell’Orto P, Del Curto B,
Henriksen KL, Mastropasqua MG, Price KN, Mery E, Lacroix-Triki M, et al:
Adjuvant letrozole versus tamoxifen according to centrally-assessed
ERBB2 status for postmenopausal women with endocrine-responsive
early breast cancer: supplementary results from the BIG 1-98
randomised trial. Lancet Oncol 2008, 9:23-28.
32. Viale G, Regan MM, Mastropasqua MG, Maffini F, Maiorano E, Colleoni M,
Price KN, Golouh R, Perin T, Brown RW, et al: Predictive value of tumor Ki-
67 expression in two randomized trials of adjuvant chemoendocrine
therapy for node-negative breast cancer. J Natl Cancer Inst 2008,
100:207-212.
33. Hayes DF, Thor AD, Dressler LG, Weaver D, Edgerton S, Cowan D,
Broadwater G, Goldstein LJ, Martino S, Ingle JN, et al: HER2 and response
to paclitaxel in node-positive breast cancer. N Engl J Med 2007,
357:1496-1506.
34. Schemper M, Smith TL: A note on quantifying follow-up in studies of
failure time. Control Clin Trials 1996, 17:343-346.
35. Cox DR: Regression models and life-tables. J R Stat Soc B 1972,
34:187-220.
36. Elston CW, Ellis IO: Pathological prognostic factors in breast cancer. I. The
value of histological grade in breast cancer: experience from a large
study with long-term follow-up. Histopathology 1991, 19:403-410.
37. Ma XJ, Hilsenbeck SG, Wang W, Ding L, Sgroi DC, Bender RA, Osborne CK,
Allred DC, Erlander MG: The HOXB13:IL17BR expression index is a
prognostic factor in early-stage breast cancer. J Clin Oncol 2006,
24:4611-4619.
38. Ma XJ, Salunga R, Dahiya S, Wang W, Carney E, Durbecq V, Harris A, Goss P,
Sotiriou C, Erlander M, Sgroi D: A five-gene molecular grade index and
HOXB13:IL17BR are complementary prognostic factors in early stage
breast cancer. Clin Cancer Res 2008, 14:2601-2608.
39. Cronin M, Pho M, Dutta D, Stephans JC, Shak S, Kiefer MC, Esteban JM,
Baker JB: Measurement of gene expression in archival paraffinembedded
tissues: development and performance of a 92-gene reverse
transcriptase-polymerase chain reaction assay. Am J Pathol 2004,
164:35-42.
40. von Ahlfen S, Missel A, Bendrat K, Schlumpberger M: Determinants of RNA
quality from FFPE samples. PLoS ONE 2007, 2:e1261.
41. Goldstein LJ, Gray R, Badve S, Childs BH, Yoshizawa C, Rowley S, Shak S,
Baehner FL, Ravdin PM, Davidson NE, et al: Prognostic Utility of the 21Gene
Assay in Hormone Receptor-Positive Operable Breast Cancer
Compared With Classical Clinicopathologic Features. J Clin Oncol 2008,
26:4063-4071.
42. Paik S: Methods for gene expression profiling in clinical trials of adjuvant
breast cancer therapy. Clin Cancer Res 2006, 12:1019s-1023s.
43. Carlson RW, Allred DC, Anderson BO, Burstein HJ, Carter WB, Edge SB,
Erban JK, Farrar WB, Goldstein LJ, Gradishar WJ, et al: NCCN Practice
Guidelines in Oncology: Breast Cancer., v.1 2009.
44. Desmedt C, Giobbie-Hurder A, Neven P, Paridaens R, Christiaens MR,
Smeets A, Lallemand F, Haibe-Kains B, Viale G, Gelber RD, et al: The Gene
expression Grade Index: a potential predictor of relapse for endocrinetreated
breast cancer patients in the BIG 1-98 trial. BMC Med Genomics
2009, 2:40.
45. Paik S, Tang G, Shak S, Kim C, Baker J, Kim W, Cronin M, Baehner FL,
Watson D, Bryant J, et al: Gene expression and benefit of chemotherapy
in women with node-negative, estrogen receptor-positive breast cancer.
J Clin Oncol 2006, 24:3726-3734.
46. Liedtke C, Hatzis C, Symmans WF, Desmedt C, Haibe-Kains B, Valero V,
Kuerer H, Hortobagyi GN, Piccart-Gebhart M, Sotiriou C, Pusztai L: Genomic
grade index is associated with response to chemotherapy in patients
with breast cancer. J Clin Oncol 2009, 27:3185-3191.
47. Schobesberger M, Baltzer A, Oberli A, Kappeler A, Gugger M, Burger H,
Jaggi R: Gene expression variation between distinct areas of breast
cancer measured from paraffin-embedded tissue cores. BMC Cancer 2008,
8:343.
48. Viale G, Giobbie-Hurder A, BIG 1-98 Collaborative Group and International
Breast Cancer Study Group (IBCSG): Value of centrally-assessed Ki-67
labeling index as a marker of prognosis and predictor of response to
adjuvant endocrine therapy in the BIG 1-98 trial of postmenopausal
women with estrogen receptor-positive breast cancer. Breast Cancer Res
Treat 2007, 106(Supplement 1):S17, Abstract 64.
49. Carlson RW, Allred DC, Anderson BO, Burstein HJ, Carter WB, Edge SB,
Erban JK, Farrar WB, Goldstein LJ, Gradishar WJ, et al: Breast cancer. Clinical
practice guidelines in oncology. J Natl Compr Canc Netw 2009, 7:122-192.
50. de Azambuja E, Cardoso F, de Castro G Jr, Colozza M, Mano MS, Durbecq V,
Sotiriou C, Larsimont D, Piccart-Gebhart MJ, Paesmans M: Ki-67 as
prognostic marker in early breast cancer: a meta-analysis of published
studies involving 12,155 patients. Br J Cancer 2007, 96:1504-1513.
51. Whitfield ML, George LK, Grant GD, Perou CM: Common markers of
proliferation. Nat Rev Cancer 2006, 6:99-106.
52. Urruticoechea A, Smith IE, Dowsett M: Proliferation marker Ki-67 in early
breast cancer. J Clin Oncol 2005, 23:7212-7220.
53. Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P,
Harris A, Bergh J, Foekens JA, et al: Definition of clinically distinct
molecular subtypes in estrogen receptor-positive breast carcinomas
through genomic grade. J Clin Oncol 2007, 25:1239-1246.
54. Chang JC, Makris A, Gutierrez MC, Hilsenbeck SG, Hackett JR, Jeong J,
Liu ML, Baker J, Clark-Langone K, Baehner FL, et al: Gene expression
patterns in formalin-fixed, paraffin-embedded core biopsies predict
docetaxel chemosensitivity in breast cancer patients. Breast Cancer Res
Treat 2008, 108:233-240.
55. Paik S: Molecular assays to predict prognosis of breast cancer. Clin Adv
Hematol Oncol 2007, 5:681-682.
Pre-publication history
The pre-publication history for this paper can be accessed here:http://www.
biomedcentral.com/1471-2407/10/37/prepub
doi:10.1186/1471-2407-10-37
Cite this article as: Antonov et al.: Molecular risk assessment of BIG 1-98
participants by expression profiling using RNA from archival tissue. BMC
Cancer 2010 10:37.
Antonov et al. BMC Cancer 2010, 10:37
http://www.biomedcentral.com/1471-2407/10/37
Page 13 of 13
15. Risk assessment of BIG 1-98
160
15. Risk assessment of BIG 1-98
161
16 Joint analysis of histopathology image features
and gene expression in breast cancer
• BMC Bioinformatics, 17:209, 2016
• IF: 2.448
• number of citations: 1
• personal contribution (80%): image analysis method design, data
collection and processing, experimental design and implementation,
statistical analyses and results interpretation, manuscript
writing
163
Popovici et al. BMC Bioinformatics (2016) 17:209
DOI 10.1186/s12859-016-1072-z
RESEARCH ARTICLE Open Access
Joint analysis of histopathology image
features and gene expression in breast cancer
Vlad Popovici1*, Eva Budinská1,2, Lenka ˇCápková1, Daniel Schwarz1, Ladislav Dušek1, Josef Feit1
and Rolf Jaggi3
Abstract
Background: Genomics and proteomics are nowadays the dominant techniques for novel biomarker discovery.
However, histopathology images contain a wealth of information related to the tumor histology, morphology and
tumor-host interactions that is not accessible through these techniques. Thus, integrating the histopathology images
in the biomarker discovery workflow could potentially lead to the identification of new image-based biomarkers and
the refinement or even replacement of the existing genomic and proteomic signatures. However, extracting
meaningful and robust image features to be mined jointly with genomic (and clinical, etc.) data represents a real
challenge due to the complexity of the images.
Results: We developed a framework for integrating the histopathology images in the biomarker discovery workflow
based on the bag-of-features approach – a method that has the advantage of being assumption-free and data-driven.
The images were reduced to a set of salient patterns and additional measurements of their spatial distribution, with
the resulting features being directly used in a standard biomarker discovery application. We demonstrated this
framework in a search for prognostic biomarkers in breast cancer which resulted in the identification of several
prognostic image features and a promising multimodal (imaging and genomic) prognostic signature. The source
code for the image analysis procedures is freely available.
Conclusions: The framework proposed allows for a joint analysis of images and gene expression data. Its application
to a set of breast cancer cases resulted in image-based and combined (image and genomic) prognostic scores for
relapse-free survival.
Keywords: Histopathology images, Image analysis, Biomarker discovery, Gene expression, Multimodal data mining
Background
The recent technological progress made scanning the
whole pathology slides affordable and its integration in
the routine pathology workflow feasible. This resulted
in a revived interest in developing new computational
methods for nuclear morphometry and tissue architecture
characterization, as well as for developing new tissuebased
biomarkers [1]. In the last decade, genomic and
proteomic techniques have been the methods of choice
for novel biomarker discovery. When applied to the same
sample, the pathology imaging and *omics technologies
*Correspondence: popovici@iba.muni.cz
1Institute of Biostatistics and Analyses, Faculty of Medicine, Masarykova
Univerzita, Kamenice 5, 62500 Brno, Czech Republic
Full list of author information is available at the end of the article
allow the investigation of the underlying biology from
different perspectives, increasing the chances for identifying
effective biomarkers. Ideally, these perspectives
could be integrated in a common data analytical framework,
to enable a joint (or multimodal) data mining and
decision [2].
Traditionally, the methods for analyzing pathology
images focused on extracting quantitative measures for a
set of predefined morphological parameters (e.g. counting,
classifying and characterizing the nuclei) and on
reproducing the expert’s decision in diagnostic applications
(for a review see Gurcan et al. [3]). More recently, a
number of applications of pathology image analysis combined
image-based quantitative features with genomic
© 2016 Popovici et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
16. Histopathology image features and gene expression
164
Popovici et al. BMC Bioinformatics (2016) 17:209 Page 2 of 9
information. For example, Yuan et al. [4] showed that
nuclear morphometry is an independent prognostic factor
that can improve a genomic signature. A similar approach
is discussed by Kong et al. [5] in the case of glioblastoma
where they show how nuclear and cytoplasmic
features can be linked to genomic profiles and survival
outcome. More advanced techniques combine several
image-derived characteristics, such as co-localization
of tumor nuclei and lymphocyte infiltration [6]. In all
these cases however, the imaging features were predefined
and based on previous known associations between
histopathology and diagnostic/prognostic.
Our interest is in developing a more general computational
framework that would allow the integration of
the standard histopathology images in the biomarker discovery
workflow and in which the image features would
be learned in a data-driven fashion, enabling a priorfree
data mining. The main challenge when analyzing the
pathology images stems from their high complexity and
size, and seeming incompatibility with *omics data. In
the present work we propose to use the bag-of-features
approach [7] for reducing the dimensionality of the images
and extracting salient features. This approach has already
been used in histopathology image classification applications
[8, 9] and has the main advantage of allowing
an unsupervised learning of image representation. The
features extracted describe mostly the textural appearance
of small neighborhoods and may be combined with
other types of features (e.g. nuclear morphometry) in later
stages of image analysis, but these approaches will not
be discussed here. As an alternative to bag-of-features,
one could use deep learning methods for learning image
features as proposed by Cire¸san et al. [10] or Cruz-Roa
et al. [11]. However, these methods require a larger
sample size and were applied in a supervised learning
context.
We propose a novel representation of histopathology
images which extends the standard bag-of-features with
a number of derived measurements aimed at capturing
more global characteristics of the tissue sample. In addition,
we introduce an objective criterion for optimizing
the image representation. The new computational framework
is demonstrated in a biomarker discovery scenario,
where prognostic features (both imaging and gene expression)
for relapse-free survival in breast cancer are sought.
We see the application of this approach as a succession
of two independent steps, not necessarily performed on
the same data corpus. In the first step, a histopathology
image representation is learned from a collection of
images representative for the pathology under investigation.
In the second step, the images of interest are recoded
based on the constructed representation and the resulting
image features are jointly analyzed with the molecular and
clinical data.
Methods
Data
The data used in this study is a subset of the data from
Moor et al. [12], selected solely based on the availability of
the material for analysis. Overall there were n = 196 standard
pathology (haematoxylin-eosin-stained) slides with
breast tissue sections, not all containing a tumoral component
and not necessarily from different cases. All images
were obtained by whole-slide scanning of the pathology
slides at 40× magnification, resulting in color images of
about 150, 000 × 100, 000 pixels.
These data were partitioned into an image model learning
set (n = 131) and a biomarker discovery/data mining
set (n = 65). In the biomarker discovery set we kept
unique cases for which the slides contained > 70 % tumor
component and the clinical, survival and gene expression
data were all available. The expression profiles of 47 target
genes (including 5 control genes) were obtained by quantitative
real-time PCR (qRT-PCR). A full description of the
data set is available in Moor et al. [12] and the major characteristics
of the biomarker discovery set used here are
given in Additional file 1.
We computed the genomic prognostic signature
(PRO_10) as described in Antonov et al. [13] for all the
cases with full genomic profiles.
Image processing
Preprocessing
All images were downscaled to an equivalent of 2.5× magnification
by subsampling the Gaussian-filtered higher
resolution images (the 4-th level in a Gaussian pyramid).
In the resulting images a mask corresponding to the tissue
regions was obtained by adaptive thresholding in the
green channel. The mask was subsequently refined by
morphological operations: erosion with a circular structuring
element with radius 13 followed by gap filling and
removal of small objects.
For each image we estimated the intensity of haematoxylin
(H) staining by deconvolving the RGB-images as
described by Ruifrok et al. [14]. The intensity levels of
the haematoxylin image (H-image) were adjusted by adaptive
histogram equalization. Finally, the background pixels
were masked out using the tissue region mask computed
as above. In all subsequent image processing steps, only
the H-images were used.
Learning the image representation
The bag-of-features [7] approach has two main stages:
(i) learning an appropriate codebook for representing the
images of interest and (ii) re-coding the images based
on the frequencies of each codeblock (codeword from
the codebook). Thus, the resulting representation of the
image is a histogram of the codeblocks. For the current
application, we extended this representation to include
16. Histopathology image features and gene expression
165
Popovici et al. BMC Bioinformatics (2016) 17:209 Page 3 of 9
several derived features. We point out that once an appropriate
image representation is learned, it can be applied
unchanged to other similar image collections thus this
step does not need to be repeated on each new data set.
Codebook learning The codebook is a collection of representative
local descriptors {C1, . . . , CK } obtained as centers
of K clusters resulting from k-means clustering of a
number of image local descriptors (i.e. a vector quantization
procedure). For this, the images are decomposed
in a set of local neighborhoods for which descriptor vectors
are computed. The local descriptors range from pixels
intensities to responses to filter banks or other textural
descriptor. For the histopathology images, the Gabor
wavelets provide a good set of descriptors, so they were
adopted in the present work. Each local neighborhood
of size w × w was convolved with a bank of 24 Gabor
filters [15],
G(x, y; ν, θ, σ) = exp −
x2 + y2
2σ2
×exp 2πνj(x cos θ + y sin θ)
where j =
√
−1, ν was the frequency, θ the orientation and
σ the bandwidth of the Gaussian kernel. These parameters
were set to σ ∈ {1, 2
√
2}, θ ∈ {k π
4 |k = 0, . . . , 3} and
ν ∈ {3/4, 3/8, 3/16}, respectively. They were kept fixed
throughout all the experiments. For each filter response,
its mean and standard deviations were recorded, thus each
local neighborhood w × w was represented by 48 values
(24 means and 24 standard deviations). A comparison of
Gabor wavelets with other local descriptors, in the context
of histopathology image analysis, is given by Budinská
et al. [9].
The size of the codebook (i.e. the number of clusters in
k-means clustering), K, is a free parameter that has to be
chosen/guessed at the moment of codebook construction
[8]. It can also be optimized for the problem at hand [9]
using, for example, the Gap statistic [16]. Here we took
advantage of having available a number of examples for
different tissue components (fat, fat foamy macrophages,
comedo necrosis, connective tissue and carcinoma infiltrating
fat – for examples see Additional file 1) which we
used as reference categories. The goal was to choose the
size of the dictionary K in such a way that the representations
of these categories are sparse and have a minimal
overlap. For each image i, let yi = {j | if codeblock Cj
is used in coding the sample i}, be the set of codeblocks
used in its coding. Then we define the following quantities
(where | · | denotes the cardinality of a set):
• total Jaccard index,
J(K) = 0.5
|yi ∩ yj|
|yi ∪ yj|
,
where the sum is taken over all pairs (i, j) of images
from different reference categories;
• total sum of within-cluster distances,
D(K) =
K
k=1 i∈cluster k
xi − Ck
2
,
where xi are the descriptor vectors.
With these quantities, we defined an (empirical) objective
function:
(K) = log
nc(nc − 1)
2
−log J(K)−log D(K)−0.75 log K,
where nc is the number of reference categories (in our case
nc = 5). The overall goal of our image recoding step is
to find a low dimensional (sparse) representation which
still bears enough information for discriminating major
tissue components. For this, we minimize J(K), i.e. the
overlap between the representations of the reference categories.
At the same time, we require tight clusters (small
within-cluster total distances D(K)) and sparse representation
(small K). Hence, the desired value for K is the one
that maximizes (K), where we note that the first term is
constant (included to bring the values closer to 0) and that
the scaling factor 0.75 is used to reduce the influence of K.
Image recoding Once a suitable K is found and a codebook
is constructed by k-means clustering, the standard
bag-of-feature approach represents the images as codeblock
histograms. However, in this coding, all spatial
information about the distribution of the codeblocks is
lost. Consider the situation in Fig. 1a: all four images have
the same number of patches assigned to the same codeblock,
but the spatial arrangement is very different. In
order to characterize these spatial differences, we extend
the image representation with a number of statistics on
the distribution of the codeblocks. For a given image
and for each codeblock k ∈ {1, . . . , K}, we construct
a binary image in which 1s represent regions assigned
to the codeblock and 0s everything else. In these binary
images, the connected components (4-neighbor connectivity)
define individual objects and for each of them we
compute the area (in pixels) and the compactness index
(ratio of the squared perimeter to the area of the object).
Finally, for each image and each codeblock, we compute
(i) the median area, (ii) the maximum area, (iii) the ratio of
the maximum area to the total area of the objects, (iv) the
skewness of the distribution of the area values and (v) the
mean compactness. Thus, for each codeblock in an image,
aside from its frequency, we add five new values aimed
at characterizing the distribution of the codeblock in the
image. We will refer to these additional quantities as the
“extended set of features”. The final representation of an
16. Histopathology image features and gene expression
166
Popovici et al. BMC Bioinformatics (2016) 17:209 Page 4 of 9
A B
Fig. 1 Codeblocks and codebook. a An example of four different hypothetical distributions of the codeblocks leading to identical frequencies. To
cope with such situations, the distribution of codeblocks is also taken into account through extended image features. b A visual representation of the
obtained codebook. The 70 image patches are the closest to the codeblocks obtained after k-means clustering. The three groups of codeblocks (with
29, 20 and 21 elements, respectively) correspond to the major clusters in Fig. 2 and the ordering of the image patches is the same as in the clustering
image has a length of 6K: K values for the codeblock histogram
(the standard representation) and 5K values of the
extended representation.
Joint data mining
The new representation of the images allows for direct
application of standard data mining techniques. In the
case of multi-modality data mining, the choice of a proper
similarity metric/measure is of crucial importance. Two
main strategies may be attempted for defining a proper
similarity: combination of single, modality-specific, metrics
or building/learning a fully multi-modality metric.
The first approach has the advantage of using established
metrics usually resulting in easily interpretable models
and facilitating the comparison with known results.
The second approach promises to build a similarity metric
that better exploits the multi-modality nature of the
data. These ideas can be implemented, for example, in
the context of kernel machines (such as Support Vector
Machines) where composite kernels (based on closure
properties – see [17] p.75) would represent a possible
implementation of the first approach and multiple kernel
learning [18] an implementation of the latter.
In the present work and in order to demonstrate the
general analytical framework, we make use of standard
statistical tools. We aim at identifying image features that
could be linked to expression levels of the genes of interest
(genotype-phenotype association) and potential image
biomarkers that alone or in combination with gene expression
can be used for defining a prognostic signature.
Besides the gene expression, we also used a proliferation
gene signature PRO_10 [12, 13], which was shown to
be prognostic in various cohorts of patients with breast
cancer.
To test the association between image features and
tumor size (T) and grade (G) we dichotomized the clinical
variables (T: {T1, T2} vs {T3, T4}, and G: {G1,G2}
vs. G3, respectively) and used two-sided t-test, with 0.05
significance level. The association of image features with
gene expression was assessed based on correlation test
(Pearson) with significance level 0.05 and the condition
that the correlation coefficient was at least 0.5 (in absolute
value). We also used canonical correlation analysis (CCA)
to study the associations between image features and
molecular data with significance level of 0.05 for Wilks’
test. The association between image features and survival
outcome (relapse-free survival – RFS) was tested using
Cox proportional hazard models (log-likelihood test),
with significance level of 0.05. The hazard ratios were
estimated from interquartile range-standardized variables
(both image and genomic variables). To test if an image
feature improves the prognostic value of the gene signature,
we tested the difference between the models with
and without the variable of interest using likelihood ratio
tests. To assess the difference in survival between two
groups we used log-rank tests. We binarized the variables
by their median value, into high- and low- expressions or
values. Since the work reported here is purely exploratory
and the sample size is rather small, no adjustment for multiple
hypotheses testing was performed. We used hierarchical
clustering (Ward method) with Euclidean distance
between samples to cluster the codeblocks.
All statistical analyses were performed in R package for
statistical computing (http://www.r-project.org) version
3.2.2.
Results
Codebook
The image analysis methods described above were implemented
in a Python package (available at https://github.
com/vladpopovici/WSItk), using the scikit-image
[19] and Mahotas [20] libraries.
For the codebook construction we used only the modeling
set of images, none of the image used in the data
16. Histopathology image features and gene expression
167
Popovici et al. BMC Bioinformatics (2016) 17:209 Page 5 of 9
mining phase being used for learning the codebook. From
each image, a set of 3000 random patches of size 32 × 32
was extracted and the corresponding Gabor descriptors
computed (vectors of 48 elements). These descriptor vectors
were clustered using the k-means algorithm to build
the codebooks. We estimated the optimal (in the sense
of the objective function, described above) codebook
size by evaluating (k) for k = 10, 20, . . . , 1000. The
optimal value was found to be K = 70 (see Additional
file 1 for a plot of (k)) leading to 420 feature vectors
for each image. Since the codeblocks are centers of
the clusters (the means of descriptor vectors assigned to
the respective cluster), they might not necessarily correspond
to observed image regions. Thus we selected
the closest regions to the codeblocks (the corresponding
descriptor vectors were the closest to the codeblocks) to
provide an approximate visual representation of the codebook
- Fig. 1b. In the following, to designate a specific
codeblock from the codebook, we will use the notation
C.xy. We have extensively investigated the stability of
the learned codebooks and the resulting image representations
and we found the process to be stable – see
Additional file 1.
The hierarchical clustering of the codeblocks
(Fig. 2) revealed a rather structured content: three
major groups of codeblocks could be identified. We
tentatively labeled them as “proliferation patterns”,
“invasion/differentiation patterns/connective tissue” and
“sparse tumor nuclei/differentiation/fat” to indicate the
major components in the clusters - without claiming a
precise histopathological characterization.
A number of codeblocks were found to be associated
with tumor size (C.10, C.18, C.29, C.38, C.41, and C.42)
and grade (C.09, C.34, C.43, C.45, C.48, and C.62).
Correlations between image features and gene expression
The association analysis between image features and gene
expression identified a number of significant (p < 0.05
and ρ > 0.5) pairwise correlations (all in the range
0.50 − 0.60). In all, eight different codeblocks were associated
with different genes, most of them with CCNE1
and CCNB2. The codeblock C.31 was associated with
most genes (CCNE1, CCNB2, BIRC5, PRC1, SPAG5)
either by its frequency of appearance in the image or
by the skewness of its distribution. By summing the frequencies
corresponding to image features that are highly
correlated (e.g. C.38, C.31, C.01, C.51, C.41, C.68) the
correlations coefficients were improved to 0.65 − 0.70.
CCA confirmed the association between these image features
and gene expression data (Wilks’ test p = 0.026).
The image features C.10, C.19, C.57, and C.68 and the
genes CCNE1, CCNB2, and SPAG5 had the strongest
impact on the canonical dimensions. These were also the
most stable image features-gene expression correlations
in the image representation stability experiments – see
Additional file 1.
Despite the fact that the PRO_10 gene signature is an
average of proliferation genes which were found to be
Fig. 2 Hierarchical clustering of the codebook. Clustering the codeblocks led to identification of three major clusters, to which generic terms have
been assigned. The codeblocks correlated with gene expression are marked with red dots. The codeblocks with potential prognostic value (in
univariate analysis) are marked with blue squares (dark blue for p-value < 0.01, light blue for 0.01 ≤ p-value ≤ 0.05
16. Histopathology image features and gene expression
168
Popovici et al. BMC Bioinformatics (2016) 17:209 Page 6 of 9
correlated with image features, the correlations between
image features and PRO_10 did not reach the required
significance level in all but one case: the skewness of
codeblock C.31.
Survival analyses
The goal of the analyses performed was to assess the utility
of image-based variables for predicting relapse-free
survival independently, or combined with the PRO_10
signature. In the set of samples analyzed, the genomic
score is a strong prognostic marker (Cox regression:
p = 0.001, HR = 2.12, 95 % CI = (1.29, 3.51)).
Univariate Cox proportional hazards models were fit for
each of the 420 image features resulting in the identification
of several significant associations with relapse-free
survival endpoint. The most prognostic image features
were C.41, C.56, C.65, C.67, C.69, with p < 0.01 and
HR between 1.16 and 1.70. From the extended set of features,
the median area of the regions assigned to clusters
C.15 and C.26 were significantly associated with RFS (p <
0.05). The strongest predictor among the image features
was C.69 (p = 0.0018, HR = 1.7, 95 % CI = (1.22, 2.37)).
In combined models (image feature and genomic score)
a number of image features led to improved models (likelihood
ratio test p < 0.05), most of them from the
extended set of features. From all these image features,
C.69 remained significant in the multivariate model (with
PRO_10) and had no significant interaction with the
genomic signature.
We defined an image score variable by averaging C.41,
C.56, C.65, C.67, C.69 which resulted in a stronger prognostic
factor (Cox regression: p = 0.0003 and HR =
1.76, 95 % CI = (1.30, 2.40) - see also Figure 3). In a
regression model including the genomic and the image
scores, both remained independent significant variables
(PRO_10: p = 0.05, image score: p = 0.007, no significant
interaction) and the model was signficantly better than the
corresponding univariate models (p = 0.013). In Fig. 4
the Kaplan-Meier curves for binarized (by median value)
scores are shown, together with corresponding p-values
(log-rank tests) and hazard ratios. Another visualization
of the prognostic scores is given in Fig. 5 where the
expected survival at 4 years is shown as a function of
the genomic, image-based, and combined scores, respectively.
Two examples of high risk cases, according to the
image-based score, are given in Additional files 2 and 3.
Discussion
The main challenge in introducing the histopathology
images in the general data mining biomarker discovery
framework stems from their high complexity and low level
of information representation. Thus, while the images
contain a huge amount of data (in the order of 1010 pixels)
the extraction of information implies a considerable effort.
Fig. 3 Regions assigned to the most prognostic codeblocks.
512 × 512 regions from two different samples with high image score
(high risk of relapse), at 2.5× magnification. The image patches
represented in full color were assigned to one of the C.41, C.56, C.65,
C.67 or C.69 codeblocks. In Additional files 2 and 3, the corresponding
whole slide images are provided
Traditionally, this effort is performed by the expert pathologists
or, more recently, by using quantitative methods for
measuring a set of predefined morphological aspects to
complement the pathology report. In this work, we took
a third approach, in which the image data is reduced to
a number of essential patterns (the codeblocks) whose
frequency and spatial distribution in the image is used
for data mining. The codeblocks are learned independent
of any prior knowledge about the images, potentially
16. Histopathology image features and gene expression
169
Popovici et al. BMC Bioinformatics (2016) 17:209 Page 7 of 9
A B C
Fig. 4 Kaplan-Meier curves for binarized scores. The genomic (a), image-based (b) and combined scores (c) were binarized by the respective
median values into “low score” (low risk) and “high score” (high risk) categories. The combined score slightly improves on the genomic score
enabling the discovery of new image features not necessarily
assessed during the pathology review of the cases. The
obvious drawback is the difficulty of interpreting some of
the patterns and the possibility of having also artifacts in
the model. The adopted representation of local neighborhoods
in the image (responses to a bank of Gabor filters)
encouraged the identification of codeblocks with distinctive
textural appearance (Fig. 1). This local appearance
may be later on combined with a nuclei detector and classifier
(as in Yuan et al. [4]), for example, to obtain a more
comprehensive characterization of the image.
By examining the similarities between codeblocks, we
identified three major aspects of the images that are captured:
proliferation, invasion/differentiation (within connective
tissue) and isolated tumor nuclei (within regions
predominantly with fat component) (Fig. 2). This result
combined with the observation that the whole third cluster
did not contribute to the prognostic models, suggests a
possible refinement of the current method, in which these
regions with high fat content are discarded in an initial
preprocessing stage and a more detailed model is used to
characterize the remaining regions.
We demonstrated the integration of the image features
in a standard biomarker discovery scenario, in which
both image-genes correlations (precursors to genotypephenotype
associations) as well as various survival prognostic
models were tested. Since the main purpose of this
exercise was to demonstrate the integration of image features
with genomic information and the sample size was
relatively modest, we did not adjust for multiple hypotheses
testing and restricted ourselves to an exploratory
analysis. Thus the associations found, while hypothesisgenerating,
have to be taken with caution and more
validation is needed.
Most of the genes in the panel were related to proliferation
processes, thus it is not surprising that the
correlations with image features involved almost exclusively
these genes. The strongest associations were found
A B C
Fig. 5 Prognostic scores at 4 years. Predicting the likelihood of an event (relapse) at 4 years, based on genomic signature (PRO_10 - panel a), the
image-based score (panel b) and the combined score (panel c)
16. Histopathology image features and gene expression
170
Popovici et al. BMC Bioinformatics (2016) 17:209 Page 8 of 9
with CCNE1 and CCNB2. Somehow surprising, no significant
correlation was found with MKI67 gene, a common
marker (with Ki-67 specific staining) for proliferation.
A number of image features were found to be prognostic
for RFS and we proposed a simple image-based
prognostic score which averages five basic image features.
The new score is strongly prognostic and is not
correlated with the genomic score considered (PRO_10).
When combining the two scores in a multivariable
Cox regression, the two remained significant (with a
marginal significance for the genomic score) and independent
predictors (no significant interaction) leading to
an improved model. Thus, the image-based score can be
used either alone - as a first line predictor - or in combination
with the genomic predictor. These results also
demonstrate the complementarity of the two modalities histopathology
imaging and genomics - and suggest
that refined predictors can be built by a combination
thereof.
It must be noted that the sample size and the number
of events did not allow for more variables in the
regression models. Further analysis of the scores (either
image-based or combined) in the context of usual clinical
predictors (TNM-staging, hormonal status, etc.) is
required before a definite conclusion about its clinical utility
can be drawn. Nevertheless, the image-based score can
already be used in applications like searching or indexing
in histopathology image archives.
Conclusions
We proposed a general framework for integrating the
histopathology images in the routine genomic data analysis
pipeline. The image features used are based on
the responses of Gabor filters applied to single channel
images. The approach can easily be extended to exploit
the full color information and to include other types of
features.
When applying our method to a data collection of breast
cancer samples, we were able to identify a number of
associations between image features and gene expression
levels. More importantly, several prognostic image features
were identified, some of them complementary to the
genomic score. Thus, we could build an image-based and
a combined survival score, improving on the performance
of the genomic score. These results must be validated in
larger data sets.
The code implementing the methods described is
made freely available and continues to be under active
development.
Availability of data and materials
The source code for the image analysis methods described
in the paper is available from the GitHub repository
https://github.com/vladpopovici/WSItk.
The data used to demonstrate the methods described is
not publicly available.
Ethics approval and consent to participate
The data used to demonstrate the methods in this study
has been graciously provided by the Department of Medical
Oncology, Inselspital Bern, Switzerland. All patients
gave a general consent for the use of their tissue samples
in research.
Additional files
Additional file 1: Codebook construction details [PDF file]. The codebook
was optimized based on a objective function and a set of reference
categories. This file contains the plot of the objective function and example
images for the selected categories. (PDF 12390 kb)
Additional file 2: High risk carcinoma according to image-based score
(Example 1). [JPG file]. Whole-slide image of a tumor labeled as high risk by
the image score, with the regions used in scoring highlighted. (JPG 9758 kb)
Additional file 3: High risk carcinoma according to image-based score
(Example 2). [JPG file]. Whole-slide image of a tumor labeled as high risk by
the image score, with the regions used in scoring highlighted. (JPG 12800 kb)
Abbreviations
HR: hazard ratio; qRT-PCR: quantitative real-time polymerase chain reaction;
RFS: relapse-free survival.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
VP initiated and designed the study, performed image and statistical data
analyses. EB and LB performed bioinformatics and biostatistics analyses. JF
performed image analyses and contributed pathology interpretation of the
images. RJ performed the gene expression profiling, provided clinical and
biological interpretation of the results. DS and LD performed image and
statistical analyses. All authors contributed to the manuscript writing. All
authors read and approved the final manuscript.
Acknowledgements
The authors gratefully acknowledge the support of Ministry of Health of the
Czech Republic through grant NT/14134. The results presented here reflect
only the authors’ views.
Funding
This study was funded by the Ministry of Health of the Czech Republic
through grant NT/14134.
Author details
1Institute of Biostatistics and Analyses, Faculty of Medicine, Masarykova
Univerzita, Kamenice 5, 62500 Brno, Czech Republic. 2RECETOX, Masarykova
Univerzita, Kamenice 5, 62500 Brno, Czech Republic. 3Department of Clinical
Research, Faculty of Medicine, University of Bern, Bern, Switzerland.
Received: 9 December 2015 Accepted: 4 May 2016
References
1. Hamilton PW, Bankhead P, Wang Y, Hutchinson R, Kieran D, McArt DG,
James J, Salto-Tellez M. Digital pathology and image analysis in tissue
biomarker research. Methods. 2014;70(1):59–73.
2. Colen R, Foster I, Gatenby R, Giger ME, Gillies R, Gutman D, Heller M,
Jain R, Madabhushi A, Madhavan S, Napel S, Rao A, Saltz J, Tatum J,
Verhaak R, Whitman G. NCI Workshop Report: Clinical and Computational
16. Histopathology image features and gene expression
171
Popovici et al. BMC Bioinformatics (2016) 17:209 Page 9 of 9
Requirements for Correlating Imaging Phenotypes with Genomics
Signatures. Transl Oncol. 2014;7(5):556–69.
3. Gurcan MN, Boucheron LE, Can A, Madabhushi A, Rajpoot NM, Yener B.
Histopathological image analysis: a review. IEEE Rev Biomed Eng. 2009;2:
147–71.
4. Yuan Y, Failmezger H, Rueda OM, Ali HR, Graf S, Chin SF, Schwarz RF,
Curtis C, Dunning MJ, Bardwell H, Johnson N, Doyle S, Turashvili G,
Provenzano E, Aparicio S, Caldas C, Markowetz F. Quantitative Image
Analysis of Cellular Heterogeneity in Breast Tumors Complements
Genomic Profiling. Sci Transl Med. 2012;4(157):143.
5. Kong J, Cooper LAD, Wang F, Gutman DA, Gao J, Chisolm C, Sharma A,
Pan T, Van Meir EG, Kurc TM, Moreno CS, Saltz JH, Brat DJ. Integrative,
multimodal analysis of glioblastoma using TCGA molecular data,
pathology images, and clinical outcomes. IEEE Trans Biomed Eng.
2011;58(12):3469–74.
6. Nawaz S, Heindl A, Koelble K, Yuan Y. Beyond immune density: critical
role of spatial heterogeneity in estrogen receptor-negative breast cancer.
Mod Pathol. 2015;28(6):766–77.
7. Csurka G, Dance CR, Fan L, Willamowski J, Bray C. Visual categorization
with bags of keypoints. Work Stat Learn Comput Vision ECCV. 200459–74.
8. Caicedo JC, Cruz A, Gonzalez FA. Histopathology Image Classification
Using Bag of Features and Kernel Functions In: Combi C, Shahar Y,
Abu-Hanna A, editors. 12th Conference on Artificial Intelligence in
Medicine. Berlin Heidelberg: Springer; 2009. p. 126–35.
9. Budinská E, ˇCápková L, Schwarz D, Dušek L, Jaggi R, Feit J, Popovici V.
Gene expression-guided selection of histopathology image features. In:
15th International Conference on Bioinformatics and Bioengineering.
Belgrade: IEEE; 2015. p. 1–6.
10. Cire¸san DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection
in breast cancer histology images with deep neural networks. In:
International Conference on Medical Image Computing and
Computer-assisted Intervention. Berlin Heidelberg: Springer; 2013. p.
411–8.
11. Cruz-Roa A, Basavanhally A, González F, Gilmore H, Feldman M,
Ganesan S, Shih N, Tomaszewski J, Madabhushi A. Automatic detection
of invasive ductal carcinoma in whole slide images with convolutional
neural networks In: Gurcan MN, Madabhushi A, editors. SPIE Medical
Imaging. San Diego, USA: SPIE; 2014. p. 904103.
12. Moor AE, Guevara C, Altermatt HJ, Warth R, Jaggi R, Aebi S. PRO_10 – A
new tissue-based prognostic multigene marker in patients with early
estrogen receptor-positive breast cancer. Pathobiology. 2011;78(3):140–8.
13. Antonov J, Popovici V, Delorenzi M, Wirapati P, Baltzer A, Oberli A,
Thurlimann B, Giobbie-Hurder A, Viale G, Altermatt H, Aebi S, Jaggi R.
Molecular risk assessment of BIG 1-98 participants by expression profiling
using RNA from archival tissue. BMC Cancer. 2010;10(1):37.
14. Ruifrok AC, Johnston DA. Quantification of histochemical staining by
color deconvolution. Anal Quant Cytol Histol. 2001;23(4):291–9.
15. Daugman JG. Uncertainty relation for resolution in space, spatial
frequency, and orientation optimized by two-dimensional visual cortical
filters. J Opt Soc Am A. 1985.
16. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a
data set via the gap statistic. J R Stat Soc Ser B Stat Methodol. 2001.
17. Shawe-Taylor J, Cristianini N. Kernel Methods for Pattern Analysis.
Cambridge, UK: Cambridge University Press; 2004.
18. McFee B, Lanckriet GRG. Learning Multi-modal Similarity. J Mach Learn
Res. 2011;12:491–523.
19. van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD,
Yager N, Gouillart E, Yu T, Scikit-image contributors. scikit-image: image
processing in Python. PeerJ. 2014;2:e453.
20. Coelho LP. Mahotas: Open source software for scriptable computer
vision. J Open Res Softw. 2013;1(1):e3.
• We accept pre-submission inquiries
• Our selector tool helps you to ﬁnd the most relevant journal
• We provide round the clock customer support
• Convenient online submission
• Thorough peer review
• Inclusion in PubMed and all major indexing services
• Maximum visibility for your research
Submit your manuscript at
www.biomedcentral.com/submit
Submit your next manuscript to BioMed Central
and we will help you at every step:
16. Histopathology image features and gene expression
172
16. Histopathology image features and gene expression
173
17 Gene expression patterns unveil a new level
of molecular heterogeneity in colorectal cancer
• Journal of Pathology, 231(1):63-76, 2013
• IF: 6.894
• number of citations: 121
• personal contribution (20%): statistical analyses, results interpretation,
manuscript writing
175
Journal of Pathology
J Pathol 2013; 231: 63–76
Published online 8 July 2013 in Wiley Online Library
(wileyonlinelibrary.com) DOI: 10.1002/path.4212
ORIGINAL PAPER
Gene expression patterns unveil a new level of molecular
heterogeneity in colorectal cancer
Eva Budinska,1,2
* Vlad Popovici,1,2 Sabine Tejpar,3 Giovanni D’Ario,1 Nicolas Lapique,1 Katarzyna Otylia Sikora,1
Antonio Fabio Di Narzo,1 Pu Yan,4 John Graeme Hodgson,5 Scott Weinrich,5 Fred Bosman,5 Arnaud Roth6,7 and
Mauro Delorenzi1,8
1
Bioinformatics Core Facility, Swiss Institute of Bioinformatics (SIB), Lausanne, 1015, Switzerland
2 Institute of Biostatistics and Analyses, Masaryk University, Brno, Czech Republic
3
Department of Oncology, University Hospital Gasthuisberg, Katholik Universiteit Leuven, Belgium
4
University Institute of Pathology, Lausanne University Medical Centre, Switzerland
5 Pﬁzer Inc., Worldwide Research and Development, Oncology Research Unit, La Jolla, CA, USA
6
Oncosurgery, Geneva University Hospital, Switzerland
7
Swiss Group for Clinical Cancer Research (SAKK), Bern, Switzerland
8 D´epartement de Formation et Recherche, Lausanne University Medical Centre, Switzerland
*Correspondence to: Eva Budinska, Institute of Biostatistics and Analyses, Masaryk University, Kotlarska 2, 611 37 Brno, Czech Republic
e-mail: budinska@iba.muni.cz
Abstract
The recognition that colorectal cancer (CRC) is a heterogeneous disease in terms of clinical behaviour and
response to therapy translates into an urgent need for robust molecular disease subclassiﬁers that can explain this
heterogeneity beyond current parameters (MSI, KRAS, BRAF). Attempts to ﬁll this gap are emerging. The Cancer
Genome Atlas (TGCA) reported two main CRC groups, based on the incidence and spectrum of mutated genes,
and another paper reported an EMT expression signature deﬁned subgroup. We performed a prior free analysis of
CRC heterogeneity on 1113 CRC gene expression proﬁles and confronted our ﬁndings to established molecular
determinants and clinical, histopathological and survival data. Unsupervised clustering based on gene modules
allowed us to distinguish at least ﬁve different gene expression CRC subtypes, which we call surface crypt-like,
lower crypt-like, CIMP-H-like, mesenchymal and mixed. A gene set enrichment analysis combined with literature
search of gene module members identiﬁed distinct biological motifs in different subtypes. The subtypes, which
were not derived based on outcome, nonetheless showed differences in prognosis. Known gene copy number
variations and mutations in key cancer-associated genes differed between subtypes, but the subtypes provided
molecular information beyond that contained in these variables. Morphological features signiﬁcantly differed
between subtypes. The objective existence of the subtypes and their clinical and molecular characteristics were
validated in an independent set of 720 CRC expression proﬁles. Our subtypes provide a novel perspective on
the heterogeneity of CRC. The proposed subtypes should be further explored retrospectively on existing clinical
trial datasets and, when sufﬁciently robust, be prospectively assessed for clinical relevance in terms of prognosis
and treatment response predictive capacity. Original microarray data were uploaded to the ArrayExpress database
(http://www.ebi.ac.uk/arrayexpress/) under Accession Nos E-MTAB-990 and E-MTAB-1026.
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd on behalf of Pathological Society
of Great Britain and Ireland.
Keywords: colorectal cancer; histopathology; gene expression; molecular heterogeneity
Received 3 February 2013; Revised 10 May 2013; Accepted 14 May 2013
Conﬂict of interest: ST and FTB have received research funds from Pﬁzer; and AR is a compensated consultant for Pﬁzer.
Introduction
Current classiﬁcations of sporadic colorectal cancer
take into consideration stage, histological type and
grade [1]. Colorectal cancer (CRC) is a highly heterogeneous
disease, with clinicopathologically similar
tumours differing strikingly in treatment response
and patient survival. These differences are only partly
explained by current concepts regarding the molecular
events leading to CRC. In recent years, microsatellite
instability (MSI) emerged as an important classiﬁer
with signiﬁcant prognostic impact and potential for
patient stratiﬁcation for therapy [2,3]. Some molecular
markers, as well as the mutation status of BRAF
or KRAS genes (predictive for anti-EGFR [4]), are in
use for treatment decisions and patient stratiﬁcation.
However, patient groups deﬁned by these molecular
markers still differ remarkably in behaviour and therapy
response [5,6]. Several approaches to further subtype
CRC have been proposed, based on combinations
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd on behalf of Pathological Society of Great Britain and Ireland.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium,
provided the original work is properly cited.
17. Molecular heterogeneity of colorectal cancer
176
64 E Budinska et al
of clinical, histopathological, gene expression, CNV,
epigenetic and single gene parameters [7–13]. Each of
these different modalities provides its own perspective
on the same underlying biological reality. The CpG
island methylator phenotype (CIMP) status is emerging
as important molecular determinant of CRC heterogeneity
[11]. The cancer genome atlas (TCGA) analysis
identiﬁed a hypermutant group not entirely captured
by MSI status [13]. Several studies have addressed
CRC subtyping using genome-wide gene expression
proﬁling of relatively large patient cohorts [12,14].
One study used unsupervised clustering of stage II and
III CRCs to identify three stage-independent subtypes,
with BRAF mutation and MSI status dominating one
of the subtypes [14]. A study of stage I–IV CRC samples
segregated CRC into two prognostic subtypes with
epithelial–mesenchymal transition (EMT) as a main
determinant [12]. Another study on 88 stage I–IV samples
identiﬁed four subtypes, one correlated with MSI,
BRAF mutation and mucinous histology, two with stromal
component and one with high nuclear β-catenin
expression [15].
We recently reported CRC expressing a BRAFmutated
signature [6], which strongly overlaps with
the methylation-based group of Hinoue [11], and a
MSI-like gene expression group that captures the
hypermutant tumours of TCGA [13], indicating the
potential for identiﬁcation of robust biological subgroups.
We now describe CRC subtypes based upon
unsupervised clustering of genome-wide expression
patterns. We characterized these subtypes in terms
of biological motifs, common clinical variables,
association with known CRC molecular markers
and morphological patterns. A key element in our
approach was the use of a system of unsupervised gene
modules—groups of genes with correlated expression.
They are more resistant to noise and have a higher
chance of having at least a few members represented
on various platforms. In addition, as each gene module
is represented by its median expression, the modules
with fewer genes contribute equally to the subtype
deﬁnition. We and others have successfully used similar
strategies previously [16–18]. We validated the
existence of the subtypes and their respective clinical
and molecular marker characteristics in an independent
dataset. Ultimately, it will be mandatory to integrate
the various sources of information on CRC heterogeneity
into an integrative, robust and reproducible
subclassiﬁer that can become a tool for clinical use.
Materials and methods
A detailed description of all the datasets and analysis
procedures is given in Supplementary methods and
results (see Supplementary material).
Data acquisition and processing
We have built two non-overlapping data collections:
a discovery collection, comprising four publicly
available (425 samples) and two previously unpublished
datasets (688 samples with 10 year follow-up in
a clinical trial setting and 64 normal samples) with
known stage status, and a validation collection of
eight publicly available datasets (720 CRC samples)
(see Supplementary material, Supplementary methods
and results). Observations derived from the analysis
of 64 normal samples were further validated on ﬁve
publicly available datasets, with both carcinoma and
normal samples available in one batch (totalling 205
normal/adenoma/carcinoma samples). Copy number
data was available for 154 of the PETACC3, as in [19].
Our analysis included a total of 2102 samples.
The discovery collection contained the previously
unpublished 688 CRC formalin-ﬁxed, parafﬁnembedded
(FFPE) samples of PETACC3 [6] and
64 FFPE normal colon tissue samples from Centre
Hospitalier Universitaire Vaudois’s Biobank, which
were uploaded to ArrayExpress (http://www.ebi.ac.uk/
arrayexpress/), under Accession Nos E-MTAB-990
and E-MTAB-1026, respectively. Gene expression
data were processed by standard tools to obtain
normalized, probeset-level expression data. For each
EntrezID in the datasets, the probeset with the
highest variability was selected as representative and
the number of EntrezIDs entering the analysis was
reduced to 3025 by applying non-speciﬁc ﬁltering.
For PETACC3 and normal colon samples, patients
signed an informed consent form in which the use of
tissue specimens was included, and all marker study
proposals were subjected to the approval of the trial
steering committee.
Subtype deﬁnition and validation
For model development (gene modules and subtype
deﬁnition, classiﬁer training, identiﬁcation of subtypespeciﬁc
genes) only the 1113 CRC samples of the
discovery set were used, no sample in the validation
collection being used for any model tuning. Hierarchical
clustering (complete linkage, Pearson correlation
similarity measure) and dynamic cut tree [20] were
used to produce gene modules (groups of genes with
correlated expression), from which non-robust modules
(see Supplementary material, Supplementary methods
and results) and a gender-related module were discarded.
Each expression proﬁle was then reduced to
a vector of meta-genes by taking the median of the
values of genes in each gene module. The meta-genes
were then further grouped into clusters using hierarchical
clustering.
The subtypes were deﬁned in terms of core samples
—those samples from the discovery collection that
were assigned to clusters by hierarchical clustering,
using a consensus distance [21] followed by pruning
of the dendrogram (see Supplementary material,
Supplementary methods and results). The clusters to
which the core samples were assigned were called
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
177
Gene expression heterogeneity in colorectal cancer 65
subtypes. The rest of the samples from the discovery
collection, not assigned to subtypes by this procedure,
were called non-core samples. This approach
allowed the reduction of noise in subtype-deﬁning samples,
and thus a higher consistency of the resulting
subtypes deﬁning the ground truth for downstream
analyses. The stability of the obtained clusters was
assessed under different perturbations of the processing
pipeline (different parameters and clustering methods)
to ensure that the results were not simple artefacts
(see Supplementary material, Supplementary methods
and results). A multiclass linear discriminant (LDA)
[22] was trained on core samples with meta-genes as
variables to assign new samples to one of the subtypes.
Minimal gene sets characteristic to each subtype
were identiﬁed using ElasticNet [23] on gene-level
data.
In order to validate the existence of subtypes (and
their independence on data selection) and the modelling
choices in subtype discovery, we applied the
same subtyping procedure (including parameters) to the
validation collection. The clusters identiﬁed in the validation
collection were put in correspondence with the
subtypes in the training set by LDA predictions and
correlations of subtype-speciﬁc moderated t statistic
[24] values, corresponding to the gene-wise comparison
of the respective subtype with the other subtypes
(one-versus-all comparison). A simple classiﬁer application
would have led the validation samples to be
classiﬁed as one of the subtypes, but it would have not
informed us of possible over-ﬁtting of the data in the
discovery procedure.
Subtype characterization
If not speciﬁed differently, all the reported p values
were adjusted for multiple hypothesis testing, using
the Benjamini–Hochberg procedure. Signiﬁcance level
was set at 0.1. Pathway analysis for each set of
gene modules was carried out using the Database
for Annotation, Visualization and Integrated Discovery
(DAVID) [25]. Gene set enrichment analysis of
gene signatures was performed using the mygsea2
tool, in each subtype and normal samples, on average
expression-ordered median-centred lists of genes.
Differential expression analysis was performed using
limma [24] and sign test using BSDA [26]. The
Cox proportional hazards model was used to analyse
the prognostic value of interquartile range (IQR)standardized
values of meta-genes, for overall survival
(OS), relapse-free survival (RFS) and survival after
relapse (SAR), stratiﬁed by dataset. The Wald test was
used to assess the global signiﬁcance of the models.
Pairwise differences in survival were assessed using the
log-rank test. For subtype comparison, the survival was
truncated at 7 years. Subtype enrichment for clinical or
molecular markers was assessed by the Fisher test to
the baseline, deﬁned as the proportion of the marker in
the whole dataset. Morphological pattern differences
were assessed pairwise by Fisher test.
Histology
The identiﬁed subtypes were characterized histologically
in terms of six different architectural patterns:
complex tubular; solid/trabecular; mucinous; papillary;
desmoplastic; and serrated (Figure 4A), which were
called dominant or secondary depending on their presence
in the histology slides (for details on immunohistochemistry,
see Supplementary material, Supplementary
methods and results).
Results
Gene modules and subtype deﬁnition
We identiﬁed 54 gene modules, reproducible across
all datasets in the discovery collection, comprising
658 genes from an initial list of 3025 identiﬁed
as the most variable. The assignment of genes to
gene modules and gene module clusters is listed in
Table S1 (see Supplementary material); meta-gene
expression proﬁles for the discovery set are shown in
Figure 1A; and between meta-gene correlations in
Figure S1C (see Supplementary material). Based on
gene modules, we identiﬁed ﬁve major subtypes:
surface crypt-like (A), lower crypt-like (B), CIMP-Hlike
(C), mesenchymal (D) and mixed (E), totalling 765
samples (69% of discovery data; see Supplementary
material, Supplementary methods and results).
Subtype reproducibility in an independent
validation set
In the validation set of 720 CRC samples we identiﬁed
a set of subtypes comprising 602 samples (83.6%
of the validation set) and associated them with our
discovery subtypes using the subtype classiﬁer (see
Supplementary material, Table S2) and correlations
of subtype-speciﬁc patterns based on moderated t
statistic (see Supplementary material, Table S3). All
ﬁve major subtypes reappeared in the validation set,
conﬁrming the robustness of our approach. Figure S2
(see Supplementary material) presents gene expression
proﬁles of both discovery and validation sets. Two
notable differences were observed: (i) subtype B in the
validation set was split into two subgroups (B1, B2),
as observed in the discovery set too, but only at lower
pruning height; (ii) another cluster passed the minimal
size criteria, corresponding to the small subtype (F)
which, in the discovery set, was not considered for
further characterization because of small sample size.
Validation of other subtype characteristics (to the extent
of available information) is described in each of the
respective sections.
Subtypes are characterized by distinct biological
components
We set out to assign biological labels to gene modules
that deﬁne the subtypes (Table 1; see also Supplementary
material, Table S1). Of the 54 meta-genes,
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
178
66 E Budinska et al
Figure 1. Meta-gene expression pattern in subtypes, connected with prognostic effect of subtypes and meta-genes, in the discovery set.
(A) Two heat maps clustering normal (left) and CRC (right) samples (columns) and meta-genes (rows). Colours represent decreased (blue)
or increased (red) meta-gene expression relative to their medians. Normal samples were clustered independently on meta-genes centred
to CRC meta-gene medians. For comparative purposes, ordering of meta-genes in normal samples is imposed to correspond to that of CRC
samples. White horizontal lines denote eight unsupervised clusters of meta-genes, each assigned a colour bar on the left; meta-genes not
belonging to a cluster have no colour bar. Names of the meta-genes corresponding to gene modules with gene–gene correlations in normal
samples comparable to those in cancer samples are marked red (see Supplementary material, Figure S1D). (B) Effect of inter-quartile range
(IQR) standardized expression of meta-genes on RFS, OS and SAR. Points represent estimated hazard ratio (HR), bars represent 95% CI.
Bold lines represent effects signiﬁcant at 5% without adjustment for multiple hypothesis testing; red lines represent effects signiﬁcant
at FDR < 10%; details are provided in Table S6 (see Supplementary material). (C) Kaplan–Meier plots for RFS, OS and SAR, with HR for
signiﬁcant pairwise comparisons (p values adjusted for FDR). Numbers below x axes represent number of patients at risk at selected time
points.
41 could be further grouped into eight gene module
clusters; 13 meta-genes remained ungrouped, each possibly
representing a distinct biological motif. Pathway
analysis characterized ﬁve of eight gene module clusters
by the following biological motifs: chromosome
20q (cluster 2), proliferation (cluster 3), EMT/stroma
(cluster 5) and immune response (clusters 7 and 8).
Literature searching identiﬁed biological motifs associated
with other gene modules. We labelled cluster 1
as GDC (genes differentially expressed in CRC), as
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
179
Gene expression heterogeneity in colorectal cancer 67
Table 1. Biological identiﬁcation of gene modules
Cluster name
Number of
genes
Pathway analysis result
(number of overlapping
genes, p value) OR description
based on literature search Selected genes
1. GDC 27 Genes involved in differentiation of colon
crypt and/or whose expression was
reported to be affected in colorectal cancer
and/or with prognostic effect in CRC
Intestinal differentiation genes: CDX2[45], IHH[46],
VAV3[47], ASCL2[35], PLAGL2[48]
Genes reported altered in colorectal cancer with
prognostic effect: PITX2[49], DDC[50], PRLR[51],
SPINK1[52]
Other genes connected to CRC:
GGH–connected to CIMP+ phenotype [53]
NR1I2–connected to chemoresistance [54]
2. Chromosome 20q
genes
33 Chromosome 20 (26 genes, 9.2E-34) Other, non-20q genes: TP53RK, ANO9, NEU1, CLDN3,
PRSS8
3. Proliferation 83 Cell cycle (36 genes, 3.0E-33)
Mitosis (26 genes, 1.4E-29)
Chromosome (26 genes, 2.5E-17)
DNA metabolic process (20 genes, 4.9E-10)
Lipid synthesis (4 genes, 5.0E-2)
Mitotic checkpoint kinases: BUB1, BUB1B
Cyclins: CCNA2, CCNB2 Centromere proteins: CENPA,
CENPE, CENPN
Kinesins: KIF11, KIF23, KIF4A
Topoisomerase II (TOP2A)
Cell division cycle 2 CDC2
4. Colon crypt markers
(secretory cells)
16 AGR2[55], AGR3, MUC2, SPINK4[56], RETNLB[57],
REG4[58]
5. EMT/stroma 310 Extracellular region part (90 genes) 2.7E-36
Cell adhesion (57 genes) 1.2E-17
Extracellular matrix (44 genes) 5.3E-30
Collagen (16 genes) 1.2E-15
EGF-like domain (26 genes) 1.6E-12
Cell motion (33 genes) 7.2E-8
Blood vessel development (25 genes) 1.1E-8
Growth factor binding (6 genes) 6.0E-5
Frizzled related (5 genes) 6.7E-3
Cell junction organization (7 genes) 1.8E-2
WNT receptor signalling pathway (8 genes)
1.4E-1
Inhibitors of β-catenin-dependent canonical WNT:
SFRP1, SFRP2, SFRP4, DKK3, FZD1,7, PRICKLE1, NXN
Mesenchymal markers: N-cadherin, OB cadherin, SPARC,
DDR2
EMT inducers(TFs): SNAI2, ZEB1, ZEB2, TWIST1, CDH11
ECM remodelling and invasion: MMP14, VIM ECM
proteins: ﬁbronectin 1, collagens
Angiogenesis: PLAT, PLAU, NRP1, NRP2, THBS1, THBS2,
THBS4
TGFs, their receptors and binding proteins: IGF1, IGFBP5,
IGFBP7,TGFB, LTBP1, LTBP2, PDGFRA, PDGFRB
6. Unidentiﬁed 14 DUSP1, EGR2, SERPINE1
7 and 8. Immune
response
103 Immune response (42 genes) 2.0E-28
Positive regulation of immune system process
(16 genes) 4.0E-9
Antigen processing and presentation via MHC
class II (6 genes) 7.5E-5
Defence response (31 genes) 3.3E-17
Chemokine signalling pathway (9 genes)
2.2E-3
Lymphocyte activation (11 genes) 2.1E-5
Regulation of programmed cell death
(14 genes) 2.1E-2
Cytokines: CCL3, CXCL5, CXCL9,CXCL10, CXCL11, SPP1,
LTB
MHC class II: HLA-DMB, HLA-DPA1, HLA-DRA, CD74
MHC class I: HLA-F, TAP1, TAP2
Anti-apoptotic: BCL2A1, CD74, BIRC3, IFI6, TNFAIP3,
TNFAIP3
Apoptotic: STAT1, XAF1
Interferon-induced proteins: IFI30, IFI16, IFI44, IFI16,
IFIH1, IFIT3
Cluster-unassigned meta-genes with colon crypt cell markers (enterocytes/top of the crypt)
Meta-gene 105 6 Top of the crypt genes FAM55A, FAM55D, MUC12 and CEACAM7[59],
SLC26A2[59], SLC26A3[59]
Meta-gene 144 5 Enterocytes, goblet cells markers LOC644844, NGEF, HEPH, KRT20[59], MUC20[59]
Cluster-unassigned meta-genes associated with chromosomal location 0
Meta-gene 81 7 Chromosome X (7 genes) 1.1E-8 CXorf15, EIF1AX, HDHD1A, MED14, PNPLA4, SCML1,
SMC1A
Meta-gene 97 6 Chromosome 20p (5 genes) 5.0E-11 CDC25B, CSNK2A1, MRPS26, PTPRA, RP5-1022P6.2,
SNRPB
Meta-gene 84 7 Chromosome 8 (7 genes) 5.4E-9 AGPAT5, FDFT1, GTF2E2, LONRF1, MTUS1, VPS37A,
ZNF395
Other cluster-unassigned meta-genes
Meta-gene 141 5 EREG AK3L1, ARID3A, EREG, LDLRAD3, ZBTB10
Meta-gene 112 6 Lipid synthesis (4 genes) 5.0E-2 DHCR7, FASN, FGFBP1, HMGCS1, IDI1, PCSK9
Meta-gene 95 6 Homeobox genes HOXA10, HOXA11, HOXA13, HOXA5, HOXA7, HOXA9
Meta-gene 124 5 Metallothioneins MT1E, MT1F, MT1G, MT1M, MT1X
Meta-gene 131 5 Disulphide bonds (5 genes) 1.7E-02 CXCL5, IL6, MMP1, MMP3, PTGS2
Meta-gene 143 5 Unidentiﬁed DUSP5, ERRFI1, KLF6, MXD1, PLAUR
Meta-gene 80 7 Regulation of RNA metabolic process
(6 genes) 4.9E-2
ATF3, C8orf4, FOS, JUNB, NR4A1, SIK1, ZFP36
Meta-gene 71 8 Gut development (3 genes) 3.5E-2 CCL11, CH25H, EDNRB, F2RL2, FOXF1, FOXF2, PCDH18,
WNT5A
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
180
68 E Budinska et al
Table 2. Subtype-speciﬁc minimal gene set as identiﬁed by Elastic net
Minimal gene sets specifying a subtype
Subtype
Up-regulated from
population mean
Down-regulated from
population mean
A. Surface crypt-like ADTRP, B3GNT7, CLCA1, MUC2, NR3C2, PADI2, RETNLB, STYK1 CHI3L1, FNDC1, TIMP3, SULF1
B. Lower crypt-like CCDC113, CDHR1, FARP1, GPSM2, GRM8, HNF4A, IHH, KCNK5,
KIAA0226L, MYRIP, PLAGL2, PRR15, QPRT, RNF43, RPS6KA3,
SLC5A6, TP53RK, TSPAN6, VAV3, YAE1D1
ALOX5, BASP1, CREB3L1, CXCR4,
EPB41L3, FSCN1, GFPT2, GPX8,
ITPRIP, KCNMA1, KCTD12,MT1E,
RARRES3, RNASE1, SGK1, SOCS3
C. CIMP-H-like ANP32E, EGLN3, IDO1, PLK2, RAB27B, RARRES3, RPL22L1, TFAP2A ATP9A, C10orf99, CXCL14, KIAA0226L
D. Mesenchymal ANK2, BOC, C7, CRYAB, DCHS1, DDR2, GEM, PRICKLE1, TAGLN HOOK1, RBM47
E. Mixed CEACAM6, CXCL5, HSD11B1, IL1B, IL6, MRPS31, PI15, RAP2A, UQCC AGR3, RAB27B, REG4
it consisted of a number of genes signiﬁcantly associated
with CRC. The analysis of pairwise intra-gene
module correlations in normal samples of both discovery
and validation set identiﬁed as cancer-speciﬁc
gene modules of chromosome 20q, several immune
response, EMT/stroma and GDC gene modules, homeobox
genes and gut development (see Supplementary
material, Figure S1D). The relationship between subtypes
and meta-genes is illustrated by the heat map
(Figure 1A), in which the major molecular motifs
and their role in subtype deﬁnition stand out. Table
S4 (see Supplementary material) contains median subtype
values per meta-gene and the results of differential
meta-gene expression testing between subtypes.
Subtypes are not determined by individual biological
components but each of them contributes to the
molecular identity of the subtypes. The EMT/stroma
cluster stands out in subtypes A + B (low expression)
and D + E (high expression), while subtype C notably
contained a high expression of immunity-associated
cluster. High expression of meta-genes representing
upper colon crypt cells in subtypes A and B, correlated
with serrated and papillary (A) and complex
tubular (B) morphological patterns (see below). Given
the enterocyte-like morphology and retained polarity
of the neoplastic cells in these patterns, they are considered
as well differentiated. Subtype C is associated
with the mucinous phenotype. Interestingly, subtypes
A and C show high expression of metallothioneins,
subtypes C and E show high expression of the homeobox
gene module, while subtypes E and B strongly
express a gene module containing the EREG gene
(Table 1). The high expression of chromosome 20q
cluster in subtype B was correlated with a signiﬁcantly
higher copy number gain/ampliﬁcation of all of 20q in
this subtype (see Supplementary material, Figure S8).
The low expression of lipid synthesis genes is striking
for subtype D and low expression of the gut development
gene module for subtype C. A reﬁned picture
of differences is given by a quantitative comparison
of (meta-)gene expression between subtype pairs (see
Supplementary material, Tables S4 and S5, Figure S4).
For each subtype we also identiﬁed a minimum set
of characteristic genes (Table 2; for more details, see
Supplementary material, Supplementary methods and
results).
Normal colon mucosa in the context of subtypes
When applied to the 64 normal samples, the LDA classiﬁer
assigned them all to subtype A, with posterior
probability > 0.99, supporting the observation that A is
well differentiated and closest to normal colonic epithelium
in terms of gene expression pattern. For validation,
we analysed ﬁve public datasets comprising 205
proﬁles of normal/adenoma/carcinoma samples. Most
of the normal and adenoma samples were classiﬁed by
LDA as subtype A (74.5% of 51 and 69.0% of 71,
respectively) or subtype B (28.2% and 21.6%, respectively),
conﬁrming subtype A as the most normal-like.
The 80 carcinoma samples were distributed over all
subtypes (26.2% A, 30.0% B, 11.3% C, 18.7% D and
13.8% E).
Subtypes and patient survival
We assessed whether subtypes differ in survival, as a
general read-out of biological signiﬁcance, and then
tested the association of each meta-gene with prognosis,
using the complete discovery set of 1113 patients
(Figure 1B-C see also Supplementary material, Table
S6). Kaplan–Meier curves for RFS, OS, SAR, hazard
ratios (HRs) and p values of pairwise differences
between subtypes are shown in Figure 1C. The results
indicate that subtypes C and D are associated with poor
OS. For subtype D, this is primarily due to early relapse
correlated with high expression of EMT genes and low
expression of proliferation-associated genes. For subtype
C it is the result of short SAR, correlated with
low expression of GDC, top colon crypt, EREG and
Chr 20q genes and high expression of meta-gene 126
(see Supplementary material, Table S1). For subtype E
the trend towards poorer OS and RFS was not statistically
signiﬁcant, although borderline signiﬁcant poorer
SAR was found relative to subtype B. Subtypes A and
B had better prognosis than D for all three endpoints,
although for OS in subtype A this was not signiﬁcant.
The analysis of clinical and molecular markers
(below) showed that subtype C is enriched for MSI
tumours and BRAF mutant tumours, the latter present
also in subtype D. The literature indicates that MSI
is associated with better RFS, while BRAF mutation
is an indicator of worse SAR [27]. To analyse how
these two contradictory components affect survival in
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
181
Gene expression heterogeneity in colorectal cancer 69
Table 3. Result of additive multivariate Cox proportional hazards model, with subtype, BRAF mutation, MSI and stagea
Variable RFS HR p OS HR p SAR HR p
A 0.906 0.760 1.381 0.390 1.726 0.180
C 0.940 0.850 1.560 0.220 3.675 0.0022∗
D 1.688 0.0055∗ 2.161 0.0011∗ 1.906 0.014∗
E 1.506 0.210 2.201 0.035∗ 2.046 0.075
BRAFm 1.633 0.085 2.472 0.0034∗ 3.361 0.00072∗
MSI 0.478 0.044∗ 0.275 0.004∗ 0.356 0.036∗
Stage 3 0.770 0.190 0.943 0.820 1.780 0.062∗
a
Baseline is subtype B, MSS, BRAF wt and Stage 2.
∗
Variables signiﬁcant in the model.
Hazard ratios (HR) for relapse-free survival (RFS), overall survival (OS) and survival after relapse (SAR).
subtypes, we built a multivariate Cox proportional
hazard model with subtype, stage, BRAF and MSI
(Table 3; see also Supplementary material, Table S6).
Subtype C remained signiﬁcantly associated with poor
SAR, even after the adjustment for BRAF, MSI and
stage, but not with RFS. Subtypes B and D remained
signiﬁcantly prognostic for RFS, OS and SAR. No
equivalent survival data were available for the datasets
in the validation series, hence these observations could
not be validated.
Colorectal stem cell and Wnt signatures within
subtypes
We investigated the association of subtypes with
Wnt [28–32], putative colon cancer stem cell (CSC)
[33–35] signatures, and two signatures speciﬁc for
upper and lower colon crypt compartments [36], using
gene set enrichment analysis (Figure 2; see also Supplementary
material, Table S7). Subtypes B and E
highly expressed canonical Wnt signalling target signatures.
Subtypes A and D and also normal samples,
however, showed low expression of these signatures.
This was in concordance with the differences in βcatenin
nuclear immunoreactivity at the invasion front
(IF; see Supplementary material, Figure S9 and Supplementary
methods and results). Subtypes B and E
showed the highest percentages, while subtypes A and
D showed signiﬁcantly lower percentages of the βcatenin-positive
nuclei. Subtype C exhibited almost no
β-catenin nuclear immunoreactivity at the IF. We analysed
CSC signatures derived from low colon crypt
compartment cells that had been identiﬁed either by a
Wnt reporter construct TOP GFP or by high surface
expression of EphB2. Subtypes D and E expressed
both TOP GFP and EphB2-derived CSC signatures,
while subtype B mainly expressed only the TOP GFP
signature (Figure 2).
Subtypes complement clinical and molecular
markers
An important goal of this study was to assess how
our molecular subtypes complement known clinical
variables and molecular markers. We found that MSI,
BRAF mutation status, site, mucinous histology and
expression of p53 were signiﬁcantly associated with
various subtypes (Figure 3), but not tumour stage,
age, gender, SMAD4 or PIK3CA mutations (see Supplementary
material, Figure S5A). Subtype D was
not signiﬁcantly enriched for any of the tested variables
except for the BRAF mutated signature and
possibly represents a mixture of tumours that have
the EMT/stroma signature in common. KRAS mutants
occurred in all subtypes (see Supplementary material,
Figure S5C), supporting the emerging notion that
KRAS-mutated CRC are substantially heterogeneous
[5,6,37], the oncogenic role of KRAS varying per speciﬁc
mutation and the molecular background of the
tumour in which it occurs [38]. Subtype C expressed
the BRAF mutant signature we identiﬁed earlier [6]
(87.0%), a CIMP-H signature ([11], Figure 2), and
its characteristics (enrichment for MSI, right side and
mucinous histology) corresponded with those of the
previously reported CIMP-H phenotype [9,11,39,40]
and hypermutated tumours [13]. Regarding the latter,
subtype C had a similar low frequency of copy
number variations (see Supplementary material, Figure
S7). The distribution of MSI status, stage, age, gender,
grade and site over the subtypes in the validation
set followed the same patterns established in the discovery
set [cf Figures 3 and S5B (see Supplementary
material)]. A classiﬁcation tree, trained with a combination
of available clinical and molecular markers, did
not identify our subtypes (see Supplementary material,
Figure S5D), indicating that gene expression patterns
reveal a layer of heterogeneity that goes beyond conventional
CRC classiﬁcation approaches.
Histological characteristics of subtypes
To study whether or not our molecular subtypes
are associated with histological patterns, we examined
haematoxylin and eosin (H&E)-stained parafﬁn
sections of a randomly selected subset of each subtype
(23, 31, 31, 29 and 19 cases for subtypes A, B, C, D and
E, respectively). In attempting to match histological
morphotypes to molecular subtypes, architectural patterns
were used, as illustrated in Figure 4A, rather than
the recognized WHO classiﬁcation of CRCs [1]. Not
surprisingly, given intratumour heterogeneity, none of
the tumours had a single pattern. However, the prevalent
patterns showed appreciable differences between
the subgroups (Figure 4B, C; see also Supplementary
material, Figure S6). In subtype A, the serrated pattern
was most frequent, followed by the papillary pattern; in
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
182
70 E Budinska et al
Figure 2. Subtypes and biological motifs. Subtype-speciﬁc ﬁngerprints of biological motifs, represented either as mean values of gene set
enrichment scores of gene sets from corresponding gene modules (EMT/stroma, immune, secretory cells, proliferation, GDC, chromosome
20q, top of the crypt—meta105 and meta144) or composed gene set enrichment scores of particular signatures (canonical Wnt targets,
CSC-TopGFP, CSC-EphB2, colon crypt bottom and CIMP-H). The gene set enrichment scores represent whether the genes from the gene set
show statistically signiﬁcant enrichment between the down-regulated (negative scores, light blue area) or up regulated (positive scores)
genes of a given subtype; details of score calculation can be found in the Supplementary material (Supplementary methods and results
and Table S7.).
Figure 3. Clinical and mutational characterization of subtypes. Columns represent variables and rows subtypes. Horizontal bar plots
represent proportions of the corresponding variable in each of the subtypes and non-core samples. Non-core samples were tested as one
group to ensure that they did not share a common characteristic that would set them apart. Numbers in brackets adjacent to subtype
name represent overall number of samples in the subtype. Under the title of each variable we denote the percentage representing baseline
proportion in the population, with available information, and N denotes the number of patients for which the information on the respective
feature was available. Bars in red represent signiﬁcant enrichment and bars in blue signiﬁcant depletion of a feature in the subtype in
comparison to baseline, at the 5% signiﬁcance level. Adjacent to each bar is the percentage of samples in the subtype with the speciﬁc
feature and in brackets the overall number of samples in the subtype with the information available. We can read that, for instance,
subtype C, comprising 154 samples, is enriched for microsatellite-unstable (MSI) tumours, where 60.4% of 91 samples with available
information are MSI.
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
183
Gene expression heterogeneity in colorectal cancer 71
Figure 4. Morphological CRC patterns. (A) morphological CRC patterns scored in subtypes. (B, C) Distribution of dominant (B) and secondary
(C) histological patterns in subtypes. Columns represent subtypes and widths are proportional to subtype frequency (numbers of samples
in each subtype); rows represent dominant (B) or secondary (C) patterns and heights are proportional to pattern frequency. Boxes show
adjusted p values of pairwise statistical testing of morphological pattern distribution between subtypes.
subtypes B and E, complex tubular dominated; in subtype
C the solid pattern dominated, with mucinous as
the second; most striking was the presence of a strong
stromal reaction in subtype D.
Discussion
Our approach, using gene modules on a large panel of
samples, allowed us to identify ﬁve main CRC gene
expression subtypes (Table 4). It is relevant to note
that subtyping can be performed on FFPE tissues, an
important prerequisite for wide clinical applications.
An example is the hypermutated group identiﬁed in
the TCGA study by whole exome sequencing [13], but
according to our data also by gene expression proﬁling
on routinely processed tissues (CIMP-H-like subtype).
The combination of gene expression, clinical, mutational,
survival and morphological data contributes new
insight into the heterogeneity of CRC. While the validation
conﬁrmed the robustness of our ﬁndings across
different platforms (ALMAC versus Affymetrix), sample
preparation methods (FFPE versus fresh-frozen)
and dataset collections, larger datasets are necessary
to assess and characterize the relevance of lower frequency
subtypes (eg F, or further segregation of B
into B1 and B2). Our data indicate that several major
biological processes are key determinants of a complex
subtype structure of CRC. Therefore our subtypes
deﬁned by gene expression do not substitute
but complement groups deﬁned by current clinicopathological
variables and molecular markers. Notably,
morphological subclassiﬁcation of CRC has clearly
reached its limits, given the often striking intratumour
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
184
72 E Budinska et al
Table4.Summaryofsubtypecharacteristics
CRCmarkersandmutationsHistopathologyIHCMediansurvival(months)ClinicalGeneexpression
SubtypeMSIBRAFKRASP53Dominant
Nuclear
β-cateninatIFOSRFSSARSiteGradeUp-regulatedDown-regulated
A:Surfacecrypt-like–+Papillaryorserrated–NANA28.9Topcoloncrypt,secretory
cell,metallothioneins
EMT/stroma,Wnt,CSC,
Chr20q,
proliferation
B:Lowercrypt-like––Complextubular+NANA50.4Left2Topcoloncrypt,
proliferation,Wnt
EMT/stroma,immune,
secretorycell
C:CIMP-H-like++–Solid/trabecularor
mucinous
–NANA6.9Right3Proliferation,immune,
metallothioneins
GDC,topcoloncrypt,
Chr20q
D:MesenchymalDesmoplastic–NA79.519.8EMT/stroma,CSC,immuneProliferation,secretory
cell,topcoloncrypt,
GDC,Wnt,Chr20q
E:Mixed––+Complextubular+NANA19.6LeftEMT/stroma,immune,top
coloncrypt,Chr20q,
GDC,CSC
Secretorycell
+,signiﬁcantlyenriched;–,signiﬁcantlydepleted;IF,invasionfront;NA,notattained;novalue,nosigniﬁcantenrichmentincomparisontopopulationbaseline.
heterogeneity, which made us use a (primary and
secondary) architectural pattern approach rather than
the canonized histological subtypes (WHO). Proﬁling
of microdissected patterns within a single tumour might
reveal molecular mechanisms responsible for these
morphotypes. This additional heterogeneity within the
subtypes may reﬂect tumour polyclonality, similar to
breast cancer [41]. Ultimately, aggregating clinical,
pathological and further detailed molecular characteristics
(including CNV, miRNA and methylation) will
contribute to a more detailed perception of CRC heterogeneity
and it is likely that more subtypes will emerge.
This, however, would need more detailed molecular
annotation of larger clinically well documented CRCs.
A striking association was found between the stromal
subtype D and the EMT signature. The previously
discovered EMT [12] also emerged from our analysis
as the largest cluster of meta-genes associated with
poor RFS (subtype D). Our histological assessment
suggests that the EMT signature is the reﬂection of
a strong mesenchymal stromal reaction, and this histological
characteristic deserves to be tested for its
capacity to predict resistance to therapy, in view of its
strong association with poor survival. Studies requiring
high tumour cell content as sample inclusion criteria
(eg [13]) could miss this poor prognosis subtype. Identiﬁcation
of this subtype in cell lines or xenograft models
is less straightforward and would beneﬁt from the
analysis of gene expression patterns between microdissected
tumour and stromal cells.
EMT, however important, only partly explains
CRC heterogeneity, as even subtypes with similar
expression of EMT-associated genes (A–C or D–E)
differ in survival, mutational, clinical and gene expression
characteristics. Additional biological components,
such as differentiation, immune response, proliferation,
chromosome 20q or cluster of genes deregulated in
CRCs, are important co-determinants that underpin a
need for further subdivision of CRCs. The ﬁndings
from the analysis of CSC and WNT signatures support
the recently suggested hypothesis that the colon stem
cell signature under the condition of silenced canonical
WNT targets is associated with higher risk of
recurrence (subtype D) [33]. This is consistent with
subtype D showing a signiﬁcantly lower percentage
of β-catenin-positive nuclei than subtype B, with its
Wnt-associated gene expression and better survival.
MSI tumours represent a subclass in most unsupervised
analyses and can be recognized at the gene
expression level [42]. The more recent gene expression
studies [14,15] suggest that MSI and BRAF share distinct
gene expression patterns. Subtype C was enriched
for both MSI and BRAF mutants and had one of the
best outcomes for RFS, but the worse outcome in SAR,
in concordance with previously reported results [43].
Subtype C retained its poor SAR prognostic value, even
in the population of MSS and BRAF wild-type patients.
Our data suggest that subtype C represents tumours
with a common biology and a gene expression pattern
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
185
Gene expression heterogeneity in colorectal cancer 73
that might best characterize a group of tumours resistant
to chemotherapy, once metastatic. In this sense,
our work not only agrees with the current known markers
(BRAF mutation status and MSI) but clearly adds
new insight, putting together these previously unrelated
clusters into one biologically meaningful group.
This observation is in line with recently published
work [6].
Our observations show that gene expression proﬁling
contributes substantially to our insight into CRC
heterogeneity in conﬁrming and complementing data
from sequencing, CNV and promoter methylation
analysis. Our subtypes can be further functionally
interrogated for driving oncogenes/events by in vitro
functional screens. High-risk subtypes D and C might
contribute to therapeutic decision making in either
adjuvant or metastatic settings. Retrospective analysis
of clinical trial series may identify drug sensitivity
associated with particular subtypes, and might open
new treatment optimization strategies to be tested in
clinical trials with stratiﬁed cohorts, similar to the
I-SPY2 trial for breast cancer [44].
In conclusion, our unsupervised approach using
gene modules resulted in the identiﬁcation of distinct
molecularly deﬁned CRC subtypes, which adds
a new layer of complexity to CRC heterogeneity and
opens new opportunities for understanding the disease.
The challenge is now to assimilate conventional
and these new molecular approaches into a comprehensive
consensus classiﬁcation, which might then be
used in further clinical studies for patient stratiﬁcation
and experimental studies to further elucidate mechanisms
involved in the development and progression
of CRC.
Acknowledgements
We thank all the clinicians who enrolled patients and
participated in the PETACC-3 trial (see Appendix
at: http://jco.ascopubs.org/content/27/19/3117.long), in
particular the coordinators D Cunningham, R Labianca
and E Van Cutsem. We thank F Sch¨utz, T Sengstag
and A Wirapati for providing analytical tools
for clustering and gene set enrichment analysis. This
work was supported in part by the Swiss National
Science Foundation (SNF; http://www.snf.ch/E/; Grant
No. 320030_135421, to AR and MD); Krebsforschung
Schweiz (KFS; http://www.krebsforschung.ch/; Grant
No. 02697-08-2010, to AR and MD); Fondation
Medic (to EB and MD); and the Swiss National
Centre for Competence in Research Molecular Biology
(http://www.nccr-oncology.ch/; to VP and MD).
ST is a Senior Clinical Investigator of the Fund
for Scientiﬁc Research Flanders (FWO-Vlaanderen)
and has received research grants from the Belgian
National Cancer Plan (Nationaal Kankerplan). We
thank the TCGA research network for the availability
of data.
Author contributions
EB and MD designed the study; YP, FTB, ST, JGH and
SW conceived and carried out microarray experiments;
YP and FTB performed histopathological experiments
and β-catenin scoring; EB, VP, GD, NL and AFN
analysed the data, EB, VP, ST, FTB, KOS, NL, JGH,
SW, MD and AR performed data interpretation; EB,
NL and KOS performed the literature search; and EB
generated ﬁgures and conceived the ﬁrst manuscript
draft. All authors were involved in writing the paper
and had ﬁnal approval of the submitted and published
versions.
References
1. Bosman FT, World Health Organization, International Agency
for Research on Cancer. WHO Classiﬁcation of Tumours of the
Digestive System, 4th edn. International Agency for Research on
Cancer (IARC): Lyons, 2010.
2. Tejpar S, Saridaki Z, Delorenzi M, et al. Microsatellite instability,
prognosis and drug sensitivity of stage II and III colorectal cancer:
more complexity to the puzzle. J Natl Cancer Inst 2011; 103:
841–844.
3. Sinicrope FA, Sargent DJ. Molecular pathways: microsatellite instability
in colorectal cancer: prognostic, predictive, and therapeutic
implications. Clin Cancer Res 2012; 18: 1506–1512.
4. Vecchione L, Jacobs B, Normanno N, et al. EGFR-targeted therapy.
Exp Cell Res 2011; 317: 2765–2771.
5. Martini M, Vecchione L, Siena S, et al. Targeted therapies: how
personal should we go? Nat Rev Clin Oncol 2011; 9: 87–97.
6. Popovici V, Budinska E, Tejpar S, et al. Identiﬁcation of a poorprognosis
BRAF-mutant-like population of patients with colon
cancer. J Clin Oncol 2012; 30: 1288–1295.
7. Jass JR. Classiﬁcation of colorectal cancer based on correlation
of clinical, morphological and molecular features. Histopathology
2007; 50: 113–130.
8. Shen L, Toyota M, Kondo Y, et al. Integrated genetic and epigenetic
analysis identiﬁes three different subclasses of colon cancer. Proc
Natl Acad Sci USA 2007; 104: 18654–18659.
9. Ogino S, Goel A. Molecular classiﬁcation and correlates in colorectal
cancer. J Mol Diagn 2008; 10: 13–27.
10. Furlan D, Carnevali IW, Bernasconi B, et al. Hierarchical clustering
analysis of pathologic and molecular data identiﬁes prognostically
and biologically distinct groups of colorectal carcinomas. Mod
Pathol 2011; 24: 126–137.
11. Hinoue T, Weisenberger DJ, Lange CP, et al. Genome-scale
analysis of aberrant DNA methylation in colorectal cancer. Genome
Res 2012; 22: 271–282.
12. Loboda A, Nebozhyn MV, Watters JW, et al. EMT is the dominant
program in human colon cancer. BMC Med Genom 2011; 4: 9.
13. TCGA CGAN. Comprehensive molecular characterization of
human colon and rectal cancer. Nature 2012; 487: 330–337.
14. Salazar R, Roepman P, Capella G, et al. Gene expression signature
to improve prognosis prediction of stage II and III colorectal cancer.
J Clin Oncol 2011; 29: 17–24.
15. Perez Villamil B, Romera Lopez A, Hernandez Prieto S, et al.
Colon cancer molecular subtypes identiﬁed by expression proﬁling
and associated to stroma, mucinous type and different clinical
behavior. BMC Cancer 2012; 12: 260.
16. Wirapati P, Sotiriou C, Kunkel S, et al. Meta-analysis of
gene expression proﬁles in breast cancer: toward a uniﬁed
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
186
74 E Budinska et al
understanding of breast cancer subtyping and prognosis signatures.
Breast Cancer Res 2008; 10: R65.
17. Farmer P, Bonnefoi H, Becette V, et al. Identiﬁcation of molecular
apocrine breast tumours by microarray analysis. Oncogene 2005;
24: 4660–4671.
18. Shedden K, Taylor JM, Enkemann SA, et al. Gene expression-based
survival prediction in lung adenocarcinoma: a multi-site, blinded
validation study. Nat Med 2008; 14: 822–827.
19. Xie T, G DA, Lamb JR, et al. A comprehensive characterization of
genome-wide copy number aberrations in colorectal cancer reveals
novel oncogenes and patterns of alterations. PLoS One 2012; 7:
e42001.
20. Langfelder P, Zhang B, Horvath S. Deﬁning clusters from a
hierarchical cluster tree: the Dynamic Tree Cut package for R.
Bioinformatics 2008; 24: 719–720.
21. Monti S, Tamayo P, Mesirov J, et al. Consensus clustering: a
resampling-based method for class discovery and visualization of
gene expression microarray data. Machine Learning 2003; 52:
91–118.
22. Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, 2nd edn.
Springer: New York, 2009.
23. Zou H, Hastie T. Regularization and variable selection via the
elastic net. J R Stat Soc B 2005; 67: 301–320.
24. Smyth GK. Linear models and empirical bayes methods for
assessing differential expression in microarray experiments. Stat
Appl Genet Mol Biol 2004; 3: Article 3.
25. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative
analysis of large gene lists using DAVID bioinformatics resources.
Nat Protoc 2009; 4: 44–57.
26. Arnholt AT. BSDA: Basic statistics and data analysis. R package
v 1.01, 2012; http://cran.r-project.org/web/packages/BSDA/
index.html
27. Tejpar S, Bertagnolli M, Bosman F, et al. Prognostic and predictive
biomarkers in resected colon cancer: current status and future
perspectives for integrating genomics into biomarker discovery.
Oncologist 2010; 15: 390–404.
28. Mokry M, Hatzis P, de Bruijn E, et al. Efﬁcient double fragmentation
ChIP-seq provides nucleotide resolution protein-DNA binding
proﬁles. PLoS One 2010; 5: e15092.
29. Hatzis P, van der Flier LG, van Driel MA, et al. Genome-wide
pattern of TCF7L2/TCF4 chromatin occupancy in colorectal cancer
cells. Mol Cell Biol 2008; 28: 2732–2744.
30. Van der Flier LG, Sabates-Bellver J, Oving I, et al. The intestinal
Wnt/TCF signature. Gastroenterology 2007; 132: 628–632.
31. Sansom OJ, Reed KR, Hayes AJ, et al. Loss of APC in vivo
immediately perturbs Wnt signaling, differentiation, and migration.
Genes Dev 2004; 18: 1385–1390.
32. Fevr T, Robine S, Louvard D, et al. Wnt/β-catenin is essential for
intestinal homeostasis and maintenance of intestinal stem cells. Mol
Cell Biol 2007; 27: 7551–7559.
33. de Sousa EMF, Colak S, Buikhuisen J, et al. Methylation of
cancer stem-cell-associated Wnt target genes predicts poor prognosis
in colorectal cancer patients. Cell Stem Cell 2011; 9:
476–485.
34. Merlos-Suarez A, Barriga FM, Jung P, et al. The intestinal stem
cell signature identiﬁes colorectal cancer stem cells and predicts
disease relapse. Cell Stem Cell 2011; 8: 511–524.
35. van der Flier LG, van Gijn ME, Hatzis P, et al. Transcription factor
achaete scute-like 2 controls intestinal stem cell fate. Cell 2009;
136: 903–912.
36. Kosinski C, Stange DE, Xu C, et al. Indian hedgehog regulates
intestinal stem cell fate through epithelial–mesenchymal
interactions during development. Gastroenterology 2010; 139:
893–903.
37. Faris JE, Ryan DP. Trees, forests, and other implications of a BRAF
mutant gene signature in patients with BRAF wild-type disease. J
Clin Oncol 2012; 30: 1255–1257.
38. Singh A, Sweeney MF, Yu M, et al. TAK1 inhibition promotes
apoptosis in KRAS-dependent colon cancers. Cell 2012; 148:
639–650.
39. Tanaka H, Deng G, Matsuzaki K, et al. BRAF mutation, CpG
island methylator phenotype and microsatellite instability occur
more frequently and concordantly in mucinous than non-mucinous
colorectal cancer. Int J Cancer 2006; 118: 2765–2771.
40. Hawkins N, Norrie M, Cheong K, et al. CpG island methylation
in sporadic colorectal cancers and its relationship to microsatellite
instability. Gastroenterology 2002; 122: 1376–1387.
41. The Cancer Genome Atlas. Comprehensive molecular portraits of
human breast tumours. Nature 2012; 490: 61–70.
42. Tian S, Roepman P, Popovici V, et al. A robust genomic signature
for the detection of colorectal cancer patients with microsatellite
instability phenotype and high mutation frequency. J Pathol 2012;
228: 586–595.
43. Dahlin AM, Palmqvist R, Henriksson ML, et al. The role of the
CpG island methylator phenotype in colorectal cancer prognosis
depends on microsatellite instability screening status. Clin Cancer
Res 2010; 16: 1845–1855.
44. Barker AD, Sigman CC, Kelloff GJ, et al. I-SPY 2: an adaptive
breast cancer trial design in the setting of neoadjuvant chemotherapy.
Clin Pharmacol Ther 2009; 86: 97–100.
45. Moskaluk CA, Zhang H, Powell SM, et al. Cdx2 protein expression
in normal and malignant human tissues: an immunohistochemical
survey using tissue microarrays. Mod Pathol 2003; 16: 913–919.
46. van den Brink GR, Bleuming SA, Hardwick JC, et al. Indian
Hedgehog is an antagonist of Wnt signaling in colonic epithelial
cell differentiation. Nat Genet 2004; 36: 277–282.
47. Liu JY, Seno H, Miletic AV, et al. Vav proteins are necessary
for correct differentiation of mouse cecal and colonic enterocytes.
J Cell Sci 2009; 122: 324–334.
48. Zheng H, Ying H, Wiedemeyer R, et al. PLAGL2 regulates Wnt
signaling to impede differentiation in neural stem cells and gliomas.
Cancer Cell 2010; 17: 497–509.
49. Hirose H, Ishii H, Mimori K, et al. The signiﬁcance of PITX2
overexpression in human colorectal cancer. Ann Surg Oncol 2011;
18: 3005–3012.
50. Kontos CK, Papadopoulos IN, Fragoulis EG, et al. Quantitative
expression analysis and prognostic signiﬁcance of L-DOPA decarboxylase
in colorectal adenocarcinoma. Br J Cancer 2010; 102:
1384–1390.
51. Bhatavdekar J, Patel D, Ghosh N, et al. Interrelationship of
prolactin and its receptor in carcinoma of colon and rectum: a
preliminary report. J Surg Oncol 1994; 55: 246–249.
52. Gaber A, Johansson M, Stenman UH, et al. High expression of
tumour-associated trypsin inhibitor correlates with liver metastasis
and poor prognosis in colorectal cancer. Br J Cancer 2009; 100:
1540–1548.
53. Kawakami K, Ooyama A, Ruszkiewicz A, et al. Low expression
of gamma-glutamyl hydrolase mRNA in primary colorectal cancer
with the CpG island methylator phenotype. Br J Cancer 2008; 98:
1555–1561.
54. Chen Y, Tang Y, Guo C, et al. Nuclear receptors in the multidrug
resistance through the regulation of drug-metabolizing enzymes and
drug transporters. Biochem Pharmacol 2012; 83: 1112–1126.
55. Park SW, Zhen G, Verhaeghe C, et al. The protein disulﬁde
isomerase AGR2 is essential for production of intestinal mucus.
Proc Natl Acad Sci USA 2009; 106: 6950–6955.
56. Noah TK, Kazanjian A, Whitsett J, et al. SAM pointed domain ETS
factor (SPDEF) regulates terminal differentiation and maturation of
intestinal goblet cells. Exp Cell Res 316: 452–465.
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
187
Gene expression heterogeneity in colorectal cancer 75
57. Steppan CM, Brown EJ, Wright CM, et al. A family of tissuespeciﬁc
resistin-like molecules. Proc Natl Acad Sci USA 2001; 98:
502–506.
58. Heiskala K, Giles-Komar J, Heiskala M, et al. High expression of
RELP (Reg IV) in neoplastic goblet cells of appendiceal mucinous
cystadenoma and pseudomyxoma peritonei. Virchows Arch 2006;
448: 295–300.
59. Dalerba P, Kalisky T, Sahoo D, et al. Single-cell dissection of
transcriptional heterogeneity in human colon tumors. Nat Biotechnol
2011; 29: 1120–1127.
60. *R Development Core Team. A Language and Environment for
Statistical Computing. R Foundation for Statistical Computing:
Vienna, Austria, 2012.
61. *Gentleman RC, Carey VJ, Bates DM, et al. Bioconductor: open
software development for computational biology and bioinformatics.
Genome Biol 2004; 5: R80.
62. *Therenau T. A Package for Survival Analysis in S. R package
version 2.36–14, 2012.
63. *Bolstad BM, Collin F, Simpson KM, et al. Experimental design
and low-level analysis of microarray data. Int Rev Neurobiol 2004;
60: 25–58.
64. *Venables WNR, Ripley BD. Modern Applied Statistics with S, 4th
edn. Springer: New York, 2002.
65. *Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning:
a conditional inference framework. J Comput Graph Statist 2006;
15: 651–674.
66. *Van Cutsem E, Labianca R, Bodoky G, et al. Randomized phase III
trial comparing biweekly infusional ﬂuorouracil/leucovorin alone or
with irinotecan in the adjuvant treatment of stage III colon cancer:
PETACC-3. J Clin Oncol 2009; 27: 3117–3125.
67. *Jorissen RN, Gibbs P, Christie M, et al. Metastasis-associated
gene expression changes predict poor outcomes in patients with
Dukes stage B and C colorectal cancer. Clin Cancer Res 2009; 15:
7642–7651.
68. *IGC. Expression Project for Oncology, 2008 [cited; available from:
http://www.intgen.org/expo/]
69. *Smith JJ, Deane NG, Wu F, et al. Experimentally derived
metastasis gene expression proﬁle predicts recurrence and death
in patients with colon cancer. Gastroenterology 2010; 138:
958–968.
70. *Skrzypczak M, Goryca K, Rubel T, et al. Modeling oncogenic
signaling in colon tumors by multidirectional analyses of microarray
data directed for maximization of analytical reliability. PLoS One
2010; 5: e13091.
71. *Hong Y, Ho KS, Eu KW, et al. A susceptibility gene set
for early onset colorectal cancer that integrates diverse signaling
pathways: implication for tumorigenesis. Clin Cancer Res 2007;
13: 1107–1114.
72. *Gyorffy B, Molnar B, Lage H, et al. Evaluation of microarray
preprocessing algorithms based on concordance with RT–PCR in
clinical samples. PLoS One 2009; 4: e5645.
73. *Galamb O, Sipos F, Solymosi N, et al. Diagnostic mRNA
expression patterns of inﬂamed, benign, and malignant colorectal
biopsy specimen and their correlation with peripheral blood results.
Cancer Epidemiol Biomarkers Prev 2008; 17: 2835–2845.
74. *Galamb O, Spisak S, Sipos F, et al. Reversal of gene expression
changes in the colorectal normal–adenoma pathway by NS398
selective COX2 inhibitor. Br J Cancer 2010; 102: 765–773.
75. *Koinuma K, Yamashita Y, Liu W, et al. Epigenetic silencing
of AXIN2 in colorectal carcinoma with microsatellite instability.
Oncogene 2006; 25: 139–146.
76. *Jorissen RN, Lipton L, Gibbs P, et al. DNA copy-number alterations
underlie gene expression differences between microsatellite
stable and unstable colorectal cancers. Clin Cancer Res 2008; 14:
8061–8069.
77. *Grone J, Lenze D, Jurinovic V, et al. Molecular proﬁles and
clinical outcome of stage UICC II colon cancer patients. Int J
Colorectal Dis 2011; 26: 847–858.
78. *Birnbaum DJ, Laibe S, Ferrari A, et al. Expression proﬁles in
stage II colon cancer according to APC gene status. Transl Oncol
2012; 5: 72–76.
79. *Giancarlo R, Scaturro D, Utro F. Computational cluster validation
for microarray data analysis: experimental assessment of Clest,
Consensus Clustering, Figure of Merit, Gap Statistics and Model
Explorer. BMC Bioinformat 2008; 9: 462.
80. *Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment
tools: paths toward the comprehensive functional analysis of
large gene lists. Nucleic Acids Res 2009; 37: x1–13.
*Cited only in the Supplementary material.
SUPPLEMENTARY MATERIAL ON THE INTERNET
The following supplementary material may be found in the online version of this article:
Supplementary methods and results (contains a further table and two further ﬁgures)
Figure S1. (A) Consensus clustering and similarity dendrogram of samples. (B) Subtype projection in the four-dimensional space of LDA axes.
(C) Heat map matrix of pairwise meta-gene Fisher Z-transformed Pearson pairwise correlations. (D) Box plots of intra gene module pairwise
gene–gene Pearson correlations in normal samples in both discovery and validation sets
Figure S2. Validation of meta-gene expression pattern of subtypes represented by heat maps
Figure S3. (A) Heat map representing validation of gene expression patterns of subtypes. (B) Pairwise Fisher Z-transformed correlations of
meta-genes in validation set. (C) Box plots representing medians of pairwise gene–gene Pearson correlations in the validation datasets
Figure S4. Expression of top ﬁve down- and top ﬁve up regulated genes from all pairwise comparisons between subtypes
Figure S5. (A) Other clinical and mutational markers tested and found non-signiﬁcant between subtypes. (B) Clinical variables tested in the
clusters of the validation test. (C) Distribution of signiﬁcant clinical and mutational markers across subtypes. (D) Classiﬁcation tree trained on
clinical variables
Figure S6. Graphs of joined distribution of dominant vsersus secondary patterns in each of the subtypes
Figure S7. Heat map of CNV proﬁles of 154 samples from the discovery set, randomly ordered inside each of the subtypes
Figure S8. Result of hypothesis testing of median log-scale copy number estimates of chromosome 20 of subtype B versus all other subtypes
Figure S9. Distribution of β-catenin immunoreactivity of the invasion front counts between subtypes
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
188
76 E Budinska et al
Table S1. Detailed description of gene module members and detailed results of meta-gene expression tests pairwise between subtypes and of
subtypes to meta-gene medians
Table S2. Multiclass linear discriminant (LDA) subtype assignment of samples from validation set
Table S3. Correlations of subtype-speciﬁc gene expression proﬁles (1 versus all moderated t test statistics) when accounting for subtype F in the
training set
Table S4. Detailed results of meta-gene expression tests pairwise between subtypes and of subtypes to meta-gene medians
Table S5. Detailed results of pairwise comparisons of differentially expressed gene between subtypes
Table S6. Detailed results of Cox proportional hazards models for RFS, OS and SAR for subtype, stage, MSI and BRAF and for meta-genes
Table S7. Results of GSEA comparison of enrichment tested signatures in individual subtypes and normal samples
 2013 Swiss Institute of Bioinformatics. Journal of Pathology published by John Wiley & Sons Ltd J Pathol 2013; 231: 63–76
on behalf of Pathological Society of Great Britain and Ireland. www.pathsoc.org.uk www.thejournalofpathology.com
17. Molecular heterogeneity of colorectal cancer
189
17. Molecular heterogeneity of colorectal cancer
190
18 Image-based surrogate biomarkers for molecular
subtypes of colorectal cancer
• Bioinformatics, 33(13):2002–2009, 2017
• IF: 7.307
• number of citations: 0
• personal contribution (80%): image analysis method design, data
collection and processing, experimental design and implementation,
manuscript writing
191
Bioinformatics
doi.10.1093/bioinformatics/xxxxxx
Advance Access Publication Date: Day Month Year
Manuscript Category
Subject Section
Image-based surrogate biomarkers for molecular
subtypes of colorectal cancer
Vlad Popovici1,∗, Eva Budinská2, Ladislav Dušek1, Michal Kozubek3 and
Fred Bosman4
1
Institute of Biostatistics and Analyses, Faculty of Medicine, Masaryk University, Brno, Czech Republic
2
Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic
3
Faculty of Informatics, Masaryk University, Brno, Czech Republic and
4
University Institute of Pathology, University of Lausanne, Switzerland.
∗To whom correspondence should be addressed.
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Motivation: Whole genome expression proﬁling of large cohorts of different types of cancer led to the
identiﬁcation of distinct molecular subcategories (subtypes) that may partially explain the observed intertumoral
heterogeneity. This is also the case of colorectal cancer where several such categorizations have
been proposed. Despite recent developments, the problem of subtype deﬁnition and recognition remains
open, one of the causes being the intrinsic heterogeneity of each tumor, which is difﬁcult to estimate
from gene expression proﬁles. However, one of the observations of these studies indicates that there may
be links between the dominant tumor morphology characteristics and the molecular subtypes. Beneﬁting
from a large collection of colorectal cancer samples, comprising both gene expression and histopathology
images, we investigated the possibility of building image–based classiﬁers able to predict the molecular
subtypes. We employed deep convolutional neural networks for extracting local descriptors which were
then used for constructing a dictionary–based representation of each tumor sample. A set of support vector
machine classiﬁers were trained to solve different binary decision problems, their combined outputs being
used to predict one of the ﬁve molecular subtypes.
Results: A hierarchical decomposition of the multi-class problem was obtained with an overall accuracy of
0.84 (95%CI=(0.79-0.88)). The predictions from the image-based classiﬁer showed signiﬁcant prognostic
value similar to their molecular counterparts.
Availability: Source code used for the image analysis is freely available from https://github.com/
higex/qpath
Contact: popovici@iba.muni.cz
Supplementary information: Supplementary data are available at Bioinformatics online.
The last two decades witnessed fundamental changes in the way
we investigate the biology of living organisms, with technological
developments fueling major breakthroughs in our understanding of
various pathologies and paving the road towards a personalized medicine.
Currently, the researchers are armed with a battery of techniques for
interrogating the same biological reality at various scales (from subcellular
to whole population) and from very diverse perspectives (clinical,
imaging, genomic, proteomic, etc) generating high throughput multimodal
data. The bottleneck is now represented by our limited ability to interpret
such data in an integrated way (Li et al. (2016)) and the need for a
more inter-disciplinary approach is epitomized by large scale projects
such as The Cancer Genome Atlas (TCGA). In cancer research, one
of the main goals it to identify homogeneous groups of patients i.e.
to stratify the patient population - in the hope of ﬁnding the
common causes and tailored treatments. Traditional stratiﬁcation of
cancer patients is based on histologic and morphologic assessment of
1
Associate Editor: Prof. Robert Murphy
© The Author (201 ). Published by Oxford University Press. All rights reserved. For Permissions, please email:
journals.permissions@oup.com
7
18. Image-based surrogate biomarkers
192
2 Popovici et al.
the tumor sample and it still deﬁnes the golden standard. Lately, various
molecular biomarkers have been proposed for the same purpose. The two
perspectives are partly overlapping and partly orthogonal, making their
integration more challenging. Our present work focusses on translating
a gene expression-based cancer patient population stratiﬁcation into an
image-based biomarker, thus trying to bring transcriptomics data into a
histopathologic context.
Colorectal cancer (CRC) is the third most frequent cancer worldwide
and the second leading cause of cancer mortality in Europe, with metastatic
disease accounting for 40% to 50% of newly diagnosed patients. At the
same time, it is a highly heterogeneous disease in terms of prognosis
and its response to therapy. Using whole-genome proﬁling of large data
collections, several systems for sub-categorization of CRC have been
proposed recently (Budinská et al., 2013; Marisa et al., 2013; Sadanandam
et al., 2013; Roepman et al., 2013; De Sousa E Melo et al., 2013). In
general, they relied on clustering the CRC tumors in order to identify
patterns of co-regulation of genes that could be indicative of common
oncogenic pathways and coherent treatment responses of these tumors.
Our own analysis (Budinská et al., 2013) identiﬁed ﬁve stable tumor
clusters (labeled as subtypes A, B,..., E), but also showed that a relatively
high proportion of cases remained unaccounted for by this system. A
recent effort (Guinney et al., 2015) to harmonize all these discoveries
conﬁrmed the presence of four distinct and reproducible subtypes across all
studies, labeled CMS1,..., CMS4, which match closely our subtypes A,...,
D (Guinney et al., 2015). The current golden standard for the identiﬁcation
of the molecular subtype of a given tumor requires the interrogation of a
large panel of genes and the application of a genomic classiﬁer. In the
analyses reported here, we will use the subtypes as deﬁned in Budinská
et al., 2013. There are several reasons for this choice: Firstly, since
they were derived from the same gene expression data that accompany
the images we use, it is hoped that the subtype assignment is less noisy.
Secondly, in Budinská et al., 2013 it is noted that an expert pathologist,
when presented with the molecular categorization for a set of cases, was
able to identify a number of morphological features that were preferentially
enriched in one or a few of the subtypes hence, showing preliminary
evidence that such connections exist. And thirdly, we are interested in
identifying the imaging support for the ﬁve previously identiﬁed subtypes.
The problem of recognizing the tumor subtype based on imaging data
is not new and probably the most studied is the case of breast cancer.
For these cancers, ﬁve molecular subtypes are currently considered Luminal
A, Luminal B, basal, Her2-enriched and normal-like (Perou
et al., 2000) - and surrogate immunohistochemical stains are available
(corresponding to hormonal status of ER, PR and Her2 and the invasion
marker Ki-67, respectively). Consequently, automatic stain quantiﬁcation
is the strategy of choice for molecular subtype recognition from image
data and it was shown to outperform the human expert (Stålhammar et al.,
2016). A systematic review of the connections between histological and
molecular subtypes in breast cancer is given in Weigelt et al., 2010.
Other efforts concentrated on the recognition of the high risk group of
triple negative breast cancers on various imaging platforms (Agner et al.,
2014; Dogan and Turnbull, 2012). The quantitative image analysis of
pathology slides can also serve as a main means for subtype deﬁnition. For
example, Chang et al., 2011 found ﬁve subtypes of glioblastoma, one of
which being predictive value and correlated with the expression of several
genes. Similarly, Lan et al., 2015 propose an alternative subtyping of
ovarian cancer based on quantitative analysis of tumor microenvironment.
A general approach to the identiﬁcation of disease subtype based on
morphologic analysis of pathology slides is described in Cooper et al.,
2012.
In the case of CRC, Budinská et al., 2013 showed that subtype A had
either serrated or papillary architecture, subtype B represented typical
colorectal adenoma with complex tubular architecture, subtype C was
mucinous or solid trabecular, subtype D was a mixture of desmoplastic
and complex tubular architecture, and subtype E was mixed (see Budinská
et al., 2013 for example images). However, these annotations did not lead
to a strong classiﬁer.
Thisobservation-thatassociationscanbefoundbetweenthemolecular
subtypes and morphological traits of the tumors - constitutes the starting
point of our investigations reported here. Our interest is to construct
a histopathology image-based classiﬁer able to predict the molecular
subtype of a given tumor section without resorting to any other staining
but the standard haematoxylin-eosine. This classiﬁer may be seen as a
surrogate image biomarker (actually, as we will see, a combination of
several biomarkers) for the molecular subtypes and, to the best of our
knowledge, it is the ﬁrst such biomarker to be proposed. This constitutes
the main contribution of our work reported here and it represents a largely
improved result from our earlier explorations (Budinská et al., 2016).
Equally important, our approach does not rely on predeﬁned morphopathological
features: the feature selection is guided by the prediction
task. This would allow identifying potentially unknown (or overlooked)
image features but may also make the interpretation of the models less
obvious.
There are many potential application of such a system once established
and well tested. First, since it does not require any special laboratory work,
it could be easily integrated in the diagnostic workﬂow to provide hints
about the molecular subtype, with no extra costs. It could also be used for
sample stratiﬁcation and selection for retrospective studies, where large
collections of samples could easily be ﬁltered for the subtypes of interest
without the need of the much more expensive molecular proﬁling.
Currently, the molecular subtype is established by proﬁling the
expression of a set of genes from the DNA/RNA extracted from the tumoral
region of a tissue section and combining their values through a genomic
classiﬁer. The whole process involves a number of parameters (from
deﬁning the characteristics of the region to be proﬁled - tumor content,
presence/absence of stroma, etc - to the cut-offs of the classiﬁers) that are
yet to be formalized, thus being error-prone and leading to noisy labels.
While we consider the molecular subtypes as the ground truth our imagebased
classiﬁer is measured against, one has to keep in mind the somehow
fuzzy nature of the class deﬁnition. These speciﬁc settings of our problem
make it even more challenging than the more classical applications in the
ﬁeld of digital/computational pathology.
The rest of the paper is structured as follows: the data and the methods
used are described in Section 1, followed by the discussion of the results
in Section 2 and conclusions in Section 3.
1 Methods
1.1 Data
The present work is based on the data from a subset of the PETACC3
clinical trial (Van Cutsem et al., 2009) samples. The trial compared two
treatment regimens (ﬂuorouracil/leucovorin alone or in combination with
irinotecan) in colorectal cancer and found no differences between the
two. The gene expression data for a set of n = 688 samples was used
(along with other data sets) in the derivation of the molecular subtypes of
CRC (Budinská et al., 2013) and is publicly available from ArrayExpress
under accession number E-MTAB-990. In (Budinská et al., 2013) the
molecular subtypes (denoted A-E) were assigned to a number of n = 458
cases, the rest being considered ambiguous (or representing other lowprevalence
subtypes) and were labeled as "outliers". From those 458
samples, n = 300 cases were selected for this study based purely on
technical considerations (availability of histopathology tumor section,
acceptable whole slide image quality, tissue sample not too fragmented,
18. Image-based surrogate biomarkers
193
Image-based surrogate biomarkers 3
Fig. 1. Typical whole slide image from the data collection. At 10× magniﬁcation, this
image is 39936 × 22528 pixels in size. The regions marked with a "T" correspond to
tumoral component, while the "N" annotation indicate normal tissue.
etc.). The "outlier" (from a molecular subtype perspective) cases were not
considered in the present study.
All molecular subtypes were represented in this collection with the
following frequencies: A: 21, B:140, C:37, D: 81, and E: 21, respectively.
The slides were annotated by an expert pathologist and these annotations
were present in the digital versions - a typical example is given in Figure 1
(note the annotations delineating the loosely the tumoral and normal tissue
components).
From the whole collection of 300 images a subset of 100 images
was selected by stratiﬁed random sampling to form the development set.
This development set was used for selecting the image representation
model and for designing the classiﬁcation approach. We did not use
the whole available data in order to reduce the likelihood of obtaining
a model too adapted to our particular collection of samples (overﬁtted).
For the same reason we also preferred limiting the number of experiments,
comparing only several modeling approaches.The remaining 200 images
were added at a later stage when the multi-class classiﬁer performance was
estimated by cross-validation. Other strategies of selecting a development
set (eventually larger, equal number of cases per class, etc.) could have
been attempted, with their own advantages and drawbacks, but we found
the chosen approach to provide a reasonable trade-off.
1.2 Image acquisition and preprocessing
All whole slide images of haematoxylin-eosin stained tumor sections
were acquired at 20× magniﬁcation, using a Hamamatsu NanoZoomer
C9600 scanner. The resulting images were compressed by the image
acquisitionsoftwareusingJPEGstandard(at80%quality)andstoredinthe
proprietary NDPI format. The resolution of the images was 455nm/pixel
(equivalent of 55824 DPI) for a typical size of 100, 000 × 50, 000 pixels
(depending on the size of the tissue section). The images were exported in
standard TIFF format using OpenSlide software library (Satyanarayanan
et al., 2013).
The images were down-scaled to an equivalent 10× magniﬁcation
and only tumoral regions were retained from each sample (manually cut
following the pathologist’s annotations) - the pixels outside the tumors
being set to zero. For example, the image in Figure 1 contains two tumoral
regions (marked with "T"). No further preprocessing was applied to the
images.
1.3 Local descriptors
We based our sample description on the aggregation of local information
over the tumor regions in the image. The choice of image features plays a
major role in the performance of image recognition/classiﬁcation system.
Traditionally, most of such features are handcrafted, consisting of some
dense sampling of local patches, like in wavelet decomposition, ScaleInvariant
Feature Transform (SIFT) (Lowe, 1999), Local Binary Patterns
(LBP) (Ojala et al., 1996), etc. These local descriptors are later pooled
into a global representations by means of methods such as Bag-of-VisualWords
(BoVW) (Csurka et al., 2004), Fisher Vector (FV) (Perronnin
and Dance, 2007), or Vector of Locally Aggregated Descriptors (VLAD)
(Jégou et al., 2010, 2012).
More recently, Convolutional Neural Networks (CNNs) (LeCun et al.,
1989, 2015) gained momentum due to the superior performance of the
systems employing them and to the increasing availability of dedicated
software (and hardware) systems facilitating their use. While the CNNs
also require a number of design decisions (such as their structure), they
also have a large number of parameters that are learned from data, leading
to adapted image descriptions. Cimpoi et al., 2016 provide a detailed
comparison of deep image features and some standard ones in the general
context of texture classiﬁcation. In biomedical imaging, there are a number
of successful recognition systems based on various CNNs architectures,
such as U-Net (Ronneberger et al., 2015). In general, training CNN-based
recognition systems requires a large number of labeled image examples,
the deeper the architecture more images being needed. For example, the
well-known image recognition systems like ImageNet (Krizhevsky et al.,
2012) or GoogleNet (Szegedy et al., 2015) were trained on millions of
images. Such large data collections are usually not available in biomedical
ﬁeld, thus the interest in transferring general pre-trained CNN models to
the medical applications. For example, van Ginneken et al. (2015) and
Kawahara et al. (2016) describe such successful systems that are based on
pre-trained CNN features.
An alternate route for obtaining local descriptors is represented by the
autoencoding methods, where an identity function is learned under the
constraint of a lower dimensional (or sparse) internal representation. The
parameters of the function are obtained through an optimization process,
where the distance (usually L2) between the original and reconstructed
image is minimized, eventually with some additional constraints over the
parameters. Examples of such methods are represented predictive sparse
decomposition methods (as used in Chang et al. (2015) for example) and
deep autoencoding networks. We do not explore further this direction on
the present work.
For the problem addressed here, we chose to use a very deep CNN
trained on ImageNet data collection – imagenet-vgg-f (Chatﬁeld et al.,
2014) – as implemented in the MatConvNet library (Vedaldi and Lenc,
2015)1. The network is trained to predict the probability of an input color
image of size 224 × 224 to belong to one of the 1, 000 categories. By
using the output of the next to last layer (relu7, before the classiﬁcation
layers), a 4, 096 element description vector can be obtained. Since we
will use Gaussian Mixture Models (GMMs - see Section 1.4) for building
the coding dictionary, such a high dimensional space would require a
prohibitively large number of samples for a good ﬁt of the models, so
we choose to perform PCA to further reduce the dimension of the local
descriptor vectors by retaining the ﬁrst d = 128 coordinates (chosen to
be ﬁxed, non-trainable). Thus, a local RGB patch of 224 × 224 pixels
was reduced to a set of 128 values corresponding to the projection of the
4, 096−value ImageNet vector onto the ﬁrst 128 principal axes.
As a side note, we remark that the CNN-based descriptor vector is itself
the result of a combination of a number of ﬁlters applied to even smaller
neighborhoods. However, in this work we consider the basic neighborhood
to be the 224 × 224 patch on which the CNN is applied.
1 for the architecture see http://www.vlfeat.org/
matconvnet/models/imagenet-vgg-f.svg
18. Image-based surrogate biomarkers
194
4 Popovici et al.
1.4 Aggregating local descriptors
Once a set of local descriptors is obtained from an image, they are pooled
into a summarizing feature vector supposed to capture the global aspects of
the image. The ﬁrst step of the process involves the re-coding of the image
in terms of elements of a visual dictionary (codebook), the same for all
classes, which is followed by the computation of the image representation.
For the construction of the codebook, k−means clustering and
Gaussian Mixture Models (GMMs) are the most common choices, and
are typically used with either the standard Bag-of-Visual-Words (Csurka
et al., 2004) or other aggregators. Jégou et al., 2012 give a comprehensive
comparison of various design choices. Here we shortly remind the main
differences between BoVW, FV and VLAD:
• Bag-of-Visual-Words typically uses k−means clustering for obtaining
a codebook, with the K centroids from the clustering being the
codewords (visual words). Then the representation of an image is
simply the histogram of the number of local descriptors assigned
to each codeword, thus an image is reduced to a K−dimensional
vector. This histogram can be further normalized using Manhattan or
Euclidean normalization Jégou et al., 2012. One can also use a softcoding
scheme in which the patches are assigned, for example, a code
based on the distance to the centroids Sivic and Zisserman, 2003.
• FisherVector representsageneralizationofBoVWasitencodeshigher
order statistics of the distribution of the codewords. In this case, the
codebook is usually obtained as a GMM with K components ﬁtted
via expectation maximization on the training data. The FV encodes
the gradient of a given sample’s likelihood with respect to parameters
of the ﬁtted GMM, thus it indicates the direction in the parameter
space in which the learned GMM has to be modiﬁed to accommodate
the observed data Jégou et al., 2012. For a full FV that accounts for
differences both in mean and variance between the model and observed
data, the resulting representation vector has 2Kd elements (d being
the size of the local descriptor vector).
• VLAD can be seen as a non-probabilistic version of FV Jégou et al.,
2012 and was designed to provide a low dimensional representation
of the image Jégou et al., 2010 that would allow the indexing of very
large image databases in memory. It tries to combine the simplicity of
BoVW with some ideas of FV: the codebook is learned via k−means
clusteringandeachpatchisassignedtheclosestcodewordasinBoVW,
but the feature vector accumulates the differences between each patch
and its corresponding codeword, similar to FV. See Arandjelovic and
Zisserman, 2013 for a detailed discussion and further extensions.
In the present work, we decided to use a common method for
constructing the visual codebook, namely the Gaussian Mixture Models.
This allowed us to test a soft-coding scheme as well, in which codes were
based on the posterior probabilities of being generated by a particular
component of the GMM.
1.5 Classiﬁer training and performance estimation
Training the system could be summarized by the following steps:
1. for each image, extract the local descriptors (based on ImageNet) for
all non-overlapping regions corresponding to tumoral component(s);
2. construct a visual codebook by:
a. performingPCAandretaintheﬁrst128components(thePCAmodel
is saved for later application on validation set)
b. ﬁtting a K = 128-component GMM on PCA-transformed local
descriptors (the visual codebook is saved for later usage on
validation set)
Table 1. Confusion matrix for BoVW.
Empty cells correspond to null values.
Predicted
A B C D E Precision Recall
A 3 4 0.75 0.43
B 1 41 5 0.76 0.87
C 3 7 2 0.44 0.58
D 4 8 13 2 0.59 0.48
E 1 2 1 2 1 0.33 0.14
3. train the binary classiﬁers (save the models for validation). Each
such binary classiﬁer was a support vector machine with a radial
basis function kernel. Two parameters were tuned in an inner crossvalidation
loop: the γ parameter of the kernel and the C parameter
for the misclassiﬁcation penalty. The ﬁnal prediction of the subtype
label is made according to the decision tree in Figure 2. This particular
decomposition of themulti-classproblem was theresult of theanalysis
of misclassiﬁed samples in the development set which suggested that
ﬁrstly subtypes A, B should be separated from the rest (see Sec. 2.1).
Since the ImageNet is an external model independent of the data
analyzed, it does not need to be included in the cross-validation loop,
this being an additional reason for preferring a pre-built CNN model. The
other steps, however, were repeated at each cross-validation iteration on
the corresponding training data.
1.6 Statistical analyses
For the identiﬁcation of image features enriched/depleted in a subtype with
respect to the other subtypes, we used Wilcoxon rank-sum tests since the
measurements were not normally distributed. For hierarchical clustering
we used the Ward method with an Euclidean distance between feature
vectors. Survival analysis was performed using survival package
(version 2.39-4) from R statistical computing environment (version 3.3.1,
www.r-project.org). The estimation of hazard ratios was obtained
from Cox proportional hazards regression in the absence of any other
covariates, while the comparison of survival experience of different
subgroups was assessed by log-rank test (Mantel-Haenszel test). Statistical
signiﬁcance level was chosen to be p = 0.01 and all tests yielding a
p−value 0.01 ≤ p ≤ 0.05 were considered marginally signiﬁcant.
Finally, the 95% conﬁdence intervals (95%CI) for binomial random
variables (such as accuracy) were estimated using the (Agresti and Coull,
1998) method.
2 Results and discussion
The results discussed here are complemented by larger images on the
project’swebsite: http://bias.cerit-sc.cz/somopro-subtypes.
html.
2.1 Initial experiments
As mentioned, in an attempt to avoid overﬁtting the available data, a
development set has been used to guide the design decisions and to set
a number of meta-parameters. We tested dictionaries with K1 = 64 and
K2 = 128 codewords and compared the performance of BoVW, FV
and VLAD representations when predicting the ﬁve molecular subtypes.
We performed this comparison under two standard decompositions of the
multi-class classiﬁcation problem, namely 1-vs-all and 1-vs-1.
These tests showed that BoVW with GMM-based quantization
performed as good as the more involved representation by FV and VLAD
18. Image-based surrogate biomarkers
195
Image-based surrogate biomarkers 5
A vs B
A,B
vs
C,D,E
C,D vs E
C vs D
A B C D E
A,B C,D,E
C,D
Fig. 2. Decomposition of the multi-class classiﬁcation problem. For each non-terminal
node a binary classiﬁer was trained to split the respective groupings of molecular subtypes.
Table 2. 10-fold cross-validation confusion
matrix for the multi-class classiﬁer and
corresponding per-class performance metrics.
Empty cells correspond to null values.
Predicted
A B C D E Precision Recall
A 21 0.95 1.00
B 1 119 13 7 0.91 0.85
C 2 29 6 0.91 0.78
D 8 1 71 1 0.75 0.88
E 2 2 5 12 0.60 0.57
see Supplement materials - Sec. 1. The small sample size deﬁnitely
inﬂuences this observation, since both FV and VLAD have much higher
dimensionality and would require more data for a better training. Table 1
shows the results for BoVW method with 1-vs-all decomposition of the
multi-class problem, on the development set (obtained by stratiﬁed 4-fold
cross-validation) - for the other approaches the results were similar, so
they are not detailed here.
Another important observation was that the 1-vs-1 and 1-vs-all
decompositions of the multi-class classiﬁcation problem might not be
the best suited for the present case. By analyzing the confusion matrix
and taking into account the performance indexes (precision and recall) it
appeared that a ﬁrst split would have been more advantageous between
classes A,B on one side and C,D,E, on the other. This observation is also
supported by the results in Budinská et al., 2013 where it is noted that
subtypes A, B, on one hand, and C, D, E, on the other hand, share dominant
and secondary dominant morphological features as well as similar survival
expectancy. So, the ﬁnal design for the multi-class classiﬁer was chosen
to be as depicted in Figure 2.
2.2 Prediction of molecular subtypes
Once the ﬁnal decisions for the classiﬁcation system were taken based on
the initial experiments described above, the performance of the system was
assessed using 10−fold cross validation, on the whole set of 300 samples.
The estimated overall accuracy of the multi-class classiﬁer was Acc =
0.84, 95%CI = (0.79−0.88) for a weighted average recall and precision
of R = 0.85, 95%CI = (0.80 − 0.89) and P = 0.84, 95%CI =
(0.80 − 0.88), respectively. Table 2 details the performance metrics of
the classiﬁer. We note the good performance of the ﬁrst decision level
({A,B} vs {C,D,E}) (Acc = 0.89, 95%CI = (0.85 − 0.92)) but also the
poor recognition of the subtype E.
We repeated the same experiments on the 200 samples not used in
the development set and the results were in line with those above (thus
not repeated here), only with subtype A being slightly worse separated
from subtype B (see Supplemental materials - Sec. 2). This indicates that
the current sample size may still be too small for some cases and some
improvements may be expected by enlarging the training set.
2.3 Associations between predictions and clinical data
The study Budinská et al., 2013 indicated that some associations could be
found between molecular subtypes and clinical variables and molecular
markers. Hence, we were interested in testing whether such associations
are transferrable to the predictions made by the image-based classiﬁer. To
avoid overly-optimistic discoveries, we use the predictions (A-E labels)
produced during the cross-validation estimation of the system. There is
also one caveat: as explained the selection of the cases was governed by
technical constraints and thus it does not represent the true populationbased
statistics for various clinical variables and the results reported here
should not be compared directly with those in Budinská et al., 2013.
Nevertheless, we investigate these associations and compare them with
those found between gene expression-based subtypes and the clinical
variables, on the same set of cases.
We ﬁrst tested whether the predicted subtypes were associated with
relapsefreesurvival(RFS).InBudinskáetal.,2013, subtypesAandBhave
a lower risk of relapse than subtypes C, D, and E. The same can be observed
in the set of 300 samples used here (p = 0.0014, HR = 1.75, 95%CI =
(1.24 − 2.49), Figure 3(a)). The image-based subtype predictions also
produce a statistically signiﬁcant stratiﬁcation of the population (p =
0.012, HR = 1.56, 95%CI = (1.10 − 2.21), Figure 3(b)).
We also found associations between microsatellite stability, BRAF
and KRAS mutations, and mucinous histology and various subtypes both
image-based and gene expression-based. In the case of image-based
predictions, subtypes A and C were enriched in mucinous histology
compared to the sample average, while subtype E was almost depleted
of it. BRAF mutated cases (5.8% of all cases) were mostly found in
subtype C (20% of cases predicted), and rarely in subtype B (2.4%), while
KRAS mutation (38.4% of all cases) represented 77% of cases predicted
as subtype A and only 29% and 22% of cases predicted as subtypes B and
E, respectively. Finally, high microsatellite instability (MSI) was almost
exclusively found in subtype C (10 out of 13 cases). The same trends were
found in gene-expression subtypes, with some variations below statistical
signiﬁcance.
A related question was whether the misclassiﬁed samples were
enriched in any particular type of tumors. The only signiﬁcant association
was between the misclassiﬁed subtype B samples, which were enriched
in higher T-stage and N-stage tumors. This observation may provide hints
about further reﬁnement of the classiﬁer for subtype B. Detailed results
are given in Supplemental materials - Sec. 3.
2.4 Visual codebook
We explored the structure of the visual codebook as obtained by training
the model on the full data set. A visual depiction of the extracted codewords
(centers of the Gaussian components) is shown in Figure 4 and a higher
resolution image is given in Supplemental materials - Sec. 4. Note that
the visual codewords are the centers of the Gaussians in the GMM,
hence the means of feature vectors obtained by projecting the ImageNet
features in the PCA space. The patches shown are just the closest image
neighborhoods to these centers, thus they are an approximation of the
true centers (whose visual appearance would require inverting the CNN
function). We use this simpliﬁcation only for visualization purposes and
to get a qualitative assessment of the results.
18. Image-based surrogate biomarkers
196
18. Image-based surrogate biomarkers
197
Image-based surrogate biomarkers 7
Fig. 5. Top four prototypes associated with each subtype: (a-d) subtype A; (e-h) subtype B; (i-l) subtype C; (m-p) subtype D; and (q-t) subtype E. Under each image the corresponding
p-value from Wilcoxon rank-sum test is shown.
Another question we will address in the future pertains the classiﬁcation
of the so-called "outliers": tumors for which no molecular subtype was
assigned. It would be interesting to see how the subtypes predicted by the
current image-based classiﬁer correlate with the similarity between their
expression proﬁles and those of well assigned tumors.
One has to bear in mind that, despite recent efforts to consolidate the
molecular taxonomy of CRC, the sub-categorization of CRC is still not
deﬁnitive. Indeed, depending on the size of the cohort and parameters
chosen for cut-offs, more or less molecular subtypes can be observed, thus
thiscategorizationisstillﬂuid. Nevertheless, inthepresentworkithasbeen
considered the golden standard to which the image-based models were
compared against. We believe that actually combining the observations
from the two modalities may led to an even more reﬁned subtyping of
the CRC. However, this would probably involved a more supervised (by
expert pathologists) construction of the image-based models.
As they stand now, our results are clearly supporting the possibility
of translating some molecular observations into image-based models, as
it is the case of molecular subtypes. These results are reinforced by
similar observations made by an expert pathologist (Budinská et al., 2013),
where several tissue architectural patterns could be linked, in a supervised
analysis, to the molecular subtypes. It is interesting to note that some of
the the regions/patterns found representative in our data-driven analysis
are also visually similar to those hand-picked by an expert (see example
images in Budinská et al., 2013). On the other hand, the intra-tumoral
heterogeneity and pathology sampling region clearly inﬂuence sample’s
assignment to a molecular subtype (Dunne et al., 2016). In the light of the
results presented here, it can be imagined an image-analysis approach to
the delineation of the tissue sampling regions to improve the stability of
the subtype assignment.
While it is too early for considering any clinical application
of the models described here, they could, however, be used for
indexing/annotating or for retrieval of samples of interest from archives.
Consider the situation in which one would like to test for some biomarker
which is hypothesized to work in one or several subtypes on a retrospective
collection of samples. Since determining the molecular subtypes relies on
proﬁling hundreds of genes, it makes more sense to use a classiﬁer such
the one proposed here, to select the most promising samples. And this
can be implemented without signiﬁcant effort since more and more of the
pathology departments are adopting the digital pathology workﬂows, thus
the images being readily available.
Acknowledgements
The necessary computational resources were provided by the CESNET
LM2015042 and the CERIT Scientiﬁc Cloud LM2015085 projects under
the programme "Projects of Projects of Large Research, Development, and
Innovations Infrastructures".
Funding
This project is ﬁnanced from the SoMoPro II programme. The research
leading to this result has acquired a ﬁnancial grant from the People
Programme (Marie Curie action) of the Seventh Framework Programme
of EU according to the REA Grant Agreement No.291782. The research
is further co-ﬁnanced by the South Moravian Region. This article reﬂects
only the authors views and the Union is not liable for any use that may be
made of the information contained therein.
References
Agner, S. C., Rosen, M. A., Englander, S., Tomaszewski, J. E., Feldman, M. D.,
Zhang, P., Mies, C., Schnall, M. D., and Madabhushi, A. (2014). Computerized
image analysis for identifying triple-negative breast cancers and differentiating
themfromothermolecularsubtypesofbreastcancer ondynamiccontrast-enhanced
MR images: a feasibility study. Radiology, 272(1), 91–99.
Agresti, A. and Coull, B. A. (1998). Approximate is Better than “Exact” for Interval
Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126.
Arandjelovic, R. and Zisserman, A. (2013). All About VLAD. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 1578–1585. IEEE.
Budinská, E., Popovici, V., Delorenzi, M., Tejpar, S., D’Ario, G., Lapique, N.,
Sikora, K. O., Di Narzo, A. F., Yan, P., Hodgson, J. G., Weinrich, S., Bosman,
F., and Roth, A. (2013). Gene expression patterns unveil a new level of molecular
heterogeneity in colorectal cancer. Journal of Pathology, 231(1), 63–76.
18. Image-based surrogate biomarkers
198
8 Popovici et al.
Budinská, E., Bosman, F., and Popovici, V. (2016). Experiments in molecular
subtype recognition based on histopathology images. In International Symposium
on Biomedical Imaging, pages 1168–1172. Masaryk University, Brno, Czech
Republic, IEEE.
Chang, H., Fontenay, G. V., Han, J., Cong, G., Baehner, F. L., Gray, J. W., Spellman,
P. T., and Parvin, B. (2011). Morphometic analysis of TCGA glioblastoma
multiforme. BMC Bioinformatics, 12(1), 484.
Chang, H., Zhou, Y., Borowsky, A., Barner, K., Spellman, P., and Parvin, B. (2015).
Stacked Predictive Sparse Decomposition for Classiﬁcation of Histology Sections.
International Journal of Computer Vision, 113(1), 3–18.
Chatﬁeld, K., Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Return of the
Devil in the Details: Delving Deep into Convolutional Nets. In British Machine
Vision Conference.
Cimpoi, M., Maji, S., Kokkinos, I., and Vedaldi, A. (2016). Deep Filter Banks
for Texture Recognition, Description, and Segmentation. International Journal of
Computer Vision, 118(1), 65–94.
Cooper, L. A. D., Kong, J., Gutman, D. A., Wang, F., Gao, J., Appin, C., Cholleti,
S., Pan, T., Sharma, A., Scarpace, L., Mikkelsen, T., Kurc, T., Moreno, C. S.,
Brat, D. J., and Saltz, J. H. (2012). Integrated morphologic analysis for the
identiﬁcation and characterization of disease subtypes. Journal of the American
Medical Informatics Association, 19(2), 317–323.
Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and Bray, C. (2004). Visual
categorization with bags of keypoints. In Workshop on Statistical Learning in
Computer Vision, ECCV, pages 59–74.
De Sousa E Melo, F., Wang, X., Jansen, M., Fessler, E., Trinh, A., de Rooij, L. P.
M. H., de Jong, J. H., de Boer, O. J., van Leersum, R., Bijlsma, M. F., Rodermond,
H., van der Heijden, M., van Noesel, C. J. M., Tuynman, J. B., Dekker, E.,
Markowetz, F., Medema, J. P., and Vermeulen, L. (2013). Poor-prognosis colon
cancer is deﬁned by a molecularly distinct subtype and develops from serrated
precursor lesions. Nature Medicine, pages 1–8.
Dogan, B. E. and Turnbull, L. W. (2012). Imaging of triple-negative breast cancer.
Annals of oncology : ofﬁcial journal of the European Society for Medical Oncology
/ ESMO, 23 Suppl 6, vi23–9.
Dunne, P. D., McArt, D. G., Bradley, C. A., O’Reilly, P. G., Barrett, H. L., Cummins,
R., O’Grady, T., Arthur, K., Loughrey, M. B., Allen, W. L., McDade, S. S.,
Waugh, D. J., Hamilton, P. W., Longley, D. B., Kay, E. W., Johnston, P. G.,
Lawler, M., Salto-Tellez, M., and Van Schaeybroeck, S. (2016). Challenging the
Cancer Molecular Stratiﬁcation Dogma: Intratumoral Heterogeneity Undermines
Consensus Molecular Subtypes and Potential Diagnostic Value in Colorectal
Cancer. Clinical Cancer Research, 22(16), 4095–4104.
Guinney, J., Dienstmann, R., Wang, X., de Reyniès, A., Schlicker, A., Soneson, C.,
Marisa, L., Roepman, P., Nyamundanda, G., Angelino, P., Bot, B. M., Morris, J. S.,
Simon, I. M., Gerster, S., Fessler, E., De Sousa E Melo, F., Missiaglia, E., Ramay,
H., Barras, D., Homicsko, K., Maru, D., Manyam, G. C., Broom, B., Boige, V.,
Perez-Villamil, B., Laderas, T., Salazar, R., Gray, J. W., Hanahan, D., Tabernero,
J., Bernards, R., Friend, S. H., Laurent-Puig, P., Medema, J. P., Sadanandam,
A., Wessels, L., Delorenzi, M., Kopetz, S., Vermeulen, L., and Tejpar, S. (2015).
The consensus molecular subtypes of colorectal cancer. Nature Medicine, 21(11),
1350–1356.
Jégou, H., Douze, M., Schmid, C., and Pérez, P.(2010). Aggregatinglocaldescriptors
into a compact image representation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 3304–3311. INRIA, Le Chesnay,
France, IEEE.
Jégou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., and Schmid, C. (2012).
Aggregating Local Image Descriptors into Compact Codes. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.
Kawahara, J., BenTaieb, A., and Hamarneh, G. (2016). Deep features to classify
skin lesions. In IEEE International Symposium on Biomedical Imaging, pages
1397–1400. Simon Fraser University, Burnaby, Canada, IEEE.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classiﬁcation
with Deep Convolutional Neural Networks. In NIPS, pages 1106–1114.
Lan, C., Heindl, A., Huang, X., Xi, S., Banerjee, S., Liu, J., and Yuan, Y. (2015).
Quantitative histology analysis of the ovarian tumour microenvironment. Scientiﬁc
Reports, 5, 16317–16317.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W.,
and Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code
Recognition. Neural computation, 1(4), 541–551.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553),
436–444.
Li, G., Bankhead, P., Dunne, P. D., O’Reilly, P. G., James, J. A., Salto-Tellez, M.,
Hamilton, P. W., and McArt, D. G. (2016). Embracing an integromic approach to
tissue biomarker research in cancer: Perspectives and lessons learned. Brieﬁngs in
Bioinformatics.
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In
Proceedings of the IEEE International Conference on Computer Vision, pages
1150–1157. The University of British Columbia, Vancouver, Canada.
Marisa, L., de Reyniès, A., Duval, A., Selves, J., Gaub, M. P., Vescovo, L.,
Etienne-Grimaldi, M.-C., Schiappa, R., Guenot, D., Ayadi, M., Kirzin, S.,
Chazal, M., Fléjou, J.-F., Benchimol, D., Berger, A., Lagarde, A., Pencreach,
E., Piard, F., Elias, D., Parc, Y., Olschwang, S., Milano, G., Laurent-Puig, P., and
Boige, V. (2013). Gene Expression Classiﬁcation of Colon Cancer into Molecular
Subtypes: Characterization, Validation, and Prognostic Value. PLoS Medicine,
10(5), e1001453.
Ojala, T., Pietikäinen, M., and Harwood, D. (1996). A comparative study of texture
measures with classiﬁcation based on featured distributions. Pattern Recognition,
29(1), 51–59.
Perou, C. M., Sorlie, T., Eisen, M. B., and van de Rijn, M. (2000). Molecular portraits
of human breast tumours. Nature, 406(6797), 747–752.
Perronnin, F. and Dance, C. (2007). Fisher kernels on visual vocabularies for image
categorization. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8. Xerox Research Centre Europe, Meulan, France,
IEEE.
Roepman, P., Schlicker, A., Tabernero, J., Majewski, I., Tian, S., Moreno, V., Snel,
M. H., Chresta, C. M., Rosenberg, R., Nitsche, U., Macarulla, T., Capella, G.,
Salazar, R., Orphanides, G., Wessels, L. F., Bernards, R., and Simon, I. M.
(2013). Colorectalcancerintrinsicsubtypespredictchemotherapybeneﬁt, deﬁcient
mismatch repair and epithelial-to-mesenchymal transition. International Journal
of Cancer, 134(3), 552–562.
Ronneberger, O., Fischer, P., andBrox, T.(2015). U-Net: ConvolutionalNetworksfor
Biomedical Image Segmentation. In International Conference on Medical image
computing and computer-assisted intervention, pages 234–241, Cham. Springer
International Publishing.
Sadanandam, A., Lyssiotis, C. A., Homicsko, K., Collisson, E. A., Gibb, W. J.,
Wullschleger, S., Ostos, L. C. G., Lannon, W. A., Grotzinger, C., Del Rio, M.,
Lhermitte, B., Olshen, A. B., Wiedenmann, B., Cantley, L. C., Gray, J. W., and
Hanahan, D. (2013). A colorectal cancer classiﬁcation system that associates
cellular phenotype and responses to therapy. Nature Medicine, pages 1–8.
Satyanarayanan, M., Goode, A., Gilbert, B., Harkes, J., and Jukic, D. (2013).
OpenSlide: A vendor-neutral software foundation for digital pathology. Journal
of Pathology Informatics, 4(1), 27.
Sivic, J. and Zisserman, A. (2003). Video google: A text retrieval approach to
object matching in videos. In Proceedings of the IEEE International Conference
on Computer Vision, pages 1470–1477. University of Oxford, Oxford, United
Kingdom.
Stålhammar, G., Martinez, N. F., Lippert, M., Tobin, N. P., Mølholm, I., Kis, L.,
Rosin, G., Rantalainen, M., Pedersen, L., Bergh, J., Grunkin, M., and Hartman,
J. (2016). Digital image analysis outperforms manual biomarker assessment in
breast cancer. Modern Pathology, 29(4), 318–329.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In
IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9. IEEE.
Van Cutsem, E., Labianca, R., Bodoky, G., Barone, C., Aranda, E., Nordlinger,
B., Topham, C., Tabernero, J., Andre, T., Sobrero, A. F., Mini, E., Greil, R.,
Di Costanzo, F., Collette, L., Cisar, L., Zhang, X., Khayat, D., Bokemeyer, C.,
Roth, A. D., and Cunningham, D. (2009). Randomized Phase III Trial Comparing
Biweekly Infusional Fluorouracil/Leucovorin Alone or With Irinotecan in the
Adjuvant Treatment of Stage III Colon Cancer: PETACC-3. Journal of Clinical
Oncology, 27(19), 3117–3125.
van Ginneken, B., Setio, A. A. A., Jacobs, C., and Ciompi, F. (2015). Off-the-shelf
convolutional neural network features for pulmonary nodule detection in computed
tomography scans. In 2015 IEEE 12th International Symposium on Biomedical
Imaging (ISBI, pages 286–289. Radboud University Nijmegen Medical Centre,
Nijmegen, Netherlands, IEEE.
Vedaldi, A. and Lenc, K. (2015). MatConvNet – Convolutional Neural Networks for
MATLAB. In ACM International Conference on Multimedia, pages 1–55.
Weigelt, B., Geyer, F. C., and Reis-Filho, J. S. (2010). Histological types of breast
cancer: how special are they? Molecular Oncology, 4(3), 192–208.
18. Image-based surrogate biomarkers
199