10.1101/gr.080978.108Access the most recent version at doi:
2008 18: 1944-1954 originally published online October 2, 2008Genome Res.
Haibao Tang, Xiyin Wang, John E. Bowers, et al.
angiosperm gene maps
Unraveling ancient hexaploidy through multiply-aligned
Material
Supplemental http://genome.cshlp.org/content/suppl/2008/11/06/gr.080978.108.DC1.html
References
http://genome.cshlp.org/content/18/12/1944.full.html#related-urls
Article cited in:
http://genome.cshlp.org/content/18/12/1944.full.html#ref-list-1
This article cites 51 articles, 32 of which can be accessed free at:
service
Email alerting
click heretop right corner of the article or
Receive free email alerts when new articles cite this article - sign up in the box at the
http://genome.cshlp.org/subscriptions
go to:Genome ResearchTo subscribe to
Copyright  2008, Cold Spring Harbor Laboratory Press
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
Unraveling ancient hexaploidy through
multiply-aligned angiosperm gene maps
Haibao Tang,1,2
Xiyin Wang,1,3
John E. Bowers,1
Ray Ming,4
Maqsudul Alam,5
and Andrew H. Paterson1,2,6
1
Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia 30602, USA; 2
Department of Plant Biology,
University of Georgia, Athens, Georgia 30602, USA; 3
College of Science, Hebei Polytechnic University, Tangshan,
Hebei 063000, China; 4
Department of Plant Biology, University of Illinois at Urbana­Champaign, Champaign,
Illinois 61801, USA; 5
Advanced Studies in Genomics, Proteomics and Bioinformatics, University of Hawaii,
Honolulu, Hawaii 96822, USA
Large-scale (segmental or whole) genome duplication has been recurring in angiosperm evolution. Subsequent gene
loss and rearrangements further affect gene copy numbers and fractionate ancestral gene linkages across multiple
chromosomes. The fragmented "multiple-to-multiple" correspondences resulting from this distinguishing feature of
angiosperm evolution complicates comparative genomic studies. Using a robust computational framework that
combines information from multiple orthologous and duplicated regions to construct local syntenic networks, we
show that a shared ancient hexaploidy event (or perhaps two roughly concurrent genome fusions) can be inferred
based on the sequences from several divergent plant genomes. This "paleo-hexaploidy" clearly preceded the
rosid­asterid split, but it remains equivocal whether it also affected monocots. The model resulting from our
multi-alignments lays the foundation for approximating the number and arrangement of genes in the last universal
common ancestor of angiosperms. Comparative analysis of inferred homologous genes derived from this model
shows patterns of preferential gene retention or loss after polyploidy and reveals large variability of nucleotide
substitution rates among plant nuclear genomes.
[Supplemental material is available online at www.genome.org.]
Ancient genome duplications are evident for many lineages of
fungi (Kellis et al. 2004), animals (Jaillon et al. 2004), and plants
(Bowers et al. 2003), offering opportunities for the evolution of
new (Spillane et al. 2007) or modified (Hittinger and Carroll
2007) gene functions, altering gene dosages, and creating new
gene arrangements. Traces from past whole-genome duplication
events can often be detected from pairwise syntenic segments,
including two sets of retained paralogs that have maintained
relative genomic locations on syntenic chromosomes. In
angiosperms, genome duplications are recurring in many lineages
(Bowers et al. 2003), generating large numbers of paralogous
loci.
Gene loss at duplicated loci effectively fractionates ancestral
linkage patterns and reduces the density of continuous stretches
of "paleologous" gene pairs, which are the remaining signatures
of paleo-polyploidy (Thomas et al. 2006). Depending on the level
of gene loss, the remaining signatures of duplication are sometimes
so eroded that the homologous segments can no longer be
identified based only on similarity to one another. The problem
is multiplied when the species in question has undergone several
genome duplications, with recent duplications tending to obscure
synteny from more ancient events as is found in most angiosperm
genomes. Such highly degenerate duplicated segments
have been referred to as "ghost duplications" and can often be
resolved by comparison to an appropriate "outgroup" genome
that did not experience polyploidy or undergo massive gene loss
(Van de Peer 2004). For example, "bridging" of ghost duplications
using outgroups has clarified the history of polyploidy in
both Saccharomyces and Tetraodon (Jaillon et al. 2004; Kellis et al.
2004; Scannell et al. 2007).
Continuous stretches of duplicate genes can be computationally
deduced through synteny, using some variants of clustering
approaches (Vandepoele et al. 2002; Hampson et al. 2005)
or more specifically using dynamic programming with a customized
scoring scheme if conserved gene order (collinearity) is also
considered (Haas et al. 2004; Wang et al. 2006). Traditional methods
for deduction of synteny based on "best-in-genome" criteria
(Miller et al. 2007), uncovering one-to-one best matching regions
during pairwise genome comparisons, are relatively straightforward
in vertebrates yet difficult in angiosperms because of
additional challenges that are more prominent in angiosperm
genomes (Tang et al. 2008). These challenges include frequent
genome duplications and convoluted genome shuffling (rearrangements,
chromosomal fusions and fissions), such as the extensive
rearrangement that has occurred in Arabidopsis within
the past 5 million years (Kuittinen et al. 2004).
One approach for the computational de-convolution of paleopolyploidy
for deduction of ancestral gene orders is a bottomup
approach in which one attempts to resolve one duplication
event at a time, starting with the most recent one. This is exemplified
by studies in Arabidopsis and Paramecium where the most
recently duplicated segments are merged to generate hypothetical
intermediate profiles that are further recursively merged
(Bowers et al. 2003; Aury et al. 2006).
Herein, we elaborate on an alternative top-down approach
6
Corresponding author.
E-mail paterson@uga.edu; fax (706) 583-0160.
Article published online before print. Article and publication date are at http://
www.genome.org/cgi/doi/10.1101/gr.080978.108.
Methods
1944 Genome Research
www.genome.org
18:1944­1954 2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08; www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
(Tang et al. 2008) that is conceptually more attractive in that it
only requires one cycle of deduction--first searching for pairwise
synteny information and then combining the resulting pairs to
form a multi-way correspondence among all structurally similar
chromosomal segments. The efficacy of the top-down approach,
however, depends on the searching strategy because of the degenerate
synteny resulting from post-duplication gene loss. In
particular, a top-down search strategy can incorporate "ghost duplications"
(Van de Peer 2004), which are not discernible using a
bottom-up approach based on information from only one spe-
cies.
New angiosperm genome sequences (Table 1) promise to
qualitatively improve our deductions about the evolution of angiosperm
gene repertoire and arrangement. Arabidopsis (Arabidopsis
Genome Initiative 2000), rice (Oryza sativa) (International
Rice Genome Sequencing Project 2005), poplar (Populus trichocarpa)
(Tuskan et al. 2006), grapevine (Vitis vinifera) (Jaillon et al.
2007), and papaya (Carica papaya) (Ming et al. 2008) have been
sequenced, and more are in the pipeline. Indeed, Arabidopsis
thaliana--a leading botanical model--is
now known to be a relatively difficult
system from which to deduce ancient
gene orders. For example, many Carica
segments show collinearity with three
or four Arabidopsis segments, showing
that two genome duplications have affected
the Arabidopsis lineage since its
divergence from Carica (Ming et al.
2008). Individual Arabidopsis genome
segments correspond to only one Carica
segment, showing that Carica has not
duplicated since its divergence from Arabidopsis.
Both Vitis and Carica have
only one duplication event, , while
and occurred in the Arabidopsis lineage
after its divergence from the Carica
lineage (Ming et al. 2008; Tang et al.
2008).
Some newly sequenced genomes
have less complicated genome structure
and thus may represent better models
for comparative genomics than Arabidopsis.
In this study, we exploit fragmentary
conservation of plant gene orders
from multiple genomes along with a
new top-down algorithm MCscan, to
improve deductions about the course of
angiosperm genome structural evolu-
tion.
Results
MCscan: Algorithm for multiple gene order alignments
When several genomes and subgenomes (resulting from ancient
duplication events) are compared simultaneously, synteny and
collinearity between all possible pairs of genomes are tedious to
enumerate because chromosomal homology is "transitive." For
example, if there are corresponding chromosomal regions in
three genomes A, B, and C, comparisons between the genomes
would reveal three pairwise synteny blocks (A-B, B-C, A-C),
whereas it could be better represented as a single multiple synteny
block (A-B-C). To solve this problem, we implemented a
novel algorithm, MCscan, that exploits this transitivity property
of collinearity to perform multiple alignments by incorporating
pairwise synteny that is derived from shared evolutionary events.
The algorithm involves a four-stage pipeline illustrated in
Figure 1, with each individual stage described in further detail in
Methods.
We first use a sequence similarity search program to detect
Figure 1. Flow-chart of MCscan core algorithm.
Table 1. Summary of sequenced plant genomes based on respective genome publications
Species Assembly statusa
Assembled/estimate size Annotation version Annotated gene no.
Arabidopsis (Arabidopsis thaliana) BAC-by-BAC 115 Mb/160 Mb TAIR version 7 26784
Papaya (Carica papaya) WGS, N50 = 11 kb 278 Mb/372 Mb University of Hawaii 25536
Poplar (Populus trichocarpa) WGS, N50 = 125 kb 410 Mb/485 Mb JGI version 1.1 45554
Grape (Vitis vinifera) WGS, N50 = 65 kb 468 Mb/487 Mb Genoscope release 30434
Rice (Oryza sativa ssp. japonica) BAC-by-BAC 371 Mb/389 Mb RAP release 2b
29389
a
(BAC) Bacterial artificial chromosome; (WGS) whole-genome shotgun; (N50) maximum length L such that 50% of all bases are in contigs of length at
least L.
b
We only used mapped representative loci for the rice annotation project (RAP) release (Itoh et al. 2007).
Paleopolyploidy and angiosperm genome evolution
Genome Research 1945
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
matchings among genes in all possible pairs of chromosomes and
scaffolds and in both transcriptional directions. This is followed
by the "pairwise collinearity" stage, in which the neighboring
matches are chained along using dynamic programming. The
pairwise collinear blocks are combined in the "multicollinearity"
stage, by fixing one gene order as reference and then
heuristically stacking the pairwise synteny tracks one after another.
In this step, we need to use a "reference" gene order as the
basis for stacking the tracks; we then describe the aligned synteny
blocks as "threaded by the reference order," a procedure inspired
by TBA aligner (Blanchette et al. 2004). Once the multi-syntenic
blocks are identified, we can classify the segments and index
them to different evolutionary events, mainly duplications and
divergence.
As a result, MCscan condenses the combinatorial matches
between multiple chromosomal segments resulting from divergence
and recursive duplication events and creates a view of the
multiply-aligned segments.
Patterns of synteny conservation
Using the top-down algorithm MCscan, we have aligned large
portions of the five sequenced genomes (Arabidopsis, Carica,
Populus, Vitis, and Oryza) based on synteny. A total of 61% of the
Arabidopsis genes have preserved their ancestral locations based
on cross-species synteny (Table 2), versus 44%, 51%, and 46% of
Carica, Populus, and Vitis genes, respectively.
The variation in frequencies of aligned genes might be due
to different levels of synteny conservation in different species.
However, it is also correlated with the degree of contiguity of the
respective sequences (Table 1), with a higher percentage of genes
explained by synteny in the genomes with higher N50. Indeed, if
most genes are in small or unanchored scaffolds, it would be very
difficult for MCscan to detect them as syntenic, even if they do
remain in their ancestral locations.
Alignments with gene order preserved across four eudicot
species show clear triplicated structure in many local regions.
Each triplicated branch contains orthologous segments from up
to four Arabidopsis regions, one Carica region, two Populus regions,
and one Vitis region, supporting the hypothesis that this
genome triplication ( ) occurred in a common ancestor of all
four species; Populus has one duplication event (p) in its salicoid
lineage, and Arabidopsis has two duplications ( and ) in its
crucifer lineage. The multiple alignments were threaded by Vitis
as the reference order (Supplemental Data 1), since Vitis appeared
to have the most close-to-ancestral karyotype among the genomes
that we investigated (Jaillon et al. 2007). This is likely to
change in the future when we include additional genomes; however,
using Vitis as the current "reference" would produce the
best solution so far.
The triplication of gene loci is also evident from Table 2. For
example, we found that 88 aligned loci in Carica have multiplicity
levels of three (triplication ), with only one aligned locus
exceeding a multiplicity of 3; 54 aligned loci in Populus have the
expected multiplicity level of 6 (triplication duplication p),
but only three loci exceed 6. The loci that exceed the expected
multiplicity level are likely produced by additional small-scale
(single gene or segmental) duplications in each lineage.
Further circumscribing the duplication event
The duplication event was previously dated to have occurred
after the monocot­dicot separation but before the expansion of
the rosids (Jaillon et al. 2007). We investigated the lower boundary
of this claim by sampling genomic regions from other eudicots
outside the rosids for which long, contiguous sequences
(BACs) were available in GenBank, including tomato (Solanum
lycopersicum) and banana (Musa acuminata).
We first mapped unigenes onto 194 sequenced tomato (Solanum
lycopersicum) BACs as preliminary gene annotation and
inspected synteny to Vitis. Among the 78 Solanum BACs that
have more than 10 distinctively mapped unigenes, 72 have more
than 50% of genes showing primary synteny to a single Vitis
chromosome (Supplemental Data 2). Each individual tomato
BAC corresponds closely to only one of the triplicate regions
rather than showing equal matches to each of the three paleohomeologous
chromosomes in Vitis. Figure 2A shows one example
of a Solanum BAC that aligns to the Vitis gene order. Although
the Solanum BACs that we inspected only represent
2.5% of the genome, the evidence so far strongly supports the
hypothesis that triplication occurred in a common ancestor of
asterids and rosids. Under this scenario, each Solanum segment
would be expected to have up to four primary syntenic segments
in Arabidopsis, as has been suggested (Ku et al. 2000).
Based on a similar notion, Jaillon et al. (2007) calculated the
relative abundance of one-to-three cases between Oryza and Vitis
and suggested that the triplication occurred after the monocotdicot
split. It is tempting to push the dating of further, yet we
consider such dating to have uncertainties in view of current
evidence. Contrary to the well-conserved synteny within the eudicot
group, only 14% of Oryza genes could be placed in crossspecies
gene clusters (Table 2). This proportion represents the
actual extent of collinearity between Oryza and any of the four
eudicots, as Oryza is the only monocot genome included in this
Table 2. Number of clustered groups of genes at different multiplicity levels in five angiosperm species
Species
Multiplicity level
No. of ancestral
loci No. of genes (%)
WGD or segmental
expansion1 2 3 4 5 6 7 8 9 10
Arabidopsis 6742 2642 868 282 80 32 6 5 1 1 10,659 16,451 (61%) 54%
Carica 9118 942 88a
1 0 0 0 0 0 0 10,149 11,270 (44%) 11%
Populus 5147 6362 763 618 96 54a
3 0 0 0 13,043 23,457 (51%) 80%
Vitis 9926 1671 239a
15 2 0 0 0 0 0 11,853 14,055 (46%) 18%
Oryza 2197 685 140 35 9 2 0 0 0 0 3068 4184 (14%) 36%
The statistics are based only on groups that contain genes from at least two different species, as constructed from syntenic alignments. The number of
inferred ancestral loci is calculated by 10
m=1Nm, and the number of genes that maintain their ancestral positions is calculated by 10
m=1m Nm, where m is
the multiplicity level varying from 1 to 10 and Nm is the number of groups for each multiplicity level.
a
Expected multiplicities for Carica, Populus, and Vitis. The multiplicity for Arabidopsis is 12 (yet no gene groups retained all 12 copies), and equivocal for
Oryza.
Tang et al.
1946 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
study. Therefore, it is more difficult to make accurate inference of
synteny patterns because of the greater evolutionary distance involved
and additional duplication in the cereal lineage. While
several studies hinted that additional monocot duplication(s)
predated the cereal duplication (Zhang et al. 2005; Jaillon et al.
2007), whether such additional duplication(s) found in Oryza
correspond to the triplication we saw in core eudicots remains
to be determined.
We also examined synteny to Vitis for chromosomal regions
from a monocot species that is basal to the cereals--banana
(Musa acuminata). On average, the levels of synteny between
Musa BACs and Vitis chromosomes are 50% lower than synteny
between Solanum and Vitis. Furthermore, in contrast to the oneto-one
primary synteny pattern of Solanum
and Vitis, Musa BACs show roughly
equal matches to any of the three
homeologs in Vitis (Fig. 2B), a pattern
similar to Oryza­Vitis. However, failure
to detect one-to-one (as opposed to
one-to-three) correspondence between
monocot regions and Vitis cannot be
viewed as strong evidence that occurred
after the eudicot­monocot split.
An alternative but equally plausible scenario
is that the monocots and eudicots
share but diverged soon after occurred.
Under this scenario, the gene arrangements
between two orthologous
chromosomes would share very little
synteny because of stochastic, independent
gene losses in both lineages--
leading to similarly low levels of correspondence
of chromosome in one taxon
to each of its three paralogs in another
taxon.
While highly specific one-to-one
synteny is indicative that two lineages
share the triplication, frequent one-tothree
synteny is not necessarily indicative
that one lineage lacks the triplication.
So far we can only confidently
place the triplication before the asterid­rosid
split and consider the status of
the paleo-hexaploidy in the monocot
lineage to be unclear.
It is difficult to test the hypothesis
that the triplication predated the divergence
of monocots and eudicots. For
example, additional data from an outgroup
genome such as Amborella would
help, but does not necessarily solve the
placement of the triplication if is
found absent in that outgroup. Much of
the uncertainty is rooted in the fact that
the triplication is an ancient event that
at least predated the asterids­rosids, and
comparisons across this evolutionary
distance are often less effective. Therefore,
we need broader and more judicious
sampling of plant taxa. Indeed,
fortuitous discoveries of genomes like
grapevine that have close-to-ancestral
karyotypes facilitate comparisons across major angiosperm
clades. Similarly, additional karyotypically conserved monocot
or basal angiosperm genomes that are free of recent polyploidies
might better elucidate the scenario.
Comparisons of paleologs show that triplicate subgenomes
are mostly homogeneous
We tested whether any two of the three subgenomes are genetically
closer to one another than the third. We retrieved paleolog
groups that have retained genes from all three subgenomes,
on different chromosomes or scaffolds in Carica or Vitis, the two
genomes that are unaffected by additional duplications other
Figure 2. Collinearity between triplicate Vitis -homeologous regions with BAC sequences from
Solanum (A) and Musa (B). (Black glyphs) Genes with the tip showing the transcriptional direction;
(gray shades) synteny matches between a Vitis gene and Solanum or Musa sequences.
Paleopolyploidy and angiosperm genome evolution
Genome Research 1947
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
than . We then inferred gene trees for these triplet groups under
the assumption that if two subgenomes are, indeed, more similar
to each other than to the third, we expect to see only one prevalent
tree topology along paleolog groups within the same ancestral
duplicated (triplicated) segment. Only a limited data set is
suitable for this study since we need to have enough triplets
along the three subgenomes that are derived from the same ancestral
segment. We picked 10 blocks with five or more Vitis
triplets (this cutoff was chosen arbitrarily as we need enough
triplets within each block for inference, yet we do not have many
blocks that have more than six or seven triplets). Nonetheless, we
failed to find one dominant topology for any block, with a typical
example shown in Figure 3. The fact that the subgenomes
are indistinguishable from each other makes it unlikely that one
of the triplicated subgenomes may have originated from largescale
segmental duplications or aneuploidy. Instead, the triplication
may have been an ancient auto-hexaploidy formed from
fusions of three identical genomes, or allo-hexaploidy formed
from fusions of three somewhat diverged genomes. We are not
able to determine whether the fusion(s) were a single event or
two events a relatively short time apart (the latter case, e.g., characterizing
the well-studied evolution of hexaploid wheat). A
more definitive test of allo-hexaploidy versus auto-hexaploidy
would only be possible if extant diploid parental species can be
found, and this is unlikely since the genome duplications appear
to be pervasive throughout most angiosperm clades including
the basal lineages (Cui et al. 2006).
Discussion
By exploiting fragmentary conservation of plant gene orders, together
with a new top-down multi-alignment approach, limitations
of Arabidopsis for comparative genomics are mitigated by
using new angiosperm genome sequences to qualitatively improve
our deductions about the tempo and modes of evolution of
angiosperm genes and genomes.
Rate variations between paleologs within four eudicot species
Deduction of a consensus gene order for multiple taxa permits us
to directly compare estimates of the ages of gene duplications
based on rates of nucleotide substitution per synonymous site
(Ks) between paleolog pairs (syntenic paralogs), filtering out the
inevitable influence of background (i.e., single gene) duplications,
which superimpose an L-shaped curve on the relics of
whole-genome duplications (Blanc and Wolfe 2004; Cui et al.
2006). By excluding the single gene duplications,
we were able to analyze the Ks
distribution using mixtures of lognormals
(see Methods).
Although apparently occurred in
a common ancestor of Carica, Populus,
and Vitis, the median Ks between Vitis
paleologs (1.22) is much lower than
that of Carica (1.76) and Populus (1.54)
(Table 3). The median values of Ks
among duplicates in these three genomes
show highly significant difference
(Kruskal-Wallis one-way ANOVA,
P = 2.25 10 142
).
The Ks distributions analyzed with
mixture models show the expected
number of components for each species, except for Arabidopsis,
where we can find only two instead of three distinct components
(Table 3). This two-peak distribution (Fig. 4B) is similar to the
results of a previous study (Maere et al. 2005) even though MCscan
provides better deductions about the identities of paleologs.
We postulate that more rapid substitutions occur at synonymous
sites in Arabidopsis than in the other three eudicot species, with
Arabidopsis paleologs being saturated with synonymous substitutions.
Therefore, within Arabidopsis, Ks-based distances between
paralogs cannot differentiate duplicates from either the
tail of the distribution of duplicates, or from noise, or both. The
median Ks values between Arabidopsis and duplicates are close
to saturation (2.00), much larger than those of the duplicates in
the other three species (Table 3). Repeating the analysis using a
more conservative genetic distance­transversion rate at fourfold
degenerate sites (4DTV) (Fig. 4C) shows almost the same pattern
as using Ks, suggesting that the saturation effect of DNA substitutions
may have also affected 4DTV distance.
Differences in the median values of distances between the
paralogs that are derived from the common event can be explained
by different substitution rates among the four rosid lineages.
We constructed a phylogenetic tree with per-branch Ks
estimates, based on orthologous gene groups that are strictly
single copy in all five species (Fig. 4D). The same trend was
found, with increasing evolutionary rates in branches leading to
Vitis, Populus, Carica, and Arabidopsis, respectively, suggesting
that the variations of substitution rates are not confined to populations
of duplicate genes but are rather lineage-specific. A similar
range of nuclear rate variation in flowering plants has been documented
in previous studies, and is often associated with life history
(Gaut et al. 1996; Koch et al. 2000). In general, the short
generation time in the annual Arabidopsis might have contributed
to the fast substitution rates compared with Populus or Vitis,
Figure 3. Topologies for five proximal ancestral loci that contain three collinear Vitis genes. Vitis
gene names are abbreviated as "[chromosome].[gene index]" for graphing. Each tree was rooted using
one best-matching moss gene, identified by JGI protein accession number. The numbers above
branches are bootstrap values in the phylogenetic reconstruction. There are a total of 10 local blocks
that have more than five triplets in Carica and Vitis that are studied in the same way. Phylogenetic
analysis was performed using PHYLIP version 3.67 (Retief 2000). The analysis was carried out using the
protdist program (default parameters) followed by neighbor-joining using neighbor. We used the
seqboot program to simulate 100 bootstrap replicates and the consense program to retrieve one
consensus tree.
Table 3. Mixture model estimates for distributions of Ks between
paleologs in each species
Species
Sample
size
No. of
mixture
components Median Variance Proportion
Arabidopsis 7435 2 0.86 0.08 0.51
2.00 0.20 0.49
Carica 907 1 1.76 0.32 1
Populus 13,113 2 0.27 0.01 0.62
1.54 0.24 0.38
Vitis 2288 1 1.22 0.16 1
Tang et al.
1948 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
which are perennials. However, because life history attributes
tend to change over evolutionary time, the generation-time effect
is not sufficient to explain the rate heterogeneity among
different organisms (Gaut et al. 1996).
Because substitution rates vary among lineages, timing of
duplication or speciation events is hard to determine using genetic
distance measures alone. For the same reason, dating of
ancient events based on phylogenetic trees (Bowers et al. 2003;
Tuskan et al. 2006) could produce incongruous results since the
drastic differences in rates may lead to incorrect trees that are
artifacts because of long-branch attractions (Felsenstein 2004).
One phylogenetic model placed Vitis within the eurosid I
clade (Jaillon et al. 2007), in contrast with the prevailing view of
the Vitaceae as sister to both eurosid I and eurosid II (Davies et al.
2004; Soltis et al. 2005). Indeed, Populus and Vitis do show small
Ka or Ks values for substitutions between inferred orthologs
(Table 4). However, the seemingly smaller distance between
Populus and Vitis genes should be interpreted with caution since
both species appear to have relatively slow evolutionary rates.
The striking differences in evolutionary rates among these taxa at
the DNA sequence level may, in part, explain the controversial
placement of Vitis inside the eurosids by some investigators (Jaillon
et al. 2007). Indeed, we found that if we use Arabidopsis as the
reference point, the increasing Ks distances from Carica, Populus,
and Vitis appear to support the view that Vitis is an outgroup to
the rosids (Table 4).
Inferring the number and arrangement of genes
in the ancestral angiosperm
Top-down multiple alignments mitigate the fragmentation and
decay of ancestral gene orders, improving our ability to deduce
the number and arrangement of genes in the last common ancestor
of a group of genomes. When we align gene orders to
produce multiple collinear segments, corresponding genes are
collected and merged into a deduced "ancestral locus." A total of
18,447 deduced ancestral loci (corresponding gene groups) collectively
represent 77,059 genes in the five species we studied (see
Figure 4. (A,B) Distribution of Ks distances among Carica, Populus, Vitis, and Arabidopsis paleologs. Ks values are grouped into bins of 0.1 intervals.
Certain Ks intervals are highlighted as they correspond to several presumed whole-genome duplication events. Dotted lines are fitted mixtures of
log-normal distributions for the paleolog Ks distributions (see Methods). (C) Distribution of 4DTV distance among paleologs in the same four eudicot
lineages. (D) Phylogeny of single-copy ortholog set used in relative rate estimates. A total of 47 orthologous genes that are single copy in all five species
were used in the analysis. Protein alignments for each ortholog group were constructed and then used to guide DNA alignments. The alignments are
then concatenated, with 53,856 aligned nucleotide positions. Per-site Ks values on each branch were estimated by codeml in the PAML package (Yang
1997) using a constrained topology that reflects organismal relationships.
Paleopolyploidy and angiosperm genome evolution
Genome Research 1949
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
Supplemental Data 3 for complete compilation). Among these
loci, 3680 (20%) are specific to only one species, and 14,767
(80%) contain genes from at least two different species. We studied
the compositions of the cross-species groups (Table 2). If all
duplicates derived from each genome duplication event had been
retained, each clustered ancestral locus ideally would have three
Carica genes ( only), three Vitis genes ( only), six Populus genes
( , p), and 12 Arabidopsis genes ( , , ). Such extreme cases were
not observed. However, we still find cases in which the copy
numbers are close to saturation in these genomes (Table 2), and
two specific cases are further discussed in the next section.
Conceptually, the number of cross-species syntenic gene
clusters would reflect the gene number prior to , by far the most
ancient duplication detected by our collinearity algorithm. If we
assume that most genes retain their ancestral positions, then by
using only the set of genes that show cross-species synteny and
correcting for the "inflation" induced by genome duplications,
we can have a relatively accurate estimate of ancestral gene number.
The four fully sequenced eudicots each yield slightly different
estimates of this number, varying from 10,149 for Carica to
13,043 for Populus (Table 2). This range coincides closely with
previous estimate of ancestral angiosperm gene numbers of
12,000­14,000 based an independent gene birth model (Sterck et
al. 2007). Our number, however, may be an underestimate considering
that the alignment algorithm does not achieve perfect
sensitivity. Moreover, lower contiguity and less progress in annotation
may tend to reduce the Carica number, and appreciable
heterozygosity in the sequenced genotype (resulting in alleles
sometimes being considered different loci) may somewhat inflate
the Populus number.
In contrast to estimation of ancestral gene number, inference
of ancestral gene order is a much harder problem, and our
computationally reconstructed gene order should not be considered
as truly "ancestral." In our analysis, the inferred ancestral
gene order was deduced by taking the consensus of the aligned
gene orders among various chromosomes and scaffolds in the
five species we investigated, similar to some previous approaches
(Blanc et al. 2003). Ideally such consensus orders would be required
to reflect all the gene arrangements aligned in the same
block. However, the solution is not unique as there may be several
possible consensus arrangements under these constraints.
For example, different permutations of interleaving genes between
the syntenic anchors would have equal likelihood of being
"ancestral." In general, the gene groups that have fewer copies
may have fewer constraints in the consensus arrangements and
therefore cannot be precisely ordered computationally.
Implications for particular eudicot gene functional groups
By combining available positional information with sequence
homologies, our method improves on other orthology/paralogy
mapping algorithms that depend mainly on similarity scores,
such as OrthoMCL (Li et al. 2003), Inparanoid (O'Brien et al.
2005), and the like. Since the clusters are inferred by syntenic
alignments, any gene family constructed by our method contains
at least two genes. Genes duplicated by single-gene or tandem
duplications do not fall on collinear chains, and thus are excluded
from the syntenic gene groups by our algorithm. In contrast,
since some of these small-scale duplications are recent and
show higher similarities than the paleo-duplicates, they are more
easily included by traditional homology-based clustering meth-
ods.
The exponential growth in gene numbers resulting from
recurring polyploidies is often tempered by a massive yet progressive
amount of gene death in the subsequent diploidization
process. However, the probability of gene loss is not uniformly
distributed among all gene functional groups (Maere et al. 2005).
Convergent restoration of some genes to singleton status after
multiple rounds of duplication in independent lineages suggests
that there may be selective advantages for the organism to have
only single copies of these genes (Paterson et al. 2006). However,
the most extreme cases of "duplication resistance," gene functional
groups for which one and only one copy per nucleus is
adaptive, would provide too little information to be inferred as
duplication-resistant by previous 2
-based statistical methods
(Paterson et al. 2006). Multi-alignment improves our ability to
identify candidate duplication-resistant genes that fall into this
most extreme category, in that if a single gene is always restored
to singleton following a sufficient number of independent duplications,
then duplication resistance of that single gene might
be inferred. Such genes have curiously "resisted" multiple duplication
cycles in multiple independent lineages, specifically one
round of duplication ( ) in Carica and Vitis, two (p, ) in Populus,
three ( , , ) in Arabidopsis, and one ( ) or more in Oryza. Indeed,
some syntenic groups have preserved exactly one copy in
the ancestral location for each of the five species. Some genes in
these groups are not true "singletons," with non-syntenic copies
present in the genome because of single gene duplications. After
filtering out such non-syntenic copies, we found 47 strict singleton
groups for five angiosperm genomes preserved in collinear
linkage groups, supporting their inferred orthology to one another.
If we assume that the diploidization process is completely
independent in each of the five species, we can estimate the
expected number of singleton groups by multiplying the proportions
of singleton genes in each genome by the average gene
number. Under this estimate, the 47 singleton groups we found
are nearly 10 times more than the expected five groups. We also
found 247 strict singleton groups for only the four eudicot genomes
(versus 20 explicable by chance). The gene IDs and functional
annotations for the singleton groups are available in
Supplemental Data 5. Many of the singleton genes have only
putative classifications, and those of known functions are mostly
enzymes.
The multiplicities in ancestral loci constructed by MCscan
also revealed extreme cases in which ancestral loci were "deletion-resistant,"
with a tendency to be preserved in consistently
high copy numbers in multiple species (Table 5). Since both
Carica and Vitis have only one round of duplication with multiplicity
of 3, while Populus has two rounds of duplications and
Table 4. Ks and Ka values for syntenic orthologs of five sequenced
plant genomes
Arabidopsis Carica Populus Vitis Oryza
Arabidopsis -- 0.24 0.23 0.25 0.37
Carica 1.57 (6913) -- 0.17 0.19 0.35
Populus 1.64 (8366) 1.08 (8504) -- 0.16 0.31
Vitis 1.72 (7381) 1.12 (7920) 0.98 (10,143) -- 0.32
For each syntenic group, the smallest Ks or Ka value among all orthologous
pairs was retrieved to represent the value. The lower triangle shows
median Ks values, and the upper triangle shows median Ka values. Numbers
in brackets correspond to the number of syntenic groups used in
each comparison. Ks values between Oryza and four eudicots show saturated
substitutions and high variances and therefore should not be considered
reliable estimates.
Tang et al.
1950 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
multiplicity of 6, we specifically selected the ancestral loci that
are saturated with paleologs for these three species, requiring that
the groups that we chose have Carica multiplicity 3, Populus
multiplicity 5, and Vitis multiplicity 3 at the same time (Table
5). We set these copy number cutoffs because Carica, Populus, and
Vitis have expected copy numbers of 3, 6, and 3 respectively, and
we slightly loosened the Populus cutoff to look at more groups
that are close to saturation. A total of 30 such groups were found
(Table 5). Considering that very few groups have exceeded the
threshold for each species (Table 2), the chance that 30 random
groups satisfy all three thresholds is almost non-existent ( 2
-test,
P = 2.2 10 16
).
In contrast to "duplication-resistant" genes, many "deletion-resistant"
loci of known function are transcription factors,
consistent with previous findings that transcriptional regulators
are significantly over-retained in WGD duplicates (Seoighe and
Gehring 2004; Freeling and Thomas 2006). For example, N05829
contains five Arabidopsis MADS-box genes (AGL14, AGL19, SOC1,
AGL42, AGL72), all descended from a single ancestral preMADS-box
gene. N03285 (contains Arabidopsis genes LBD40,
LBD41, LBD42) and N07685 (contains Arabidopsis genes LBD37,
LBD38, LBD39) collectively comprise all six class II lateral organ
boundaries (LOB) gene family members characterized to date
(Shuai et al. 2002), which we infer to trace to two ancestral (pre- )
LOB class II genes.
Comparative analysis for genes derived from "deletionresistant"
loci that have largely expanded following each round
of polyploidy have important implications
for studying plant gene family evolution.
Because of less gene loss, such gene families
show improved power to resolve particular
evolutionary events. Using two ancestral
loci that are close to each other
in the local ancestral order and highly
saturated with paleo-duplicates, N01482
(C2H2 transcription factor family) and
N01483 (auxin-response protein), we constructed
phylogenetic trees for the gene
members. Both phylogenetic trees (Fig. 5)
support the coarse partitioning of three
subclades, with each clade containing up to
four Arabidopsis genes, two Populus genes,
one Carica gene, and one Vitis gene. These
two examples also support the inference
that Arabidopsis genes evolve more quickly
than Vitis genes. This is reflected by the
longer branches, that is, more nucleotide
substitutions for Arabidopsis genes
within individual subclades. Indeed, differential
evolutionary rates have some impact
on the N01482 tree topology, as one
Vitis gene (Vv4g1235) appears to be even
closer to one of its paleologs (Vv18g1188)
than to its orthologs in the three other
species. One possible alternative explanation
is that these two Vitis genes have
undergone homogenization, as has been
shown to occur in some paleo-duplicated
genes in Oryza genome (Wang et al.
2007).
Methods
Gene set and sequence homology search
Protein sequences from Arabidopsis, Carica, Populus, Vitis, and
Oryza genome annotations were used (Table 1). A few annotated
moss (Physcomitrella patens) genes (JGI annotation version 1.1)
were also used as the outgroup in gene tree analysis. Carica, Populus,
and Vitis gene names were renamed according to their incremental
position on the chromosomes or scaffolds (see Supplemental
Data 4 for a conversion table to original gene identifiers).
In case the original gene identifiers are subject to future changes,
the conversion table will be updated accordingly to ensure easy
translation. If a gene had more than one transcript, only the first
transcript in the annotation was considered. Each genome was
compared against itself and other genomes using BLASTP
(Altschul et al. 1990), retrieving the best five hits meeting an
E-value threshold of 1 10 5
.
Pairwise gene order alignments
The syntenic regions were grouped to form multiple alignments
using a novel algorithm MCscan (multiple collinearity scan). We
first took whole-genome BLASTP results and computed strictly
collinear segments for all possible pairs of chromosomes and
scaffolds. A pairwise alignment procedure was implemented using
an empirical scoring scheme similar to that of Haas et al.
(2004). The default scoring scheme (configurable) is min(log10 E,
50) match score for one gene pair, and 1 gap penalty for each
10-kb distance between any two consecutive gene pairs. The
Table 5. Thirty ancestral loci selected based on saturated paleolog copy numbers in
Carica (3 copies), Populus (5 copies), and Vitis (3 copies)
Ancestral
locus ID Carica Populus Vitis Arabidopsis Gene familyb
N00011 3 6 3 4
N00123 3 6 3 3
N00137 3 6 3 6
N00535 3 6 3 5 GRAS transcription factor
N00715 3 5 3 3
N01470 3 6 3 8
N01482a
3 6 3 7 C2H2 transcription factor
N01483a
3 5 3 10
N01501 4 5 3 6
N01504 3 5 3 6
N01732 3 6 3 8
N01831 3 5 3 3
N02420 3 5 3 6 C3H transcription factor
N02938 3 6 3 3
N03063 3 6 3 6
N03148 3 5 3 3 Phosphatidylinositol-4-kinase g
N03158 3 5 3 4
N03159 3 6 3 5 C2C2-Dof transcription factor
N03285 3 5 3 3 Lateral organ boundaries gene, class II
N03326 3 5 3 3
N03658 3 5 3 4
N03794 3 5 3 4
N04406 3 5 3 6 Kinesin-like proteins
N05304 3 6 3 4 C3H transcription factor
N05519 3 6 3 3 EF-hand containing proteins: Group IV
N05829 3 5 3 5 MADS transcription factor
N05866 3 6 4 6 AP2-EREBP transcription factor
N06369 3 5 3 5 C2C2-Gata transcription factor
N07685 3 6 3 3 Lateral organ boundaries gene, class II
N07692 3 6 3 3 Core cell cycle genes
The ancestral loci reference IDs are available in Supplemental Data 3.
a
Used in the phylogenetic reconstruction in Figure 5.
b
Based on curated Arabidopsis gene families from TAIR (Huala et al. 2001).
Paleopolyploidy and angiosperm genome evolution
Genome Research 1951
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
score for each pairwise collinear chain is then calculated via dynamic
programming through the following recurrence condition,
assuming that two gene pairs, u and v, are on the path
where u precedes v,
ChainScore v = MatchScore v + max
u
{ChainScore u
+ GapPenalty u,v ,0}
Tandem matches <50 kb apart are collapsed using a representative
pair that has the smallest BLASTP E-value. This threshold,
indeed, did not purge all tandems--we still found a very few
long-distance tandems in our clustered ancestral loci--however,
this is reasonable trade-off since increasing the threshold would
remove some of the intra-chromosomal WGD duplicates. All
pairwise segments with scores above 300 are reported. Each pairwise
segment consists of two distinct genomic locations with
aligned, collinear genes as anchors.
The expected number of occurrences of a pairwise collinearity
pattern could be estimated with the following, similar to the
one used in Wang et al. (2006),
E = 2PN
m
i=1
m-1 l1i
L1
l2i
L2
,
where N is the number of matching gene pairs (by BLASTP or
BLAT, etc.) between two chromosomal regions defined by the
syntenic block; m is the number of collinear gene pairs in the
identified block; L1 and L2 are respective lengths of the two chromosomal
regions; and l1i and l2i are distances between two adjacent
collinear gene pairs in the syntenic block. The expectation
multiplies by two since there are two possible orientation configurations
between two collinear segments. This is only an approximation
to a more rigorous yet computationally expensive
permutation test (Van de Peer 2004) and Monte Carlo methods
(Hampson et al. 2005); however, computational experiments and
analytical results (Wang et al. 2006) suggest that this gives a
reasonable estimate for the significance of the syntenic blocks.
All the pairwise alignments that we reported are significant at
E < 1 10 10
.
Multiple gene order alignments
Pairwise syntenic matches were clustered into multi-way anchors
through a Markov clustering algorithm MCL (Enright et al.
2002), in order to simplify the correspondences among multiple
loci. Multiple chromosomal regions threaded by consecutive ancestral
loci are recovered and aligned using a heuristic that constructs
the multiple alignments progressively by aligning one
closest-related region at a time by dynamic programming. We
then use a reference genome to report all the multiple blocks.
Notice that when we use a "reference" as the basis, we lose symmetry.
For example, let us assume A-B-C as a multiple alignment,
formed by syntenic regions A, B, and C. If we allow the blocks to
be threaded by A, B, or C, we can find this block three times;
however, the resulting multiple alignment may be slightly different
because of the order in which we stack A, B, and C. We
found that the "once a gap, always a gap" rule applies to the
multiple alignment of gene orders, in that the order of progressive
stacking does affect the resulting alignment. Therefore, we
implement a refinement procedure to ameliorate such effect by
Figure 5. Phylogenetic analysis of ancestral loci N01482 (A) and N01483 (B). Coding sequences of all members in four eudicot species for each
ancestral locus (19 genes in N01482, 21 in N01483) were aligned by CLUSTALW (Thompson et al. 1994) using parameters suggested by Hall (2007).
Phylogenetic relationships among the members and sequences were grouped into clades using MrBayes (Ronquist and Huelsenbeck 2003). The Bayesian
analysis was carried out for 500,000 generations using the General Time Reversible plus Gamma (GTR+G) substitution model selected based on
MODELTEST (Posada and Crandall 1998). All branches with support <50% are collapsed into a polytomy. A majority tree was presented in both cases.
The gene names for Carica, Populus, and Vitis are recoded to reflect relative orders on chromosome or scaffold (see Methods). The conversions from the
original locus identifiers to the re-indexed gene names are available as a conversion table in Supplemental Data 4. In case the original gene identifiers
are subject to future changes, the conversion table will be updated accordingly. Arabidopsis gene names follow their standard TAIR locus IDs. Scale bars
represent the number of substitutions per site following the GTR+G model.
Tang et al.
1952 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
iteratively realigning each segment, allowing the falsely placed
gaps to be corrected and further optimize the gap placement.
Clustering the multiply-aligned genomic regions
If we consider "gene retention at the ancestral locus" as the ancestral
state and "gene loss" as derived, then each aligned chromosomal
segment can be described as a vector of binary characters.
We could then search for hierarchical clustering based on
"Camin-Sokal parsimony" since genes that had been lost are
highly unlikely to re-emerge at original paleologous locations,
that is, reversal to the ancestral state is prohibited (Camin and
Sokal 1965). Using this simplistic parsimony principle, syntenic
genomic regions in multiple alignment blocks can be clustered,
using the "mix" program in the PHYLIP package (Retief 2000)
with 0/1-coded chromosomal regions within each block as input.
MCscan implementation and availability
The multi-aligned plant gene orders and implemented algorithm
and C++ source codes are publicly available (http://chibba.
agtec.uga.edu/duplication/mcscan/). The program uses only two
input files--a file containing BLASTP results and a file describing
gene coordinates--and outputs both pairwise syntenic blocks
and the multi-aligned gene orders threaded by a reference genome.
There are several parameters to configure according to the
user's need. For example, the significance cutoff would reduce
sensitivity but increase specificity for the uncovered syntenic
blocks.
Comparison between Vitis and Solanum, Musa
For Solanum, we downloaded 195-nt sequences for tomato (Solanum
lycopersicum) from NCBI (September 2007) that were 100
kb, discarding one chloroplast sequence from analysis, for a total
of 25 Mb (representing 2.5% of the tomato genome). We retrieved
53,792 TIGR Solanum unigenes (S. lycopersicum TIGR transcript
assembly version 5), mapping them to the collected BACs
(BLASTN E-value < 1 10 6
) and took the best hit that had
200-bp alignment length and 97% identity. This should accommodate
minor sequencing errors or cultivar differences between
the ESTs and BACs, if any. If multiple unigenes went within 300
bp on the tomato sequence, only the longest hit was retained.
This was to resolve cases in which the unigenes were not assembled
completely or correctly for a gene and the real gene was
represented by more than one unigene. A total of 2243 Solanum
unigenes, 4.2% of the total, were anchored to BACs. Solanum
unigenes were assigned their base-pair locations within the BACs,
and we used these mapped unigenes as tentative gene models on
these Solanum BACs. The mapped unigenes were then searched
for homology against the Vitis proteins using BLASTX
(E < 1 10 5
). We analyzed synteny of Vitis chromosomal regions
and 17 banana (Musa acuminata) BACs in a similar proce-
dure.
Synonymous substitution (Ks) and fourfold degenerate site
transversion (4DTV) calculation
For each pair of homologs, we aligned their protein sequences
using CLUSTALW (Thompson et al. 1994) and converted the
protein alignment to DNA alignment using PAL2NAL (Suyama et
al. 2006). Some homologous genes could not produce reliable
CLUSTALW alignment for various reasons and were discarded
from further analysis. Ks values were calculated using the NeiGojobori
algorithm (Nei and Gojobori 1986) implemented in the
PAML package (Yang 1997). We repeated the Ks calculation using
other algorithms and found that the differences are small, systematic
biases that do not affect major conclusions. We calculated
4DTV values between gene pairs using in-house Perl scripts.
4DTV values are calculated for gene pairs having 10 fourfold
degenerate sites. Fourfold degenerate sites are codons of amino
acid residues G, A, T, P, V, and R, S, L. Raw 4DTV values are then
corrected for possible multiple transversions at the same site using
this formula:
4DTVcorrected = -1 2 × ln 1 - 2 × 4DTVuncorrected .
Finite mixture models of genome duplications based on Ks
distribution
The actual distribution of Ks between paleologs can be modeled
as mixtures of log-transformed exponentials and normals, representing
single gene duplications and whole genome duplications,
respectively. Since we have identified the paralogs that
show segmental correspondence with most of the single gene
duplications excluded, the actual distributions can be described
as mixtures of log-normal components that represent multiple
rounds of genome duplications, using the EMMIX software
(http://www.maths.uq.edu.au/gjm/emmix/emmix.html). Ks
values that are <0.005 were discarded to avoid fitting a component
to infinity (Cui et al. 2006), and the mixed populations were
modeled with one to five components. We selected one best mixture
model for each paleolog distribution on the basis of Bayesian
information criterion (BIC) and an additional restriction on the
mean/variance structure for Ks (Cui et al. 2006).
Acknowledgments
We appreciate financial support from the U.S. National Science
Foundation (MCB-0450260 to A.H.P. and J.E.B., DBI-0421803 to
R.M. and A.H.P.), the University of Hawaii to M.A., and the U.S.
Department of Defense W81XWH0520013 to M.A. We thank
Guojun Li for helpful discussions on the synteny deduction al-
gorithm.
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990.
Basic local alignment search tool. J. Mol. Biol. 215: 403­410.
Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence
of the flowering plant Arabidopsis thaliana. Nature 408: 796­815.
Aury, J.M., Jaillon, O., Duret, L., Noel, B., Jubin, C., Porcel, B.M.,
Segurens, B., Daubin, V., Anthouard, V., Aiach, N., et al. 2006.
Global trends of whole-genome duplications revealed by the ciliate
Paramecium tetraurelia. Nature 444: 171­178.
Blanc, G. and Wolfe, K.H. 2004. Widespread paleopolyploidy in model
plant species inferred from age distributions of duplicate genes. Plant
Cell 16: 1667­1678.
Blanc, G., Hokamp, K., and Wolfe, K.H. 2003. A recent polyploidy
superimposed on older large-scale duplications in the Arabidopsis
genome. Genome Res. 13: 137­144.
Blanchette, M., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F., Roskin,
K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., et al.
2004. Aligning multiple genomic sequences with the threaded
blockset aligner. Genome Res. 14: 708­715.
Bowers, J.E., Chapman, B.A., Rong, J., and Paterson, A.H. 2003.
Unravelling angiosperm genome evolution by phylogenetic analysis
of chromosomal duplication events. Nature 422: 433­438.
Camin, J.H. and Sokal, R.R. 1965. A method for deducing branching
sequences in phylogeny. Evolution Int. J. Org. Evolution 19: 311­326.
Cui, L., Wall, P.K., Leebens-Mack, J.H., Lindsay, B.G., Soltis, D.E., Doyle,
J.J., Soltis, P.S., Carlson, J.E., Arumuganathan, K., Barakat, A., et al.
2006. Widespread genome duplications throughout the history of
flowering plants. Genome Res. 16: 738­749.
Davies, T.J., Barraclough, T.G., Chase, M.W., Soltis, P.S., Soltis, D.E., and
Savolainen, V. 2004. Darwin's abominable mystery: Insights from a
supertree of the angiosperms. Proc. Natl. Acad. Sci. 101: 1904­1909.
Enright, A.J., Van Dongen, S., and Ouzounis, C.A. 2002. An efficient
algorithm for large-scale detection of protein families. Nucleic Acids
Res. 30: 1575­1584.
Paleopolyploidy and angiosperm genome evolution
Genome Research 1953
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from
Felsenstein, J. 2004. Inferring phylogenies. Sinauer, Sunderland, MA.
Freeling, M. and Thomas, B.C. 2006. Gene-balanced duplications, like
tetraploidy, provide predictable drive to increase morphological
complexity. Genome Res. 16: 805­814.
Gaut, B.S., Morton, B.R., McCaig, B.C., and Clegg, M.T. 1996.
Substitution rate comparisons between grasses and palms:
Synonymous rate differences at the nuclear gene Adh parallel rate
differences at the plastid gene rbcL. Proc. Natl. Acad. Sci. 93:
10274­10279.
Haas, B.J., Delcher, A.L., Wortman, J.R., and Salzberg, S.L. 2004.
DAGchainer: A tool for mining segmental genome duplications and
synteny. Bioinformatics 20: 3643­3646.
Hall, B.G. 2007. Phylogenetic trees made easy: A how-to manual, 3d ed.
Sinauer, Sunderland, MA.
Hampson, S.E., Gaut, B.S., and Baldi, P. 2005. Statistical detection of
chromosomal homology using shared-gene density alone.
Bioinformatics 21: 1339­1348.
Hittinger, C.T. and Carroll, S.B. 2007. Gene duplication and the
adaptive evolution of a classic genetic switch. Nature 449: 677­681.
Huala, E., Dickerman, A.W., Garcia-Hernandez, M., Weems, D., Reiser,
L., LaFond, F., Hanley, D., Kiphart, D., Zhuang, M., Huang, W., et al.
2001. The Arabidopsis Information Resource (TAIR): A
comprehensive database and web-based information retrieval,
analysis, and visualization system for a model plant. Nucleic Acids
Res. 29: 102­105.
International Rice Genome Sequencing Project. 2005. The map-based
sequence of the rice genome. Nature 436: 793­800.
Itoh, T., Tanaka, T., Barrero, R.A., Yamasaki, C., Fujii, Y., Hilton, P.B.,
Antonio, B.A., Aono, H., Apweiler, R., Bruskiewich, R., et al. 2007.
Curated genome annotation of Oryza sativa ssp. japonica and
comparative genome analysis with Arabidopsis thaliana. Genome Res.
17: 175­183.
Jaillon, O., Aury, J.M., Brunet, F., Petit, J.L., Stange-Thomann, N.,
Mauceli, E., Bouneau, L., Fischer, C., Ozouf-Costaz, C., Bernot, A., et
al. 2004. Genome duplication in the teleost fish Tetraodon nigroviridis
reveals the early vertebrate proto-karyotype. Nature 431: 946­957.
Jaillon, O., Aury, J.M., Noel, B., Policriti, A., Clepet, C., Casagrande, A.,
Choisne, N., Aubourg, S., Vitulo, N., Jubin, C., et al. 2007. The
grapevine genome sequence suggests ancestral hexaploidization in
major angiosperm phyla. Nature 449: 463­467.
Kellis, M., Birren, B.W., and Lander, E.S. 2004. Proof and evolutionary
analysis of ancient genome duplication in the yeast Saccharomyces
cerevisiae. Nature 428: 617­624.
Koch, M.A., Haubold, B., and Mitchell-Olds, T. 2000. Comparative
evolutionary analysis of chalcone synthase and alcohol
dehydrogenase loci in Arabidopsis, Arabis, and related genera
(Brassicaceae). Mol. Biol. Evol. 17: 1483­1498.
Ku, H.M., Vision, T., Liu, J., and Tanksley, S.D. 2000. Comparing
sequenced segments of the tomato and Arabidopsis genomes:
Large-scale duplication followed by selective gene loss creates a
network of synteny. Proc. Natl. Acad. Sci. 97: 9121­9126.
Kuittinen, H., de Haan, A.A., Vogl, C., Oikarinen, S., Leppala, J., Koch,
M., Mitchell-Olds, T., Langley, C.H., and Savolainen, O. 2004.
Comparing the linkage maps of the close relatives Arabidopsis lyrata
and A. thaliana. Genetics 168: 1575­1584.
Li, L., Stoeckert Jr., C.J., and Roos, D.S. 2003. OrthoMCL: Identification
of ortholog groups for eukaryotic genomes. Genome Res. 13:
2178­2189.
Maere, S., De Bodt, S., Raes, J., Casneuf, T., Van Montagu, M., Kuiper,
M., and Van de Peer, Y. 2005. Modeling gene and genome
duplications in eukaryotes. Proc. Natl. Acad. Sci. 102: 5454­5459.
Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney,
B., Burhans, R., King, D.C., Baertsch, R., Blankenberg, D., et al. 2007.
28-Way vertebrate alignment and conservation track in the UCSC
Genome Browser. Genome Res. 17: 1797­1808.
Ming, R., Hou, S., Feng, Y., Yu, Q., Dionne-Laporte, A., Saw, J.H., Senin,
P., Wang, W., Ly, B.V., Lewis, K.L., et al. 2008. The draft genome of
the transgenic tropical fruit tree papaya (Carica papaya Linnaeus).
Nature 452: 991­996.
Nei, M. and Gojobori, T. 1986. Simple methods for estimating the
numbers of synonymous and nonsynonymous nucleotide
substitutions. Mol. Biol. Evol. 3: 418­426.
O'Brien, K.P., Remm, M., and Sonnhammer, E.L. 2005. Inparanoid: A
comprehensive database of eukaryotic orthologs. Nucleic Acids Res.
33: D476­D480.
Paterson, A.H., Chapman, B.A., Kissinger, J.C., Bowers, J.E., Feltus, F.A.,
and Estill, J.C. 2006. Many gene and domain families have
convergent fates following independent whole-genome duplication
events in Arabidopsis, Oryza, Saccharomyces and Tetraodon. Trends
Genet. 22: 597­602.
Posada, D. and Crandall, K.A. 1998. MODELTEST: Testing the model of
DNA substitution. Bioinformatics 14: 817­818.
Retief, J.D. 2000. Phylogenetic analysis using PHYLIP. Methods Mol. Biol.
132: 243­258.
Ronquist, F. and Huelsenbeck, J.P. 2003. MrBayes 3: Bayesian
phylogenetic inference under mixed models. Bioinformatics 19:
1572­1574.
Scannell, D.R., Frank, A.C., Conant, G.C., Byrne, K.P., Woolfit, M., and
Wolfe, K.H. 2007. Independent sorting-out of thousands of
duplicated gene pairs in two yeast species descended from a
whole-genome duplication. Proc. Natl. Acad. Sci. 104: 8397­8402.
Seoighe, C. and Gehring, C. 2004. Genome duplication led to highly
selective expansion of the Arabidopsis thaliana proteome. Trends
Genet. 20: 461­464.
Shuai, B., Reynaga-Pena, C.G., and Springer, P.S. 2002. The lateral organ
boundaries gene defines a novel, plant-specific gene family. Plant
Physiol. 129: 747­761.
Soltis, D.E., Soltis, P.S., Endress, P.K., and Chase, M.W. 2005. Phylogeny
and evolution of angiosperms. Sinauer Associates, Sunderland, MA.
Spillane, C., Schmid, K.J., Laoueille-Duprat, S., Pien, S.,
Escobar-Restrepo, J.M., Baroux, C., Gagliardini, V., Page, D.R., Wolfe,
K.H., and Grossniklaus, U. 2007. Positive Darwinian selection at the
imprinted MEDEA locus in plants. Nature 448: 349­352.
Sterck, L., Rombauts, S., Vandepoele, K., Rouze, P., and Van de Peer, Y.
2007. How many genes are there in plants (... and why are they
there)? Curr. Opin. Plant Biol. 10: 199­203.
Suyama, M., Torrents, D., and Bork, P. 2006. PAL2NAL: Robust
conversion of protein sequence alignments into the corresponding
codon alignments. Nucleic Acids Res. 34: W609­W612.
Tang, H., Bowers, J., Wang, X., Ming, R., Alam, M., and Paterson, A.
2008. Synteny and collinearity in plant genomes. Science 320:
486­488.
Thomas, B.C., Pedersen, B., and Freeling, M. 2006. Following tetraploidy
in an Arabidopsis ancestor, genes were removed preferentially from
one homeolog leaving clusters enriched in dose-sensitive genes.
Genome Res. 16: 934­946.
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W:
Improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res. 22:
4673­4680.
Tuskan, G.A., Difazio, S., Jansson, S., Bohlmann, J., Grigoriev, I.,
Hellsten, U., Putnam, N., Ralph, S., Rombauts, S., Salamov, A., et al.
2006. The genome of black cottonwood, Populus trichocarpa (Torr. &
Gray). Science 313: 1596­1604.
Van de Peer, Y. 2004. Computational approaches to unveiling ancient
genome duplications. Nat. Rev. Genet. 5: 752­763.
Vandepoele, K., Saeys, Y., Simillion, C., Raes, J., and Van De Peer, Y.
2002. The automatic detection of homologous regions (ADHoRe)
and its application to microcolinearity between Arabidopsis and rice.
Genome Res. 12: 1792­1801.
Wang, X., Shi, X., Li, Z., Zhu, Q., Kong, L., Tang, W., Ge, S., and Luo, J.
2006. Statistical inference of chromosomal homology based on gene
colinearity and applications to Arabidopsis and rice. BMC
Bioinformatics 7: 447. doi: 10.1186/1471-2105-7-447.
Wang, X., Tang, H., Bowers, J.E., Feltus, F.A., and Paterson, A.H. 2007.
Extensive concerted evolution of rice paralogs and the road to
regaining independence. Genetics 177: 1753­1763.
Yang, Z. 1997. PAML: A program package for phylogenetic analysis by
maximum likelihood. Comput. Appl. Biosci. 13: 555­556.
Zhang, Y., Xu, G.H., Guo, X.Y., and Fan, L.J. 2005. Two ancient rounds
of polyploidy in rice genome. J. Zhejiang Univ. Sci. B 6: 87­90.
Received May 15, 2008; accepted in revised form September 2, 2008.
Tang et al.
1954 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Presson June 13, 2009 - Published bygenome.cshlp.orgDownloaded from