doi:10.1101/gr.4825606 published online May 15, 2006;Genome Res. Claude W. dePamphilis S. Soltis, John E. Carlson, Kathiravetpilla Arumuganathan, Abdelali Barakat, Victor A. Albert, Hong Ma and Liying Cui, P. Kerr Wall, James H. Leebens-Mack, Bruce G. Lindsay, Douglas E. Soltis, Jeff J. Doyle, Pamela flowering plants Widespread genome duplications throughout the history of data Supplementary http://www.genome.org/cgi/content/full/gr.4825606/DC1 "Supplemental Research Data" P

90% of paralogs. Moreover, the distributions for paralogous gene pairs sampled from two tissue sources (floral vs. nonfloral organs) were similar (Fig. 3B,C) and in agreement with previous analyses based on all duplicate gene pairs in this species (median Ks = 0.277 and 0.632) (Schlueter et al. 2004). Together, our tests found strong signals of deviation from the null model, and as expected, mixture model analyses suggested ancient polyploidy events in Arabidopsis, Glycine, and Solanum. Taken together, these results suggest that unbiased Ks distribu- tions can be obtained from as few as 6000 unigenes sampled from complex cDNA libraries derived from developing floral organs. Ancient polyploidy in a basal eudicot Eschscholzia californica (California poppy, Papaveraceae) is a member of Ranunculales, the sister lineage to all other eudicots (Soltis et al. 2000; Zanis et al. 2002; Borsch et al. 2003). Analysis of the Ks distribution of 178 pairs of Eschscholzia paralogs rejected the constant birth and death model (P 0.0001), and two com- ponents in the distribution were identified by the mixture model. The second component dominated the distribution, with 89% of the duplicate pairs (Fig. 3D), providing the first strong Figure 2. Ks distribution from a sample of Arabidopsis unigenes and the diagnostic test according to the constant birth­death model (null model). (A) Ks estimates from four methods show strong agreement. (ML) Maximum likelihood method by Goldman and Yang; (NG) Nei-Gojobori method; (mNG) modified Nei-Gojobori method; (YN) Yang and Nielsen method. These sample sizes are comparable to the unigenes available for the species sequenced in this study. (B) Ks distributions for paralogs from four replicate unigene samples of 6000 sequences each. (C) The density plot of observed Ks distribution and simulated data based on the null model with parameter = 0.67. (D) The Q-Q plot of observed versus expected Ks values shows the poor fit of the null hypothesis that gene birth and death rates are constant (P 0.0001). Table 2. Summary of EST data sets and paralogous pairs identified in this study Scientific name ESTs Unigenes Pairs with Ks < 2 Source Arabidopsis thaliana 6000a 205 DbEST Glycine max 10,046 6240 125 DbEST Solanum lycopersicum 10,028 5303 143 DbEST Eschscholzia californica 9079 5713 178 PGN Acorus americanus 7484 4663 149 PGN Liriodendron tulipifera 9531 6520 92 PGN Persea americana 8735 6183 196 PGN Saruma henryi 10,273 6293 184 PGN Nuphar advena 8442 6205 138 PGN Amborella trichopoda 8629 6099 69 PGN Pinus taeda 6000b 276 PlantGDB Pinus pinaster 6000c 259 PlantGDB Welwitschia mirabilis 9776 6048 157 PGN a Sampled from 6369 unigenes. b Sampled from 52,527 unigenes. c Sampled from 8076 unigenes. Cui et al. 4 Genome Research www.genome.org on May 17, 2006www.genome.orgDownloaded from evidence of probable ancient genome duplication in a basal eu- dicot. Phylogenetic analyses of duplicated AGAMOUS and AP3 homologs (Kramer et al. 1998; Kramer and Irish 1999; Zahn et al. 2005a) have suggested that this duplication event occurred after the split between Ranunculales and core eudicots. Thus, the ge- nome-wide duplication evident in the Eschscholzia paralogous pairs was probably independent of the genome duplications that have been inferred from analyses of the Arabidopsis genome (Vi- sion et al. 2000; Bowers et al. 2003; Maere et al. 2005). Basal monocot Acorus americanus (Acoraceae, Acorales) represents the sister lin- eage to all other monocots (Duvall et al. 1993; Soltis et al. 2000, 2002; Zanis et al. 2002; Borsch et al. 2003; Hilu et al. 2003). Three components were identified in the paralogous pairs by the mix- ture model approach. The second component, accounting for 33% of all duplicates, was shown as a sharp peak in the Ks dis- tribution, while the third component, containing 65% of the duplicates, appeared as a broader peak (Fig. 3E). Based on the distinct modes observed in the raw Ks distribution, we hypoth- esize that the second and third components estimated in the mixture model represent two distinct large-scale duplication events. This hypothesis will be tested in future phylogenetic analyses of well-sampled gene families. Magnoliids Both shared and lineage-specific genome duplications were in- ferred from analyses of unigenes from three magnoliid species: Liriodendron tulipifera (Magnoliaceae, Magnoliales), Persea ameri- cana (Lauraceae, Laurales), and Saruma henryi (Aristolochiaceae, Piperales). A total of 92 paralogous pairs was detected in the Liriodendron unigene set. The constant-birth­death model was rejected (P < 0.001), and a mixture of two components was iden- tified in the Ks distribution, with the second component being dominant (Fig. 4A). The null birth­death model was also rejected in the P. americana (avocado) analysis (P 0.0001) with 196 paralogous gene pairs. The optimal mixture model also included two components very similar to those seen for Liriodendron (Fig. 4B; Table 3). To determine whether the duplication events inferred from the Ks distributions of Liriodendron and Persea represented events in a common ancestor, we first computed the median Ks of pu- tatively orthologous gene pairs (408 pairs identified as recipro- cal best hits in BLAST searches) and compared the median Ks for orthologs with Ks values for paralogous pairs within each species. The Ks distribution of putative ortholog pairs showed a single major component (median = 0.8057, variance = 0.0858) (Fig. 4F), inferred to be slightly older than the probable ge- nome duplication observed in Persea (median = 0.6464, vari- ance = 0.1197; P < 0.0001, Wilcoxon test). The timing of the du- plication event inferred from the Liriodendron Ks distribution (median = 0.7616, variance = 0.1328) relative to the divergence of the Persea and Liriodendron lineages was ambiguous (P = 0.35), and direct comparison of the Persea and Liriodendron Ks distribu- tions may have been confounded by unequal substitution rates. To account for possible variation in synonymous substitu- tion rates between the Persea and Liriodendron lineages, we aligned putatively orthologous genes from Liriodendron, Persea, and Saruma and estimated Ks values for each lineage on a phy- Table 3. Mixture model estimates for Ks distributions in each species Scientific name n P lnL BIC Median Variance Proportion Arabidopsis thaliana 202 2 162.498 351.54 0.2889 0.0473 0.21 0.751 0.0777 0.79 Glycine max 123 2 147.358 318.78 0.1873 0.0398 0.29 0.6705 0.1066 0.71 Solanum lycopersicum (floral) 139 2 118.607 261.89 0.0643 0.0066 0.09 0.7894 0.1021 0.91 Solanum lycopersicum (nonfloral) 119 2 122.933 269.76 0.1857 0.0547 0.15 0.7885 0.1425 0.85 Eschscholzia californica 178 2 161.652 349.21 0.0871 0.0043 0.11 0.7098 0.087 0.89 Acorus americana 139 3 103.568 246.61 0.0118 0.001 0.01 0.455 0.0046 0.33 0.5813 0.1309 0.65 Liriodendron tulipifera 87 2 94.046 210.42 0.1005 0.0121 0.14 0.7616 0.1328 0.86 Persea americanus 186 2 196.998 420.12 0.0234 0.0004 0.07 0.6464 0.1197 0.93 Saruma henryi 146 2 162.789 350.5 0.0913 0.0168 0.2 0.7927 0.1066 0.8 Nuphar advena 134 3 159.416 358.02 0.1746 0.0461 0.37 0.4291 0.0202 0.56 1.3273 0.0084 0.07 Amborella trichopoda 49 1 80.676 169.14 0.2698 0.1147 1 Pinus taeda 227 1 405.77 822.39 0.0839 0.0147 1 Pinus pinaster 240 1 373.135 757.23 0.2499 0.0819 1 Welwitschia mirabilis 132 2 181.128 386.67 0.1139 0.0271 0.35 0.9519 0.1374 0.65 Initial tests against the null model (no genome duplication) were conducted, then a mixture analysis was applied to each species. The final mixture model was selected according to the Bayesian Information Criterion (BIC) and restriction on the mean/variance structure for Ks (see Methods). (n) Sample size; P, number of mixture components, lnL, log likelihood for the mixture model. For each mixture model, the proportions for each component (subpopulation) sum to 1. Genome duplication in flowering plants Genome Research 5 www.genome.org on May 17, 2006www.genome.orgDownloaded from logeny. We examined 19 putative orthologous gene sets in the three species with alignments of at least 400 bp for all taxa (see Methods; Supplemental Table S2) and found that Ks on the lin- eage leading to Liriodendron was slower on average than the rate on the lineage leading to Persea. For example, in the tree for the orthologous set shown in Figure 4G, the branch length (in Ks units) for the branch leading to Persea is 1.31 times that leading to Liriodendron. The ratio of synonymous substitutions on the Persea branch relative to the Liriodendron branch ranged from 0.86 to 2.68, and the ratio was greater than one in 16 of 19 cases (Supplemental Table S2). When Liriodendron paralog Ks values were multiplied by the median branch-length ratio, 1.29, the peak in the scaled Liriodendron Ks distributions matched an older, but nonsignificant peak in the Persea Ks distribution (Fig. 4E). Taken together, these analyses suggest that the prominent peak in the Liriodendron Ks distribution (median = 0.82) represents a duplication event in the common ancestral genome of Magno- liales and Laurales that had not been identified as a distinct com- ponent in the mixture model for the Persea Ks distribution. In line with the comparison of Ks values for Persea paralogs and putative Liriodendron­Persea orthologs, we interpret the domi- nant peak in the Persea Ks distribution to represent a genome- scale duplication event that occurred after the divergence of Mag- noliales and Laurales. This hypothesis needs to be tested with additional data. Saruma henryi is a member of Piperales, which (with Canel- lales) is sister to the Magnoliales/Laurales clade (Soltis et al. 2000, 2002; Zanis et al. 2002; Borsch et al. 2003). The Ks distribution of Saruma paralogs showed a distinct peak with median Ks = 0.7927 (Fig. 4C; Table 3). This is lower than the median Ks for 202 Saruma­Liriodendron ortholog pairs (0.9555, P = 0.0001) and the median Ks for 254 putative Saru- ma­Persea ortholog pairs (1.0121, P < 0.0001) (Fig. 4F). We therefore sur- mise that the peak in the Ks distribution of Saruma paralogous pairs represents a large-scale duplication in Piperales after divergence from the Magnoliales and Laurales lineages. Basal-most angiosperms Amborella trichopoda (Amborellaceae) and the water lilies (Nymphaeales) are either successive sister lineages to all other extant angiosperms, or together form a clade that is sister to the rest of the angiosperms (Zanis et al. 2002; Stefa- novic et al. 2004; Leebens-Mack et al. 2005). The Ks distribution for a total of 69 Amborella paralogous pairs appeared to follow an exponential distribution, but the uniform birth­death process was rejected (P < 0.01; Figure 5A). However, the mixture model analysis identified only one component containing all of the gene pairs (Table 3). Nymphaeales are represented by Nuphar advena. A to- tal of 138 paralogous pairs was identi- fied, and the resulting Ks distribution did not fit the constant birth­death model (P < 0.01). Three mixture components were estimated from the Ks distribution. The second component, accounting for 56% of the paralogous pairs, provided strong evi- dence for ancient polyploidy in the history of the Nuphar ge- nome (Fig. 5B). The third component, with a median Ks of 1.3273, may represent the oldest genome duplication to be de- tected in analyses of angiosperm Ks distributions. The median Ks for the third component was not distinguishable from the me- dian Ks value for putative Amborella­Nuphar orthologs (Fig. 5C) (median Ks[orthologs] = 1.24, variance = 0.1918, based on 113 putatively orthologous sequence pairs; P = 0.05, two-sample t- test on the log Ks[orthologs] and log Ks[third component of Nuphar paralogs]). Therefore, the third component in the Nuphar Ks distribution may correspond to a polyploidy event that oc- curred at approximately the time of divergence between the Am- borella and Nuphar lineages (see Discussion). Gymnosperms We obtained 52,527 unigenes for Pinus taeda (loblolly pine) from PlantGDB (Dong et al. 2004), and a random sample of 6000 uni- genes was drawn to match the sample size for other species we investigated. The Ks distribution showed a clear monotonous de- cay of paralogs with increasing age and no detectable sign of genome duplication in recent history (P = 0.16) (Fig. 5D). The frequency distribution for all paralogous pairs was essentially identical. The analysis of Pinus pinaster yielded a similar expo- nential distribution (Table 3). The constant-birth­death model was rejected for Wel- witschia (P < 0.01), and a mixture analysis of the Ks distribution Figure 3. Ks distributions of paralogs in selected angiosperm species, with fitted densities from mixture model analysis, suggest paleopolyploidy in eudicots and monocots. Each fitted line indicates a subpopulation in the mixture. The first (leftmost) component corresponds to paralogs from back- ground gene duplications; other peaks indicate estimated median Ks for ancient duplications. (A) Glycine max (soybean). (B,C) Solanum lycopersicum (tomato), data from floral tissue (B) and nonfloral tissue (C). (D) A basal eudicot, Eschscholzia californica (California poppy). (E) A basal monocot, Acorus americanus. Cui et al. 6 Genome Research www.genome.org on May 17, 2006www.genome.orgDownloaded from identified two components (Fig. 5E). The second component, corresponding to the heavy right-hand tail of the distribution, may represent one or more ancient duplication events, or a re- duced rate of gene death for older duplicates. Discussion In this paper, we introduce a model-based statistical test that accounts for estimation error in Ks values in terms of deviation from a constant rate of gene birth and death. This represents a refinement of previous studies using Ks distributions, which have yielded significant insights into genome duplications (Force et al. 1999; Lynch and Conery 2000; Blanc and Wolfe 2004; Schlueter et al. 2004). The birth­death model developed here for dupli- cated genes is a natural extension of stochastic birth-and-death models that have been widely used in population and phyloge- netic approaches to studies of gene family evolution (Karev et al. 2004). Simulations based on this model have allowed us to in- vestigate how specified death rates and duplication times result in Ks distributions with (or without) secondary peaks or heavy tails (e.g., Fig. 1). The model can be extended to incorporate variable rates of gene birth or death over time, and in the ex- treme, an instant burst of gene birth corresponding to a whole- genome duplication. Although our results could not exclude par- tial and segmental duplication events, the birth­death model was validated with genomes with known duplication histories where detection of whole­genome events was expected. We found that three major factors influence the frequency and observed divergence of paralogous pairs arising from genome-wide duplications. The time since the duplication event, the rate of gene death, and the background rate of gene birth all influence observed Ks distributions. Very recent genome du- plication events are associated with Ks values for resulting paralogous pairs that are indistinguishable from those of back- ground single-gene duplications using EST data. For example, polyploidy is not clearly evident in the Ks distribution for hexaploid wheat because there has been little divergence among the parental or homeologous gene copies, and the range of divergence for allelic variants was not distinct from that of paralogs arising from recent gene duplications (Blanc and Wolfe 2004). At the same time, evi- dence of very ancient genome duplica- tions is eroded as synonymous substitu- tions reach saturation and variance in Ks increases. This may be evident in Ks plots for wheat, maize, rice, and barley, for which evidence for a genome dupli- cation event some 50­60 million years ago (Mya) in the common ancestor of all major grain lineages has been obscured (Blanc and Wolfe 2004; Paterson et al. 2004a). Detection of very old duplica- tion events in Ks distributions is espe- cially difficult in species with high synonymous substitution rates. Conversely, evidence for the oldest detectable genome- wide duplications will be found in Ks distributions for species with the slowest substitution rates (see below). Concurrent expansion of a few gene families could lead to moderate deviations from our null model. This is especially true if ancient duplication events are overrepresented in sets of sampled paralogous pairs, or if major adaptive radiations of in- dividual gene families preceded or accompanied the diversifica- tions of the organismal lineages under study. In this study, we avoided over-counting of ancient gene duplications by con- straining genes to be included in only one paralogous pair. Our analysis of duplicated Arabidopsis genes verified that this ap- proach produced Ks distributions similar to those of previous studies that implemented more elaborate corrections for gene family expansions (Maere et al. 2005). Moreover, sampled paralo- gous genes were not particularly biased toward large gene fami- lies. Whereas most sampled duplicate genes belonged to the housekeeping functional categories, such as protein synthesis, proteolysis, and energy metabolism (Supplemental Table S1), none of the duplicate gene sets was dominated by a single gene family. Several transcription factor families were also identified in our paralog pairs, but again, no family accounted for more than a few percent of the duplicate gene pairs. Our results for Persea (Lauraceae) and Liriodendron (Magno- liaceae) corroborate previous evidence of ancient polyploidy Figure 4. Ks distributions of paralogs and orthologs among magnoliids suggest independent dupli- cations and possibly shared genome duplication events in Laurales (Persea) and Magnoliales (Lirioden- dron). (A,B,C) The Ks distributions for (A) Liriodendron, (B) Persea, and (C) Saruma, with fitted lines based on the mixture model analysis. (D) The Ks distribution for Liriodendron and Persea, without scaling for rate differences between lineages. (E) Ks distribution for paralogs in Liriodendron after rate calibration (adj = adjusted), compared with that of Persea, suggesting recent independent duplication and older shared genome-scale duplications. (F) Ks distribution for orthologs of two magnoliid species. (Ltu) Liriodendron; (Pam) Persea; (She) Saruma. (G) Phylogeny of one representative orthologous gene set used for relative rate estimates. The branch lengths show the estimated relative rates of synonymous evolution in respective species. Genome duplication in flowering plants Genome Research 7 www.genome.org on May 17, 2006www.genome.orgDownloaded from from isozyme studies (Soltis and Soltis 1990). Soltis and Soltis (1990) found that 25%­29% of the loci investigated were dupli- cated in both families, and hence could have arisen via poly- ploidy. All members of Magnoliaceae examined shared the same isozyme duplications (PGI, TPI, 6PGD), while the species of Lau- raceae shared a similar suite of isozyme duplications (PGM, TPI, 6PGD, GDH). These were interpreted as evidence for paleopoly- ploid events occurring very early in the evolutionary history of Magnoliaceae and Lauraceae. The Persea and Liriodendron paralo- gous genes suggest polyploidy in a common ancestor at least 100 Mya (Bell et al. 2005) followed by a second round of polyploidy in the Persea lineage (Fig. 4E), but this hypothesis must be tested with analyses of additional gene family phylogenies. If this sce- nario is correct, the duplicated isozyme loci observed in the Mag- noliaceae and Lauraceae may have arisen from a polyploidy event that predated the separation of the two families (cf. Brysting and Borgen 2000). Over time, nucleotide substitutions can become saturated, and therefore lineages with slow synonymous substitution rates will provide a deeper view into genome history relative to lin- eages with faster substitution rates. It is estimated that the syn- onymous substitution rate in palm (2.61 10 9 synonymous substitutions/per year) (Gaut et al. 1996) is only about half that reported for grasses, eudicots (Lynch and Conery 2000), and grass­eudicot comparisons (Wolfe et al. 1987). We infer a simi- larly slow substitution rate for other basal angiosperms based on the Magnoliales­Laurales divergence as a calibration point. We estimated a synonymous site divergence of Ks = 0.7 for Lirioden- dron and Persea ortholog pairs (Fig. 4F). Using a divergence date estimate of 116 Mya for the Magnoliales­Laurales split (Bell et al. 2005), we estimate an average synonymous substitution rate of 3.02 10 9 synonymous substitutions/year. The low substi- tution rate in Liriodendron and Persea may be explained in part by their longer generation times (these lin- eages are trees and shrubs) relative to model eudicot and grass species. We found that the median for the oldest component in the Nuphar Ks dis- tribution is close to the median Ks for putative Amborella­Nuphar orthologs (median Ks = 1.24) (Fig. 5C). This level of divergence is compatible with the syn- onymous divergence for the very early duplication in Arabidopsis (i.e., dupli- cation) (Bowers et al. 2003; De Bodt et al. 2005; Maere et al. 2005). Direct dating of the early Nuphar peak based on the Ks data is challenging because of uncer- tainty in the branching relationships be- tween Amborella, Nuphar, and the rest of the angiosperms, and the possibility of additional rate variation as was seen for magnoliids. We adopted two approaches to date the earliest event in Nuphar. First, using the median Ks Amborella­Nuphar ortholog divergence of 1.24 and a cali- bration range of 134­165 Mya (Leebens- Mack et al. 2005) gives a rate of 4.66­ 3.79 10 9 substitutions per silent site per year. Therefore, Ks = 1.33 (the early Nuphar duplication event) would predict an age range between 143 and 173 Mya for the split between these two lineages. An alternative calcula- tion, using the magnoliid calibration of 3.02 10 9 substitu- tions per silent site per year, leads to an estimate of 220 Mya for the divergence of lineages leading to Amborella and Nuphar. This range of age estimates supports two alternative inter- pretations of the Nuphar and Amborella paralog Ks distributions. The third component in the Nuphar Ks distribution may repre- sent polyploidy in a common ancestor of all angiosperms (Fig. 6), in agreement with recent analyses of MADS-box gene families (Kim et al. 2004; Buzgo et al. 2005; Zahn et al. 2005a). This scenario would require that evidence of ancient polyploidy has been sufficiently eroded as to be undetected in analyses of EST samples from Amborella and various other angiosperm species owing to gene death and/or saturation of synonymous substitu- tions as discussed above. For example, the nonsignificant peaks around Ks = 1.5 in the Liriodendron and Persea Ks distributions (Fig. 4A,B) may provide weak evidence of polyploidy early in angiosperm history. Alternatively, the earliest duplication peak detected in the Nuphar analysis may trace back to a genome du- plication in the common ancestor of Nuphar and all extant an- giosperm lineages other than Amborella (Fig. 6). This scenario would be consistent with the hypothesis that Amborella is sister to all other extant angiosperms (e.g., solid line on Fig. 6), and the extremely low proportion of duplicate genes found in the Ambo- rella unigene set. This scenario also would narrow the timing of a genome duplication to 10 Myr separating the branching points for Amborella and all other extant angiosperm lineages (Leebens-Mack et al. 2005). As discussed above, however, there have been instances where known genome duplication events have not been detected in Ks distributions (Fig. 1; Blanc and Wolfe 2004; Paterson et al. 2004b), thus lack of evidence for ancient polyploidy in the Amborella Ks distribution does not ex- clude the possibility of polyploidy in an ancestral genome. More Figure 5. Ks distributions suggest possible genome duplications in basal angiosperms, and no evi- dence for genome duplication events in Amborella and some gymnosperm species. (A) Ks distribution in Amborella, a basal-most angiosperm. No significant large-scale duplication is detected. (B) Three distinct components in the Ks distribution for Nuphar, also a basal-most angiosperm, suggest at least two large-scale genome duplications. (C) Ks distribution for putative orthologs between Amborella and Nuphar. (D) Pinus taeda (loblolly pine) paralogous pairs follow the null model (see Methods). (E) Ks distribution for paralogs in a gymnosperm, Welwitschia. Cui et al. 8 Genome Research www.genome.org on May 17, 2006www.genome.orgDownloaded from sequence data, and ultimately whole genome sequences, will be needed from Amborella, water lilies, and other early branching angiosperm species to select among these alternative scenarios for polyploid origins of angiosperms. While genomic sequences have revealed evidence of poly- ploidy in Poaceae and core eudicots, the secondary peaks found in paralog Ks distributions for representatives of virtually all ma- jor angiosperm lineages support the notion that genome dupli- cations are common in angiosperm history and gene birth and death are important processes in plant evolution (Lynch and Conery 2000). The evidence now supports the hypothesis pro- posed initially decades ago by Stebbins (1950) that angiosperms have experienced repeated rounds of polyploidization through- out their evolutionary history. Many questions follow: How many polyploidy events separate different plant lineages? What is the typical fate of genes generated through these duplication events? And perhaps most intriguingly, have polyploidy events been important engines of angiosperm diversification? Genome- scale sequencing of phylogenetically crucial angiosperm species would provide the data necessary to directly test whether the rapid diversification of flowering plants following their origin (Darwin 1903) was associated with one or more polyploidy events. Methods EST sequencing and assembly EST sequences from floral cDNA libraries of seven species (Am- borella trichopoda, yellow water lily [Nuphar advena], avocado [Per- sea americana], yellow-poplar [Liriodendron tulipifera], wild ginger [Saruma henryi]), sweet flag [Acorus americanus], and Cali- fornia poppy [Eschscholzia californica]) are available through the Plant Genome Network (http://www.pgn.cornell.edu). cDNA li- brary construction, EST sequencing, and assembly were described previously (Albert et al. 2005). Public EST sets from selected libraries for Arabidopsis thaliana, soybean (Glycine max, Williams 82), and tomato (Sola- num lycopersicum, cultivar TA496) were downloaded from the GenBank dbEST section, trimmed using seqclean, and assembled using CAP3 with the percent identity parameter P = 90 and over- lap length 40 bp. A. thaliana ESTs were from four libraries (root, flower, green silique, and 2- to 6-wk above-ground organs). To minimize the allelic variations in the EST sequence collection, the unigenes were mapped to the Arabidopsis genome, and re- dundant unigenes matching the same genomic locus were dis- carded. Only the sequences that matched the protein-coding re- gions were retained. From this screened unigene set, we drew replicate samples with 6000 unigenes in each sample. The sample size of 6000 Arabidopsis unigenes approximates the number of unigenes from new EST data sets we analyzed. To see if library sources influence estimates, we analyzed two samples of tomato ESTs, one from floral cDNA libraries and one from vegetative cDNA libraries. The soybean ESTs were sampled from cDNA li- braries of flower, young seedling, root, and other vegetative or- gans. Unigenes for gymnosperms Pinus taeda and Pinus pinaster were downloaded from PlantGDB (Dong et al. 2004), which were built with public ESTs from all libraries. For each species, we sampled 6000 unigenes for Ks analysis. Ks calculation for paralogs and orthologs Paralogous pairs of sequences were identified from best reciprocal matches in all-by-all BLASTN searches. For data sets with trace files, we discarded bases with Phred (Ewing and Green 1998; Ew- ing et al. 1998) quality values lower than 20. Only sequence pairs with alignment lengths >300 bp were used for Ks calculations. Translated sequences of unigenes generated by ESTScan (Iseli et al. 1999) were aligned using MUSCLE v3.3 (Edgar 2004). Nucleo- tide sequences were then forced to fit the amino acid alignments. The Ks value for each sequence pair was calculated using the Goldman and Yang maximum likelihood method (Goldman and Yang 1994) implemented in codeml with the F3 4 model (Yang 1997). In order to assess whether the shape of Ks distributions was dependent on the estimation procedure, the Nei-Gojobori method, the modified Nei-Gojobori method, and the YN00 method (Yang and Nielsen 2000) were also applied on the Ara- bidopsis set. The Ks frequency in each interval size of 0.05 within the range [0, 2.0] was plotted. The age distribution of paralogs under a constant birth­death model (the null model) We modeled the birth and death of paralogs formed by gene duplications under a constant-rate birth­death model in order to test whether an observed frequency distribution of Ks values in- dicates deviation from this process. The duplicate genes are gen- erated by a Poisson process at rate , and the number of duplicate pairs decreases by age at an exponential rate . We can estimate the age distribution of surviving paralogs (survivors), total N, by considering the process as sampling gene birth over time [0, t], and decide if each birth was a survivor. The distribution for the number of survivors of age t is N t Po 0 t exp s ds = Po F t , where = / , and F(t) = 1 exp( t), the cumulative density function of exponential ( ). From this we deduce that the popu- Figure 6. Phylogenetic summary of paleopolyploidy events estimated by the mixture model approach and their distribution among angio- sperm and gymnosperm lineages. Scaled graph in center with Xs corre- sponding to median Ks of pairs from background gene duplications, while small ovals indicate the median Ks of possible concentrated duplications in the history of particular lineages. The phylogenetic tree at left shows the likely placement of detected genome-scale duplications. Uncertainty in phylogenetic timing of what may be a single duplication event at the base of the angiosperms is indicated with a wide oval that covers possible branching points compatible with the Ks evidence. Hollow ovals indicate duplications identified in previous studies using paralogous genes or ge- nomic data from those lineages. Genome duplication in flowering plants Genome Research 9 www.genome.org on May 17, 2006www.genome.orgDownloaded from lation size N( ) = Po( ). Furthermore, the survivors' age distribu- tion is an empirical distribution of a sample of exponentially distributed random variables, generated with the parameter . To obtain an estimate of the true age, we must consider the error of Ks with respect to the true age of paralogs. If the true age is T, then we can calculate Ks (with error) as Ks = T + (s|t) z, where s|t is the standard error for Ks at T = t, and z is a standard normal random variable. The error can be estimated from the empirical standard error given by the PAML software. The mean of s is expected to correlate with the time t, since older Ks estimates have larger variances. The conditional distri- bution of s can be approximated by exponential (2/t). The maxi- mum likelihood estimate of the parameter from the data was obtained using a grid-based method, and a simulated sample un- der the null model was compared to the observed using a 2 test. A quantile­quantile plot (Q-Q plot) was used to visualize the difference between observed data and a simulated data set ac- cording to the null model. A strong deviation from the 45° line in the Q-Q plot suggests that the two distributions differ, and a bootstrapped Kolmogorov-Smirnov test (http://sekhon.polisci. berkeley.edu/matching/ks.boot.html) was applied to compare the observed and expected Ks distributions. The modeling and simulation scripts are available as Supplemental data. Finite mixture model of genome duplications In order to explore further how genome-wide duplication events influence the age distribution of paralogs and Ks distributions, we defined "background duplication" as gene duplication under the constant-rate birth­death process, and a "genome duplication" as an instant spike of gene birth overlaid on top of the back- ground. We modeled changes in Ks distributions with increasing time since a duplication event, while assuming a constant rate of gene loss (death rate) and a constant background gene duplica- tion rate (birth rate). Each simulation included a genome dupli- cation (which led to new duplicates n) at time t. About 5% of duplicates were allowed to escape the death process. In all instances when we rejected the constant rates hypoth- esis, we surmised that the observed Ks distributions actually re- flect a compound distribution generated by variable birth and/or death rates from the time of duplication. For example, a genome duplication event would generate an immediate spike in the birth of paralogs. Mixture models treat the distribution of inter- est as a mixture of several component distributions in various proportions. The EMMIX software is suitable for mixed popula- tions, where each component can be described by a Gaussian density (McLachlan et al. 1999) (see http://www.maths.uq. edu.au/gjm/emmix/emmix.html for the Users' Guide). Follow- ing Schlueter et al. (2004), we modeled the log-transformed Ks distribution of paralogs. (The actual distribution is a mixture of log-transformed exponentials and normals.) Observations with Ks < 0.005 were excluded to avoid fitting a component to infinity (Schlueter et al. 2004). This truncation might also reduce the proportion of gene pairs attributed to background duplication. We modeled the mixed populations with one to four compo- nents and repeated the EM algorithm 100 times with random starting values, as well as 10 times with k-mean start values. One restriction imposed on the variance structure of Ks is that vari- ance increases with the mean according to the empirical esti- mates. The observed data could therefore often be fitted to more than one component, with different means, variances, and mix- ture proportions. The mixture model with the best fit was iden- tified using the Bayesian Information Criterion (Schwarz 1978). The mean and variance for each component (subpopulation of log Ks values) for the selected model were back-transformed to the original scale for plotting and interpretation. Calibrating rate of synonymous substitution across lineages When comparing Ks distributions among taxa, variation in the substitution rates among lineages must be taken into account. We used a phylogenetic approach to estimate lineage-specific synonymous substitution rates on branches leading to the mag- noliids L. tulipifera, P. americana, and S. henryi. Orthologous genes from A. thaliana, rice, and the three magnoliid species were clas- sified by InParanoid (Remm et al. 2001). Protein alignments of Arabidopsis and rice gene models (the TIGR Arabidopsis thaliana database, the TIGR rice database) were first constructed, then DNA alignments were forced to protein alignments by codon positions. A maximum likelihood tree was estimated using the HKY model in PHYML v.2.4.3 (Guindon and Gascuel 2003) for each putative ortholog set including at least 400 aligned nucleo- tide positions. A per-site estimate of Ks was then made for each magnoliid branch in gene phylogenies consistent with organis- mal relationships ([Liriodendron, Persea] Saruma) using codeml in the PAML package (Yang 1997). The ratio of Ks values on the Persea branch relative to the Liriodendron branch was then esti- mated for each gene. Two supplemental tables and R-scripts for birth­death simu- lations are available as Supplemental material. Teri Solow and Lukas Muller provided the EST sequence assembly for eight spe- cies (A. americanus, A. trichopoda, E. californica, L. tulipifera, N. advena, P. americana, S. henryi, and Welwitschia mirabilis), now available through the Plant Genome Network (http://pgn. cornell.edu/). Acknowledgments We thank Jongmin Nam for providing code for Ks computation; Lena Scheaffer, Yi Hu, and Shelia Plock for technical support on cDNA library construction and sequencing; Lukas Mueller, Dan Ilut, Teri Solow, and Steve Tanksley for the PGN Database; and anonymous reviewers for critical comments on the manuscript. This work was supported by NSF Plant Genome award DBI- 0115684. References Abi-Rached, L., Gilles, A., Shiina, T., Pontarotti, P., and Inoko, H. 2002. Evidence of en bloc duplication in vertebrate genomes. Nat. Genet. 31: 100­105. Adams, K.L., Cronn, R., Percifield, R., and Wendel, J.F. 2003. Genes duplicated by polyploidy show unequal contributions to the transcriptome and organ-specific reciprocal silencing. Proc. Natl. Acad. Sci. 100: 4649­4654. Albert, V.A., Soltis, D.E., Carlson, J.E., Farmerie, W.G., Wall, P.K., Ilut, D.C., Solow, T.M., Mueller, L.A., Landherr, L.L., Hu, Y., et al. 2005. Floral gene resources from basal angiosperms for comparative genomics research. BMC Plant Biol. 5: 5. Bell, C.D., Soltis, D.E., and Soltis, P.S. 2005. The age of the angiosperms: A molecular timescale without a clock. Evolution Int. J. Org. Evolution 59: 1245­1258. Bennett, M.D., Leitch, I.J., Price, H.J., and Johnston, J.S. 2003. Comparisons with Caenorhabditis (approximately 100 Mb) and Drosophila (approximately 175 Mb) using flow cytometry show genome size in Arabidopsis to be approximately 157 Mb and thus approximately 25% larger than the Arabidopsis Genome Initiative estimate of approximately 125 Mb. Ann. Bot. (Lond.) 91: 547­557. Bierne, N. and Eyre-Walker, A. 2003. The problem of counting sites in the estimation of the synonymous and nonsynonymous substitution rates: Implications for the correlation between the synonymous substitution rate and codon usage bias. Genetics 165: 1587­1597. Blanc, G. and Wolfe, K.H. 2004. Widespread paleopolyploidy in model Cui et al. 10 Genome Research www.genome.org on May 17, 2006www.genome.orgDownloaded from plant species inferred from age distribution of duplicate genes. Plant Cell 16: 1667­1678. Blanc, G., Barakat, A., Guyot, R., Cooke, R., and Delseny, M. 2000. Extensive duplication and reshuffling in the Arabidopsis genome. Plant Cell 12: 1093­1101. Blanc, G., Hokamp, K., and Wolfe, K.H. 2003. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res. 13: 137­144. Bogart, J.P. 1979. Evolutionary implications of polyploidy in amphibians and reptiles. Basic Life Sci. 13: 341­378. Borsch, T., Hilu, K.W., Quandt, D., Wilde, V., Neinhuis, C., and Barthlott, W. 2003. Noncoding plastid trnT­trnF sequences reveal a well resolved phylogeny of basal angiosperms. J. Evol. Biol. 16: 558­576. Bowers, J.E., Chapman, B.A., Rong, J., and Paterson, A.H. 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422: 433­438. Brysting, A.K. and Borgen, L. 2000. Isozyme analysis of the Cerastium alpinum C-arcticum complex (Caryophyllaceae) supports a splitting of C. arcticum Lange. Plant Syst. Evol. 220: 199­221. Buzgo, M., Soltis, P.S., Kim, S., and Soltis, D.E. 2005. The making of a flower. Biologist 52: 149­154. Cannon, S.B., Mitra, A., Baumgarten, A., Young, N.D., and May, G. 2004. The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana. BMC Plant Biol. 4: 10. Darlington, C.D. 1937. Recent advances in cytology. P. Blakiston's Son & Co., Philadelphia, PA. Darwin, C.D. 1903. More letters of Charles Darwin. John Murray, London. De Bodt, S., Maere, S., and Van de Peer, Y. 2005. Genome duplication and the origin of angiosperms. Trends Ecol. Evol. 20: 591­597. Dehal, P. and Boore, J.L. 2005. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 3: e314. deWet, J.M. 1979. Origins of polyploids. Basic Life Sci. 13: 3­15. Dong, Q., Schlueter, S.D., and Brendel, V. 2004. PlantGDB, plant genome database and analysis tools. Nucleic Acids Res. 32: D354­D359. Duvall, M.R., Learn Jr., G.H., Eguiarte, L.E., and Clegg, M.T. 1993. Phylogenetic analysis of rbcL sequences identifies Acorus calamus as the primal extant monocotyledon. Proc. Natl. Acad. Sci. 90: 4641­4644. Edgar, R.C. 2004. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113. Ewing, B. and Green, P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8: 186­194. Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8: 175­185. Force, A., Lynch, M., Pickett, F.B., Amores, A., Yan, Y.L., and Postlethwait, J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 1531­1545. Friedman, R. and Hughes, A.L. 2001. Gene duplication and the structure of eukaryotic genomes. Genome Res. 11: 373­381. Gaut, B.S., Morton, B.R., McCaig, B.C., and Clegg, M.T. 1996. Substitution rate comparisons between grasses and palms: Synonymous rate differences at the nuclear gene Adh parallel rate differences at the plastid gene rbcL. Proc. Natl. Acad. Sci. 93: 10274­10279. Goldman, N. and Yang, Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11: 725­736. Grant, V. 1963. The origin of adaptations. Columbia University Press, New York. ------. 1981. Plant speciation. Columbia University Press, New York. Grant, D., Cregan, P., and Shoemaker, R.C. 2000. Genome organization in dicots: Genome duplication in Arabidopsis and synteny between soybean and Arabidopsis. Proc. Natl. Acad. Sci. 97: 4168­4173. Gu, X., Wang, Y., and Gu, J. 2002. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution. Nat. Genet. 31: 205­209. Guindon, S. and Gascuel, O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52: 696­704. Hilu, K.W. 1993. Polyploidy and the evolution of domesticated plants. Am. J. Bot. 80: 2521­2528. Hilu, K.W., Borsch, T., Mueller, K., Soltis, D.E., Soltis, P.S., Savolainen, V., Chase, M.W., Powell, M., Alice, L.A., Evans, R., et al. 2003. Angiosperm phylogeny based on matK sequence information. Am. J. Bot. 90: 1758­1776. Hughes, A.L. 1999. Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J. Mol. Evol. 48: 565­576. Hughes, A.L. and Friedman, R. 2003. 2R or not 2R: Testing hypotheses of genome duplication in early vertebrates. J. Struct. Funct. Genomics 3: 85­93. Iseli, C., Jongeneel, C.V., and Bucher, P. 1999. ESTScan: A program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 138­148. Karev, G.P., Wolf, Y.I., Berezovskaya, F.S., and Koonin, E.V. 2004. Gene family evolution: An in-depth theoretical and simulation analysis of non-linear birth-death-innovation models. BMC Evol. Biol. 4: 32. Kellis, M., Birren, B.W., and Lander, E.S. 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428: 617­624. Kim, S., Yoo, M.-J., Albert, V.A., Farris, J.S., Soltis, P.S., and Soltis, D.E. 2004. Phylogeny and diversification of B-function MADS-box genes in angiosperms: Evolutionary and functional implications of a 260-million-year-old duplication. Am. J. Bot. 91: 2102­2118. Kramer, E.M. and Irish, V.F. 1999. Evolution of genetic mechanisms controlling petal development. Nature 399: 144­148. Kramer, E.M., Dorit, R.L., and Irish, V.F. 1998. Molecular evolution of genes controlling petal and stamen development: Duplication and divergence within the APETALA3 and PISTILLATA MADS-box gene lineages. Genetics 149: 765­783. Ku, H.M., Vision, T., Liu, J., and Tanksley, S.D. 2000. Comparing sequenced segments of the tomato and Arabidopsis genomes: Large-scale duplication followed by selective gene loss creates a network of synteny. Proc. Natl. Acad. Sci. 97: 9121­9126. Leebens-Mack, J., Raubeson, L.A., Cui, L., Kuehl, J.V., Fourcade, M.H., Chumley, T.W., Boore, J.L., Jansen, R.K., and dePamphilis, C.W. 2005. Identifying the basal angiosperm node in chloroplast genome phylogenies: Sampling one's way out of the Felsenstein zone. Mol. Biol. Evol. 22: 1948­1963. Li, W.H. and Grauer, D. 1991. Fundamentals of molecular evolution. Sinauer Associates, Sunderland, MA. Liu, B. and Wendel, J.F. 2003. Epigenetic phenomena and the evolution of plant allopolyploids. Mol. Phylogenet. Evol. 29: 365­379. Lynch, M. and Conery, J.S. 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 1151­1155. Maere, S., De Bodt, S., Raes, J., Casneuf, T., Van Montagu, M., Kuiper, M., and Van de Peer, Y. 2005. Modeling gene and genome duplications in eukaryotes. Proc. Natl. Acad. Sci. 102: 5454­5459. Makalowski, W. 2001. Are we polyploids? A brief history of one hypothesis. Genome Res. 11: 667­670. Masterson, J. 1994. Stomatal size in fossil plants: Evidence for polyploidy in majority of angiosperms. Science 264: 421­424. McLachlan, G.J., Peel, D., Basford, K.E., and Adams, P. 1999. The EMMIX software for the fitting of mixtures of normal and t-components. J. Stat. Softw. 4: 2. McLysaght, A., Hokamp, K., and Wolfe, K.H. 2002. Extensive genomic duplication during early chordate evolution. Nat. Genet. 31: 200­204. Müntzing, A. 1936. The evolutionary significance of autopolyploidy. Hereditas 21: 263­378. Nei, M. and Gojobori, T. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3: 418­426. Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, New York. Otto, S.P. and Whitton, J. 2000. Polyploid incidence and evolution. Annu. Rev. Genet. 34: 401­437. Paterson, A.H., Bowers, J.E., and Chapman, B.A. 2004a. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc. Natl. Acad. Sci. 101: 9903­9908. Paterson, A.H., Bowers, J.E., Chapman, B.A., Peterson, D.G., Rong, J., and Wicker, T.M. 2004b. Comparative genome analysis of monocots and dicots, toward characterization of angiosperm diversity. Curr. Opin. Biotechnol. 15: 120­125. Remm, M., Storm, C.E., and Sonnhammer, E.L. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314: 1041­1052. Schlueter, J.A., Dixon, P., Granger, C., Grant, D., Clark, L., Doyle, J.J., and Shoemaker, R.C. 2004. Mining EST databases to resolve evolutionary events in major crop species. Genome 47: 868­876. Schwarz, G. 1978. Estimating the dimension of a model. Ann. Statist. 6: 461­464. Shoemaker, R.C., Polzin, K., Labate, J., Specht, J., Brummer, E.C., Olson, T., Young, N., Concibido, V., Wilcox, J., Tamulonis, J.P., et al. 1996. Genome duplication in soybean (Glycine subgenus soja). Genetics 144: 329­338. Simillion, C., Vandepoele, K., Van Montagu, M.C., Zabeau, M., and Van de Peer, Y. 2002. The hidden duplication past of Arabidopsis thaliana. Genome duplication in flowering plants Genome Research 11 www.genome.org on May 17, 2006www.genome.orgDownloaded from Proc. Natl. Acad. Sci. 99: 13627­13632. Soltis, D.E. and Soltis, P.S. 1990. Isozyme evidence for ancient polyploidy in primitive angiosperms. Syst. Bot. 15: 328­337. ------. 1999. Polyploidy: Recurrent formation and genome evolution. Trends Ecol. Evol. 14: 348­352. Soltis, P.S., Soltis, D.E., Zanis, M.J., and Kim, S. 2000. Basal lineages of angiosperms: Relationships and implications for floral evolution. Int. J. Plant Sci. 161: S97­S107. Soltis, D.E., Soltis, P.S., and Zanis, M.J. 2002. Phylogeny of seed plants based on evidence from eight genes. Am. J. Bot. 89: 1670­1681. Stebbins, G.L. 1950. Variation and evolution in plants. Columbia University Press, New York. Stefanovic, S., Rice, D.W., and Palmer, J.D. 2004. Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol. Biol. 4: 35. Vision, T.J., Brown, D.G., and Tanksley, S.D. 2000. The origins of genomic duplications in Arabidopsis. Science 290: 2114­2117. Wang, H.C., Singer, G.A., and Hickey, D.A. 2004a. Mutational bias affects protein evolution in flowering plants. Mol. Biol. Evol. 21: 90­96. Wang, J.P., Lindsay, B.G., Leebens-Mack, J., Cui, L., Wall, K., Miller, W.C., and dePamphilis, C.W. 2004b. EST clustering error evaluation and correction. Bioinformatics 20: 2973­2984. Wang, W., Tanurdzic, M., Luo, M., Sisneros, N., Kim, H.R., Weng, J.K., Kudrna, D., Mueller, C., Arumuganathan, K., Carlson, J., et al. 2005. Construction of a bacterial artificial chromosome library from the spikemoss Selaginella moellendorffii: A new resource for plant comparative genomics. BMC Plant Biol. 5: 10. Wolfe, K.H. and Shields, D.C. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387: 708­713. Wolfe, K.H., Li, W.H., and Sharp, P.M. 1987. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc. Natl. Acad. Sci. 84: 9054­9058. Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555­556. Yang, Z. and Nielsen, R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17: 32­43. Yu, J.J., Wang, W., Lin, S., Li, H., Li, J., Zhou, P., Ni, W., Dong, S., Hu, C., Zeng, J., et al. 2005. The genomes of Oryza sativa: A history of duplications. PLoS Biol. 3: e38. Zahn, L.M., Kong, H., Leebens-Mack, J.H., Kim, S., Soltis, P.S., Landherr, L.L., Soltis, D.E., dePamphilis, C.W., and Ma, H. 2005a. The evolution of the SEPALLATA subfamily of MADS-box genes: A preangiosperm origin with multiple duplications throughout angiosperm history. Genetics 169: 2209­2223. Zahn, L.M., Leebens-Mack, J., dePamphilis, C.W., Ma, H., and Theissen, G. 2005b. To B or Not to B a flower: The role of DEFICIENS and GLOBOSA orthologs in the evolution of the angiosperms. J. Hered. 96: 225­240. Zanis, M.J., Soltis, D.E., Soltis, P.S., Mathews, S., and Donoghue, M.J. 2002. The root of the angiosperms revisited. Proc. Natl. Acad. Sci. 99: 6848­6853. Zhang, J., Rosenberg, H.F., and Nei, M. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. 95: 3708­3713. Received October 20, 2005; accepted in revised form March 27, 2006. Cui et al. 12 Genome Research www.genome.org on May 17, 2006www.genome.orgDownloaded from