Genome Evolution: Overview J Bruce Walsh, University ofArizotta, Tutsan, Arizona, USA The genome iithe total genetic constitution of an organism. Understanding of the structure and evolution of genomes is undergoing a revolution with ability to sequence enlj re gengmes. A key feature in the evolution of genomes is the creation of new genes, usually by duplication. Introductory article Artkle Content; ■ IritroducUoni * I- : iii- .if 11.1 nnmris Arc ÜrqaniTcrl i VanH/lJDfi in Gene Humt™ i Vnnaliwi in Cfnurnc 5Je: The C-Hdlue Pdrdilyi Introduction Our understanding of the structure and evolution of genomes (tlie tf,tiil genetic constitution of an organism) is pres-cntLy undergoing a revolution on a par with any other in the history of science. We now have the ability to sequence entire genomes, and this is happening at a rapidly increasing pace. Further, molecular studies have provided a fairly good understanding of the deep phylogeny of life (how the major groups of life are related to each other), providing the framework for examining the evolution of genomes across all of life. It is now established that rather than the old prokaryotie eukaryotie division (bacteria versus cells with nuclei) there arc instead three very deep, fundamental branches of Life, the bacteria (or eubacteria), the Arehaea (or arehacbaetcria): and the Eucarya for eukaryoies), Prokaryotes thus consist of two radically different groups (the cubacteria and archacbacteria) that are as distinct from each other as either is from the eukaryotes. While we will still use the term prokaryote when referring to features shared by both cubacleria and archaebacteria. the reality is that the archaebacteria are (slightly) more closely related to eukaryotes than they are toeubacteria, Excluding those of viruses, genomes consist of one or more doubk>strandod molecules of DNA. Depending on the particular strain, the genomes of viruses use all four potential types of nucleic aeids. Some use double-slranded DNA, others use single-stranded DNA. still others either single-stranded or double-stranded RNA. How Genes and Genomes Are Organized The vast majority of genes code for messenger RNAs (iriRNAs) that arc translated into proteins, with the collection of all protein-coding genes within a genome referred to as the proleome. Proteins can be broadly classified as those involved in basic cellular housekeeping functions common to essentially all cells, and those proteins that have more specialiTed functions {such as those thai appear only in specific tissues and,1 or only when the cell experiences specific environmental cues). Examples of housekeeping function include information storage and retrieval (DNA replication and repair, transcription of DNA into RNA, and translation of mRNA into proteins), metabolism (synthesis of complex organic molecules from simpler precursors), energy management and storage, cellular transport and cellular division. While such housekeeping proteins are absolutely essential to the cell, the majority of proteins have tissue-specific or environment-specific roles. For example, the bulk of genes in multicellular organisms arc probably involved in Lissuc-spceilie functions and in regulating the development or different cell types. The genome also contains a smaller (but certainly no less important) set of genes coding for structural RNAs. in both prokaryotes and eukaryotesh ribosomal and transfer RNAs (rRN As and tRN As, respectively) play critical roles in translation. Additionally, eukaryotes have a number of small RNAs that play key roles in rRN A and in RNA processing, cellular transport and other functions. Collectively, all the protein-coding and structural RNA-coditig genes constitute the genie DNA of a genome. For prokaryotes, this is the bulk of the entire genome, In eukaryotes, genie DNA comprises only a fraction (and in some cases a very small fraction) of the total genome. Eubacteria The Uacteria as well as the Archaea lack a nuclear membrane and as a consequence there is no clear separation of the genome (DNA) from the cytoplasm. This lack of oompartmentalization allows for coupled transcription—translation wherein a mRNA can be translated into a protein on ribosomes while it is still being transcribed from the DNA. Typically, the genomes of these groups exist as a single circular chromosome, with a single replication origin (DNA replication can only initiate at a single site), Genome size is small, with very little nongenic DNA. The genome may also be augmented by a number of plasmids (small circular DNAs)f which usually (but not always) lack genes essential for normal cellular function. ENCYCLOPEDIA OF LIFE SCIENCES C 2QQ1, |ahn Wiley & Snni, Ud. www.eh.net 1 Genome Evolution: Overview l!iii:lL:n:il ;iiv1 iLrlMly l: hi Mural, often iirr.;irsia\I i:i an operon structure in which a single long transcript contains several distinct genes, each with their own initiation and termination codons. Such transcripts ana culled polycistronic, as they code for multiple ristrons (proteins). Individual genes typically start with a Stiine-Dalgarno sequcnee L hat iacili ta Lcs Ihc bindi ng of mRN A to the ribosome for translation, Bacterial genes are uninterrupted, lacking the internal intron sequences that arc ubiquitous in eukaryotic protein-coding genes, and. do not undergo the extensive mRNA processing Found in eukaryoies, Transcriptional control of bacterial genes (and operon s) occurs by proteins binding to the promoter and a few adjacent regions to cither enhance or block transcription. Bacteria use a single RNA polymerase for transcribing both mRNAs and structural RNAs (rRNAs, tRNAs), Eukaryotes Rukaryotic genomes contain multiple linear chromosomes, each chromosome containing multiple origins of replication. The presence of multiple linear chromosomes requires specialized sequences for proper replication of the chromosome ends (telomeres) and special sequences to ensure the correct segregation of chromosome pairs during cell division (centromeres), Eukaryotic genomes are considerably larger (by orders of magnitude) than bacterial genomes. Part of this size difference is due to increases in the total number of genes, but the vast majority is due to a greaL increase in the fraction of nongenic DNA (sec below). The presence of multiple replication origins probably accounts for eukaryotes1 much larger genome sizes, as the speed of genome replication (find hence cell growth) is far less constrained by genome size relative to prokaryotes with their single origin, A consequence of the much larger genome sizes in eukaryotes is that the DNA in eukaryotic chromosomes must be tightly packaged to ill within the nucleus {in some, species, this is equivalent to fitting over a hundred miles of wire into an object the size of a basketball). At the lowest level of packing, the DNA is wrapped around complexes of hislonc proteins to form nucleosomes. Strings of nucleosomes are themselves further folded to greatly increase the DNA compaction. The result is chromatin, a DNA molecule extensively covered with proteins and highly condensed. The eukaryotic genome is enclosed in a nuclear membrane that separates the DNA from the cytoplasm of the cell, This nuelear-cytoplasmie separation has profound influences on transcription and translation. Following transcription, extensive RNA processing occurs in the nucleus before the final mRNAs are transported to the cytoplasm for translation into proteins. Both ends of the initial mRNA transcript are modified, the starting (5) end with a special nucleotide cap,, while a run of adenines (the poly(A) tail) is added at the finishing (3J) end. By far Ihc most ^r:ki:i<: feature of eukaryotic mRNA processing is that the original transcript often contains numerous inlrons, internal sequences that must be precisely removed (spliced out) to create the final product. Introns can easily make up the vast bulk of a gene, with the coding exons (those regions of the mRNA that remain following splicing) comprising only a small part of the initial transcript. Transcription in eukaryotes is much more complex, than in prokaryotes for several reasons. First, the configuration of lIn; dimrmiriii -abounding a gene has :i strong influence on its expression. Second, transcriptional control often requires the binding of several transcription factors at multiple regulatory sites adjacent to the gene to form large mulliprotein transcription complexes, Because of the tight packing of eukaryotic DNA, sequences very far apart on a chromosome may in fact actually lie close together in the nucleus. This probably accounts for the fact that transcription can be strongly influenced by enhancer sequences iniiir, ihriusjjiuls rifhiiuis from [fin t'.ene.. wlndi l: : i -1 greatly increase the level of transcription. Eukaryotes use th ree disti tict RNA polymerases for transcription: mRN As (and some small RNAs) arc transcribed by RNA polymerase II (Pol 11), rRNAs use Pol 1, and tRNA and other small RNAs used Pol [IT. Eukaryotic mRN As lack the Shine-Dalgarno sequence used by prokaryotes to bind the mRNA to the ribosome- in eukaryotes. mRNA ribosome binding is facilitated by ribusomal protein interactions with the 5' cap, Finally, operon-like structures arc only rarely found in eukaryotes (the most extreme being in the nematode Caettorhubditis ekgrns in which rough ly 25 % of genes exist i n short opcrons where a single initial transcript contains two or more separate genes), Even when operons are present, as a result of extensive RNA processing and splicing, the final mRNA that is transported into the cytoplasm codes for only a single protein, Thus, eukaryotic cytoplasmic mRNAs arc not polycistronic Most eukaryotes eontai n one or two additional genomes beyond the nuclear DNA, as the two major cellular organelles (the mitochondrion and plastid or chloroplast) each contain their own distinct genomes. Comparative sequencing clearly shows that each of these organelles originally arose by a single endosymbiotic event. First, a very primitive eukaryotic cell engulfed a eubacterium that gave rise to mitochondria. This endosymbiotic event occurred near the b;jse of 1 lie eukaryotic Tree. Second, the ancestor of the plants and algae later captured a cyanobacterium that gave rise to the plastid. In both cases, the majority of genes from the captured cubactcrial ancestor were transferred to the eukaryotic nucleus, Despite their small size, organelle genomes still contain unmistakable eubacterial signatures from their ancestors (for example, most organelle genomes exist as single circular chromosomes). 1 Genome Evolution: Overview Archaea In many respects, the genomes of Archaea (Ihc arehac-bacteria) are typical of bacteria: a single major circular chromosome with a single origin of rep] ieation, and a smal 1 genome consisting mainly of genie DNA, Archaeal protein-coding genes lack in Irons, are often clustered in OpCrOnS, and have Shine-Da igarnO ribOsOmt-binding sites. Despite these cub:icLcri:il similarities, sequencing udks have shown thai archaebaelcrial genomes also have many eukaryotic features. Much of the cellular informational processing machinery (DNA replication, transcription, translation) is far more like that found in eukaryot.es, For example, the DNA replication enzyme (DNA polymerase) used by archaebacteria is homologous to the eukaryotic polymerases, both of which are unrelated to the polymerases used by eubacteria, Most other archaeal replica don prnteius. iire :iliO fur more eukaryote-like lh:m eubacteria-like. Similar patterns are seen in genes involved in transcription and translation. Conversely, most archaeal metabolic gents arc much more like cubactcrial genes. This chimaeric distribution of archaeal genomes, with some genes being very eubacteria like while others are very eukaryoie-like is consistent with the Archaea being a third distinct branch of life about equally distant from the other two groups. However, this distribution has also led to suggestions that the Archaea may be the result of an ancient fusion between a cubaelcrium and cukaryole, The first genomes While the Earth did not have a stable crust until about 4.0 billion years ago (Bya), unmistakable and unambiguous evidence of cells is seen at 3.5 Bya and reasonable evidence at 3.7-3,&flya. Life thus exploded onto the Earth in what amounts to a cosmic instant. Given the complexity of even the smallest genomes (Mushegia and Koonin argue that the minimal number of genes required for a cell is around 250-300): how did the llrsl genome (the progenote) arise and what was its nature? While present-day eells rely absolutely on proteins for almost all cellular functions, there are suggestions that primitive cells may have been entirely RNA-bascd. RNA can both store genetic information and catalyse biological reactions, circumventing the historical problem of which came first, proteins or DNA. The observation that DNA replication in all present-day genomes, requires an RNA primer is consistent with an RNA-based progenole, potentially being a vestigial relict of an RNA genome. If Lhc pro genu Lc was indeed RNA-based, how did it transform into a DNA-based genome that uses proteins for most cellular functions? The concept of bypcreyeles provides a solution, Suppose factor 1 is required to make factor 2, factor 2 is required to make factor 3, and factor 3 is req uired to ma ke 1. This is a simple example of a hypercy cle in which the replication of each component depends on replication of the others. If an additional factor (such as. a protein component) can increase the speed of reaction, there is strong selection pressure to incorporate it into the hypcrcyclc. Very complex biological interactions can be built up in this matter, Thus a complex RNA-protein hypercycle can evolve from an initially RNA-only hyper-cycle with far fewer components. At some point a primitive reverse-transeriptase enzyme converted RNA into DNA, allowing an RNA-bascd genome to be turned into a chemically more stable DNA genome, Even allowing for hypercyclcs, the simplest biochemical pathways in present-day genomes are still extraordinarily complex. In a typical reaction the cell starts with factor A and* through a complex series of protein-mediated enzymatic interactions, converts it to the final product E (say) required by the cell; e.g. A^>B—><2—fY>—>E. How do such complex metabolic pathways evolve? It has been suggested that they evolved backwards, Suppose that initially F was common in the environment,, as might be expected on a primeval Earth rich in organic (i.e. carbon-based) molecules but largely devoid of life. The first life forms would use what is present (E). As the supply of E became rarer, there would be a great selective advantage to those cells that could use D (also present in the environment) and convert this to E, Proceeding in this fashion, the complex pathways of today could have evolved backwards by a series ofsuch steps. Another central question is what can be said about the genomic struct are of the last common ancestor (the cenancestor) of the three major domains of life, We can draw inferences about its nature by examining the genes and cellular processes shared by all three groups, hirst, the translation! code was in place in the cenancestor because all present-day genomes are based on the same genetic code (specifying which codons code for which amino acids). Second, while the progenute may have been RNA-bascd. the cenancestor likely had a DNA-based genome. Evidence for this is that all three domains share homologous enzymes for dealing with DNA (such as DNA topoisome-rascs, gyrascs, and DNA-dcpcndcnt RNA polymerases). However, the replication machinery may not have been finalized in the cenancestor, as the DNA polymerases used by eubacteria are unrelated to those used by archaebacteria and eukaryotes. Com pari ng the full y-sequenced genome of a eukaryote {yeast) with that, of a eubacterium and an archaebacteriuin shows SO detectable ortholngues (DNA sequences showing signs ofeommon ancestry) common to all three groups. This is probably an underestimate, as some genes arc expected to evolve to the point where it is very difficult to detect any signature of past common ancestry. Further, nonorlhologous gene displacement has likely occurred in some lineages, with unrelated genes taking over the roles of other genes, displacing the original orthologo js genes, A more detailed comparison between a eubacteria! and archaebacteria! genome found around 2fifl genes in tomraon, of which 130 were involved in Cciifjme EvoJution: Overview informational processing {95 for translation, 18 for L>NA replication, fi For racomhination/repair, 9 for transcription). The remainder were involved in nucleotide metabolism (23), general cellular metabolism (8(i), and other 1 unctions (IS). A widely debated issue is whether the cenancestor (and perhaps even the progenolc) had interrupted genes (introns). The Imrons-early* view holds that iiurons were present in the cenanecstor but were lu-s-l. by eubaetcria and archaebacteria. while the 'introns-late' hypothesis holds that introns invaded the protein-coding genes after the eukaryotic ancestor branched off from the cenancestor, The phylogenetic distribution of introns is far more consistent with the introns-lale view. For example, nuclear-encoded genes originally from organelles contain nitrons, yet phylogeny shows that the ancestors of these organelles lacked introns, implying that introns arose following the transfer of these genes to the nucleus. Variation in Gene Number Estimates of the number of protein-coding genes in sequenced genomes arc obtained by counting the number of open reading frames (ORFs) of sufficient length and with sufficient other signatures (such as start and stop codons and appropriate regulatory regions). Some ORFs can immediately be assigned to known families of proteins while others remain as unidentified reading frames (URFs), potentially Indicating genes (of unknown function), The current numbers of such delected ORFs in eubacterial genomes range from 500 to 700 for certain intercellular parasites to 1000-4300 in free-living bacteria. Archaeal genomes show gene numbers ranging from 1800 to 2500, although this range is expected to grow as more species are sequenced. The range of gene number in eukaryotes is considerably higher: yeast {Sacckaramyces) with 6154; nematodes (C. ekgans) with 19 100; the fruit fly itlrosaphiltii with around 12000; and humans with around 7O0O0, These numbers are probably undcrcsLimales as small genes are often overlooked by ORF-searcbing programs as they scan for long open reading frames. One very sobering, and exciting, observation is that even in the two best-characterized organisms (the human gut eubacteria Escherichia coil and the yeast Saccharomyces), the majority of ORFs have unknown function 60% (2600) of E. coti and 56% <3500) of yeast ORFs have unknown functions. Further, in yeast a large fraction of the ORFs can be individually deleted wiLh no obvious effect, As the genomes of higher eukaryotes (such as flies, mice and humans) arc sequenced, we expect the percentage of unknown ORbs are likely to be much higher. Organelle genomes and genome miniaturization Mitochondrial genomes (mtDN A) also show considerable variation in gene number, especially given that the eubacterial ancestor was expected to contain between 20Q0and 4000 gcnesal the time of capture. Fully sequenced mtDNAs. show between 3 and 62 protein-coding genes and between 5 and 25 RNAs (all contain the I6S and 23 S rRNAs, and most contain 22-25 tRNAs), with most animal mitochondria containing no more than a dozen or so protein-coding genes. Looking across all sequenced mtDNAs, only four genes seem to have been conserved in all genomes - the cob and cox! genes involved in cellular respiration, and the large and small rRNAs. Over a thousand nuclear-encoded proteins arc imported into the mitochondria, mostly produced by genes originally from the eubacterial ancestor. For example, the ribosomal proteins and most other machinery to translate the mtDN A-cncodcd mRNAs are prokaryotic in nature, and distinct from those required for cytoplasmic ribosomes. Thus, following the initial endosymbiotic event, there was massive gene-transfer of eubacterial genes into the eukaryotic nucleus. Given this transfer, a key unresolved issue is why mitochondrial genomes exist at all, Indeed, there are a few eukaryotes that lack mitochondria. While these amiluchondrial eukaryotes were originally thought to be very deep-branching lineages predating the original endosymbiotic event, it is now known that some have a far more recent origin and have lost their mitochondria fairly recently. Other amitochondrial eukaryotes do appear to be very deep-branching, but sequencing has shown that they contain one or more key mitochondria I genes of eubacterial origin in their nucleus. Thus, the mitochondrial endosymbiotic event must have occurred very early in the eukaryotic radiation, with even very deep-branching amitochondrial eukaryotes originally containing mitochondria, which transferred at least some genes to the nucleus before the organelle itself was lost in these lineages, Plastids (chloroplasts) also contain a very reduced set of genes relative to their cyanobacterial ancestor, although they contain many more genes than mitochondrial genomes (typically around 100 or So protein-coding genes), The plastids from certain groups prove very interesting examples of multiple rounds of gene transfer. It is clear for both structural and phylogenetic reasons that some eukaryotes have obtained their plastids from secondary endosymbiotic events by engulfing a pkastid-containing alga. In at least two such cases of secondary endosymbio-sis> the current plastid contains a vestigial remnant of the nucleus of the captured eukaryotic cell, the nucleomorph. These are the smallest eukaryotic genomes known, being on the order of 350000 to 600000 bases and containing an estimated 200 Or SO genes. Plastids with nueleomorphs represent cases of two massive gene transfers - first from the eynanobactcrial ancestor to the eukaryotic nucleus in 4 the primary endosymbiotic even I arid then I he transferor these genes again from the nucleus of the engulfed alga to the nucleus of the new host eukaryote. Note that these eukaryotes have four separate genomes: nuclear, mitochondria 1H plaslid and nucleoniorph. Variation in Genome Size: The C-value Paradox For historical reasons, the (haploid) genome size of an organism is often referred to as its C value. The genome size of cubaetcria and an;haebaeleria shows over a 30-i'old range, from 0-4 megabases( I Mb= 10s bases) to 13 Mb of DNA. Most of this variation must be due to differences in the number of genes, as the correlation between genome size and number af protein -coding, genes in fully sequenced prokaryotic genomes is extremely high. Considerable sítb variation is also seen in intDNA genomes, which range from a low of 6000 bases to over 3 Mb. This is in contrast to the total number of mitoehondrially encoded genes, which show an 15-fold range, compared to the 5G0-fo1d range in the mitochondrial C value. Plastid genomes show a much smaller range in C-values and this variation is similar to the variation in gene number. Eu kary otic nuclear gnomes show an almost 80 000-fold range in genome sÍ7e (from 3.5 Mb to over (TOD 000 Mb), with plants showing a 6000-fold range across different species, animals a 2000-fold range, and mammals a 4-fold range, This huge range has been referred to lis the C-value paradox. While the minimal genome si7e increases with phylogcnclic complexity (the smallest insect, fish, and mammal C values are 10, 38, and 1400 Mb, respectively), the upper limits within caeh group show no such correlation. For example, humans have 3400Mb, while some species of pines have 68 000 Mb, lungfish have 140000 Mb, certain ferns 160000 Mb, and two species of amoeba have the largest genomes (3OQO0O and 670000 Mb). The huge variation in C values is only partly-due to differences in the number of genes, as there is only about a 50-ibld variation among eukaryotes in the number of protein-coding genes. Rather, most of the variation is due to differences in the amount of nongenic DNA. The fraction of nongenic DNA in eukaryotes ranges from under 30% to over 99.9%, which (when differences in genome size are taken into account) translates into a 300000-fold range in the total amount of nongenic DNA. Humans have around 10% genie DNA, implying that we carry in our genomes around 3000 Mb of DNA with no obvious cellular function. WhaL is all this extra DNA doing? Some possible explanations follow from the fact that large fractions of nongenic DNA arc repetitive DNAs of one sort or another. Genome Evolution: Overview Types of repetitive DNA Repetitive DNAs arc sequences that exist as multiple copies within a genome. Gene families, related genes coding for mRNAs or structural RNAs., are ubiquitous features of genomes, but the vast majority of repetitive sequences are nongenic. We can classify these by their genomic organization, in that many families of nongenic repeated sequences are clustered in one or a few localized repeats, while other families ejJsl as dispersed repeals with their copies scattered around the genome. These organizations provide clues as to the origin and maintenance of such nongenic families, Dispersed repeats arc typically the result (directly or indirectly) of mobile genetic element or transposons -DNA sequences that code for the ability to make additional copies of themselves at other genomic locations, These elements are genomic parasites. Transposons make up a large fraction of many cukaryolic genomes (they arc also present in prokaryotes, but make up a much smaller fraction of the genome). For example, 36% of the human genome consists of sequences with similarity to known mobile genetic elements. This figure is almost certainly an underestimate as ancient families of mobile elements thai are no longer active have decayed through mutation to the point of no longer being recognizable, Just two element types comprise almost 25% of the entire human genome ALU elements (J 200000 copies for 10% of the genome) and longer LINE (long interspersed DNA element)-! elements (600000 topics for 15% of the genome). Functionally, transposons can be classified into those that use an RNA intermediate to move about the genome (retrotraitsposons) and those that move entirely through DNA intermediates. The ALU and LINE elements in humans are both retro transposons, but are evolutionarily unrelated. Full-length LINE elements encode the reverse transcriptase enzyme necessary for relrolranspusi lion and use Pol [[ for transcription, ALU RNAs are Pol III transcripts and do not code for reverse transcriptase, rather they rely on LINE elements to produce this enzyme. Thus, in a sense ALUs are parasitic on LINE elements, and both are parasitic in using other machinery from the human genome (such as the appropriate transcription factors) to make additional copies of themselves, Examples of localized repeats include satellite DNAs of various types (the term following from the fact that such sequences form a band or satellite when whole genomic DNA is broken into small fragments and cenlrifuged at high speed). Such DNAs exist as a moderate to very large numbers of tandem repeats of a common core sequence. Satellite DNAs are further divided into micro- and minisatcllitcs depending on the length of the repeat unit, with mini satellites having a repeat unit of just a few bases while microsatellites have longer repeat units. The fraction of satellite DNA within a genome Can be very large. For eita mple, ma in malian genomes typicall y consist of between 5 Cúiiom-c Evolution: Overview 5% and 30% satellite DNAs, while in plants the figure is around 40%. There are also more extreme cases, one example being the Kangaroo rat, Here, over half the genome consists of families of just three basic repeats, with over two billion copies of a three-base repeat, two billion copies ofa six-base repeat, and a billion copies of a ten-base repeal. Satellite DNAs typically do not code lor RNAs (although there are rare exceptions), and they are generally found in genomic regions showing reduced recombination. They are probably generated by a combination of replication slippage (where DNA polymerase slips when copying a small repeat, generating excess) and incorrect recombination between multiple adjacent copies (unequal erossing--over). Other mechanisms of local gene amplification may also be involved. Nongenic DNA: selfish, junk or structural? Three not necessarily exclusive hypotheses have been proposed to account for the huge fraction, and variation therein, of nongenic DNA in cukaryotcs. The selhsh DNA hypothesis states that most nongenic DNA consists of sequences that exist solely to make additional copies of themselves (i.e. they are transposons). Such repetitive DNAs will spread even if there is some cost to the host. The junk DNA hypothesis stales that much of the nongenic DNA is simply a byproduct of DNA replication, recombination and mutation (such as satellite DNAs). Rather than actively spreading copies through the genome (as under the selfish DNA model), nongenic DNA simply piles up like junk in a closet. This hypothesis assumes that the cost to the cell of carrying this extra DNA is negligible. Finally, the structura I DNA hypothesis states t hat much of this nongenic DNA has important cellular functions. In particular il has been argued thai genome siae influences cell size, and that selection for increased (or decreased) cell size can indirectly influence genome sire. At present, there is little evidence to support the structural DNA hypothesis, and the vast majority of nongenic DNA is thought to be either junk or selfish. Concerted evolution of repeated sequences One surprising observation is that gene family members within a species are often more similar to each other than they are to members from different species. This is especially true for nongenic repeals. This phenomenon is referred to as concerted evolution, as the Individual repeats do not seem lo evolve independently, but rather appear to evolve in concert within a species, Given that most nongenic DNA is not [.nought to be under selection, what mechanism(s) account for concerted evolution? The key is that several recombination-related processes allow for sequence exchange between repeats. One process is gene conversion, where one DNA region converts another region to its sequence. Conversion can occur between both localized and dispersed repeats. When repetitive sequences occur as tandem arrays, unequal crossing-over also results in some members of the array being over- or under-represented following the crossover event, Akin lo genetic drift removing variation in a finite population, multiple rounds of gene conversion and/or unequal erossing-over result in a collection of repeats becoming more similar, These sequence exchange processes can thus produce concerted evolution even in the absence of any selection. Expansion of genomes: the origin of new genes One key feature in the evolution of genomes is the creation of new genes. Most new genes arise by duplication of existing genes, with the duplicate copies diverging and acquiring new functions. Exon shuffling is a variation on the theme of gene duplication, wherein the exons from two or more genes are joined (shuttled) together to create a new gene. While shuffling is an important mechanism for gene creation in cukaryotes, its role in the other domains of life is less clear. Proponents of the introns-early school have argued that many early genes were formed by similar processes in the progenote, by shuffling between a small set of minigencs coding for different protein domains. Not all duplications result in new genes, as one of the copies can acquire one or more inactivating mutations, becoming a pseudogene no longer under selection so thai its sequence signature decays away over time. Pseudogenes are common features ofeukaryotic genomes, and are often found in clusters of related genes. Eukaryolic genomes also contain processed pseudogenes, which are reverse-transcribed mRNAs that have been inserted back into the genome. Such pseudogenes arc inactive at the time of their formation, while the duplicate gene that eventually becomes a traditional pseudogene often remains functional for tens of millions of years before becoming inactivated. While duplications can be local, involving one gene or a few genes, duplication of the entire genome also occurs. For example, the complete sequence oi" yeast {Sacchar-omyces) indicates that is has undergone a whole-genome duplication, but that only about 8% of the duplicated genes survived. There is also clear evidence for whole-genome duplication in many plants and in some bacteria. It has also been argued (hat two successive genome duplications occurred during the transition to the higher vertebrates, one between the invertebrates and thejawless fishes and the second occurring immediately after the jawlcss fishes. Consistent with this is the observation that several genes found as four copies in humans are present as two copies in amphibians and as single copies in insects. While there is still debate on whether these successive wholc-gcnomc duplications did indeed occur in verte- Genome Evolution: Overview brates, the complete sequencing of the human and mouse genomes will go a long way towards resolving the issue. Further Reading Cavalier-SmJtEiT (ed.} (1985} T/HrEvoluiitottfGemrneSim, New York: Wiley. Clayton KA, White O, Kelchum KA und Venter JC (1997) The ftnrt genome front the third domain of life. Nature 387: 459 460. Douglas 5E (1998) Plasticl e^oLutifln: origins, diversity, Trends. Current Opinuitt InGeneiia andDevelopment8: 655-66], f-ortrrre P (1997) Arehaca: what cur we leam from their sequences? Current Qpiniun in Genetics and Dere/oprnent 7; 764 770. Gilsoti PR, Maier Lr-G and McFaddej) Gl (1997} Size itn't everything, IcsRoni in genetic miniaturization 1'rom nuclcninorphii- Ci;rn .■ir Opinion in luetics and Development 7: &NQ-806. Cruar D and Li W-H (1999) FanJumrntaf vf Molecular Evolution, 2nd edn. Sunderland., MA: Sinanef Associates. Keeling PJ (JWS) A kingdom's progress: archexona and (he origin of eukuryote-s. SioEnays M: 87-95. Koonin EV arid GalptrinMY( 1997} Pnokaryoiic genomes; the emerging paradigm of genomc-hascd. microbiology. Current Opinion in Gentiles and Development 7: 757-763. Li W-H {1997} Molecular LvuFulitm. Sunderland. MA: Sinauer Assoc lues. Logsdon JM fr (199S) The recent origins of spliccosomal inlrons revisited. Current Opinion in Genetiei and Development 9: 637-64S. Mashigja AP. and Koonin EV (1996} A minim al gene sel for Cellular life derived by comparisons of complete bacteria] genomes. Proceedings qf the National Academy oj Sciences of tin USA 93: 1-02641- 10273. Olsen ŮJ and Wccse CR (1997} Archaeul jjenomics. An overview. Ceil 89: 991 994. [Four additional reviews covering different aspects of Archaeal genomics follow Uiis overview.] Page RDM and Holmes EC (1998) MolecularEvolution; A Phyíůgenetie Approach. Oifond: MarJcwell Sdenue. Palmer J D (1997} The mitochondrion tha t time forgúl. Nature M7:454-455- Sidow A (IÍW6) Gen(omJe duplication! in trie evolution of Early vertebrate*. ťnrr-evTf Opinion in Genetics and De vefapment ť: 7 ] 5 7Í2. Sk ni ba nuk L and Wolfe KH (1998) Eukaryoie genome duplication -where'* the evidence? Current Opinion In GeneikSčnd Developmrtit ft 694-700. Singer M Lind Berg P {1991} Genes and Genomes. Herodon, VA: University Science Books- Smit A FA {19%) The origin nf interspersed repeats in the human genome. Current Opinion in Genetiesand Development h: 741-743. TICK (2*00} TIGR Vie Institute for Genomic Research, [hup:,1/ www [i^i.urg.'idli/iisdtx.slitml] [The Institute for Genomic Research wchsite provides a listing of all fully sequenced genomes. | OÚMK (2000} OGMK - The Organelle Genome Megasequencin£ Proeriitri[lLttp:;,,triega£uiL.bth.iimontriHil.t;Li:iiginppruj,hlnilJ [TheOrganelle Genome Megasequencmg Program website provides, a listinu of lulh s^qiLJIiL^il IliL'. .1! .Ulil ^! 11. ■ -p 1.1 -1 j.-:'i-ir.i -.| 7