1 A new classification scheme of the genetic code Thomas Wilhelm and Svetlana Nikolajewa Institute of Molecular Biotechnology Beutenbergstr.11 07745 Jena, Germany Tel. +49 3641 656208 Fax +49 3641 656191 Email: wilhelm@imb-jena.de corresponding author 2 Abstract Since the early days of the discovery of the genetic code non-random patterns have been searched for in the code in the hope of providing information about its origin and early evolution. Here we present a new classification scheme of the genetic code that is based on a binary representation of the purines and pyrimidines. This scheme reveals known patterns more clearly than the common one, for instance the classification of strong, mixed, and weak codons as well as the ordering of codon families. Furthermore, new patterns have been found that have not been described before: nearly all quantitative amino acid properties, such as Woese's polarity or the specific volume, show a perfect correlation to Lagerkvisťs codon-anticodon binding strength. Our new scheme leads to new ideas about the evolution of the genetic code. It is hypothesized that it started with a binary doublet code and developed via a quaternary doublet code into the contemporary triplet code. Furthermore, arguments are presented against suggestions that a "simplet" code, where only the mid-base was informational, was at the origin of the genetic code. Key Words: genetic code, origin of life, evolution, doublet code, pattern, amino acid properties Introduction Crick (1968) introduced the notion that the genetic code is simply the result of pure chance or a "frozen accident" and that it therefore does not need any further evolutionary explanation. Later, this view was questioned. Although certain knowledge of the origin and early stages of life is not likely to be obtained, there are some hints of possible evolutionary scenarios of the genetic code. One direction of research (the "top-down approach" (Szathmary 1999)) analyzes patterns in the contemporary code (Knight and Landweber 1998, Szathmary 1999) and tries to infer appropriate chemical and selective forces. The bottom-up approach, on the other hand, is rooted in biochemistry and aims at constructing plausible scenarios for the origin of coding (Topal and Fresco 1976, Maizels and Weiner 1987, Szathmary 1993). It has been appreciated for a long time that the genetic code assigns similar amino acids to similar codons (Sonneborn 1965, Woese 1965, Zuckerkandl and Pauling 1965, Crick 1968). Two different rationales have been presented: first, mutation (Sonneborn 1965, Zuckerkandl and Pauling 1965) and translation (Woese 1967, Haig and Hurst 1991, Freeland and Hurst 1998) error minimization (or both (Ardell and Sella 2002)), and second, similar amino acids tend to directly interact with similar RNA sequences (Woese et al. 1966, Yarus 1998, 2000). Landweber and coworkers found further evidence to support both hypotheses. Extending previous work (Haig and Hurst 1991, Freeland and Hurst 1998), by quantifying amino acid similarity, these authors were able to show that "the canonical code is at or very close to a global optimum for error minimization" (Freeland et al. 2000). Based on the earlier work of Yarus (cf. Yarus 1998, 2000), by doing a statistical analysis of RNA aptamers (nucleic-acid molecules selected to bind specific ligands) they concluded that there is "the strongest support for an intrinsic affinity between any amino acid and its codons" (Knight and Landweber 1998). It has also been proposed that instead of the actual codons, some derivatives of them, such as the anticodons (Dunnill 1966, Jungck 1978) or codon-anticodon duplexes (Alberti 1997) were the original amino acid binding motifs. It could also be that the original amino acid recognition 3 took place at the tRNA acceptor stem (Hopfield 1978) or that the specificity of aminoacylation is determined by the interaction of the tRNA synthetase with its tRNA (Weiner and Maizels 1987). Szathmary (1999) proposed that amino acid-RNA allocation took place even before the appearance of tRNA. He also gave a possible evolutionary scenario for the development of an anticodon hairpin to a longer structure with an operational code at the acceptor stem. Several patterns of the genetic code have been identified which can be illustrated within the classical scheme. The common scheme of the genetic code The common scheme of the genetic code (Alberts et al. 2002) contains 43 =64 codons, a three-dimensional matrix where each dimension represents one of the three positions in the triplet code (Fig.1). Viewed this way, some patterns emerge: The first codon position seems to be correlated with amino acid biosynthetic pathways (Wong 1975, Taylor and Coates 1989), and to their evolution as evaluated by synthetic "primordial soup" experiments (Eigen 1977, Schwemmler 1994).The second position is correlated with the hydropathic properties of the amino acids (Crick 1968, Wolfenden et al. 1979, Taylor and Coates 1989), and the degeneracy of the third position could be related to the molecular weight or size of the amino acids (Hasegawa and Miyata 1980, Taylor and Coates 1989). Lagerkvist (1978, 1981) divided the common illustration scheme (Fig.1) into a left part (containing the first and second column, i.e. U and C in the second position of the codon, respectively) and a right part (the third and fourth column, A and G in the second position). He observed that codon families (the amino acid of a codon family is uniquely determined by the first two nucleotides of a codon, cf. shaded regions in Fig.1) have a much higher probability to appear in the left part. Furthermore, he found that "strong" codons (the first two nucleotides in the codon are G and/or C) always represent codon families, while "weak" codons (A and/or U as the first two nucleotides) never do so. "Mixed" codons in the right part of the scheme never represent codon families, whereas mixed codons in the left part always stand for a codon family. Lagerkvist (1978) speculated "that interactions between mixed codons and their anticodons are stronger in the left half of the codon square". However, most amino acid properties show no clear pattern in the common scheme of the genetic code. Instead Jungck (1978) used 15 different quantitative measures of amino acid properties such as polarity or molecular volume to demonstrate that these properties are generally more closely correlated with anticodon than with codon dinucleoside monophosphate properties. This supports the hypothesis that the relationship between amino acids and their anticodon dinucleosides was the basis for the origin of the genetic code. In this article we follow the "top-down approach" towards understanding the organization of the genetic code. We are thereby led to propose a new classification scheme for the code that helps us to identify new patterns which in turn suggest new speculations about its origin. Results A new classification scheme of the genetic code Fig.2 shows our new scheme for presenting the genetic code. It is based on a binary classification of nucleic acid bases. The two components of all nucleic acids, purines and pyrimidines, are denoted by 1 and 0, respectively. The 8 rows in Fig.2 represent the 23 =8 possible combinations of three binary digits. Since there are two purines (A,G) and two pyrimidines (U,C) for each row, there again exist 8 possibilities. 4 Our first observation is that 4 (and not 8) columns are sufficient to place all 20 amino acids, as well as the termination codons. Each row contains exactly 4 different amino acids (including the termination codon). In the standard code, exceptions are the second row with two leucines and in the fourth row the AU* start codon. Note that here are also the deviations from the standard code. Interestingly, the yeast mitochondrial code shows no exception: each row contains exactly four different entries in four different columns. In this spirit the yeast mitochondrial code is the most regular one. The notice that in our scheme four columns are sufficient reflects the well-known fact that if the third position is important (in exactly half of our table this is not the case), then it is only decisive if there is either a purine (1) or a pyrimidine (0) (Fitch and Upper 1987), i.e. the third position is analyzed in a binary manner (Taylor and Coates 1989). This has been explained by Crick's wobble hypothesis (Crick 1966) whereupon the first two nucleotides of the codon pair with their anticodon bases according to Watson-Crick rules, but the third base pairs according to the wobble rules where G can also pair with U, for instance. The third codon position is exclusively analyzed in a binary manner in the mitochondrial codes of yeast, vertebrates, invertebrates, coelenterates and flatworms, as well as in the codes of mold, protozoan and mycoplasma/spiroplasma; for the other codes there are a few exceptions (cf. Elzanowski and Ostell 2000). Note that these few exceptions always have a purine at the third position of the codon (e.g. AUA (Ile) and AUG (Met) in the standard code). Our scheme yields some support for the "adaptive genetic code" hypothesis (Freeland 2002) which states that the code has evolved to minimize the deleterious effects of mutation and translation error (Haig and Hurst 1991, Freeland and Hurst 1998). The purine-pyrimidine binary coding scheme, given in Fig.2, gives a much higher regularity than a binary coding according to the base pairs (A,U ­ 1; G,C ­ 0). This corresponds to the known fact that transition mutations (e.g. purine A vs. purine G) occur more frequently than transversion mutations (e.g. purine A vs. pyrimidine U). A second observation concerns the order of the columns. In the first column the first two positions are G and C. These always pair with their anticodon base via 3 hydrogen bonds, i.e. the first two bases together always guarantee 6 hydrogen bonds. For that reason Lagerkvist (1978) called them strong codons. In the second and third column, the first two bases guarantee exactly 5 bonds (mixed codons) and in the fourth column just 4 bonds (weak codons). This pattern corresponds very well to the importance of the third base in the triplet codon: if the first bases are G and/or C (first column), the third base is never important, and in the second and third column, the third base is important in exactly half of the cases (if there is a purine in the second position ­ lower half of the table). In the fourth column the third base is always necessary for the determination of the correct amino acid. In Fig.2, the order of codon families is illustrated by the shaded regions. It seems that for the first column, the first two bases alone guarantee sufficient stability in the codon-anticodon pairing to ensure the correct choice of the amino acid. In the case of mixed codons (second and third column) a codon family is guaranteed if there is a pyrimidine in the second position. Going beyond Lagerkvists counting of hydrogen bonds, others provided some quantitative information about nucleotide binding strengths (Ornstein and Fresco 1983). A third observation refers to two perfect symmetries in our scheme. The first is the codon-anticodon symmetry: the thick horizontal line in Fig.2 marks the symmetry axis. For instance, codon CCC (Pro, first column, first row) has the anticodon GGG (Gly, first column, last row), or codon ACG (Thr, third column, fourth row) has the anticodon UGC (Cys, third column, fifth row). The second perfect symmetry is the point symmetry corresponding to Halitsky's family ­ nonfamily symmetry operation ("E-M bifurcation", Halitsky 2003), indicated by the point in the center of Fig.2. Halitsky observed that all the 32 "family codons" CC*, CU*, UC* GC*, GU*, AC*, CG*, GG* can be mapped into the 32 "nonfamily codons" UU*, AU*, CA*, UG*, UA*, GA*, AG*, AA* by exchanging the two amino bases A and C 5 with one another, and the two keto bases U and G with one another. For instance, the family codon GUA (Val) is mapped into the nonfamily codon UGC (Cys). Thus, this point symmetry is behind the family ­ nonfamily symmetry in our scheme (shaded vs. unshaded regions). A fourth observation concerns the deviations of non-standard genetic codes. As can be seen in Fig.2, nearly all deviations occur in codons with a purine at the third position. The only exception is the yeast mitochondrial code where CU* does not code for Leu, but rather for Thr. Our fifth observation refers to hitherto unknown regularities of amino acid properties. Jungck (1978) collected 15 different measures of amino acid properties, as well as three measures for dinucleoside monophosphates. For all of these 18 measures we arranged a table with 8 rows and 4 columns corresponding to the scheme in Fig.2 (for AU(G/A) we took the Met values (e.g. vertebrate mitochondrial code), for UA(G/A) the Tyr values (mitochondrial flatworm code)). Then we analyzed all row and column sums. The row sums show a strong monotonicity just for the three dinucleoside monophosphate measures and for the hydropohobicity measure of Levitt. However, amazingly, the column sums of nearly all measures are perfectly correlated to the corresponding codon-anticodon binding strength in the sense of Lagerkvist (1978, 1981), in the following simply denoted as codon strength. This is demonstrated in Table 1. For this table we averaged the column sums of the second and third column, giving one "mixed codons" column. As can be seen in Table 1 there are just two exceptions. In the polarity measure of Zimmerman, the deviation is only very weak and in contradiction to all other measures, here the values for the amino acids vary by orders of magnitude. A problem only arises for the three hydrophobicity measures: The two monotonic measures "Levitt" and "BullBreese" are anticorrelated, and the "Jones" measure is not monotonic. The anticorrelation was already found by Jungck (1978), but he did not comment on this. The fact that the order of the second and third column is not fixed is also underlined by an individual consideration of the two mixed codon columns, instead of the averaging done in Table 1. In about half of the cases the order of the second and third column should be exchanged to guarantee the strong monotonicity of the amino acid measures as function of the column number. The observed pattern of strong correlation between amino acid properties and codon strength (considers just the first two nucleotides) implies that both first positions together, and not the first or second position alone must have been important for the amino acid ­ codon assignment in the evolution of the genetic code. Evolution of the genetic code What do the observed patterns tell us about the evolution of the genetic code? The so-called biosynthetic theory assumes that the genetic code evolved from a simpler form that encoded fewer amino acids (Crick 1968). A special version of this theory has been given by Wong (1975) who proposes that the genetic code coevolved with the invention of biosynthetic pathways for new amino acids. Although it has been shown that his analyzes rest on wrong assumptions (Ronneberg et al. 2000), it is generally accepted that one can discriminate evolutionary old and new amino acids (Alberts et al. 2002). Of course it could be that the binding allocation between nucleic acid molecules (RNAs or even PNAs (Knight and Landweber 2000b)) and amino acids did not start until all 20 amino acids were available; but it seems simpler to assume that as soon as there were amino acids and nucleic acids available (produced abioticly), both began to bind to each other. It now seems clear that "the code probably underwent a process of expansion from relatively few amino acids to the modern complement of 20" (Knight and Landweber 2000b). Does our scheme yield some hints as to the evolution of the code? We already noted that the third nucleotide is nearly always (two exceptions in the standard code) analyzed just in 6 a binary manner. Taking this for granted, we can reduce our originally 8x8 scheme to a 8x4 scheme (shown in Fig.2). Looking at this scheme, we observe a high redundancy for each second row. Therefore, it is tempting to speculate that there was a period during code evolution where the third position was not needed at all. Assuming this, we can cancel each second row and are left with a pure doublet code that encodes 4x4=16 amino acids (or 15 plus a termination codon). Perhaps then, a doublet code preceded the triplet code, as already had been speculated (Jukes 1973, Hayes 1998). Conceivably, codon expansion from doublet to triplet could have arisen before this, or possibly not until all 16 amino acids were encoded. If one assumes the latter, then it is interesting to postulate for each doublet the corresponding old amino acid. Met (Wong 1975), Trp, Gln, Asn (Knight and Landweber 2000b), and Tyr (Alberts et al. 2002) seem to be newer amino acids. As mentioned above, Szathmary (1999) proposed an evolutionary mechanism of tRNA formation. In principle, this mechanism could also work starting with doublets instead of triplets. It should be possible to gain experimental evidence for a doublet code by studying amino acid ­ nucleic acid doublet binding in the same way as has been done for triplets. Knight and Landweber (2000a) showed that Arg triplet codons alone significantly associate with arginine binding sites. Perhaps the doublets show a higher specificity. However, by proposing a doublet code one faces the frameshifting problem. It seems to be unthinkable that a sudden transition from a two-letter to a three-letter frame ever occurred. Instead, one can imagine a gradual evolution with an ancient three-letter reading frame where just the first two letters have been analyzed by an ancient translation machinery. However, one then wonders about such inefficient use of coding space. Perhaps the ancient translation machinery could simply for stereochemical reasons not analyze a two-letter frame. In this context it is also interesting to note that even our contemporary code is somehow `inefficienť: already a quaternary doublet code can encode 16 amino acids (or 15 plus a termination codon). For just four (or fife) further amino acids a third letter is necessary. Of course, this inefficiency has the advantage of robustness enhancing redundancy. Szathmary (1992, 2003) proposed a model which yields the result that two different base pairs represent an optimal compromise between the overall copying fidelity and an overall reproduction rate (metabolic efficiency). He assumed that the genetic code was developed before evolution invented proofreading. For higher copying fidelity (due to proofreading, etc.), the model predicts that three different base pairs are better than just two. It is tempting to speculate that in the earliest phases of biological evolution with the lowest copying fidelity just one base pair could have worked as well (The copying fidelity is always highest for just one base pair. Nevertheless, Szathmary's simple model gives no one base pair optimum, but a more detailed model for the metabolic efficiency could do so.). So, perhaps, nucleic acid ­ amino acid mapping started with a binary code. This is in accordance with earlier speculations that the first genetic material contained only a single base-pairing unit (Crick 1968, Orgel 1968). An important argument in this context is the chemical instability of cytosine, so that it may be difficult to establish a genetic system with G-C base pairing (Levy and Miller 1998). Wächtershäuser (1988) proposed an all-purine precursor of nucleic acids. However, for the sake of self-replication it is more obvious to assume a two-letter code that can give rise to complementary base pairing. Jimenez-Sanchez (1995) argued for an early (binary) A-U coding. Recently, a ribozyme composed of only two different nucleotides has been found by in vitro evolution that contained the pyrimidine uracil and the purine 2,6-diaminopurine (Reader and Joyce 2002). Note that uracil is the biosynthetic precursor of the pyrimidines cytosine and thymine (the corresponding precursor of the purines adenine and guanine is hypoxanthine). Of course, a binary encoding also would be the most aesthetic version from a purely mathematical point of view. A binary triplet code would represent just one column in our scheme (Fig.2). Given the high redundancy between the rows, it is unlikely that this ever happened. However, an even simpler coding, a binary doublet code, seems conceivable. It is 7 tempting to speculate which four amino acids, one per two consecutive rows, were the first encoded ones. In the first two rows (two pyrimidines, i.e. 00) Ser seems to be the oldest amino acid, and in the third and fourth row (10) Ala (Wong 1975). On the other hand the 01 rows obviously contain no really old amino acid while the 11 rows contain more than one: Gly, Asp, Glu (Wong 1975). One could speculate that the termination marker was important from the very beginning and resulted in coding by the 01 binary doublet. It has been noted that the five amino acids coded by G** (Ala, Val, Gly, Asp, Glu) are all at or near the head of the amino acid synthesis pathways (Taylor and Coates 1989) and also the most abundantly formed ones in abiotic synthesis experiments (Miller 1953, 1987). Furthermore, it has been shown recently by extensive statistical analyzes that the frequencies of all fife G** amino acids are significantly greater in evolutionary conserved residues and it has been concluded that "these amino acids may have been the first introduced into the genetic code" (Brooks and Fresco 2002, 2003, Brooks et al. 2002). This is also consistent with physicochemical arguments proposing that the first sense codons had the form G** (Eigen and Schuster 1978). However, Gly is biochemically built from Ser, so Ser can be assumed as prior. It could be that in the beginning of nucleic acid ­ amino acid assignment Asp and Glu competed for the 11 doublet. Of course, code transfer from one amino acid to another one might also have occurred (Wong 1975). Another scenario consistent with a binary doublet code has been given by Fitch's "ambiguity reduction" hypothesis (Fitch and Upper 1987). It states that early in evolution there was an ambiguity in the charging of amino acids to anticodon acceptors: in a first step just *pyrimidine* codons (*0*), coding for hydrophobic amino acids, and *purine* codons (*1*), coding for hydrophilic amino acids, has been distinguished (binary singulet code). In a second step the more refined binary doublet code (00*, 01*, 10*, 11*) evolved. The idea that the doublet code was just the second state in the evolution of the genetic code and that this evolution started with just the mid-base as coding, has been worked out by others, who termed it "simplet" code (McClendon 1986, Schwemmler 1994). However, in this hypothesis both old amino acids Ser (UC*) and Ala (GC*), as well as Asp (GA*) and Glu (GA*), cannot be discriminated. We therefore suggest that the first two positions were equally important from the very beginning. Although our suggestion also does not allow discrimination between the related amino acids Asp and Glu, it nevertheless allows discrimination between the functionally divergent amino acids Ser and Ala. A further argument for the evolutionary importance of the first two nucleotides is the strong correlation observed between codon strength and the amino acid properties. Conclusion Taylor and Coates (1989) stated that "Many parts of the patterns (of the genetic code) have been seen by others but ... it is the synthesis that adds up to the most interesting ...new insights." In this spirit, we note that in the work presented here different patterns appear more clearly than in the common scheme of the genetic code. An example is Lagerkvists (1978) observation that all strong codons represent codon families while weak codons do not. Mixed codons represent codon families in half of the cases. Our presentation of the code also highlights new patterns, which were not seen before. As summarized in Table 1, nearly all measures of the amino acid properties strongly correlate with the codon strengths. Furthermore, there is a perfect codon ­ anticodon symmetry as well as point-symmetry corresponding to the family ­ nonfamily symmetry operation (Halitsky 2003) in the here presented scheme. With regard to evolution, we hypothesize that codon assignments started from a binary doublet code (e.g., hypoxanthin and uracil) and developed later to a quaternary doublet code (A, G, C, U); thereafter, expansion to a triplet code took place. Although the third position is needed for correct amino acid recognition, still until now it is nearly always analyzed in a 8 binary manner. The conclusion that code evolution must have started with doublets and not with a single letter is also underlined by the correlation observed here between properties of amino acids and the codon strengths. Acknowledgments. We thank two anonymous reviewers for many valuable comments and refering us to relevant literature, and A. Beyer, F. Grosse and M.-L. Merten for critical reading of the manuscript. This work was supported by Grant 0312704E of the Bundesministerium für Bildung und Forschung. References Alberti S (1997) The origin of the genetic code and protein synthesis. J. Mol. Evol. 45:352-358 Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P (2002) Molecular Biology of the Cell. Garland Science, NY Ardell DH, Sella G (2002) No accident: genetic codes freeze in error-correcting patterns of the standard genetic code. Phil. Trans. R. Soc. Lond. B 357:1625-1642 Brooks DJ, Fresco JR (2002) Increased frequency of cysteine, tyrosine, and phenylalanine residues since the last universal ancestor. Mol. Cell. Prot. 1.2:125-131 Brooks DJ, Fresco JR, Lesk AM, Singh M (2002) Evolution of amino acid frequencies in proteins over deep time: inferred order of introduction of amino acids into the genetic code. Mol. Biol. Evol. 19(10):1645-1655 Brooks DJ, Fresco JR (2003) Greater GNN pattern bias in sequence elements encoding conserved residues of ancient proteins may be an indicator of amino acid composition of early proteins. Gene 303:177-185 Crick FHC (1966) Codon-anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19:548-555 Crick FHC (1968) The origin of the genetic code. J. Mol. Biol. 38:367-379 Dunnill P (1966) Triplet nucleotide ­ amino acid pairing: a stereochemical basis for the division between protein and nonprotein amino acids. Nature 210:1267-1268 Eigen M (1977) The hypercycle. A principle of natural self-organization. Part A: Emergence of the hypercycle. Naturwissenschaften 64:541-565 Eigen M, Schuster P (1978) The hypercycle: a principle of natural self-organization. Naturwissenschaften 65:341-368 Elzanowski A, Ostell J (2000) Genetic codes. http://www3.ncbi.nlm.nih.gov/htbin- post/Taxonomy/wprintgc?mode=t#SG1 Fitch WM, Upper K (1987) The phylogeny of tRNA sequences provides evidence for ambiguity reduction in the origin of the genetic code. Cold Spring Harbor Symposia on Quantitative Biology 52:759-767 Freeland SJ (2002) The Darwinian genetic code: An adaptation for adapting? Genetic Progamming and Evolvable Machines 3:113-127 Freeland SJ, Hurst LD (1998) The genetic code is one in a million. J. Mol. Evol. 47:238-248 Freeland SJ, Knight RD, Landweber LF, Hurst LD (2000) Early fixation of an optimal genetic code. Mol. Biol. Evol. 17:511-518 Haig D, Hurst LD (1991) A quantitative measure of error minimization in the genetic code. J. Mol. Evol. 33:412-417 9 Halitsky D (2003) Extending the (hexa-)rhombic dodecahedral model of the genetic code: the code's 6-fold degeneracies and the orthogonal projections of the 5-cube as 3-cube. Contributed paper (983-92-151), American Mathematical Society; and personal communication Hasegawa M, Miyata T (1980) On the antisymmetry of the amino acid code table. Orig. Life 10:265-270 Hayes B (1998) The invention of the genetic code. Amer. Scientist 86:8-14 Hopfield JJ (1978) Origin of the genetic code: a testable hypothesis based on tRNA structure, sequence, and kinetic proofreading. Proc. Natl. Acad. Sci. USA 75:4334-4338 Jimenez-Sanchez, A (1995) On the origin and evolution of the genetic code. J. Mol. Evol. 41:712-716 Jukes, TH (1973) Possibilities for the evolution of the genetic code from a preceding form. Nature 246:22-26 Jungck JR (1978) The genetic code as a periodic table. J. Mol. Evol. 11:211-224 Knight RD, Landweber LF (1998) Rhyme or reason: RNA-arginine interactions and the genetic code. Chem. & Biol. 5:R215-R220 Knight RD, Landweber LF (2000a) Guilt by association: the arginine case revisited. RNA 6:499-510 Knight RD, Landweber LF (2000b) The early evolution of the genetic code. Cell 101:569-572 Knight RD, Freeland SJ, Landweber LF (2001) Rewiring the keyboard: evolvability of the genetic code. Nat. Rev. Gen. 2:49-58 Lagerkvist U (1978) "Two out of three": An alternative method for codon reading. Proc. Natl. Acad. Sci. USA 75:1759-1762 Lagerkvist U (1981) Unorthodox codon reading and the evolution of the genetic code. Cell 23:305-306 Levy M, Miller SL (1998) The stability of the RNA bases: implications for the origin of life. Proc. Natl. Acad. Sci. USA 95:7933-7938 Maizels N, Weiner AM (1987) Peptide-specific ribosomes, genomic tags, and the origin of the genetic code. Cold Spring Harbor Symposia on Quantitative Biology 52:743-749 McClendon JH (1986) The relationship between the origins of the biosynthetic paths to the amino acids and their coding. Origins Life 16:269-270 Miller SL (1953) Production of amino acids under possible primitive earth conditions. Science 117:528-529 Miller SL (1987) Which organic compounds could have occurred on the prebiotic earth? Cold Spring Harbor Symposia on Quantitative Biology 52:17-27 Orgel LE (1968) Evolution of the genetic apparatus. J. Mol. Biol. 38:381-393 Ornstein RL, Fresco JR (1983) Correlation of Tm, sequence, and H of complementary RNA helices and comparison with DNA helices. Biopolymers 22:2001-2016 Osawa S, Jukes TH, Watanabe K, Muto A (1992) Recent evidence for the evolution of the genetic code. Microbiol. Rev. 56(1):229-264 Reader JS, Joyce GF (2002) A ribozyme composed of only two different nucleotides. Nature 420:841-844 10 Ronneberg TA, Landweber LF, Freeland SJ (2000) Testing a biosynthetic theory of the genetic code: fact or artifact? Proc. Natl. Acad. Sci. USA 97:13690-13695 Schwemmler W (1994) Reconstruction of cell evolution: A periodic system of cells. CRC Press, Boca Raton, FL Sonneborn TM (1965) Degeneracy of the genetic code: extent, nature, and genetic implications. In: Bryson V, Vogel HJ (eds) Evolving Genes and Proteins. Academic Press, NY, pp. 297-377 Szathmary E (1992) What is the optimum size for the genetic alphabet? Proc. Natl. Acad. Sci. USA 89:2614-2618 Szathmary E (1993) Coding coenzyme handles: A hypothesis for the origin of the genetic code. Proc. Natl. Acad. Sci. USA 90:9916-9920 Szathmary E (1999) The origin of the genetic code. TIG 15:223-229 Szathmary E (2003) Why are there four letters in the genetic alphabet? Nat. Rev. Gen. 4:995- 1001 Thanbichler M, Böck A (2002) The function of SECIS RNA in translational control of gene expression in Escherichia coli. EMBO J. 21:6925-6934 Taylor FJR, Coates D (1989) The code within the codons. BioSystems 22:177-187 Topal MD, Fresco JR (1976) Base pairing and fidelity in codon-anticodon interaction. Nature 263:289-293 Wächtershäuser G (1988) An all-purine precursor of nucleic acids. Proc. Natl. Acad. Sci. USA 85:1134-1135 Weiner AM, Maizels N (1987) tRNA-like structures tag the 3' ends of genomic RNA molecules for replication: Implications for the origin of protein synthesis. Proc. Natl. Acad. Sci. USA 84:7383-7390 Woese CR (1965) On the evolution of the genetic code. Proc. Natl. Acad. Sci. USA 54:1546- 1552 Woese CR (1967) The genetic code: The molecular basis for Genetic Expression. Harper & Row, NY Woese CR, Dugre DH, Saxinger WC, Dugre SA (1966) The molecular basis for the genetic code. Proc. Natl. Acad. Sci. USA 55:966-974 Wolfenden RV, Cullis PM, Southgate CCF (1979) Water, protein folding, and the genetic code. Science 206:575-577 Wong JT-F (1975) A co-evolution theory of the genetic code. Proc. Natl. Acad. Sci. USA 72:1909-1912 Yarus M (1998) Amino acids as RNA ligands: a direct-RNA-template theory for the code's origin. J. Mol. Evol. 47:109-117 Yarus M (2000) RNA-ligand chemistry: a testable source for the genetic code. RNA 6:475-484 Zuckerkandl E, Pauling L (1965) Evolutionary divergence and convergence in proteins. In: Bryson V, Vogel HJ (eds) Evolving Genes and Proteins. Academic Press, NY, pp. 97-167 11 Measure Strong codons Mixed codons Weak codons Dinucleoside monophosphates Hydrophilicity (Weber & Lacey 1978) 1.686 1.434 1.235 Hydrophilicity (Barzilay et al. 1973) 2.72 2.26 2.26 Hydrophobicity (Garel et al. 1973) 2.556 3.413 3.982 Amino acids Molec. Weight (Handbook value) 907 1065.6 1217.5 Molec. Volume (Grantham 1974) 381 637.5 906 Refractivity (Jones 1975) 83.86 140.03 186.51 Alpha pK1 (Zimmermann et al. 1968) 16.96 17.11 17.43 Bulkiness (Zimmermann et al. 1968) 93.22 124.345 143.54 Specific volume (McMeekin et al. 1964) 5.26 5.37 5.8 Polarity (Zimmerman et al. 1968) 107.16 109.58 58.14 Polarity (Woese et al. 1967) 61.2 59.15 51 Polarity (Grantham 1974) 71.2 67 56.3 Hydrophobicity (Jones 1975) 9.18 8.385 16.93 Hydrophobicity (Levitt 1976) -2.2 1.6 8.8 Hydrophobicity (Bull & Breese 1974) 3880 -165 -6790 Hydrophilicity (Weber & Lacey 1978) 7.02 6.585 5.59 Partition coefficient (Garel et al. 1973) 1.88 5.58 7.6 Sequence Frequency (Jungck 1971) 4280 3522 2966 Table 1 Correlation of codon strength and amino acid properties. Averaged values (per column, in our scheme of Fig. 2) of quantified dinucleoside monophosphate properties (codon and anticodon values give the same average, because of the codon-anticodon symmetry) and amino acid properties for strong, mixed and weak codons. Each row represents one of the measures published by Jungck (1978; This paper contains (in its Table 1) all detailed references as well as a short note to the determination procedure.). 12 Figure Legends Figure 1 The common presentation of the standard (`universaľ) genetic code. All deviations from this code (Elzanowski and Ostell 2000) are thought to be the result of later mutations (Osawa et al. 1992, Knight and Landweber 2000b, Knight et al. 2001). Shaded regions show codon families. Figure 2 A new classification scheme of the standard genetic code based on a binary representation of purines (1) and pyrimidines (0). The third base is given in parenthesis. When there are differences between the standard code and any other code, the number of deviations from the standard code is indicated. This comparison is based on 16 non-standard codes (Elzanowski and Ostell 2000). For instance, in the UG(G/A) field, 0/9 indicates that UGG encodes for Trp in all codes, but UGA is not the termination codon in 9 of the 16 non-standard codes: in 8 different mitochondrial codes UGA encodes Trp, and in the euplotid nuclear code it represents Cys. It is interesting that at least in some bacteria the 21st amino acid, selenocysteine, can also be encoded by UGA (Osawa et al. 1992, Thanbichler and Böck 2002). Another example is the CU(G/A) field. In the yeast mitochondrion CUG and CUA encode Thr, in the alternative yeast nuclear code CUG represents Ser. Shaded regions show codon families. The point in the center indicates the perfect point symmetry in this scheme, according to Halitsky's family ­ nonfamily symmetry operation (Halitsky 2003). The thick horizontal line marks the symmetry axis for codon-anticodon symmetry. 13 Second LetterFirst Letter U C A G Third Letter UUU UCU UAU UGU U UUC Phe UCC Ser UAC Tyr UGC Cys C UUA UCA UAA UGA Stop A U UUG Leu UCG Ser UAG Stop UGG Trp G CUU CCU CAU CGU U CUC Leu CCC Pro CAC His CGC Arg C CUA CCA CAA CGA A C CUG Leu CCG Pro CAG Gln CGG Arg G AUU ACU AAU AGU U AUC Ile ACC Thr AAC Asn AGC Ser C AUA Ile ACA AAA AGA A A AUG Met ACG Thr AAG Lys AGG Arg G GUU GCU GAU GGU U GUC Val GCC Ala GAC Asp GGC Gly C GUA GCA GAA GGA A G GUG Val GCG Ala GAG Glu GGG Gly G Figure 1 14 Code Strong codons 6 hydrogen bonds Mixed codons 5 hydrogen bonds (first G or C) Mixed codons 5 hydrogen bonds (first U or A) Weak codons 4 hydrogen bonds 000 Pro CC (C/U) Leu CU (C/U) 1/1 Ser UC (C/U) Phe UU (C/U) 001 Pro CC (G/A) Leu CU (G/A) 1/2 Ser UC (G/A) 1/0 Leu UU (G/A) 1/0 100 Ala GC (C/U) Val GU (C/U) Thr AC (C/U) Ile AU (C/U) 101 Ala GC (G/A) Val GU (G/A) Thr AC (G/A) Met/Ile AU (G/A) 5/0 010 Arg CG (C/U) His CA (C/U) Cys UG (C/U) Tyr UA (C/U) 011 Arg CG (G/A) Gln CA (G/A) Trp/Stop UG (G/A) 9/0 Stop UA (G/A) 2/4 110 Gly GG (C/U) Asp GA (C/U) Ser AG (C/U) Asn AA (C/U) 111 Gly GG (G/A) Glu GA (G/A) Arg AG (G/A) 6/6 Lys AA (G/A) 3/0 Figure 2