Molecules, Languages and Automata
David B. Searls
Lower Gwynedd, PA 19454, USA
Abstract. Molecular biology is full of linguistic metaphors, from the
language of DNA to the genome as “book of life.” Certainly the organization
of genes and other functional modules along the DNA sequence
invites a syntactic view, which can be seen in certain tools used in bioinformatics
such as hidden Markov models. It has also been shown that
folding of RNA structures is neatly expressed by grammars that require
expressive power beyond context-free, an approach that has even been
extended to the much more complex structures of proteins. Processive
enzymes and other “molecular machines” can also be cast in terms of
automata. This paper brieﬂy reviews linguistic approaches to molecular
biology, and provides perspectives on potential future applications of
grammars and automata in this ﬁeld.
1 Introduction
The terminology of molecular biology from a very early point adopted linguistic
and cryptologic tropes, but it was not until some two decades ago that serious
attempts were made to apply formal language theory in this ﬁeld. These included
eﬀorts to model both the syntactic structure of genes, reﬂecting their hierarchical
organization, and the physical structure of nucleic acids such as DNA and RNA,
where grammars proved suitable for representing folding patterns in an abstract
manner. In the meantime, it was also recognized that automata theory could be
a basis for representing some of the key string algorithms used in the analysis of
macromolecular sequences. These varied approaches to molecular biology are all
bound together by formal language theory, and its close relationship to automata
theory. In reviewing these approaches, and discussing how they may be extended
in new directions within biology, we hope to demonstrate the power of grammars
as a uniform, computer-readable, executable speciﬁcation language for biological
knowledge.
2 Structural Grammars
Nucleic acids are polymers of four bases, and are thus naturally modeled as
languages over the corresponding alphabets, which for DNA comprise the wellknown
set Σ = {a, c, g, t}. (RNA bases are slightly diﬀerent, but for all practical
purposes can be treated the same.) DNA, which carries the genetic information
in our chromosomes, tends to form itself into a double helix with two strands
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 5–10, 2010.
c Springer-Verlag Berlin Heidelberg 2010
6 D.B. Searls
that are held together by a complementary pairing of the opposing bases, ‘a’
with ‘t’ and ‘g’ with ‘c’. RNA molecules are more often single-stranded, though
they can fold back on themselves to form regions of double-stranded structure,
called secondary structure.
Given these bare facts, the language of all possible RNA molecules is speciﬁed
by the following trivial grammar (with S the start symbol and the empty string,
as usual):
S → xS | for each x ∈ {a, c, g, u} (1)
It would seem to be natural to specify DNA helices, which comprise two strands,
as a pair of strings, and a DNA language as a set of such pairs. However, DNA
has an additional important constraint, in fact two: ﬁrst, the bases opposing one
another are complementary, and second, the strings have directionality (which
is chemically recognizable) with the base-paired strands running in opposite
directions.
We have previously shown how this sort of model can be extended to describe
a number of speciﬁc phenomena in RNA secondary structure, and in fact a
simple addition to the stem-and-loop grammar allows for arbitrarily branching
secondary structures:
S → xS¯x | SS | where ¯g=c, ¯c=g, ¯a=t, ¯t=a (2)
Examples of such branching secondary structure include a cloverleaf form such
as is found in transfer RNA or tRNA, an important adaptor molecule in the
translation of genetic information from messenger RNA to protein. The language
of (2) describes what is called orthodox secondary structure, which for our
purposes can be considered to be all fully base-paired structures describable by
context-free grammars.
There are, however, secondary structures that are beyond context-free, the
archetype of which are the so-called pseudoknots. Pseudoknots can be conceived
as a pair of stem-loop structures, one of whose loops constitutes one side of
the others stem. The corresponding (idealized) language is of the form uv¯uR
¯vR
,
which cannot be expressed by any context-free grammar. It is sometimes described
as the intersection of two context-free palindromes of the form uv¯uR
and
v¯uR
¯vR
, but of course context-free languages are not closed under intersection.
Pseudoknots and other non-context-free elements of the language of secondary
structure can be easily captured with context-sensitive grammars, but the resulting
complex movements of nonterminals in sentential forms tend not to enlighten.
Rather, grammars with more structured rules, such as Tree-Adjoining
Grammars (TAG), have been more proﬁtably used for this purpose [6].
A grammar variation that the author has proposed describes even more complex,
multi-molecular base-paired complexes of nucleic acids [4]. This is enabled
by the addition to any grammar of a new symbol δ which is understood to cut
the string at the point it appears. This means that derivations ultimately give
rise not to strings but to sets of strings arising from such cuts, which may be
base-paired in arbitrarily ramiﬁed networks.
Molecules, Languages and Automata 7
Proteins are more complex macromolecular structures with several kinds of
intermolecular interactions. Some of the basic types of such recurrent structural
themes have been described with a variety of grammars [6].
3 Gene Grammars
Genes, which are encoded in the DNA of organisms, have a hierarchical organization
to them that is determined by the process by which they are converted into
proteins (for the most part). Genes are ﬁrst transcribed into messenger RNA,
or mRNA, which constitutes a complementary copy of the gene, and then this
is translated into protein. The latter step requires the DNA/RNA code to be
adapted to that of proteins, whose alphabet comprises the twenty amino acids.
This encoding is called the genetic code, which appears as a table of triplets of
bases mapped to amino acids.
Transcription itself involves a number of complications regarding the structure
of genes, such as the fact that the actual coding sequence is interrupted
by segments that are spliced out at an intermediate step, establishing what is
called the intron/exon structure of the gene. In addition there are many signal
sequences embedded in the gene, including in ﬂanking non-coding regions, that
determine such things as the starting point of transcription, the conditions under
which transcription will occur, and the points at which splicing will occur.
The author has demonstrated how grammars can eﬀectively capture all these
features of genes, including ambiguities such as alternative splicing whereby different
versions of genes may arise from the same genome sequence [1]. Such grammars
have been used to recognize the presence of genes in raw sequence data by
means of parsing, in what amounts to an application of syntactic pattern recognition
[2]. (Modern ‘gene-ﬁnders’, however, use highly customized algorithms for
eﬃciency, though the most eﬀective of these do capture the syntactic structure of
the standard gene model.) As the variety of genes and related features (such as
immunoglobulin superfamily genes and microRNA) and their higher-level organization
in genomes continues to grow more complex, grammars may yet prove
to be a superior means to formally specify knowledge about such structure.
4 Genetic Grammars
Gregor Mendel laid the foundation for modern genetics by asserting a model
for the inheritance of traits based on a parsimonious set of postulates. While
many modiﬁcations have been required to account for a wider and wider set
of observations, the basic framework has proven robust. Many mathematical
and computational formalizations of these postulates and their sequelae have
been developed, which support such activities as pedigree analysis and genetic
mapping.
The author has been developing a grammar-based speciﬁcation of Mendelian
genetics which is able to depict the basic processes of gamete formation, segregation
of alleles, zygote formation, and phenotypic expression within a uniform
8 D.B. Searls
framework representing genetic crosses [unpublished]. With this basic ‘Mendelian
grammar,’ extensions are possible that account in a natural way for various
known mechanisms for modiﬁcation of segregation ratios, linkage, crossing-over,
interference, and so forth.
One possible use of this formalism is as a means to frame certain types of
analysis as a form of grammar inference. For example, mapping of genes to
linkage groups and ordering linkage groups can be seen as ﬁnding an optimal
structure of an underlying grammar so as to ﬁt experimental data. Especially
intriguing is the possibility of including together in one framework the genetic
analysis with phenotypic grammars, for example in the genetic dissection of
pathways.
5 Molecular Machines
Enzymes and other biochemical structures such as ribosomes are sometimes
called ‘molecular machines’ because they perform repetitive chemical and/or
mechanical operations on other molecules. In particular, a large class of such
objects process nucleic acids in various ways, many of them by attaching to
and moving along the DNA or RNA in what is termed processive fashion. This
immediately brings to mind computational automata which perform operations
on tapes.
Since automata have their analogues in grammars, it is natural to ask whether
grammars can model enzymes that act on DNA or RNA. In fact the trivial rightrecursive
grammar that we showed at the outset (1) can be considered a model
for terminal transferase, an enzyme that synthesizes DNA by attaching bases to
a growing chain, as in this derivation:
S ⇒ cS ⇒ ctS ⇒ ctcS ⇒ ctcaS ⇒ ctcaaS ⇒ ctcaagS ⇒ ctcaag
We can view the nonterminal S as the molecular machine, the terminal transferase
itself, laying down bases sequentially and then departing into solution.
Similarly, we can envision a context-sensitive grammar that models an exonuclease,
an enzyme that degrades nucleic acids a base at a time from one or the other
end. The orientation is important, because exonucleases are speciﬁc for which
end they chew on, and therefore whether they run in the forward or reverse
direction on the strand:
Fx → F | forward exonuclease
xR → R | reverse exonuclease
(3)
These can produce derivations such as the following, with the nonterminals again
physically mimicking the action of the corresponding enzymes:
Fgcaa ⇒ Fgcaa ⇒ Fcaa ⇒ Faa ⇒ Fa ⇒ F ⇒
atggacR ⇒ atggaR ⇒ atggR ⇒ atgR ⇒ atR ⇒ at
Molecules, Languages and Automata 9
In the ﬁrst derivation, the F completely digests the nucleic acid strand and then
itself disappears via the disjunct — into solution, as it were. On the other hand,
in the second example we show the R exonuclease departing without completing
the job, which mirrors the biological fact that enzymes can show greater or lesser
propensity to hop on or oﬀ the nucleic acid spontaneously. We could model the
tendency to continue the recursion (known as an enzyme’s processivity) with a
stochastic grammar, where probabilities attached to rules would establish the
half-lives of actions of the biological processes.
The author’s most recent eﬀorts [unpublished] have been to catalogue a wide
range of grammars describing the activities of enzymes acting on nucleic acids
in various circumstances. This requires the extension of the model to doublestranded
DNA, as well as the ability to act on more than one double-stranded
molecule at once. Wth the employment of stochastic grammars, it appears possible
to specify a wide variety of biochemical details of molecular machines.
6 Edit Grammars
Another view of the movement of nonterminals is as a means to perform editing
operations on strings. As in the case for processive enzymes, we view a nonterminal
as a ‘machine’ that begins at the left end of an input string and processes
to the right end, leaving an altered string as output.
xS
0
−→ Sx identity (x ∈ Σ)
yS
1
−→ Sx substitution (x = y)
S
1
−→ Sx deletion
xS
1
−→ S insertion
To frame this input/output process in a more standard fashion, one can simply
assert a new starting nonterminal S , a rule S →Swτ where w ∈ Σ∗
is the input
string and τ is a new terminal marker not in the language, and an absorbing rule
S→τ that is guaranteed to complete any derivation and leave only the output
string.
Note that the insertion rule is not strictly context-sensitive (the left side being
longer than the right), and can generate any string whatever as output. The
numbers above the arrows here represent a cost of applying the corresponding
edit rule. An overall derivation would again move the S nonterminal from the
beginning of an input string to the end, leaving the output to its left, as follows:
Sgact
0
=⇒ gSact
1
=⇒ gtSct
1
=⇒ gtcSt
2
=⇒ gtcgSt
2
=⇒ gtcgtS
Here the numbers above the double arrows represent the cumulative cost of the
derivation. The rules applied are an identity (for no cost), a substitution of a ‘t’
for an ‘a’ (adding a cost of 1), another identity, an insertion of a ‘g’ (adding a
cost of 1), and an identity.
10 D.B. Searls
Minimal edit distances are typically calculated with dynamic programming
algorithms that are O(nm) in the lengths of the strings being compared. The
same order of results can be obtained with the appropriate table-based parsers
for grammars such as that above, though perhaps with the sacriﬁce of some
eﬃciency for the sake of generality. The great advantage of the parsing approach
is that grammars and their cognate automata make it possible to describe more
complex models of string edits, and therefore of processes related to molecular
evolution. The author has recast a number of the algorithms developed for such
purposes in the form of automata, which can be shown to be equivalent to the
recurrence relations typically used to specify such algorithms [4].
References
1. Searls, D.B.: The linguistics of DNA. Am. Sci. 80, 579–591 (1992)
2. Dong, S., Searls, D.B.: Gene structure prediction by linguistic methods. Genomics
23, 540–551 (1994)
3. Searls, D.B.: String Variable Grammar: a logic grammar formalism for DNA sequences.
J. Logic Prog. 24, 73–102 (1995)
4. Searls, D.B.: Formal language theory and biological macromolecules. In: FarachColton,
M., Roberts, F.S., Vingron, M., Waterman, M. (eds.) Mathematical Support
for Molecular Biology, pp. 117–140. American Mathematical Society, Providence
(1999)
5. Searls, D.B.: The language of genes. Nature 420, 211–217 (2002)
6. Chiang, D., Joshi, A.K., Searls, D.B.: Grammatical representations of macromolecular
structure. J. Comp. Biol. 13, 1077–1100 (2006)