© 1999 Oxford University Press Nucleic Acids Research, 1999, Vol. 27, No. 19 3911–3920
Heuristic approach to deriving models for gene finding
John Besemer1 and Mark Borodovsky1,2,*
1School of Biology and 2School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA
Received April 30, 1999; Revised and Accepted July 27, 1999
ABSTRACT
Computer methods of accurate gene finding in DNA
sequences require models of protein coding and
non-coding regions derived either from experimentally
validated training sets or from large amounts of
anonymous DNA sequence. Here we propose a new,
heuristic method producing fairly accurate inhomogeneous
Markov models of protein coding regions.
The new method needs such a small amount of DNA
sequence data that the model can be built ‘on the fly’
by a web server for any DNA sequence >400 nt. Tests
on 10 complete bacterial genomes performed with
the GeneMark.hmm program demonstrated the
ability of the new models to detect 93.1% of annotated
genes on average, while models built by
traditional training predict an average of 93.9% of
genes. Models built by the heuristic approach could
be used to find genes in small fragments of anonymous
prokaryotic genomes and in genomes of
organelles, viruses, phages and plasmids, as well as
in highly inhomogeneous genomes where adjustment
of models to local DNA composition is needed.
The heuristic method also gives an insight into the
mechanism of codon usage pattern evolution.
INTRODUCTION
Computer-aided gene finding frequently employs statistical
gene prediction methods based on Markov models (1–3).
Parameters of inhomogeneous Markov models for a protein
coding DNA sequence could be inferred from training sets of
experimentally annotated DNA sequences (1) or from a large
enough set of anonymous DNA sequences (2,4–6). In this
paper we present a rather simple training procedure that
produces fairly efficient Markov models using a minimum
amount of training data. The idea of the method is based on
two observations made upon analysis of the performance of the
prokaryotic versions of the GeneMark and GeneMark.hmm
programs (1,3). First, for the Escherichia coli genome, whose
genes have been divided into three classes that differ in codon
usage pattern (7–9), it was noticed that predicting genes of
each class by GeneMark did not require carefully tuned up
class-specific Markov models. For instance, the genes of Class
II, the highly expressed genes possessing the most biased
codon usage pattern, could be accurately predicted just by
using the models of the genes of Class I, encompassing the
majority of E.coli genes (10). Secondly, the GeneMark.hmm
program (3) was able to detect a vast majority of genes of all
three E.coli classes using second order Markov models trained
on the Class III E.coli genes, presumably horizontally transferred
genes whose E.coli-specific codon usage pattern was the
least pronounced. Having realized that practically useful
models of protein coding regions may be learned from a rather
small amount of genomic sequence, we attempted to avoid the
traditional training process. The proposed heuristic procedure
for Markov model derivation used a fragment of genomic
DNA just long enough to accurately estimate the nucleotide
composition. This procedure also used linear functions that
related nucleotide frequencies in the three codon positions to
the global nucleotide frequencies and linear functions that
related amino acid residue frequencies to genome GC content.
These functions were obtained by linear regression analysis of
DNA sequence data of several completely sequenced prokaryotic
genomes. Tests of the new approach were performed on 10
complete bacterial genomes. The heuristically derived models
were used with the GeneMark.hmm and GeneMark programs.
The tests have shown that the heuristic models worked surprisingly
well. Particularly, when GeneMark.hmm was used an
average 93.1% of annotated genes were detected, while in
comparison, models built by traditional training predicted an
average of 93.9% of genes. The heuristic approach to model
building will be useful for dealing with prokaryotic species
whose genomic sequence information is available in small
amounts and for small genomes of organelles, viruses, phages
and plasmids, as well as for genomes with highly inhomogeneous
DNA composition, when models need adjustment to local
DNA composition.
It should be noted that although we have not moved far
enough in this direction, the heuristic method produced an
interesting by-product, a heuristic codon usage table. Comparison
of the experimental and heuristically defined codon
frequencies revealed a strong correlation, especially for
species with a highly biased nucleotide composition. This
correlation indicates the presence of general factors related to
genome and proteome composition at the levels of nucleotides
and amino acids, respectively, that are involved in shaping the
species-specific codon usage patterns. For the species with a
balanced nucleotide composition, such as E.coli, the heuristic
codon usage frequencies may help to single out ‘outliers’, the
instances when codon frequencies deviate from expectation
due to some other important factors that become hidden in
genomes with a biased composition.
*To whom correspondence should be addressed at: School of Biology, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA. Tel: +1 404 894 8432;
Fax: +1 404 894 0519; Email: mark@amber.biology.gatech.edu
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
3912 Nucleic Acids Research, 1999, Vol. 27, No. 19
MATERIALS AND METHODS
Materials
To obtain the information necessary to generate the heuristic
models, we examined the 17 complete bacterial genomes available
in GenBank as of November 1998. The species used were
Aquifex aeolicus (11), Archaeoglobus fulgidus (12), Bacillus
subtilis (13), Borrelia burgdorferi (14), Chlamydia trachomatis
(15), E.coli (16), Haemophilus influenzae (17), Helicobacter
pylori (18), Methanobacterium thermoautotrophicum
(19), Methanococcus jannaschii (4), Mycobacterium tuberculosis
(20), Mycoplasma genitalium (21), Mycoplasma pneumoniae
(22), Pyrococcus horikoshii (23), Rickettsia prowazekii
(24), Synechocystis PCC6803 (25) and Treponema pallidum
(26). Amino acid frequencies and positional nucleotide
frequencies were measured using genes annotated in these
genomic sequences. For testing purposes, we used 10 of the 17
complete genomes mentioned above as well as the complete
genomes for human immunodeficiency virus type I (27) and
human T cell lymphotrophic virus type I (28). We also used
650 Pseudomonas aeruginosa sequences and 385
Chlamydomonas reinhardtii sequences.
The heuristic method of Markov model derivation
This method was designed to build the set of Markov models
using a minimal amount of sequence information. When using
this approach, only three independent parameters, three of the
four nucleotide frequencies specific for the particular genomic
sequence, are needed to generate the models necessary to
utilize gene-finding programs.
Upon analysis of the 17 complete bacterial genomes we have
observed relationships between the positional nucleotide
frequencies and the global nucleotide frequencies (Fig. 1A–D)
as well as relationships between the amino acid frequencies
and the global GC% of the bacterial genomes (Fig. 2A–J).
These relationships were approximated by linear functions
using standard linear regression. Interestingly, the graphs for
positional frequencies of T and G nucleotides (Fig. 1A and C)
show a ‘Z-pattern’ caused by a noticeable difference in
frequencies at the first and second codon positions. The
frequencies of C and A nucleotides at the first and second
codon positions are close to each other and the ‘Z-pattern’ is
absent from the graphs (Fig. 1B and D).
Of the 20 amino acids, the frequency of only 10 were
observed to change significantly over the range of GC percentages
for the 17 complete genomes examined: 28.6% GC
(B.burgdorferi) to 65.6% GC (M.tuberculosis). Of these 10,
four amino acids are coded by SSN type codons (S stands for
C or G and designates strong Watson–Crick pairing): alanine
(A), glycine (G), proline (P) and arginine (R). Arginine,
though, is encoded not only by GCN codons, but also by AGA
and AGG. We conventionally considered arginine as an SSNtype
amino acid, since four of its six codons are of the SSN
type. Frequencies of all four SSN-type amino acids, A, G, P
and R, increased as the GC content of the genome increased
(Fig. 2A–D).
Five other amino acids whose frequencies significantly
changed over the GC% range are coded for by WWN-type
codons (W stands for A or T and designates weak Watson–
Crick pairing): phenylalanine (F), isoleucine (I), lysine (K),
asparagine (N) and tyrosine (Y). As could be expected, the
frequency of each of these amino acids decreased as GC%
increased (Fig. 2E–I). Methionine, although technically a
WWN-type amino acid, occurs very infrequently in bacterial
genomes (~1.8%) and changed the least of all 20 amino acids
over the GC% range examined.
Amino acids coded by combinations of strong and weak
nucleotides in the first two positions of a codon were considered
as neutral. Only one neutral amino acid, valine, had a frequency
that changed significantly as GC% changed and behaved like a
member of the SSN group (Fig. 2J). Valine belongs to the group
of aliphatic amino acids, along with isoleucine and leucine. The
frequency of isoleucine, classified as a WWN-type amino acid,
decreased as GC% increased (Fig. 2F). No significant change in
the frequency of leucine, with four of its six codons being of
neutral type and the other two of the WWN type, was observed.
Perhaps some evolutionary pressure exists to hold near constant
the sum of the frequencies of the aliphatic amino acids at the
level of the proteome. Thus, as genomic GC% increases, the
increase in frequency of valine may be explained as compensation
for a deficiency of isoleucine.
Given these observations, amino acid frequency dependence
on global GC% was taken into account only for the four amino
acids of SSN type, A, G, P and R, for the five amino acids of
WWN type, F, I, K, N and Y, and for the one neutral amino
acid, V. For all other amino acid frequencies, the values
observed in the E.coli proteome were used as constants. For
mycoplasma genomes, an additional constant was added to the
E.coli tryptophan frequency since the codon TGA, usually a
stop codon, codes for tryptophan in these species.
The parameters of the set of Markov models, the three-periodic
models for coding sequence of orders zero, one and two
and a single zero order model for non-coding sequence, were
derived as follows. Learning the global nucleotide frequencies
from a given genomic sequence allowed us to determine nucleotide
frequencies in each of the three codon positions using the
linear relationships shown in Figure 1A–D. Then, the initial
values of frequency of occurrence of each of the 61 codons,
fI(XYZ), were obtained as products of the three positional
nucleotide frequencies of corresponding nucleotides. The
frequency of a particular amino acid was determined for a
given GC content and was then used to modify the initial value
of codon frequency. For example, the refined frequency of the
alanine codon GCT is defined by the formula
fR(GCT) = falanine(GC%global) × [fI(GCT)/(fI(GCC) + fI(GCA) + fI(GCG) + fI(GCT))] 1
In the case of alanine, encoded by four codons, we could have
reached the same result by taking into account only the nucleotide
frequencies in the third codon position. Equation 1,
Figure 1. (Opposite) (A) Frequency of nucleotide T in three codon positions observed in 17 bacterial genomes shown as a function of global nucleotide T
frequency in a given genome. The equations of the lines approximating the observed data were obtained by linear regression analysis. (B–D) As in (A) for
positional frequencies of nucleotides C, A and G, respectively.
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
Nucleic Acids Research, 1999, Vol. 27, No. 19 3913
Figure 1. (A) Frequency of nucleotide T in three codon positions observed in 17 bacterial genomes shown as a function of global nucleotide T frequency in a given
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
3914 Nucleic Acids Research, 1999, Vol. 27, No. 19
however, functioned properly independently of the number of
synonymous codons, so all amino acids were handled in the
same manner. Obviously, this method guarantees that the sum
of the refined frequencies of codons is equal to the sum of the
frequencies of the amino acids. By completing this computation
for all 61 codons, we produced the heuristically built
codon usage table for the input genomic sequence.
To construct the three-periodic zero order Markov model of
a protein coding region the codon usage table is all that is
needed. For example, to determine the probability of A in the
first position of a codon, the probabilities of all codons that
start with A were added together. In the zero order model of
non-coding sequence the global frequencies of the respective
nucleotides were used.
For the first order three-periodic Markov model, the codon
usage table provides enough data to calculate only two
matrices of transition probabilities out of three. To define the
values of transition probabilities related to nucleotides
occupying the third position of one codon and the first position
of the next codon it was assumed that occurrences of adjacent
codons are independent events. Indeed, a rather weak correlation
exists between nucleotides of adjacent codons. Thus the
probability of nucleotide Y in the first position of a codon
following a nucleotide X in the third position of the previous
codon, P(X→Y) for the (..X||Y..) configuration, is equal to the
probability of nucleotide Y in the first position of a codon
defined previously for the zero order Markov model.
For the second order Markov model, only the transition
probabilities for the nucleotide in the third codon position
could be produced from the codon usage table.
To find the transition probabilities related to the first and
second codon positions, we used the same assumption of
Figure 2. (A) Frequency of amino acid alanine shown as a function of GC content of the bacterial genome for 17 genomes. The equation of the line approximating
the observed data was obtained by linear regression analysis. (B–J) As (A) for glycine, proline, arginine, phenylalanine, isoleucine, lysine, asparagine, tyrosine and
valine, respectively.
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
Nucleic Acids Research, 1999, Vol. 27, No. 19 3915
independence of adjacent codons. The second order transition
probability P(XY→Z) for the (.XY||Z..) configuration was
assumed to be equal to the probability of nucleotide Z in the
first codon position as defined in the zero order Markov model.
The transition probability P(XY→Z) for the (..X||YZ.) configuration
was equal to the value of probability of Z in the second
position following Y in the first position. This value was
already defined for the first order Markov model.
Gene finding accuracy estimation
To characterize the gene prediction accuracy of the models,
predictions made by the GeneMark and GeneMark.hmm
programs were compared with the GenBank annotation.
Although it may not always be true, the annotation was
assumed to be precisely correct in terms of gene (ORF) location
and position of the start codon. We use the term ‘close
prediction’ or mere ‘prediction’ for the case when the
predicted stop codon of the ORF matched the annotated stop,
regardless of whether the predicted start codon location
matched the annotated start. The term ‘exact prediction’
describes the case when both the positions of the predicted stop
and start codons matched the annotation. Since almost none of
the annotated sequence was directly used in building the
heuristic models, there was no need to use a cross-validation
procedure, which is regularly used to assess the accuracy of the
models with a large number of parameters learned from a
training set.
RESULTS AND DISCUSSION
Testing on 10 complete bacterial genomes
In order to gauge how well the matrices generated through the
heuristic approach performed when used in gene prediction,
they were tested on 10 complete bacterial genomes. One test
was done with the GeneMark.hmm program only. Another test
was performed using a combination of both GeneMark.hmm
and GeneMark. Note that the heuristic models could be
employed in any gene-finding program using Markov models.
Prediction of the 5′-end of a gene was aided by the RBS model
built from the E.coli sequence data (3).
The complete genomes of the following species were
analyzed in the tests: A.fulgidus, B.subtilis, E.coli, H.influenzae,
H.pylori, M.genitalium, M.jannaschii, M.pneumoniae,
M.thermoautotrophicum and Synechocystis PCC6803. By
using GeneMark.hmm with heuristic models, an average of
93.1% of the genes present in the GenBank annotations of the
above 10 genomes were closely predicted. At the same time,
72.1% of annotated genes were predicted exactly. These
results compare favorably with the results obtained using
‘native’ models derived from the genomic sequences by traditional
training (1). When native models were used in GeneMark.hmm,
93.9% of the annotated genes were closely
predicted and 77.4% of the genes were exactly predicted
(Table 1). This information is broken down into the 10 species
tested in Table 2, which shows the results using heuristic
models for every species.
A limitation of GeneMark.hmm is that a gene that overlaps
at its 3′-end with an adjacent gene in the opposite orientation
can be missed. To recover missing genes, the GeneMark
program was also run on the sequence. The predictions made
by the two programs were then parsed and a single list of
predicted genes was produced. In the case where both GeneMark.hmm
and GeneMark predicted a gene with the same stop
position, the GeneMark.hmm prediction was selected as the
representative one. Since we were specifically interested in
recovering rather long genes missed by GeneMark.hmm, we
only used the GeneMark predictions longer than 500 nt.
Using the combined approach, the heuristic models closely
predicted 94.6% of the genes annotated in the 10 bacterial
genomes. Native models worked better, predicting on average
97.3% of the genes (Table 1).
When run with the native models, GeneMark.hmm predicted
an average of 11.4% genes which were not present in the annotation
for the 10 bacterial genomes. These predictions were
denoted as potential new genes. The heuristic models predicted
an average of 12.4% potential new genes over the same condi-
tions.
Overall, the heuristic models performed similarly to native
models in terms of percentage of genes missed, exactly
predicted genes and percentage of potential new genes
predicted. Only minimal gains were made in terms of gene
prediction accuracy by moving from heuristic models to native
models. This suggests that heuristic models can be used to
accurately predict genes in cases where the amount of
sequence data necessary for traditional training of higher order
native models is not yet available or cannot be obtained at all.
Possible applications include sequencing projects very near
their beginning and prediction of genes in small genomes, such
as organelles, viruses, phages and plasmids. Heuristic models
may also help in analyzing genomes with highly inhomogeneous
nucleotide composition.
Table 1. Average gene prediction performance of the heuristic and native models as defined in tests on 10 bacterial genomes
The figures for annotated genes predicted give the percentage of annotated genes closely predicted (with possible misplacement of the start codon). The combination
of GeneMark.hmm with GeneMark described in Lukashin and Borodovsky (3) allows for a 3.4% improvement in the accuracy of native models and 1.5%
improvement in heuristic models.
Gene prediction program Model type Annotated genes predicted (%) Annotated genes exactly predicted (%) Potential new genes (%)
GeneMark.hmm Native 93.9 77.4 11.4
GeneMark.hmm Heuristic 93.1 72.1 12.4
GeneMark.hmm and GeneMark Native 97.3 77.4 11.4
GeneMark.hmm and GeneMark Heuristic 94.6 73.4 13.4
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
3916 Nucleic Acids Research, 1999, Vol. 27, No. 19
Prediction of short genes
Approximately 25% of the 4289 genes found in the E.coli
genome are shorter than 500 nt in length. Typically, using
computational methods these genes have been much more
difficult to accurately predict than longer genes. We have done
an additional analysis to find out how efficient the models
generated through the heuristic approach are in predicting
genes of short length in comparison with native models. The
percentages of genes that were closely predicted in the E.coli
genomic sequence using both types of models are shown in
Figure 3A. In all length categories other than the shortest one
(shorter than 100 nt), which includes 11 annotated genes out of
4289, the heuristic models and native models produced similar
results. Figure 3B shows a comparison of the percentage of
potential new genes predicted in the E.coli genome by both
native and heuristic models. The heuristic models and native
models predict similar numbers of potential new genes in all
length categories.
Testing on bacterial genomes with inhomogeneous
composition
Genomic sequences of P.aeruginosa were used for this test
since they range fairly widely in GC content from 41.6 to
70.3%. Native Markov models based on P.aeruginosa
sequence were generated using 650 records available in
GenBank. The whole GC content range was divided into three
bins and the native models were derived for each GC bin. A set
of 262 of these sequences was used as a test set for gene prediction
accuracy. The GeneMark.hmm program using native
models closely predicted 95% of the genes annotated in
P.aeruginosa sequences and exactly predicted 71% of the
annotated genes. The GeneMark.hmm program using the
heuristic models, tested on the same 262 genes, also made
close predictions for 95% of the annotated genes and exactly
predicted 75% of the annotated genes. The program using
native models predicted 23% genes not present in the annotation
while using heuristic models produced a slightly smaller
number of new predictions, 21%. Overall, better results were
achieved with heuristically derived models than with native
models. Note that the need to use a cross-validation procedure
to estimate the accuracy of the heuristic model was virtually
abolished since the number of the model parameters learned
from the genomic sequences is so small. Using crossvalidation
for testing the native models could lead to a decrease
in the accuracy figure given above by several percentage points.
Table 2. Gene prediction accuracy of the GeneMark.hmm program using heuristic models for each of 10 bacterial genomes
The figures in the third column give the percentage of annotated genes closely predicted by the GeneMark.hmm program (with possible misplacement of the start
codon). Percentage of potential new genes (false positives) is defined with regard to the number of annotated genes.
Genes annotated Genes predicted Genes predicted (%) Genes exactly predicted (%) Potential new genes (%)
A.fulgidus 2407 2516 88.1 70.7 13.2
B.subtilis 4099 4384 96.5 66.4 10.8
E.coli 4289 4426 93.4 75.3 10.7
H.influenzae 1717 1840 95.9 84.9 10.7
H.pylori 1566 1612 93.7 76.8 9.6
M.genitalium 467 509 88.9 66.8 19.3
M.jannaschii 1680 1841 94.5 68.2 13.2
M.pneumoniae 678 734 91.9 65.6 17.0
M.thermoautotrophicum 1869 1944 93.9 65.8 6.9
Synechocystis 3169 3360 94.6 80.5 12.6
Average 93.1 72.1 12.4
Figure 3. (A) Percent of genes predicted by the GeneMark.hmm program out
of those annotated in the E.coli genome as a function of gene length. (B) As
(A) for the percent of potential new genes.
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
Nucleic Acids Research, 1999, Vol. 27, No. 19 3917
Testing on the smallest genomes
Due to the small genome size of phages and viruses, there is
not enough sequence data to properly train the Markov models
in the traditional way. Since the heuristic approach eliminates
the need for large amounts of DNA sequence for model
building, genomes of phages and viruses were an ideal test
case for this method. The heuristic approach was tested on the
complete genomes of human immunodeficiency virus type I
and human T cell lymphotropic virus type I.
The HIV type I genome is only 9719 nt long and contains
eight annotated genes, two of which contain a single intron.
The intron for both genes is in exactly the same position, 6046–
9378, although the two genes have different start and stop locations.
The prokaryotic version of GeneMark.hmm used here
was unable to accurately predict genes with introns. Therefore,
we ignored the two genes that contain introns. GeneMark.hmm
predicted five of the six annotated genes and no genes that
were not annotated. The gene that was not predicted was 237 nt
in length and overlapped the preceding gene. The GeneMark
program recovered this annotated gene. However, GeneMark
also predicted the existence of five other short genes in this
sequence that were not present in the annotation. Four of these
five predictions were located in the long terminal repeat (LTR)
region of the genome as indicated by the annotation. A BLAST
search on these four predictions revealed that all four
sequences coded for the HIV type I nef protein. The nef protein
has been shown to play a role in virus replication and it has
recently been suggested that it may be related to pathogenicity
as well (30). The other prediction, interestingly, was located
inside the annotated intron. A BLAST search on this sequence
revealed that it coded for the Vpu protein. Therefore, all the
intronless predictions made by heuristically derived models
were supported by either nucleotide sequence annotation in
GenBank or by already existing entries in protein sequence
databases.
The GenBank record for the human T cell lymphotropic
virus type I, 9068 nt, contains only three annotated genes.
Using GeneMark.hmm with heuristic models, all three annotated
genes were closely predicted. In addition, four genes
were predicted that were not present in the annotation. Using
GeneMark, another potential new gene was predicted. Of the
five potential new genes, two were exactly identical and were
found in the LTR regions. Although there were no BLAST hits
for this duplicated putative gene, a similarity search against the
SCOP database produced a number of matches to known
protein structural domains (31). BLAST searches on the other
three predicted proteins revealed that two of them were already
annotated proteins for this virus. The similarity search for one
remaining protein produced no significant homology.
Interpretation of the results in terms of Kullback–Liebler
distance
A gene-finding method exploiting the maximum likelihood
concept, such as GeneMark.hmm (3), attempts to fit each one
of the set of initially defined Markov models, such as models
of coding or non-coding regions, to a given DNA sequence
eventually parsed into segments where one or another model
fits best. Essentially, the algorithm implements ‘competition’
among the models for the best fit for each given DNA segment.
In the simplest case of two competing models, this competition
is quantified by a likelihood ratio value that determines which
one of two models fits the particular DNA segment better. In
making this decision there are two types of possible errors,
false positive and false negative. The lower these errors are, the
higher is the discrimination power of the method. The
Kullback–Liebler (KL) distance or relative entropy, D(P||Q), is
a measure of affinity of two statistical (Markov) models P and
Q. Rigorous theory shows (29) that false positive (false negative)
error rates decrease exponentially with growth of the
sequence fragment length, n, with the exponent value proportional
to –nD(P||Q) [–nD(Q||P)]. This means that the discrimination
power should increase as the sequence fragment
increases. As shown earlier, this trend was indeed observed in
the tests of gene finding as the longer genes were detected with
higher accuracy.
The KL distance, as a single parameter, has proven to be a
useful characteristic for comparative analysis of performance
of gene-finding methods using high order Markov models
described by a large number of parameters. Some rather
surprising results could be explained in terms of the KL
distance (5). Particularly, it was observed that Markov models
trained on the set of E.coli Class I genes provided higher
discrimination power for detecting the E.coli Class II genes
than for genes from Class I itself (4). This observation agrees
with the fact that the effective KL distance, De(P||Q), between
one model (the E.coli Class II gene, P) and another (the E.coli
non-coding sequence, Q), defined by the formula De(P||Q) =
D(P||Q) – D(P||P*), where P* is the model of the E.coli Class I
gene, is larger than the regular KL distance, D(P*||Q), between
the E.coli Class I gene model and the non-coding sequence
model (M.Borodovsky, unpublished data).
In our computations we used the standard formula for the KL
distance (29) modified to deal with the periodic Markov
models P and Q:
D(P||Q) = 1/3∑3
c∑4
i∑4
j∑4
kpc
ipc
ijpc
ijk log2(pc
ijk/qc
ijk) 2
Here index c represents the three codon positions and indices i,
j and k represent the four types of nucleotides. We computed
the KL distance values between several second order models:
the three three-periodic Markov models for the E.coli genes of
Classes I, II and III; the heuristic model of E.coli genes; the
ordinary Markov model of the E.coli non-coding region (Table
3). It is seen that in terms of KL distance the E.coli heuristic
model lies close to the E.coli Class III model. Both of these
Table 3. Values of the Kullback–Liebler distance
The KL distance values defined by equation 2 are given for both D(P||Q) and
D(Q||P), the values related to false positive and false negative error rates (see
text).
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
3918 Nucleic Acids Research, 1999, Vol. 27, No. 19
models are closer to the model of the non-coding region than
the model of E.coli Class I and, especially, the model of E.coli
Class II genes. Let us assume that the E.coli Class III model or
the E.coli heuristic model is used in the gene-finding algorithm.
The effective KL distance between the E.coli Class I or
Class II gene and non-coding models is then large enough (see
5) and the genes of these classes should be predicted with sufficient
accuracy. This expectation was confirmed to be a reasonable
one for both the heuristic model (current paper) and for
the Class III model (M.Borodovsky and A.V.Lukashin, unpublished
data). The E.coli heuristic model was also able to predict
E.coli Class III genes, as would be expected from the closeness
of these two models in terms of KL distance. However, in
comparison with the Class III model the heuristic model has an
advantage in that it could be built in a regular way from a small
portion of DNA sequence data without dealing with clusterization
of the whole gene pool of a given species (5,6,10). Thus,
we conclude that the E.coli heuristic model is able to predict
genes of all E.coli classes, while also being easy to build. A
similar analysis could be conducted for genomes of other
bacterial species with the gene pools divided, if necessary, into
sets of typical and atypical genes (5).
Why the method works? Implications for evolution of
codon usage
The comparison of actual codon usage frequencies with the
codon frequencies defined by the heuristic codon usage table
might help understand why the heuristic model is efficient for
gene finding. In Figure 4A–B both sets of codon frequencies
are shown for the whole E.coli genome (Fig. 4A) and the E.coli
Class III genes (Fig. 4B). These sets are also shown for the
genomes of B.burgdorferi and M.tuberculosis (Fig. 4C and D)
having, respectively, the lowest and the highest GC% content
in the sample of 17 complete genomes. The correlation coefficient,
R, between the experimental codon frequencies and the
heuristic codon frequencies were calculated. The R values for
the E.coli whole genome and the E.coli Class III genes were
equal to 0.58 and 0.48, respectively. For B.burgdorferi and
M.tuberculosis the R values were equal to 0.94 and 0.87,
respectively. The R values indicate that the pattern of codon
usage frequency was captured by the heuristic procedure to
some extent for the case of the E.coli genome, with medium
GC content. This correlation, as indicated earlier, turned out to
be sufficient to generate models providing reasonably accurate
gene-finding ability. For the genomes with highly biased GC
content the high level of correlation indicates that the codon
usage pattern is modeled by the heuristic procedure in great
detail. This observation raises the question of to what extent
does the simple principle of the heuristic method explain the
diversity of codon usage patterns in bacterial and perhaps in
higher species. The basic parameters of the method, the global
nucleotide frequencies, appear to be the fundamental variables
defined by the complex structure of the species-specific
biochemical pathways producing the building blocks of the
A B
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
Nucleic Acids Research, 1999, Vol. 27, No. 19 3919
DNA double helix. The amino acid composition of a species
proteome represents other fundamental variables that are even
more conserved among species and vary slightly as the GC%
of the whole genome changes. Therefore, the observations,
though limited, suggest that a significant component of the
codon usage pattern in evolution may be the result of a
compromise between restrictions on the nucleotide and amino
acid compositions, achieved mainly using elasticity of the
silent codon positions. This mechanism of developing the
codon usage pattern seems to be more pronounced in the
species with highly biased nucleotide composition, such as
B.burgdorferi and M.tuberculosis. In species such as E.coli the
pressure of compositional restrictions is weaker and other
factors come more freely into play to form the codon usage
pattern (8,9,32).
Testing on eukaryotic genomes
Although the heuristic approach to model building was
designed with bacterial gene prediction in mind, we had
reasons to try this approach for eukaryotic DNA sequences as
well. The advantage of the heuristic model is the ability of
being produced ‘on the fly’ and of being adjusted to local
sequence composition. Thus, the heuristic models under
certain circumstances may replace a native protein coding
model, which is part of a whole set of models needed for
eukaryotic gene prediction. This whole set, as used in the
eukaryotic version of GeneMark.hmm (M.Borodovsky and
A.V.Lukashin, unpublished data), includes contextual models
for start codons, stop codons and splice sites as well as models
for length distributions for coding and non-coding regions
(exons, introns and intergenic regions). We have estimated the
parameters for these additional models from a set of 350
C.reinhardtii sequences available in GenBank. Parameters of
the ‘local’ heuristic models for protein coding regions were
derived using nucleotide frequencies counted in DNA fragments
of ~1000 nt. Although we used the linear functions
derived for bacterial genomes, a separate analysis has provided
evidence that similar linear functions are valid for the
Figure 4. (Opposite and above) (A) Frequencies of 61 codons as observed in E.coli protein coding regions are shown along with the codon frequencies defined by
the heuristic codon usage table. The order of codons follows the order accepted in the table of genetic code with the exception of serine and arginine codons.
Although a correlation between natural and model frequencies is seen in general, some significant differences exist in frequencies of particular codons: CTG, CCG,
CGT and CGC, as well as CTA, ATA, AAG, GAG, AGA and AGG. (B) As in (A) with observed codon frequencies taken from 158 genes of E.coli Class III. Again,
a general correlation is observed with the noticeable exception of CTC, CTA, AAG, GAC, GAG and AGG, as well as AAT and AAA. (C) As in (A) for B.burgdorferi.
A clear periodic pattern is seen in codon frequencies. This pattern relates to the type of nucleotide occupying the third codon position. The B.burgdorferi
genome is AT-rich. The codon frequencies for codons ending with A/T are greater in most cases than the frequencies of those ending with C/G. The natural and
heuristic frequencies are in good correlation. (D) As in (A) for M.tuberculosis. Here the periodic pattern similar to that observed in (C) is reversed since the
M.tuberculosis genome is GC-rich. The natural and heuristic frequencies correlate reasonably well.
C D
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom
3920 Nucleic Acids Research, 1999, Vol. 27, No. 19
eukaryotic case (data not shown). When GeneMark.hmm using
native models (of order five) was run on a randomly selected
test set of 58 C.reinhardtii sequences, 88.5% of the annotated
exons and 65% of the annotated genes were exactly predicted.
Using heuristic models of up to only order two, 82% of the
exons and 52% of the genes were predicted exactly. These
results suggest that a heuristic approach aimed at eukaryotic
genomes has some potential and requires further study. It could
be especially useful for genomes with a highly inhomogeneous
composition.
Gene finding with heuristic models via a web server
The software program that builds the heuristic model for input
sequences is accessible via the Internet at http://
dixie.biology.gatech.edu/GeneMark/heuristic.cgi . This program
produces heuristic models for a sequence longer than 400 nt.
The models are then applied to the analysis of the input
sequence by the GeneMark and GeneMark.hmm programs.
Output of the web server includes a list of predicted genes in
text format and, optionally, a list of predicted protein
sequences and a graph of protein coding potentials.
ACKNOWLEDGEMENTS
We are grateful to Dr Jeffrey Lawrence for useful discussion
and to Maribeth Norris for help with manuscript preparation.
J.B. and M.B. were supported in part by a grant from the US
National Institutes of Health. M.B. was also supported in
part by a grant from the US Civil Research Development
Foundation.
REFERENCES
1. Borodovsky,M.Y. and McIninch,J.D. (1993) Comput. Chem., 17,
123–133.
2. Salzberg,S.L., Delcher,A.L., Kasif,S. and White,O. (1998) Nucleic Acids
Res., 26, 544–548.
3. Lukashin,A.V. and Borodovsky,M. (1998) Nucleic Acids Res., 26,
1107–1115.
4. Bult,C.J., White,O., Olsen,G.J., Zhou,L., Fleischmann,R.D., Sutton,G.G.,
Blake,J.A., Fitzgerald,L.M., Clayton,R.A., Gocayne,J.D. et al. (1996)
Science, 273, 1058–1073.
5. Hayes,W.S. and Borodovsky,M. (1998) Genome Res., 8, 1154–1171.
6. Audic,S. and Claverie,J.M. (1998) Proc. Natl Acad. Sci. USA, 17,
10026–10031.
7. Grantham,R., Gautier,C., Gouy,M., Jacobzone,A. and Mercier,M. (1981)
Nucleic Acids Res., 9, r43–r74.
8. Gouy,M. and Gautier,C. (1982) Nucleic Acids Res., 10, 7055–7074.
9. Médigue,C., Rouxel,T., Vigier,P., Henaut,A. and Danchin,A. (1991)
J. Mol. Biol., 222, 851–856.
10. Borodovsky,M., McIninch,J.D., Koonin,E.V., Rudd,K.E., Médigue,C.
and Danchin,A. (1995) Nucleic Acids Res., 23, 3554–3562.
11. Deckert,G., Warren,P.V., Gaasterland,T., Young,W.G., Lenox,A.L.,
Graham,D.E., Overbeek,R., Snead,M.A., Keller,M., Aujay,M. et al.
(1998) Nature, 392, 353–358.
12. Klenk,H.P., Clayton,R.A., Tomb,J., White,O., Nelson,K.E.,
Ketchum,K.A., Dodson,R.J., Gwinn,M., Hickey,E.K., Peterson,J.D. et al.
(1997) Nature, 390, 364–370.
13. Kunst,F., Ogasawara,N., Moszer,I., Albertini,A.M., Alloni,G.,
Azevedo,V., Bertero,M.G., Bessieres,P., Bolotin,A., Borchert,S. et al.
(1997) Science, 390, 249–256.
14. Fraser,C.M., Casjens,S., Huang,W.M., Sutton,G.G., Clayton,R.A.,
Lathigra,R., White,O., Ketchum,K.A., Dodson,R., Hickey,E.K. et al.
(1997) Nature, 390, 580–586.
15. Stephens,R.S., Kalman,S., Lammel,C.J., Fan,J., Marathe,R., Aravind,L.,
Mitchell,W.P., Olinger,L., Tatusov,R.L., Zhao,Q. et al. (1998) Science,
282, 754–759.
16. Blattner,F.R., Plunkett,G., Bloch,C.A., Perna,N.T., Burland,V., Riley,M.,
Collado-Vides,J., Glasner,J.D., Rode,C.K., Mayhew,G.F. et al. (1997)
Science, 277, 1453–1462.
17. Fleischmann,R.D., Adams,M.D., White,O., Clayton,R.A., Kirkness,E.F.,
Kerlavage,A.R., Bult,C.J., Tomb,J., Dougherty,B.A., Merrick,J.M. et al.
(1995) Science, 269, 496–512.
18. Tomb,J.-F., White,O., Kerlavage,A.R., Clayton,R.A., Sutton,G.G.,
Fleischmann,R.D., Ketchum,K.A., Klenk,H.P., Gill,S., Dougherty,B.A.
et al. (1997) Nature, 388, 539–547.
19. Smith,D.R., Doucette-Stamm,L.A., Deloughery,C., Lee,H.-M., Dubois,J.,
Aldredge,T., Bashirzadeh,R., Blakely,D., Cook,R., Gilbert,K. et al.
(1997) J. Bacteriol., 179, 7135–7155.
20. Cole,S.T., Brosch,R., Parkhill,J., Garnier,T., Churcher,C., Harris,D.,
Gordon,S.V., Eiglmeier,K., Gas,S., Barry,C.E. et al. (1998) Nature, 393,
537–544.
21. Fraser,C.M., Gocayne,J.D., White,O., Adams,M.D., Clayton,R.A.,
Fleischmann,R.D., Bult,C.J., Kerlavage,A.R., Sutton,G.G., Kelley,J.M.
et al. (1995) Science, 270, 397–403.
22. Himmelreich,R., Hilbert,H., Plagens,H., Pirkl,E., Li,B.-C. and
Herrmann,R. (1996) Nucleic Acids Res., 24, 4420–4449.
23. Kawarabayasi,Y., Sawada,M., Horikawa,H., Haikawa,Y., Hino,Y.,
Yamamoto,S., Sekine,M., Baba,S., Kosugi,H., Hosoyama,A. et al. (1998)
DNA Res., 5, 55–76.
24. Andersson,S.G.E., Zomorodipour,A., Andersson,J.O., SicheritzPonten,T.,
Alsmark,U.C.M., Podowski,R.M., Naeslund,A.K.,
Eriksson,A.S., Winkler,H.H. and Kurland,C.G. (1998) Nature, 396,
133–140.
25. Kaneko,T., Sato,S., Kotani,H., Tanaka,A., Asamizu,E., Nakamura,Y.,
Miyajima,N., Hirosawa,M., Sugiura,M., Sasamoto,S. et al. (1996)
DNA Res., 3, 109–136.
26. Fraser,C.M., Norris,S.J., Weinstock,G.M., White,O., Sutton,G.G.,
Dodson,R., Gwinn,M., Hickey,E.K., Clayton,R., Ketchum,K.A. et al.
(1998) Science, 281, 375–388.
27. Wong-Staal,F., Gallo,R.C., Chang,N.T., Ghrayeb,J., Papas,T.S.,
Lautenberger,J.A., Pearson,M.L., Petteway,S.R.,Jr, Ivanoff,L.,
Baumeister,K. et al. (1985) Nature, 313, 277–284.
28. Inoue,J., Watanabe,T., Sato,M., Oda,A., Toyoshima,K., Yoshida,M. and
Seiki,M. (1986) Virology, 150, 187–195.
29. Cover,T.M. and Thomas,J.A. (1991) Elements of Information Theory.
John Wiley & Sons, New York, NY.
30. Zaher,H., Kay,D.J., Rebai,N., Guimond,A., Jothy,S. and Jolicoeur,P.
(1998) Cell, 95, 163–175.
31. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol.
Biol., 247, 536–540.
32. Lawrence,J.G. and Ochman,H. (1997) J. Mol. Evol., 44, 383–397.
atMasarykUniversityonMarch11,2013http://nar.oxfordjournals.org/Downloadedfrom