BLAST Algorithm
Stephen F Altschul, National Center for Biotechnology Information, Bethesda,
Maryland, USA
BLAST is an acronym for basic local alignment search tool; the BLAST family of database
search programs takes as input a query DNA or protein sequence, and search DNA or
protein sequence databases for similarities that may indicate homology. The programs
implement variations of the BLAST algorithm, which is a heuristic method for rapidly
finding local alignments with scores sufficiently high to be statistically significant.
Deoxyribonucleic acid (DNA) or protein sequences
are aligned with one another for a wide variety of
reasons, for example, to infer homology, to predict
protein structure and function, to reconstruct phylogeny
and to locate exons and introns. The type of
sequence alignment that makes the fewest prior
assumptions is local alignment, which seeks similar
segments of unspeciﬁed length from the two sequences
being compared. Local sequence similarity is generally
measured with the aid of a nucleotide or amino acid
substitution matrix and associated gap costs. With
sequence similarity so deﬁned, a rigorous method for
ﬁnding optimal local alignments is provided by the
Smith–Waterman algorithm, which requires time
proportional to the product of the lengths of the
sequences it compares. However, using this algorithm,
a similarity search of a large nucleic acid sequence
database or protein database with a single query
sequence of moderate length may require an hour on
a modern work station. Accordingly, rapid heuristic
algorithms such as FASTA and basic local alignment
search tool (BLAST) have been developed that can
perform these searches up to two orders of magnitude
faster than the Smith–Waterman algorithm, but at the
cost of possibly missing an occasional similarity, that
is, local alignment, of interest.
BLAST Statistics
Local alignments of the kind sought by BLAST are
evaluated by alignment scores, deﬁned as the sum of
substitution scores for aligned pairs of letters, and gap
scores for strings of letters in one sequence aligned
with null characters introduced into the other.
Whether an alignment score is great enough to be of
interest depends on what scores are expected to arise
purely by chance. An analytic theory exists only for
local alignments that may not contain gaps. Assuming
that ‘random’ DNA or protein sequences are strings of
nucleotides or amino acids chosen independently, with
ﬁxed background residue frequencies, the number of
distinct local alignments with score at least S expected
to arise from the comparison of two sequences of
lengths m and n is well approximated by the formula
E ¼ Kmn eÀlS
where K and l are calculable parameters (Karlin and
Altschul, 1990). No such theory has been established
for gapped local alignments, but computational
experiments strongly suggest that the formula still
applies in this more general case (Altschul and Gish,
1996). However, in the gapped case, the statistical
parameters l and K can no longer be derived
analytically but must rather be estimated by random
simulation.
The BLAST programs convert raw local alignment
scores to ‘E-values’ using the above equation, m being
the length in residues of the query sequence, and n the
length of the database to which it is compared. The
original BLAST programs (Altschul et al., 1990)
sought only ungapped local alignments and therefore
could calculate l and K analytically. The gapped
BLAST programs (Altschul and Gish, 1996; Altschul
et al., 1997) allow users to choose among a set of
substitution and gap costs for which the statistical
parameters have been preestimated. Only local
alignments with an E-value lower than a set threshold,
usually 10 by default, are reported.
Basic Algorithm
Protein similarities of great biologic signiﬁcance may
be quite diffuse, sometimes involving long alignments
that nowhere contain more than two adjacent
matching pairs of amino acids. No rigorous method
substantially faster than the Smith–Waterman algorithm
is known for locating optimal local alignments.
On the other hand, computer science has very rapid
Advanced article
Article contents
 BLAST Statistics
 Basic Algorithm
 BLAST Variants
 Implementations and Servers
doi: 10.1038/npg.els.0005253
BLAST Algorithm
ENCYCLOPEDIA OF LIFE SCIENCES & 2005, John Wiley & Sons, Ltd. www.els.net 1
techniques for identifying perfectly matching
substrings of two input sequences. The heuristic
BLAST algorithm uses these techniques to locate,
with strong probability, optimal local alignments, even
when they represent relatively diffuse similarities.
Let a ‘word’ be a set-length string of adjacent letters
within a sequence. Given two sequences and a
threshold score T, let a ‘hit’ be a pair of words, one
from each sequence, whose aligned score is at least T.
The central idea behind the BLAST algorithm is that
for an appropriately chosen T, any local alignment
between the query and a database sequence with an
E-value worth reporting is highly likely to contain an
aligned pair of words constituting a hit (Altschul et al.,
1990). Therefore, a rapid search for hits involving
query and database sequences can be used to locate
candidate local alignments for reporting.
For the purpose of illustration, consider the original
ungapped BLAST algorithm (Altschul et al., 1990)
applied to protein sequence comparison using word
length 3, and the threshold T ¼ 11 for hits scored using
the BLOSUM-62 amino acid substitution matrix
(Henikoff and Henikoff, 1992). Omitting most details,
BLAST follows the following steps when supplied with
a query sequence and database.
Firstly, construct a table of all possible words that
would form a hit when aligned with some word in the
query sequence, and tag each word in the table with the
locations of all query sequence words that generated it.
For example, if the word ‘MVD’ appeared somewhere
in the query, it would generate the 11 words shown in
Table 1, each with a record of the location of ‘MVD’ in
the query.
Secondly, scan the database for any words that
appear in the table. Each such database word then
constitutes a hit in conjunction with one or more
known query sequence words.
Thirdly, consider each hit a nascent alignment
and extend it in each direction by aligning additional
pairs of amino acids from the query and database
sequences. Terminate an extension if the alignment
score drops more than a ﬁxed amount X below the best
score yet found seeded by this hit. Report the optimal
alignment for this hit if its score has a sufﬁciently low
E-value.
BLAST achieves its speed in large measure
by scanning the database only for exact matches
to words in the table created in the ﬁrst step
above. However, because it generates this table by
expanding each query word into a set of similar words,
BLAST is able to detect fairly diffuse similarities. For
example, six of the 11 hits generated by ‘MVD’ do not
contain two adjacent pairs of matching amino acids
(Table 1).
A key feature of the BLAST algorithm is its tunable
trade-off between speed and the probability of missing
relevant alignments. If the hit threshold score is raised
from 11 to 12, the number of words generated by
‘MVD’, for example, falls from 11 to 5 (Table 1). This
results in many fewer alignment extensions and a
corresponding decrease in execution time, but it also
renders it more likely that an alignment of interest will
contain no hit and therefore be overlooked. Gapped
versions of BLAST have additional conditions on and
procedures for generating gapped alignments
(Altschul and Gish, 1996; Altschul et al., 1997),
which will not be discussed here.
BLAST Variants
Variations on the BLAST algorithm have been
implemented for use in a variety of contexts.
Protein sequence comparison: BLASTP
BLASTP is used to compare a protein query sequence
to a database of protein sequences; recent versions
allow gapped alignments (Altschul and Gish, 1996;
Altschul et al., 1997).
DNA sequence comparison: BLASTN and
MEGABLAST
BLASTN is used to compare a DNA query sequence
to a database of DNA sequences. In lieu of the scorebased
hit deﬁnition used for protein comparisons, hits
are deﬁned as runs of matching pairs of nucleic acids of
a speciﬁed minimum length. MEGABLAST (Zhang
et al., 2000) is a faster variant on BLASTN best used
for very long query sequences and for ﬁnding
alignments of very closely related sequences.
Table 1 High-scoring words that align
to ‘MVD’
Word Score
MVD 15
MID 14
LVD 12
MLD 12
MMD 12
IVD 11
VVD 11
MAD 11
MTD 11
MVE 11
LID 11
Alignment scores are computed using the
BLOSUM-62 amino acid substitution matrix
(Henikoff and Henikoff, 1992). Amino acids
matching those to which they are aligned in
‘MVD’ are shown in bold.
BLAST Algorithm
2
Translating comparisons: BLASTX, TBLASTN
and TBLASTX
BLASTX (Gish and States, 1993) compares a DNA
query sequence to a protein database. The query
sequence is conceptually translated into protein in all
six possible reading frames, and the comparisons are
performed using an amino acid substitution matrix.
TBLASTN compares a protein query sequence to a
DNA sequence database. Database sequences are
conceptually translated in all six reading frames, and
the comparisons are performed at the protein level.
TBLASTX compares a DNA query sequence to a
DNA database, but at the protein level. Both query
and database sequences are translated into protein in
all possible reading frames, resulting in 36 comparisons
for each database sequence.
Profile searches: PSI-BLAST, IMPALA and
RPS-BLAST
PSI-BLAST (Altschul et al., 1997), an acronym for
‘position-speciﬁc iterated BLAST’, compares a protein
query sequence to a protein database in an iterative
manner. After each round of searching, any statistically
signiﬁcant alignments are combined into a
multiple alignment, from which a position-speciﬁc
score matrix or proﬁle is abstracted. This proﬁle is
compared to the database in the next round of
searching. PSI-BLAST has the potential to detect
much more subtle sequence relationships than
BLAST. IMPALA (Scha¨ ffer et al., 1999) and RPSBLAST
compare a protein query sequence to a
database of PSI-BLAST generated proﬁles; the former
uses the Smith–Waterman algorithm for the comparison
and the latter a variant of the BLAST algorithm.
Pattern-based searches: PHI-BLAST
PHI-BLAST (Zhang et al., 1998), an acronym for
‘pattern-hit initiated BLAST’, compares a protein
query sequence to a protein database but requires
any alignments found to contain a user-speciﬁed
pattern. This pattern serves in place of a hit table to
generate candidate alignments for extension and
evaluation.
Implementations and Servers
A number of BLAST servers are freely available on the
internet (see Web Links). These servers employ
different implementations of the BLAST programs.
Executable code is available from both sites, and
source code is available from the National Center for
Biotechnology Information website (Wheeler et al.,
2001).
See also
FASTA Algorithm
Protein Homology Modeling
Sequence Similarity
Similarity Search
Smith–Waterman Algorithm
References
Altschul SF and Gish W (1996) Local alignment statistics. Methods in
Enzymology 266: 460–480.
Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990)
Basic local alignment search tool. Journal of Molecular Biology
215: 403–410.
Altschul SF, Madden TL, Scha¨ ffer AA, et al. (1997) Gapped BLAST
and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Research 25: 3389–3402.
Gish W and States DJ (1993) Identiﬁcation of protein coding regions
by database similarity search. Nature Genetics 3: 266–272.
Henikoff S and Henikoff JG (1992) Amino acid substitution matrices
from protein blocks. Proceedings of the National Academy of
Sciences of the United States of America 89: 10 915–10 919.
Karlin S and Altschul SF (1990) Methods for assessing the statistical
signiﬁcance of molecular sequence features by using general
scoring schemes. Proceedings of the National Academy of Sciences
of the United States of America 87: 2264–2268.
Scha¨ ffer AA, Wolf YI, Ponting CP, et al. (1999) IMPALA: matching
a protein sequence against a collection of PSI-BLASTconstructed
position-speciﬁc score matrices. Bioinformatics 15:
1000–1011.
Wheeler DL, Church DM, Lash AE, et al. (2001) Database resources
of the National Center for Biotechnology Information. Nucleic
Acids Research 29: 11–16.
Zhang Z, Scha¨ ffer AA, Miller W, et al. (1998) Protein sequence
similarity searches using patterns as seeds. Nucleic Acids Research
26: 3986–3990.
Zhang Z, Schwartz S, Wagner L and Miller W (2000) A greedy
algorithm for aligning DNA sequences. Journal of Computational
Biology 7: 203–214.
Further Reading
Altschul SF and Koonin EV (1998) Iterated proﬁle searches with
PSI-BLAST: a tool for discovery in protein databases. Trends in
Biochemical Sciences 23: 444–447.
Altschul SF, Boguski MS, Gish W and Wootton JC (1994) Issues
in searching molecular sequence databases. Nature Genetics 6:
119–129.
Altschul SF, Bundschuh R, Olsen R and Hwa T (2001) The
estimation of statistical parameters for local alignment score
distributions. Nucleic Acids Research 29: 351–361.
Baxevanis AD and Ouellette BFF (2001) Bioinformatics: A Practical
Guide to the Analysis of Genes and Proteins, 2nd edn. New York:
John-Wiley.
Dembo A, Karlin S and Zeitouni O (1994) Limit distribution of
maximal non-aligned two-sequence segmental score. Annals of
Probability 22: 2022–2039.
Ewens WJ and Grant GR (2001) Statistical Methods in Bioinformatics.
New York: Springer-Verlag.
Scha¨ ffer AA, Aravind L, Madden TL, et al. (2001) Improving the
accuracy of PSI-BLAST protein database searches with composition-based
statistics and other reﬁnements. Nucleic Acids
Research 29: 2994–3005.
BLAST Algorithm
3
Smith TF and Waterman MS (1981) Identiﬁcation of common
molecular subsequences. Journal of Molecular Biology 147:
195–197.
Smith TF, Waterman MS and Burks C (1985) The statistical
distribution of nucleic acid similarities. Nucleic Acids Research
13: 645–656.
Wilson AC, Carlson SS and White TJ (1977) Biochemical evolution.
Annual Review of Biochemistry 46: 573–639.
Web Links
National Center for Biotechnology Information BLAST server
http:/www.ncbi.nlm.nih.gov/BLAST/
Washington University BLAST server
http:/blast.wustl.edu/
BLAST Algorithm
4