Read Mapping Algorithms for Single Molecule
Sequencing Data
Vladimir Yanovsky1
, Stephen M. Rumble1
and Michael Brudno1,2
1
Department of Computer Science and
2
Donnelly Centre for Cellular and Biomolecular Research
University of Toronto
{volodyan,rumble,brudno}@cs.toronto.edu
Abstract. Single Molecule Sequencing technologies such as the Heliscope
simplify the preparation of DNA for sequencing, while sampling
millions of reads in a day. Simultaneously, the technology suffers from a
significantly higher error rate, ameliorated by the ability to sample multiple
reads from the same location. In this paper we develop novel rapid
alignment algorithms for two-pass Single Molecule Sequencing methods.
We combine the Weighted Sequence Graph (WSG) representation of all
optimal and near optimal alignments between the two reads sampled
from a piece of DNA with k-mer filtering methods and spaced seeds
to quickly generate candidate locations for the reads on the reference
genome. We also propose a fast implementation of the Smith-Waterman
algorithm using vectorized instructions that significantly speeds up the
matching process. Our method combines these approaches in order to
build an algorithm that is both fast and accurate, since it is able to
take complete advantage of both of the reads sampled during two pass
sequencing.
1 Introduction
Next generation sequencing (NGS) technologies are revolutionizing the study
of variation among individuals in a population. While classical, Sanger-style
sequencing machines were able to sequence 500 thousand basepairs per run
at a cost of over $1000 per megabase, new sequencing technologies, such as
Solexa/Illumina and AB SOLiD can sequence 4 billion nucleotides in the same
amount of time, at the cost of only $6 per megabase. The decreased cost and
higher throughput of NGS technologies, however, are offset by both a shorter
read length and a higher overall sequencing error rate.
Most NGS technologies reduce the cost of sequencing by running many sequencing
experiments in parallel. After the input DNA is sheared into smaller
pieces, DNA libraries are prepared by a Polymerase Chain Reaction (PCR)
method, with many identical copies of the molecules created, and attached to
a particular location on a slide. All of the molecules are sequenced in parallel
using sequencing by hybridization or ligation, with each nucleotide producing a
specific color as it is sequenced. The colors for all of the positions on the slide
are recorded via imaging and are then converted into base calls.
In this paper we address a specific type of NGS technology ­ Single Molecule
Sequencing (SMS). While the origins of SMS date back to 1989 [3], it is only
now becoming practical. The Heliscope sequencer, sold by Helicos, is the first
commercial product that allows for the sequencing of DNA with SMS. In a recent
publication, the complete genome of the M13 virus was resequenced with
the Heliscope [4]. The key advantage of the SMS methods over other NGS technologies
is the direct sequencing of DNA, without the need for the PCR step
described above. PCR has different success rates for different DNA molecules,
and introduces substitutions into the DNA sequence as it is copied. Additionally,
the direct sequencing of DNA via SMS significantly simplifies the preparation of
DNA libraries. SMS methods should also lead to more accurate estimates of the
quantity of DNA which is required for detection of copy number variations and
quantification of gene expression levels via sequencing. SMS technologies have
very different error distribution than standard Sanger-style sequencing. Because
only one physical piece of DNA is sequenced at a time, the sequencing signal is
much weaker, leading to a large number of "dark bases": nucleotides that are
skipped during the sequencing process leading to deletion errors. Additionally,
a nucleotide could be mis-read (substitution errors), or inserted, however these
errors are more typically the result of imaging, rather than chemistry problems,
and hence are rarer. A full description of SMS errors is in the supplementary
material of [4].
One advantage of the Heliscope sequencer, as well as several other proposed
SMS technologies (e.g. Pacific Biosciences' expected method) is the ability to
read a particular piece of DNA multiple times (called multi-pass sequencing).
The availability of multiple reads from a single piece of DNA can help confirm
the base calls, and reduce the overall error rate. In practice these methods
usually use two passes, as this offers a good tradeoff between error rate and
cost. In this paper we introduce a novel rapid alignment algorithm for multipass
single molecule sequencing methods. While we will use the most common
case of two-pass sequencing to describe our algorithms, they can be easily extended
to a larger number of passes. We combine the Weighted Sequence Graph
(WSG) [11] representation of all optimal and near optimal alignments between
the two reads sampled from a piece of DNA with k-mer filtering methods [9]
and spaced seeds [1] to quickly generate candidate locations for the reads on the
reference genome. We also propose a novel, fast implementation of the SmithWaterman
algorithm [12] using vectorized instructions that significantly speeds
up the matching process. Our algorithms are implemented as part of the SHort
Read Mapping Package (SHRiMP). SHRiMP is a set of tools for mapping reads
to a genome, and includes specialized algorithms for mapping reads from popular
short read sequencing platforms, such as Illumina/Solexa and AB SOLiD.
SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.
2 Algorithms
In this section we will describe the alignment algorithms we use for mapping two
reads generated from a single DNA sequence with SMS to a reference genome.
We will start (Sections 2.1 and 2.2) by reviewing how the Weighted Sequence
Graph data structure [11] can be used to speed up the inherently cubic naive
alignment algorithm to near-quadratic running time. We will then demonstrate
how to combine the WSG alignment approach with standard heuristics for rapid
alignment, such as seeded alignment and k-mer filters in Section 2.3. Finally,
we will describe our implementation of the Smith-Waterman alignment algorithm
with SSE vector instructions to further speed up our algorithm, making
it practical for aligning millions of reads against a long reference genome.
2.1 Alignment with Weighted Sequence Graphs
Given unlimited computational resources, the best algorithm for mapping two
Fig. 1. An example of a WSG graph representing 1-suboptimal alignments of CTGACT
with CAGCAT. (A). Optimal alignments of score 9 and two 1-suboptimal of score 8;
two more 1-suboptimal alignments are possible. (B). Cells of the dynamic programming
matrix such that the best path through them is 9 are shaded in light grey, if it is 8 in
dark grey. (C). WSG corresponding to at worst 1-suboptimal alignments.
reads sampled from the same location to the reference genome is full threeway
alignment. This algorithm would require running time proportional to the
product of the sequence lengths. Because the reads are typically short (30bp),
the overall running time to map a single read pair to a genome of length n may
be practical (30  30  n = 900n), however the naive algorithm will not scale to
aligning millions of reads that an SMS platform produces every day.
An alternative approach, suggested in [4], is to align the two reads to each
other, thus forming a profile, which could then be aligned to the reference
genome. This approach, akin to the standard "progressive" alignment approaches
used, e.g., in CLUSTALW [2], has the advantage of significantly decreased runtime,
as the running time becomes proportional to the product of the lengths of
the genome and the longer read (30 times the length of the genome), however,
because the profile only incorporates a single optimal alignment, it loses important
information about possible co-optimal and suboptimal alignments between
the sequences. For example, Figure 1 demonstrates two short sequences with
two optimal alignments, as well as 4 more near-optimal ones. While the total
number of optimal alignments between two sequences could be exponential in
their lengths, Naor and Brutlag [7] and Hein [6] have suggested that all of these
alignments can be represented compactly using a directed acyclic graph (DAG).
Furthermore, Schwikowski and Vingron [11] have shown how to generalize the
standard sequence alignment paradigms to alignment of DAGs, which they call
weighted sequence graphs. Their original algorithm was stated in the context of
the tree alignment problem, but is easily generalizable to any progressive alignment
scoring scheme. Here we reintroduce the idea of representing alignments
as graphs, and extend it for the SMS mapping problem.
Definitions The following definitions will prove useful:
­ Given the sequence S of length |S| over an alphabet  = {A, C, T, G}, we
refer to its ith symbol as S[i]
­ We say that a sequence S over alphabet  =  `-`, where character `-`
is called a gap, spells S if removing the gaps from S results in sequence S.
When this does not cause ambiguity, we will write S instead of S .
­ A k-mer is any sequence of length k.
­ When addressing matrix indices, we use a notations "i1 : i2", ": i2", "i1 :"
and ":" to represent the sets of indices j such that i1  j  i2, j  i2,
i1  j and all possible indices respectively. For example, given matrix M of
size K × N we shall denote the i'th row of the matrix by M[i; :] and its j'th
column by M[:; j].
­ A global alignment of 2 sequences S1 and S2 over alphabet  is defined as a
matrix M = [2 × N] such that M[i; :] spells Si for i = 1, 2.
­ Given a real-valued score function sc :  ×   we define the score SC
of the alignment M = [2 × N] as
SC(M) = N
j=1sc(M[:; j])
­ The global alignment problem is to find an alignment M such that SC(M) is
maximum. It is known [8] that for any cost function sc satisfying condition
sc(-, -) < 0 the problem is well-defined though its solution is not necessarily
unique. We denote the maximum score with Opt.
­ We call an alignment M -suboptimal if SC(M)  Opt - .
While we defined the alignment for two sequences, the problem for more than
two sequences is defined similarly. In particular, the problem we consider in this
work ­ an alignment of a pair of sequences S1 and S2 to a reference sequence R
­ is to find a high scoring alignment matrix M = [3 × N] such that M[i; :] = Si
for i=1 and 2 respectively and M[3; :] is a substring of R.
Representing (Sub)-optimal Alignments in a Graph A sequence S can
be represented by a labeled directed graph G (S) = (V, E) with |S| + 1 nodes
{s = V [0]    V [|S|] = t]} and |S| edges: for every 0  i < |S| there is an edge
V [i]  V [i + 1] labeled with S[i] ­ the i'th symbol of sequence S; vertices s and
t will be called the source and the sink. We obtain the graph G(S) from graph
G by adding a self loop edge V [i]  V [i] labeled with a gap symbol '­' for every
i. There is a one-to-one correspondence between the sequences over alphabet 
spelling sequence S and the sequences produced by reading edge labels along
the paths in G(S) from s to t.
Given two strings, S1 and S2 and two graphs A = G(S1) and B = G(S2), their
cross product A × B is defined as a graph with the vertex set V = {(vA, vB)},
for any vA vertex of A and vB vertex of B, and edge set E = {v1  v2}, for
any v1 = (v1
A, v1
B) and v2 = (v2
A, v2
B), such that there is an edge v1
A  v2
A in A
and an edge v1
B  v2
B in B. The edges of all graphs considered in this work will
be labeled with strings over alphabet ; the edge v1
 v2
of the cross product
graph will receive a label which is the concatenation of the labels of an edge
v1
A  v2
A in A and an edge v1
B  v2
B in B.
This cross product graph, sometimes referred to as an edit graph, corresponds
to the Smith-Waterman dynamic programming matrix of the two strings. It is
easy to prove that there is a one-to-one correspondence between the paths from
source s = s1 × s2 to sink t = t1 × t2 and alignments: if the sequence of edges
along the path is e0    eN-1, then the corresponding alignment M is defined as
M(:; j) = label(ej) for all values of j ­ recall that label(e) is a pair of symbols.
Note that the only cycles in the cross-product graph we just defined are the self
loops labeled with '­ ­' at every node of the graph.
Now a WSG is defined as a subgraph of the cross-product graph such that its
vertex set V contains vertices s, t and for every vertex v  V it is on some path
from s to t. Intuitively, one may think of the WSG as a succinct representation
of a set of alignments between the two strings. We use WSGs in Section 2.2 for
the purpose of representing all high-scoring alignments.
2.2 Alignment of Read Pairs with WSG
During two-pass DNA sequencing, two reads are produced from every fragment
of DNA. Both of the reads may have sequencing errors, the most common of
which are skipped letters. These errors are nearly ten-fold more common than
mis-calls or insertions [4]. Our algorithm for aligning two-pass SMS reads is
based on the intuition that in a high-scoring three-way alignment of a pair of
reads, S1 and S2, to a reference sequence, the alignment of sequences S1 and S2
to each other should also be high-scoring. The algorithm proceeds as follows:
­ Build a cross-product graph G = (V, E) representing the set of all possible
alignments of S1 and S2, s is the source of G , t is the sink.
­ For every edge e = u  v in G we compute the score of the highest scoring
path from source s to sink t through the edge e; we denote this score as
w(e), the weight of the edge. To do this we use dynamic programming to
compute the scores of the highest scoring paths from s to every vertex of G
and of the highest scoring paths, using the edges of G reversed, from t to
every vertex of G. Time complexity of this step is O(E).
­ For a given suboptimality parameter , we build WSG G2 from G by discarding
all edges such that
w(e) < Opt - 
where Opt denotes the score of the highest scoring path from s to t in G .
Observe that while all -suboptimal alignments of S1 and S2 correspond to
paths from s to t, in G2, the converse is not true: not all paths from s to t
are -suboptimal.
­ Align WSG G2 to the reference genome: compute the cross product of G2
and the linear chain that represents the reference sequence, obtaining a WSG
spelling all possible three way alignments M, such that M[1 : 2; :] is one of the
alignments, represented by G2. Finally, use dynamic programming (similar
to Smith and Waterman [12]) to search this graph for the locally optimal
path. The score of this path is the score of mapping the pair of reads at the
given location.
While the use of the WSG leads to a significant speedup compared with the
full 3-dimensional dynamic programming, this is still insufficient for the alignment
of many reads to a long reference genome, as the resulting algorithm is at
best quadratic. In the subsequent section we combine the WSG alignment approach
with standard heuristics for rapid sequence alignment in order to further
improve the running time.
2.3 Seeded Alignment Approach for SMS Reads
Almost all heuristic alignment algorithms start with seed generation ­ the location
of short, exact or nearly exact matches between two sequences. Whenever
one or more seeds are found, the similarity is typically verified using a slow, but
more sensitive alignment algorithm. The seeds offer a tradeoff between running
time and sensitivity: short seeds are more sensitive than long ones, but lead to
higher running times. Conversely, longer seeds result in faster searches, but are
less sensitive. Our approach for SMS seeded alignment differs in two ways from
standard methods. First, we generate potential seeds not from the observed
reads, but rather from the -suboptimal WSGs representing their alignments.
Second, we combine spaced seeds [1] with k-mer-filtering techniques [9] to further
speed up our algorithm.
Seed Generation Intuitively, given a read pair and an integer k, the seeds we
want are k-long substrings (k-mers) likely to appear in the DNA segment represented
by the pair and segment's matching location on the reference sequence.
While k-mers present in each individual read may not match the genome directly
due to the deletion errors introduced during the sequencing process, it is much
less likely that both reads have an error at the same location. Consequently, we
generate the potential seeds not from the the reads directly, but from the highscoring
alignments between them. Because every high-scoring alignment between
the reads corresponds to a path through the WSG, we take all k-long subpaths
in the WSG and output the sequence of edge labels along the path. Recall that
each edge of the WSG is labeled by a pair of letters (or gaps), one from each
read. If one of the reads is gapped at the position, we output the letter from the
other read. If neither read is gapped we output a letter from an arbitrary read.
While the number of k-mers we generate per pair of reads can be large if the two
reads are very different, in practice the reads are similar and the WSG is small
and "narrow". In our simulations (described in the Results section) we saw an
average of 27 k-mers per read-pair for reads of approximately 30 nucleotides.
Genome Scan Given the sets of k-mers generated from the WSGs of readpairs,
we build a lookup table, mapping from the k-mers to the reads. We used
spaced seeds [1], where certain pre-determined positions are "squeezed" from the
k-mers to increase sensitivity. We then scan the genome, searching for matches
between k-mers present in the genome and the reads. If a particular read has as
many or more than a specified number of k-mer matches within a given window
of the genome, we execute a vectorized Smith-Waterman step, described in the
next section.
Unlike some local alignment programs that build an index of the genome and
then scan it with each query (read), we build an index of the reads and query
this index with the genome. This approach has several advantages: first, it allows
us to control memory usage, as our algorithm never needs memory proportional
to the size of the genome, while the large set of short reads can be easily divided
between many machines in a compute cluster. Secondly, our algorithm is able to
rapidly isolate which reads have several k-mer matches within a small window
by using a circular buffer to store all of the recent positions in the genome that
matched the read. The genome scan algorithm is illustrated in Figure 2A.
2.4 Vectorized Smith-Waterman Implementation
While the genome scan described in the previous section significantly reduces the
number of candidate regions, many false positives are encountered. To further reduce
the number of potential mapping positions given to the WSG-based aligner,
(A) (B)
Fig. 2. A. Overview of the k-mer filtering stage within SHRiMP: A window is moved
along the genome. If a particular read has a preset number of k-mers within the window
the vectorized Smith-Waterman stage is run to align the read to the genome. B.
Schematic of the vectorized-implementation of the Smith-Waterman algorithm. The
red cells are the vector being computed, on the basis of the vectors computed in the
last step (yellow) and the next-to-last (green). The match/mismatch vector for the
diagonal is determined by comparing one sequence with the other one reversed (indicated
by the red arrow below). To get the set of match/mismatch positions for the
next diagonal the lower sequence needs to be shifted to the right.
we first align both reads to the candidate region using a vectorized implementation
of the Smith-Waterman algorithm. Most contemporary mobile, desktop
and server-class processors have special vector execution units, which perform
multiple simultaneous data operations in a single instruction. For example, it is
possible to add the eight individual 16-bit elements of two 128-bit vectors in one
machine instruction. Over the past decade, several methods have been devised to
significantly enhance the execution speed of Smith-Waterman-type algorithms
by parallelizing the computation of several cells of the dynamic programming
matrix. The simplest such implementation by Wozniak [13] computes the dynamic
programming matrix using diagonals (See Figure 2B). Since each cell of
the matrix can be computed once the cell immediately above, immediately to
the left, and at the upper-left corner have been computed, one can compute
each successive diagonal once the two prior diagonals have been completed. The
problem can then be parallelized across the length of supported diagonals, often
leading to a vectorization factor of 4 to 16. The only portion of such an approach
that cannot be parallelized is the identification of match/mismatch scores for every
cell of the matrix. These operations are done sequentially, necessitating 24
independent, expensive data loads for 8-cell vectors. The approach becomes increasingly
problematic as vector sizes increase because memory loads cannot be
vectorized; when the parallelism grows, so does the number of lookups.
Rognes and Seeberg [10] developed an alternative algorithm with the following
innovations: first, they avoided the aforementioned loads by pre-calculating
a 'query profile' before computing the matrix. By pre-allocating vectors with
the appropriate scores in memory, they needed only load a single score vector
on each computation of a vector of cells. The up-front cost in pre-calculation
greatly reduced the expense of computing scores in the crucial, tight inner-loop
of the algorithm. The query profile, however, requires that scores be computed
per-column, rather than for each diagonal. This creates data dependencies between
several cells within the vector. However, the authors took advantage of the
fact that often cells contain the value 0 when computing the Smith-Waterman
recurrence, and any gap continuation would therefore be provably suboptimal.
This enables the conditional skipping of a significant number of steps. Although
highly successful for protein sequences, where only 1/20 elements on average
have a positive score, it is significantly less beneficial for DNA sequences where
usually at least 1/4 elements in the matrix match. Farrar [5] most recently introduced
a superior method: by striping the query profile layout in memory, he
significantly reduced the performance impact of correcting data dependencies.
We propose an alternate method, where the running time of the fully vectorized
algorithm is independent of the number of matches and mismatches in
the matrix, though it only supports fixed match/mismatch scores (rather than
full scoring matrices). Since SHRiMP only applies a vectorized Smith-Waterman
scan to short regions of confirmed k-mer hits, alternative approaches that benefit
by skipping computation in areas of dissimilarity are unable to take significant
advantage. Figure 2B demonstrates the essence of our algorithm: by storing one
of the sequences backwards, we can align them in such a way that a small number
of logical instructions obtain the positions of matches and mismatches for
a given diagonal. We then construct a vector of match and mismatch scores for
every cell of the diagonal without having to use expensive and un-vectorizable
load instructions or pre-compute a "query profile". In our tests (see Table 1),
our approach surpasses the performance of Wozniak's original algorithm [13]
and performs on par with Farrar's method [5]. The advantages of our method
over Farrar's approach are simplicity, independence of the running time and the
scores used for matches/mismatches/gaps, and linear scalability for larger vector
sizes. The disadvantages of our method are that it cannot support full scoring
matrices (i.e. it is restricted to match/mismatch scores) and is slower for queries
against large reference sequences where significant areas of dissimilarity are expected.
However, the former is less important for DNA alignment and the latter
does not apply to SHRiMP.
The vectorized Smith-Waterman approach described above is used to rapidly
determine if both of the reads have a strong match to a given region of the
genome. The locations of the top n hits for each read on the genome are stored in
a heap data structure, which is updated after every invocation of the vectorized
Smith-Waterman algorithm if the heap is not full, or if the attained score is
greater than or equal to the lowest scoring top hit. Once the whole genome
is processed, the highest scoring n matches are re-aligned using the full WSG
alignment algorithm described in Section 2.2.
Processor type Unvectorized Wozniak Farrar SHRiMP
P4 Xeon 97 261 335 338
Core 2 105 285 533 537
Table 1. Performance (in millions of cells per second) of the various Smith-Waterman
implementations, including a regular implementation, Wozniak's diagonal implementation
with memory lookups, Farrar's striped algorithm and our diagonal approach
without score lookups (SHRiMP). We used each method within SHRiMP to align 50
thousand reads to a reference genome with default parameters. The improvement of
the Core 2 architecture for vectorized instructions lead to a significant speedup for our
and Farrar's approaches, while the Wozniak algorithm's slight improvement is due to
the slow memory lookups.
3 Results
In order to test the efficacy of our read mapping algorithm we designed a simulated
dataset with properties similar to those expected from the first generation
SMS sequencing technologies, such as Helicos [4]. We sampled 10,000 pairs of
reads from human chromosome one (the total length of the chromosome is approximately
1/10th of the human genome). The first read was sampled to have
a mean length of 30 bases, with standard deviation of 8.9, the second one had a
mean length of 26 bases with standard deviation 8.6. The lengths of both reads
ranged from 12 to 58 bases. We introduced errors into each of the the reads,
with 7% deletion errors and 0.5% insertion and substitution errors uniformly at
random. We also introduced Single Nucleotide Polymorphisms (SNPs) at a rate
of 1%.
We used three approaches to map the read pairs back to the reference. Our
main approach was the WSG-based alignment algorithm described in Section 2.2.
For comparison we also implemented two alternatives: mapping the reads individually,
using the version of SHRiMP for Illumina/Solexa, but with the same
parameters as were used for WSG-based mapping in order to properly capture
the high expected deletion rates, and a profile-based approach, which follows
the overall framework of the WSG approach, but only considers a single top
alignment (one path in the WSG) for further mapping to the reference genome.
The methods are labeled WSG, SEPARATE, and PROFILE, respectively, in
Table 2.
We evaluated the performance of these algorithms based on three criteria: the
fraction of the reads that were not mapped at all (lower is better), the fraction of
the reads that were not mapped uniquely (lower is better: while some reads have
several equally good alignments due to repetitive DNA, others map with equal
scores to non-repetitive DNA segments, and this should be minimized), and the
percentage of the reads mapped uniquely and correctly (higher is better). When
evaluating the two reads separately, we considered a hit unique if either of the
two reads had a unique top hit, and we considered a unique top hit correct if
either of the top hits was correct. We ran all algorithms with three seed weights:
8, 9, and 10, with each spaced seed having a single wildcard character (zero) in
the middle of the seed (e.g. 1111101111 was the spaced seed of weight 9).
As one can see in Table 2, the approach of mapping the reads separately, while
leading to very good sensitivity (almost all reads aligned), has poor specificity
(only 70-78% of the reads have a unique top hit, with 61-64% both unique
and correct). The WSG and PROFILE approaches perform similarly, with the
WSG approach slightly more sensitive and having a higher fraction of unique
correct mappings for all seed sizes.
Type SEPARATE PROFILE WSG
seed weight 8 9 10 8 9 10 8 9 10
no hits % 0.000 0.131 2.945 1.741 4.905 10.52 1.609 4.314 10.15
multiple % 30.12 26.45 21.13 10.20 9.342 8.353 10.44 9.127 8.258
unique cor % 63.90 63.00 61.09 78.96 74.90 69.66 79.17 75.84 70.85
runtime 28m16 8m48 4m22 27m17 11m32 6m58 30m59 12m13 7m13
Table 2. Comparison of the WSG-based alignment with the two alternative approaches.
The first two rows describe the percentage of reads not mapped anywhere
on the genome or mapped in multiple places (with equal alignment scores). The third
row is the percentage of the unique hits that are correct. The sub-columns under each
method are the number of matching positions in the spaced seed (we allowed for a single
non-matching character in the middle of the seed). The last row shows the running
time in minutes and seconds on a 2.66 GHz Core2 processor. The best result in each
cathegory is in boldface.
4 Discussion
In this paper we propose a novel read alignment algorithm for next-generation
Single Molecule Sequencing (SMS) platforms. Our algorithm takes advantage of
spaced k-mer seeds and effective filtering techniques to identify potential areas of
similarity, a novel, vectorized implementation of the Smith-Waterman dynamic
programming algorithm to confirm the similarity and a weighted sequence graphbased
three way final alignment algorithm. Our approach is implemented as
part of SHRiMP, the SHort Read Mapping Package, and is freely available at
http://compbio.cs.toronto.edu/shrimp.
While the overall emphasis of the current paper was read mapping, we believe
that several of our methods may have broader applications. For example, our
implementation of the Smith-Waterman algorithm can be used in other sequence
alignment methods, and unlike previous vectorized implementations the running
time of our algorithm is independent of the scoring parameters. The weighted
sequence graph model could also be useful for ab initio genome assembly of SMS
data, where a variant of our method could be used for overlap graph construction.
References
1. Andrea Califano and Isidore Rigoutsos. Flash: A fast look-up algorithm for string
homology. In Proceedings of the 1st International Conference on Intelligent Systems
for Molecular Biology, pages 56­64. AAAI Press, 1993.
2. R. Chenna, H. Sugawara, T. Koike, R. Lopez, T. J. Gibson, D. G. Higgins, and
J. D. Thompson. Multiple sequence alignment with the clustal series of programs.
Nucleic Acids Res, 31(13):3497­3500, July 2003.
3. J.H. Jettand et al. High-speed DNA sequencing: an approach based upon fluorescence
detection of single molecules. Journal of biomolecular structure and
dynamics, 7(2):301­309, 1989.
4. Timothy D. Harris et al. Single-molecule DNA sequencing of a viral genome.
Science, 320(5872):106­109, April 2008.
5. M. Farrar. Striped Smith-Waterman speeds database searches six times over other
SIMD implementations. Bioinformatics, 23(2):156­161, 2007.
6. J. Hein. A new method that simultaneously aligns and reconstructs ancestral
sequences for any number of homologous sequences, when the phylogeny is given.
Molecular Biology and Evolution, 6(6):649­668, 1989.
7. D. Naor and D.L. Brutlag. On near-optimal alignments of biological sequences.
Journal of Computational Biology, 1(4):349­266, 1994.
8. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for
similarities in the amino acid sequences of two proteins. JMB, 48:443­453, 1970.
9. Kim R. Rasmussen, Jens Stoye, and Eugene W. Myers. Efficient q-gram filters for
finding all epsilon-matches over a given length. Journal of Computational Biology,
13(2):296­308, 2006.
10. T. Rognes and E. Seeberg. Six-fold speed-up of smith-waterman sequence database
searches using parallel processing on common microprocessors. Bioinformatics,
16(8):699­706, 2000.
11. Benno Schwikowski and Martin Vingron. Weighted sequence graphs: boosting
iterated dynamic programming using locally suboptimal solutions. Discrete Appl.
Math., 127(1):95­117, 2003.
12. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences.
Journal of Molecular Biology, 147:195­197, 1981.
13. A. Wozniak. Using video-oriented instructions to speed up sequence comparison.
Computer Applications in the Biosciences, 13(2):145­150, 1997.