[10:14 21/9/2010 Bioinformatics-btq485.tex] Page: 2534 2534–2540 BIOINFORMATICS ORIGINAL PAPER Vol. 26 no. 20 2010, pages 2534–2540 doi:10.1093/bioinformatics/btq485 Sequence analysis Advance Access publication August 24, 2010 GASSST: global alignment short sequence search tool Guillaume Rizk1,∗ and Dominique Lavenier2,∗ 1Univ-Rennes 1/IRISA and 2ENS-Cachan/IRISA, IRISA - Symbiose Campus universitaire de Beaulieu, 35042 Rennes Cedex, France Associate Editor: Alex Bateman ABSTRACT Motivation: The rapid development of next-generation sequencing technologies able to produce huge amounts of sequence data is leading to a wide range of new applications. This triggers the need for fast and accurate alignment software. Common techniques often restrict indels in the alignment to improve speed, whereas more flexible aligners are too slow for large-scale applications. Moreover, many current aligners are becoming inefficient as generated reads grow ever larger. Our goal with our new aligner GASSST (Global Alignment Short Sequence Search Tool) is thus 2-fold—achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. Results: We propose a new efficient filtering step that discards most alignments coming from the seed phase before they are checked by the costly dynamic programming algorithm. We use a carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments in a wide range of configurations. The main filter uses a precomputed table containing the alignment score of short four base words aligned against each other. This table is reused several times by a new algorithm designed to approximate the score of the full dynamic programming algorithm. We compare the performance of GASSST against BWA, BFAST, SSAHA2 and PASS. We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners. Availability: GASSST is distributed under the CeCILL software license at http://www.irisa.fr/symbiose/projects/gassst/ Contact: guillaume.rizk@irisa.fr; dominique.lavenier@irisa.fr Supplementary information: Supplementary data are available at Bioinformatics online. Received on February 19, 2010; revised on July 26, 2010; accepted on August 17, 2010 1 INTRODUCTION Next-generation sequencing (NGS) technologies are now able to produce large quantities of genomic data. They are used for a wide range of applications, including genome resequencing or polymorphism discovery. A very large amount of short sequences are generated by these new technologies. For example, the IlluminaSolexa system can produce over 50 million 32–100 bp reads in a single run. A first step is generally to map these short reads over a reference genome. To enable efficient, fast and accurate mapping, new alignment programs have been recently developed. Their main ∗To whom correspondence should be addressed. goals are to globally align short sequences to local regions of complete genomes in a very short time. Furthermore, to increase sensitivity, a few alignment errors are permitted. The seed and extend technique is mostly used for this purpose. The underlying idea is that significant alignments include regions having exact matches between two sequences. For example, any 50 bp read alignments with up to three errors contains at least 12 identical consecutive bases. Thus, using the seed and extend technique, only sequences sharing common kmers are considered for a possible alignment. Detection of common kmers is usually performed through indexes localizing all kmers. Recently, several index methods have been investigated and implemented in various bioinformatics search tools. The first method, used by SHRiMP (Rumble et al., 2009) and MAQ (Li,H. et al., 2008), creates an index from the reads and scans the genome. The advantage is a rather small memory footprint. The second method makes the opposite choice: it creates an index from the genome, and then aligns each read iteratively. PASS (Campagna et al., 2009), SOAPv1 (Li,R. et al., 2008), BFAST (Homer et al., 2009) and our new aligner GASSST (Global Alignment Short Sequence Search Tool); use this approach. The last method, used in CloudBurst, indexes both the genome and the reads. Although more memory is needed, the algorithm exhibits better performance due to memory cache locality. Another short read alignment technique, used in Bowtie (Langmead et al., 2009), SOAPv2 (Li et al., 2009) and BWA (Li and Durbin, 2009), uses a method called backward search (Ferragina and Manzini, 2000) to search an index based on the Burrows–Wheeler transform (BWT; Burrows and Wheeler, 1994). Basically, it allows exact matches to be found before using a backtracking procedure that allows the addition of some errors. Although this technique reports extremely fast running times and small memory footprints, some data configurations lead to poor performances. (http://bowtie-bio.sourceforge.net/manual.shtml). Moreover, in order to speed-up computations, some methods restrict the type or the number of errors per alignment to a few mismatch and indel errors. In the building alignment process, computing the number of mismatches requires linear time, whereas indel errors require more costly algorithms such as the dynamic programming techniques used in the Smith–Waterman (Smith and Waterman, 1981) or Needleman–Wunsch (NW; Needleman and Wunsch, 1970) algorithms. For instance, MAQ, Eland and Bowtie do not allow gaps. EMBF (Wendi et al., 2009), SOAPv1 and SOAPv2 allow only one continuous gap, while PASS, SHRiMP, BFAST and SeqMap (Jiang and Wong, 2008) allow any combination of mismatch and indel errors. GASSST, as well, considers any combination of mismatch, insertion or deletion errors. In most applications, when reads are very short, dealing with a restricted © The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. atMasarykuniversityonNovember23,2014http://bioinformatics.oxfordjournals.org/Downloadedfrom [10:14 21/9/2010 Bioinformatics-btq485.tex] Page: 2535 2534–2540 GASSST number of errors is acceptable. On the other hand, when longer reads are processed, or when more distant reference genomes are compared, this restriction may greatly affect the quality of the search. In this article, we introduce GASSST, a new short read aligner for mapping reads with mismatch and indel errors at a very high speed. We show how a series of carefully designed filters allows false positive positions to be quickly discarded before the refinement extension step. In particular, GASSST is compared with similar state-of-the-art programs: BFAST, BWA, PASS and SSAHA2. 2 APPROACH GASSST uses the seed and extend strategy and indexes the genome. The seed step provides all potentially homologous areas in the genome with a given query sequence. The traditional way to accomplish this step is through a hash-table of kmers for selecting regions sharing common kmers with query sequences. To include gaps, the extend step is carried out with a dynamic programming algorithm (NW). This approach provides a high degree of accuracy, but is prohibitively expensive due to the high number of costly NW extensions that must be performed. To tackle this issue, several approaches have been proposed. The SHRiMP implementation (Rumble et al., 2009) provides an improved seed step with spaced seeds and Q-gram filters to restrict the size of the candidate hit space. Then, a carefully optimized vectorized Smith–Waterman extension is run to perform the extend step. BFAST (Homer et al., 2009) uses long spaced seeds to limit the amount of candidate locations, yet still achieves high sensitivity through the use of multiple indexes. PASS (Campagna et al., 2009) implements another solution: a filter is introduced before the full extend step to rule out areas that have too many differences with the query sequence. A precomputed table of all possible short words, aligned against each other, is built to perform a quick analysis of the flanking regions adjacent to seed words. The GASSST strategy is similar: a traditional anchoring method is used followed by a NW extension step. However, fast computation is achieved thanks to a very efficient filtering step. Candidate positions are selected through a new carefully designed series of filters of increasing complexity and efficiency. Two main types of filter are used. One is related to the computation of an Euler distance between nucleotide frequency vectors as defined by Wendi et al. (2009). The idea is this: if one sequence has, for example, three more ‘T’ nucleotides than another, then the alignment will have at least three errors (mismatches or indels). The other includes precomputed tables, as in PASS, to produce a score based on the NW algorithm, but brings the strategy to a higher level; instead of addressing large tables, as PASS does, we designed an algorithm able to reuse small tables along the whole query sequence. In this way, the filter is much more selective and discards a very large number of false positives, thus drastically decreasing the time spent in the final extend step. GASSST’s originality comes then from the use of a small lookup table. More precisely, the precomputed alignment scores of all possible pairs of words of length w can be stored in a memory of size 42w bytes, if a score is memorized in a single byte. For PASS, the size of the lookup table is 414 = 256 MB. This fits into any computer’s main memory, but not in the first-level CPU cache. Hence, random accessing of the table, even if it avoids many computations, may still be costly. On the contrary, the GASSST algorithm manipulates a small table of 64 KB which easily fits into the cache memory, providing fast filtering compared with non-precomputed calculations. 3 METHODS GASSST algorithm has three stages: (i) searching for exact matching seeds between the reference genome and the query sequences; (ii) quickly eliminating hits that have more than a user-specified number of errors; and (iii) computing the full gapped alignment with the NW algorithm. These three steps are referred to, respectively, as seed-filter-extend. The novelty of the GASSST approach relies on a new highly efficient filter method. 3.1 Seed GASSST creates an index of all possible kmers in the reference sequence and then scans every query sequence to find matching seeds. When seeds longer than 14 are selected the algorithm uses a simple hash-table mechanism. For each seed position in the reference sequence, the index contains the sequence number, the position where it occurs and the flanking regions (binary encoded). Up to 16 nt are stored on the left and on the right of the seed in order to speed-up the next filtering steps. The size of the index is equal to 16×N bytes, with N the size of the reference sequence. If large reference sequences which exceed the memory size are considered, a simple partitioning scheme is provided; the reference sequence is split into as many parts as necessary to fit in the available RAM. Each part is then processed iteratively. 3.2 Tiled-NW filter A lookup table, called pre-computed score table PSTl, containing all the NW alignment scores of all possible pairs of l nt long words is first computed. PASS uses a PST7 to analyze discrepancies near the 7 nt long flanking region adjacent to the seeds. The bigger the table, the better the filtering. But unfortunately, the table grows exponentially with l. To address this issue, GASSST analyzes discrepancies in a region of any length thanks to the reuse of a small PST4 table. The goal is to provide a lower bound approximation of the real NW score along the whole alignment. If estimated lower bounds are greater than the maximum number of allowed errors, then alignments are eliminated. If not, alignments are passed to the next step. The following values are used for the NW score computation: 0 for a match, and 1 for mismatch and indel errors. Consequently, the final score indicates the number of discrepancies in the alignment. In the following, only the right side of the seed is considered, the other side being symmetrical. Definition 1. We call TNWl(n) the score of a region of n nucleotides which are adjacent to the right of the seed and which are computed with a PSTl. TNW stands for Tiled NW. TNWl(full) operates on the maximum length available, i.e. from the right of the seed to the end of the query sequence. Definition 2. If S1 and S2 are the two sequences directly adjacent to the right of the seed, we call PSTl(i, j) the pre-computed NW score for the two l nt long words S1(i, i+l−1) and S2(j, j+l−1). If Gm is the maximum number of allowed gaps, TNWl(n) is computed with the following recursion: If n≤l then TNWl(n)=PSTl(1,1) (1) Else TNWl(n)=min A, B, C (2) With A=TNWl(n−l)+PSTl(n−l+1, n−l+1) B= min 1≤s≤Gm (max(s, TNWl(n−l))+PSTl(n−l+1, n−l+1−s)) 2535 atMasarykuniversityonNovember23,2014http://bioinformatics.oxfordjournals.org/Downloadedfrom [10:14 21/9/2010 Bioinformatics-btq485.tex] Page: 2536 2534–2540 G.Rizk and D.Lavenier B A Fig. 1. Computation of a semi-global alignment (only the query sequence needs to be globally aligned), in the case of a maximum of 1 error. (A) Dynamic Programming. With 12 nt long sequences and a traditional dynamic programming algorithm, cell calculations can be limited in a band, but there are still 34 cells to compute. (B) Tiled Algorithm. With a precomputed table score of size 4×4, the tiled algorithm is performed in only three steps. The first step requires one table access, while the second and third steps both require three table accesses. Here, the score given by the tiled algorithm is the same as the full dynamic programming algorithm—there are two errors in the alignment. In the general case, the tiled algorithm only gives a lower bound of the number of errors present in the alignment. C = min 1≤s≤Gm (max(s, TNWl(n−l))+PSTl(n−l+1−s, n−l+1)) Figure 1B shows a graphical representation of this recursion. Complexity of the filter. GASSST uses a PST4 lookup table. By storing each score in a single byte, its size is equal to 64 KB. This small size allows the PST4 to fit in a CPU cache of today’s desktop computers. But if TNW4(4) requires a few clock cycles to be accessed, its filtering power is limited. In the general case, the number of PSTl accesses Nacc[TNW(n)] needed for the computation is in O(n). If Gm is the maximum number of allowed gaps, the exact formula is: Nacc[TNW(n)]=( n l −1)·(1+2·Gm)+1 (3) The filter works with iterative l-sized levels. The first one takes one access, then each following level requires (1+2·Gm) PSTl accesses. Each new level filters more and more false positive alignments. Large PSTl tables lead to fewer levels and better accuracy, but they imply longer execution times since the number of memory cache misses increases rapidly with bigger tables. 3.3 Frequency vector filter The frequency vector filter of GASSST is the same as the one used in EMBF (Wendi et al., 2009). The idea is quite simple: if one sequence has, for example, three more ‘G’ nucleotides than another sequence, then their alignment will have at least three errors. If the user-specified maximum number of errors is 2, then the alignment can be directly eliminated. For a sequence S=s1s2 ...n of characters in the alphabet ={a1, a2,...,ap}, the frequency vector F ={f1, f2,..., fm} is defined as: ∀i∈[1;m] fi = 1≤k≤n δsk,ai (4) With δsk,ai = 1 if sk =ai 0 otherwise (5) The Euler distance has to be computed on similar frequency vectors, i.e. referring to sequences of equal length. The distance is computed with: ECD(F,G)= 1≤i≤m |ui −vi| 2 (6) In GASSST, frequency vectors of sequences of up to 16 nt on both sides of the seed are computed. Actually, this value is often limited by the length of the reads, and forbids vectors of the reference sequence to be computed at runtime. Indeed, the size of subsequences depends on the size available on the read, which is often less than 16 and which is only known at compute time. The Euler distance is computed between the frequency vectors of the read and the genome. If the result is greater than the maximum number of errors then the alignment is discarded. The computation of the frequency vectors is vectorized and quickly performed thanks to the binary format of the seed flanking regions. The frequency vectors and the Euler distance computations are both vectorized to benefit from vector execution units present in modern processors. Since a 16 nt sequence is 32-bit encoded, counting the frequency of a single nucleotide can be done with bit-level logical operations that operate on traditional 32-bit words. The vectorization consists of simultaneously computing values for the four nucleotides of the alphabet and operates on 128-bit wide words. The frequency distance vectorized filter is referred to as FD-vec. 3.4 Filters combination The goal of the filter step is to eliminate as many false positive alignments generated by the seed step as possible, and in the fastest possible way. This is done by ordering the individual filters from the fastest to the most complex and powerful. Algorithm 1 shows the main computation loop of GASSST with the series of filters. TNW4(4) is first since it is the fastest. It is followed by the vectorized frequency filter that rules out remaining false positive alignments. Then a more thorough filter is used (TNW4(16)). The last filter is a TNW7(full) filter which is applied on the full length of the sequence and with a PST7 in order to eliminate a maximum number of false positive alignments. It is computationally intensive, but it comes at the end when most alignments have already been ruled out. 2536 atMasarykuniversityonNovember23,2014http://bioinformatics.oxfordjournals.org/Downloadedfrom [10:14 21/9/2010 Bioinformatics-btq485.tex] Page: 2537 2534–2540 GASSST Finally, the true NW alignment is computed on alignments that go through all filters. This combination of filters ensures an efficient and fast filtering in a wide range of configurations, from short to longer reads, with low or high polymorphism. One important point is that these filters only discard alignments which are proven to have too many errors and that would have been eliminated by the NW algorithm. They never eliminate good alignments, hence they do not decrease sensitivity. Moreover, to reduce running time, the search is stopped when a maximal number of occurrences of a seed in the reference sequence has been reached. This kind of limitation is present in most other aligners. The only difference here is that a threshold is also checked in some stage of the filtering process. The threshold is automatically computed according to seed length and reference sequence size. Users can control the speed/sensitivity trade-off of this heuristic through a parameter s in 0–5 which modulates this threshold. Algorithm 1 Main computation loop 1. Input: 2. reference genome, short query sequences. 3. parameter: n (maximum errors allowed). 4. Pre-calculation: 5. Compute the reference genome index 6. for each query q do 7. for each overlapping seed s in q do 8. for each occurrence o of s in reference genome do 9. if TNW4(4){q,s,o}