Overlap Layout Consensus assembly Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me briefly how you’re using them. For original Keynote files, email me. Real-world assembly methods Both handle unresolvable repeats by essentially leaving them out Fragments are contigs (short for contiguous) Unresolvable repeats break the assembly into fragments OLC: Overlap-Layout-Consensus assembly DBG: De Bruijn graph assembly a_long_long_long_time a_long_long_time a_long          long_time Assemble substrings with Greedy-SCS Assemble substrings with OLC or DBG Assembly alternatives Alternative 1: Overlap-Layout-Consensus (OLC) assembly Alternative 2: de Bruijn graph (DBG) assembly Overlap Layout Consensus Error correction de Bruijn graph Scaffolding Refine Overlap Layout Consensus Overlap Layout Consensus Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig Finding overlaps Can we be less naive than this? CTCTAGGCC TAGGCCCTC X: Y: Say l = 3 CTCTAGGCC TAGGCCCTC X: Y: Look for this in Y, going right-to-left Found it CTCTAGGCC TAGGCCCTC X: Y: Extend to left; in this case, we confirm that a length-6 prefix of Y matches a suffix of X We’re doing this for every pair of input strings Finding overlaps Can we use suffix trees for overlapping? Problem: Given a collection of strings S, for each string x in S find all overlaps involving a prefix of x and a suffix of another string y Hint: Build a generalized suffix tree of the strings in S Finding overlaps with suffix tree Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1 A 6 $0 C 13 $1 GAC TA 5 $0 C TA 9 GAC$ 1 1 ATA$0 11 $1 3 $0 7 GAC$ 1 2 ATA$0 12 $1 0 ATA$0 10 $1 4 $0 8 GAC$ 1 Say query = GACATA. From root, follow path labeled with query. Green edge implies length-3 suffix of second string equals length-3 prefix of queryATAGAC      |||      GACATA Finding overlaps with suffix tree Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1 A 6 $0 C 13 $1 GAC TA 5 $0 C TA 9 GAC$ 1 1 ATA$0 11 $1 3 $0 7 GAC$ 1 2 ATA$0 12 $1 0 ATA$0 10 $1 4 $0 8 GAC$ 1 For each string: Walk down from root and report any outgoing edge labeled with a separator. Each corresponds to a prefix/suffix match involving prefix of query string and suffix of string ending in the separator. Strategy: (1) Build tree (2) Finding overlaps with suffix tree Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1 A 6 $0 C 13 $1 GAC TA 5 $0 C TA 9 GAC$ 1 1 ATA$0 11 $1 3 $0 7 GAC$ 1 2 ATA$0 12 $1 0 ATA$0 10 $1 4 $0 8 GAC$ 1 GACATA      |||      ATAGAC GACATA          |          ATAGAC ATAGAC      |||      GACATA Now let query be second string: ATAGAC Finding overlaps with suffix tree A 6 $0 C 13 $1 GAC TA 5 $0 C TA 9 GAC$ 1 1 ATA$0 11 $1 3 $0 7 GAC$ 1 2 ATA$0 12 $1 0 ATA$0 10 $1 4 $0 8 GAC$ 1 Say there are d reads of length n, total length N = dn, and a = # read pairs that overlap Time to build generalized suffix tree: O(N) ... to walk down red paths: O(N) ... to find & report overlaps (green): O(a) Overall: O(N + a) d2 doesn’t appear explicitly, but a is O(d2) in worst case Assume for given string pair we report only the longest suffix/prefix match Finding overlaps What if we want to allow mismatches and gaps in the overlap? CTCGGCCCTAGG      |||  |||||      GGCTCTAGGCCC X: Y:I.e. How do we find the best alignment of a suffix of X to a prefix of Y? Dynamic programming But we must frame the problem such that only backtraces involving a suffix of X and a prefix of Y are allowed Finding overlaps with dynamic programming CTCGGCCCTAGG      |||  |||||      GGCTCTAGGCCC X: Y: We’ll use global alignment recurrence and score function Find the best alignment of a suffix of X to a prefix of Y A C G T -­‐ A 0 4 2 4 8 C 4 0 4 2 8 G 2 4 0 4 8 T 4 2 4 0 8 -­‐ 8 8 8 8 s(a, b) D[i, j] = min 8 < : D[i 1, j] + s(x[i 1], ) D[i, j 1] + s( , y[j 1]) D[i 1, j 1] + s(x[i 1], y[j 1]) But how do we force it to find prefix / suffix matches? Finding overlaps with dynamic programming Find the best alignment of a suffix of X to a prefix of Y A C G T -­‐ A 0 4 2 4 8 C 4 0 4 2 8 G 2 4 0 4 8 T 4 2 4 0 8 -­‐ 8 8 8 8 s(a, b) D[i, j] = min 8 < : D[i 1, j] + s(x[i 1], ) D[i, j 1] + s( , y[j 1]) D[i 1, j 1] + s(x[i 1], y[j 1]) -­‐ G G C T C T A G G C C C -­‐ C T C G G C C C T A G G X Y How to initialize first row & column so suffix of X aligns to prefix of Y? 0 0 0 0 0 0 0 0 0 0 0 0 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ First column gets 0s (any suffix of X is possible) First row gets ∞s (must be a prefix of Y) 4 12 20 28 36 44 52 60 68 76 84 92 4 8 14 20 28 36 44 52 60 68 76 84 4 8 8 16 20 28 36 44 52 60 68 76 0 4 12 12 20 24 30 36 44 52 60 68 0 0 8 16 16 24 26 30 36 44 52 60 4 4 0 8 16 18 26 30 34 36 44 52 4 8 4 2 8 16 22 30 34 34 36 44 4 8 8 6 2 10 18 26 34 34 34 36 4 8 10 8 8 2 10 18 26 34 36 36 2 6 12 14 12 10 2 10 18 26 34 40 0 2 10 16 18 16 10 0 10 18 26 34 0 0 6 14 20 22 18 10 2 10 18 26 CTCGGCCCTAGG      |||  |||||      GGCTCTAGGCCC X: Y: Backtrace from last row Finding overlaps with dynamic programming Find the best alignment of a suffix of X to a prefix of Y A C G T -­‐ A 0 4 2 4 8 C 4 0 4 2 8 G 2 4 0 4 8 T 4 2 4 0 8 -­‐ 8 8 8 8 s(a, b) D[i, j] = min 8 < : D[i 1, j] + s(x[i 1], ) D[i, j 1] + s( , y[j 1]) D[i 1, j 1] + s(x[i 1], y[j 1]) -­‐ G G C T C T A G G C C C -­‐ C T C G G C C C T A G G X Y Problem: very short matches got high scores by chance... 0 0 0 0 0 0 0 0 0 0 0 0 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 4 12 20 28 36 44 52 60 68 76 84 92 4 8 14 20 28 36 44 52 60 68 76 84 4 8 8 16 20 28 36 44 52 60 68 76 0 4 12 12 20 24 30 36 44 52 60 68 0 0 8 16 16 24 26 30 36 44 52 60 4 4 0 8 16 18 26 30 34 36 44 52 4 8 4 2 8 16 22 30 34 34 36 44 4 8 8 6 2 10 18 26 34 34 34 36 4 8 10 8 8 2 10 18 26 34 36 36 2 6 12 14 12 10 2 10 18 26 34 40 0 2 10 16 18 16 10 0 10 18 26 34 0 0 6 14 20 22 18 10 2 10 18 26 ...which might obscure the more relevant match Say we want to enforce minimum overlap length l = 5 Finding overlaps with dynamic programming Find the best alignment of a suffix of X to a prefix of Y A C G T -­‐ A 0 4 2 4 8 C 4 0 4 2 8 G 2 4 0 4 8 T 4 2 4 0 8 -­‐ 8 8 8 8 s(a, b) D[i, j] = min 8 < : D[i 1, j] + s(x[i 1], ) D[i, j 1] + s( , y[j 1]) D[i 1, j 1] + s(x[i 1], y[j 1]) -­‐ G G C T C T A G G C C C -­‐ 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ C 0 4 12 20 28 36 44 52 60 68 76 84 92 T 0 4 8 14 20 28 36 44 52 60 68 76 84 C 0 4 8 8 16 20 28 36 44 52 60 68 76 G 0 0 4 12 12 20 24 30 36 44 52 60 68 G 0 0 0 8 16 16 24 26 30 36 44 52 60 C 0 4 4 0 8 16 18 26 30 34 36 44 52 C 0 4 8 4 2 8 16 22 30 34 34 36 44 C 0 4 8 8 6 2 10 18 26 34 34 34 36 T ∞ 4 8 10 8 8 2 10 18 26 34 36 36 A ∞ 12 6 12 14 12 10 2 10 18 26 34 40 G ∞ 20 12 10 16 18 16 10 0 10 18 26 34 G ∞ ∞ ∞ ∞ ∞ 20 22 18 10 2 10 18 26 X Y Solve by initializing certain additional cells to ∞ Cells whose values changed highlighted in red Now the relevant match is the best candidate Finding overlaps with dynamic programming Number of overlaps to try: O(d2) Size of each dynamic programming matrix: O(n2) Overall: O(d2n2) = O(N2) Say there are d reads of length n, total length N = dn, and a is total number of pairs with an overlap Contrast O(N2) with suffix tree: O(N + a), but where a is worst-case O(d2) But dynamic programming is more flexible, allowing mismatches and gaps Real-world overlappers mix the two, using indexes to filter out vast majority of non-overlapping pairs, then using dynamic programming for remaining pairs Finding overlaps Overlapping is typically the slowest part of assembly Consider a second-generation sequencing dataset with hundreds of millions or billions of reads! Approaches from alignment unit can be adapted to finding overlaps Could also have adapted efficient exact matching, approximate string matching, co-traversal, ... We saw adaptations of naive exact matching, suffix-treeassisted exact matching, and dynamic programming Finding overlaps http://wgs-assembler.sourceforge.net/wiki/index.php/RunCA#Overlapper Celera Assembler’s overlapper is probably the best documented: Inverted substring indexes built on batches of reads Only look for overlaps between reads that share one or more substrings of some length Overlap Layout Consensus Overlap Layout Consensus Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig Layout Overlap graph is big and messy. Contigs don’t“pop out”at us. Below: part of the overlap graph for to_every_thing_turn_turn_turn_there_is_a_season l = 4, k = 7 ry_thin thing_t 4 _thing_ 5y_thing 6 urn_tur rn_turn 6 _turn_t 4 n_turn_ 5 5 6 turn_tu 4 turn_th 4 here_is e_is_a_ 4 ere_is_ 6 re_is_a 5 ing_tur 5hing_tu 6 ng_turn 4 urn_the _there_ 4 n_there 5rn_ther 6 5 4 there_i 6 4 6 4 5 6 6 4 5 5 4 5 4 66 4 6 5 4 6 g_turn_ 5 5 6 4 5 6 4 6 5 4 4 4 6 55 6 5 4 5 44 6 5 6 4 6 4 5 6 5 4 6 4 5 4 4 6 55 ry_thin 4 _thing_ 5y_thing 6 very_th 5ery_thi 6 4 _every_ 5 4 every_t 6 6 4 5 4 6 5 o_every 4 6 5 5 6 to_ever 5 4 6 Layout Anything redundant about this part of the overlap graph? Some edges can be inferred (transitively) from other edges E.g. green edge can be inferred from blue Layout Remove transitively-inferrible edges, starting with edges that skip one node: ry_thin thing_t 4 _thing_ 5y_thing 6 urn_tur rn_turn 6 _turn_t 4 n_turn_ 5 5 6 turn_tu 4 turn_th 4 here_is e_is_a_ 4 ere_is_ 6 re_is_a 5 ing_tur 5hing_tu 6 ng_turn 4 urn_the _there_ 4 n_there 5rn_ther 6 5 4 there_i 6 4 6 4 5 6 6 4 5 5 4 5 4 66 4 6 5 4 6 g_turn_ 5 5 6 4 5 6 4 6 5 4 4 4 6 55 6 5 4 5 44 6 5 6 4 6 4 5 6 5 4 6 4 5 4 4 6 55 Before: x Layout y_thing urn_tur rn_turn 6 n_turn_ 6 here_is ere_is_ 6 thing_t hing_tu 6 urn_the rn_ther 6 _there_ there_i 6 _thing_ 6 ing_tur 4 _turn_t turn_tu 6 turn_th 6 n_there 4 6 ng_turn 6 6 6 6 4 4 4 6 6 g_turn_ 4 6 6 4 6 4 4 4 6 After: x Remove transitively-inferrible edges, starting with edges that skip one node: Layout ry_thin y_thing 6 urn_tur rn_turn 6 n_turn_ 6 here_is ere_is_ 6 thing_t hing_tu 6 urn_the rn_ther 6 _there_ there_i 6 _thing_ 6 6 _turn_t turn_tu 6 turn_th 6 n_there 6 ing_tur ng_turn 6 6 6 4 6 6 g_turn_ 6 6 4 6 4 6 x Remove transitively-inferrible edges, starting with edges that skip one or two nodes: x Even simpler After: Layout Emit contigs corresponding to the non-branching stretches ry_thin y_thing 6 urn_tur rn_turn 6 a_seaso _season 6 n_turn_ 6 here_is ere_is_ 6 thing_t hing_tu 6 urn_the rn_ther 6 _there_ there_i 6 very_th ery_thi 6 _every_ every_t 6 _thing_ 6 e_is_a_ _is_a_s 6 6 6 _turn_t turn_tu 6 turn_th 6 n_there 6 o_every 6 ing_tur ng_turn 6 re_is_a 6 is_a_se s_a_sea 6 6 6 4 6 6 _a_seas 6 g_turn_ 6 to_ever 6 6 6 4 6 6 6 4 6 to_every_thing_turn_ turn_there_is_a_season Contig 1 Contig 2 Unresolvable repeat Layout In practice, layout step also has to deal with spurious subgraphs, e.g. because of sequencing error Possible repeat boundary Mismatcha b Mismatch could be due to sequencing error or repeat. Since the path through b ends abruptly we might conclude it’s an error and prune b. ... a bprune Overlap Layout Consensus Overlap Layout Consensus Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig Consensus Take reads that make up a contig and line them up At each position, ask: what nucleotide (and/or gap) is here? Complications: (a) sequencing error, (b) ploidy Say the true genotype is AG, but we have a high sequencing error rate and only about 6 reads covering the position. TAGATTACACAGATTACTGA  TTGATGGCGTAA  CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG  TTACACAGATTATTGACTTCATGGCGTAA  CTA TAGATTACACAGATTACTGACTTGATGGCGTAA  CTA TAGATTACACAGATTACTGACTTGATGGCGTAA  CTA TAGATTACACAGATTACTGACTTGATGGCGTAA  CTA Take consensus, i.e. majority vote Overlap Layout Consensus Overlap Layout Consensus Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig OLC drawbacks Building overlap graph is slow. We saw O(N + a) and O(N2) approaches. 2nd-generation sequencing datasets are ~ 100s of millions or billions of reads, hundreds of billions of nucleotides total Overlap graph is big; one node per read, and in practice # edges grows superlinearly with # reads Assembly alternatives Alternative 1: Overlap-Layout-Consensus (OLC) assembly Alternative 2: de Bruijn graph (DBG) assembly Overlap Layout Consensus Error correction de Bruijn graph Scaffolding Refine