De Bruijn Graph assembly Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me briefly how you’re using them. For original Keynote files, email me. Real-world assembly methods Both handle unresolvable repeats by essentially leaving them out Fragments are contigs (short for contiguous) Unresolvable repeats break the assembly into fragments OLC: Overlap-Layout-Consensus assembly DBG: De Bruijn graph assembly a_long_long_long_time a_long_long_time a_long          long_time Assemble substrings with Greedy-SCS Assemble substrings with OLC or DBG De Bruijn graph assembly A formulation conceptually similar to overlapping/SCS, but has some potentially helpful properties not shared by SCS. k-mer “k-mer”is a substring of length k GGCGATTCATCG ATTCA 4-mer of S: S: All 3-mers of S: GGC  GCG    CGA      GAT        ATT          TTC            TCA              CAT                ATC                  TCG I’ll use“k-1-mer”to refer to a substring of length k - 1 mer: from Greek meaning“part” De Bruijn graph AAA, AAB, ABB, BBB, BBA As usual, we start with a collection of reads, which are substrings of the reference genome. AAB is a k-mer (k = 3). AA is its left k-1-mer, and AB is its right k-1-mer. AAB AA AB L R 3-mer AAB’s left 2-mer AAB’s right 2-mer De Bruijn graph AAA, AAB, ABB, BBB, BBA Take each length-3 input string and split it into two overlapping substrings of length 2. Call these the left and right 2-mers. AA, AA, AA, AB, AB, BB, BB, BB, BB, BA Let 2-mers be nodes in a new graph. Draw a directed edge from each left 2-mer to corresponding right 2-mer: AA AB BA BB L R L R L R L R L R Each edge in this graph corresponds to a length-3 input string AAABBBA take all 3-mers: form L/R 2-mers: De Bruijn graph AA AB BA BB An edge corresponds to an overlap (of length k-2) between two k-1 mers. More precisely, it corresponds to a k-mer from the input. AAA AAB ABB BBB BBA De Bruijn graph AA AB BA BB If we add one more B to our input string: AAABBBBA, and rebuild the De Bruijn graph accordingly, we get a multiedge. AAA AAB ABB BBB BBA BBB Directed multigraph Directed multigraph G(V, E) consists of set of vertices, V and multiset of directed edges, E Otherwise, like a directed graph a b c d V = { a, b, c, d } E = { (a, b), (a, b), (a, b), (a, c), (c, b) } Repeated Node’s indegree = # incoming edges Node’s outdegree = # outgoing edges De Bruijn graph is a directed multigraph Eulerian walk definitions and statements Node is balanced if indegree equals outdegree Node is semi-balanced if indegree differs from outdegree by 1 A directed, connected graph is Eulerian if and only if it has at most 2 semi-balanced nodes and all other nodes are balanced Graph is connected if each node can be reached by some other node Jones and Pevzner section 8.8 AA AB BA BB Eulerian walk visits each edge exactly once Not all graphs have Eulerian walks. Graphs that do are Eulerian. (For simplicity, we won’t distinguish Eulerian from semi-Eulerian.) De Bruijn graph AAA, AAB, ABB, BBB, BBA Back to our De Bruijn graph AA, AA, AA, AB, AB, BB, BB, BB, BB, BA AA AB BA BB L R L R L R L R L R Is it Eulerian? Argument 1: AA → AA → AB → BB → BB → BA Argument 2: AA and BA are semi-balanced, AB and BB are balanced Yes De Bruijn graph A procedure for making a De Bruijn graph for a genome Start with an input string: a_long_long_long_time Take each k mer and split into left and right k-1 mers Pick a substring length k: 5 long_ long  ong_ Add k-1 mers as nodes to De Bruijn graph (if not already there), add edge from left k-1 mer to right k-1 mer Assume perfect sequencing where each length-k substring is sequenced exactly once with no errors ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time a_lo _lon a_lo _lon long a_lo _lon long ong_ ong_ ng_l a_lo _lon long ng_l g_lo ong_ a_lo _lon long ng_l g_lo a_lo _lon long ong_ ng_l g_lo a_lo _lon long ong_ ng_l g_lo a_lo _lon long ong_ First 8 k-mer additions, k = 5 a_long_long_long_time De Bruijn graph ng_l g_lo a_lo _lon long ong_ ng_t ng_l g_lo a_lo _lon long ong_ ng_t g_ti ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time ng_l g_lo a_lo _lon long ong_ Finished graph a_long_long_long_time Last 5 k-mer additions, k = 5 De Bruijn graph With perfect sequencing, this procedure always yields an Eulerian graph. Why? ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time Node for k-1-mer from left end is semi-balanced with one more outgoing edge than incoming * Node for k-1-mer at right end is semi-balanced with one more incoming than outgoing * * Unless genome is circular Other nodes are balanced since # times k-1-mer occurs as a left k-1-mer = # times it occurs as a right k-1-mer De Bruijn graph De Bruijn graph implementation class  DeBruijnGraph:        """  A  De  Bruijn  multigraph  built  from  a  collection  of  strings.                User  supplies  strings  and  k-­‐mer  length  k.    Nodes  of  the  De                Bruijn  graph  are  k-­‐1-­‐mers  and  edges  join  a  left  k-­‐1-­‐mer  to  a                right  k-­‐1-­‐mer.  """          @staticmethod        def  chop(st,  k):                """  Chop  a  string  up  into  k  mers  of  given  length  """                for  i  in  xrange(0,  len(st)-­‐(k-­‐1)):  yield  st[i:i+k]                class  Node:                """  Node  in  a  De  Bruijn  graph,  representing  a  k-­‐1  mer  """                def  __init__(self,  km1mer):                        self.km1mer  =  km1mer                                def  __hash__(self):                        return  hash(self.km1mer)                def  __init__(self,  strIter,  k):                """  Build  De  Bruijn  multigraph  given  strings  and  k-­‐mer  length  k  """                self.G  =  {}          #  multimap  from  nodes  to  neighbors                self.nodes  =  {}  #  maps  k-­‐1-­‐mers  to  Node  objects                self.k  =  k                for  st  in  strIter:                        for  kmer  in  self.chop(st,  k):                                km1L,  km1R  =  kmer[:-­‐1],  kmer[1:]                                nodeL,  nodeR  =  None,  None                                if  km1L  in  self.nodes:                                        nodeL  =  self.nodes[km1L]                                else:                                        nodeL  =  self.nodes[km1L]  =  self.Node(km1L)                                if  km1R  in  self.nodes:                                        nodeR  =  self.nodes[km1R]                                else:                                        nodeR  =  self.nodes[km1R]  =  self.Node(km1R)                                self.G.setdefault(nodeL,  []).append(nodeR) Chop string into k-mers For each k-mer, find left and right k-1-mers Create corresponding nodes (if necessary) and add edge For Eulerian graph, Eulerian walk can be found in O(| E |) time. | E | is # edges. Convert graph into one with Eulerian cycle (add an edge to make all nodes balanced), then use this recursive procedure #  Make  all  nodes  balanced,  if  not  already tour  =  [] #  Pick  arbitrary  node src  =  g.iterkeys().next()   def  __visit(n): while  len(g[n])  >  0: dst  =  g[n].pop() __visit(dst)      tour.append(n)                 __visit(src) #  Reverse  order,  omit  repeated  node tour  =  tour[::-­‐1][:-­‐1] #  Turn  tour  into  walk,  if  necessary Insight: If C is a cycle in an Eulerian graph, then after removing edges of C, remaining connected components are also Eulerian http://www.algorithmist.com/index.php/Eulerian_tour De Bruijn graph Full illustrative De Bruijn graph and Eulerian walk: http://nbviewer.ipython.org/7237207 >>>  G  =  DeBruijnGraph(["a_long_long_long_time"],  5) >>>  print  G.eulerianWalkOrCycle() ['a_lo',  '_lon',  'long',  'ong_',  'ng_l',  'g_lo',   '_lon',  'long',  'ong_',  'ng_l',  'g_lo',  '_lon',   'long',  'ong_',  'ng_t',  'g_ti',  '_tim',  'time'] Example where Eulerian walk gives correct answer for small k whereas Greedy-SCS could spuriously collapse repeat: ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time De Bruijn graph Another example Eulerian walk: >>>  st  =  "to_every_thing_turn_turn_turn_there_is_a_season" >>>  G  =  DeBruijnGraph([st],  4) >>>  path  =  G.eulerianWalkOrCycle() >>>  superstring  =  path[0]  +  ''.join(map(lambda  x:  x[-­‐1],  path[1:])) >>>  print  superstring to_every_thing_turn_turn_turn_there_is_a_season to_ o_e n_t _tu _th g_t ery ry_ _ev y_t sea eas a_s _se ng_re_ e_i ing ver her ere s_a _a_ _is is_tur urn thithe aso son eve hin rn_ Recall: This is not generally possible or tractable in the overlap/SCS formulation De Bruijn graph Assuming perfect sequencing, procedure yields graph with Eulerian walk that can be found efficiently. ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time We saw cases where Eulerian walk corresponds to the original superstring. Is this always the case? De Bruijn graph No: graph can have multiple Eulerian walks, only one of which corresponds to original superstring BE EF AB BC BY CD FA ZA DA Right: graph for ZABCDABEFABY, k = 3 Alternative Eulerian walks: ZA → AB → BE → EF → FA → AB → BC → CD → DA → AB → BY ZA → AB → BC → CD → DA → AB → BE → EF → FA → AB → BY These correspond to two edge-disjoint directed cycles joined by node AB AB is a repeat: ZABCDABEFABY De Bruijn graph Case where k = 4 works: >>>  st  =  "to_every_thing_turn_turn_turn_there_is_a_season" >>>  G  =  DeBruijnGraph([st],  4) >>>  path  =  G.eulerianWalkOrCycle() >>>  superstring  =  path[0]  +  ''.join(map(lambda  x:  x[-­‐1],  path[1:])) >>>  print  superstring to_every_thing_turn_turn_turn_there_is_a_season But k = 3 does not: >>>  st  =  "to_every_thing_turn_turn_turn_there_is_a_season" >>>  G  =  DeBruijnGraph([st],  3) >>>  path  =  G.eulerianWalkOrCycle() >>>  superstring  =  path[0]  +  ''.join(map(lambda  x:  x[-­‐1],  path[1:])) >>>  print  superstring to_every_turn_turn_thing_turn_there_is_a_season Due to repeats that are unresolvable at k = 3 De Bruijn graph This is the first sign that Eulerian walks can’t solve all our problems Other signs emerge when we think about how actual sequencing differs from our idealized construction BE EF AB BC BY CD FA ZA DA De Bruijn graph Gaps in coverage can lead to disconnected graph Graph for a_long_long_long_time, k = 5: ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time De Bruijn graph Gaps in coverage can lead to disconnected graph Graph for a_long_long_long_time, k = 5 but omitting ong_t : ng_l g_lo a_lo _lon long ong_ g_ti _tim ng_t time Connected components are individually Eulerian, overall graph is not De Bruijn graph Differences in coverage also lead to nonEulerian graph Graph for a_long_long_long_time, k = 5 but with extra copy of ong_t : ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time Graph has 4 semi-balanced nodes, isn’t Eulerian De Bruijn graph Errors and differences between chromosomes also lead to non-Eulerian graphs Graph for a_long_long_long_time, k = 5 but with error that turns a copy of long_ into lxng_ Graph is not connected; largest component is not Eulerian ng_l g_lo a_lo _lon lxng xng_ long ong_ ng_t g_ti _tim time De Bruijn graph Casting assembly as Eulerian walk is appealing, but not practical Even if graph were Eulerian, repeats yield many possible walks Uneven coverage, sequencing errors, etc make graph non-Eulerian De Bruijn Superwalk Problem (DBSP) is an improved formulation where we seek a walk over the De Bruijn graph, where walk contains each read as a subwalk Kingsford, Carl, Michael C. Schatz, and Mihai Pop. "Assembly complexity of prokaryotic genomes using short reads." BMC bioinformatics 11.1 (2010): 21. Proven NP-hard! Medvedev, Paul, et al. "Computability of models for sequence assembly." Algorithms in Bioinformatics. Springer Berlin Heidelberg, 2007. 289-301. De Bruijn graph But first we note that using the De Bruijn graph representation has other advantages... In practice, De Bruijn graph-based tools give up on unresolvable repeats and yield fragmented assemblies, just like OLC tools. De Bruijn graph To build a De Bruijn graph in practice: Pick k. Assume k ≤ shortest read length (k = 30 to 50 is common). For each read: For each k-mer: Add k-mer’s left and right k-1-mers to graph if not there already. Draw an edge from left to right k-1-mer. d = 6 x 109 reads n = 100 nt m = 3 x 109 nt ≈ 1 sequencing run ≈ human Say a sequencer produces d reads of length n from a genome of length m De Bruijn graph a_long_long_long_time a_long_long_long,  ng_long_l,  g_long_time Genome: Reads: k-mers: Pick k = 8 a_long_l    _long_lo        long_lon        ong_long          ng_long_          g_long_l            _long_lo              long_lon                ong_long g_long_t  _long_ti    long_tim      ong_time ng_long_  g_long_l g_long_ _long_l _long_t long_ti ong_tim long_lo ong_lon ng_long ng_time a_long_ Given n (# reads), N (total length of all reads) and k, and assuming k < length of shortest read: Exact number of k-mers: N - n (k -1) O(N) This is also the number of edges, | E | Number of nodes | V | is at most 2 ∙ | E |, but typically much smaller due to repeated k-1-mers De Bruijn graph g_long_ _long_l _long_t long_ti ong_tim long_lo ong_lon ng_long ng_time a_long_ How much work to build graph? For each k-mer, add 1 edge and up to 2 nodes Reasonable to say this is O(1) expected work Assume hash map encodes nodes & edges Assume k-1-mers fit in O(1) machine words, and hashing O(1) machine words is O(1) work Querying / adding a key is O(1) expected work O(1) expected work for 1 k-mer, O(N) overall De Bruijn graph Timed De Bruijn graph construction applied to progressively longer prefixes of lambda phage genome, k = 14 0 10000 20000 30000 40000 50000 0.000.050.100.150.20 Length of genome Secondsrequiredtobuild O(N) expectation appears to work in practice, at least for this small example De Bruijn graph In typical assembly projects, average coverage is ~ 30 - 50 ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time De Bruijn graph Recall average coverage: average # reads covering a genome position GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT                                    CTAGGCCCTCAATTTTT                                CTCTAGGCCCTCAATTTTT                            GGCTCTAGGCCCTCATTTTTT                      CTCGGCTCTAGCCCCTCATTTT                TATCTCGACTCTAGGCCCTCA                TATCTCGACTCTAGGCC        TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT 177 nucleotides 35 nucleotides Average coverage = 177 / 35 ≈ 7x De Bruijn graph In typical assembly projects, average coverage is ~ 30 - 50 Same edge might appear in dozens of copies; let’s use edge weights instead ng_l g_lo a_lo _lon long ong_ ng_t g_ti _tim time ng_l g_lo 20 a_lo _lon 10 long ong_ 30 20 ng_t 10 g_ti _tim 10 10 20 30 time 10 Weight = # times k-mer occurs Using weights, there’s one weighted edge for each distinct k-mer Before: one edge per k-mer After: one weighted edge per distinct k-mer De Bruijn graph Say (a) reads are error-free, (b) we have one weighted edge for each distinct k-mer, and (c) length of genome is G # of nodes and edges both O(N); N is total length of all reads So # of nodes and edges are also both O(G) Can’t be more distinct k-mers than there are k-mers in the genome; likewise for k-1-mers Combine with the O(N) bound and the # of nodes and edges are both O(min(N, G)) There’s one node for each distinct k-1-mer, one edge for each distinct k-mer De Bruijn graph With high average coverage, O(G) size bound is advantageous Size of De Bruijn graph grows sublinearly when average coverage is high 10 20 30 40 50 20000400006000080000 Average coverage #deBruijngraphnodes+edges k = 30 Genome = lambda phage (~ 48.5 K nt) Draw random k-mers until target average coverage is reached (x axis) Build De Bruijn graph and total the # of nodes and edges (y axis) De Bruijn graph De Bruijn graph What De Bruijn graph advantages have we discovered? Can be built in O(N) expected time, N = total length of reads With perfect data, graph is O(min(N, G)) space; G = genome length Compares favorably with overlap graph Fast overlap graph construction (suffix tree) is O(N + a) time Space is O(N + a). a is O(n2) Note: when average coverage is high, G ≪ N De Bruijn graph What did we give up? Reads are immediately split into shorter k-mers; can’t resolve repeats as well as overlap graph Read coherence is lost. Some paths through De Bruijn graph are inconsistent with respect to input reads. Only a very specific type of“overlap”is considered, which makes dealing with errors more complicated, as we’ll see This is the OLC ⟷ DBG tradeoff Single most important benefit of De Bruijn graph is the O(min(G, N)) space bound, though we’ll see this comes with large caveats