Pan-genome Search and Storage by Guillaume Holley A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor rerum naturalium at the Faculty of Technology, Bielefeld University February 2018 Referees: Professor Doctor Jens Stoye Doctor Faraz Hach Doctor Rayan Chikhi ACKNOWLEDGEMENTS I would like to thank everyone who has been directly or indirectly involved in this thesis. I thank Jens Stoye for welcoming me as his student in the Genome Informatics group and providing me the guidance I needed to achieve this thesis. I am thankful to Roland Wittler for supporting my ideas and helping me out the numerous times I entered his office and said “Do you have five minutes?” while knowing pertinently it would take much longer. I also thank Faraz Hach from who I have learned a lot during my research visit in Vancouver. My gratitude goes to the entire population of the U10 and V10 floors scattered in multiple groups, to the former members of the AGGI group I have met during the past few years and to the badminton gang. A big thank goes to all DiDy students for all these ping-pong games, cool retreats, dinners at the chinese restaurant, cinema and board game evenings. A special thank goes to Nina, my “office wife”, for sharing an office (and temporarily a house) with me and reading thoroughly this thesis. I could not have found a better office mate (thank you again Roland!) with whom I could play ping-pong, discuss scientific issues and share all my jokes of questionable quality. I would like to extend this thank to Tina who did not back down when she started to use the rustic first version of my code. This acknowledgement would be incomplete without thanking Pierre Peterlongo for introducing me to the awesome field of computational biology and sharing his contagious passion of the de Bruijn graph. I would like to thank all my friends met before and during my time in Bielefeld. I cannot thank enough Monika and Kerstin for being such good friends, for helping me in so many occasions, for all these geocaching afternoons and sushi dinners. I am also especially grateful to Sebastien, Jean-Baptiste and Paul for staying in touch after I arrived in Bielefeld while we did not see each others for a long time. I am thankful to my family for the enormous support provided during all these years and for visiting me numerous times. Particularly, I thank my parents who taught me creativity and perseverance. Who would have guessed that installing v an old computer in my room fifteen years ago would have resulted in the present thesis? Clearly, nothing of this would have been possible without the overwhelming support and the unlimited patience of Violette. The past few years have been a hell of an adventure and I am so glad we did it together. Finally, I would like to acknowledge funding from the International Research Training Group GRK/1906 “Computational Methods for the Analysis of the Diversity and Dynamics of Genomes”, also known as DiDy, during three years and funding from the Bielefeld Genome Informatics group and Faculty of Technology scholarship for the past few months. vi TABLE OF CONTENTS ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . v LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv CHAPTER I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Biological Background . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . 1 1.1.2 DNA Sequencing . . . . . . . . . . . . . . . . . 3 1.1.3 de novo Read Assembly . . . . . . . . . . . . . 6 1.1.4 Comparative Genomics . . . . . . . . . . . . . . 6 1.1.5 Pan-genomes . . . . . . . . . . . . . . . . . . . 7 1.2 Computational Background . . . . . . . . . . . . . . . . . 10 1.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . 11 1.2.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . 13 vii 1.2.4 Hash Tables . . . . . . . . . . . . . . . . . . . . 19 1.2.5 Bloom Filters . . . . . . . . . . . . . . . . . . . 23 1.2.6 Burrows-Wheeler Tranform . . . . . . . . . . . 26 1.2.7 FM-Index . . . . . . . . . . . . . . . . . . . . . 29 1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . 30 II. Methods and Tools for Pan-genome Indexing . . . . . . . . 31 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Reference-based Methods . . . . . . . . . . . . . . . . . . 32 2.3 Alignment-based Methods . . . . . . . . . . . . . . . . . . 35 2.4 Graph-based Methods . . . . . . . . . . . . . . . . . . . . 36 2.5 de Bruijn graph Methods . . . . . . . . . . . . . . . . . . 37 2.6 k-mer based Methods . . . . . . . . . . . . . . . . . . . . 40 III. Pan-genome Indexing . . . . . . . . . . . . . . . . . . . . . . . 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 The Bloom Filter Trie . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Uncompressed Container . . . . . . . . . . . . . 44 3.2.2 Compressed Container . . . . . . . . . . . . . . 45 3.2.3 Color Set . . . . . . . . . . . . . . . . . . . . . 49 3.3 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Container Insertion . . . . . . . . . . . . . . . . 50 3.3.2 Tree Insertion . . . . . . . . . . . . . . . . . . . 52 3.3.3 Container Look-up . . . . . . . . . . . . . . . . 53 viii 3.3.4 Tree Look-up . . . . . . . . . . . . . . . . . . . 54 3.4 Successors and Predecessors Traversing . . . . . . . . . . 55 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.1 P. aeruginosa Dataset . . . . . . . . . . . . . . 61 3.5.2 Human Tissues Dataset . . . . . . . . . . . . . 63 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 65 IV. Pan-genome Storage . . . . . . . . . . . . . . . . . . . . . . . 67 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.1 Existing Approaches . . . . . . . . . . . . . . . 68 4.1.2 Contributions . . . . . . . . . . . . . . . . . . . 70 4.2 The guided de Bruijn Graph . . . . . . . . . . . . . . . . 70 4.3 Compression . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Read Clustering and Merging. . . . . . . . . . . 74 4.3.2 Spanning Super Read Encoding . . . . . . . . . 76 4.3.3 Partition Encoding . . . . . . . . . . . . . . . . 77 4.3.4 Meta Data and gdBG Compression . . . . . . . 80 4.4 Update and Decompression . . . . . . . . . . . . . . . . . 80 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 86 V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . 88 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 ix LIST OF FIGURES Figure 1.1 Representation of a DNA molecule. Each strand is shown with a different grey color and their respective extremities are annotated with “3’” and “5’” to indicate their orientation. Nucleotides located on the strands are illustrated with a white color and are annotated with the symbol of the nucleobase they represent. Grey colored links between nucleotides represent the hydrogen bonds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Comparison of sequencers based on their sequencing capacity and the length of the reads they produce (Nederbragt, 2016). . . 4 1.3 Annual sequencing cost of the human genome. The plot was produced by the National Human Genome Research Institute and is available at www.genome.gov/sequencingcostsdata/. 5 1.4 Representation of a pan-genome with 3 strains. . . . . . . . . . 8 1.5 Trie of strings “aa”, “aba”, “abb”, “ba” and “bba” for an alphabet A = {a, b}. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Compaction methods for tries of strings “aa”, “aba”, “abb”, “ba” and “bba” for an alphabet A = {a, b}. . . . . . . . . . . . . . . 16 1.7 Suffix trie (a) and tree (b) of string “abaab$”. . . . . . . . . . . 17 1.8 Suffix array of string “abaab$” for alphabet order $ < a < b. . . 18 1.9 Burst trie of strings “aa”, “abb”, “ba” and “bba” for an alphabet A = {a, b}. Leaves have a capacity of two suffixes. Inserting string “aba” in the burst trie of (a) triggers a bursting of the left leaf, resulting in the burst trie setup shown in (b). A terminal symbol is added at the end of each string in order to handle the case of one string which is the prefix of another string. . . . . . 20 x 1.10 Hash table with collisions. . . . . . . . . . . . . . . . . . . . . . 21 1.11 Hash table with chaining. . . . . . . . . . . . . . . . . . . . . . 22 1.12 Open addressing methods. . . . . . . . . . . . . . . . . . . . . . 23 1.13 Insertion of two items e1 and e2 into a BF. . . . . . . . . . . . . 24 1.14 Query of two items. Item e3 is a true negative and item e4 is a false positive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Insertion of six suffixes (that are here complete k-mers) with different colors into a BFT with k = 12, l = 4 and ϕ = 5. In (a), the first five suffixes are inserted at the root into an uncompressed container uc. When a sixth suffix “gcgccaggaatc” is inserted, uc exceeds its capacity and is burst, resulting in the BFT structure shown in (b) with one compressed container and four uncompressed containers. Note that in practice, container vertices might have more than one container and suffixes might have more than one color. . . . . . . . . . . . . . . . . . . . . . 45 3.2 BF of four edge labels “aggc”, “ctca”, “gccc” and “gcgc” with f = 2 hash functions h1 and h2. . . . . . . . . . . . . . . . . . . 46 3.3 Tree representation of four edge labels “aggc”, “ctca”, “gccc” and “gcgc”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Exact representation of four edge labels “aggc”, “ctca”, “gccc” and “gcgc” in a compressed container with |labelp| = 2 and |labels| = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Insertion of edge label “gtat” in a compressed container with four edge labels “aggc”, “ctca”, “gccc” and “gcgc”. Inserted or changed parts are highlighted. Array edge is not represented. . . 52 3.6 Traversed paths for predecessors preda of k-mer x = “aggctatgctca” such that preda = a ⊙ x(1, |x| − 1) for all a ∈ A. Content of vertices is only shown for the root, other vertices only have one traversed container and are represented with an empty rounded box. Color set vertices are not represented. . . . . . . . . . . . . 57 3.7 Traversed paths for successors succa of k-mer x = “aggctatgctca” such that succa = x(2, |x| − 1) ⊙ a for all a ∈ A. Content of vertices is only shown for the leaves, other vertices have only one traversed container and are represented with an empty rounded box. Color set vertices are not represented. . . . . . . . . . . . . 58 xi 3.8 Internal representation of a compressed container with five edge labels “aggc”, “ctca”, “gccc”, “gcgc”, “gtat”, adapted as “ggca”, “tcac”, “cccg”, “cgcg”,“tatg”, respectively, for predecessor and successor traversal. Array edge is not represented. . . . . . . . . 59 3.9 Insertion of edge label “tggc” adapted as “ggct” in a compressed container with edge labels “aggc”, “ctca”, “gccc”, “gcgc” and “gtat”. Array edge is not represented. Inserted parts are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1 The gdBG of sequence S = “cgtaagtaat” as constructed by Algorithm 5 with k = 3. . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 The gdBG of sequence S = cgtaagtaat using 3-mers overlapping on k − l = 1. The last symbol of S is not encoded in the gdBG as it cannot be part of a k-mer. . . . . . . . . . . . . . . . . . . 73 4.3 Minimizer clustering of 4 reads. Minimizers of length 2 are underlined. For the sake of convenience, the reverse-complement is not considered. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Merging of two reads into a super read. Minimizers of length 2 are underlined. For the sake of convenience, the reversecomplement is not considered. . . . . . . . . . . . . . . . . . . . 75 4.5 Merging of three super reads into an SSR “acgttgatt”. Minimizers of length 2 are underlined with a dashed line. Secondary minimizers (for merging) are underlined with a plain line. For the sake of convenience, the reverse-complement is not considered. . 75 4.6 Extraction of 4-mers overlapping on k − l = 2 from two similar SSRs, ssr1 and ssr2. . . . . . . . . . . . . . . . . . . . . . . . . 76 4.7 The gdBG of SSRs ssr1 = “acgtac” and ssr2 = “tccttc” using 4-mers (l = 2). Dotted edges are false implicit edges. The labeled solid edge exists by using the starting overlap of ssr2 after the traversal of ssr1, as described in Section 4.3.2. . . . . . 79 4.8 Compression ratios in paired-end mode (left) and single-end mode (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.9 Disk sizes in paired-end mode (left) and single-end mode (right). 82 4.10 DARRC disk size distribution in paired-end mode (left) and single-end mode (right). . . . . . . . . . . . . . . . . . . . . . . 83 xii 4.11 Compression times in paired-end mode (left) and single-end mode (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.12 Decompression times in paired-end mode (left) and single-end mode (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.13 Compression main memory peaks in paired-end mode (left) and single-end mode (right). . . . . . . . . . . . . . . . . . . . . . . 85 4.14 Decompression main memory peaks in paired-end mode (left) and single-end mode (right). . . . . . . . . . . . . . . . . . . . . 85 xiii LIST OF TABLES Table 1.1 BWT computation of string “abaab$”. The computed BWT is highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.2 Reconstruction of a string from the BWT string “bba$aa”. The reconstructed string is highlighted. . . . . . . . . . . . . . . . . 28 1.3 FM-Index of the BWT string “bba$aa”. . . . . . . . . . . . . . . 29 3.1 BFT and SBT construction evaluation for 473 P. aeruginosa isolates. Best results are highlighted. . . . . . . . . . . . . . . . 61 3.2 SBT and BFT querying evaluation for sequences of length 100 bp from the sequencing experiment ERR431077. Best results are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Evaluation of k-mer extraction and branching queries for a BFT constructed from 473 P. aeruginosa isolates. . . . . . . . . . . . 63 3.4 BFT and SBT construction evaluation for 2,652 human tissue RNA-Seq experiments. Best results are highlighted. . . . . . . . 64 3.5 SBT and BFT querying evaluation for the GENCODE sequences database. Best results are highlighted. . . . . . . . . . . . . . . 65 3.6 Evaluation of k-mer extraction and branching queries for a BFT constructed from 2,652 human tissue RNA-Seq experiments. . . 65 4.1 Naive representation of a partition set composed of integers 12534, 12567 and 28911. . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Delta and Vbyte encoded representation of the same partition set used in Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . 79 xiv ABSTRACT Pan-genome Search and Storage by Guillaume Holley High Throughput Sequencing (HTS) technologies are constantly improving and making genome sequencing more affordable. However, HTS sequencers can only produce short overlapping genome fragments that are erroneous and cover the sequenced genomes unevenly. These genome fragments are assembled based on their overlaps to produce larger contiguous sequences. Since de novo genome assembly is computationally intensive, some species have a reference genome used as a guide for assembling genome fragments from the same species or as a basis for comparative genomics methods. Yet, assembling a genome is an error-prone process depending on the quality of the sequencing data and the heuristics used during the assembly. Furthermore, analyses based on a reference are biased towards the reference. Finally, a single reference cannot reflect the dynamics and diversity of a population of genomes. Overcoming these issues requires to move away from the single-genome reference-centric paradigm and take advantage of the multiple sequenced genomes available for each species. For this purpose, pan-genomes were introduced as sets of genomes from different strains of the same species. A pan-genome is represented by a multi-genome index exploiting the similarity and redundancy of the genomes it contains. Still, pan-genomes are more difficult to analyze than single genomes because of the large amount of data to be stored and indexed. Current data structures for pan-genome indexing do not fulfill all requirements for pan-genome analysis. Indeed, these data structures are often immutable while xv the size of a pan-genome grows constantly with newly sequenced genomes. Frequently, these data structures consider only assemblies as input, while unassembled genome fragments abound in databases. Also, indexing variants and similarities between the genomes of a pan-genome usually requires time and memory consuming algorithms such as sequence alignments. Sometimes, pan-genome analysis tools just assume variants and similarities are provided as input. While data structures already exist for pan-genome indexing, no solution is currently proposed for genome fragment compression in a pan-genome context. Indeed, it is often of interest to transmit and store all genome fragments of a pan-genome. However, HTS-specific compression tools are not dynamic and cannot update a compressed archive of genome fragments with new fragments of a genome without decompression. Hence, those tools are poorly adapted to the transmission and storage of genome fragments in a pan-genome context. In this thesis, we aim to provide scalable solutions for pan-genome indexing and storage. We first address the problem of pan-genome indexing by proposing a new alignment-free, reference-free and incremental data structure that considers genome fragments as well as assemblies in input: the Bloom Filter Trie (BFT). The BFT is a tree data structure representing a colored de Bruijn graph in which k-mers, words of length k from the input genomes, are associated with sets of colors representing the genomes in which they occur. The BFT makes extensive use of Bloom filters to navigate in the tree and optimize the graph traversal. A “bursting” method is employed to perform an efficient path and level compaction of the tree. We show that the BFT outperforms a data structure that has similar features but is based on an approximation of the set of indexed k-mers. Secondly, we address the problem of genome fragments compression in a pangenome context by proposing a new abstract data structure, the guided de Bruijn graph. It augments the de Bruijn graph with k-mer partitions such that the graph traversal is guided to reconstruct exactly the genome fragments when decompressing. Different techniques are proposed to optimize the storage of fragments in the graph and the partition encoding. We show that the BFT described previously has all features required to index a guided de Bruijn graph and is used in the implementation of our compression method named DARRC. The evaluation of DARRC on a large pan-genome dataset compared to state-of-the-art HTSspecific and general purpose compression tools shows a 30 % compression ratio improvement over the second best performing tool of this evaluation. xvi CHAPTER I Introduction 1.1 Biological Background Section 1.1.1 introduces the preliminary biological notions that will be used in defining the problems this thesis aims to solve. Section 1.1.2 describes how genomic data are acquired and Section 1.1.3 details the complexity of transforming these data in order to analyze them. Then, Section 1.1.4 shows which analyses can be carried out on these data and what are the current limitations. Finally, Section 1.1.5 presents a new representation of genomic data from different sources to overcome these limitations. 1.1.1 Preliminaries DNA (DeoxyriboNucleic Acid) is the foundation of the genetic support for life. It was discovered and isolated by Friedrich Miescher in 1869, but the paper of Watson and Crick (1953) is considered as the work that pioneered the study of DNA. It is a macromolecule, a large group of atoms tied together by chemical bonds, composed of two strands forming a double helix shape. A DNA strand is a chain of nucleotides, each being composed of a nucleobase that is either Adenine (symbol “a”), Cytosine (symbol “c”), Guanine (symbol “g”) or Thymine (symbol “t”), plus a deoxyribose group and a phosphate group. Nucleotides are chained in a strand via their deoxyribose group binding to the phosphate group of the next nucleotide in the chain. The 5’ end extremity of a strand is the nucleotide with a non-bound phosphate group and the 3’ end extremity of a strand is the 1 nucleotide with a non-bound deoxyribose group. The strand extremities define its orientation: from the 5’ end to the 3’ end. Two strands of a DNA molecule are maintained together through hydrogen bonds connecting the nucleotides of each respective strand: Adenine is always paired with Thymine through two hydrogen bonds and Cytosine is always paired with Guanine through three hydrogen bonds. Two paired nucleotides from different strands of the same DNA molecule form a base pair (abbreviated bp) and each nucleotide of a pair is the complement of the other. The length of a DNA molecule is then the number of base pairs it contains. In a DNA molecule, each strand is the reverse-complement of the other strand: The sequence of nucleotides from a strand can be obtained by “reading” the other strand in the reverse direction (from the 3’ end to the 5’ end) and complementing each nucleotide read. A DNA molecule is schematically illustrated in Figure 1.1. a a a a a t t t t g g g c c c t gc 5' 5' 3' 3' Figure 1.1: Representation of a DNA molecule. Each strand is shown with a different grey color and their respective extremities are annotated with “3’” and “5’” to indicate their orientation. Nucleotides located on the strands are illustrated with a white color and are annotated with the symbol of the nucleobase they represent. Grey colored links between nucleotides represent the hydrogen bonds. A gene is a sequence of nucleotides which encodes a function such as metabolism, the process by which an organism maintains life. A locus is the position of a gene on a chromosome, a structure in which DNA is packed. A genome encodes all genetic information of an organism through its DNA mainly contained in chromo- 2 somes, but also through extrachromosomal DNA contained in other structures (mitochondria, chloroplasts and plasmids). Each cell of an organism contains a copy of the genome which length varies greatly among different organisms. A genome is said to be diploid if it contains two sets of chromosomes, one copy from each parent, such as in the human genome. To the contrary, a genome having only one set of chromosomes from one parent corresponds to a haploid genome. Heredity refers to the process of inheriting genetic traits from the parents. During this process, the DNA of the inherited genome is permanently altered and such alterations are called mutations. An allele is one of the possible forms of a gene that has been mutated. Polymorphism refers to a recurrent variation of a genome within a population of genomes and a Single Nucleotide Polymorphism (SNP) is a polymorphism of a single nucleotide at a given position of a genome within a population of genomes. A species is the most basic category of biological classification and refers to a group of similar organisms able to breed with each other. A strain is a variant within a species. 1.1.2 DNA Sequencing DNA sequencing is the process of determining the order of nucleotides in a DNA strand. It is performed by a sequencer, an instrument able to “read” the DNA of a genome. Yet, sequencers cannot read the full length of the DNA in a genome but, instead, only fragments from several copies of the sequenced genome corresponding to overlapping nucleotide sequences called reads. The read length depends on the sequencing technology and ranges currently between a few dozen to a few thousand of base pairs. A paired-end read refers to a pair of nucleotide sequences read by a sequencer from different ends of the same DNA fragment. The coverage of a genome with respect to a set of reads indicates the expected number of reads covering a specific base of the sequenced genome. The coverage can be uneven depending on the sequenced genome and the sequencing technology. Sequencing machines produce sequencing errors for which the rate, the distribution and the type (insertion, deletion and substitution of a nucleotide) vary upon the sequencing technology. Also, other attributes are used to characterize DNA sequencers such as the sequencing capacity (number of bases produced per 3 run), the throughput (number of bases produced in a given amount of time) and the cost of the machine. A comparison of sequencers based on their sequencing capacity and the length of the reads they produce is given in Figure 1.2 (Nederbragt, 2016). Figure 1.2: Comparison of sequencers based on their sequencing capacity and the length of the reads they produce (Nederbragt, 2016). The first widely used sequencing technology, also called first generation sequencing, was developed by Sanger et al. (1977). It is characterized by reads up to 750 bp long (Schuster, 2008), a low error rate of 0.001 % to 1 % (Hoff, 2009) and a low sequencing capacity (Nederbragt, 2016). The length of the reads and their accuracy made this technology very successful for several decades but its cost was prohibitive and prevented its use to sequence long genomes. The second generation of sequencing technology, also known as Next Generation Sequencing (NGS) or High Throughput Sequencing (HTS) was introduced in 2005 as 4 highly parallel sequencing methods with a high throughput, multiple orders of magnitude more than with Sanger sequencing. Reads produced by HTS technologies are characterized by a shorter length than reads produced by Sanger sequencing and a low error rate of 0.01 % to 1 % (Glenn, 2011). The availability of such technologies marked the beginning of a new era in genomics because of its afforable price (Loh et al., 2012) making sequencing more accessible to the scientific community. Figure 1.3 shows that the cost of sequencing the human genome in 2001 (International Human Genome Sequencing Consortium, 2001; Venter et al., 2001) was estimated to 100 million dollars while ten years later, this price was 10,000-fold smaller. Finally, the third generation of sequencing technologies encompasses new methods such as Single Molecule Real Time sequencing and nanopore sequencing. Reads produced by these technologies are long, a few thousand of base pairs, but noisy as the error rate ranges from 13 % to 38 % (Rhoads and Fai Au, 2015). More detailed reviews of sequencing technologies are given in (Shendure and Ji, 2008; Thudi et al., 2012; Rhoads and Fai Au, 2015). Figure 1.3: Annual sequencing cost of the human genome. The plot was produced by the National Human Genome Research Institute and is available at www.genome.gov/sequencingcostsdata/. 5 1.1.3 de novo Read Assembly The de novo read assembly problem (Compeau et al., 2011) is the problem of reconstructing a genome as a sequence of nucleotides per chromosome from a set of reads only. To this end, genome assembly software use the overlaps of the reads to assemble them into larger sequences forming an assembly. This problem has been largely studied in computational biology because of its complexity and necessity. Indeed, genome analysis relies mainly on assembled genomes. Assembling reads that are erroneous, potentially short and with an uneven coverage that may be low or equal to zero in some regions of the sequenced genome is very challenging. Also, the sequenced genome can be difficult to assemble as it might contain repeated regions. Furthermore, finding overlaps requires to store and index the reads to assemble, a very time and memory consuming tasks as a single run of sequencing can generate up to billions of reads. Hence, the completeness and quality of an assembly depends upon all these factors as well as on the assembly algorithms that can create mis-assemblies. A more detailed description of the theory and practice of read assembly is given in (Simpson and Pop, 2015). 1.1.4 Comparative Genomics The field of comparative genomics consists in the comparison of genomes to understand what are their similarities and differences. The answer to this question is of main interest to study the evolution of a species and, therefore, shed light on vital life mechanisms. Comparing a set of genomes to a reference genome has been a standard in computational biology for many years. Indeed, it is more time and memory efficient to compare all genomes of a set to a reference genome than performing all possible pairwise comparisons of genomes from the set. Yet, reference-centric comparisons are biased towards the reference: Dolle et al. (2017) established that in the latest human reference genome GRCh38, 3.2 Mbp were removed and 13.7 Mbp were added from the previous reference GRCh37. Thus, results obtained from a comparison with the human reference GRCh37 might be different with the updated reference GRCh38. With the computing power growth, the increased capability of external storage, and the improvements made in the domain of HTS technologies, multiple reference genomes are nowadays available for the same species. However, reference-centric 6 analysis methods cannot account for the very large and diverse variability among individuals of a same species. As an example, the 1000 Genomes Project (1000 Genomes Project Consortium, 2015) has sequenced more than 2,504 individuals from 26 populations representing a large set of 88 million variants. An ideal analysis of a newly sequenced human genome would benefit from a comparison with all 2,504 human genomes of such a project. In contrast to the human species which has a low degree of polymorphism, Ciona savignyi is a sea squirt species which has a very high degree of polymorphism, known to be one of the highest of any species (Leffler et al., 2012). Hence, a single genome of C. savignyi cannot be the representative of its species. Also, Mosquera-Rendón et al. (2016) mention that most of the issues in the development of an effective vaccine against Pseudomonas aeruginosa, an important pathogen in multiple types of infections, is because each strain of the species exhibits different mechanisms responsible for its pathogenesis. A study from Tettelin et al. (2005) indicates the same obstacles with the pathogen Streptococcus agalactiae. Therefore, developing an efficient vaccine against P. aeruginosa and S. agalactiae is unfeasible using reference-centric analysis methods. Consequently, linear references do not fulfill the requirements necessary for these problematics and must be augmented with information of multiple genomes. 1.1.5 Pan-genomes The term pan-genome stems from the greek prefix pan which is a combining form of “all” or “every”. It was first mentioned by Sigaux (1999) to study genome and transcriptome alterations in the context of cancer evolution. However, the analysis of multiple S. agalactiae genomes by Medini et al. (2005) laid the foundation for the concept of pan-genomes as we know it nowadays. In the study of Medini et al. (2005), a pan-genome is defined as the global gene repertoire of a species and is composed of two parts. The core genome is defined as genes present in all strains of the species. Yet, some studies such as (Mosquera-Rendón et al., 2016) relax this definition to genes present in almost all strains of the species. These so-called housekeeping genes encode vital functions for survival and are the genomic basis for the phylogeny of the species. Evidences from the P. aeruginosa pan-genome study (Mosquera- 7 Rendón et al., 2016) show that its core genome encodes its pathogenicity through genes involved in lung infections and antibiotic resistance genes present in 95% to 100% of the strains studied. A similar conclusion from Donati et al. (2010) was established from the pan-genome of Streptococcus pneumoniae: 31 proteins out of 47 analyzed proteins that are surface-exposed, known to be involved with host interaction or with virulence are conserved in the core genome of S. pneumoniae. The accessory genome, also called dispensable genome, is the pool of genes that are not present in the core genome. These genes account for the diversity of the species by encoding functions non-essential for their survival. They are often the result of environmental adaptation or are acquired from different organisms of the same environment through gene transfer. Mosquera-Rendón et al. (2016) highlight that each strain of the P. aeruginosa pan-genome exhibits a large variety of resistance mechanisms to the immune system in its accessory genome. Genes specific to a strain of the pan-genome are sometimes distinguished as singleton genes (Blom et al., 2009), hence forming the singleton genome. A graphical representation of a pan-genome is provided in Figure 1.4. core genome accessory genome accessory genome accessory genome singleton genome singleton genome singleton genome Strain A Strain B Strain C Figure 1.4: Representation of a pan-genome with 3 strains. The proportion of core genome, accessory genome and singleton genome is variable from one species to another. An analysis of the Escherichia coli pangenome of Lukjancenko et al. (2010) shows that the E. coli accessory genome is predominant and represents about 90% of its pan-genome. Similarly, the accessory genome of P. aeruginosa (Mosquera-Rendón et al., 2016) composes 85% of 8 its pan-genome. In contrast, the accessory genome of the S. pneumoniae pangenome (Donati et al., 2010) represents an average 16% of its pan-genome, a number supported by Medini et al. (2005) establishing the accessory genome of the group B of Streptococcus to 18% of its pan-genome and the singleton genome to 1.5% of its pan-genome. Tettelin et al. (2005) were the first to model the growth of a pan-genome as a function of the number of unique genes acquired per additional genome analyzed. The goal is to predict how many genomes are necessary to fully characterize the core and accessory genomes of a species. For this purpose, regression analysis is used to estimate statistically the pan-genome growth parameters from known data. In (Tettelin et al., 2008), two non-linear regression analyses were performed using the data from different species: a power law regression whose function makes the output vary as a power of the input and an exponential regression whose function makes the ouput vary as an exponent of the input. Results show that the power law regression was more adapted than the exponential one. This conclusion is also supported by the data of the P. aeruginosa pan-genome study from Mosquera-Rendón et al. (2016). Another model is the finite supragenome model (Hogg et al., 2007) which attempted to improve the method of Tettelin et al. (2005) by considering an unequal sampling probability of genes in the population. Donati et al. (2010) compared the power law regression model to the finite supragenome model and showed that for a large number of genomes analyzed (>40), the power law regression was more appropriate to model the pan-genome growth. More specifically, the Heap’s law belonging to the class of power laws is well suited to model the pan-genome growth. It is generally used to describe the growth of unique attributes in a set of entities. For example, Heap’s law is used in linguistics to model the size of a vocabulary VH (number of unique words present in a set of documents) as a function of the number of documents n considered. It is defined as follows: VH(n) = Knγ in which parameters K and γ are determined empirically. Power law regression helps to determine the completeness of a pan-genome by estimating the number of genomes necessary to fully describe the species’ 9 pan-genome. A pan-genome is said to be closed if it can be determined with an estimated finite number of genomes (Tettelin et al., 2005). The opposite is an open pan-genome in which each new strain analyzed is estimated to contribute with new genes to the pan-genome. An open pan-genome is usually the sign of an extremely adaptive species that is inclined to new evolutionary opportunities. More theoretically, a parameter γ ≤ 1 in the power law regression indicates that the number of new genes converges to a constant as expected in a closed pangenome. To the contrary, γ > 1 indicates an infinite number of genes in the pan-genome, potentially with a very slow gene discovery rate if γ = 1 (Tettelin et al., 2008). An outline of pan-genomes in different application domains as well as a description of pan-genomic approaches from a computational perspective are provided in (Computational Pan-Genomics Consortium, 2016). 1.2 Computational Background The purpose of this section is to outline all computational notions that will be used throughout this thesis. Notations and preliminary notions are first introduced in Section 1.2.1, followed by a description of graphs and their variants in Section 1.2.2. Then, Section 1.2.3 details a specific type of graph, a tree, and provides an overview of trees specialized in string indexing with their variants. Hash tables are presented afterwards in Section 1.2.4 as a data structure type for indexing elements. The description of Bloom filters and their variants in Section 1.2.5 is derived from the description of hash tables. Finally, Section 1.2.6 presents the Burrows-Wheeler Transform and Section 1.2.7 presents the FM-Index as a string indexing method and data structure, respectively. Note that in the figures of this section, grey colored labels and arrows are for illustration but are not stored in practice. 10 1.2.1 Preliminaries A string S is a sequence of symbols drawn from a finite non-empty set of symbols called the alphabet A. Its length is denoted by |S|. A string of length 0 is an empty string. A substring of S is a string occurring in S: it has a starting position i, a length l and is denoted by S(i, l). A prefix of S is a substring starting at position 1 while a suffix of S is a substring finishing at position |S|. The concatenation operator ⊙ joins together two strings such that S = p ⊙ s for (potentially empty) prefix p and suffix s. A repeat is a substring that occurs at least twice in a string, i.e., S(i, l) = S(j, l) for two positions i ̸= j in a string S and l > 0. A repeat is left maximal if it cannot be extended to the left without introducing a mismatch, i.e., i = 1 or j = 1 or S(i − 1, 1) ̸= S(j − 1, 1) for all pair of positions i ̸= j with S(i, l) = S(j, l). Reciprocally, it is right maximal if it cannot be extended to the right without introducing a mismatch, i.e., i = |S|−l+1 or j = |S|−l+1 or S(i+l, 1) ̸= S(j+l, 1) for all pair of positions i ̸= j with S(i, l) = S(j, l). A maximal repeat is left and right maximal. A Maximal Exact Match (MEM) is a tuple (i, S1, j, S2, l) representing a maximal substring match of two strings S1 and S2, i.e., S1(i, l) = S2(j, l) with S1(i, l) and S2(j, l) being maximal over S1 and S2. An array is an organization of elements in a contiguous space. Each element of an array A has a position i and is denoted by A[i]. Arrays are directly addressable: each element is accessible in O(1) time through its position in the array. A range of elements between positions i < j (inclusives) is denoted by A[i..j]. A dynamic array can be reallocated to change its size. A list is an abstract data structure representing a sequence of ordered values. The most common implementation of a list is the linked list which is represented by a sequence of elements, each containing a value and a pointer to the next element of the list. 1.2.2 Graphs Definition 1. A graph G = (V, E) is an abstract data structure composed of a set of vertices V and a set of edges E such that each edge e ∈ E connects two vertices of V . 11 A graph is directed if each edge has a start vertex and an end vertex such that the edge points away from the start vertex. The edge in-degree, reciprocally out-degree, of a vertex v is the number of edges ending on, reciprocally starting from, vertex v. The edge degree of a vertex is the sum of the edge in-degree and out-degree of this vertex. A vertex is branching if it has an edge in-degree > 1 or an edge out-degree > 1. A path is a sequence of vertices such that each pair of consecutive vertices in the sequence has an edge connecting them in the graph. A path is branching if it contains at least one branching vertex. A bubble is formed by two paths sharing only the same first vertex and the same last vertex. A graph is cyclic if it contains at least one cycle, i.e., a path starting and ending on the same vertex. A graph is connected if there exists a path between all pairs of vertices. 1.2.2.1 de Bruijn Graphs A de Bruijn graph (dBG) is a directed graph G = (V, E) in which each vertex v ∈ V represents a k-mer, a string of length k over A. A directed edge e ∈ E from vertex v to vertex v′ representing k-mers x and x′ , respectively, exists if and only if x(2, k − 1) = x′ (1, k − 1). Each k-mer x has |A| possible successors x(2, k −1)⊙a and |A| possible predecessors a⊙x(1, k −1) with a ∈ A. Note that in the original definition of the dBG, all possible k-mers for an alphabet A are represented in the graph. However, the definition of dBG used in the computational biology literature is less strict and only a subset of all possible k-mers are represented in the graph. These k-mers are extracted from the sequences the dBG is built from. A compacted de Bruijn graph merges all maximal non-branching paths of η vertices from the dBG into single vertices representing words of length k + η − 1. A colored de Bruijn graph (cdBG) is a graph G = (V, E, C) in which (V, E) is a dBG and C is a set of colors such that each vertex v ∈ V maps to a subset of C. A lightweight representation of dBGs and cdBGs does not store edges since they are implicitly given by vertices overlapping on k − 1 symbols. However, implicit edges can falsely connect vertices that share an overlap of k−1 but do not overlap in the sequences the graph was built from. The dBG is a long-studied abstract data structure used in computational 12 biology. It is particularly useful for the problem of de novo read assembly in which it is necessary to find a Hamiltonian cycle in the graph, a path starting and ending on the same vertex that visits each vertex exactly once. Although heuristics exist to extract Hamiltonian cycles from a graph, the read assembly problem is yet to be solved because a Hamiltonian cycle is only one possible reconstruction of the original genome the graph was built from. Furthermore, the Hamiltonian cycle formulation of the assembly problem has no immediate practical application due to coverage gaps and genomic variants. 1.2.3 Trees Definition 2. A tree T = (VT , ET ) is an acyclic connected graph. A tree is a special type of graph. In a tree T, vertices are separated in two groups. On one hand, the internal vertices of T are characterized by an edge degree > 1. On the other hand, the leaves are vertices with an edge degree of exactly 1. If the tree is rooted, a vertex called the root is designated. A rooted tree may be directed in which case the edges point from the root towards the leaves. Thus, we can define the notion of children and parents for each vertex of a directed rooted tree T. The children of a vertex v ∈ VT , denoted by children(v, T), is a set of vertices that can be reached by an edge starting at v. Reciprocally, the parent of a vertex v ∈ VT , denoted by parent(v, T), is the vertex having an edge that ends at v. The root of a rooted tree does not have a parent and the leaves of a rooted tree do not have children. The depth of a vertex v ∈ VT is denoted by depth(v, T) and is the number of edges between the root of T and v. The height of T, denoted by height(T), is the number of edges on the longest path from the root of T to one of its leaves. The level i ≥ 1 of T is defined as the set of vertices with a depth of i − 1. A k-ary rooted tree T is a tree for which all vertices v ∈ VT have |children(v, T)| ≤ k (a 2-ary rooted tree is also called a binary tree). A subtree T′ of a tree T is a tree for which the vertices and edges are connected subsets of the ones in T. A search tree is a tree used to index elements, known as keys, potentially associated with values. There exists a large variety of search trees such as the binary search tree, the B-tree and the Red-or-Black tree. A more detailed description of those trees is given by Cormen et al. (2009). In the following, we will focus on a 13 subset of search trees named tries. The premises of the trie were first introduced by De La Briandais (1959) but the concept of tries was mostly completed by Fredking (1960). The name “trie” comes from the word retrieval in an attempt to name this subset of search trees specialized in indexing sets of variable length strings. Strings are especially challenging to index in a memory-efficient manner compared to other key types because of their different and potentially long lengths. Also, strings offer a potential for redundancy to exploit as they usually come from natural language texts in which repetitions are expected to occur. Finally, indexing and retrieving strings might require type-specific operations such as prefix search. Note that tries do not restrict the key type to strings. For example, tries are typically used in routers for fast IP address lookup in which an IP address is a series of four 8 or 32 bits integers. Definition 3. A trie T = (VT , ET ) is a directed rooted tree in which each edge is labeled with a single symbol of an alphabet A such that for any vertex v ∈ VT , all edges starting at v are labeled with a different symbol. A path from a vertex v ∈ VT to a vertex v′ ∈ VT represents a substring of length depth(v′ , T) − depth(v, T), which is the concatenation of all symbols on the edges of this path. The string obtained from the concatenation of all symbols on the edges of a path from the root to a vertex v is the prefix of a string stored in T and is denoted by prefix(v, T). A subtrie of T rooted at vertex v represents a set of suffixes denoted by suffixes(v, T) such that for each suffix s in this set, prefix(v, T) ⊙ s is a string that is stored in T. If the vertex v is a leaf, then |s| = 0. Thus, tries inherit the dynamics of trees such that it is not necessary to know before insertion how many strings will be stored in a trie. An example of a trie is given in Figure 1.5. A trie can be implemented using various data structures such as a linked list or an array. The choice of the supporting data structure for a trie is of main importance, as well as the implementation design that must be chosen carefully. Indeed, a trie can consume significant memory if implemented in a naive way, such as representing edges with pointers to memory locations, a typical imple- 14 a b a b a b a b aaa aba abb ba bba Figure 1.5: Trie of strings “aa”, “aba”, “abb”, “ba” and “bba” for an alphabet A = {a, b}. mentation based on linked lists. This design scatters vertices over the memory of a computer (main memory or disk memory) and forces the Central Processing Unit (CPU) to visit dispersed locations of the memory in a short time while traversing a trie for example. At each traversed vertex, the CPU copies the data located at the memory address to a small multi-layer high-speed memory called cache which is located near the CPU. The first assumption made by a CPU is that memory accessed is likely to be modified: operations carried out in the cache are much faster than in Random Access Memory (RAM) and a lot faster than on disk. Secondly, a CPU assumes the data locality property: It is likely that a memory location close of the one accessed at the first place is going to be read and eventually modified. If the content of a memory location to be accessed is not already present in one layer of the cache, the processor must fetch it – a socalled cache-miss. Cache-misses are the main bottleneck in applications with no data locality. Hence, a naive pointer-based implementation of a trie is inefficient because it is cache-oblivious and each indexed symbol is the label of an edge using a 64 bits pointer on current computer architectures. A trie has a dynamic structure and inserting new elements triggers the insertion of new vertices and edges. If the supporting data structure is cache friendly and memory efficient but requires to move large chunks of memory for each new edge or vertex inserted, the data structure will be time inefficient. Finally, even though shared prefixes are stored in a single trie path, the same suffixes must be stored multiple times if their corresponding prefixes are different. In order to improve the memory efficiency of tries and decrease the number of cache-misses while traversing them, the radix trie was proposed. A radix trie 15 employs a path compaction method which merges a path of n > 1 vertices having each at most one child into one vertex whose in-going edge is labeled with a substring of n symbols. A binary radix trie is also referred to as a PATRICIA trie from Morrison (1968). An orthogonal compaction method was presented by Andersson and Nilsson (1993) as a level compaction method. Level-compaction applies to a k-ary subtrie t for which each internal vertex v from level i = 1 (the root) to level j > i has |children(v, t)| = k children. For such a subtrie t, each path from the root to a vertex v at level j + 1 is replaced by a directed edge from the root to v. This edge is labeled with the concatenation of the symbols on the path. Path and level compactions are illustrated in Figure 1.6. a b a b a b a ba aa aba abb ba bba (a) Path compressed trie aa ba a b aaa aba abb ba bba ab bb (b) Level compressed trie Figure 1.6: Compaction methods for tries of strings “aa”, “aba”, “abb”, “ba” and “bba” for an alphabet A = {a, b}. Path and level compactions were combined by Nilsson and Karlsson (1999) in a data structure named the LC-trie. A more detailed review of tries is given by Askitis and Sinha (2010). 1.2.3.1 Suffix Tries and Suffix Trees A trie as described above indexes a set of strings such that a prefix search can be performed in a natural top-down traversal of a trie. In order to perform a substring search, a suffix trie is required. Introduced by Weiner (1973), a suffix trie of a string S indexes all |S| suffixes of S. A prefix search can then be performed over the trie indexing the suffixes and therefore, simulate a substring search. As each suffix s of S has a length of at most |S| symbols, the total number of substrings in S is O(|S|2 ) and corresponds to the number of vertices in the 16 suffix trie of S. The suffix tree is a path compacted suffix trie which has at most |S| − 1 internal vertices. Each internal vertex of the tree except the root has at least two children. To ensure this property, an extra symbol “$” called terminal symbol is added at the end of S to guarantee that no suffix is the prefix of another suffix. Labels on the edges are not stored explicitly as they all relate to the same string: Each label of an edge can be instead replaced by its start position in S and a reference to the child list of the vertex pointed by the edge. Hence, instead of the quadratic space of the suffix trie, the suffix tree reduces the space usage to O(|S|) and can be built in linear time using the McCreight (1976), Ukkonen (1995) and Farach (1997) algorithms. A vertex v of a suffix tree T represents the prefix of one or multiple suffixes. A suffix link starting at a non-root vertex v for which prefix(v, T) = a⊙s connects to a vertex v′ for which prefix(v′ , T) = s with a ∈ A and s is a (potentially empty) string. Suffix links are not part of the suffix tree but are used by the McCreight (1976) and Ukkonen (1995) algorithms. The Generalized Suffix Tree (GST) is a suffix tree of at least two strings. It is built from a collection of N > 1 strings S1 ⊙ ... ⊙ SN : Each string Si finishes with a different terminal symbol $i such that $1 < ... < $N . Suffix trees are useful for numerous applications such as finding MEMs and MUMs. b a a $ a a b $ b $ $ b$ baab$ b $ a a$ b $ aab$ abaab$ ab$ (a) Suffix trie b aab$ $ a ab$ b $$ b$ baab$aab$$aab$ abaab$ab$ (b) Suffix tree Figure 1.7: Suffix trie (a) and tree (b) of string “abaab$”. 17 As a standard suffix tree uses O(|S| log |S|) bits for a string S over an alphabet A, the high memory cost of such a data structure for long strings has motivated the need for more lightweight data structures in order to get closer to the O(|S| log |A|) bits required to store S itself. Hence, multiple proposals have been made for a Compressed Suffix Tree (CST) (Sadakane, 2007; Fischer et al., 2009; Ohlebusch et al., 2010; Russo et al., 2011). The most lightweight representation of a CST was proposed by Russo et al. (2011) and uses |S|Hk + o(|S|) bits for a string S with Hk being the empirical entropy of S. This representation supports all operations of the suffix tree in O(log |S| log log |S|) time, except returning the child of a vertex corresponding to a searched symbol which is in O(log |S| · (log log |S|)2 ) time. The suffix array by Manber and Myers (1993) is the counterpart of the suffix tree. It is an array of the sorted suffixes of a string S in which suffixes are represented by their start position in S. The lexicographic order of the suffixes mimics a depth-first traversal order of the suffix tree. A suffix array interval is an interval of positions in the suffix array corresponding to a set of contiguous suffixes starting with a given prefix p. Such an interval represents the subtree of a suffix tree T for which the root has prefix(root, T) = p. A suffix array, just as the suffix tree, can be built in O(|S|) time and uses O(|S|) space. As a suffix array does not cover all functionalities of a suffix tree, an Enhanced Suffix Array (ESA) (Abouelhoda et al., 2004) augments the suffix array with at least one additional structure, the Longest Common Prefix (LCP) array, in order to enable all operations supported by a suffix tree. In practice, the suffix array and the ESA are preferred to the suffix tree as they can be implemented more efficiently. The suffix array and the ESA also improve the data locality compared to the suffix tree because an array is stored contiguously in memory. 1 2 3 4 5 6 SA 6 3 4 1 5 2 $ aab$ ab$ abaab$ b$ baab$ Figure 1.8: Suffix array of string “abaab$” for alphabet order $ < a < b. 18 1.2.3.2 Burst Tries Path and level compactions of tries are efficient methods in a limited number of cases, more specifically if a long unique substring or all possible equal-length substrings are rooted from a vertex. The burst trie is a compacted trie designed by Heinz et al. (2002). It is based on the simple idea that the main cause of memory and time inefficiency in tries is because of subtries suffixes(v, T) in which each path from v to a leaf is not branching. Therefore, a more efficient representation of such subtries of suffixes is needed. In the burst trie, edges starting from an internal vertex v are stored in an array of length |A| contained in v. Each element of such an array represents the label a ∈ A of the edge it stores. An edge exists at the position represented by label a ∈ A only if the edge points to a child for which all s ∈ suffixes(v, T) have s(1, 1) = a. Leaves of the burst trie are dynamic arrays representing sets of suffixes with a limited capacity. When the capacity of a leaf suffix set is exceeded during an insertion, a burst is triggered. Bursting a leaf replaces it with a new internal vertex storing edges labeled with the first symbol of every suffix the leaf contained. These edges point to new leaves containing the remaining symbols of the suffixes. An example of bursting is provided in Figures 1.9. Askitis and Sinha (2010) proposed multiple variants named the HAT-trie, the hybrid HAT-trie and the HAT+ -trie which explore different splitting techniques as well as different data structures for the leaves. Those variants are expected to be more space-efficient than the HAT-trie while having the disadvantage to be slower to construct and search. Edge label arrays in vertices of the burst trie are directly addressable. The drawback of this representation is that empty elements occupy memory. Also, even though suffixes are grouped together into leaf suffix sets, multiple locations of the memory representing internal vertices still need to be visited in order to reconstruct the prefixes. 1.2.4 Hash Tables A hash table is an implementation of the so-called associative array or dictionary, also known as a “key-value” data structure. The purpose of such data structures is to associate information to elements, known as values and keys re- 19 aa$ abb$ ba$ bba$ a$ bb$ a b a$ ba$ (a) Burst trie before inserting “aba” ba$ bba$ a b a$ ba$a b a$ b$$ aba$ abb$ aa$ (b) Burst trie after inserting “aba” Figure 1.9: Burst trie of strings “aa”, “abb”, “ba” and “bba” for an alphabet A = {a, b}. Leaves have a capacity of two suffixes. Inserting string “aba” in the burst trie of (a) triggers a bursting of the left leaf, resulting in the burst trie setup shown in (b). A terminal symbol is added at the end of each string in order to handle the case of one string which is the prefix of another string. spectively, such that those values can be retrieved from the keys alone. One of the main differences with tree-like data structures accomplishing a similar purpose is that basic key-value operations (inserting, getting and removing a key-value) are expected to be carried out in constant time. Formally, n keys from a universe U have to be inserted and in most of the use cases, n ≪ |U|. Therefore, creating a direct addressing array of size |U| is not a suitable solution. Instead, a hash table H is an array of length m with m ≪ |U| and each key e to insert will be mapped to an element of H through a hash function h such that h(e) : e → {1, .., m}. For the sake of convenience, we assume in the following a set representation based on a simplified hash table model in which the keys to insert are unique and there are no values. A key e is stored and can be found in the element H[h(e)], while removing e from H deletes the content of H[h(e)]. All these operations take O(1) time and H uses O(n + m) space. 1.2.4.1 Collisions The previously described hash table design is minimal and creates collisions, i.e., two keys e1 and e2 collide if e1 ̸= e2 and h(e1) = h(e2). Figure 1.10 illustrates a hash table in which two keys are colliding. A hash function that minimizes the number of collisions is chosen indepen- 20 1 2 3 4 5 6 7 8 9 m (used keys) U (universe of keys) H e2 e5 e1 e3 e4 e2 e5 e1 e3 e4 h(e2) h(e5) h(e1) h(e3) h(e4) Figure 1.10: Hash table with collisions. dently from the keys, i.e., through randomization. As the set of all such functions is very large, universal hashing is an approach that can design a smaller class of hash functions: A τ-universal hash function has a collision probability between two random distinct keys e1 and e2 of prob(h(e1) = h(e2)) ≤ τ m with m being the size of the hash table. A more detailed description of universal hashing is given by Cormen et al. (2009). Chaining is a collision resolution method: each element of H does not contain a single key anymore but a chain of keys stored in a list. Inserting a key in H remains a O(1) time operation, but finding or deleting a key takes O(1 + n/m) expected time because the expected list length of each element is n/m. In the worst case, the hash function h maps all keys to the same element and finding or deleting a key in H takes Θ(n) time. An example of a hash table with chaining is given in Figure 1.11. 1.2.4.2 Open Addressing Open addressing methods resolve collisions in hash tables by providing alternate locations, also known as probing, to keys that are colliding. Hash tables using open-addressing are self-contained in contrast to hash tables using chaining as they do not store the values outside the hash table. A first open addressing method is linear probing and inserts a key e in element H[b] which is the first nonoccupied element from H[h(e)] with b ≥ h(e). If no element at position b ≤ m is 21 1 2 3 4 5 6 7 8 9 m (used keys) U (universe of keys) H e2 e5 e1 e3 e4 e2 e5 e1 e3 e4 h(e2) h(e5) h(e1) h(e3) h(e4) Figure 1.11: Hash table with chaining. available, the hash table can be extended to contain |H| > m elements. Finding e consists in iterating over the elements of H from position h(e) with an increment of i = 1 until e is found or an empty element is encountered, in which case e is not present. Therefore, inserting and finding a key in H takes Θ(|H|) worst case time. Removing a key from a hash table using linear probing is more complex than from a chained hash table. Indeed, simply removing a key from its element H[b] creates an empty element that might cause the search of a different key e′ stored in an element at position b′ > b to stop prematurely. It is therefore necessary to store a supplementary information for each element indicating if it is empty because it does not contain a key or because the key it contained was removed, in which case a search going through this element would continue. Quadratic probing is similar to linear probing but uses a quadratic increment to iterate over the elements of the hash table in case of a collision. Finally, the double hashing method combines two hash functions in the probing: h(i, e) = ((h1(e) + i · h2(e)) mod m) + 1 If element H[h(i, e)] is already occupied, i is incremented until an empty element is found. Compared to linear and quadratic probing, double hashing allows to avoid long ranges of occupied elements in the hash table. The different open addressing methods are illustrated in Figure 1.12 Although hash tables are advantageous because of the key-value association handled by constant time operations, hash functions with good hashing perfor- 22 1 2 3 4 5 6 7 8 9 m H e6 e2 e5 e1 e3 e4 h(e6) i = 1 i = 1 i = 1 (a) Linear probing 1 2 3 4 5 6 7 8 9 m H e6 e2 e5 e1 e3 e4 h(e6) i = 1 i = 2 (b) Quadratic probing 1 2 3 4 5 6 7 8 9 m H e6 e2 e5 e1 e3 e4 h1(e6) h2 (e6) h(1,e6) h(2,e6) (c) Double hashing Figure 1.12: Open addressing methods. mance can be long to compute in practice because of their complex randomization machinery, particularly in the context of cryptographic applications. Furthermore, not all elements of a hash table are used if the hash table size was pre-/re-allocated without a priori knowing the number of keys to insert. Finally, when a hash table contains a number of keys n converging to the size of the hash table, collisions occur more often during insertion and open addressing methods spend more time finding empty elements, making the hash table time inefficient. 1.2.5 Bloom Filters Introduced by Bloom (1970), a Bloom filter (BF) is a space and time efficient data structure that records the approximate membership of elements in a set. Based on the hash table principle, look-up and insertion times are constant. The BF is composed of a binary array B of m elements, initialized with 0s, in which the presence of n items is recorded. A set of f hash functions h1, ..., hf is used such that inserting an item into B and testing for its presence are then Insertbf(e, B) : B[hi(e)] ← 1 for all i = 1, ..., f and MayContain(e, B) : f ∧ i=1 B[hi(e)], 23 respectively, in which ∧ is the logical conjunction operator. These two operations are illustrated in Figures 1.13 and 1.14 respectively. 1 2 3 4 5 6 7 8 9 m (used keys) U (universe of keys) B 1 1 1 1 1 0 0 0 0 0 e2 e1 h3(e2) h1(e2) h1(e1) h3(e1) h2(e2) h2(e1) Figure 1.13: Insertion of two items e1 and e2 into a BF. 1 2 3 4 5 6 7 8 9 m (used keys) U (universe of keys) B 1 1 1 1 1 0 0 0 0 0 e2 e1 h3(e3) h1(e3) h1(e4) h3(e4) h2(e3) h2(e4) e3 e4 e3 not present e4 present Figure 1.14: Query of two items. Item e3 is a true negative and item e4 is a false positive. Kirsch and Mitzenmacher (2006) demonstrated that two hash functions combined in a double hashing technique can be applied to BFs in order to simulate more than two hash functions and obtain similar hashing performance. The BF does not support deletion as an element of B set to 1 might have been set by different hash functions for different items. The BF does not generate false negatives but may generate false positives, as MayContain can report the presence of items which are not present but a result of independent insertions. In contrast to a hash table, a BF never overflows because it is always possible to 24 insert new items. However, the number of false positives scales exponentially with the number of inserted items. Indeed, the false positive ratio φ (Kirsch and Mitzenmacher, 2006) is φ ≈ ( 1 − e −fn m )f . The number of hash functions that minimizes the false positive ratio is f = m n ln 2 ≈ 0.7 m n , and, given f and φ, the optimal size of a BF is m = n ln φ (ln 2)2 . The union and intersection of two sets represented by BFs with the same parameters are B1 ∩ B2 = B1 ∧ B2, and B1 ∪ B2 = B1 ∨ B2, respectively, in which ∨ is the logical disjunction operator. Such union and intersection are computed in O(|B|) time. Note that the false positive ratio of the intersection of two BFs B1 and B2 is upper-bounded by the smallest false positive ratio of B1 and B2. It is because the number of bits set to 1 in the BF intersection is at most the smallest number of bits set to 1 in B1 and B2. To the contrary, the false positive ratio of the union of two BFs B1 and B2 is lower-bounded by the greatest false positive ratio of B1 and B2. It is because the the number of bits set to 1 in the BF union is at least the greatest number of bits set to 1 in B1 and B2. One main drawback of BFs is their poor data locality. Indeed, for one item, multiple and potentially distant locations of the BFs have to be visited and modified. In practice, these locations are first copied from main memory to cache and are then copied back from cache to main memory if modified. Such transfers affect the cache efficiency and slow down the running time. BFs, as hash tables, have been extensively studied and applied in numerous domains, resulting in a large number of variants. Among the most popular 25 variants, Fan et al. (2000) proposed the Counting Bloom Filter (CBF) which supports deletions by maintaining a counter in each element of B instead of a binary value. Counters are initialized with 0s and the operation Insertbf increments the counters instead of setting them to 1. Similarly, the deletion operation decrements the counters. The CBF is more memory consuming due to the counter size which, furthermore, must be extended when the maximum value they can hold is exceeded. Almeida et al. (2007) proposed the Scalable Bloom filter which is designed with a maximum false positive ratio but without a priori knowledge of the number of items to insert. It is made of a series of BFs for which the size is determined by a geometric progression. However, potentially all BFs of the series have to be queried to determine if an item is present. Putze et al. (2009) overcame the data locality problem of BFs with the Blocked Bloom Filter (BBF), an array of smaller BFs individually fitting into one or multiple cache lines. To insert or look-up an item, a supplementary hash function is used to determine which BF to load. As some BFs might be overloaded while some others might be sparse, it is difficult to establish the expected false positives ratio of a BBF. The authors of the data structure advise to use a BBF size 30% higher than a standard BF for the same parameters. 1.2.6 Burrows-Wheeler Tranform The Burrows-Wheeler Transform (BWT) by Burrows and Wheeler (1994) is a method to index and compress a string. The key idea of the BWT is to rearrange an input string S such that symbols with similar context are aggregated together. The transformation of S, denoted bwt(S), can be reversed to recover S without any additional information. Although bwt(S) has the same length as S, its runs of same symbols makes it more compressible than S by compression methods such as move-to-front encoding followed by run-length encoding. Similarly to the GST (Section 1.2.3.1), it is possible to build the BWT of multiple strings by concatenating strings that end with different terminal symbols. The BWT is based on a lexicographic sorting of all cyclic permutations of an input string S as rows of a matrix. The last column of the matrix, identified by the label L, corresponds to the computed BWT. Note that the first column of the matrix, identified by label F, corresponds to the characters of S given in lexicographic order. An example is given in Table 1.1, in which the cyclic permutations of a 26 string “abaab$” are first computed (Table 1.1a) and are then sorted (Table 1.1b). The computed BWT corresponds to the concatenation of the last symbol of all sorted permutations. The example of Tables 1.1a and 1.1b shows that the computed BWT contains two runs of two symbols while the input string only contains one run of two symbols. F L a b a a b $ $ a b a a b b $ a b a a a b $ a b a a a b $ a b b a a b $ a (a) Cyclic permutations F L $ a b a a b a a b $ a b a b $ a b a a b a a b $ b $ a b a a b a a b $ a (b) Sorted cyclic permutations Table 1.1: BWT computation of string “abaab$”. The computed BWT is high- lighted. The BWT construction algorithm previously described is inefficient and uses O(|S|2 ) space to compute bwt(S). Instead, bwt(S) can be computed in O(|S|) space by keeping in each row of the matrix a pointer to the corresponding permutation in the original string. Burrows and Wheeler, in the original BWT paper, noticed that by using a terminal symbol in S, the sorting step after computing the cyclic permutations of S is equivalent to sorting the suffixes of S. As described in Section 1.2.3.1, this is achieved in O(|S|) space and time using a suffix tree or array. Reconstructing a string S from bwt(S) can be achieved naively with an iterative algorithm in O(|S|2 ) time and space. This algorithm is described first to give an intuition on how the reconstruction operates. A more efficient algorithm running in linear time and space is explained afterwards. The naive algorithm starts from an empty matrix and for each iteration, fills the first empty column of the matrix with bwt(s) and sorts lexicographically all the rows of the matrix. The reconstruction stops when all columns of the matrix are used and the reconstructed string is in the row which has the terminal symbol as last symbol. An example is provided in Table 1.2. A more efficient reversing is based on the fact that each symbol in a row of column F is preceded in S by the symbol in the same row of column L. 27 L b b a $ a a (1) Prepend L $ a a a b b (2) Sort L b $ b a a a $ a a b a b (3) Prepend L $ a a a a b a b b $ b a (4) Sort L b $ a b a a a a b $ a b a b $ a b a (5) Prepend L $ a b a a b a b $ a b a b $ a b a a (6) Sort L b $ a b b a a b a a b $ $ a b a a b $ a a b a a (7) Prepend L $ a b a a a b $ a b $ a a b a a b $ a b b a a b (8) Sort L b $ a b a b a a b $ a a b $ a $ a b a a a b $ a b a b a a b (9) Prepend L $ a b a a a a b $ a a b $ a b a b a a b b $ a b a b a a b $ (10) Sort F L b $ a b a a b a a b $ a a a b $ a b $ a b a a b a b $ a b a a b a a b $ (11) Prepend F L $ a b a a b a a b $ a b a b $ a b a a b a a b $ b $ a b a a b a a b $ a (12) Sort Table 1.2: Reconstruction of a string from the BWT string “bba$aa”. The reconstructed string is highlighted. This ensures a property named LF-Mapping: the occurrence of the i-th symbol a in F matches the occurrence of the i-th same symbol a in L. Therefore, bwt(S) can be reversed unambiguously in a back-to-front manner using columns F and L only. As column F can be computed from column L, reversing bwt(S) takes O(|S| + |A|) space and time. The LF-Mapping property also enables pattern matching directly in the BWT of a string S. It is done by iteratively matching pattern P in bwt(S) from the end to the beginning of P. At each iteration i = 1, ..., |P|, the symbol at position P(|P|−i+1, 1) is searched within an interval corresponding to the suffix P(|P|−i+1, i). The iterations stop when P has been entirely matched or when the symbol P(|P|−i+1, 1) cannot be found within the 28 current interval, in which case it indicates P is not present. This method known as backward search is more detailed in (Li and Durbin, 2009). The LF-Mapping property on which relies the backward search is rarely used as it is, but instead with the support on the FM-Index described in the following section. 1.2.7 FM-Index The BWT is not only useful for compression but also for indexing because of its connections with the suffix array. The FM-Index (Ferragina and Manzini, 2000) has been conceived as a direct application of the BWT by proposing a constant time LF-Mapping over the BWT of a string S. This allows to locate and count the occurrences of a pattern P in S with sublinear complexities. The FM-Index is made of a vector C and a matrix Occ. The vector C counts for each symbol a ∈ A of bwt(S) the number of occurrences of lexicographically lower symbols in bwt(S). Matrix Occ represents a function indicating, for a symbol a ∈ A and a position pos in bwt(S), the occurrence number of a in L(1, pos − 1). Examples are given in Tables 1.3a and 1.3b. $ a b 0 1 4 (a) Vector C 1 2 3 4 5 6 $ 0 0 0 1 1 1 a 0 0 1 1 2 3 b 1 2 2 2 2 2 (b) Matrix Occ Table 1.3: FM-Index of the BWT string “bba$aa”. Given C and Occ, the LF-Mapping at position i in bwt(S) is LF-Mapping(i) = C[L[i]] + Occ(L[i], i). Using the constant time LF-Mapping from the FM-Index, counting the occurrences of a pattern P in S and locating them takes O(|P|) time. As the matrix Occ can be large, only a subset of the columns is stored in practice and the values of the non-stored columns are recomputed on the fly if needed. More details about the FM-Index and compression techniques to reduce its size are provided in (Ferragina et al., 2004). 29 1.3 Thesis Overview Pan-genomes are a new opportunity in comparative genomics to better understand life compared to single-genome reference-based methods. Yet, they constitute a tremendous challenge regarding our capacity to simultaneously explore and analyze data from different sources. On the basis of Sections 1.1 and 1.2 detailing the biological and computational background of this thesis, Chapter II describes the current state-of-the-art for pan-genome indexing through the data structures used in pan-genome analysis methods. From this literature, the first aim of this thesis is to establish a list of missing or inadequate features in data structures used to index pan-genomes. Based on this list, we present in Chapter III a new data structure for pan-genome indexing: the Bloom Filter Trie (BFT). We show that the BFT outperforms a data structure providing the same features but based on an approximation of the data indexed. The second aim of this thesis is to show that the problem of pan-genome read storage has never been addressed before and current techniques for compressed storage of reads are not adapted. Hence, Chapter IV defines first the characteristics that a pan-genome read compression method must have. Then, a new tool for pan-genome read compression named DARRC is detailed. We show that the BFT is adapted for the task of pan-genome read compression and is the supporting data structure of this tool. DARRC outperforms all tested tools based on individual and independent genome compression. 30 CHAPTER II Methods and Tools for Pan-genome Indexing 2.1 Introduction Along the years, the definition of pan-genome has acquired a broader meaning than its original definition by Medini et al. (2005). It nowadays also refers to a collection of sequences from different genomes (Computational Pan-Genomics Consortium, 2016). Such a collection can be analyzed naively by indexing separately each sequence it is composed of. This representation is however wasteful in memory since core genome and accessory genome need to be stored as many times as they occur in the strains. Furthermore, such a representation would require analyzing each genome separately, leading to time inefficient methods. If a pan-genome is composed of similar genomes, it is naturally more efficient to index variants of a collection of genomes with respect to a reference. Such methods will be reviewed in Section 2.2. Yet, computing similarities or variants is a complex task achieved by aligning sequences of the pan-genome. As a consequence, much attention has been paid to methods that either provide scalable sequence alignments for many similar genomes or extract directly from a sequence alignment the shared and unique regions of the pan-genome (Section 2.3). This exercise is not trivial and is the reason for the design of graph-based methods (Section 2.4) that take advantage of the similarity and difference not only with a reference but also with other indexed genomes. A special case of graph-based methods are the de Bruijn graph methods (Section 2.5) which are alignment-free and reference-free. Other methods reviewed in Section 2.6 enable large scale indexing and querying of genomic databases. Note that the following sections focus only on sequencebased data structures and methods for pan-genome indexing and querying at 31 the DNA level in contrast to gene-based methods. Other tools for computational pan-genomics are reviewed in (Computational Pan-Genomics Consortium, 2016). This chapter is based on (Zekic et al., 2018) and is a collaboration with Tina Zekic and Jens Stoye. 2.2 Reference-based Methods Among the reference-based methods, Wandelt et al. (2013) proposed the Referentially Compressed Search Index (RCSI), a tool that uses referential indexing to index a pan-genome and to search for exact or inexact matches. The reference is first indexed using a CST (Section 1.2.3.1) and the other genomes are then encoded as ordered lists of subsequence matches to the reference, called referential match entries. The index is built with respect to a maximum query length and a parameter representing the maximum number of edit distance operations allowed during an inexact search. RCSI exploits the seed-and-extend paradigm (Rasmussen et al., 2006) based on the pigeonhole principle stating that for a sequence S, a pattern P and at most M mismatches, there exists at least one matching substring, called seed, of length ⌊ |P| M+1 ⌋ between S and P. When queried, the reference is first searched for matching seeds that are then extended. Matches found are transferred onto the referential match entries using a hashing structure to identify genomes in which matches are found with at most M mismatches. Another approach of referential indexing is to index variants instead of similarities. The Multiple Genome Index (MuGI) from Danek et al. (2014) applies this approach by maintaining a database of variants. Additionally, the occurrence positions of k-mers in the genomes are stored in arrays, sorted by lexicographic order of k-mers. The seed-and-extend paradigm employs the k-mer arrays for the exact or inexact matching algorithms such that genomes can be searched without scanning the reference, hence offering faster queries than other state-of-the-art data structures while using less memory during the search. MuGI requires less memory than RCSI to build the index but its search algorithms can handle only mismatches, while RCSI also supports insertions and deletions. Also, MuGI considers that variants are provided as input, which allows to keep the reference non-indexed while RCSI computes the subsequence matches itself by indexing 32 the reference. The Journaled String Tree (JST) from Rahn et al. (2014) is another data structure for indexing similar genomes based on encoding genome variants with respect to a reference. The novelty of this work is that it can be “pluggedin” with existing sequential pattern matching algorithms instead of proposing a new search method. A JST is composed of a reference and an array of branchnodes representing variants of one or multiple genomes forming branches with respect to the reference. Each genome is referentially compressed as a journaled string consisting of an insertion buffer and a binary search tree. The insertion buffer is a string which is the concatenation of genome subsequences not present in the reference (insertions). The binary search tree maintains positions and lengths of inserted or shared subsequences with respect to the reference. Pattern matching algorithms scanning a sequence in a left-to-right manner can be used in combination with the JST: as soon as a branch-node occurs in the reference, the state of the search algorithm is saved for later processing and the algorithm processes the occurring branch that is extracted from the corresponding journaled strings. Experiments with exact and inexact pattern matching algorithms show better running time and memory consumption compared to a naive scheme where sequences are processed in a sequential manner. However, MuGI shows better performance during the search, mainly due to its fully indexed search, while JST supports only sequential search. Self-indexes, like the FM-Index (Section 1.2.7) and the CST, are data structures that compress and index data while providing random access to the indexed data as well as pattern matching functionalities. Mäkinen et al. (2010) introduced a new family of self-indexes (Navarro, 2012) for storage, retrieval and search of highly similar sequence collections. The central idea of this approach is that previous self-indexes could not capture the high similarity of multiple sequences that differ only by few variants. Hence, the newly proposed family extends existing self-indexes to achieve greater compression than the high-order entropy of the sequence collection. The approach is then used to introduce new basic structures, including an advanced suffix array sampling scheme, that are adapted afterwards to a CST. This representation of a collection of highly repetitive sequences uses O(n log N n + ω log2 N) bits on average in which N is the total length of the sequence collection, n is the length of the reference and ω is the total number of 33 mutations. Exact pattern matching of a pattern P is achieved in O(|P| log N) time. The provided basic structures require each sequence to be pairwise aligned with the reference of the collection to provide basic query operations. BWBBLE (Huang et al., 2013) builds an augmented reference by including all variants detected between a set of genomes to a reference. For this purpose, a linear augmented reference is built and its BWT (Section 1.2.6) is computed such that it can then be used by an aligner based on the FM-Index. SNPs are handled by making use of the International Union of Pure and Applied Chemistry (IUPAC) nucleotide code in which each symbol represents either one nucleotide or multiple possible nucleotides. More complex variants such as indels (insertions or deletions) are linearized: One of the indel is incorporated in the augmented reference and its surrounding characters in the augmented reference are padded to the extremities of the other indels. These other indels are then concatenated to the augmented reference using a special separation symbol. Exact and inexact matching algorithms with the augmented reference use an extension of the BWTbased backward search (Sections 1.2.6 and 1.2.7) that handles IUPAC nucleotide symbols. As a IUPAC nucleotide symbol can represent multiple nucleotides, a query can match several different substrings in the augmented reference and, thus, different suffix array intervals. While being more accurate with the augmented reference than BWA with a single reference (Li and Durbin, 2009), the aligner is slower and uses significantly more memory. Durbin (2014) describes another data structure based on the BWT, the positional Burrows-Wheeler Transform (pBWT), to represent a haplotype sequence collection with w binary allelic variable sites for efficient haplotype matching and compact storage. The proposed algorithms derive first a reverse prefix ordering of the sequences instead of the classical lexicographic ordering of suffixes in the suffix array. This new ordering ensures that locally maximal sequences are adjacent. Consequently, forward search of the sequences can be used to find all set-maximal matches from a new sequence or a sequence from the collection. Both matchings can be computed in linear time, O(w) time in the former case and O(wN) time in the latter case. The pBWT is highly compressible and uses the FM-Index for the forward search. 34 2.3 Alignment-based Methods Panseq (Laing et al., 2010) is an online tool for pan-genome analysis relying on Pairwise Sequence Alignment (PSA), Multiple Sequence Alignment (MSA) and Local Sequence Alignment (LSA) for its different modules described in the following. The Novel Region Finder module extracts novel regions out of a set of input sequences by iteratively adding them to a database if there are no PSA matches with sequences already in the database. The Core and Accessory Genome Finder module eliminates first the singleton genome of the input sequences by creating a pool of input sequence segments that match at least two genomes. Segments are then broken into fragments of user-specified length such that an LSA determines if they are part of the core or accessory genome. Core genome sequences are built for each input sequence by concatenating its core genome fragments. The core genome is provided as an MSA of the core genome sequences while the accessory genome is provided as a binary matrix indicating the presence of each fragment in the input sequences. Finally, the Loci Selector module constructs a locus set based on the unique number of fingerprints and the discriminatory power of the loci among the input sequences. The Harvest suite (Treangen et al., 2014) uses a variant of whole genome alignment for rapid core genome extraction of higly similar microbial genomes (≥ 97 % average nucleotide similarity). The suite is composed of two tools: Parsnp for core genome MSA and Gingr for dynamic visualization of large scale MSA as well as core genome SNPs tree exploration. The first tool relies on the fact that core genome MSA has a lower complexity than MSA because it focuses only on regions shared by all genomes. Parsnp starts by indexing an input genome using a CST which is then queried with the remaining genomes in order to identify in a first step multiple Maximum Unique Matches (multi-MUMs) present in all input genomes and in a second step Locally Collinear Blocks of multiMUMs. Gaps between collinear multi-MUMs are then aligned. Additional postprocessing steps are performed to filter out unreliable SNPs and build the core genome SNPs tree. Parsnp was compared to whole-genome alignment methods, k-mer based methods and read mapping methods. Results show that its accuracy is a compromise between whole-genome alignment methods and read mapping methods but depends on the input data (draft or finished assemblies). However, Parsnp has the advantage to run in a small fraction of the other methods’ running 35 times. Nguyen et al. (2015) formulated the pan-genome construction problem as a genome rearrangement problem in which the arrangement and orientation of sequence alignment blocks for a set of genomes partitioned by homology must be optimized. Its complexity has been shown to be NP-hard and a heuristic using Cactus graphs (Paten et al., 2011) was provided. 2.4 Graph-based Methods The Variant-based Graph (VG) representation of a collection of genomes was introduced in the tool GenomeMapper (Schneeberger et al., 2009) for sequence alignment against multiple genomes. This representation follows a similar principle as reference indexing methods with the exception that the underlying data structure does not index variants with respect to a reference but instead implements a fully indexed multi-genome reference in the form of a graph. GenomeMapper fragments each input genome into blocks of equal length maintained in a block table. Blocks corresponding to shared regions in the genomes are stored only once and each block is connected to its neighboring blocks, thus forming a graph in which bubbles are formed by blocks of divergent sequences. In addition, a k-mer hash-based index is built to map each k-mer occurring in the input genomes to a list of blocks and positions in the blocks. This index is used to identify seeds for the alignment. The Generalized Compressed Suffix Array (GCSA) (Sirén et al., 2011, 2014) is a self-index data structure which indexes an MSA of genomes into a finite automaton. The latter is encoded using a generalization of the BWT to finite automata, achieving in the expected case for a constant-size alphabet A an O(n·(1+|A| ω )O(log n) ) bits representation. Possible applications are read alignment, split-read alignment using splicing graphs, probe and primer design as well as alignment to assembly graphs. GCSA has been improved into GCSA2 (Sirén, 2017) which is the core data structure of a VG toolkit, vg (vg team, 2015), providing sequence alignment and read mapping over a multi-genome reference using the seed-and-extend paradigm. HISAT2 (Kim et al., 2016) is a VG alignment tool for populations of genomes 36 based on HISAT (Kim et al., 2015). It uses a Hierarchical Graph FM-Index (HGFM) composed of a main Graph FM-Index (GFM) based on GCSA to represent the general population of genomes and a large number of small GFMs, each representing a small genomic region. PanCake (Ernst and Rahmann, 2013) is an extension of string graphs, known from genome assembly (Myers, 2005), which achieves compression based on PSA. PanCake graph vertices represent a reference subsequence and a set of tuples called feature instances. Such a tuple contains a chromosome identifier, the position of a subsequence located in the identified chromosome and similar to the reference subsequence, as well as compressed information necessary to reconstruct the subsequence. Feature instances from the same chromosome are ordered in a doubly linked list, such that the entire chromosome can be reconstructed by iterating over the list and concatenating the reconstructed subsequences. New genomes to be inserted in the data structure are pairwise aligned with the chromosomes already inserted in the graph. One application of PanCake graphs is core and singleton genomes extraction: Singleton regions are the vertices that contain a unique feature instance, while core regions are the vertices containing at least one feature instance from every genome stored in the graph. 2.5 de Bruijn graph Methods A usual step following the dBG (Section 1.2.2.1) construction is to compact it in order to decrease the memory requirements and simplify the graph for further analysis. As building a compacted dBG requires to build the uncompacted dBG first, much attention has been focused on algorithms enabling the direct construction of the compacted dBG without its uncompacted counterpart. SplitMEM (Marcus et al., 2014) makes use of the dBG to extract shared regions of a set of similar genomes. The algorithm directly builds the compacted dBG in O(N log g) time and O(N + |G|) space with g being the length of the longest genome and |G| the size of the compacted dBG. To this end, SplitMEM exploits the relations between dBG and suffix tree (Section 1.2.3.1) with the use of suffix skips, a generalization of suffix links that allow to connect internal vertices of a suffix tree by trimming multiple characters at the beginning of a suffix. The 37 augmented suffix tree allows to identify first all MEMs of length at least k in the genomes. As MEMs can overlap and be nested, they are split using suffix skips in order to build the set of repeatNodes, substrings occurring at least twice in the pan-genome. Then, uniqueNodes, unique substrings of the pan-genome that link repeatNodes, as well as outgoing edges for each vertex, are identified. Core or accessory genome extraction can be performed with core genome defined as subsequences of the pan-genome occurring in at least 70 % of the indexed genomes. Baier et al. (2016) improved SplitMEM with two algorithms, one using the CST and the other using the BWT. The CST algorithm follows the same two steps as SplitMEM: first, identifying repeatNodes based on left and right maximality of repeats, then identifying uniqueNodes that link repeatNodes, as well as edges of the graph. It runs in O(N) instead of O(N log g) time, as suffix skips are not required to identify repeatNodes and a non-comparison sorting algorithm is used to identify uniqueNodes and edges of the dBG. Instead, the BWT algorithm computes the complete compacted dBG in a single backward pass over the pangenome and runs in O(N log |A|) time for an alphabet A. TwoPaCo (Minkin et al., 2016) is a highly parallel method to build the compacted dBG of many similar whole genome sequences. It reduces the problem of finding maximal non-branching paths in the dBG to finding in the input sequences the positions of junctions, k-mers that are branching and k-mers that start or finish an input sequence. For this purpose, the algorithm considers first a set of candidate junction positions in the input sequences. For each such position i, the two (k + 1)-mers starting at positions i and i − 1 are inserted into a data structure D. Then, each such position i is processed again by querying D for all possible successors and predecessors of the k-mer starting at position i. If the k-mer has an edge in-degree of 1 and an edge out-degree of 1, it is not a junction. TwoPaCo uses this method in a first pass with a BF (Section 1.2.5) as data structure D to be memory efficient and select the candidate junction positions. Those positions are then used as input for the second pass with a hash table (Section 1.2.4) as data structure D to remove the false positive junctions generated by the BF. Because the two-pass method can still be memory intensive, k-mers are divided into partitions and the two-pass method deals with each partition once at a time. The expected running time is O(Nf + k · (|G| + σϕψ)) 38 with f being the number of hash functions used in the BF, σ being the number of non-junctions in G, ϕ being the probability of a non-junction to be a false positive and ψ being the average number of times a false positive junction occurs in an input sequence. The expected memory is O(Max(m, k · (δ + σϕ))) with m being the number of bits in the BF and δ being the number of junctions. BCALM (Chikhi et al., 2015) and by extension its highly parallel version BCALM 2 (Chikhi et al., 2016) are methods to compact efficiently the dBG from reads and whole genome sequences. The approach used is the inverse of the one of TwoPaCo: Instead of identifying junction k-mers, BCALM 2 progressively compacts the dBG such that junction k-mers naturally arise from it. In BCALM 2, k-mers are first partitioned in clusters with regard to their minimizer Roberts et al. (2004), the lexicographically smallest of their p-mers with p < k. Each k-mer x is attributed to the cluster corresponding to the minimizer minp of its prefix x(1, k − 1). It is also attributed to the cluster corresponding to the minimizer mins of the suffix x(2, k − 1) if minp ̸= mins. This ensures that k-mers with the same left or right minimizer are in the same cluster. Then, the clusters are compacted in parallel based on their respective (k − 1)-length overlaps and minimizers. Each such compaction produces a set of strings with each string of length k + η − 1 being the compaction of η k-mers. Each compacted string of cluster min might have a left (reciprocally right) lonely end, a (k − 1)-length prefix (reciprocally suffix) for which min is not the minimizer. These strings are candidates to merge with strings of other clusters in a reuniting step. Note that TwoPaCo and BCALM 2 build the compacted dBG of single or multiple genomes, but neither index the dBG nor store the dataset identity from which the k-mers originate. The usage of cdBGs (Section 1.2.2.1) in a pan-genome context was introduced by the de novo assembler Cortex (Iqbal et al., 2012). In a pan-genomic context, colors of the cdBG represent genomes in which the k-mers occur. Cortex can perform assembly of single genomes and populations of genomes that scale to eukaryotic genome sizes, and its algorithms can genotype simple and complex variants. Bubble calling and path divergence algorithms make extensive use of the colors to aggregate information from different samples and accurately identify the variant types. The cdBG data structure of Cortex is a hash table in which hashed k-mers are associated to various information such as colors and coverage. 39 Vari (Belk et al., 2016) is also a succinct data structure for indexing a pangenome as a cdBG. It uses the BOSS representation (Bowe et al., 2012) of succinct dBGs based on an adaptation of the FM-Index. The BOSS representation of a dBG G is defined by the edge-BWT of G and two bitvectors. The edge-BWT of G is the sequence of edge labels sorted according to the edges’ co-lexicographic order of their starting nodes. The bitvectors register the positions of specific edges for each node. Vari builds the BOSS representation of the union of individual dBGs and stores additionally a two-dimensional binary array containing the colors of each edge. Vari was compared to Cortex on multiple sets of similar whole genome sequences by constructing the cdBG and performing a bubble calling algorithm. Results show that Vari outperforms Cortex regarding the memory efficiency while Cortex is the most time efficient algorithm. 2.6 k-mer based Methods As q-gram indexes, also called k-mer indexes, require large amount of memory and are often used for sequence alignment purposes, Claude et al. (2010) proposed two compressed q-gram indexes dedicated to highly repetitive sequence collections. The first method divides a sequence collection of total length N into blocks of length l and establishes an index that associates every occurring q-gram to a list of blocks in which it appears. The sequence collection is encoded using a grammar-based compressor that can efficiently encode long repetitions and provides linear time decompression. Lists of blocks are first delta-encoded to limit the size of each number in the list. Then, each number is encoded using a variable-length encoding technique that assigns a shorter encoding to smaller numbers. Finally, LZ-77 (Ziv and Lempel, 1977) compresses the previously encoded list to take advantage of its internal repetitions. Search time for a q-gram is in O(N · (1 − (1 − l N )occ )) average time and O(Min(l · occ, N)) worst-case time, with occ being the occurrence number of a searched q-gram. The second proposed index is a grammar-compressed self-index based on a StraightLine Program (SLP), a grammar generating a unique string. Indexing a collection of sequences based on SLPs of minimum size can achieve some compression, but computing it is an NP-complete problem. Instead, the authors proposed to use a grammar-based compressor that does not exactly generate an SLP but provides a good heuristic for this problem. This self-index requires 40 γ · (log r + log γ + log N |A| ) + r · (3 log r + log N) bits where γ is the length of the final compressed stream, r is the number of rules and A is the grammar used. Worst-case running time is in O((q · (q + log N) + occ) log N log r). The Sequence Bloom Tree (SBT) from Solomon and Kingsford (2016) is a data structure for large scale querying of genomic experiments. It is designed as a binary tree with BFs as vertices. Leaves of the tree approximately represent genomic experiments: k-mers are extracted from each such experiment and inserted into the BF of the corresponding leaf. An internal vertex is the union of its two children BFs, i.e. a BF in which an element is set to 1 if the element at the same position in at least one of the two children is 1. As BFs generate false positives, leaves do not represent dBGs but approximations of k-mer pools instead. Although the BF size must be the same for all vertices of the tree, BFs are compressed using RRR Raman et al. (2007) such that over-sized BFs have a higher compression ratio. An SBT is queried with a pattern P by decomposing it into k-mers which are in turn used to query the SBT in a top-down manner. For each vertex v of an SBT t traversed during the querying, the subset of k-mers that are present according to the vertex BF is determined. This subset is then used to query children(v, t). It enables to prune the search for branches that do not contain the queried k-mers. A match of pattern P in a genomic experiment represented by leaf v is reported if θ · (|P| − k + 1) k-mers from P are reported present in v with 0 ≤ θ ≤ 1 being a user-defined threshold. Because of the false positives and the heuristics used for the pattern matching, the number of experiments in which P occurs is a subset of the ones reported by the SBT. 41 CHAPTER III Pan-genome Indexing 3.1 Introduction As described in Chapter II, efficient tools have been proposed to index and analyze pan-genomes. However, such methods and data structures do not cover all expected features for pan-genome analysis. Most of them operate only on draft or finished assemblies as input, while such assemblies are available only for a small fraction of species. Furthermore, hundreds or thousands of such assemblies might be required to characterize the pan-genome of a species, a number far much larger than what is available in most cases. By the end of February 2017, the National Center for Biotechnology Information (NCBI) Genome database (NCBI, 2017) contained 23,004 assembled genomes for which about 85 % only have one assembly available. However, unassembled reads abound in databases and represent the vast majority of data available. By the end of February 2017, the NCBI Sequencing Read Archive (SRA) database (NCBI, 2007) contained about 9.8 petabases of reads. Also, methods using an assembly as reference introduce a bias in the analysis towards the reference. Finally, it has been shown that assembly errors can lead to an over-estimation of the number of genes inferred from an assembled genome (Denton et al., 2014). It might cause an over-estimation of the size and growth of the core, accessory and singleton genomes. Hence, an ideal data structure indexing a pan-genome should be reference-free and consider assemblies as well as reads as input to take advantage of all the data available in genomic databases. Most methods presented in the previous chapter also use time or memory costly algorithms to compute variants and similarities between the input genomes, 42 often relying on sequence alignments. Some of these tools simply leave to the user the burden of obtaining a reference and a set of variants with respect to this reference. Rather, a data structure for pan-genome indexing should be capable of computing variants without using sequence alignment methods. More importantly, almost all methods we are aware of consider that pan-genomes are immutable despite the fact that they are continuously growing with newly sequenced genomes. Indeed, large scale sequencing projects such as the 1000 Genomes Project (1000 Genomes Project Consortium, 2015) might take years to complete and regularly release new sequenced genomes. Because methods indexing and analyzing pangenomes are not incremental, any new genome added to the pan-genome causes its entire index to be recomputed. In this chapter, we propose a new lightweight data structure for indexing a pan-genome as a cdBG (Section 1.2.2.1), the Bloom Filter Trie (BFT). It is alignment-free, reference-free, incremental and considers assemblies as well as reads in input. The BFT provides insertion and look-up operations for strings of fixed length associated with a set of colors. The data structure is described in the next section, followed by the operations it supports. Then, a description of the traversal method of a cdBG stored as a BFT is provided. Finally, experimental results showing the performance of the data structure are presented. The work described in this chapter was published in (Holley et al., 2015, 2016) and is joint work with Roland Wittler and Jens Stoye. Note that in the figures of this chapter, grey colored labels and arrows are for illustration but are not stored in practice. 3.2 The Bloom Filter Trie The Bloom Filter Trie (BFT) is a data structure based on the burst trie (Section 1.2.3.2) that implements a cdBG (Section 1.2.2.1). Each input genome represented by a set of assembled sequences or by a set of reads is first decomposed into its constituent k-mers. These k-mers are inserted into the BFT with colors representing the genomes in which they occur. In the following, let bft = (Vbft , Ebft ) be a trie that we name BFT. The BFT is created for a certain value of k in which we assume that k is a multiple of an integer l such that k-mers can be split into k l equal-length substrings. The 43 BFT has two types of vertices: container vertices and color set vertices. A container vertex is a list of containers which are compressed or uncompressed. Such a list starts with zero or more compressed containers and finishes with at most one uncompressed container. A color set vertex contains a color set and is a leaf of the BFT. An edge (v, v′ ) ∈ Ebft starting from the compressed container of a container vertex v and pointing to a vertex v′ ∈ children(v, bft) represents a substring of length l. Thus, the maximum height of a BFT bft is heightmax (bft) = k l since a k-mer is the concatenation of k l substrings of length l, each of which is represented by at most one edge in a path starting from the root. Such a path ends with a color set vertex. Indeed, a color set vertex is always located at depth j = heightmax(bft) while a container vertex is always located at a depth i < j. A k-mer may be represented by a path having less than heightmax edges because its suffix may be stored in an uncompressed container, together with the color set of the k-mer. The bursting method replaces an uncompressed container by a compressed one in order to represent the indexed k-mers more efficiently. This method is adapted from the burst trie and is described in the following sections with a more detailed description of the BFT. 3.2.1 Uncompressed Container An uncompressed container of a container vertex v in a BFT is used to index a small number of k-mer suffixes and their colors. It is a limited capacity set of tuples where p⊙s is a k-mer from which p is the prefix represented by the path from the root to v, s is the suffix and color is the color set associated to the k-mer p ⊙ s. Tuples are lexicographically ordered in the uncompressed container according to their suffixes. The set of tuples is implemented with a dynamic array. When the number of suffixes stored exceeds the capacity ϕ of the uncompressed container, it is burst into a compressed container to represent more efficiently the suffixes and colors it contains. In this process, each suffix s of the uncompressed container is split into a prefix spref of length l and a suffix ssuf of length |s| − l such that s = spref ⊙ ssuf . Prefixes are stored in a new compressed container replacing the uncompressed one in vertex v. Such prefixes are the labels of edges starting from the compressed container and ending on new children containing new uncompressed containers storing the suffixes and their color sets. Suffixes 44 sharing a common prefix are stored in the same child. In the following, we refer to the prefixes of k-mer suffixes stored in compressed containers as edge labels. An example of a BFT and a bursting is given in Figure 3.1. aggctatgctca aggctgcattgt ctcatttgataa gccctgcattgt gcgctatgctga uc (a) cc1 tatgctca tgcattgt uc1 tttgataa uc2 tgcattgt uc3 caggaatc tatgctga uc4 aggc ctca gccc gcgc (b) Figure 3.1: Insertion of six suffixes (that are here complete k-mers) with different colors into a BFT with k = 12, l = 4 and ϕ = 5. In (a), the first five suffixes are inserted at the root into an uncompressed container uc. When a sixth suffix “gcgccaggaatc” is inserted, uc exceeds its capacity and is burst, resulting in the BFT structure shown in (b) with one compressed container and four uncompressed containers. Note that in practice, container vertices might have more than one container and suffixes might have more than one color. 3.2.2 Compressed Container A bursting replaces an uncompressed container by a compressed one, used to store q ≤ ϕ edge labels in compressed form (in Figure 3.1(b), q = 4) and their corresponding edges pointing to children containing the suffixes. A compressed container is composed of multiple layers. The first layer is a BF (Section 1.2.5) used for the approximate membership of edge labels while the second layer is based on an exact representation of the edge labels. Although those two layers contain redundant information, the primary purpose of the first layer is to accelerate container insertion and look-up as it can be queried in constant time 45 rather than linear time for the second layer. BFs of compressed containers are denoted by bf and each such BF is a bit array of length m associated with f hash functions. An example of a BF for the edge labels of Figure 3.1b is provided in Figure 3.2. 0 10 0 1 1 0 0 1 10 1 h1(aggc) h2(gcgc) h1(gcgc) h2(gccc) h1(gccc) h2(aggc) h2(ctca) h1(ctca) bf Figure 3.2: BF of four edge labels “aggc”, “ctca”, “gccc” and “gcgc” with f = 2 hash functions h1 and h2. As a BF is a one-sided data structure that generates false positives when queried, it is necessary to have an exact representation of the edge labels in a second layer of the compressed containers. An ideal index for this task is a path and level compacted tree, but as described in Section 1.2.3, tree representations can be highly memory inefficient and they must be designed with special care. For this purpose, we propose a representation of an edge label tree relying on a hierarchy of arrays. It is based on a method called quotienting from Bender et al. (2012) that was first suggested by Knuth (1998). In his book “The Art of Computer Programming: Sorting and Searching”, Knuth describes a simple form of hashing called division hashing in which the hash fingerprint of a key e is h(e, i) = (⌊e i ⌋ mod m) + 1 and corresponds to the quotient of the key e divided by an integer i. This method is generally not used with hash tables because of its poor hashing performance. Bender et al. (2012) reused this concept as the core method of the Quotient filter, an approximate membership query data structure supporting all operations of the BF as well as deletion and resizing. In the Quotient filter, the division hashing is used differently: a key e is first hashed with a universal hash function and its computed z bits fingerprint is partitioned into its λ leftmost bits, the quotient, and the µ rightmost bits, the remainder, such that z = λ + µ. Naturally, the quotient is a binary prefix of the fingerprint and the remainder is a binary suffix of the fingerprint. While all remainders are stored in the Quotient filter, the same quotients are stored only once with additional information to retrieve all remainders of each quotient. Although efficient, the Quotient filter cannot be used to index edge labels in compressed containers, because only the hash fingerprint is stored and not the 46 keys themselves. Furthermore, false positives occur in such a data structure if and only if two different keys hash to the same fingerprint. Instead, the BFT uses the quotienting method in the compressed containers by directly splitting each edge label label into a prefix labelp and a suffix labels with respective binary representations α (the quotient) and β (the remainder) of length λ and µ bits. The alphabet we consider is the DNA alphabet A = {a, c, g, t} for which each symbol can be stored using two bits. Edge label prefixes and suffixes are indexed in a three level tree: Edges between the first and second level index the prefixes while edges between the second and third level index the suffixes. An example of such a tree (cf. Figure 3.1b) is provided in Figure 3.3. ag gc gc ca cc gc ct aggc ctca gccc gcgc Figure 3.3: Tree representation of four edge labels “aggc”, “ctca”, “gccc” and “gcgc”. To represent such a tree, three structures pref , suf and clust are used: • pref represents the root of the edge label tree and its out-going edges. It is a bit array of length 2λ , initialized with 0s and used to record prefix presence exactly. Here the binary representation α of a prefix labelp is interpreted as an integer such that pref [α] = 1 records the presence of labelp in the edge label tree; • suf represents the leaves of the edge label tree and their in-going edges. It is an array of q suffixes labels sorted in ascending lexicographic order of label; • clust represents the vertices at the second level of the edge label tree. It is an array of q bits, one per suffix of array suf , that represent cluster starting points. A cluster is a list of consecutive suffixes in array suf sharing the same prefix. It represents a set of leaves in the edge label tree sharing the 47 same parent. Each cluster has an index 1 ≤ icluster ≤ 2λ and a start position icluster ≤ poscluster ≤ q in the array suf . Position poscluster in array clust is set to 1 to indicate that the suffix in suf [poscluster ] starts a cluster because it is the lexicographically smallest suffix of its cluster. A cluster contains n ≥ 1 suffixes and, therefore, each position poscluster < i < poscluster + n in array clust is set to 0. The end of a cluster is implicitly indicated by the beginning of the next cluster or if poscluster ≥ q. The exact representation of the edge labels shown in Figure 3.3 is shown in Figure 3.4. aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 10 0 0 0 0 0 0 0 0 0 0 01 1pref 1 1 1 0clust gc ca cc gcsuf Figure 3.4: Exact representation of four edge labels “aggc”, “ctca”, “gccc” and “gcgc” in a compressed container with |labelp| = 2 and |labels| = 2. Each compressed container has a third and last layer which is the array edge. It contains q edges, one for each edge label stored, i.e., one for each suffix of array suf . Each of these edges points to a child. Arrays suf , clust and edge are dynamic such that function Insertarray(e, pos, A) inserts an item e at position pos of the array A by reallocating it and shifting every item having an index i ≥ pos by one position in O(|A|) time. The dynamic nature of those arrays allows to store contiguously in memory all children accessible through the edges of a compressed container. In practice, only two pointers must be stored per compressed container, one for all container vertices and one for all color set vertices, to access all children of a compressed container. Therefore, the size of edges in a BFT is negligible. The size of q edge labels indexed in a compressed container is m + 2λ + q · (µ + 1) bits. The prefix length |labelp|, its bit length λ and the BF size m are chosen such that the set of edge labels stored in a compressed container does not occupy more memory than their original representation in an uncompressed container, i.e., m + 2λ ≤ q · (λ − 1). Each edge label inserted after a bursting occupies only µ + 1 extra bits. When the average size per edge label stored is close to µ + 1 48 bits, the arrays pref , suf and clust can be recomputed by increasing |labelp| and decreasing |labels|, such that 2λ′ +q ·µ′ < 2λ +q ·µ, where λ′ and µ′ are the values of λ and µ, respectively, after resizing. 3.2.3 Color Set A color set is represented by a dynamic bit array color initialized with 0s. Each color has an index icolor such that colorx[icolor ] = 1 records that k-mer x has color icolor . Color sets can be compressed by storing sets that are identical for multiple k-mers once. To this end, a list of all color sets occurring in the BFT is built and sorted in decreasing order of total size which we define as the number of k-mers sharing a color set multiplied by its size. Then, by iterating over the list, each color set is added incrementally to an external array if the integer encoding its position in the array uses less space than the size of the color set itself. Finally, each color set present in the external array is replaced in the BFT by its position in the external array. 3.3 Operations The BFT supports all operations necessary for storing, traversing and searching a pan-genome. Here we describe the most basic ones of them, insertion and look-up of k-mers and their colors, as well as how the sets of colors are compressed. The traversal of the graph is discussed in the next section. The operations described in the following use three auxiliary functions: • HammingWeight(α, pref ) counts the number of 1s in pref [1..α] and corresponds to the number of prefixes represented in array pref that are lexicographically smaller than or equal to a prefix labelp with binary representation α of length λ bits. This requires O(2λ ) time. • Select(i, clust) returns the position of the i-th entry 1 corresponding to the start position of cluster i in array clust of length q. If the entry is not found, the function returns q +1 as a position. While Select could be implemented 49 in O(1) time (Ferragina and Manzini, 2000), we use a more naive but space efficient O(q) time implementation. • Insertcc_arrays(bv, s, pos, cc) performs two tasks. First, it inserts a binary value bv and a suffix s at position pos of arrays clust and suf , respectively, in compressed container cc. Then, it creates a new edge pointing to a new child and inserts this edge at position pos of array edge in compressed container cc. This function takes O(q) time with q being the number of edge labels in the compressed container cc. It returns the end vertex of the edge stored in edge[pos]. 3.3.1 Container Insertion This section describes the functions inserting a k-mer suffix into an uncompressed container and inserting an edge label into a compressed container. These functions are used in the next section to describe how to perform a k-mer insertion in a BFT. The function Insertuc inserts a k-mer suffix s into an uncompressed container uc. Because these containers are lexicographically sorted, the function simply performs a binary search to find the position where s must be inserted. Insertuc takes O(log ϕ + ϕ) time with ϕ being the capacity of uc, and returns the color set associated with s such that new colors can be added into the set. Inserting an edge and its label label = labelp ⊙ labels with binary representation α ⊙ β into a compressed container cc is achieved with the function Insertcc provided in Algorithm 1. First, line 1 records the presence of label in bf in O(f) time. Next, line 3 records the exact presence of prefix labelp by setting the value of pref [α] to 1. While those two insertions are simple, inserting labels into its cluster is a more complicated task, especially in the case where the cluster already exists as it must be expanded. First, line 4 computes in O(2λ ) time the Hamming weight i of labelp, i.e., the index idcluster of the (potentially new) cluster in which suffix labels will be inserted. From the index idcluster , line 5 computes the cluster start position poscluster using Select. If pref [α] = 1 prior to insertion, it means a cluster of n ≥ 1 suffixes already exists for prefix labelp. Thus, lines 8 to 10 compute the insertion position pos of labels in cluster idcluster such that 50 after the insertion of labels into suf , suffixes of the cluster remain lexicographically sorted. In lines 11 to 18, suffix labels is inserted into suf [pos] such that poscluster ≤ pos ≤ poscluster + n. If pos = poscluster , labels starts its cluster: a bit 1 is inserted into clust[pos] and clust[pos + 1] is set to 0. Otherwise, a 0 is inserted into clust[pos]. If pref [α] ̸= 1 prior to insertion, cluster idcluster is a new cluster: a bit 1 is inserted into clust[poscluster ] to create a new cluster, and labels is inserted into suf [poscluster ] in line 20. Lines 5 to 20 take O(q) time. Hence, function Insertcc takes O(f + 2λ + q) time. The function returns the end vertex of the edge labeled with label. Note that if label is already present in the container, lines 12 and 17 ensure that it is not inserted a second time in addition to line 19 which returns the end vertex of its edge. The returned vertex is used by Algorithm 2 to recursively insert k-mer suffixes in subtries of the BFT. Algorithm 1 Insertcc(labelp ⊙ labels, cc) 1: Insertbf(labelp ⊙ labels, cc.bf ) 2: prev ← cc.pref [α] 3: cc.pref [α] ← 1 4: i ← HammingWeight(α, cc.pref ) 5: pos ← Select(i, cc.clust) 6: if prev = 1 then 7: prev ← pos 8: if pos ≤ |suf | and cc.suf [pos] < labels then pos ← pos + 1 9: while pos ≤ |suf | and cc.clust[pos] = 0 and cc.suf [pos] < labels do 10: pos ← pos + 1 11: if pos > |suf | then return Insertcc_arrays(0, labels, pos, cc) 12: else if cc.suf [pos] ̸= labels then 13: if prev = pos then 14: cc.clust[pos] ← 0 15: return Insertcc_arrays(1, labels, pos, cc) 16: else return Insertcc_arrays(0, labels, pos, cc) 17: else if cc.clust[pos] = 1 and prev ̸= pos then 18: return Insertcc_arrays(0, labels, pos, cc) 19: else return cc.edge[pos] 20: else return Insertcc_arrays(1, labels, pos, cc) As an example, the internal representation of the compressed container shown in Figure 3.1b after insertion of the edge label “gtat” is given in Figure 3.5. The presence of prefix “gt” is recorded in pref [12]. Then, its cluster index and start position are computed as 4 and 5, respectively. Consequently, after reallocation of arrays suf and clust, suffix “at” is inserted in suf [5] and clust[5] is set to 1 to 51 indicate suf [5] starts a new cluster. aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 10 0 0 0 0 0 0 1 0 0 0 01 1pref 1 1 1 0clust 1 gc ca cc gcsuf at 0 10 0 1 1 0 0 1 11 1 h1(aggc) h2(gcgc) h1(gcgc) h2(gccc) h1(gccc) h2(aggc) h2(ctca) h1(ctca) bf h1(gtat) h2(gtat) Figure 3.5: Insertion of edge label “gtat” in a compressed container with four edge labels “aggc”, “ctca”, “gccc” and “gcgc”. Inserted or changed parts are highlighted. Array edge is not represented. 3.3.2 Tree Insertion Function TreeInsert described in Algorithm 2 inserts a k-mer into a BFT bft. Such an insertion is performed recursively by traversing bft from the root in a topdown manner. In each traversed container vertex v, either an edge and its label are inserted into a compressed container, or a k-mer suffix into an uncompressed container using the functions of Section 3.3.1. For this purpose, TreeInsert iterates over the containers of v from the head to the tail of the container list and the insertion of the edge label label into a compressed container is only triggered if its BF reports label as present in the container. If it is indeed present in the container, the insertion continues recursively on the end vertex of the edge labeled with label in the container. At this point, label might not be present in the container because it is a false positive of the BF. False positives of compressed container BFs are “recycled”, which is a nice property of the BFT: the BF remains unchanged, i.e., it is not necessary to perform line 1 of Algorithm 1, and only pref , suf , clust and edge are updated as described in lines 2 to 20 of Algorithm 1. It allows to limit the number of burstings and compressed containers in a BFT. Algorithm 2 is initially called as TreeInsert(x, l, root, bft) to insert a k-mer x into a BFT bft with root being the root vertex of bft. In the worst case, all container 52 vertices on a traversed path represent all possible edge labels and the BFs have a false positive ratio of 0. In such case, each traversed container vertex has ⌈ |A|l ϕ ⌉ containers. The longest path of a BFT has k l container vertices. Therefore, the worst case time of TreeInsert is O ( k l · (⌈ |A|l ϕ ⌉ ·f +2λ +q )) . The function returns the color set associated to the inserted k-mer such that colors can be inserted in it. Algorithm 2 TreeInsert(s, l, v, bft) 1: for each container cont in v do 2: if cont is compressed then 3: if MayContain(s(1, l), cont.bf ) then 4: vchild ← Insertcc(s(1, l), cont) 5: if vchild is a color set vertex then return vchild.color 6: else return TreeInsert(s(l + 1, |s| − l), l, vchild, bft) 7: else return Insertuc(s, cont) 3.3.3 Container Look-up Similarly to the previous sections, we describe here the functions looking for a k-mer suffix in an uncompressed container and looking for an edge in a compressed container. These functions are used in the next section to describe how to perform a k-mer look-up in a BFT. Searching for a k-mer suffix s in an uncompressed container uc is performed by the function Containuc which is similar to the function Insertuc. A binary search of suffix s is performed in O(log ϕ) time, where ϕ is the capacity of uc. It returns the set of colors associated with the suffix s in uc if it is found or an empty set otherwise. The function Containcc, testing whether an edge label label = labelp ⊙ labels with binary representation α ⊙ β is stored in a compressed container cc, is given in Algorithm 3. Line 1 verifies the presence of label in bf and its prefix labelp in the array pref in O(f) time. If labelp is present, line 2 computes in O(2λ ) time the Hamming weight i of labelp, i.e., the index of the cluster in which suffix labels is possibly situated. Line 3 locates the rank of i, i.e., the start position of the cluster, and lines 4 to 7 compare the suffixes of the cluster to labels. Lines 3 to 7 are computed in O(q) time. Algorithm 3 has therefore a worst case time of 53 O(f + 2λ + q). If label is found, the algorithm returns the end vertex of the edge labeled with label. Otherwise, a vertex containing an empty color set is returned. Algorithm 3 Containcc(labelp ⊙ labels, cc) 1: if MayContain(labelp ⊙ labels, cc.bf ) and cc.pref [α] = 1 then 2: i ← HammingWeight(α, cc.pref ) 3: start ← Select(i, cc.clust) 4: pos ← start 5: while pos ≤ |suf | and (pos = start or cc.clust[pos] = 0) do 6: if cc.suf [pos] = labels then return cc.edge[pos] 7: else if cc.suf [pos] < labels then pos ← pos + 1 8: else pos ← |suf | + 1 9: return a vertex containing ∅ 3.3.4 Tree Look-up The function TreeContain, testing whether a k-mer x is present in a BFT bft, is given in Algorithm 4. Each container vertex v is the root of a subtrie representing k-mer suffixes. The BFT look-up function traverses bft top-down from the root and, for a container vertex v, queries its containers from the head to the tail of the container list for the k-mer suffix xsuf = x(l · depth(v, t) + 1, |x| − l · depth(v, t)). If the queried container is compressed, its BF is queried for xsuf (1, l) using the function MayContain in O(f) time. In case of a positive answer, the function Containcc is used for an exact membership of xsuf (1, l). The usage of MayContain before Containcc allows to reduce the querying time of this container from O(f + 2λ +q) to O(f) if the BF reports xsuf (1, l) as not present. If xsuf (1, l) is found, the traversing procedure continues recursively on the corresponding child. Otherwise, the absence of xsuf (1, l) indicates the absence of x in bft since xsuf (1, l) cannot be in another container of v because of the tree insertion process explained in Section 3.3.2. If the container is uncompressed, the presence of xsuf is detected using the function Containuc. As an uncompressed container has no children, a match indicates the presence of the k-mer. Algorithm 4 is initially called as TreeContain(x, l, root, bft) and its worst case time is O ( k l · (⌈ |A|l ϕ ⌉ · f + 2λ + q )) . If the queried k-mer is found, its color set is returned. Otherwise, an empty color set is returned. 54 Algorithm 4 TreeContain(s, l, v, bft) 1: for each container cont in v do 2: if cont is compressed then 3: if MayContain(s(1, l), cont.bf ) then 4: vchild ← Containcc(s(1, l), cont) 5: if vchild is a color set vertex then return vchild.color 6: else return TreeContain(s(l + 1, |s| − l), l, vchild, bft) 7: else return Containuc(s, cont) 8: return ∅ 3.4 Successors and Predecessors Traversing The BFT is an implementation of the cdBG used to index a pan-genome. In order to analyze a pan-genome indexed with a BFT, it is necessary to traverse the cdBG it represents. For example, the traversal can be used to extract all paths of the cdBG for which the k-mers have all colors in their color sets. Such paths belong to the core genome. Let bft be a BFT representing a cdBG G. For a k-mer x, visiting all its predecessors or successors in G, denoted by pred(x, G) and succ(x, G), respectively, implies the look-up of |A| different k-mers in bft. Such a look-up would visit in the worst case |A|·heightmax (bft) container vertices in bft. This section describes how to reduce the number of vertices and containers visited in bft during the traversal of a vertex in G. Observation 1. Let G be a cdBG represented by a BFT bft and x a k-mer corresponding to a vertex of G. All k-mers of succ(x, G) share x(2, k−1) as a common prefix and therefore share a common subpath in bft starting at the root. However, all k-mers of pred(x, G) have different first symbols and, therefore, except for the root of bft do not share a common subpath. Hence, the maximum number of visited container vertices in bft for all k-mers of succ(x, G) is heightmax (bft) and for all k-mers of pred(x, G) is 1 + |A| · (heightmax (bft) − 1). Lemma 1. Let G be a cdBG represented by a BFT bft, x a k-mer in bft and v a container vertex of bft that terminates the shared subpath of the k-mers in succ(x, G). If depth(v, bft) = heightmax (bft)−1, the suffixes of succ(x, G) may be stored in any container of v. If not, they are stored in the uncompressed container of v. 55 Proof. In a BFT bft representing a cdBG G, a container vertex v is the root of a subtrie storing k-mer suffixes of length l · (heightmax (bft) − depth(v, bft)) with l = k heightmax (bft) . Let s be the suffix of a k-mer from succ(x, G) that is rooted at a vertex v ∈ Vbft . If depth(v, bft) ̸= heightmax (bft) − 1 but s is rooted at a compressed container in v, then this compressed container stores s(1, l), and s(l+1, |s|−l) is rooted in one of its children. As the divergent symbol between the k-mer suffixes of succ(x, G) is in position |s| − 1, this symbol is in s(l + 1, |s| − l), rooted at one child of this compressed container. Therefore, v does not terminate the common subpath shared by succ(x, G) k-mers. Observation 1 and Lemma 1 prove that the only two cases where a look-up of pred(x, G) or succ(x, G) must search in different containers of a vertex are: • searching at the root of bft for k-mers of pred(x, G), • if depth(v, bft) = heightmax (bft) − 1, searching at vertex v for suffixes of succ(x, G). Restricting the hash functions used in the compressed containers to take only positions 2 through l − 1 of edge labels into account allows to limit the search space. Lemma 2. Let bft be a BFT in which the f hash functions hi of bf have the form hi(label) : label(2, l−1) → {1, .., m} for i = 1, ..., f. Then, for a vertex container v of bft and an edge label label, all possible edge labels label′ = a1⊙label(2, l−1)⊙a2 are contained in the same container of v. Proof. Assume a k-mer suffix s inserted into a container vertex v of bft. A look-up for s analyzes the containers of v from the head to the tail of the container list. In the worst case, s can be rooted, according to BFs, from all compressed containers as a true positive or as a false positive. However, a look-up stops either on the first compressed container claiming to contain the edge label label = s(1, l), or on the uncompressed container. Therefore, as the hash functions of the BFs consider only label(2, l − 1), a look-up will stop on the same container for any edge label label′ = a1 ⊙ label(2, l − 1) ⊙ a2. 56 Figures 3.6 and 3.7 illustrate how the containers of container vertices are traversed for the insertion of k-mer “aggctatgctca” predecessors and successors, before and after restricting the BF hash functions. cc1 cc2 cc3 uc1 v1 v5 v3 v4 v6 v2 preda predgpredt predc predt predg preda predc gagg aagg ctat tagg ctat (a) Before restricting the BF hash functions. Multiple containers are traversed at the root vertex v1. cc1 cc2 cc3 uc1 v1 v5v3 v4v2 preda predg predt predc predt predg preda predc v8v7v6 gaggaagg ctat taggcagg ctat ctat (b) After restricting the BF hash functions. Only one container is traversed for each container vertex. Figure 3.6: Traversed paths for predecessors preda of k-mer x = “aggctatgctca” such that preda = a⊙x(1, |x|−1) for all a ∈ A. Content of vertices is only shown for the root, other vertices only have one traversed container and are represented with an empty rounded box. Color set vertices are not represented. As a consequence of Lemma 2, all edge label label stored or to store in arrays pref , suf and clust of BFTs are modified such that label′ = label(2, l)⊙label(1, 1), which guarantees that any label′′ = label′ (1, l − 2) ⊙ a2 ⊙ a1 are in the same container. Furthermore, suffixes stored in array suf are required to have a minimum length of two symbols to ensure that symbols a1 and a2, the variable parts between the different label′′ , are stored in array suf . Hence, as any label′′ share label′ (1, l − 2) as a prefix, they share the same cluster in arrays suf and clust. Suffix prefixes label′′ = label′ (1, l − 1) ⊙ a1 also have consecutive suffixes in their cluster. As an example, Figure 3.8 shows the compressed container of five edge labels “aggc”, “ctca”, “gccc”, “gcgc” and “gtat” adapted for predecessor and successor traversal. Figure 3.9 illustrates the same compressed container in which the edge label ”tggc” is inserted. The edge label ”tggc” is, as the prefix “aggc” 57 cc1 cc2 cc3 uc1 v1 v7 succa succg succt succc v8 succasuccc succt succg atgc ggct (a) Before restricting the BF hash functions. Multiple containers are traversed at the leaf v8. cc1 cc2 cc3 uc1 v1 v7 succa succg succt succc v8 succa succc succt succg atgc ggct (b) After restricting the BF hash functions. Only one container is traversed for each container vertex. Figure 3.7: Traversed paths for successors succa of k-mer x = “aggctatgctca” such that succa = x(2, |x| − 1) ⊙ a for all a ∈ A. Content of vertices is only shown for the leaves, other vertices have only one traversed container and are represented with an empty rounded box. Color set vertices are not represented. is already present in the container, a predecessor for a k-mer x with x(1, 3) = “ggc”. 3.5 Evaluation In this section, we compare the BFT implementation written in C, version 0.8.5, to the SBT (Section 2.6) implementation written in C++, version 0.3.5. Indeed, the SBT presents the same characteristics as the BFT: it is alignmentfree, reference-free, incremental and considers assemblies as well as reads in input by decomposing each input experiment into k-mers. The BFT implementation is available at https://github.com/GuillaumeHolley/BloomFilterTrie. All evaluations were carried out using a single thread on a server with 378 GB 58 aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 00 0 0 1 1 0 1 0 1 1 0 00 0pref 1 1 1 1clust 1 cg cg ca tgsuf ac 01 0 0 11 0 0 1111 h1(gg) h2(cg) h1(cg) h2(cc) h1(cc) h2(gg) h2(tc) h1(tc) bf h1(ta) h2(ta) Figure 3.8: Internal representation of a compressed container with five edge labels “aggc”, “ctca”, “gccc”, “gcgc”, “gtat”, adapted as “ggca”, “tcac”, “cccg”, “cgcg”,“tatg”, respectively, for predecessor and successor traversal. Array edge is not represented. aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 00 0 0 1 1 0 1 0 1 1 0 00 0pref 1 1 1 1clust 10 cg cg ca tgsuf acct 01 0 0 11 0 0 1111 h1(gg) h2(cg) h1(cg) h2(cc) h1(cc) h2(gg) h2(tc) h1(tc) bf h1(ta) h2(ta) Figure 3.9: Insertion of edge label “tggc” adapted as “ggct” in a compressed container with edge labels “aggc”, “ctca”, “gccc”, “gcgc” and “gtat”. Array edge is not represented. Inserted parts are highlighted. of RAM and two 8-core Intel Xeon E5-2630 v3 2.4 GHz processors. All input files were placed on a mechanical hard drive also used to write the output files. Because this hard drive was shared among users of the employed server, all times reported in the following evaluations are user times which do not include time spent to wait for disk input and output operations, to the contrary of wall-clock 59 times. Note that this is at the advantage of the SBT implementation which is mainly disk-based while the BFT implementation is mainly RAM-based (the disk is used to perform color sets compression as described in Section 3.2.3). The BFT implementation takes as input a list of k-mers for each experiment to index such that users can employ the tool of their choice to extract k-mers from sequencing experiments. Then, extracted k-mers are inserted into a BFT. The capacity ϕ of uncompressed containers influences the compression ratio as well as the time for insertion and look-up. We chose a value of ϕ = 248, as it showed a good compromise in practice. Also, the edge label length l determines the size of several internal structures of the BFT and how efficiently they can be stored. We selected l = 9, as this limits the internal fragmentation of the memory. The SBT implementation takes as input FASTQ or FASTA files and extracts their k-mers with the tool Jellyfish (Marçais and Kingsford, 2011) which is directly integrated in the implementation. For each sequencing experiment to index, its constituent k-mers are extracted and inserted into a BF that is a leaf of the SBT. Once all leaves are constructed, they are used by the implementation to build the internal vertices. Then, all vertices are compressed using RRR (Raman et al., 2007). Because the SBT implementation integrates k-mer counting and leaf construction into a single operation without the possibility to distinguish each step, we discarded the time to construct the leaves from the SBT construction time. It allows to be as fair as possible by not reporting the time spent by Jellyfish to count the k-mers. Hence, SBT construction time includes the time to construct the internal vertices from the leaves and the time to compress all vertices of the SBT. Configuring the SBT implementation is not straightfoward. While the data structure is incremental, all its BFs must have the same size designed with respect to a number of k-mers to insert and a number of hash functions in the BFs. However, the number of k-mers per experiment of a dataset may be fluctuent. Furthermore, in order to avoid saturation of 1s in the BFs composing the lowest levels of the data structure, Solomon and Kingsford (2016) advise to compute the total count of k-mers for a subset of the experiments to insert and design the BFs’ bit size with respect to this number. However, it is unclear how this simplistic approach 60 scales with new insertions in the SBT. For all computed SBTs in the following, we used one hash function for the BFs as in (Solomon and Kingsford, 2016), and computed the total count of k-mers for a subset of the experiments representing about 3.8 % of each dataset, as in the evaluation performed in (Solomon and Kingsford, 2016). The BFT and SBT were used to index and query two pan-genome datasets, a bacterial dataset and a human tissue dataset, for which the evaluations are shown in Sections 3.5.1 and 3.5.2, respectively. For each querying evaluation, we used two different thresholds θ = 0.7 and θ = 0.9 indicating the ratio of k-mers from a query that must be present in an indexed sequencing experiment to have the query reported as present in this experiment. 3.5.1 P. aeruginosa Dataset The dataset used for this evaluation is composed of 473 clinical isolates of P. aeruginosa sampled from 34 patients (NCBI BioProject PRJEB5438), resulting in 338.61 Gbp of sequences. For each data structure, we inserted all 36-mers with a minimum number of 3 occurrences in each sequencing experiment of the dataset. As the size of BFs used by the SBT implementation must be specified prior to the k-mer insertion and should be the same for all vertices, we computed from 18 isolates a total count of 123,994,112 36-mers that the SBT implementation used as the BFs’ bit size. From this size and the maximum number of 36-mers extracted from a file of the dataset, a BF in a leaf of the SBT has a maximum false positive ratio of 6.2 % in this evaluation. In total, 3,270,528,926 36-mers were inserted in both data structures. The construction evaluation is shown in Table 3.1. Table 3.1: BFT and SBT construction evaluation for 473 P. aeruginosa isolates. Best results are highlighted. BFT SBT Insertion time 127.74 min > 6.69 min Main memory usage peak 3.82 GB 0.51 GB Disk usage peak 3.82 GB 13.64 GB Disk size 0.68 GB 4.83 GB 61 The difference of construction times between BFT and SBT can be explained by the fact that SBT construction time does not include the time spent to build the leaves. Also, the user time was measured instead of wall-clock time. Indeed, the BFT implementation is RAM-based while the SBT implementation is disk-based, suggesting that in terms of wall-clock time, the SBT construction is several times slower than its reported user time. A more accurate construction comparison using a dedicated hard drive, measuring wall-clock time and distinguishing k-mer counting from leaf construction would provide more insights into the real time performance of each data structure during construction. While the SBT used about 3.57 times more disk space than the BFT during the construction, the BFT used about 7.46 times more main memory than the SBT. The disk size indicates that BFT is more suitable for long-term disk storage as it uses 7.1 times less disk space than the SBT. We queried both constructed data structures with a subset of 1,000 and 10,000 sequences of length 100 bp from the sequencing experiment ERR431077. As the average size of an SBT vertex is 5.23 MB, we configured the SBT implementation to load the SBT in main memory by batch of 150 vertices such that the main memory consumption of both data structures is equivalent. The results of the querying evaluation are shown in Table 3.2. Table 3.2: SBT and BFT querying evaluation for sequences of length 100 bp from the sequencing experiment ERR431077. Best results are highlighted. Time Main memory peak BFT SBT BFT SBT 1,000 queries θ = 0.7 1.66 s 47.66 s 786.70 MB 815.97 MB θ = 0.9 1.71 s 30.76 s 786.59 MB 833.77 MB 10,000 queries θ = 0.7 4.58 s 390.17 s 786.75 MB 898.28 MB θ = 0.9 4.01 s 328.16 s 786.62 MB 895.52 MB Results show that for a similar main memory usage, the BFT is 18 to 85 times faster to query than the SBT. The main memory usage of the BFT during querying is lower than during construction because its color sets are already compressed while during construction, the raw color sets are stored in main memory before being compressed. An interesting result of this evaluation is that for a fixed number of queries but different thresholds θ (ratio of k-mers that must be present in the queries to report the query as present), the querying time of 62 the BFT is stable while the querying time of the SBT varies greatly. It can be explained by the difference between the BFT and SBT indexing methods: The BFT indexes colors with respect to the k-mers while the SBT indexes the k-mers with respect to the colors. Hence, each k-mer is queried multiple times in the SBT but is queried only once in the BFT. Finally, a last evaluation for this dataset demonstrates the operations provided by the BFT which are absent from the SBT because it is based on an approximation of the indexed k-mers. For this evaluation, we first extracted all k-mers stored in the BFT and then, used these k-mers as queries to compute the number of branching k-mers in the cdBG represented by the BFT. Table 3.3 reports the time and main memory usage of both operations. Table 3.3: Evaluation of k-mer extraction and branching queries for a BFT constructed from 473 P. aeruginosa isolates. Time Main memory peak Extracting k-mers 7.40 s 786.39 MB Branching queries 146.84 s 786.71 MB In this evaluation, 55,355,097 36-mers were extracted and 1,518,746 of them are branching in the cdBG represented with the BFT. 3.5.2 Human Tissues Dataset In this evaluation, we reused the dataset of the SBT evaluation in (Solomon and Kingsford, 2016). It is composed of 2,652 human tissue (blood, brain and breast) RNA-Seq experiments from the SRA database. This dataset represents about 5 TB of SRA files or about 2.7 TB of FASTA files compressed with Gzip (Ziv and Lempel, 1977). Because the k-mer length must be a multiple of a l = 9 in the BFT, we extracted k-mers of length k = 18 in order to be as close as possible to the original SBT evaluation in which k = 20 was used. Also, we reused the k-mer abundance thresholds computed in (Solomon and Kingsford, 2016) to discards k-mers that do not occur enough and which are then considered as sequencing errors. Hence, for a FASTA file file from the dataset and its corresponding size Size(file), the k-mer abundance threshold is 1 for Size(file) ≤ 300 MB, 3 for 300 MB < Size(file) ≤ 500 MB, 10 for 500 MB < Size(file) ≤ 1 GB, 20 for 63 1 GB < Size(file) ≤ 3 GB and 50 for Size(file) > 3 GB. Using these abundance thresholds, 15,755,778,039 18-mers were extracted. We reused as well the size of BFs mentioned in (Solomon and Kingsford, 2016). Using this size and the maximum number of 18-mers extracted from a file of the dataset, a BF in a leaf of the SBT has a maximum false positive ratio of 3.8 % in this evaluation. The construction evaluation is shown in Table 3.4. Table 3.4: BFT and SBT construction evaluation for 2,652 human tissue RNASeq experiments. Best results are highlighted. BFT SBT Insertion time 19.3 h > 4.89 h Main memory usage peak 144.21 GB 18.5 GB Disk usage peak 144.21 GB 1,174.65 GB Disk size 13.11 GB 92.46 GB Similarly to the P. aeruginosa dataset, the difference of construction times between BFT and SBT can be explained by the fact that SBT construction time does not include the time spent to build the leaves. Also, user time was measured instead of wall-clock time. While SBT used about 8.15 times more disk than BFT, the BFT used about 7.8 times more RAM than SBT. The disk size of both data structures confirms the result from the P. aeruginosa dataset evaluation: The BFT is more adapted for long-term disk storage than SBT as it uses 7 times less disk space than the SBT. Then, the SBT and BFT were queried with a subset of 1,000 and 10,000 transcript sequences from the GENCODE (Harrow et al., 2012) database, version 25 (GENCODE group, 2013). These transcript sequences are transcripts from the reference chromosomes of the human genome. Because the average size of an SBT vertex in this evaluation is 17.85 MB, we configured the SBT implementation to load the SBT in main memory by batch of 800 vertices to have a similar main memory consumption to the BFT. The results of this querying evaluation are shown in Table 3.5. Results show that for a similar main memory usage, the BFT is 9.51 to 52.66 times faster to query than the SBT and confirm the results obtained from the querying evaluation of the P. aeruginosa dataset in Section 3.5.1. Finally, we extracted all k-mers stored in the BFT and used these k-mers as 64 Table 3.5: SBT and BFT querying evaluation for the GENCODE sequences database. Best results are highlighted. Time Main memory peak BFT SBT BFT SBT 1,000 queries θ = 0.7 39.95 s 655.07 s 14.11 GB 14.51 GB θ = 0.9 39.94 s 379.89 s 14.04 GB 16.67 GB 10,000 queries θ = 0.7 109.98 s 5,792.04 s 14.11 GB 15.33 GB θ = 0.9 102.42 s 2,944.09 s 14.11 GB 16.19 GB queries to compute the number of branching k-mers in the cdBG represented by the BFT. Table 3.6 reports the time and main memory usage of both steps. In this evaluation, 836,937,335 18-mers were extracted and 138,625,655 of them are branching in the cdBG represented with the BFT. Table 3.6: Evaluation of k-mer extraction and branching queries for a BFT constructed from 2,652 human tissue RNA-Seq experiments. Time Main memory peak Extracting k-mers 1.58 min 13.92 GB Branching queries 35.89 min 13.92 GB 3.6 Conclusion We proposed a novel alignment-free, reference-free, incremental data structure called the Bloom Filter Trie (BFT) to index a pan-genome as a colored de Bruijn graph (cdBG). It accepts assemblies and reads as input to take advantage of all sequencing data representations. The BFT is based on a burst trie and stores k-mers with their colors representing the sequencing experiments in which the k-mers occur. A new representation of vertices is proposed to compress and index shared substrings of k-mers. It uses different types of containers to quickly verify the presence of substrings. The containers use Bloom filters to navigate in the trie and accelerate the cdBG traversal. The BFT was compared to a data structure presenting the same features but based on an approximation of the indexed k-mers. Using two pan-genome datasets, we showed that the BFT has a smaller disk size, can be queried multiple times faster and propose additional 65 operations such as the extraction of k-mers and the traversal of the cdBG. Future work concerns a more advanced compression scheme of the color sets. 66 CHAPTER IV Pan-genome Storage 4.1 Introduction As the number of sequenced genomes grows exponentially (Land et al., 2015), storing and accessing these data is a problem of main importance. For example, the SRA public database was endangered in 2011 because of budgetary constraints (Genome Biology Editorial Team, 2011). In order to reduce storage and transmission costs, raw sequencing data are often compressed using general purpose compression tools such as gzip based on LZ-77 (Ziv and Lempel, 1977) or bzip based on the BWT (Section 1.2.6). Although these classic tools compressed most of the public data, they are not optimized for HTS compression (Holland and Lynch, 2013; Deorowicz and Grabowski, 2013; Giancarlo et al., 2014; Hosseini et al., 2016; Numanagić et al., 2016). The FASTQ format represents raw sequencing data and each record of this format has three major components: (i) unique identifier, (ii) read sequence and (iii) quality scores. A large variety of HTS-specific compression tools were proposed (Jones et al., 2012; Hach et al., 2012; Bonfield and Mahoney, 2013; Roguski and Deorowicz, 2014; Saha and Rajasekaran, 2014; Rozov et al., 2014; Grabowski et al., 2015; Patro and Kingsford, 2015; Kingsford and Patro, 2015; Benoit et al., 2015) to compress either FASTQ files or only the read sequences. While these tools are very efficient, they are not adapted to the context of large-scale sequencing projects that produce tens to several thousand of genomes per species. A pan-genome is characterized by a high degree of similarity and redundancy between the genomes (Section 1.1.5) but all HTS-specific compression tools can only consider redundancy and similarity within a single genome and not in a collection of genomes. Hence, a tool for pan- 67 genome read compression must compress reads and the identity of the genomes they belong to while taking advantage of the similarity and redundancy within a pan-genome. Also, large-scale sequencing projects such as the 1000 Genomes Project (1000 Genomes Project Consortium, 2015) may take years to complete, making pan-genomes continually growing. Therefore, a tool for pan-genome read compression must also be incremental such that an archive containing the compressed reads of a pan-genome can be updated with compressed reads of a new genome. In the following, Section 4.1.1 presents the existing approaches for compressing reads from a single genome and Section 4.1.2 provides an overview of our contributions to pan-genome read compression. Section 4.2 describes an abstract data structure for read indexing inspired from the colored de Bruijn graph (Section 1.2.2.1): the guided de Bruijn graph (gdBG). The methods using the gdBG for compression and the preprocessing operated on the reads to optimize the compression ratio are detailed in Section 4.3. The decompression and update operations are explained in Section 4.4. Finally, Section 4.5 shows the evaluation of our tool and single-genome read compression tools on a pan-genome dataset. Note that in the figures of this chapter, grey colored labels and arrows are for illustration but are not stored in practice. 4.1.1 Existing Approaches HTS-specific compression tools are divided into two categories: referencebased and de novo. Reference-based methods generally provide high compression ratio by encoding similarities or differences between the read sequences and a reference usually by mapping the read sequences to the reference. In the following, we use interchangeably the terms “read sequences” and “reads”. Reference-based compression tools require a reference available for the species from which the reads are generated or a similar genome. Note that only a small fraction of sequenced species that are accessible in public databases have such a reference available. Furthermore, the reference used for compression must be provided with the compressed archive for decompression, adding extra storage and transmission costs. On the other hand, de novo compression tools perform similarity search within a set of reads in order to exploit its redundancy. 68 BARCODE (Rozov et al., 2014) is a reference-based method that makes use of BFs (Section 1.2.5) to compress reads. It inserts perfectly matching reads to a reference into a BF that generates false positives. To reduce the number of false positives, BARCODE subsequently inserts them into cascading BFs (Salikhov et al., 2014) to tell apart false positives from true positives. The reference is then used during decompression to query the BFs. Kpath (Kingsford and Patro, 2015) constructs a dBG (Section 1.2.2.1) from the reference and encodes each read as a path within the graph. The paths within the graph are then encoded via arithmetic coding (Witten et al., 1987). The beginnings of such paths are stored separately in a trie and encoded with LZ-77. QUIP (Jones et al., 2012) uses a lossless compression algorithm based on adaptive arithmetic encoding of the identifier, read and quality score streams of the FASTQ format. A reference and a sequence alignment of the reads can be used to improve compression of the reads. QUIP can also perform assembly-based compression in which it builds reference sequences by assembling a small portion (user defined) of the reads using a dBG and then mapping the reads back to the reference sequences. Similar methods are used in FASTQZ and FQZCOMP (Bonfield and Mahoney, 2013). SCALCE (Hach et al., 2012) uses core substrings as a measure of similarity in order to cluster similar reads together. Those core substrings are generated via Locally Consistent Parsing (LCP) (Sahinalp and Vishkin, 1996). SCALCE compresses the reads in each cluster with gzip. ORCOM (Grabowski et al., 2015) re-orders reads by similarity as well: it creates clusters of reads that share the same minimizer (Roberts et al., 2004), i.e., the lexicographically smallest p-mer of each read with p usually between 8 and 15. Reads of the same cluster are then merged and compressed. Similar to ORCOM, Mince (Patro and Kingsford, 2015) uses the minimizer approach for clustering. However, it builds the clusters in two steps. A cluster is represented by a k-mer and a set of q-mers from the reads it contains with q < k. For each read to process, a set of candidate clusters is first established from the k-mers it is composed of. The read is then assigned to the candidate cluster that maximizes the number of q-mers they share. If the read has no candidate cluster, it is assigned to a new cluster corresponding to its minimizer of length k. FQC (Saha and Rajasekaran, 2014) clusters reads using a k-mer hashing method. Clusters are used to build lists of potentially neighboring reads that share an overlap. From each neighboring list, a consensus sequence is built and used as a reference for compressing the reads of the list. DSRC 2 (Roguski and Deorowicz, 2014) compresses the different streams of FASTQ files with 69 different methods: arithmetic coding, Huffman coding (Huffman, 1952), as well as 2 bits per base in the case of the DNA sequence stream. Finally, LEON (Benoit et al., 2015) encodes the reads as paths of a dBG represented with a BF. The dBG is built from solid k-mers of the reads, i.e., k-mers occurring multiple times in the reads. A read is anchored in the graph if it contains at least one solid k-mer and encoded as a list of graph bifurcations from this anchor. 4.1.2 Contributions In this chapter, we present a new Dynamic Alignment-free and Reference-free Read Compression method (DARRC). The main contribution of this work is the guided de Bruijn graph (gdBG) inspired from the cdBG (Section 1.2.2.1) which allows a unique traversal to reconstruct the reads it was built from. The gdBG is indexed using the BFT (Chapter III) which enables the update of the gdBG with reads of other similar genomes. Additional methods are presented to optimize the encoding of the reads. On a large P. aeruginosa dataset, DARRC outperforms all other tested tools. It provides a 30% compression ratio improvement in singleend mode compared to the best performing state-of-the-art HTS-specific and general purpose compression method in our experiments. The work presented in this chapter was published in (Holley et al., 2017) and is a joint-work with Roland Wittler, Jens Stoye and Faraz Hach. 4.2 The guided de Bruijn Graph The read assembly problem shows that different traversals of dBGs are possible. In the worst case, the number of possible paths between two vertices in a graph is infinite if the graph is cyclic, and exponential otherwise. Given a dBG built from a sequence and a starting vertex for the traversal, the dBG must be augmented with information to guide its traversal in order to reconstruct the sequence it was built from. Definition 4. Given a dBG G built from a sequence S, a partition part(G, S) is a subgraph G′ of G such that G′ is a path graph that reconstructs a subsequence of S. 70 A path graph (V, E) is a non-branching connected graph with |E| = |V | − 1. A guided de Bruijn graph (gdBG) is a cdBG G = (V, E, P) built from a sequence S. The set of colors, now denoted as P, represents partitions guiding the traversal of G to reconstruct S. Self-overlapping k-mers, for which the prefix of length k − 1 is equal to the suffix of length k − 1, require a special treatment to avoid looping on themselves within the same partition. Algorithm 5 creates a gdBG G from a sequence S using vertices of length k. It returns all information necessary to reconstruct S: the gdBG encoding S and the k − 1 length prefix of the first k-mer of S starting the graph traversal for decoding. Note that selfoverlapping k-mers terminate their partition such that the next inserted k-mers start a new partition (line 9). The algorithm requires O(|S|) time and O(|G|) space where |G| = |V | + |P| if the gdBG uses an implicit representation of edges. Algorithm 5 Encode(S, k) 1: p ← 1 ▷ partition index 2: G ← the empty graph 3: for i ← 1, . . . , |S| − k + 1 do 4: x ← S(i, k) 5: Y ← {y | y successor of x in G with p ∈ G[y]} 6: if Y ̸= ∅ then p ← p + 1 7: if x ∈ G then G[x].add(p) ▷ add p to vertex x in G 8: else G.add(x, p) ▷ insert vertex x with p in G 9: if x(2, k − 1) = x(1, k − 1) then p ← p + 1 10: return (G, S(1, k − 1)) Algorithm 6 decodes a sequence S from a gdBG G using vertices of length k starting with k-mer prefix x. Algorithm 5 guarantees that for any k-mer and one of its partitions, this k-mer can only have zero or one successor in the graph with the same partition. Therefore, Algorithm 6 traverses the graph by searching, for each traversed vertex, the successor with the same partition. If it is not found, the partition index is incremented and the traversal continues. As for Algorithm 5, the algorithm requires O(|S|) time and O(|G|) space. Figure 4.1 represents a simple cyclic dBG built from a sequence containing a repetition. An infinite number of sequences could be extracted from the dBG because of the cycle. However, by augmenting the dBG with partitions, Algorithm 6 will traverse the cycle only once during the reconstruction of the sequence. Indeed, when Algorithm 5 tries to insert k-mer “agt” with partition 1, a successor with the same partition is found. Therefore, k-mer “agt” is inserted 71 Algorithm 6 Decode(G, x, k) 1: p ← 1 2: z ← k-mer y in G with y(1, k − 1) = x and p ∈ G[y] 3: x ← z 4: S ← z 5: Z ← {z} 6: if z(2, k − 1) = z(1, k − 1) then p ← p + 1 7: while Z ̸= ∅ and p ∈ P do 8: Z ← {z | z successor of x in G with p ∈ G[z]} 9: if Z contains exactly one k-mer z then 10: S ← S ⊙ z(k, 1) 11: x ← z 12: if x(2, k − 1) = x(1, k − 1) then p ← p + 1 13: else 14: p ← p + 1 15: Z ← {x} 16: return S with partition 2 such that the cycle is not contained in one partition. cgt {1} gta {1,2} taa {1,2} aat {2} agt {2} aag {1} Figure 4.1: The gdBG of sequence S = “cgtaagtaat” as constructed by Algorithm 5 with k = 3. An important property of gdBGs using implicit edges is that no false implicit edge can be traversed during the decoding. Proposition 1. Let G be a gdBG built from a sequence S using an implicit representation of edges. An edge between vertices v and v′ corresponding to kmers x and x′ respectively, such that x(2, k − 1) = x′ (1, k − 1) but x ⊙ x′ (k, 1) is not a substring of S is called a false implicit edge. Algorithm 6 does not consider any false implicit edge when traversing G to reconstruct S. Proof. If a false implicit edge connects vertices not sharing a partition, Algorithm 6 will not consider this edge as only successors with the same partition 72 are traversed. If a false implicit edge connects vertices v and v′ which share a partition, the edge out-degree of v is at least 2 and the edge in-degree of v′ is at least 2: one true implicit edge each and at least one false implicit edge each. As these vertices are branching, Definition 4 guarantees that v and v′ are not in the same partition. Algorithm 5 does not distinguish true implicit edges from false implicit edges, ensuring that Definition 4 is always respected during the encoding. Furthermore, partitions allow to apply the following generalized definition of edges in dBGs to gdBGs: Definition 5. In a dBG, a directed edge from vertex v to vertex v′ representing k-mers x and x′ , respectively, exists if and only if x(l + 1, k − l) = x′ (1, k − l) with l ≥ 1. For a sequence S to encode in a gdBG and l > 1, ⌊ |S|−k+1 l ⌋ k-mers will be inserted instead of |S|−k+1. However, the graph can contain more partitions as each vertex has now |A|l possible successors and predecessors. Figure 4.2 gives the gdBG encoding the same sequence as in Figure 4.1 using a k-mer overlap of k − 2 instead of k − 1. The resulting gdBG contains only half the number of vertices than the one in Figure 4.1. cgt {1} taa {1,3} agt {2} Figure 4.2: The gdBG of sequence S = cgtaagtaat using 3-mers overlapping on k − l = 1. The last symbol of S is not encoded in the gdBG as it cannot be part of a k-mer. 4.3 Compression Section 4.2 presented methods to encode a sequence as a gdBG and to decode it. In this section, we describe how to use this methodology to compress reads. To improve compression efficiency, we preprocess the reads. 73 4.3.1 Read Clustering and Merging. A simple form of read assembly extended from ORCOM (Grabowski et al., 2015) is performed to reduce the input data. It clusters reads according to their minimizer, then merges reads sharing an overlap within each cluster and finally merges reads sharing an overlap but originating from different clusters. These three steps are described in the following. 4.3.1.1 Clustering The minimizer (Roberts et al., 2004) of a read r is the lexicographically smallest of its p-mers with p ≪ |r|. The canonical minimizer of r is the lexicographically smaller minimizer of r and its reverse-complement r. The following method is based on the simple assumption that reads sharing a minimizer are likely to share a longer overlap and therefore be similar. Thus, the canonical minimizer m is computed for each read r such that r or r is assigned to its cluster m. An example of clustering is illustrated in Figure 4.3. acgtcc tcctgg cgtcct ctggtt ac cc ct Figure 4.3: Minimizer clustering of 4 reads. Minimizers of length 2 are underlined. For the sake of convenience, the reverse-complement is not considered. 4.3.1.2 Intra-cluster Merging Within each cluster, the reads are sorted by decreasing position of their minimizer, in which reads sharing the same minimizer position are sorted lexicographically. For each read r and its minimizer m at position pm, all reads r′ with minimizers at positions p′ m ≤ pm are considered for merging, in decreasing order of positions p′ m to maximize the overlap lengths. To merge reads r and r′ , they are first anchored at the position of their minimizers such that they overlap on o = p′ m + Min(|r| − pm, |r′ | − p′ m) symbols. Reads are merged into a super read (Zimin et al., 2013) if r(pm − p′ m, o) = r′ (1, o) with at most d mismatches. The same process is applied to the created super read in order to merge it with 74 the remaining reads of the cluster. For each super read, we encode all of its read meta data in separate streams: position, length, reverse-complement information and mismatches. An example of intra-cluster merging (cf. Figure 4.3) is shown in Figure 4.4. acgtcc tcctgg cgtcct ctggtt ac cc ct cgtcctgg Figure 4.4: Merging of two reads into a super read. Minimizers of length 2 are underlined. For the sake of convenience, the reverse-complement is not consid- ered. 4.3.1.3 Inter-cluster merging As an extension of the previous steps used by ORCOM, we additionally perform a process similar to the intra-cluster read merging described previously to merge super reads from multiple clusters. For each super read sr and its minimizer m at position pm, a new minimizer m′ is computed in sr(pm + 1, |sr| − pm) and sr. All super reads of cluster m′ are considered for a merging with sr or sr. Merging two or more super reads creates a Spanning Super Read (SSR). The same process is applied to the created SSR until no super reads can merge with it. An example of inter-cluster merging (cf. Figure 4.4) is provided in Figure 4.5. acgtcc ctggtt ac cc ct cgtcctgg Figure 4.5: Merging of three super reads into an SSR “acgttgatt”. Minimizers of length 2 are underlined with a dashed line. Secondary minimizers (for merging) are underlined with a plain line. For the sake of convenience, the reversecomplement is not considered. 4.3.1.4 Paired-end Reads Each mate of a pair is considered as a single read that is clustered and merged using the previously described methods. However, the clustering and merging 75 steps keep track of the position of the mates in the SSRs. This information is used afterwards to store in each read meta data whether the read is the first mate of its pair. In such case, the position of its corresponding mate in the SSRs is stored as well. 4.3.2 Spanning Super Read Encoding Encoding a set of SSRs using a gdBG requires to extract k-mers from the SSRs. If edges represent overlaps of length k − 1, all k-mers of the SSRs are extracted. If edges represent overlaps of length k − l with l > 1, k-mers are extracted every l positions. As a consequence, similar SSRs can have different sets of k-mers. An example is given in Figure 4.6, in which two similar SSRs, ssr1 and ssr2, do not share any k-mers because they are extracted every l = 2 positions from the first position of each SSR. By shifting the k-mer extraction start position by one position in the second SSR, as shown with ssr2 ′ , two extracted k-mers are shared with the first SSR. ssr1 = a c g t c c t g a a t a c g t g t c c c c t g t g a a ssr2 = g a c g t c c g g a a g a c g c g t c t c c g c g g a ssr2 ′ = g a c g t c c g g a a a c g t g t c c c c g g g g a a Figure 4.6: Extraction of 4-mers overlapping on k −l = 2 from two similar SSRs, ssr1 and ssr2. In order to keep the growth of the gdBG small when inserting a new SSR, we determine the k-mer extraction start position, called start position in the following, that maximizes the number of k-mers already stored in the gdBG. To this end, we maintain in memory a k-mer index recording all k-mers extracted. As the cost in time and memory of such an index is prohibitive, we use a BF having constant time insertion and look-up instead of a BFT having linear time insertion and look-up. Algorithm 7 is a greedy approach making use of the BF to iteratively detect for each SSR of a set R its optimal start position and updating the BF with all novel k-mers. To encode all SSRs completely, it not only returns the k-mers to insert into a gdBG, because these do not necessarily cover the entire SSRs. It also returns the head and tail of each SSR, which are the 76 uncovered prefix and suffix, respectively, not encoded in the gdBG. Additionally, to provide an entry point into the gdBG for the decoding, it returns the starting overlap of each SSR, which is the k − l length prefix of the first k-mer. More precisely, we denote by x and y the first and last k-mers extracted, respectively, from an SSR ssr with posx and posy as their respective occurrence positions in ssr. Then, the head of ssr is the prefix ssr(1, posx − 1), the tail of ssr is the suffix ssr(posy + k, |ssr| − posy − k + 1), and the starting overlap of ssr is ssr(posx, k − l). SSR heads, tails and starting overlaps are encoded in separate streams and compressed separately from the gdBG. 4.3.3 Partition Encoding 4.3.3.1 Encoding Partition sets associated with k-mers in gdBGs are represented as lists of sorted integers. A naive way to store a partition set is to use a fixed number of bytes for each partition. For example, 4 bytes is a standard size for integers on current computer architectures. In order to decrease the memory footprint while keeping the lists indexed, partitions are first delta encoded by storing the difference between each integer and its predecessor in the list (or 0 if the integer is in first position). The resulting values are called deltas. However, it only decreases the minimum number of bits necessary to encode the partitions but not their final representation. Consequently, deltas are Vbyte encoded (Williams and Zobel, 1999): each byte used to encode a delta has one bit indicating whether the byte starts a new delta or not, allowing to remove unnecessary bytes from each delta. Thus, partitions use a variable number of bytes proportional to the minimum number of bits necessary to encode their deltas. An example of a partition set naively represented is described in Table 4.1. The table summarizes the minimum number of bits (log) necessary to encode the partitions, their binary code and the number of bytes they use. Table 4.2 describes the same set represented with a delta and Vbyte encoding: it uses 5 bytes instead of 12 in the naive representation. 77 Algorithm 7 ExtractKmers(k, l, R) 1: function ExtractPosition(r, k, l, B) 2: bestcount ← 0 3: bestpos ← 1 4: for i ← 1, . . . , l do 5: pos ← i 6: count ← 0 7: while pos ≤ |r| − k + 1 do 8: if MayContain(r(pos, k), B) then count = count + 1 9: pos = pos + l 10: if count > bestcount then 11: bestcount ← count 12: bestpos ← i 13: return (bestpos, bestcount) 14: 15: function ExtractKmers(k, l, R) 16: K ← ∅ ▷ ordered set of extracted k-mers 17: L ← ∅ ▷ ordered set of SSR heads and tails 18: O ← ∅ ▷ ordered set of SSR starting overlaps 19: B ← empty Bloom filter 20: for each ssr ∈ R do 21: a ← ExtractPosition(ssr, k, l, B) 22: b ← ExtractPosition(ssr, k, l, B) 23: if a.bestcount > b.bestcount then 24: ssr′ ← ssr 25: pos ← a.bestpos 26: else 27: ssr′ ← ssr 28: pos ← b.bestpos 29: L ← L ∪ {ssr′ (1, pos − 1)} 30: O ← O ∪ {ssr′ (pos, k − l)} 31: while pos ≤ |ssr′ | − k + 1 do 32: K ← K ∪ {ssr′ (pos, k)} 33: if MayContain(ssr′ (pos, k)), B) = false then 34: Insert(ssr′ (pos, k)), B) 35: pos ← pos + l 36: L ← L ∪ {ssr′ (pos + k, |ssr′ | − pos − k + 1)} 37: return (K, L, O) 4.3.3.2 Recycling As a small delta produces a small encoding, partition integers are recycled instead of naively using the next higher integer for every new partition as, for the 78 Table 4.1: Naive representation of a partition set composed of integers 12534, 12567 and 28911. Partitions log Binary code Number of bytes 12534 13.61 00000000 00000000 00110000 11110110 4 12567 13.62 00000000 00000000 00110001 00010111 4 28911 14.82 00000000 00000000 01110000 11101111 4 Table 4.2: Delta and Vbyte encoded representation of the same partition set used in Table 4.1. Partitions log Binary code Number of bytes ∆1 = 12534 13.61 01100000 11101101 2 ∆2 = 12567 − ∆1 = 33 5.04 01000011 1 ∆3 = 28911−∆1 −∆2 = 16344 14.00 11111110 10110001 2 sake of convenience, described in Algorithm 5. Partition sets a and b can share the same partition integer if they are not neighbors in the graph, i.e., no k-mer suffix or prefix of a overlaps a k-mer prefix or suffix of b, for suffixes and prefixes of length k − l. A trivial example is provided in Figure 4.7 in which k-mer cttc uses the same partition integer as k-mer acgt because they are not neighbors in the graph. acgt {1} gtac {2} tcct {2} cttc {1} tc Figure 4.7: The gdBG of SSRs ssr1 = “acgtac” and ssr2 = “tccttc” using 4-mers (l = 2). Dotted edges are false implicit edges. The labeled solid edge exists by using the starting overlap of ssr2 after the traversal of ssr1, as described in Section 4.3.2. As there can be a large number of partitions in the graph, verifying the connectivity of one partition to all other partitions is often impractical. We propose instead a heuristic that verifies the connectivity only to the last t partitions inserted, t being a user-defined threshold, such that these t partitions are the only candidates for recycling. Using partition recycling requires to save the partitions traversal order which cannot be incremental anymore as proven in Algorithms 5 and 6. 79 4.3.4 Meta Data and gdBG Compression Steps described previously generate meta data specific to one input file such as read lengths and positions in SSRs. These meta data are first encoded in separate streams and are then compressed using an LZ-type algorithm, LZMA (Pavlov, 1999). After all k-mers and partitions are inserted in the gdBG, the latter is written to disk. As it must be loaded in memory for every update and decompression, the gdBG is compressed with Zstd (Collet, 2015), a compression method based on Huffman coding and Asymmetric Numeral Systems (Duda, 2013) that favors compression and decompression speed over compression ratio. 4.4 Update and Decompression In order to update a compressed archive with a new input file, only the gdBG previously created is decompressed and loaded in memory, as meta data are not used for the update. A fast procedure iterates over all k-mers of the gdBG and inserts them into the BF of Algorithm 7 instead of starting with an empty BF in order to optimize the choice of the k-mer extraction start positions in the SSRs. The gdBG is then updated with the new k-mers and partitions. The starting partition index is greater than the partition indexes already present in the gdBG, ensuring that each input file is encoded with a unique set of partitions. Decompressing a read file starts with decompressing its meta data and the gdBG it is encoded in. The gdBG is then loaded in memory and Algorithm 6 is used to traverse the gdBG, but only following those partitions that are specific to the read file to decompress. This way, single files can be decompressed separately. As Algorithm 6 decodes SSRs, meta data are used afterwards to extract the actual reads. If reads are paired-end, meta data are also used to reorganize them such that corresponding mates of the same pair are together in the decompressed file. 4.5 Results DARRC is implemented in C and uses the BFT library for its gdBG. The BFT provides time and space efficient functionalities that are required by DARRC. 80 These functionalities include: (i) the ability to update the BFT with new kmers and colors without recomputing the index, (ii) k-mers extraction from the BFT and (iii) prefix search over the set of k-mers within the BFT. The software is available at https://github.com/GuillaumeHolley/DARRC. We compared DARRC to three state-of-the-art de novo DNA sequence compression tools: ORCOM (Grabowski et al., 2015), LEON (Benoit et al., 2015) and Mince (Patro and Kingsford, 2015). DARRC was also compared to the same LZ-type algorithm used to compress its meta data, LZMA (Pavlov, 1999). Experiments were carried out on a server with 378 GB of RAM and two 8-core Intel Xeon E5-2630 v3 2.4 GHz processors. All input files were placed on a dedicated mechanical hard drive. Compressed archives and decompressed files, during compression and decompression, respectively, together with temporary files such as read clusters were written to a RAM-based partition when the tools allowed to specify an output directory. As the current version of DARRC does not take advantage of parallelism, all software were run using a single thread, except Mince which requires a minimum of four threads. All de novo DNA sequence compression tools were run using their default parameters. LZMA was run with the same compression level as the one used to compress DARRC meta data. DARRC default parameters are minimizers of length 9 for the clustering, 5 mismatches allowed per read merging and 36-mers overlapping on 11 symbols for the gdBG. ORCOM, LEON, Mince and LZMA compressed all files in separate archives while DARRC updated the same archive iteratively with the files to compress: each iteration decompressed and reloaded the necessary data from the data written to disk in the previous iteration. The dataset used for the experiment consists of 473 clinical isolates of P. aeruginosa sampled from 34 patients (NCBI BioProject PRJEB5438), resulting in 338.61 Gbp of high coverage sequences. Reads are 100 bp paired-end reads generated by Illumina HiSeq 2000. Pair mates were placed in different files for every isolate. The experiment was run in single-end mode and paired-end mode for all tools such that in the single-end mode, every mate file is considered as a single-end read file. The appropriate single-end and paired-end modes were used for DARRC and Mince. The mates were concatenated for the paired-end experiment of ORCOM as the tool neither preserves the order of the reads nor stores the paired-end information. LEON and LZMA do not have an explicit paired-end mode but keep the original order of the reads, thus the mate files of every isolate were concatenated for the paired-end experiments of LEON and LZMA. 81 0 100 200 300 400 01020304050 Number of isolates Compressionratio DARRC Mince LEON ORCOM LZMA 0 100 200 300 400 01020304050 Number of isolates DARRC Mince LEON ORCOM LZMA Figure 4.8: Compression ratios in paired-end mode (left) and single-end mode (right). 0 100 200 300 400 0102030405060 Number of isolates Disksize(GB) DARRC Mince LEON ORCOM LZMA 0 100 200 300 400 0102030405060 Number of isolates DARRC Mince LEON ORCOM LZMA Figure 4.9: Disk sizes in paired-end mode (left) and single-end mode (right). Compression ratios in paired-end mode and single-end mode are shown in Figure 4.8. DARRC clearly outperforms all the other tested tools in both modes. In paired-end mode and single-end mode, DARRC uses about 0.261 bits per base and 0.204 bits per base, corresponding to a 57 % and 30 % compression ratio 82 0 100 200 300 400 0246810 Number of isolates Disksize(GB) Meta data gdBG 0 100 200 300 400 0246810 Number of isolates Meta data gdBG Figure 4.10: DARRC disk size distribution in paired-end mode (left) and singleend mode (right). improvement compared to the second best results, respectively. The paired-end compression ratio of ORCOM compared to its single-end compression ratio shows that the tool is not adapted to paired-end read compression. Corresponding disk sizes are shown in Figure 4.9. The distributions of DARRC meta data and gdBG disk sizes are displayed in Figure 4.10. The gdBG represents about 10 % and 13 % of the data written to disk in paired-end mode and single-end mode, respectively. DARRC compressed more than two times faster than LZMA but used the most time to decompress, as shown in Figures 4.11 and 4.12, respectively. The compression time overhead of DARRC is explained by the fact that at each iteration, the gdBG must be decompressed, loaded in memory and updated with new k-mers and partitions. Main memory usage during compression and decompression is shown in Figures 4.13 and 4.14, respectively. All tools performed compression and decompression using a maximum of 3.4 GB of main memory, an amount nowadays available on most desktop computers and laptops. Even by updating the same archive iteratively, DARRC compression used an amount of main memory similar to the memory footprint of the other tested tools. 83 0 100 200 300 400 0100030005000 Number of isolates Time(min) DARRC Mince - 4 threads LEON ORCOM LZMA 0 100 200 300 4000100030005000 Number of isolates DARRC Mince - 4 threads LEON ORCOM LZMA Figure 4.11: Compression times in paired-end mode (left) and single-end mode (right). 0 100 200 300 400 0500100015002000 Number of isolates Time(min) DARRC Mince - 4 threads LEON ORCOM LZMA 0 100 200 300 400 0500100015002000 Number of isolates DARRC Mince - 4 threads LEON ORCOM LZMA Figure 4.12: Decompression times in paired-end mode (left) and single-end mode (right). 84 0 100 200 300 400 01234 Number of isolates Memory(GB) DARRC Mince LEON ORCOM LZMA 0 100 200 300 400 01234 Number of isolates DARRC Mince LEON ORCOM LZMA Figure 4.13: Compression main memory peaks in paired-end mode (left) and single-end mode (right). 0 100 200 300 400 0.00.51.01.52.02.53.0 Number of isolates Memory(GB) DARRC Mince LEON ORCOM LZMA 0 100 200 300 400 0.00.51.01.52.02.53.0 Number of isolates DARRC Mince LEON ORCOM LZMA Figure 4.14: Decompression main memory peaks in paired-end mode (left) and single-end mode (right). 85 4.6 Conclusion We presented DARRC, a dynamic alignment-free and reference-free read compression method that can incrementally update compressed archives with new genome sequences without full decompression of the archives. DARRC uses a new abstract data structure, the guided de Bruijn graph, that allows a unique traversal of the de Bruijn graph to reconstruct the sequences it is built from. We showed that, on a large pan-genome dataset, our method outperforms several state-of-the-art DNA sequence compression methods and a general purpose compression tool regarding the compression ratio while achieving reasonable running time and main memory usage. Furthermore, we showed that the compression ratio of DARRC is attractive even with only few files compressed. Future work concerns the parallelization of the software, particularly the read clustering and merging phase which offers a lot of potential for multi-threading. 86 CHAPTER V Conclusion In this thesis, we explored new solutions for the problems of pan-genome indexing and storage. After introducing biological notions and providing a retrospective on the acquisition of genomic data, we established the current limitations of single-genome reference-centric methods in comparative analysis. These limitations can be overcome by using a pan-genome — a set of genomes from different strains of the same species — as a reference instead of a single genome that cannot represent an entire population of genomes. Then, we provided a description of pan-genomes and their characteristics. Next, we reviewed the literature of data structures and methods to index and analyze pan-genomes at the DNA sequence level in Chapter II. In Chapter III, we showed the current limitations and inadequacies of data structures for pan-genome indexing mentioned in the literature. We established that the colored de Bruijn graph is an ideal representation of a pan-genome because it is alignment-free and reference-free. However, current data structures to represent colored de Bruijn graphs are neither incremental nor time and memory efficient. Therefore, we proposed to represent a pan-genome as a colored de Bruijn graph with a new lightweight data structure, the Bloom Filter Trie, which is alignment-free, reference-free, incremental and considers assemblies as well as reads as input. The Bloom Filter Trie is based on a burst trie to efficiently represent k-mer suffixes in the deeper levels of the trie. The usage of Bloom filters allows to efficiently navigate in the trie while optimizing the traversal of the colored de Bruijn graph. We evaluated the Bloom Filter Trie in comparison to a data structure presenting similar features on two pan-genome datasets, a bacterial dataset and a human tissue dataset. We showed that while the two data structures have different construction advantages, the Bloom Filter Trie has the 87 smallest disk size, is faster to query and propose supplementary functionalities over the other data structure. In Chapter IV, we discussed that pan-genomes are challenging to store because of the increasing amount of data produced by large scale sequencing projects. Compression is a solution to this problem, but no compression tool is currently taking advantage of the genome similarity and redundancy within a pan-genome. Furthermore, current compression tools are not dynamic and produce compressed archives that cannot be updated with new data. For this purpose, we proposed a new abstract data structure — the guided de Bruijn graph — which augments k-mers stored in the graph with a partition information. The guided de Bruijn graph is used to insert k-mers of the reads to compress and the partitions are used during decompression to guide the traversal of the graph to exactly reconstruct the compressed reads. We also described different techniques to optimize the storage of reads in the graph and the partition encodings. As the guided de Bruijn graph is inspired from the colored de Bruijn graph, we used the Bloom Filter Trie as the main data structure to implement this compression tool named DARRC. We evaluated DARRC in comparison to other state-of-the-art HTS-specific and general purpose compression tools on a bacterial dataset. We show that DARRC outperforms the other tools with a 30 % compression ratio improvement over the second best performing tool of the evaluation. 5.1 Perspectives A first interesting perspective is that large scale sequencing projects are flourishing because they benefit from the diversity of sequencing technologies and the decrease of sequencing cost. Hence, the size of pan-genomes, currently up to a few thousand sequenced genomes, is going to increase at a very fast pace within the next few years. Indeed, Stephens et al. (2015) estimate that the human pangenome will be composed of 100 million to 2 billion sequenced genomes by 2025. For example, the 100,000 Genomes Project (100,000 Genomes Project, 2012) undertaken in 2012 is expected to have sequenced 100,000 human genomes by 2017, while a US-based project (Ledford, 2016) aims to sequence two million individuals within ten years. Hence, data structures for pan-genome indexing should be ready to handle this massive amount of data. Besides indexing, the analysis of a pan-genome based on efficient data struc- 88 tures as presented in this thesis is an important next step. A first application of such a pan-genome index is conducted by Tina Zekic at the Bielefeld University and consists in the extraction of core, accessory and singleton genomes directly from a colored de Bruijn graph represented with a Bloom Filter Trie. This work can be used in the context of ancestral genome reconstruction (Luhmann et al., 2016) to sheld light on ancient species from which damaged DNA has been recovered. Indeed, the assembly of these data is usually difficult because of the short length and degraded quality of the reads. Such reads from ancient species can be included in the pan-genome of extant strains to assist comparative genomics methods aiming to extract conserved regions between ancient and extant strains, as defined by the core genome. These conserved regions then allow to analyze genome rearrangements in a phylogenetic context. A third interesting perspective is the emergence of third generation sequencing technologies. The Oxford Nanopore (Mikheyev and Tin, 2014) is a third generation sequencer that has the size of a large USB stick and costs around 1,000 $. Its small size and cost indicate an upcoming fast deployment of this sequencing technology all around the world. Third generation sequencing technologies produce long reads that can disambiguate repeats in genome assembly and variant calling. However, such reads usually have a high error rate that can be corrected with the help of short and more accurate reads produced by HTS technologies. As these technologies are complementing each other, pan-genome indexes should neither be limited to one type of representation — assemblies or reads — nor should they be limited by the sequencing technology used to produce reads. To the contrary, the future of pan-genome indexing resides in the unification of all types of sequencing data into a single index such that comparative methods can take advantage of all features proposed by each sequencing technology. An example of such unification is the preliminary work of Ehsan Haghshenas from Simon Fraser University, who is performing hybrid single variant detection using long and short reads. The core idea of this work is to map long reads to a colored de Bruijn graph of short reads represented with a Bloom Filter Trie. The colors can be used to distinguish short reads from different strains to enhance the chances of mapping a long read to the graph. An orthogonal concept is to include long and short reads in the same colored de Bruijn graph represented with a Bloom Filter Trie such that long reads can assist the assembly of repeated regions. While such a representation is made possible by using the guided de Bruijn graph presented in Chapter IV, the high k-mer diversity of long noisy reads makes it challenging 89 to index them efficiently. Finally, the work presented in this thesis shows that the Bloom Filter Trie is adapted to pan-genome indexing and compression. The next logical step is to combine both approaches into a single one enabling sequence queries directly on a compressed archive of DARRC without having to decompress the sequences beforehand and index them with a BFT afterwards. Besides the considerable amount of compute time and memory saved, it would allow to perform large scale complex methods such as read alignment and variant calling using multiple genomes at once. 90 BIBLIOGRAPHY 91 BIBLIOGRAPHY 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526(7571):68–74. 100,000 Genomes Project (2012). Genomics England. https://www. genomicsengland.co.uk/. [Online; accessed 16-February-2017]. Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. (2004). Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor., 2(1):53–86. Almeida, P. S., Baquero, C., Preguiça, N., and Hutchison, D. (2007). Scalable bloom filters. Inform. Process. Lett., 101(6):255–261. Andersson, A. and Nilsson, S. (1993). Improved behaviour of tries by adaptive branching. Inform. Process. Lett., 46(6):295–300. Askitis, N. and Sinha, R. (2010). Engineering scalable, cache and space efficient tries for strings. The VLDB Journal, 19(5):633–660. Baier, U., Beller, T., and Ohlebusch, E. (2016). Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics, 32(4):497–504. Belk, K., Boucher, C., Bowe, A., Gagie, T., Morley, P., Muggli, M. D., Noyes, N. R., Puglisi, S. J., and Raymond, R. (2016). Succinct Colored de Bruijn Graphs. bioRxiv, page 040071. Bender, M. A., Farach-Colton, M., Johnson, R., Kraner, R., Kuszmaul, B. C., Medjedovic, D., Montes, P., Shetty, P., Spillane, R. P., and Zadok, E. (2012). Don’t thrash: how to cache your hash on flash. Proc. of the VLDB Endowment, 5(11):1627–1637. Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., and Rizk, G. (2015). Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform., 16(1):288. Blom, J., Albaum, S. P., Doppmeier, D., Pühler, A., Vorhölter, F.-J., Zakrzewski, M., and Goesmann, A. (2009). EDGAR: A software framework for the comparative analysis of prokaryotic genomes. BMC Bioinform., 10(1):154. 92 Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Comm. ACM, 13(7):422–426. Bonfield, J. K. and Mahoney, M. V. (2013). Compression of FASTQ and SAM format sequencing data. PLoS One, 8(3):e59190. Bowe, A., Onodera, T., Sadakane, K., and Shibuya, T. (2012). Succinct de Bruijn graphs. In Proc. of 12th International Workshop on Algorithms in Bioinformatics (WABI’12), volume 7534, pages 225–235. Burrows, M. and Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm. Digital SRC Research Report 124. Chikhi, R., Limasset, A., Jackman, S., Simpson, J. T., and Medvedev, P. (2015). On the representation of de Bruijn graphs. J. Comp. Biol., 22(5):336–352. Chikhi, R., Limasset, A., and Medvedev, P. (2016). Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201– i208. Claude, F., Farina, A., Martínez-Prieto, M. A., and Navarro, G. (2010). Compressed q-gram indexing for highly repetitive biological sequences. In Proc. of the IEEE International Conference on BioInformatics and BioEngineering (BIBE’10). Collet, Y. (2015). ZSTD compression library. https://github.com/facebook/ zstd. [Online; accessed 20-December-2016]. Compeau, P. E., Pevzner, P. A., and Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol., 29(11):987–991. Computational Pan-Genomics Consortium (2016). Computational pan-genomics: status, promises and challenges. Brief. Bioinform., page bbw089. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. (2009). Introduction to algorithms. MIT Press. Danek, A., Deorowicz, S., and Grabowski, S. (2014). Indexes of large genome collections on a PC. PloS One, 9(10):e109384. De La Briandais, R. (1959). File searching using variable length keys. In Proc. of the Western Joint Computer Conference (IRE-AIEE-ACM’59), pages 295–298. Denton, J. F., Lugo-Martinez, J., Tucker, A. E., Schrider, D. R., Warren, W. C., and Hahn, M. W. (2014). Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput. Biol., 10(12):e1003998. Deorowicz, S. and Grabowski, S. (2013). Data compression for sequencing data. Algorithms Mol. Biol., 8:25. 93 Dolle, D.-D., Liu, Z., Cotten, M. L., Simpson, J. T., Iqbal, Z., Durbin, R., McCarthy, S. A., and Keane, T. M. (2017). Using reference-free compressed data structures to analyse sequencing reads from thousands of human genomes. Genome Res., 27(2):300–309. Donati, C., Hiller, N. L., Tettelin, H., Muzzi, A., Croucher, N. J., Angiuoli, S. V., Oggioni, M., Dunning Hotopp, J. C., Hu, F. Z., Riley, D. R., et al. (2010). Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol., 11(10):R107. Duda, J. (2013). Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. arXiv:1311.2540. Durbin, R. (2014). Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics, 30(9):1266–1272. Ernst, C. and Rahmann, S. (2013). PanCake: A Data Structure for Pangenomes. In Proc. of the German Conference on Bioinformatics 2013 (GCB’13), volume 34, pages 35–45. Fan, L., Cao, P., Almeida, J., and Broder, A. Z. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281–293. Farach, M. (1997). Optimal suffix tree construction with large alphabets. In Proc. of the 38th Symposium on Foundations of Computer Science (FOCS’97), pages 137–143. Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications. In Proc. of the 41st Symposium on Foundations of Computer Science (FOCS’00), pages 390–398. Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. (2004). An alphabetfriendly FM-index. In Proc. of the International Symposium on String Processing and Information Retrieval (SPIRE’04), volume 3246, pages 150–160. Fischer, J., Mäkinen, V., and Navarro, G. (2009). Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci., 410(51):5354–5364. Fredking, E. (1960). Trie Memory. Comm. ACM, 3(9):490–499. GENCODE group (2013). GENCODE database. https://www.gencodegenes. org. [Online; accessed 23-February-2017]. Genome Biology Editorial Team (2011). Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biol., 12(3):402. 94 Giancarlo, R., Rombo, S. E., and Utro, F. (2014). Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform., 15(3):390–406. Glenn, T. C. (2011). Field guide to next-generation DNA sequencers. Mol. Ecol. Res., 11(5):759–769. Grabowski, S., Deorowicz, S., and Roguski, L. (2015). Disk-based compression of data from genome sequencing. Bioinformatics, 31(9):1389–1395. Hach, F., Numanagić, I., Alkan, C., and Sahinalp, S. C. (2012). SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics, 28(23):3051–3057. Harrow, J., Frankish, A., Gonzalez, J. M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B. L., Barrell, D., Zadissa, A., Searle, S., et al. (2012). GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res., 22(9):1760–1774. Heinz, S., Zobel, J., and Williams, H. E. (2002). Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst., 20(2):192–223. Hoff, K. J. (2009). The effect of sequencing errors on metagenomic gene prediction. BMC Genom., 10:520. Hogg, J. S., Hu, F. Z., Janto, B., Boissy, R., Hayes, J., Keefe, R., Post, J. C., and Ehrlich, G. D. (2007). Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol., 8(6):R103. Holland, R. C. and Lynch, N. (2013). Sequence squeeze: an open contest for sequence compression. GigaScience, 2:5. Holley, G., Wittler, R., and Stoye, J. (2015). Bloom Filter Trie–A Data Structure for Pan-Genome Storage. In Proc. of the 15th Workshop on Algorithms in Bioinformatics (WABI’15), volume 9289, pages 217–230. Holley, G., Wittler, R., and Stoye, J. (2016). Bloom Filter Trie: an alignmentfree and reference-free data structure for pan-genome storage. Algorithms Mol. Biol., 11:3. Holley, G., Wittler, R., Stoye, J., and Hach, F. (2017). Dynamic Alignment-free and Reference-free Read Compression. In Proc. of 21st International Conference on Research in Computational Molecular Biology (RECOMB’17), volume 10229 of Lecture Notes in Computer Science, pages 50–65. Hosseini, M., Pratas, D., and Pinho, A. J. (2016). A Survey on Data Compression Methods for Biological Sequences. Information, 7(4):56. 95 Huang, L., Popic, V., and Batzoglou, S. (2013). Short read alignment with populations of genomes. Bioinformatics, 29(13):i361–i370. Huffman, D. A. (1952). A method for the construction of minimum-redundancy codes. Proc. of the IRE, 40(9):1098–1101. International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., and McVean, G. (2012). De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet., 44(2):226–232. Jones, D. C., Ruzzo, W. L., Peng, X., and Katze, M. G. (2012). Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res., 40(22):e171. Kim, D., Langmead, B., and Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nat. Methods, 12(4):357–360. Kim, D., Langmead, B., and Salzberg, S. L. (2016). HISAT2 implementation. https://github.com/infphilo/hisat2. [Online; accessed 23-February- 2017]. Kingsford, C. and Patro, R. (2015). Reference-based compression of short-read sequences using path encoding. Bioinformatics, 31(12):1920–1928. Kirsch, A. and Mitzenmacher, M. (2006). Less hashing, same performance: Building a better Bloom filter. In Proc. of the European Symposium on Algorithms (ESA’06), volume 4168, pages 456–467. Knuth, D. E. (1998). The Art of Computer Programming: Sorting and Searching, volume 3. Pearson Education. Laing, C., Buchanan, C., Taboada, E. N., Zhang, Y., Kropinski, A., Villegas, A., Thomas, J. E., and Gannon, V. P. J. (2010). Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinform., 11(1):461. Land, M., Hauser, L., Jun, S.-R., Nookaew, I., Leuze, M. R., Ahn, T.-H., Karpinets, T., Lund, O., Kora, G., Wassenaar, T., et al. (2015). Insights from 20 years of bacterial genome sequencing. Funct. Integr. Genomics, 15(2):141– 161. Ledford, H. (2016). AstraZeneca launches project to sequence 2 million genomes. Nature, 532(7600). 96 Leffler, E. M., Bullaughey, K., Matute, D. R., Meyer, W. K., Segurel, L., Venkat, A., Andolfatto, P., and Przeworski, M. (2012). Revisiting an old riddle: what determines genetic diversity levels within species? PLoS Biol., 10(9):e1001388. Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754–1760. Loh, P.-R., Baym, M., and Berger, B. (2012). Compressive genomics. Nat. Biotechnol., 30(7):627–630. Luhmann, N., Thévenin, A., Ouangraoua, A., Wittler, R., and Chauve, C. (2016). The SCJ Small Parsimony Problem for Weighted Gene Adjacencies. In Proc. of the International Symposium on Bioinformatics Research and Applications (ISBRA’16), volume 9683, pages 200–210. Lukjancenko, O., Wassenaar, T. M., and Ussery, D. W. (2010). Comparison of 61 sequenced Escherichia coli genomes. Microb. ecol., 60(4):708–720. Mäkinen, V., Navarro, G., Sirén, J., and Välimäki, N. (2010). Storage and retrieval of highly repetitive sequence collections. J. Comp. Biol., 17(3):281–308. Manber, U. and Myers, G. (1993). Suffix arrays: a new method for on-line string searches. SIAM J. Comput., 22(5):935–948. Marçais, G. and Kingsford, C. (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764–770. Marcus, S., Lee, H., and Schatz, M. C. (2014). SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476–3483. McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. J. ACM, 23(2):262–272. Medini, D., Donati, C., Tettelin, H., Masignani, V., and Rappuoli, R. (2005). The microbial pan-genome. Curr. Opin. Genet. Dev., 15(6):589–594. Mikheyev, A. S. and Tin, M. M. Y. (2014). A first look at the Oxford Nanopore MinION sequencer. Mol. Ecol. Res., 14(6):1097–1102. Minkin, I., Pham, S., and Medvedev, P. (2016). TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, page btw609. Morrison, D. R. (1968). PATRICIA – Practical Algorithm To Retrieve Information Coded in Alphanumeric. J. ACM, 15(4):514–534. Mosquera-Rendón, J., Rada-Bravo, A. M., Cárdenas-Brito, S., Corredor, M., Restrepo-Pineda, E., and Benítez-Páez, A. (2016). Pangenome-wide and molecular evolution analyses of the pseudomonas aeruginosa species. BMC Genom., 17:45. 97 Myers, E. W. (2005). The fragment assembly string graph. Bioinformatics, 21:ii79–ii85. Navarro, G. (2012). Indexing highly repetitive collections. In Proc. of the 23rd International Workshop on Combinatorial Algorithms (IWOCA’12), volume 7643, pages 274–279. NCBI (2007). NCBI Sequencing Read Archive database. https://trace.ncbi. nlm.nih.gov/Traces/sra/. [Online; accessed 23-February-2017]. NCBI (2017). NCBI Genome database. https://www.ncbi.nlm.nih.gov/ genome/. [Online; accessed 23-February-2017]. Nederbragt, L. (2016). Developments in NGS. https://figshare.com/ articles/developments_in_NGS/100940. [Online; accessed 20-December- 2016: 10.6084/m9.figshare.100940.v9]. Nguyen, N., Hickey, G., Zerbino, D. R., Raney, B., Earl, D., Armstrong, J., Haussler, D., and Paten, B. (2015). Building a pangenome reference for a population. J. Comput. Biol., 22(5):387–401. Nilsson, S. and Karlsson, G. (1999). IP-address lookup using LC-tries. IEEE J. Sel. Area. Comm., 17(6):1083–1092. Numanagić, I., Bonfield, J. K., Hach, F., Voges, J., Ostermann, J., Alberti, C., Mattavelli, M., and Sahinalp, S. C. (2016). Comparison of high-throughput sequencing data compression tools. Nat. Methods, 13(12):1005–1008. Ohlebusch, E., Fischer, J., and Gog, S. (2010). CST++. In Proc. of the International Symposium on String Processing and Information Retrieval (SPIRE’10), volume 6393, pages 322–333. Paten, B., Diekhans, M., Earl, D., John, J. S., Ma, J., Suh, B., and Haussler, D. (2011). Cactus graphs for genome comparisons. J. Comput. Biol., 18(3):469– 481. Patro, R. and Kingsford, C. (2015). Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics, 31(17):2770– 2777. Pavlov, I. (1999). LZMA compression library. http://www.7-zip.org. [Online; accessed 20-December-2016]. Putze, F., Sanders, P., and Singler, J. (2009). Cache-, hash- and space-efficient bloom filters. ACM J. Exp. Algorithmic, 14:9. Rahn, R., Weese, D., and Reinert, K. (2014). Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics, 30(24):3499–3505. 98 Raman, R., Raman, V., and Rao, S. S. (2007). Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algor., 3(4):43. Rasmussen, K. R., Stoye, J., and Myers, E. W. (2006). Efficient q-gram filters for finding all ε-matches over a given length. J. Comp. Biol., 13(2):296–308. Rhoads, A. and Fai Au, K. (2015). PacBio Sequencing and Its Applications. Genomics, Proteomics & Bioinformatics, 13(5):278–289. Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M., and Yorke, J. A. (2004). Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363–3369. Roguski, L. and Deorowicz, S. (2014). DSRC 2–Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213–2215. Rozov, R., Shamir, R., and Halperin, E. (2014). Fast lossless compression via cascading Bloom filters. BMC Bioinform., 15(9):S7. Russo, L., Navarro, G., and Oliveira, A. L. (2011). Fully compressed suffix trees. ACM Trans. Algor., 7(4):53. Sadakane, K. (2007). Compressed suffix trees with full functionality. Theor. Comput. Syst., 41(4):589–607. Saha, S. and Rajasekaran, S. (2014). Efficient algorithms for the compression of FASTQ files. In Proc. of the International Conference on Bioinformatics and Biomedicine (BIBM’14). Sahinalp, S. C. and Vishkin, U. (1996). Efficient approximate and dynamic matching of patterns using a labeling paradigm. Proc. of the 37th Annual Symposium on Foundation of Computer Science (FOCS’96). Salikhov, K., Sacomoto, G., and Kucherov, G. (2014). Using cascading Bloom filters to improve the memory usage for de Brujin graphs. Algorithm. Mol. Biol., 9:2. Sanger, F., Nicklen, S., and Coulson, A. R. (1977). DNA sequencing with chainterminating inhibitors. Proc. Natl. Acad. Sci. USA, 74(12):5463–5467. Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., and Weigel, D. (2009). Simultaneous alignment of short reads against multiple genomes. Genome Biol., 10(9):R98. Schuster, S. C. (2008). Next-generation sequencing transforms today’s biology. Nat. methods, 5:16–18. 99 Shendure, J. and Ji, H. (2008). Next-generation DNA sequencing. Nat. Biotechnol., 26(10):1135–1145. Sigaux, F. (1999). Cancer genome or the development of molecular portraits of tumors. Bulletin de l’Academie nationale de medecine, 184(7):1441–7. Simpson, J. T. and Pop, M. (2015). The Theory and Practice of Genome Sequence Assembly. Annu. Rev. Genom. Hum. Genet., 16:153–172. Sirén, J. (2017). Indexing Variation Graphs. In Proc. of the 19th Workshop on Algorithm Engineering and Experiments (ALENEX’17), pages 13–27. Sirén, J., Välimäki, N., and Mäkinen, V. (2011). Indexing finite language representation of population genotypes. In Proc. of the 11th International Workshop on Algorithms in Bioinformatics (WABI’11), volume 6833, pages 270–281. Sirén, J., Välimäki, N., and Mäkinen, V. (2014). Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinf., 11(2):375–388. Solomon, B. and Kingsford, C. (2016). Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol., 34(3):300–302. Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., Iyer, R., Schatz, M. C., Sinha, S., and Robinson, G. E. (2015). Big data: astronomical or genomical? PLoS Biol., 13(7):e1002195. Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., Angiuoli, S. V., Crabtree, J., Jones, A. L., Durkin, A. S., et al. (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl. Acad. Sci. USA, 102(39):13950–13955. Tettelin, H., Riley, D., Cattuto, C., and Medini, D. (2008). Comparative genomics: the bacterial pan-genome. Curr. Opin. Microbiol., 11(5):472–477. Thudi, M., Li, Y., Jackson, S. A., May, G. D., and Varshney, R. K. (2012). Current state-of-art of sequencing technologies for plant genomics research. Brief. Funct. Genomics, 11(1):3–11. Treangen, T. J., Ondov, B. D., Koren, S., and Phillippy, A. M. (2014). The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol., 15(11):524. Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14(3):249– 260. 100 Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001). The Sequence of the Human Genome. Science, 291(5507):1304–1351. vg team (2015). vg implementation. https://github.com/vgteam/vg. [Online; accessed 23-February-2017]. Wandelt, S., Starlinger, J., Bux, M., and Leser, U. (2013). RCSI: Scalable similarity search in thousand(s) of genomes. Proc. of the VLDB Endowment, 6(13):1534–1545. Watson, J. D. and Crick, F. H. (1953). The structure of DNA. In Cold Spring Harb. Symp. Quant. Biol., volume 18, pages 123–131. Weiner, P. (1973). Linear pattern matching algorithms. In Proc. of the 14th Annual Symposium on Switching and Automata Theory (SWAT’73). Williams, H. E. and Zobel, J. (1999). Compressing integers for fast file access. Comput. J., 42(3):193–201. Witten, I. H., Neal, R. M., and Cleary, J. G. (1987). Arithmetic coding for data compression. Commun. ACM, 30(6):520–540. Zekic, T., Holley, G., and Stoye, J. (2018). Pangenome Storage and Analysis Techniques. In Setubal, J., Stoye, J., and Stadler, P., editors, Comparative Genomics, volume 1704 of Methods in Molecular Biology. Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L., and Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics, 29(21):2669–2677. Ziv, J. and Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337–343. 101 Printed on non-ageing paper °° ISO 9706