YongKiat Wee is a PhD student at the University of the Sunshine Coast. Salma Begum Bhyan is a PhD student at the University of the Sunshine Coast. Yining Liu is a research fellow at the Guangzhou Medical University. Jiachun Lu is a Professor at The School of Public Health, The First Affiliated Hospital, Guangzhou Medical University. Xiaoyan Li is a research fellow and experts in clinical sequencing at Beijing Anzhen Hospital, Capital Medical University. Min Zhao is a senior research fellow at the University of the Sunshine Coast. © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 1 Briefings in Functional Genomics, 18(1), 2019, 1–12 doi: 10.1093/bfgp/ely037 Advance Access Publication Date: 21 November 2018 Review paper The bioinformatics tools for the genome assembly and analysis based on third-generation sequencing YongKiat Wee, Salma Begum Bhyan, Yining Liu, Jiachun Lu, Xiaoyan Li and Min Zhao Corresponding authors: Xiaoyan Li, Beijing Anzhen Hospital, Capital Medical University, Beijing, China. Tel.: +86010-64456199; Fax: +86010-64456169; E-mail: xiaoyanli82@163.com, Min Zhao, School of Science and Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland 4558, Australia. Tel.: +61 (0)423791085; Fax: 61 7 5456 3402; E-mail: mzhao@usc.edu.au Abstract The application of third-generation sequencing (TGS) technology in genetics and genomics have provided opportunities to categorize and explore the individual genomic landscapes and mutations relevant for diagnosis and therapy using whole genome sequencing and de novo genome assembly. In general, the emerging TGS technology can produce high quality long reads for the determination of overlapping reads and transcript isoforms. However, this technology still faces challenges such as the accuracy for the identification of nucleotide bases and high error rates. Here, we surveyed 39 TGS-related tools for de novo assembly and genome analysis to identify the differences among their characteristics, such as the required input, the interaction with the user, sequencing platforms, type of reads, error models, the possibility of introducing coverage bias, the simulation of genomic variants and outputs provided. The decision trees are summarized to help researchers to find out the most suitable tools to analyze the TGS data. Our comprehensive survey and evaluation of computational features of existing methods for TGS may provide a valuable guideline for researchers. Key words: third-generation sequencing; genome assembly; mapping; genome sequencing Introduction The advent of next-generation sequencing (NGS) technologies created a revolutionary impact on human genomics study. Since the 1st market launch in 2005, these technologies accelerated genome mining through dramatic reduction of overall cost of sequencing [1]. NGS technologies are distinct in their approaches and are of high-throughput in nature providing millions of sequencing reactions simultaneously. Current NGS platform includes Roche/454 [2], Illumina/Solexa [3, 4], Ion torrent, Sequencing by Oligonucleotide Ligation and Detection (SOLiD), etc. All these technologies have some advantages and disadvantages over each other. However, the common disadvantages of NGS technologies in genome assembly and analysis are (1) small read lengths (<300 bp), which create difficulties in de novo assembly; (2) regions such as high/low G+C regions, tandem repeat regions and interspersed repeat regions, which are hard to sequence using the NGS platforms; and (3) de novo genome assemblies lacking entire portions of genomes and missing vital genes, which could be due to fragmentation [5]. The missing genome regions can generate genome assemblies that lack adequate robustness to examine the whole genome organization and chromosome architecture [6]. Third-generation sequencing (TGS) technologies came out with a new insight in sequencing and produce highly accurate de novo assemblies in different genomes de novo [7]. These technologies improved the sequencing efficiency through rapid sample preparation and real-time signaling. The major platforms using TGS technology are Pacific Biosciences (PacBio) single-molecule real-time (SMRT) sequencing, Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 2 Wee et al. Oxford Nanopore Technologies (ONT) sequencing and BioNano Genomics (BioNano) sequencing [8]. These platforms have advantages over first- and second-generation platforms: (1) long read lengths (half of the data in reads >20 kb, maximum read length >60 kb), (2) high percentage of consensus accuracy (>99.999% at 30× coverage depth, free of systematic errors), (3) low bias of G+C content and (4) simultaneous epigenetic classification (direct identification of DNA base modifications at one-base resolution) [9]. Finally, TGS is also competing with alternative technologies that can perform similar analyses often at a lower cost. In summary, comparing the NGS platforms, there are three important improvements in TGS platforms: (1) increase in read length from tens of bases to tens of thousands of bases per read, (2) reduction in the sequencing time from days to hours (or to minutes for real-time applications) and (3) reduction or elimination of sequencing bias introduced by polymerase chain reaction (PCR) amplification [10]. The ONT sequencing technologies continue to evolve and improve these past years. A new device called PromethION (Oxford Nanopore Technologies, Oxford, United Kingdom) has been introduced recently; it’s a bigger version compared to another nanopore device, MinION (Oxford Nanopore Technologies, Oxford, United Kingdom), that is created for portability and accessibility of its workflow [11]. The PromethION is a standalone high-throughput benchtop instrument which offers the flexibility to load up to 192 libraries across the whole instrument in an asynchronous approach. In comparison to MinION that performs with a single 512channel flow cell, PromethION possess a higher capacity and larger scale with 48 individual flow cells each with 3000 pores (equivalent to 48 MinIONs) running at 500 base pairs per second that is adequately powerful to achieve high accuracy and high coverage for a larger genome such as human genome [12]. In addition, users can execute or stop the analysis as requested or increase the speed by utilizing the numerous flow cells onto one single analysis. Real-time base calling and further analysis can be performed in the integrated compute module [13]. The nanopore data generated by the MinION and the PromethION are integrated into a cloud-based analytics company, Metrichor. Metrichor is powered by its EPI2ME platform [11]. Metrichor allows the automation of data analysis workflows to aid in tracking, predicting and interpreting biological data on a realtime basis. ONT develop and offer several different types of analysis software tools such as MinKNOW (Oxford Nanopore Technologies, Oxford, United Kingdom), Albacore and Guppy. The nanopore data generated from these two instruments could be utilized for detecting the complex structural variants (SVs),for uncovering the highly repetitive sequences and for examining the biological structure of larger genomes in different species such as mammalian genomes [12]. Because of the advantage of longer reads, TGS technologies have been implemented as a powerful tool for studying the evolution and genomic diversity of an organism [7]. Data developed through TGS have been widely applied in resequencing analyses, creating detailed maps of structural variations and phasing variants across large regions of human chromosomes. TGS have also been widely recognized as useful tools for studying transcriptomics and discovering thousands of novel isoforms including alternative splicing detection and gene fusions that were not identified using second-generation short read sequencing [7]. A combined approach of TGS and mapping technologies could enhance the analysis of structural variations by forming super-contigs (‘scaffolds’) that can span almost the entire arm of a chromosome. Although long-read sequencing in TGS overcomes the length limitation of NGS, it remains considerably more expensive and has lower throughput than other platforms, limiting the widespread adoption of this technology in favor of less expensive approaches. Here, we present our systematic and comprehensive review of available software tools for the de novo and whole genome sequencing analyses of TGS data. We review a total of 39 TGS analysis tools, which were either recently published or developed (Table S1). We discuss their various characteristics, such as the required input, interaction with the user, sequencing platforms, type of reads, error models, possibility of introducing coverage bias, simulation of genomic variants and output provided. This is done within the framework of potential applications, providing readers with guidelines for the identification of the TGS de novo software applications that are best suited for their purposes. This review evaluates various tools applied in three main TGS platforms on genome assembly and further analysis. Details of each approach along with its benefits and drawbacks are discussed. Lastly, two distinct decision trees are presented to guide researchers for selecting a suitable TGS de novo and genome-based sequencing analysis tools. De novo assembly using TGS technologies In recent years, there has been a major transformation in the way of extracting genomic information from organisms. De novo long-read genome assembly involves in several steps including raw read mapping, read error correction, assembly of corrected reads and assembly polishing. Long-read genome assemblers normally use overlap-based procedures such as overlap–layout– consensus (OLC) algorithms to assemble the long reads [14]. It first generates the alignments between long reads. After that it calculates the best overlap graph and then the consensus sequence of the contigs is generated from the graph. There are two approaches for long reads error correction. The 1st approach involves in aligning the long reads against themselves while the 2nd approach uses short reads to correct long reads [14]. Even though error correction stage may have been part of the assembly process, errors can still be found in the assembly, specifically in long-read assemblies. This can be improved by polishing the assemblies with short or long reads such as the accuracy of the base calls [15]. The development of high-throughput sequencing technologies including TGS has been instrumental in advancing research in all scientific areas. As shown by the impressive increase in genomic data output, TGS tools have been developed to allow for the rapid and easy annotation, prioritization and navigation of large variant data sets from various platforms. Figure 1 highlights the major development of the TGS tools for the past 5 years. There are more than five different tools that have been developed in 2015 and 2016. Most of these tools are involved in de novo and genome-based sequencing analysis such as RefAligner, Canu and Nanopore Synthetic-Long (NaS). The specialized assembly of sequencing reads is an essential process in de novo sequencing and assembling the novel genome for the first time. The read length in TGS provides a great advantage for the genome assembly process. Second-generation platforms include MiSeq, HiSeq and NextSeq from Illumina. These platforms use massively parallel sequencing to achieve high throughput and have high base-calling accuracy; however, the sequencing reads are short and this can result in split contigs in repetitive regions during sequence assembly. Thirdgeneration sequencers including PacBio from Pacific Biosciences and MinION from ONT can generate a very long read length at high throughput by sequencing single-molecule templates. TGS Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 The bioinformatics tools based on TGS 3 Figure 1. Milestones in TGS analysis software development. The green box refers to the de novo assembly tool while the orange box refers to the genome-based analysis tool. platforms can address the problems inherent to short sequence reads by sequencing long single molecules in real time. Basically, assemblers are developed based on various types of algorithms, including OLC, de Bruijn graphs (DBG) and string graphs [16]. To make a de novo OLC-based assembly, three essential steps are involved in the process: preassembly, consensus build-up and consensus polishing. The main purpose of preassembly data processing is to produce long and accurate sequences by correcting base errors. Seed reads (a subset of the sequencing reads) are selected based on the read length distribution. Each single read is then mapped to the seed reads to generate a consensus sequence for the mapped reads, resulting in long and accurate fragments of the target genome. The computation in this step is very intensive as it involves all-versus-all raw read mapping and base error correction. The next step is the consensus building from the overlapping reads. A few options are available when selecting assembly algorithms, but OLC assemblers offer clear advantages for de novo assembly using multi-kb long reads. For genomes with repeats of any length, a single long error-corrected read could simply bridge the gaps among unique sequences and ensure that the consensus building process continues without interruption. When designing a de novo genome sequencing project, reasonable read coverage (50–60×) is needed to generate sufficient coverage of reads that uniquely anchor the longest repeat regions in the genome assembly. For preassembled reads, there could be base errors in the repetitive regions, where raw base errors are coupled with repeats. Errors such as indels and substitutions in the preassembled reads could also be easily passed on to the consensus. Therefore, there is a need for consensus polishing for assemblies produced from the TGS data. Genome alignment and assembly tools for TGS platform Several tools are being used in assembling long-read sequence developed through TGS platforms and a decision tree has been presented in Figure 2 for guiding the researchers to choose the suitable sequencing tools based on different de novo sequencing analysis in three different platforms including ONT, SMRT and BioNano sequencing platform. MinHash Alignment Process (MHAP), PBJelly, Hierarchical Genome-Assembly Process (HGAP), FALCON and HINGE utilize long reads from SMRT platforms. In the BioNano platform, the alignment tool known as RefAligner uses a dynamic programming algorithm to align each molecule map to the reference maps by identifying the best matching region in the sequence genome. In the BioNano platform, the alignment tool known as RefAligner implements a dynamic programming algorithm to align each molecule and map to the reference by determining the best matching region in a sequence genome [17]. The match score is then recorded from the in silico nicking sites in the region on the reference sequence and the distribution of fluorescent labels on the molecule. For the ONT platform, PoreSeq and Nanocorr are available for de novo sequencing analysis. Another two software programs Minimap/miniasm and Circlator, which utilize long reads from both ONT and SMRT platforms for de novo assembly and circularization genome assembly analysis. MHAP is developed to identify all overlaps among noisy long reads using a probabilistic hashing algorithm. MinHash sketches are implemented in this software for better alignment filtering. The algorithm works by estimating the Jaccard similarity based on the min-mers (minimum k-mers). The required time to index, store, hash and compare k-mers is proportional to the sketch size; hence, it is recommended that the sketches be kept at smaller size [18]. Even though NGS technologies can perform the sequencing in a faster and more cost-effective way, decoding a complete genome remains one of the important challenges in bioinformatics, particularly for complex genomes. Fragments or ‘gaps’ in de novo genome assembly can result from the short-read length, repetitive components and low sequence coverage [18]. PBJelly is a software program that uses scaffolding approach for gap closing in genome assembling [19]. The reads are first aligned to the contigs in establishing a scaffold and then reads that span numerous contigs are applied as links to build Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 4 Wee et al. Figure 2. Decision tree for the selection of a suitable TGS de novo sequencing analysis tools in nanopore, SMRT and BioNano sequencing platform. The selection of a TGS tool requires a set of sequential decisions. First, one must decide on the reads from the main TGS platforms: nanopore, ONT; SMRT sequencing, PacBio technologies; and BioNano sequencing technologies. Then, in SMRT sequencing, one must determine whether the genome read is from a eukaryote or bacterium and whether it is automated aligning for long reads. In addition, one must decide which analysis involved in de novo sequencing including detection of SV or intercellular heterogeneity, identification and quantification of isoforms, haplotype genome assembly, genome assembly with or without error correction method and whether mapping phase is required. In BioNano platform, only one software is available for genome mapping. For nanopore sequencing, one must perform de novo sequencing analysis for sequence variants with low coverage or with long-reads assembly faster than 50 kbps. a scaffold graph. It utilizes SMRT long reads instead of NGS short reads for gap closing [19]. HGAP is used for de novo assembly and it applies the longest reads for assembling a sequence genome [20]. The principle behind the hierarchical genome assembly process involves using long-insert-size DNA shotgun template libraries with SMRT sequencing. The longest reads are selected as ‘seed’ reads to which all other reads are sequenced and mapped. Both HGAP and PBJelly are applicable for bacterial-sized genomes and MHAP is used in eukaryoticsized genomes. HGAP has a higher assembly quality and the ability to resolve repetitive regions, while MHAP has a higher sensitivity in overlapping processes, as it integrates another software known as Celera Assembly. PBJelly is designed to be automated for finishing the genome assembly process and it needs only FASTA format sequences as an input data; thus, it can perform the task faster than HGAP. In addition, MHAP has another great advantage over HGAP and PBJelly, as it can improve the telomere assemblies by reconstructing the repeatrich heterochromatic regions of eukaryotic assemblies. FALCON is one of the hierarchical haplotype genome assembly tools that follows the design of HGAP but utilizes more computationally optimized elements [21]. It is applied on the long-read data from SMRT platform. Daligner is implemented to split the sequence data into blocks for comparison. Firstly, it collates a list of k-mers with their respective identified variables and read coordinates and subsequently organizes them lexicographically. The similar k-mers from each individual block are incorporated into a new list including both the query identifiers and the corresponding coordinates [22]. The sorting approach is implemented to generate the overlap candidates by locating the neighbouring matches adjacent to each other. Based on the alignments of the overlaps, a directed string graph consists of heterozygosity information is build [21]. HINGE is another assembler that implements the OLC paradigm with the absent of error correction step [23]. Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 The bioinformatics tools based on TGS 5 Dalinger is applied for detecting the overlaps. The principal behind this assembler is the repeat regions which are not spanned by longer reads will be replaced and marked with hinges [23]. The boundaries of unbridged repeats such as inhinge and out-hinge are marked on the reads and the coverage gradients of the alignments are used to determine the repeats [24]. The overlapping reads will not be considered for hinge placing when a repeat is spanned by a completely bridged read and resulted in separate bridged repeats. Before acquiring a consensus, hinge-aided greedy graphs are useful in resolving the repeat junctions [23]. For the ONT platform, PoreSeq is the only available open source software and it uses Python for consensus, variant calling and de novo sequencing [25]. To acquire the de novo reads with higher accuracy and more uniform coverage, a novel algorithm that applies statistical models is proposed in this software. The base-calling algorithm works by using the discretized ionic current data from an autonomous number of nanopore reads of the same area of DNA, including reverse or partial accompaniment reads. PoreSeq is designed ideally for sequences with low coverage as it produces higher accuracy in classifying the sequence variants at low coverage than other methods [25]. Nanocorr is built as a novel open-source hybrid error correction algorithm using complementary MiSeq data and generating a de novo assembly that is notably high in accuracy [26]. There are three software programs, Minimap/miniasm, Circlator and Canu, which utilize long reads from both SMRT and ONT. Minimap/miniasm is a de novo assembly software program that functions as a mapper for mapping and assembling the SMRT and ONT reads with higher accuracy than other available tools [27]. Miniasm applies the ‘O’ and ‘L’ approaches in the OLC assembly paradigm [27]. It discovers long noisy reads that can be assembled without an error correction stage, and without this stage, the assembly process can be significantly improved, while attaining similar contiguity and large-scale accuracy to current developed pipelines, at least for genomes without excessive repetitive sequences. Despite the fact that these new technologies emphasize the automated completion of genome sequencing, the existing assembly software still presumes that the end products including contigs they generate are linear. The genomes in various species contain at least one circular DNA structure including bacterial chromosomes and plasmids and the plastid and mitochondrial genomes of eukaryotes. Correct completion and circularization of these molecules are Figure 3. Decision tree for the selection of a suitable TGS de novo sequencing analysis tools in hybrid sequencing platform. The selection of the de novo sequencing analysis software packages used in hybrid sequencing platform. If the read used both platforms (nanopore and SMRT sequencing), one must decide whether the reads are from a circular genome and if no error correction is required. If the reads are applied from both NGS and the TGS platform, one must identify whether the reads are from AB eukaryotic-sized or bacterial-sized genome and determine the availability of the Newbler software. In addition, one must determine whether the reads from both the NGS and TGS platform need gap closing or scaffolding assembly. Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 6 Wee et al. important if the technology is to be applied frequently in clinical practice. Hence, Circlator is the first tool created to automate the assembly and produce accurate linear representations of circular sequences using both SMRT and ONT long reads [28]. The contigs are circularized using the local assemblies of corrected long reads at contig ends, preventing the search for common sequences between low-quality contig ends and this enabling the process of circularization even when overlaps do not exist [28]. Although Minimap/miniasm performs the sequencing faster than other software, it has lower accuracy in assembling the sequences as no error correction stage is required. As a result, it is difficult to identify the cause of low identity matching between two long noisy reads. Furthermore, a larger space and RAM are needed to execute this software; hence, it is not memory efficient. Circlator yields higher quality assemblies than other existing approaches, but the assembly of long reads without consideration for the circularization process can be problematic, particularly for small plasmids whose lengths are shorter than the length of the reads used to assemble it. This can cause the generated contigs to compose the entire sequence of the plasmid two or more times. Canu is developed to address the noisy read data of single-molecule sequences. This software support both PacBio and Oxford Nanopore data [29]. In comparison with the first two software, it has lower runtime and it requires only low coverage as little as 20× single-molecule coverage. Three stages—correction, trimming and assembly—are included in the Canu pipeline where each stage can perform independently. Canu has another great advantage over Miniasm as Miniasm lacks correction step, thus it could not resolve the repeat or the error rates. This means that Miniasm is less continuous than Canu assemblies on large genomes [29]. The assemblies generated by the Miniasm can be difficult to discard during filtering and polishing as it contains higher frequency of large insertions and deletions and low bas accuracy (<90%). Hence, a few rounds of polishing have to be processed in Miniasm before the assembly quality converges while a single round of polishing is only required by Canu [29]. However, Canu is not the fastest tool to generate a polished assembly read. As we discussed earlier, assembly with Miniasm followed by Racon performs faster than Canu itself. Sequencing tools for hybrid technologies of both NGS and TGS To date, many researchers have adopted a hybrid strategy by using both NGS and TGS to perform genome sequencing, thereby producing higher coverage and accuracy. A decision tree for hybrid strategy using both NGS and TGS technologies for sequence reads is presented in Figure 3 and this combined approach is common in de novo assembling, and these software programs include OPERA-LG, DBG2OLC, GMCloser, NaS and Cerulean. The average per base identity of the ONT reads can be greatly improved from 65% across all flow cell iterations to greater than 97% using this approach. Hence, it produces highly contiguous and complete assemblies given sufficient read lengths and sequence coverage [26]. OPERA-LG is used in assembly scaffolding ideally for larger and repeat-rich genomes [30]. It generates a framework for the scaffolding of repetitive sequences and a structured approach for incorporating the sequencing data from both second-generation and TGS technologies. To generate a scaffold with higher accuracy in genomes with larger sizes and more repeats, OPERALG combines some novel characteristics and improvements, including (a) optimized data structures to enhance its scalability, (b) improved edge-length estimation and the capability to utilize numerous libraries to enhance scaffolding accuracy and (c) extensions that allow for the scaffolding of repeat sequences [30]. One of the greatest advantages of OPERA-LG is that it has faster performance and it takes a few seconds and a few hundred megabytes of memory (largely for storing read mapping information) and thus it needs notably less memory. DBG2OLC is a hybrid assembly approach that simultaneously utilizes NGS and TGS data to address both high error rates and the excessive cost of sequencing [31]. The software is designed based on the following fundamental principles: (i) compact representation of the long reads leads to efficient alignments; (ii) base-level errors can be ignored, structural errors need to be detected and corrected; (iii) structurally correct TGS reads are assembled and polished [31]. Furthermore, since NGS and TGS data can compensate for each other, the utilization of NGS data also lowers the required sequencing depth of TGS and leads to a reduced cost of sequencing. A base-level correctionfree assembly pipeline is developed by directly analyzing and exploiting overlap information in the long reads. It utilizes NGS assemblies to lower the computational burden of aligning TGS sequences rather than just polishing the TGS data. This enables users to take advantage of the cheap and easily accessible NGS reads, while avoiding the issues associated with existing hybrid approaches [31]. Both NaS and Cerulean are used to assemble microbial genome. NaS is a hybrid method, which enables the sequencing of microbial genomes using the MinION® [32]. Cerulean is another hybrid assembler that uses both short (Illumina) and long (SMRT) reads [33]. This software program does not use short reads directly; however, it incorporates an assembly graph structure produced from short read data using other existing assembly tools. The algorithm works with a simplified version of the assembly graph consisting only of long contigs; then, in each iteration, smaller contigs are added slowly to improve the assembly process. In contrast to the state-of-the-art longread error correction method, which needs high computational resources and a long run time on a supercomputer, the software can produce a similar assembly using only a standard desktop in a short running time [34]. In summary, DBG2LOC is applicable for eukaryotic-sized genomes and both NaS and Cerulean are used for bacterial-sized genomes. However, Cerulean has many benefits over NaS as it is completely automated, has short running time, is a stand-alone software, lowers the usage of computer resources and has high accuracy in assembling. Tools for genome-based sequence analysis Sequence alignment tools The genome-based sequence alignment is involved in several steps including data processing and quality control. However, this review has mainly focused on the tools that are applied to TGS data analysis. Figure 4 shows a list of tools that allow users to perform whole genome sequence analysis based on certain requirements and parameters. In total, there are 14 tools for two TGS platforms. For the SMRT data, one of the popular read alignment tool called regional Hashing-based Alignment Tool (rHAT) that is applicable for long reads only. rHAT uses a seed-and-extensionbased read alignment method for noisy long reads [35]. A regional hash table (RHT) is implemented for indexing the reference genome by reporting the short tokens within local windows of a reference genome. During the seeding stage, Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 The bioinformatics tools based on TGS 7 Figure 4. Decision tree for the selection of a suitable TGS genome-based sequencing analysis tool in SMRT, nanopore and hybrid sequencing platform. A set of sequential decisions has to be made when performing genome-based sequencing analyses. First, one must determine whether the reads are generated from the two TGS platforms nanopore or ONT and SMRT sequencing or PacBio technologies and whether it is a hybrid or non-hybrid read. If the reads from the SMRT platform are used for read alignment, one must identify the length of the reads and whether a hash table is required. For the reads from the nanopore platform, one must decide whether the reads should be performed for base calling or read alignment. For the hybrid reads that utilize both platforms, one must decide whether the analyses should be carried out for read simulation or haplotype assembly. rHAT deploys RHT to determine the potential candidate sites by calculating the number of short token matches between the fragmented reads and local genomic windows in a genome sequence. One of the advantages of using rHAT is that it can lower the cost of aligning reads by implementing a sparse dynamic programming base heuristic approach in the extension step [35]. On the other hand, marginAlign used Oxford Nanopore long reads for read alignment. The rates of deletions, substitutions and insertions in MiNION reads are deduced by the maximum likelihood estimates using an expectation– maximization (EM) algorithm [36]. This EM is applied on the hidden Markov model (HMM) for the robust interpretation of the error sources into several classes of genetic mutations, including insertions, deletions and mismatches [36]. Overall, marginAlign generates high-quality sequence alignments that allow users to call single-nucleotide variant accurately with its built-in software, marginCaller. In addition, it also enables users to characterize the unresolved part with the repetitive sequence. Base calling and polishing tools Another tool that uses Oxford Nanopore long reads for base calling is Nanocall. It divides the sequence of events into strands based on several heuristic approaches. First, it measures the basic current level using a heuristic approach and identifies the islands with five or more consecutive abasic current estimations. Next, Nanocall chooses the island that is located nearest to the middle of the event sequence. If the selected island is positioned within the middle third of the whole event sequence, it is used to divide the events corresponding to the two strands [37]. Nanocall will stop executing if the island is identified outside of the middle third of the event sequence. An HMM is also applied on these events where the states are the k-mers being sequenced, the pore model emissions are then voluntarily scaled using several rounds of EM based on posteriors calculated with forward– backward algorithm, and the base calls are generated by running Viterbi [37]. Generally, the main advantage of using Nanocall is its double-strand pore model scaling performs better than the single-strand, implying that the latter might cause model overfitting. In addition, another base caller tool for nanopore data is Albacore (can be downloaded from ONT user community) [38]. It was a command-line base caller and implemented for the ultra-long reads. Albacore is a memory-efficient tool as it can directly base call the FASTQ file; hence, it saves more disk spaces [38]. This advantage makes Albacore quite practical and convenience compared to Nanopolish. Scrappie is the latest C-based local base caller software tool developed by ONT [39]. It conducts a transducer-based base calling method in order Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 8 Wee et al. to define the accurate length of homopolymers. It is known as the 1st base caller which resolves the base calling issue of homopolymer sequencing errors. Furthermore, the base calling with the raw current signal can be executed in Scrappie with the absence of event detection [39]. DeepNano is a freely available base caller software tool which utilizes a recurrent neural network-based method to conduct base calling for the MinION nanopore sequencing platform [40]. It was developed in Python. Both Scrappie and Albocore are considered better options for base calling compared to DeepNano as they support multithreading [41]. The assembly reads can be improved through polishing process such as post-assembly error correction. There are two state-of-the-art sequence polishing tools—Nanopolish and Racon. Nanopolish implemented a Hidden Markov method to enhance the base quality by estimating the probabilities for each base from the raw signal data of reads [42]. Furthermore, the accuracy of the draft genome can also be increased using Nanopolish as it corrects the homopolymer-rich regions of the genome [42]. Racon is a consensus model that works independently to correct the raw contigs produced by the assembling approaches and it does not require consensus phase [43]. It attempts to identify the best alignment in order to improve the accuracy and quality of the assembly reads [43]. It supports the data generated by both Pacific Biosciences and ONT. In terms of accuracy, Nanopolish has higher accuracy results for polishing compared to Racon. However, Nanopolish is computationally expensive, and hence time-consuming. Simulation tools: testing sequence alignment software To test the software, a list of simulation software programs was developed to produce a synthetic TGS read. There are two distinct tools that allow users to apply SMRT or Oxford Nanopore long reads for read simulation: SiLiCo and ReadSim. SiLiCo is among the first in silico tool to generate high-quality sequencing reads from both TGS platforms [44]. It simulates sequencing results from the two sequencing technologies by randomly generating genomic coordinates and acquiring the corresponding nucleotide sequences from a reference assembly. An in silico simulator has been established to quantify the patterns of nick sites in sequencing libraries by generating an empirical distribution of terminal nucleotides in ideal longread sequencing libraries. The scalability of SiLiCo enables the end user to build an empirical distribution of various genomic characteristics, as it can scale up to a Monte Carlo simulation [44]. ReadSim is developed based on a new data-driven model using support vector regression which can accurately estimate the assembly performance of a sequence [45]. It produces long reads imitating the read length distribution that exists in an input file by choosing a stochastic starting location in the genome and producing a read of the following observed length [45]. Another simulation tool known as Noisy Datatypes (LongISLND). LongISLND applies another approach which is known as learn-and-stimulate [35]. An empirical model is established through the learning process by capturing the samples for a specific set of real data, setting up an empirical model. Generally, LongISLND understands the alignment data by recording the base calls, with and without errors which correspond to the sequencing structure of the reference genome [46]. To examine these alignment records, a non-parametric model is implemented including the error profile such as genome sequence [46]. Overall, LongISLND has an advantage over rHAT in that it allows users to customize the output formats. Recently, two new simulator tools have been introduced in nanopore sequencing platform—NanoSim and DeepSimulator. NanoSim is developed based on Python for simulation purpose and analyzing the read length [47]. It examines the experimental ONT reads to model read specifications including the sequence length distributions and error profiles. It then implements these characteristics to produce in silico reads that serves as an input reference [47]. DeepSimulator can imitate the sequence reads from the statistical models of the data including both nucleotide reads and raw electrical current signals [48]. It can be applied to generate a guideline to examine the recently designed approaches for nanopore sequencing data analysis. This tool is comprised of different frameworks: sequence generation and formation of the simulated current raw signals [48]. The main difference between these two software programs is that ReadSim can perform both tasks including read simulation and performance prediction of genome sequence assembly, but SiLiCo is only capable of read simulation. However, SiLiCo has the great advantage of ensuring all the nucleotides have similar likelihoods of being chosen in a simulated read; it selects the start and end of a genomic coordinate using a buffer and this eventually preventing the occurrence of end-selection bias. In summary, SiLiCo is more user-friendly, as it allows users to supply the corresponding parameters for the desired genome coverage, the standard deviation and the mean of read length, compared to ReadSim. Haplotype assembly tools Haplotype assembly is one of the computational difficulties of rebuilding the haplotypes, which are the two parental copies in a diploid genome. Haplotype assembly also has an important role in measuring the allele-specific expression [49]. The complexity includes sequencing errors, which require higher usage of computer resources, and uneven coverage across transcripts, which may introduce false variants or cause true heterozygous variants to be discarded from the analysis [49]. Hence, a method with faster performance and memory-efficiency, called HapCol, was designed to overcome these problems in haplotype assembly. HapCol applies a fixed-parameter algorithm to the k-constrained minimum error correction problem, a recently developed variant of the weighted Minimum Error Correction (wMEC) issue that takes into consideration the important features of future highthroughput sequencing technologies, including the increased read lengths and the constant distributions of sequencing errors. HapCol can be applied to the long reads in both platforms. It is important to have reads that are long enough to span numerous distinct heterozygous locations in order to accurately assemble the haplotype reads [50]. HapCol excludes the traditional allheterozygous assumption and hence it phases the data sets with a much higher coverage. There are a few other methods that cannot process long reads or coverage greater than 20×, however, this software is capable of performing the task with data sets including both long reads (exceed 100 000 bp long) and coverage up to 25×, on standard workstations/small servers [50]. According to the error model, users can apply different error distributions by choosing the maximum number (k) of errors per position. Furthermore, it can also be effortlessly adapted by setting a higher value for k until a useful solution is obtained. Even if the average error rate is low, for example in data from the existing Illumina sequencing technologies, this approach greatly reduces the impact of systematic sequencing errors on the performance when processing the data sets [50]. In summary, Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 The bioinformatics tools based on TGS 9 HAPCOL overcomes the traditional heterozygous assumption and processes data sets with coverage of 25× on standard workstations/small servers and the value of k-mers can be adjusted easily based on the user’s requirement and thus it is more flexible. Error correction tools The error-correction stage is an essential step in numerous analyses including sequence assembly, haplotype interference and single nucleotide variant calling. It is important that the errors produced by common high-throughput sequencing platforms be categorized in order to generate high-quality reads. Error correction approaches belong to two main categories that include de novo and hybrid methods. Hybrid approaches use short and long reads data, while non-hybrid methods such as de novo self-correct reads by exploiting overlap of high-coverage data. We included five hybrid correctors—LORDEC, proovread, Jabba, LSC and PacBioToCA—and two de novo correctors—PacBioToCA and LORMA—in this review. PacBiotoCA, proovread and LORDEC utilized the long reads only from the SMRT platform for error correcting in whole genome sequence analysis. In addition, there are two tools, LoRMA and Jabba, which are used in both SMRT and ONT long reads for error correction. MultiBreak-SV can be used in different platforms for error correction. PacBioToCA is part of the module in the Celera Assembler software package that functions by mapping shorter, high accuracy reads onto the long reads [51]. The strategy consists of two stages: a long-read correction phase and an assembly phase. Both are implemented as part of the Celera Assembler, but the output of the correction phase can be used as input to any other analysis or assembler capable of utilizing long FastA sequences. The outline of the correction algorithm is as follows: (1) high-identity short-read sequences are simultaneously mapped to all long-read sequences; (2) repeats are resolved by placing each short-read sequence in its highest identity repeat copy; (3) chimera and trimming problems are identified and corrected within the long-read sequences; and (4) based on a multiple alignment of the short-read sequences, a consensus sequence is calculated for each long-read sequence [51]. proovread is a hybrid correction pipeline and mappingbased approach for SMRT reads. They correct the long reads first by mapping the short reads on long reads and correct them based on a consensus built on the mapped short reads. proovread-corrected sequences were longer, and the throughput was higher. Thus, proovread combines the most accurate correction results with an excellent adaptability to the available hardware. Therefore, it will enhance the performance of SMRT sequencing [52]. LoRDEC is a hybrid error correction method that builds a succinct DBG representing the short reads and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph [53]. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving similar accuracy [53]. Compared with other correction algorithms, LoRDEC offers a novel graph-based approach. Path searching in a DBG allows for handling higher error rates. However, this search can fail if either no path or too many paths exist between the source and target k-mers. PacBioToCA generates higher quality assemblies with fewer errors and gaps than proovread as the goal of proovread is not generating high accuracy reads or reducing the cost of sequencing; instead, it was developed to run on standard computers as well as computer grids/independent of the computing infrastructure, and it can be easily adapted to various use cases. proovread is more flexibly adapted than PacBioToCA and LoRDEC on existing hardware infrastructure from a laptop to a high-performance computing cluster. However, LoRDEC has great advantages over the other two software programs, as it allows trimming processes, is less bias when correcting SMRT reads and uses memory efficiently. Jabba is a hybrid approach for error correction in long thirdgeneration reads by mapping them on a corrected DBG that was constructed from second generation data [54]. The main difference is this software uses a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. It applies a pseudo alignment approach based on a seed-and-extend methodology. The seeds are MEM between an individual read and a node of the graph. New algorithms, based on DBG, were specifically designed to efficiently integrate with the assembly of huge amounts of NGS data [54]. Overlap between short reads is then established in linear time between reads that share a k-mer. Overall, the pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a DBG. Jabba is faster, is highly reliable on the generated aligned reads, generates higher accuracy as many of the aligned reads are error-free and has lower usage of CPU time. LSCplus was specifically designed to apply a hybrid sequencing approach that combines NGS and SMRT data which improves long reads accuracy by short read alignment [55]. LSCplus is designed for RNA-seq analysis. Due to the high error rate in PacBio long reads, hybrid sequencing is required. The original algorithm in the error correction step of LSC was optimized in LSCplus [55]. During the error correction process, if a specific position only covers a few different bases, the program cannot decide which one is real. By increasing the coverage depth, the number of true positives is increased, increasing the true positive rate. Overall, LSCplus is applicable for both long and short reads; however, LoRMA and Jabba are only suitable for long reads. MultiBreak-SV is an algorithm to identify SVs from single molecule sequencing data, paired read sequencing data or a combination of sequencing data from different platforms [56]. MultiBreak-SV applies a probabilistic approach to reduce the error rates in sequencing. A study also showed that MultiBreakSV can determine SVs with high sensitivity and specificity by applying to PacBio data from four human fosmids [56]. LoRMA is used for error correction in long reads only. There are two steps involved in LoRMA: first, an iterative alignment-free correction method is used based on DBG with increasing length of k-mers; and second, the long-distance dependencies determined using multiple alignments are used to further improve the corrected reads [57]. The method demonstrates that efficient alignment free methods can be implemented on highly erroneous longread data [57]. Discussion Three main advantages of TGS technologies: it is fast, is easy and produces much longer reads. The availability of long reads will have a major impact on genomics studies involving the process of assembling. Assembling genomes solely based on short reads, without any available reference genome remains a challenge [58]. The long reads have proved invaluable for achieving highquality assemblies because they span proportionally more of the repeats present in a genome. Long-read assemblers implement an overlap graph or string graph approach that begins by comparing the entire long reads to Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 10 Wee et al. each other. In Cerulean, long reads are used to find the best path in the DBG that bridges the gaps between large contigs. Although these software packages have achieved important advances for TGS genome assembly, resolving intricate ambiguities is inherently difficult. Furthermore, the underlying graph search algorithms usually have exponential complexity with respect to the search depth; highly repeating regions (such as long repeats of simple sequences) will lead to large search depths and are not resolvable. In addition, the more powerful read overlap graph structure (of the long reads) was not fully explored in all these approaches. Generally, these algorithms depend on heuristics such as contig lengths and iterations are required. Compared to the string graph approach, hybrid strategies that associate with NGS data are more effective when a limited amount of long-read coverage is available, especially below 30× coverage, whereas self-correction is better suited to higher sequencing coverage because more reliable alignments can be made between the long reads. For example, HGAP was developed using a non-hybrid strategy to assemble SMRT sequencing data, which does not require the usage of NGS short reads. HGAP contains a consensus algorithm that generates long and highly accurate overlapping sequences by correcting errors on the longest reads using shorter reads from the same library. This correction approach was proposed earlier in the hybrid setting and is widely implemented in assembly pipelines. However, this non-hybrid, hierarchical assembly technique needs relatively high sequencing coverage (50–100×) and substantial error correction time to acquire adequate results. It is notable to mention that most of the algorithms we reviewed in HGAP were originally designed for bacterial-sized genomes. Though recent advancements in aligning erroneous long reads have also shortened the computational time of TGS assembly, running these programs on large genomes, particularly mammalian-sized genomes, normally requires a huge computational burden more appropriate to large computational clusters. To help scientists in choosing the appropriate TGS tool(s) for genomic studies using TGS, we summarize our discussion for whole genome sequencing analyses and de novo assembly analyses tools collected in this paper based on different TGS platforms. For the whole genome sequencing analyses tools used in the SMRT platform, both rHAT and LongISLND are applicable for read alignment. LongISLND is more ideal for read alignment if the hash table is not required. In the ONT platform, Nanocall is an effective tool for base calling and MarginAlign is applicable for read alignment using ONT long reads. HapCol, SiLiCo and readSim are applied using the long reads from both platforms. HapCol is relevant to haplotype assembly, while SiLiCo and readSim are applicable for read simulation. ReadSim is suited to both long and short reads; however, SiLiCo is only used for long reads. Furthermore, for the de novo assembly tools using the SMRT platform, HGAP, PBJelly and HINGE are suitable for assembling bacterial-sized genomes, while MHAP is used for assembling mammalian genomes. FALCON is the only SMRT diploid-aware assembler which allows researchers to investigate of haplotype structure and heterozygous structural variation of the genome sequence. In the ONT platform, PoreSeq is an efficient tool for assembling the genomes if the sequence variants contain low-coverage regions. Minimap/miniasm is suitable for assembling the long reads from both platforms without any correction stage. Circlator is another tool that applies the SMRT and ONT long reads for circular genome assemblies. Numerous tools including GMcloser, OPERA-LG, Nanocorr, DBG2LOC, NaS and Cerulean implement a hybridapproach to assemble the reads from both TGS and NGS. GMcloser works for gap closing while OPERA-LG is developed for scaffolding assembly. Nanocorr is designed to generate a de novo assembly with a built-in error correction algorithm. DBG2LOC is specific for assembling eukaryotic-sized genomes while both NaS and Cerulean are more suitable for bacterialsized genomes. The error correction stage is one of the challenges in TGS genome assembly. Mapping phase of the error correction method is generally involved in processing the sequencing reads through mapping them to a reference genome or aligning the reads to other sequence to form a potential overlap. Bad alignments are normally caused by the noise presented in the error reads. Bad alignments are normally caused by noise introduced from errors in the reads. These low low-quality alignments may then be removed from the downstream analysis and thus result in loss of important information. This can be difficult specifically when examining the low low-quality reads in low-coverage genomic regions. Therefore, error correction methods can be implemented to overcome all these difficulties. For example, the tools implemented in TGS include proovread, PacBioToCA and LSC. proovread is more flexible than these other two software programs. Although LSC was developed mainly for the correction of (human) transcriptomic data, PacBioToCA can handle different data sets, but is part of the Celera WGS pipeline and requires the installation of the complete package. LSC does not trim the data but both PacBioToCA and proovread can trim the data. To give the user maximum flexibility, proovread also reports the untrimmed corrected reads. Furthermore, the trimming step is independent of the correction, thereby enabling the user to easily optimize the trimming parameters for the given data set. All the alignments from the sequencing can be optimized by correcting the errors in the reads, resulting in higher accuracy and quality of the alignments, and ultimately leading to better downstream analysis. Currently, a new generation of technologies is pushing the limit even further, producing the resolution of single nucleic acid molecules in shorter time. For example, an ultra-fast mapping, error correction and de novo assembly tool MECAT was developed for single-molecule sequencing reads, which could be deployed in personal computers [59]. By integrating both novel computational algorithms and new technological characteristics, this is an iterative procedure of establishing high resolution TGS computational tools. However, all these procedures need committed efforts from collaborations between industrial and academic scientists, which may aid in performing the sequencing with much higher efficiency and accuracy. Key points • TGS technologies hold the promise of longer read lengths; they have been implemented to generate highly accurate de novo assemblies of hundreds of species of genomes, providing new insights into evolution and sequence diversity. • We evaluated various tools applied in three main TGS platforms: PacBio, SMRT sequencing, ONT sequencing and BioNano sequencing. We discussed their various characteristics, such as the required input, interaction with the user, sequencing platforms, type of reads, error models, possibility of introducing coverage bias, simulation of genomic variants and output provided. This was done within the framework of potential Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 The bioinformatics tools based on TGS 11 applications, providing readers with guidelines for the identification of the TGS de novo software applications that are best suited for their purposes. • We presented two distinct decision trees to guide researchers for selecting a suitable TGS de novo and whole-genome sequencing analysis tools. Supplementary Data Supplementary data are available online at https://academic. oup.com/bfg. Acknowledgement This work was supported by the research start-up fellowship of University of Sunshine Coast to M.Z. References 1. Reuter JA, Spacek DV, Snyder MP. High-throughput sequencing technologies. Mol Cell 2015;58:586–97. 2. Margulies M, Egholm M, Altman WE, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005;437:376. 3. Bennett S. Solexa Ltd. Pharmacogenomics 2004;5:433–8. 4. Shen R, Fan JB, Campbell D, et al. High-throughput SNP genotyping on universal bead arrays. Mutat Res 2005;573: 70–82. 5. Alkan C, Sajjadian S, Eichler EE. Limitations of nextgeneration genome sequence assembly. Nat Methods 2011;8:61–5. 6. Denton JF, Lugo-Martinez J, Tucker AE,et al. Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol 2014;10:e1003998. 7. Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. J Appl Genet 2011;52: 413–35. 8. Stankova H, Hastie AR, Chan S, et al. BioNano genome mapping of individual chromosomes supports physical mapping and sequence assembly in complex plant genomes. Plant Biotechnol J 2016;14:1523–31. 9. Nakano K, Shiroma A, Shimoji M, et al. Advantages of genome sequencing by long-read sequencer using SMRT technology in medical area. Hum Cell 2017;30: 149–61. 10. Lu H, Giordano F, Ning Z. Oxford Nanopore MinION sequencing and genome assembly. Genomics Proteomics Bioinformatics 2016;14:265–79. 11. Jain M, Olsen HE, Paten B, et al. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 2016;17:239. 12. Magi A, Semeraro R, Mingrino A, et al. Nanopore sequencing data analysis: state of the art, applications and challenges. Brief Bioinform 2017. doi.org/10.1093/bib/bbx062. 13. Leggett RM, Clark MD. A world of opportunities with nanopore sequencing. J Exp Bot 2017;68:5419–29. 14. de Lannoy C, de Ridder D, Risse J. The long reads ahead: de novo genome assembly using the MinION. F1000Res 2017;6:1083. 15. Sohn JI, Nam JW. The present and future of de novo wholegenome assembly. Brief Bioinform 2018;19:23–40. 16. Henson J, Tischler G, Ning Z. Next-generation sequencing and large genome assemblies. Pharmacogenomics 2012;13:901–15. 17. Mak AC, Lai YY, Lam ET, et al. Genome-wide structural variation detection by genome mapping on nanochannel arrays. Genetics 2016;202:351–62. 18. Berlin K, Koren S, Chin CS, et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 2015;33:623–30. 19. English AC, Richards S, Han Y, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 2012;7:e47768. 20. Chin CS, Alexander DH, Marks P, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 2013;10:563–9. 21. Chin CS, Peluso P, Sedlazeck FJ, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 2016;13:1050–4. 22. Zimin AV, Puiu D, Luo MC, et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res 2017;27:787–92. 23. Kamath GM, Shomorony I, Xia F, et al. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res 2017;27:747–56. 24. Jayakumar V, Sakakibara Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Brief Bioinform 2017. doi.org/10.1093/bib/bbx147. 25. Szalay T, Golovchenko JA. De novo sequencing and variant calling with nanopores using PoreSeq. Nat Biotechnol 2015;33:1087–91. 26. Goodwin S,Gurtowski J,Ethe-Sayers S,et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res 2015;25:1750–6. 27. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 2016;32:2103–10. 28. Hunt M, Silva ND, Otto TD, et al. Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biol 2015;16:294. 29. Koren S, Walenz BP, Berlin K, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 2017;27:722–36. 30. Gao S, Bertrand D, Chia BK, et al. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biol 2016;17: 102. 31. Ye C, Hill CM, Wu S, et al. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Sci Rep 2016;6: 31900. 32. Madoui MA, Engelen S, Cruaud C, et al. Genome assembly using nanopore-guided long and error-free DNA reads. BMC Genomics 2015;16:327. 33. Deshpande V, Fung ED, Pham S, et al. Cerulean: a hybrid assembly using high throughput short and long reads. Springer 2013;8126:349–63. 34. Bao S, Jiang R, Kwan W, et al. Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 2011;56:406–14. 35. Liu B, Guan D, Teng M, et al. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics 2016;32:1625–31. Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023 12 Wee et al. 36. Jain M, Fiddes IT, Miga KH, et al. Improved data analysis for the MinION nanopore sequencer. Nat Methods 2015;12: 351–6. 37. David M, Dursi LJ, Yao D, et al. Nanocall: an open source basecaller for Oxford Nanopore sequencing data. Bioinformatics 2017;33:49–55. 38. Technologies ON. Albacore. https://github.com/Albacore/ albacore. (5 May 2018, date last accessed). 39. TechnologiesON.Scrappie.https://github.com/nanoporetech/ scrappie. (15 April 2018, date last accessed). 40. Boza V, Brejova B, Vinar T. DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads. PLoS One 2017;12:e0178751. 41. Senol Cali D, Kim JS, Ghose S, et al. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief Bioinform 2018. doi.org/10.1093/bib/bby017. 42. Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods 2015;12:733–5. 43. Vaser R, Sovic I, Nagarajan N, et al. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 2017;27:737–46. 44. Baker EAG, Goodwin S, McCombie WR, et al . SiLiCO: a simulator of long read sequencing in PacBio and Oxford Nanopore. bioRxiv 2016:1–3. doi.org/10.1101/076901. 45. Lee H, Gurtowski J, Yoo S, et al. Error correction and assembly complexity of single molecule sequencing reads. bioRxiv 2014. doi:10.1101/006395. 46. Lau B, Mohiyuddin M, Mu JC, et al. LongISLND: in silico sequencing of lengthy and noisy datatypes. Bioinformatics 2016;32:3829–32. 47. Yang C, Chu J, Warren RL, et al. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 2017;6:1–6. 48. Li Y, Han R, Bi C, et al. DeepSimulator: a deep simulator for nanopore sequencing. Bioinformatics 2018;34:2899–2908. 49. Cao H, Wu H, Luo R, et al. De novo assembly of a haplotype-resolved human genome. Nat Biotechnol 2015;33: 617–22. 50. Pirola Y, Zaccaria S, Dondi R, et al. HapCol: accurate and memory-efficient haplotype assembly from long reads. Bioinformatics 2016;32:1610–7. 51. Koren S, Schatz MC, Walenz BP, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 2012;30:693–700. 52. Hackl T, Hedrich R, Schultz J, et al. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 2014;30:3004–11. 53. Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics 2014; 30:3506–14. 54. Miclotte G, Heydari M, Demeester P, et al. Jabba: hybrid error correction for long sequencing reads. Algorithms Mol Biol 2016;11:10. 55. Hu R, Sun G, Sun X. LSCplus: a fast solution for improving long read accuracy by short read alignment. BMC Bioinformatics 2016;17:451. 56. Ritz A, Bashir A, Sindi S, et al. Characterization of structural variants with single molecule and hybrid sequencing approaches. Bioinformatics 2014;30:3458–66. 57. Salmela L, Walve R, Rivals E, et al. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 2017;33:799–806. 58. Lee H, Gurtowski J, Yoo S, et al. Third-generation sequencing and the future of genomics. bioRxiv 2016. doi:10.1101/048603. 59. Xiao C, Chen Y, Xie SQ, et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat Methods 2017;14:1072–4. Downloadedfromhttps://academic.oup.com/bfg/article/18/1/1/5181442bygueston24September2023