When do longer reads matter? A1 benchmark of long read de novo2 assembly tools for eukaryotic genomes3 Bianca-Maria Cosma1 , Ramin Shirali Hossein Zade1 , Erin Noel Jordan1,2 , Paul van4 Lent1 , Chengyao Peng1 , Stephanie Pillay1 , and Thomas Abeel1,3,* 5 1 Delft Bioinformatics Lab, Delft University of Technology Van Mourik, Broekmanweg 6, 2628 XE, Delft, The6 Netherlands; 23 Technical Biochemistry, TU Dortmund University, Emil-Figge-Straße 66, 44227, Dortmund,7 Germany; and 4 Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, 415 Main Street,8 Cambridge, MA, 02142, USA9 *t.abeel@tudelft.nl10 Abstract11 Background: Assembly algorithm choice should be a deliberate, well-justified decision when12 researchers create genome assemblies for eukaryotic organisms from third-generation sequencing13 technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and14 Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-15 generation sequencing (NGS), third-generation sequencers are known to produce more error-prone16 reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the17 introduction of third-generation sequencing technologies, many tools have been developed that aim18 to take advantage of the longer reads, and researchers need to choose the correct assembler for19 their projects.20 Results: We benchmarked state-of-the-art long-read de novo assemblers, to help readers make a21 balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated22 datasets from different eukaryotic genomes, with different read length distributions, imitating23 PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly24 used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation25 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint categories address the following metrics: reference-based metrics, assembly statistics, misassembly26 count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of27 increased read length on the quality of the assemblies, and report that read length can, but does not28 always, positively impact assembly quality.29 Conclusions: Our benchmark concludes that there is no assembler that performs the best in all the30 evaluation categories. However, our results shows that overall Flye is the best-performing31 assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that32 the increased read length improves assembly quality, but the extent to which that can be achieved33 depends on the size and complexity of the reference genome.34 Key words: De novo assembly, Third-generation sequencing, Benchmarking, Eukaryote genomes.35 Introduction36 De novo genome assembly is essential in several leading fields of research, including disease37 identification, gene identification, and evolutionary biology [1–4]. Unlike reference-based assembly,38 which relies on the use of a reference genome, de novo assembly only uses the genomic information39 contained within the sequenced reads. Since it is not constrained to the use of a reference, high quality40 de novo assembly is essential for studying novel organisms, as well as for the discovery of overlooked41 genomic features, such as gene duplication [5], in previously assembled genomes.42 The introduction of Third Generation Sequencing (TGS) led to massive improvements in de novo43 assembly. The advent of TGS has addressed the main drawback of Next Generation Sequencing (NGS)44 platforms, namely the short read length, but has introduced new challenges in genome assembly,45 because of the higher error rates of long reads. The leading platforms in long-read sequencing are46 Pacific Biosciences Single Molecule, Real-Time sequencing (often abbreviated as "PacBio") and Oxford47 Nanopore (ONT) sequencing [6].48 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint Since the introduction of TGS platforms, many methods have been developed that aim to take the49 most benefits from the longer read length and overcome the new challenges due to sequencing error.50 Recent studies have been conducted to compare long-read de novo assemblers. One such study was51 conducted by Wick and Holt [7], who focused on long-read de novo assembly of prokaryotic genomes.52 Eight assemblers were tested on real and simulated reads from PacBio and ONT sequencing, and53 evaluation metrics included sequence identities, circularisation of contigs, computational resources,54 as well as accuracy. Murigneux et al. [8] performed similar experiments on the genome of M. jansenii,55 although in this case, the focus was on comparatively benchmarking Illumina sequencing and three56 long-read sequencing technologies, in addition to the comparison of long-read assembly tools. Studies57 narrowed down to just one type of sequencing technology include those of Jung et al. [9], who58 evaluated assemblers on real PacBio reads from five plant genomes, and Chen et al. [10], who used59 Oxford Nanopore real and simulated reads from bacterial pathogens in their comparison. Except for60 the Wick and Holt study, which provides a compressive comparison on de novo assembly of61 prokaryotic genomes, other studies are either comparing the assemblers on single genome or using62 data from a single sequencing platform. Here, we provide a comprehensive comparison on de novo63 assembly tools on all TGS technologies and 7 different eukaryotic genomes, to complement the study64 of Wick and Holt.65 In this study, we are benchmarking these methods using 13 real and 72 simulated datasets (see Figure66 1) from both PacBio and ONT platforms to guide researchers to choose the proper assembler for their67 studies. Benchmarking using simulated reads allows us to accurately compare the final assembly with68 the ground truth, and benchmarking using the real reads can validate the results based on simulated69 reads. The assembler comparison presented in this manuscript complements the literature that has70 already been published, by introducing an analysis of not just assembler performance, but also of the71 effect of read length on assembly quality. Although increased read length is considered an advantage,72 we investigate if it is always a necessary advantage to have for assembly performance. To that end,73 the scope of the study extends to six model eukaryotes that provide a performance indication for74 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint genomes of variable complexity, covering a wide range of taxa on the eukaryotic branch of the Tree75 of Life [11]. Complexity in genome assembly is determined by multiple variables, the most notable of76 which is the proportion of repetitive sequences within the genome of a particular organism.77 Complexity in eukaryotic genomes is further exacerbated by size and organization of chromosomal78 architecture, including telomeres and centromeres, and the presence of circular elements such as79 mitochondrial and chloroplast DNA.80 81 Figure 1: The benchmarking pipeline. We first select 6 representative eukaryotes from the Tree of Life (Letunic and Bork,82 2021) and use Badread’s error and QScore model generation feature (Wick, 2019) to create 3 models of state-of-the-art long83 sequencing technologies. This is input to the read simulation stage, where we simulate reads from all genomes, with four84 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint different read length distributions. We then perform assembly of simulated and real reads, using five long-read assemblers.85 Lastly, we evaluate all assemblies based on several criteria.86 De novo genome assembly evaluation remains challenging, as it represents a process that must87 account for variables such as the goal of an assembly and the existence of a ground-truth reference.88 A standard evaluation procedure was introduced in the literature by the two Assemblathon89 competitions [12,13], which outlined a selection of metrics that encompasses the most relevant90 aspects of genome assembly, however, these metrics require a reference sequence. Most of these91 metrics are adopted in our benchmark.92 Consequently, this study addresses two main objectives. First, we provide a systematic comparison of93 five state-of-the-art long-read assembly tools, documenting their performance in assembling real and94 simulated PacBio Continuous Long Reads (CLRs), PacBio Circular Consensus Sequencing (CCS) HiFi95 reads, and Oxford Nanopore reads, generated from the genomes of S. cerevisiae, P. falciparum, C.96 elegans, A. thaliana, D. melanogaster, and T. rubripes. Our second objective is to investigate whether97 increased read length has a positive effect on overall assembly quality, given that increasing the length98 of reads is an on-going effort in the development of Third Generation Sequencing platforms [14].99 Materials and methods100 Data101 In this study, we are using real and simulated data from various organisms to benchmark long read102 de novo assembly tools.103 Reference genomes104 We selected six reference genomes from eukaryotic organisms represented in the Interactive Tree Of105 Life (iTOL) v6 [11]: S. cerevisiae (strain S288C), P. falciparum (isolate 3D7), C. elegans (strain VC2010),106 A. thaliana (ecotype Col-0), D. melanogaster (strain ISO-1), and T. rubripes. Assembly accessions are107 included in Supplementary Table S1.108 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint The reference assemblies for C. elegans, D. melanogaster, and T. rubripes included uncalled bases. In109 these cases, before read simulation, each base N was replaced with base A, as done by Wick and Holt110 [7]. This avoids ambiguity in the read simulation process and consequently simplifies the evaluation111 of the simulated-read assemblies. As such, we used this modified version as a reference when112 evaluating all assemblies of simulated reads from these four genomes. In the evaluation of real-read113 assemblies, the original assemblies were used as references.114 Simulated reads115 All simulated read sets were generated using Badread v0.2.0 [15]. To create read error and QScore116 (quality score) models in addition to the simulator’s own default models, Badread requires the117 following three parameters: a set of real reads, a high-quality reference genome, and an alignment118 file, obtained by aligning the reads to the reference genome. We used real read sets from the human119 genome to create error and QScore models that reflect the state-of-the-art for three sequencing120 technologies: PacBio Continuous Long Reads (CLRs), PacBio Circular Consensus Sequencing (CCS) HiFi121 reads, and Oxford Nanopore reads.122 To create the models, we used the real read sets sequenced from the human genome and aligned to123 the latest high-quality human genome reference assembled by [16]: assembly T2T-CHM13v2.0, with124 RefSeq accession GCF_009914755.1. The alignment was performed using Minimap2 v2.24 [17] with125 default parameters. The sources for these sequencing data are outlined in Supplementary Table S2,126 as well as the read identities for each technology, which are later passed as parameters for the127 simulation stage.128 For each of the six reference genomes, we simulated reads that imitate PacBio CLR, PacBio HiFi, and129 Oxford Nanopore sequencing, with four different read length distributions, using Badread. The first130 read simulation represents the current state of the three long-read technologies. The other three131 simulations reflect data points in-between technology-specific values and ultra-long reads, data points132 of a similar length as ultra-long-reads, and longer than ultra-long reads. Since Badread’s read length133 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint models are parameterized by gamma distributions, we need to define the mean and standard134 deviation of the gamma distributions for these simulations. The values for the mean and standard135 deviation of these distributions were selected as follows. First, we calculated the read length136 distributions of the real read sets in Supplementary Table S2 and simulated an initial iteration of reads137 using these technology-specific values. For choosing these values for the other three iterations, we138 analysed a set of Oxford Nanopore Ultra-Long reads used in the latest assembly of the human genome139 (Nurk et al., 2022). We selected GridION run SRR12564452, available as sequence data in BioProject140 PRJNA559484, with a mean read length of approximately 35.7 kbp, and a standard deviation of 42.5141 kbp.142 A full overview of the mean and standard deviation of all four read length distributions is given in143 Table 1. Note that, for each of the technologies, the standard deviation for the last three distributions144 was derived from the mean, using the ratio between the mean and standard deviation reflected by145 the technology-specific values. Hence, for the last three iterations, the mean read length is consistent146 across sequencing technologies, but the standard deviation varies.147 Table 1: The mean and standard deviation describing the read length distributions used in our simulations. Note that read148 length increases with each iteration, and the distribution parameters are different for each technology.149 Read length distribution parameters (kbp), per technology PacBio CLR PacBio HiFi Oxford Nanopore Mean Stdev Mean Stdev Mean Stdev Iteration 1 (technology-specific values) 15.7 14.4 20.7 2.5 12.1 17.1 Iteration 2 25 22.5 25 3 25 35 Iteration 3 (imitate ultra-long reads) 35 31.5 35 4.2 35 49 Iteration 4 75 67.5 75 9 75 105 150 Consequently, we ran twelve simulations for each reference genome. As described above, we used151 our own models for each technology, and passed them to the simulator as the --error_model and152 --qscore_model. The read identities per technology were set to the values included in153 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint Supplementary table S2. Across all simulations, we chose a coverage depth of 30x. Canu’s154 documentation [18] specifies a minimum coverage of 20 - 25x for HiFi data, and 20x for other types of155 data, while Flye’s guidelines [19] indicate a minimum coverage of 30x. As there is no minimum156 recommended coverage indicated for the other assemblers we used in our benchmark, we simulated157 reads following the stricter guideline among these two, that is, 30x coverage.158 A summary of the Badread commands used in our simulation can be found in Supplementary Table159 S3. Note that, in the case of simulated HiFi reads, we additionally lowered the rates of glitches,160 random, junk, and chimeric reads to reflect the higher accuracy of this technology. We set the161 percentage of chimeras to 0.04, as estimated by [20].162 Real reads163 In support of our evaluation on simulated reads, we also performed a benchmark on real-read164 assemblies from Oxford Nanopore and PacBio reads sequenced from the reference genomes. These165 reads were sampled to approximately 30x coverage, to ensure a fair comparison with our simulated-166 read assemblies. The data sources for all real sets are included in Supplementary Table S4.167 Assemblies168 Five long-read de novo assemblers are included in this benchmark: Canu v2.2 [18], Flye v2.9 [19],169 Redbean (also known as Wtdbg2) v2.5 [21], Raven v1.7.0 [22], and Miniasm v0.3_r179 [23].170 The assemblies were performed with default values for most parameters. Canu and Wtdbg2 require171 the estimated genome size as a parameter, and we set the following values: S. cerevisiae = 12 Mbp, P.172 falciparum = 23 Mbp, A. thaliana = 135 Mbp, D. melanogaster = 139 Mbp, C. elegans = 103 Mbp, and173 T. rubripes = 384 Mbp. All commands used in the assembly pipelines are available in Supplementary174 Table S6. We note that further polishing of assemblies using high-fidelity short reads, although175 common in practice [24–26], is omitted in this study, as the focus is exclusively on assembler176 performance on long-read data and not polishing tools.177 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint We added a long-read polishing step for Miniasm and Wtdbg2, as their assembly pipelines do not178 include long-read based polishing. Following Raven’s default pipeline, which performs two rounds of179 Racon polishing [27], we used two rounds of Racon polishing on Wtdbg2 and Miniasm. We note that180 for Miniasm, we used Minipolish [7], which simplifies Racon polishing by applying it in two iterations181 on the GFA (Graphical Fragment Assembly) files produced by the assembler. For both Miniasm and182 Wtdbg2, the alignments required for polishing were generated with Minimap v2.24.183 Evaluation184 We evaluated the assemblies in three different categories of metrics. The COMPASS analysis compares185 the assemblies with their corresponding reference genome and provides insight into their similarities.186 The assembly statistics provide some basic knowledge about the contiguity and misassemblies. Finally,187 the BUSCO assessment investigates the presence of essential genes in the assemblies. These three188 categories of metrics, next to each other, can provide a complete overview of the assembly's quality.189 COMPASS analysis190 For each assembly, we ran the COMPASS script to measure the coverage, validity, multiplicity and191 parsimony, to assess the quality of the assemblies, as defined in Assemblathon 2 [13]. These metrics192 describe several characteristics that were deemed important for comparing de novo assembly tools,193 and were computed using three types of data: (1) the reference sequence, (2) the assembled scaffolds,194 and (3) the alignments (sequences from the assembled scaffolds that were aligned to the reference195 sequences). Definitions and formulas for the metrics are reported in Supplementary Table S5.196 Assembly statistics and misassembly events197 We use QUAST v5.0.2 [28] is used to measure the NG50 [12] (Earl et al., 2011) of an assembly and the198 number of misassemblies. QUAST identifies misassemblies based on the definition outlined by [29].199 The total number of misassemblies is the sum of all relocations, inversions, and translocations.200 Considering two adjacent flanking sequences, if they both align to the same chromosome, but 1 kbp201 away from each other, or overlapping for more than 1 kbp, this is counted as a relocation. If these202 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint flanking sequences, aligned to the same chromosome, are on opposite strands, the misassembly is203 considered an inversion. Lastly, translocations describe events in which two flanking sequences align204 to different chromosomes.205 BUSCO assessment206 BUSCO v5.4.2 assessment [30,31] is performed to evaluate the completeness of the essential genes in207 the assemblies. This quantifies the number of single-copy, duplicated, fragmented and missing208 orthologs in an assembled genome. From the number of orthologs specific to each dataset, BUSCO209 identifies how many orthologs are present in the assembly (either as single-copy or duplicated), how210 many are fragmented, and how many are missing.  We ran these evaluations with a different OrthoDB211 lineage dataset for each genome: S. cerevisiae - saccharomycetes, P. falciparum - plasmodium, A.212 thaliana - brassicales, D. melanogaster - diptera, C. elegans - nematoda, and T. rubripes -213 actinopterygii.214 Results and discussion215 Overview of the benchmarking pipeline216 Figure 1 shows an overview of the benchmarking pipeline. We begin with the selection of six217 representative eukaryotes from the interactive Tree of Life [11]: S. cerevisiae, P. falciparum, A.218 thaliana, D. melanogaster, C. elegans, and T. rubripes. We also use three read sets from the latest219 human assembly project [16] to generate Badread error and Qscore models [15] for PacBio220 Continuous Long Reads (CLRs), PacBio High Fidelity reads, and Oxford Nanopore reads (see221 Supplementary Table S2). The reference sequences and models become input to the Badread222 simulation stage. For each genome, we simulate reads with four different read length distributions223 and three sequencing technologies (see Table 1), amounting to a total of 12 simulated read sets per224 reference genome. These reads, as well as 13 real read sets, are assembled with five assembly tools:225 Canu, Flye, Miniasm, Raven, and Wtdbg2.226 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint Next, the resulting assemblies are evaluated using COMPASS, QUAST, and BUSCO, and based on the227 reported metrics we distinguish six main evaluation categories: sequence identity, repeat collapse,228 rate of valid sequences, contiguity, misassembly count, and gene identification. The selected229 COMPASS metrics are the coverage, multiplicity, and validity of an assembly, which provide insight on230 sequence identity, repeat collapse, and the rate of valid sequences, respectively. In this regard, an231 ideal assembly has coverage, multiplicity and validity close to 1. This suggests that a large fraction of232 the reference genome is assembled, repeats are generally collapsed instead of replicated, and most233 sequences in the assembly are validated by the reference. Among others, QUAST reports the number234 of misassemblies and the NG50 of an assembly. A high NG50 value is associated with high contiguity.235 In order to assess contiguity across genomes of different sizes, we report the ratio between the236 assembly’s NG50 and the N50 of the references. Lastly, gene identification is quantified in terms of237 the percentage of complete BUSCOs in an assembly.238 The search for an optimal assembler is influenced by read sequencing technology,239 genome complexity, and research goal240 To select an assembler that is most versatile across eukaryotic taxa, we simulate PacBio Continuous241 Long Reads (CLRs), PacBio High Fidelity (HiFi) reads, and Oxford Nanopore reads from the genomes of242 six model eukaryotes, assemble these reads, and evaluate the assemblers in the six main categories243 mentioned in the previous section. The results for each evaluation category are normalized in the244 range given by the worst and best values encountered in the evaluation of all assemblies of reads with245 default length. This highlights differences between assemblers, as well as between genomes and246 sequencing technologies.247 The results of the benchmark on the read sets with default lengths, namely those belonging to the248 first iteration (see Table 1), are illustrated in Figure 2. A full report of the evaluation metrics in this249 figure is included in the Supplementary Tables S7 – S24, under “Iteration 1”. We note that no250 assembler unanimously ranks first in all categories, across different sequencing technologies and251 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint eukaryotic genomes, although our findings highlight some of their strengths and thus their potential252 for various research aims. The runtime and memory usage of the assembly tools on all of the simulated253 datasets are reported in Supplementary Tables S25 – S30, since this can also be a deciding factor next254 to the quality of the assembly for the researchers to choose the suitable assembler for their purpose.255 We note that all assemblies were run on our local High Performance Computing Cluster, and the256 runtime and RAM usage may have been affected by the heterogeneity of the shared computing257 environment in which the assembly jobs executed.258 Miniasm, Raven and Wtdbg2 are all well-rounded choices for the simpler S. cerevisiae, P. falciparum259 and C. elegans genomes, with a balanced trade-off between assembly quality and computational260 resources. For PacBio HiFi reads, Raven is generally qualitatively outperformed by other assemblers261 like Canu, Flye, and Miniasm, likely as a consequence of the fact that its pipeline is not customized for262 all long-read sequencing technology. Nonetheless, if computational resources are a concern, Raven is263 a more suitable choice, since Miniasm and Wtdbg2 do not scale well for larger genomes.264 We can single out Flye as the most robust assembler across all six organisms, although for larger265 genomes such as T. rubripes, Canu is a better tool. Both produce assemblies with high sequence266 identity and validity, as well as good gene prediction, but Flye assemblies generally rank first when we267 compute the average score across all six metrics. For Canu, we notice more variation in assembly268 quality across different genomes, particularly for P. falciparum and A. thaliana, while Flye maintains269 more consistent results. Nonetheless, on the T. rubripes genome, Canu assemblies have higher270 sequence identity and contiguity, as well as more accurate gene identification.271 272 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint 273 Figure 2: The performance of the five assemblers on the read sets with default read lengths, from iteration 1 (see Table 1),274 generated from six eukaryotic genomes. Six evaluation categories are reported for each assembler, and the results are275 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint normalized among all assemblies included in the figure. Ranges for each metric are reported as the best and worst values276 computed for these assemblies. The best performing assembler is highlighted for each read set, and marked with a star.277 Evaluation of real-read assemblies supports our rankings on simulated-read278 assemblies279 To determine assembler performance on real reads and validate the rankings of the simulated-read280 assemblies, we assemble several real read sets from the six reference eukaryotes (Supplementary281 Table S4). The evaluation results on the real-read assemblies, summarized in Figure 3, indicate that282 assemblers which perform well on simulated reads perform similarly well in assembling the sets of283 real reads. The full report of metrics on the real read assemblies is included in Supplementary Table284 S31. We conclude that, overall, the assembler rankings remain consistent. This illustrates that285 benchmarking using simulated data is similar to real read sets. For reference-based metrics, we used286 the reference genomes given in Supplementary Table S1.287 Notably, reference-based metrics in the evaluation of real-read assemblies rely on comparisons with288 an assembly, and not the genome from which the reads were initially sequenced. In contrast to the289 evaluation of simulated-read assemblies, the existence of a ground truth reference is not available in290 this case, but reference-based metrics are included for the sake of consistency with the simulated-291 read evaluation.292 In the evaluation of real-read assemblies, Flye ranks first for nearly all datasets, with the exception of293 the T. rubripes and C. elegans PacBio reads, for which Raven performs better overall. However, even294 in C. elegans, Flye performance is close to the best values in all metrics other than contiguity. As295 expected, overall assembler performance decreases for reference-based metrics like sequence296 identity, repeat collapse and validity, but surprisingly the misassembly count is considerably lower.297 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint 298 Figure 3: The performance of the five assemblers on the real reads (see Supplementary Table S4), sequenced from six299 eukaryotic genomes. As in Figure 2, six evaluation categories are reported for each assembler, and the results are normalized300 among all assemblies included in the figure. Ranges for each metric are reported as the best and worst values computed for301 these assemblies. The best performing assembler is highlighted for each read set, and marked with a star.302 Longer reads lead to more contiguous assemblies of large genomes, but do not always303 improve assembly quality304 To investigate the effect of increased read length on assembly quality, we use Badread to simulate305 Oxford Nanopore, as well as PacBio CLR and HiFi reads with different read length distributions (Table306 1) from the genomes of S. cerevisiae, P. falciparum, C. elegans, A. thaliana, D. melanogaster, and T.307 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint rubripes. We assemble these reads with five state-of-the-art long-read assemblers, and evaluate308 assembly quality based on six evaluation categories (see Overview of the benchmarking pipeline). It is309 worth mentioning that Canu iteration 4 assemblies (the longest reads) of A. thaliana and T. rubripes310 did not finish within reasonable time and are excluded from the evaluation.311 Figure 4 shows a summary of the assemblers’ performance on all simulated read sets, highlighting312 changes in performance for each read length distribution. All six evaluation metrics are normalized313 given the maximum and minimum metric values per genome, per sequencing technology, and314 combined to obtain an average score. We then average these three scores again and report a score315 between 1 and 10 for each assembler, per read length distribution. The results on all computed316 metrics are fully described in Supplementary Tables S7 – S24.317 The results imply that there is a correlation between the size and complexity of the reference genome318 and the extent of the improvement in assembly quality that can be achieved by increasing the length319 of the reads. While we observe no trend in assembly quality improvement on the assemblies of smaller320 genomes, the results on the T. rubripes assemblies are more conclusively in favour of the longer reads.321 For instance, on the shorter and simpler S. cerevisiae and P. falciparum genomes, identification of322 repetitive and complex regions is not aided by increased read length, likely as these regions are already323 spanned by the reads with default lengths. However, the benchmark results suggest that more324 complex and repetitive regions within the A. thaliana, D. melanogaster and, most notably, T. rubripes325 genomes are better captured by longer reads.326 As recorded in Supplementary Table S22 – S24, for larger genomes, longer reads generally lead to327 significantly higher assembly contiguity and a lower misassembly count. The latter implies that the328 resulting assemblies are more faithful to the references, although this is not necessarily supported by329 other metrics. We cannot report any compelling improvements in sequence identity, multiplicity,330 validity, and gene identification.331 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint 332 Figure 4: The performance of the five assemblers on all simulated read sets, with four different read length distributions (as333 previously described in Table 1). A score of 1 - 10 is reported for each assembler. The results are normalized for each genome,334 per sequencing technology. An average score for each read length distribution is first computed per technology (ONT, PacBio335 CLR, PacBio HiFi), and then these three scores are averaged to obtain an overall score per read length distribution.336 Conclusion337 In fulfilment of the first objective of this study, we conclude that Flye is the highest performing338 assembler when considering the overview of all evaluation categories in this benchmark, which339 include the sequence identity, repeat collapse, rate of valid sequences, contiguity, misassembly count,340 and gene identification. Rankings are mostly consistent for all three sequencing platforms included in341 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint the study: PacBio CLR, PacBio HiFi and ONT. However, no assembler ranks first in all evaluation342 categories, suggesting that the choice of assembler is often a trade-off between certain advantages343 and disadvantages. Therefore, we have corroborated the conclusion of Wick and Holt [7], who344 benchmarked long-read assemblers on prokaryotes, for eukaryotic organisms, and recommend that345 these benchmarking parameters are considered in relation to the desired outcome of an assembly346 experiment.347 Additionally, the tests performed on real reads validate our rankings of simulated-read assemblies.348 Flye, the assembler that scored consistently well in most evaluation categories for assemblies of349 simulated reads, also ranks first when evaluated on several sets of real reads sequenced on long-read350 platforms.351 Regarding our second objective, which addressed the effect of increased read length on assembly352 quality, the benchmarking of assemblers on read sets with different read length distributions suggests353 that longer reads have the potential to improve assembly quality. However, this depends on the size354 and complexity of the genome that is being reconstructed. We found that improvements in contiguity355 were most significant among all metrics, as also supported by the conclusion of [8], who showed that356 using third generation sequencing considerably improves contiguity in assembling a plant genome (M.357 jansenii). However, we did not find significant improvements in other aspects of assembly quality,358 such as sequence identity or gene identification.359 Data availability360 All accessions to the reference genomes used in this study are included in Supplementary Table S1.361 The read sets that were used for the creation of error and QScore models for the simulator are362 included in Supplementary Table S2. These models are available at363 https://github.com/AbeelLab/long-read-assembly-benchmark. The accessions for the real reads we364 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint assembled are included in Supplementary Table S4. All other data is reproducible as per the365 commands in Supplementary Tables S3 and S6.366 Code availability367 Our evaluations were produced with QUAST v5.0.2 [28], BUSCO v5.4.2 [30, 31], and COMPASS [13].368 We also provide the scripts we used on https://github.com/AbeelLab/long-read-assembly-369 benchmark.370 References371 1. Boycott KM, Vanstone MR, Bulman DE, MacKenzie AE. Rare-disease genetics in the era of next-372 generation sequencing: discovery to translation. Nature Reviews Genetics. 2013; doi:373 10.1038/nrg3555.374 2. Bras J, Guerreiro R, Hardy J. Use of next-generation sequencing and other whole-genome375 strategies to dissect neurological disease. Nature Reviews Neuroscience. 2012; doi:376 10.1038/nrn3271.377 3. Grada A, Weinbrecht K. Next-Generation Sequencing: Methodology and Application. Journal of378 Investigative Dermatology. 2013; doi: 10.1038/jid.2013.248.379 4. Schlötterer C, Kofler R, Versace E, Tobler R, Franssen SU. Combining experimental evolution with380 next-generation sequencing: a powerful tool to study adaptation from standing genetic variation.381 Heredity. 2015; doi: 10.1038/hdy.2014.86.382 5. Salazar AN, Gorter de Vries AR, van den Broek M, Wijsman M, de la Torre Cortés P, Brickwedde A,383 et al.. Nanopore sequencing enables near-complete de novo assembly of Saccharomyces cerevisiae384 reference strain CEN.PK113-7D. FEMS Yeast Res. 2017; doi: 10.1093/femsyr/fox074.385 6. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-386 read sequencing data analysis. Genome Biology. 2020; doi: 10.1186/s13059-020-1935-5.387 7. Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome388 sequencing. F1000Research. 2021; doi: 10.12688/f1000research.21782.4.389 8. Murigneux V, Rai SK, Furtado A, Bruxner TJC, Tian W, Harliwong I, et al.. Comparison of long-read390 methods for sequencing and assembly of a plant genome. GigaScience. 2020; doi:391 10.1093/gigascience/giaa146.392 9. Jung H, Jeon M-S, Hodgett M, Waterhouse P, Eyun S. Comparative Evaluation of Genome393 Assemblers from Long-Read Sequencing for Plants and Crops. Journal of Agricultural and Food394 Chemistry. 2020; doi: 10.1021/acs.jafc.0c01647.395 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint 10. Chen Z, Erickson DL, Meng J. Benchmarking Long-Read Assemblers for Genomic Analyses of396 Bacterial Pathogens Using Oxford Nanopore Sequencing. International Journal of Molecular Sciences.397 2020; doi: 10.3390/ijms21239161.398 11. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display399 and annotation. Nucleic Acids Research. 2021; doi: 10.1093/nar/gkab301.400 12. Earl D, Bradnam K, John JS, Darling A, Lin D, Fass J, et al.. Assemblathon 1: A competitive401 assessment of de novo short read assembly methods. Genome Research. 2011; doi:402 10.1101/gr.126599.111.403 13. Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, et al.. Assemblathon 2:404 evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013;405 doi: 10.1186/2047-217X-2-10.406 14. Dijk EL van, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing407 technology. Trends in Genetics. 2014; doi: 10.1016/j.tig.2014.07.001.408 15. Wick R. Badread: simulation of error-prone long reads. JOSS. 2019; doi: 10.21105/joss.01316.409 16. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al.. The complete sequence410 of a human genome. Science. 2022; doi: 10.1126/science.abj6987.411 17. Li H. Minimap2: pairwise alignment for nucleotide sequences. Birol I, editor. Bioinformatics.412 2018; doi: 10.1093/bioinformatics/bty191.413 18. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate414 long-read assembly via adaptive k -mer weighting and repeat separation. Genome Research. 2017;415 doi: 10.1101/gr.215087.116.416 19. Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat417 graphs. Nat Biotechnol. 2019; doi: 10.1038/s41587-019-0072-8.418 20. Tvedte ES, Gasser M, Sparklin BC, Michalski J, Hjelmen CE, Johnston JS, et al.. Comparison of419 long-read sequencing technologies in interrogating bacteria and fly genomes. G3420 Genes|Genomes|Genetics. 2021; doi: 10.1093/g3journal/jkab083.421 21. Ruan J, Li H. Fast and accurate long-read assembly with wtdbg2. Nature Methods. 2020; doi:422 10.1038/s41592-019-0669-3.423 22. Vaser R, Šikić M. Time- and memory-efficient genome assembly with Raven. Nature424 Computational Science. 2021; doi: 10.1038/s43588-021-00073-4.425 23. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.426 Bioinformatics. 2016; doi: 10.1093/bioinformatics/btw152.427 24. Chen Z, Erickson DL, Meng J. Polishing the Oxford Nanopore long-read assemblies of bacterial428 pathogens with Illumina short reads to improve genomic analyses. Genomics. 2021; doi:429 10.1016/j.ygeno.2021.03.018.430 25. Hu T, Chitnis N, Monos D, Dinh A. Next-generation sequencing technologies: An overview.431 Human Immunology. 2021; doi: 10.1016/j.humimm.2021.02.012.432 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint 26. Wick RR, Holt KE. Polypolish: Short-read polishing of long-read bacterial genome assemblies.433 PLOS Computational Biology. 2022; doi: 10.1371/journal.pcbi.1009802.434 27. Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long435 uncorrected reads. Genome Research. 2017; doi: 10.1101/gr.214270.116.436 28. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome437 assemblies. Bioinformatics. 2013; doi: 10.1093/bioinformatics/btt086.438 29. Barthelson R, McFarlin AJ, Rounsley SD, Young S. Plantagora: Modeling Whole Genome439 Sequencing and Assembly of Plant Genomes. PLoS ONE. 2011; doi: 10.1371/journal.pone.0028436.440 30. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome441 assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015; doi:442 10.1093/bioinformatics/btv351.443 31. Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, et al.. BUSCO444 Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology445 and Evolution. 2018; doi: 10.1093/molbev/msx319.446 447 .CC-BY-NC 4.0 International licenseperpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for thisthis version posted February 2, 2023.;https://doi.org/10.1101/2023.01.30.526229doi:bioRxiv preprint