BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences Minoru Kanehisa1 , Yoko Sato2 and Kanae Morishima1 1 - Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan 2 - Healthcare Solutions Department, Fujitsu Kyushu Systems Ltd., Hakata-ku, Fukuoka 812-0007, Japan Correspondence to Minoru Kanehisa: kanehisa@kuicr.kyoto-u.ac.jp http://dx.doi.org/10.1016/j.jmb.2015.11.006 Edited by M. Sternberg Abstract BlastKOALA and GhostKOALA are automatic annotation servers for genome and metagenome sequences, which perform KO (KEGG Orthology) assignments to characterize individual gene functions and reconstruct KEGG pathways, BRITE hierarchies and KEGG modules to infer high-level functions of the organism or the ecosystem. Both servers are made freely available at the KEGG Web site (http://www.kegg.jp/blastkoala/). In BlastKOALA, the KO assignment is performed by a modified version of the internally used KOALA algorithm after the BLAST search against a non-redundant dataset of pangenome sequences at the species, genus or family level, which is generated from the KEGG GENES database by retaining the KO content of each taxonomic category. In GhostKOALA, which utilizes more rapid GHOSTX for database search and is suitable for metagenome annotation, the pangenome dataset is supplemented with Cd-hit clusters including those for viral genes. The result files may be downloaded and manipulated for further KEGG Mapper analysis, such as comparative pathway analysis using multiple BlastKOALA results. © 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Introduction Genome sequencing has become a routine task in many areas of biological sciences. It also has the potential to revolutionize healthcare, such as in personalized medicine using personal genome information and in combating infectious diseases and drug resistance using pathogen genome information. There is a growing need for better bioinformatics methods to fully make use of such genome sequencing data. We have been developing the KEGG (Kyoto Encyclopedia of Genes and Genomes) bioinformatics resource as a reference knowledge base for biological interpretation of genome sequences and other highthroughput data, especially for uncovering high-level functions and utilities of the cell, the organism and the ecosystem [1]. One unique aspect of the KEGG resource is that functional information is associated with ortholog groups and stored in the KO (KEGG Orthology) database, separately from the KEGG GENES database that accumulates all fully sequenced genomes. Genome annotation in KEGG is simply to assign KO identifiers (K numbers) rather than to rewrite text definitions of genes in the GENES database. In general, KO grouping of functional orthologs is defined in the context of KEGG pathways and other molecular networks, which are in fact represented as networks of nodes identified by K numbers. Therefore, once genes in the genome are given K numbers, KEGG pathways and other molecular networks can be automatically reconstructed. Internally, the GENES database is accompanied by the SSDB database, which contains amino acid sequence similarity scores and best-hit relations for all gene pairs in pairwise genome comparisons computed by the SSEARCH program [2]. SSDB is a huge weighted, directed graph and has been used to computationally define Ortholog Clusters as quasi-cliques of bidirectional best hits with significant similarity scores [3]. Genome annotation in KEGG involves both manual and computational procedures. 0022-2836/© 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). J Mol Biol (2016) 428, 726–731 Databases/ Web Servers Basically, members of a KO group are first manually selected defining a core subgraph in the SSDB graph, which is then computationally extended by the KOALA (KEGG Orthology and Links Annotation) program [1]. Here we report automatic annotation servers for outside users, BlastKOALA and GhostKOALA, which are based on the methods similar to those used in internal annotations and the newly developed nonredundant GENES database. Brief comparisons are made to other servers including KAAS [4], RAST [5] and MG-RAST [6]. Results and Discussion Improved KO and GENES databases The KO database is a collection of orthologs, most of which are members of the KEGG molecular networks, including KEGG pathways, BRITE hierarchies and KEGG modules. There arealso unclassified KO entries whose relationships to molecular networks are unknown. In order to facilitate sequence informationbased assignment, each KO entry is defined as a sequence similarity group including some complicated cases. For example, one KO entry may consist of multiple sequence similarity groups, or one KO entry may be the remaining portion of a large sequence similarity group after extracting a closely related portion defined as another KO entry. Efforts are being made to associate each KO entry with experimental evidence of functionally characterized sequence data. As of June 2015, about 19,000 KO entries are made available, and PubMed links and sequence links are included in 75% and 42%, respectively. By the genome annotation procedure in KEGG, about 46% of 17 million genes are annotated with K numbers. The GENES database can be seen as structured by the assigned K number groups, where each group corresponds to a sequence informationbased extension of experimental evidence and functional information. Thus, the sequence similarity search against the structured GENES database is a search for most appropriate K numbers, which can easily be computerized as implemented in the KOALA and KAAS programs. This is a clear advantage over most other sequence databases, where functional information is associated with individual proteins or genes and the sequence similarity search will require processing of search results, which may contain a large amount of data. The GENES database is a collection of complete genomes taken from RefSeq and GenBank databases, each identified by the three- or four-letter organism code, such as “hsa” for Homo sapiens. We have recently added three new categories with two-letter codes: “pg” for plasmid genes taken from the RefSeq plasmid section excluding those already represented in the KEGG complete genomes, “vg” for viral genes taken from the RefSeq viral section and “ag” for the internally developed and publication-based collection of KEGG addendum genes. The “ag” category is a gene-based, not genomebased, collection and now fills the gap of missing sequence data in some pathway maps, where KO entries do not correspond to any GENES entries and simply linked to UniProt database because complete genome sequences have not been determined. With these new categories, the repertoire of KO entries has been significantly expanded toward capturing all knowledge on gene/protein functions. Non-redundant dataset of pangenome sequences As many closely related bacterial strains are sequenced, it has become customary to define selected sets of genomes for annotation and other purposes, such as reference and representative genomes in RefSeq [7] and reference proteomes in UniProt [8]. As of June 12, 2015, there are 3904 KEGG organisms (complete genomes): 304 eukaryotes in the ranks of 299 species, 227 genera and 172 families, as well as 3600 prokaryotes in the ranks of 1858 species and 809 genera. See the following Web pages for the current version: KEGG organisms† , KEGG species list‡ and KEGG genus list§ . Each species or genus name in these lists is linked to another Web page for analyzing an organism group, such as how pathway maps and BRITE hierarchies can be reconstructed from the combined set of assigned K numbers in the group. When a new genome is added in KEGG, the SSEARCH comparison is performed against all genomes in the GENES database for subsequent K number assignment by KOALA. Because of the size of the GENES database, this procedure is too time consuming to be implemented as a Web service. Thus, for use in the BlastKOALA and GhostKOALA servers, we have devised a way to reduce the size of the GENES database while retaining the functional content of the gene set in each taxonomic rank of species, genus or family. When multiple members are present in each species/genus/family group, the first genome in the KEGG organisms list is taken as a representative genome. When the other members in the group contain different K numbers that are not present in the representative genome, those genes are added as if they are present in additional chromosomes or plasmids. Therefore, the resulting dataset is not simply a collection of representative genomes. It is rather a non-redundant dataset of pangenome sequences in terms of the functional content of genes designated by assigned K numbers. The degree of reduction was as follows: 4,660,929 sequences in 304 eukaryotes were reduced to 4,598,413 (98.7%), 3,719,997 (79.8%) and 2,803,038 (60.1%) sequences at the species, genus 727Review: BlastKOALA and GhostKOALA and family levels, respectively, while 11,315,630 sequences in 3600 prokaryotes were reduced to 6,148,174 (54.3%) and 2,698,820 (23.9%) sequences at the species and genus levels, respectively. The annotation of metagenome sequences usually involves assignment of both functional and taxonomic information. The non-redundant dataset of pangenome sequences is supplemented by sequence similarity groups of genes without assigned K numbers to better characterize taxonomic compositions of metagenomes. This is accomplished by creating Cd-hit clusters [9] with 50% identity cutoff for each species/genus/ family group, and sequences are selected from each cluster when no members have assigned K numbers and no members from the representative genome are included. As a result, the number of family-level eukaryotic sequences increased from 2,803,038 to 3,423,406 (122%) and the number of genus-level prokaryotic sequences increased from 2,698,820 to 3,913,518 (145%). The new addendum (ag) and plasmid (pg) categories of the GENES database have also been included in the non-redundant dataset of pangenome sequences. The ag and pg sequences are grouped into species with the NCBI taxonomy identifier, such as 562 for Escherichia coli, and each species/genus/ family group is processed in the same way as the KEGG organisms. For the viral (vg) category, since the K number assignment rate is very low, currently 7% for viruses versus 46% for KEGG organisms, a separate dataset is created simply from the Cd-hit clusters with 90% identity cutoff. K number assignment In the internal annotation of the GENES database, the GFIT (Gene Function Identification Tool) table is generated from SSDB computation results, where a list of top-scoring genes (target genes) in all other genomes is shown for each query gene in each genome. The assignment of a K number to the query gene is performed by the KOALA algorithm, which computes the weighted sum of SW (Smith–Waterman) scores for each K number group, as well as the no K number group, of target genes. The weighting factors include the bidirectional best-hit information, the overlap length of the alignment, the ratio of the query and target sequence lengths, the degree of matches of taxonomic categories if known and the degree of matches of Pfam [10] domains. There are also some additional rules for how many target genes are to be examined before deciding a K number to be given or not to be given. A variant of this is a recently introduced rule for identical and almost identical sequences. For a predefined set of K numbers, which are called tight KOs, a single hit of an identical sequence or a few hits of almost identical sequences are considered sufficient to assign the K number of a tight KO. Tight KOs have been used, for example, to characterize antimicrobial resistance and distinguish substrate specificity in biodegradation pathways. The BlastKOALA and GhostKOALA servers utilize the BLASTP [11] and GHOSTX [12] programs, respectively, for searching the non-redundant GENES database. The search result by BLASTP or GHOSTX is converted into a GFIT-like table where multiple hits (paralogs) in each target genome are combined and only the top-scoring hit is shown. In BlastKOALA, the K number assignment is performed using the weighted sum of BLAST bit scores, where the weighting scheme is the same as the internally used KOALA algorithm excluding the bidirectional best-hit information. In GhostKOALA, the K number assignment is simply based on the sum of GHOSTX normalized scores without considering any weighting factors. The new rule to assign tight KOs is included only in BlastKOALA. Implementation and comparison The BlastKOALA and GhostKOALA servers are made freely available at the kegg.jp Web site (see Table 1 for URLs). Both servers are e-mail based. A job request from the Web interface can either be confirmed or be canceled by clicking on the link in the automatically sent e-mail, and the annotation result such as shown in Fig. 1 can be viewed from Table 1. KEGG automatic annotation servers and associated KEGG Mapper tools Program URL Content BlastKOALA www.kegg.jp/blastkoala/ Automatic annotation and KEGG mapping server suitable for functional characterization of complete genomes GhostKOALA www.kegg.jp/ghostkoala/ Automatic annotation and KEGG mapping server suitable for functional characterization of metagenomes Annotate Sequence www.kegg.jp/kegg/tool/annotate_sequence.html An interactive version of BlastKOALA suitable for use when related genomes are present in KEGG Reconstruct Pathway www.kegg.jp/kegg/tool/map_pathway.html KEGG Mapper tool for reconstructing KEGG pathway maps Reconstruct Brite www.kegg.jp/kegg/tool/map_brite.html KEGG Mapper tool for reconstructing BRITE functional hierarchies Reconstruct Module www.kegg.jp/kegg/tool/map_module.html KEGG Mapper tool for reconstructing KEGG modules Map Taxonomy www.kegg.jp/kegg/tool/map_taxonomy.html KEGG Mapper tool for reconstructing KEGG taxonomy or NCBI taxonomy 728 Review: BlastKOALA and GhostKOALA the link in the job completion notification e-mail. There is an option to run BlastKOALA interactively without giving an e-mail address. It is implemented as the Annotate Sequence tool in KEGG Mapper (Table 1), but this option is limited to searching a single pangenome at the genus or family level. Generally speaking, BlastKOALA is suitable for annotating fully sequenced genomes, while GhostKOALA is recommended for annotating a large amount of metagenome sequences. This is because GHOSTX uses suffix arrays for finding matching sequences and runs 100 times faster than BLAST, although the sensitivity is not as good as BLAST. Table 2 shows a summary of sample runs with different search programs and different datasets to be searched. The query genome was Kangiella geojedonensis [13], before it was included in KEGG with the organism code of kge (T03889). The first row of Table 2, SSDB/KOALA, is the subsequent annotation of this genome using the standard internal procedure of SSDB computation and KOALA assignment. Out of 2208 genes in the genome, 1368 genes were automatically annotated. By taking this set as a reference, we compared the performance of BlastKOALA with two different datasets, GhostKOALA, KAAS and the Annotate Sequence tool. As expected, the best accuracy was achieved by BlastKOALA with the species-level dataset, followed by BlastKOALA with the genus-level dataset and GhostKOALA, but the performance must be evaluated together with the execution time. The Annotate Sequence tool very much depends on the available dataset and may not have been very effective in this case because there was only a single species of Kangiella in KEGG. The KAAS server reported a number of false positives, presumably Fig. 1. Examples of (a) the BlastKOALA result page and (b) the GhostKOALA result page. Table 2. A comparison of K number assignments for the genome of K. geojedonensis. Program Dataset Size Elapsed timea Assignedb Match Mismatch False positive False negative SSDB/KOALA 1368 1368 BlastKOALA species_prokaryotes 6,098,670 1:20:54 1359 1350 9 18 BlastKOALA genus_prokaryotes 2,687,538 41:11 1342 1331 1 10 36 GhostKOALA c_genus_prokaryotes 3,889,798 6:57 1289 1278 2 9 88 KAAS representative set 134,684 25:29 1393 1239 50 104 79 Annotate Sequence Kangiella 2,632 1:55 1309 1280 12 17 76 a Elapsed time was measured on Sun Fire X4600 for KAAS and on SGI Altix UV1000 for all other programs. It is dependent on how many other jobs are running at the same time. b The number of genes in the genome was 2208. 729Review: BlastKOALA and GhostKOALA because the search is limited to a selected set of genomes and the algorithm to assign K numbers [4] is different from KOALA. We also made a comparison with the RAST server [5] by submitting the same genome as a query. The server assigned 1530 FIGfams (protein families), which can be compared to 1359 KOs assigned by BlastKOALA. We compared EC numbers given to FIGfams and KOs. The number of genes with full EC numbers was 581 in RAST and 643 in BlastKOALA and 460 were perfect matches. There is a conceptual difference in the design and implementation of the servers. RAST and its metagenome version, MGRAST, are an environment for the entire processing and subsequent analysis of genome and metagenome sequences, while BlastKOALA and GhostKOALA are meant to be used as a modular unit that may be integrated in the user's environment. BlastKOALA and GhostKOALA can also handle eukaryotic genomes and metagenomes containing eukaryotes, which is a clear advantage. KEGG Mapper analysis Figure 1a shows an example of the BlastKOALA result page, where the user can view or download the annotation data (the list of query genes and assigned K numbers) and perform KEGG Mapper analysis. The pie chart shows the functional category of annotated genes according to the KO system|| . In BlastKOALA, the BLAST search result in the form of a GFIT table can be examined for each query gene from the “View” page of annotation data. In GhostKOALA, the result page (Fig. 1b) contains an additional pie chart for the taxonomic composition of the query data together with a link for more detailed analysis, which is especially useful for metagenome data. Each query gene is assigned a taxonomic category according to the best-hit gene in the Cd-hit cluster supplemented version of the non-redundant pangenome dataset. The KEGG Mapper links generate lists of reconstructed KEGG pathways, BRITE hierarchies and KEGG modules according to the assigned K numbers, which will allow interpretation of high-level functions encoded in the genome or the metagenome. The same analysis can be performed by downloading the annotation data and using the Reconstruct Pathway, Reconstruct Brite and Reconstruct Module tools in the KEGG Mapper Web site (Table 1). Since the BlastKOALA/GhostKOALA result page is deleted after 1 week of job completion, it is recommended to locally store the annotation result (list of K number assignments). If the query genome size exceeds the limit of the server, it would be necessary to split the query data, download the result files, concatenate them and perform KEGG Mapper analysis. Furthermore, multiple results from different query genomes may be combined and, for example, displayed with different coloring on KEGG pathway maps by the Reconstruct Pathway tool. In GhostKOALA, the taxonomy data may also be downloaded and used to analyze, for example, metabolic capacities of selected taxonomic groups by using KEGG Mapper tools. Other types of analyses and usage In addition to the predefined set of analysis tools in KEGG Mapper, the downloaded data may be used for other types of analyses. One is the chemical reaction network analysis of small molecules, which can be performed by converting K numbers into R numbers of the KEGG REACTION database. Alternatively, K numbers may be converted into EC numbers in the KEGG ENZYME database, which contains most up-to-date enzyme information taken from the Enzyme Nomenclature database [14]. It is recommended, however, to use the K number-to-R number links because one EC number may represent multiple reactions and multiple enzyme sequence families. Another is the phylogenetic analysis. Each KO entry is associated with the taxonomic distribution of assigned genes. This may be used to identify phylogenetically interesting genes, such as eukaryotelike genes in an archaeal genome. The EC numbers and the NCBI taxonomy IDs may also be useful for integration of KEGG annotation results with other data and analysis tools. Furthermore, KEGG contains disease and drug information linked to disease genes and drug targets in human, together with special pathway maps with the five-letter code of “hsadd”. Therefore, if genes in the query genome can be linked to human orthologs through the assigned KOs, the analysis of diseases and drugs may be performed. The BlastKOALA and GhostKOALA servers reported in this paper have been used internally for automatic annotation of draft genomes in the KEGG DGENES database and metagenomes in the KEGG MGENES database, respectively. In fact, Fig. 1 was the results of annotating a butterfly genome ofAcropora digitifera (T10031) [15] in DGENES and a sample of Tara Oceans data (T30798) [16] in MGENES. The KEGG Mapper data linked from the result pages of Fig. 1 are equivalent to those made available under the categories of Pathway map, Brite hierarchy and Module in the T10031 and T30798 genome pages at the KEGG Web site. Acknowledgements We thank Dr. Hiroyuki Ogata for helpful comments on metagenome analysis. The KEGG project is partially supported by the National Bioscience Database Center of the Japan Science and Technology Agency. The computational resource for the BlastKOALA and 730 Review: BlastKOALA and GhostKOALA GhostKOALA servers is provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University. Received 26 June 2015; Received in revised form 23 September 2015; Accepted 5 November 2015 Available online 14 November 2015 Keywords: genome annotation; metagenome analysis; taxonomic composition; KEGG Orthology; KEGG pathway mapping † http://www.kegg.jp/kegg/catalog/org_list.html. ‡ http://www.kegg.jp/kegg/catalog/org_list4.html. § http://www.kegg.jp/kegg/catalog/org_list5.html. || http://www.kegg.jp/brite/ko00001. References [1] M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, M. Tanabe, Data, information, knowledge and principle: Back to metabolisminKEGG,NucleicAcidsRes.42(2014)D199–D205. [2] W.R. Pearson, Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms, Genomics 11 (1991) 635–650. [3] A. Nakaya, T. Katayama, M. Itoh, K. Hiranuka, S. Kawashima, Y. Moriya, S. Okuda, M. Tanaka, T. Tokimatsu, Y. Yamanishi, A.C. Yoshizawa, M. Kanehisa, S. Goto, KEGG OC: A largescale automatic construction of taxonomy-based ortholog clusters, Nucleic Acids Res. 41 (2013) D353–D357. [4] Y. Moriya, M. Itoh, S. Okuda, A. Yoshizawa, M. Kanehisa, KAAS: An automatic genome annotation and pathway reconstruction server, Nucleic Acids Res. 35 (2007) W182–W185. [5] R.K. Aziz, D. Bartels, A.A. Best, M. DeJongh, T. Disz, R.A. Edwards, K. Formsma, S. Gerdes, E.M. Glass, M. Kubal, F. Meyer, G.J. Olsen, R. Olson, A.L. Osterman, R.A. Overbeek, L.K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G.D. Pusch, C. Reich, R. Stevens, O. Vassieva, V. Vonstein, A. Wilke, O. Zagnitko, The RAST Server: Rapid annotations using subsystems technology, BMC Genomics 9 (2008) 75. [6] F. Meyer, D. Paarmann, M. D'Souza, R. Olson, E.M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, R.A. Edwards, The metagenomics RAST server—A public resource for the automatic phylogenetic and functional analysis of metagenomes, BMC Bioinf. 9 (2008) 386. [7] T. Tatusova, S. Ciufo, B. Fedorov, K. O'Neill, I. Tolstoy, RefSeq microbial genomes database: New representation and annotation strategy, Nucleic Acids Res. 42 (2014) D553–D559. [8] UniProt Consortium, UniProt: A hub for protein information, Nucleic Acids Res. 43 (2015) D204–D212. [9] W. Li, A. Godzik, Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics 22 (2006) 1658–1659. [10] R.D. Finn, A. Bateman, J. Clements, P. Coggill, R.Y. Eberhardt, S.R. Eddy, A. Heger, K. Hetherington, L. Holm, J. Mistry, E.L. Sonnhammer, J. Tate, M. Punta, Pfam: The protein families database, Nucleic Acids Res. 42 (2014) D222–D230. [11] S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, D.J.Lipman, Gapped BLASTand PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389–3402. [12] S. Suzuki, M. Kakuta, T. Ishida, Y. Akiyama, GHOSTX: An improved sequence homology search algorithm using a query suffix array and a database suffix array, PLoS One 9 (2014) e103833. [13] H. Choe, S. Kim, J. Oh, A. Nasir, B.K. Kim, K.M. Kim, Complete genome of Kangiella geojedonensis KCTC 23420T, putative evidence for recent genome reduction in marine environments, Mar. Genomics (2015) http://dx.doi.org/10.1016/j.margen. 2015.05.015. [14] A.G. McDonald, S. Boyce, K.F. Tipton, ExplorEnz: The primary source of the IUBMB enzyme list, Nucleic Acids Res. 37 (2009) D593–D597. [15] Heliconius Genome Consortium, Butterfly genome reveals promiscuous exchange of mimicry adaptations among species, Nature 487 (2012) 94–98. [16] P. Bork, C. Bowle, C. de Vargas, G. Gorsky, E. Karsenti, P. Wincker, Tara Oceans. Tara Oceans studies plankton at planetary scale. Introduction, Science 348 (2015) 873. 731Review: BlastKOALA and GhostKOALA