Faculty of Informatics Masaryk University Czech Republic A Journey in Biomedical Discovery Informatics: Prom Ontology Learning to Knowledge Graph Embeddings Habilitation Thesis Vít Nováček 2021 Abstract The thesis begins with a commentary part where I describe the broader context of the presented work. This is followed by two more parts that illustrate the evolution of my research in biomedical discovery informatics from text- and ontology-based solutions to predictive models based on relational machine learning. These latter parts are a collection of previously published works. Discovery informatics is a loosely defined area of computer science that primarily aims at providing solutions to the information overload problem. More specifically, discovery informatics research tries to come up with new, more efficient ways of acquiring, integrating, organising, augmenting and utilising information and knowledge from data sets that are typically large, heterogeneous, poorly structured, fast-evolving, unreliable—or, in other words, realistic. These characteristics are arguably relevant to virtually any domain of human activity nowadays. However, they are particularly pertinent to life sciences. The presented work has been motivated chiefly by the following instances of the information overload problem in life sciences: 1. The vast and ever-growing breadth and depth of published articles that can hardly be utilised in a focused and exhaustive manner. 2. The untapped potential of networked biomedical resources for making discoveries. My solutions to the first specific challenge were based on advances in ontology learning, population and integration. My more recent research motivated by the second challenge has been about enabling new discoveries by applying relational machine learning to link prediction in networked biomedical datasets. The presented works have been internationally recognised, garnering over 70 citations in the Web of Science database (and roughly three times as many in Google Scholar). Moreover, the research reported in one of the included publications was awarded the 2nd prize in the Elsevier Grand Challenge in Knowledge Enhancement in Life Sciences, where we won $15,000 prize money in a tough competition of over 70 teams from world-renowned institutions like Stanford or Carnegie Mellon University. Another publication reports predictions of previously unknown protein interactions in cancer pathways that were then observed in living human cells— a strong real-world validation of my work. Last but not least, research reported in the last part of the thesis has been taken up for commercial development by Fujitsu Laboratories Limited and led to five patents (two pending, three granted in the USPTO, EPO and/or Japan jurisdictions). That clearly demonstrates also the industrial relevance of my work. The core of the thesis are nine previously published works (7 high-impact journal articles, 2 A-ranked conference papers). I have been the first author of 4 of them, and senior author of the rest. I have conceptualised and coordinated the research that has led to all the publications, and substantially contributed to each of them (either in terms of implementation of the corresponding prototype, devising validation methodology and pilots, manuscript writing and/or editing, funding acquisition and overall coordination, or combination thereof). Acknowledgements I am indebted to the coauthors of the included publications: Loredana Laera, Siegfried Handschuh, Brian Davis, Tudor Groza, Stefan Decker, Gully A.P.C. Burns, Pasquale Minervini, Luca Costabello, Emir Mufioz, Pierre-Yves Vandenbussche, Brian Walsh, Sameh K. Mohamed, Aayah Nounu, Gavin McGauran, David Matal-lanas, Adrián Vallejo Blanco, Piero Conca, Kamalesh Kanakaraj, Zeeshan Nawaz, Colm J. Ryan, Walter Kolch, Dirk Fey. Many of you will forever be not only my much cherished colleagues, but friends. It has been a blast to coordinate the various teams we have worked in together, and I have learned a lot during our interdisciplinary collaborations. I also gratefully acknowledge various grants that have supported the work I present here: European Union's 6th 1ST Framework projects Knowledge Web (EU FP6-507482) and NEPOMUK (EU FP6-027705), Science Foundation Ireland's projects DEPJ-Líon I, II, Insight 1, 2 and my personal short term travel fellowship (SFI/02/CE1/1131, SFI/08/CE/I1380, SFI/12/RC/2289, SFI/12/RC/2289.2 and SFI/08/CE/I1380-STTF 11 (2), respectively), KI2NA and TOMOE projects funded by Fujitsu Laboratories Ltd., Japan, and the CLARIFY H2020 project funded by European Commission under the grant number 875160. Other than that, a shout out to the passionate pizzaiolos Ronan and Eugene, and all the amazing Dough Bros staff in Galway—without our team Fridays over your excellent napoletanas, many of the ideas presented here would be much less thoroughly baked. And, last but not least, love to Hanka, Vojta and Eli; you are my islands of sanity in the seas I've had to navigate all the way to the here and now. 1 Contents I Context of the Work 4 1 Introduction 5 2 Overall Concept and Related Work 7 2.1 Semantic Literature Search Enabled by Ontology Learning . . 7 2.1.1 Overview of Related Work................ 8 2.2 Automating Discoveries via Knowledge Graph Embeddings . 9 2.2.1 Overview of Related Work................ 10 3 Specific Contributions 11 3.1 Semantic Literature Search Enabled by Ontology Learning . . 11 3.1.1 Ontology Population and Evolution .......... 11 3.1.2 Semantic Literature Search............... 12 3.1.3 Distributional Semantics in Semantic Literature Search 13 3.2 Automating Discoveries via Knowledge Graph Embeddings . 15 3.2.1 Injecting Axioms into Knowledge Graphs....... 15 3.2.2 Prediction of Adverse Drug Reactions......... 16 3.2.3 Integrated Biomedical Knowledge Graph........ 17 3.2.4 Discovering Protein Drug Targets............ 19 3.2.5 Prediction of Kinase-Substrate Networks........ 20 3.2.6 One Method to Rule Them All............. 21 4 Impact 23 4.1 Reception in Academia...................... 23 4.2 Reception in Industry...................... 24 II From Ontology Learning. . . 30 5 Ontology Population and Evolution 31 2 6 Semantic Literature Search 45 7 Distributional Semantics in Semantic Literature Search 52 III ... to Knowledge Graph Embeddings 91 8 Injecting Axioms into Knowledge Graphs 92 9 Prediction of Adverse Drug Reactions 109 10 Integrated Biomedical Knowledge Graph 123 11 Discovering Protein Drug Targets 132 12 Prediction of Kinase-Substrate Networks 141 13 One Method to Rule Them All 172 3 Part I Context of the Work 4 Chapter 1 Introduction Automating discoveries has recently been touted as one of the foremost upcoming challenges for computer science [10]. However, the challenge is far from new. Already decades ago, some research communities were trying hard to come up with computational approaches that could assist people in the process of turning data to information, and to knowledge [3]. According to later works like [5] or [8], such efforts can be conveniently grouped under a common label—discovery informatics. In [8], this field is succinctly defined as a discipline of applied computer science that aims at: i) formal description of the entire scientific process, amenable to machine understanding and processing; ii) design, development, and evaluation of computational artifacts based on such formalisation; iii) application of the resulting artifacts to advance science, either in a fully automated or machine-aided manner. This thesis tracks the evolution of my discovery informatics research vision over the last 13 years, and can be classified as a coherent collection of previously published works that explore various applied AI approaches to specific problems. All these problems are, however, motivated by one of two high-level information overload challenges in life sciences: 1. The vast and ever-growing breadth and depth of published articles that can hardly be utilised in a focused and exhaustive manner. 2. The untapped potential of networked biomedical resources for making discoveries. My solutions to the first specific challenge were based on advances in ontology learning, population and integration [26]. My more recent research motivated by the second challenge has been about enabling new discoveries by 5 applying knowledge graph embeddings [25] to link prediction in networked biomedical datasets. The rest of the thesis is organised as follows: • In the remainder of this commentary part, I first introduce the overall concept of the presented research and discuss the most essential related approaches (Section 2). Then I describe my specific contributions (Section 3) and review the impact of the works included in the thesis (Section 4). • Part II presents three of my published works that aim at making life science literature search more efficient and truly knowledge-based by means of text mining and ontology learning. • Part III presents six of my published works that pave the way towards using knowledge graph embeddings for making discoveries about drugs, proteins and other biomedical entities of practical interest. 6 Chapter 2 Overall Concept and Related Work This chapter consists of two sections where I describe the overall concept of the presented work. This reflects the specific focus of each of the two parts that list the corresponding detailed publications later on. In each section here, I also give a brief overview of the most relevant related works (details are then given in the particular included publications). 2.1 Semantic Literature Search Enabled by Ontology Learning Bringing scientific publishing largely online in the last decades has made knowledge production and dissemination much more efficient than before. The publication process is faster, since the essential phases like authoring, submission, reviewing, and final typesetting are largely computerised. Moreover, the published content is easily disseminated to global audiences via the Internet. In effect, more and more knowledge is being made available. However, the big question is whether all this hypothetically available knowledge is also truly accessible? In our works [19, 17, 16], we claimed the answer to this question is negative, and we showed how this particular instance of the information overload problem could be alleviated in the context of life science publishing. As of 2009, Medline, a comprehensive source of life sciences and biomedical bibliographic information (available via the PubMed search engine, cf. https: //pubmed. ncbi. nlm. nih. gov/), hosted over 18 million resources. It 7 had a growth rate of 0.5 million items per year, which represented around 1,300 new resources per day [23]. Using contemporary publication search engines, one could explore the vast and ever-growing article repositories using relevant keywords. But this was very often not enough. Imagine for instance a junior researcher compiling a survey on various types of leukemia (example taken from [17]). The researcher wants to state and motivate in the survey that acute granulocytic leukemia is different from T-cell leukemia. Although such a statement might be obvious to a life scientist, one should support it in the survey by a citation of a published paper. Our researcher may be a bit inexperienced in oncology and may not know the proper reference straightaway. Using, e.g., the PubMed search service, it is easy to find articles that contain both leukemia names. Unfortunately, one can find hundreds of such results. It is tedious or even impossible to go through them all to discover one that actually supports that acute granulocytic leukemia is different from T-cell leukemia. The problem was that asking queries more expressive than (boolean combinations of) mere keywords was virtually impossible. My work presented in [17, 16] addressed this problem by technologies that can operate at an enhanced level, using more expressive concepts and their various relationships. This required collecting, extracting, and interrelating knowledge scattered across large numbers of available life science publications. For that we used the groundwork set forth in one of my earlier works on automated ontology population via text mining [19], and also a distributional semantics framework we introduced in [18]. 2.1.1 Overview of Related Work The first publication [19] included into this thesis deals with biomedical ontology population by new knowledge extracted from textual resources. It defines several practical requirements on an implementation of such type of knowledge integration (namely the ability to process texts and incorporate the contents extracted from them automatically, and resolve potential inconsistencies based on user-defined preferences). While books like [6] or [22] and other, more focused works offered potentially applicable solutions back then, none of them satisfied all the requirements defined in the article. This motivated the development presented there. The second and third included publication [17, 16] describe two different solutions for knowledge-based search in biomedical literature, where the 8 queries can express rather complex semantics going beyond key-words and their Boolean combinations. From the end-user's point of view, which is arguably the crucial perspective in this context, these publications were, at that time, most related to works like FindUR [12], Melisa [1], GoPubMed [4] or Textpresso [13]. The award-winning CORAAL tool described in [17] addressed the major limitation and scalability bottleneck of the contemporary state of the art systems—their dependence on manually provided ontologies enabling the semantic search—by extracting the ontologies automatically. The SKIMMR tool [16] then improved CORAAL by using our new distributional semantics framework [18] as the underlying knowledge representation, and by general streamlining of the original technology and front-end motivated by the simple, yet powerful metaphor of machine-aided skim reading. 2.2 Automating Discoveries via Knowledge Graph Embeddings While letting people search in publications more efficiently can no doubt facilitate progress in life sciences, it cannot directly lead to new discoveries. Therefore, in the more recent stage of my research career, I decided to come up with new approaches that could address this problem. Complex biological systems can be conveniently modelled as networks of interconnected biological entities [2]. Such networks can then be converted into so called knowledge graphs [7], which are lightweight representations of interlinked knowledge that mitigate many disadvantages of the more traiditional, logics-based ontologies. Owing to their simple design principles, knowledge graphs are robust, and easy to create and maintain, which makes them readily available in practical applications. Yet they are also sufficiently well formalised to support many machine learning and inference tasks, such as relation extraction, link prediction or knowledge base completion [25]. Until as recently as 2017, however, the potential of knowledge graphs in the field of biomedical informatics had been largely unexplored. For instance, our work [14] is arguably the first approach that addresses the problem of discovering adverse drug reactions using knowledge graphs (a substantially extended version [15] of this paper is included here). My recent works included in Part III of this thesis build on the initial success reported in [14]. More specifically, they show how as diverse problems as discovering adverse drug reactions, polypharmacy side effects, protein drug targets or cancer signalling reactions can be solved simply by 9 casting them as a link prediction task, and solving that task by means of knowledge graph embeddings [25]. This essentially consists of learning low-rank vector representations of the knowledge graph nodes and edges that preserve the graph's inherent structure. Efficient prediction of new possible links in the original graph using the trained model can in turn lead to new discoveries in the domain the graph represents, as detailed in the particular works included in the thesis. 2.2.1 Overview of Related Work Various research communities have been trying to apply advanced machine learning techniques to biomedical use cases for well over a decade now. For instance, [28] or [11] came up with state of the art models for predicting adverse drug reactions using sophisticated applications of supervised machine learning. Polypharmacy side effect prediction by relational learning on graph data was investigated in [29]. Competitive machine learning models for prediction of protein drug targets were recently introduced for instance in [21] or [24]. And approaches like [9] or [27] dealt with combining biological background knowledge and general-purpose machine learning and other AI methods to discover signalling protein interactions. The success of most of these methods, however, was dependent on carefully crafted datasets, lots of extensive (and expensive) manual feature engineering, limited flexibility of the predictive pipelines and/or impractical training times. Motivated by these challenges, the research groups I have been coordinating since 2015 have been coming with an increasingly refined approach based on knowledge graph embeddings. This eventually allowed for targeting many discovery tasks with essentially one method— largely automated conversion of relevant datasets into the knowledge graph form and consequent application of an off-the-shelf relational learning model to achieve state of the art performance. These results, as well as exhaustive lists of related approaches, are described in detail by a series of works included in Part III of the thesis. 10 Chapter 3 Specific Contributions This chapter reviews the published works included in the thesis, providing a brief summary of each in a dedicated section (typically based on the abstract of the corresponding publication). I also list my specific personal contributions to each of the works here. 3.1 Semantic Literature Search Enabled by Ontology Learning 3.1.1 Ontology Population and Evolution The first specific contribution of my research I present in this thesis is based on the following journal article: • Vít Nováček, Loredana Laera, Siegfried Handschuh, and Brian Davis. Infrastructure for dynamic knowledge integration—automated biomedical ontology extension using textual resources. Journal of biomedical informatics, 41(5):816-828, 2008. The article is a direct follow-up of my master's thesis in ontology learning. It describes the final results of Knowledge Web, an EU Network of Excellence, in which I coordinated a work package on ontology evolution during my research internship and, later on, early PhD studies at the DERI institute of National University of Ireland Galway. More specifically, in the article we present a novel ontology integration technique that explicitly takes the dynamics and data-intensiveness of e-health and biomedicine application domains into account. Changing and growing knowledge, possibly contained in unstructured natural language resources, is handled by application of 11 cutting-edge Semantic Web technologies. In particular, semi-automatic integration of ontology learning results into a manually developed ontology is employed. This integration relies on automatic negotiation of agreed alignments, inconsistency resolution and natural language generation methods. Their novel combination alleviates the end-user effort in the incorporation of new knowledge to large extent. This allows for efficient application in many practical use cases, as we show in the paper. I personally contributed to the publication in the following ways: • I conceptualised the reported research and coordinated the corresponding work. • I implemented a substantial portion of the presented research prototype. • I designed, performed and interpreted a validation study for the prototype. • I wrote the manuscript. 3.1.2 Semantic Literature Search The second specific contribution of my research I present in this thesis is based on the following journal article: • Vít Nováček, Tudor Groza, Siegfried Handschuh, and Stefan Decker. Coraal—dive into publications, bathe in the knowledge. Web Semantics: Science, Services and Agents on the World Wide Web, 8(2-3):176-181, 2010. The article describes our award-winning prototype Coraal. This research was motivated by the shortcomings of prevalent search engines used in online scientific publishing that mostly exploit raw publication data (bags of words) and shallow metadata (authors, key words, citations, etc.). Making use of the knowledge contained implicitly in published texts is still largely not utilised. Following our long-term ambition to take advantage of such knowledge, we have implemented CORAAL (COntent extended by emeR-gent and Asserted Annotations of Linked publication data), an enhanced-search prototype and the second-prize winner of the Elsevier Grand Challenge. CORAAL extracts asserted publication metadata together with the knowledge implicitly present in the relevant text, integrates the emergent 12 content, and displays it using a multiple-perspective search&browse interface. This way we enable semantic querying for individual publications, and convenient exploration of the knowledge contained within them. In other words, recalling the metaphor in the article title, we let the users dive into publications more easily, and allow them to freely bathe in the related unlocked knowledge. I personally contributed to the publication in the following ways: • I conceptualised the reported research and coordinated the corresponding work. • I implemented a substantial portion of the presented research prototype. • I designed, performed and interpreted a validation study for the prototype. • I wrote the manuscript. 3.1.3 Distributional Semantics in Semantic Literature Search The third specific contribution of my research I present in this thesis is based on the following journal article: • Vít Nováček and Gully APC Burns. Skimmr: Facilitating knowledge discovery in life sciences by machine-aided skim reading. Peer J, 2:e483, 2014. The article describes a prototype facilitating the process of skim-reading scientific publication via automatically generated, interlinked graphical summaries. This research improved the previous contribution (i.e., the Coraal prototype) by reworking the internal knowledge representation mechanism from scratch, using a distributional semantics framework I developed in the final stage of my PhD research. I also simplified the user interface and user interaction modes based on extensive feedback from various sample users of my research prototypes. Unlike full reading, "skim-reading" involves the process of looking quickly over information in an attempt to cover more material whilst still being able to retain a superficial view of the underlying content. Within this work, we specifically emulate this natural human activity by providing a dynamic 13 graph-based view of entities automatically extracted from text. For the extraction, we use shallow parsing, co-occurrence analysis and semantic similarity computation techniques. Our main motivation is to assist biomedical researchers and clinicians in coping with increasingly large amounts of potentially relevant articles that are being published ongoingly in life sciences. To construct the high-level network overview of articles, we extract weighted binary statements from the text. We consider two types of these statements, co-occurrence and similarity, both organised in the same distributional representation (i.e., in a vector-space model). For the co-occurrence weights, we use point-wise mutual information that indicates the degree of non-random association between two co-occurring entities. For computing the similarity statement weights, we use cosine distance based on the relevant co-occurrence vectors. These statements are used to build fuzzy indices of terms, statements and provenance article identifiers, which support fuzzy querying and subsequent result ranking. These indexing and querying processes are then used to construct a graph-based interface for searching and browsing entity networks extracted from articles, as well as articles relevant to the networks being browsed. Last but not least, we describe a methodology for automated experimental evaluation of the presented approach. The method uses formal comparison of the graphs generated by our tool to relevant gold standards based on manually curated PubMed, TREC challenge and MeSH data. We provide a web-based prototype (called "SKIMMR") that generates a network of inter-related entities from a set of documents which a user may explore through our interface. When a particular area of the entity network looks interesting to a user, the tool displays the documents that are the most relevant to those entities of interest currently shown in the network. We present this as a methodology for browsing a collection of research articles. To illustrate the practical applicability of SKIMMR, we present examples of its use in the domains of Spinal Muscular Atrophy and Parkinson's Disease. Finally, we report on the results of experimental evaluation using the two domains and one additional dataset based on the TREC challenge. The results show how the presented method for machine-aided skim reading outperforms tools like PubMed regarding focused browsing and informativeness of the browsing context. I personally contributed to the publication in the following ways: • I conceptualised the reported research and coordinated the corresponding work. 14 • I implemented the presented research prototype. • I designed, performed and interpreted a validation study for the prototype. • I wrote the manuscript. 3.2 Automating Discoveries via Knowledge Graph Embeddings 3.2.1 Injecting Axioms into Knowledge Graphs The fourth specific contribution of my research I present in this thesis is based on the following conference paper: • Pasquale Minervini, Luca Costabello, Emir Mufioz, Vít Nováček, and Pierre-Yves Vandenbussche. Regularizing knowledge graph embeddings via equivalence and inversion axioms. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 668-683. Springer, 2017. The paper presents a method for augmenting predictive models that are based on essentially statistical relational machine learning with symbolic knowledge in the form of logics-based axioms. While the application context and validation of the method presented in the paper is not specifically targeted at the domain of life sciences, it nevertheless presents a solid formal groundwork for injecting background knowledge into statistical predictive models trained purely from relational data. Thus it is an important stepping stone towards models that can be made more accurate and/or focused on concrete biological or clinical use cases via selected bits of domain knowledge provided by experts. As to the method itself, it explores symbolic augmentation of knowledge graph embedding models. Learning embeddings of entities and relations into low-rank continuous vector spaces using neural architectures is an effective method of performing statistical learning on large-scale relational data, such as knowledge graphs. In this paper, we consider the problem of regularising the training of neural knowledge graph embeddings by leveraging external background knowledge. We propose a principled and scalable method for leveraging equivalence and inversion axioms during the learning process, by 15 imposing a set of model-dependent soft constraints on the predicate embed-dings. The method has several advantages: (i) the number of introduced constraints does not depend on the number of entities in the knowledge base; (ii) regularities in the embedding space effectively reflect available background knowledge; (iii) it yields more accurate results in link prediction tasks over non-regularized methods; and (iv) it can be adapted to a variety of models, without affecting their scalability properties. We demonstrate the effectiveness of the proposed method on several large knowledge graphs. Our evaluation shows that it consistently improves the predictive accuracy of several neural knowledge graph embedding models (for instance, the MRR of TransE on WordNet increases by 11%) without compromising their scalability properties. I personally contributed to the publication in the following ways: • I helped to conceptualise the reported research and coordinated the corresponding work. • I performed the formal analysis of specific approaches to injecting the selected axioms into the three embedding models we chose for validating the concept. • I wrote the corresponding draft sections of the manuscript, commented on the overall structure and contents and edited the final version. 3.2.2 Prediction of Adverse Drug Reactions The fifth specific contribution of my research I present in this thesis is based on the following journal article: • Emir Mufioz, Vít Nováček, and Pierre-Yves Vandenbussche. Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models. Briefings in bioinformatics, 20(l):190-202, 2019. This article is the first major work where we demonstrated the potential of knowledge graphs for solving practically relevant biomedical discovery informatics challenges via off-the-shelf, efficient machine learning methods. More specifically, the article deals with the problem of predicting adverse drug reactions (ADRs). Timely identification of ADRs is highly important in the domains of public health and pharmacology. Early discovery of potential ADRs can limit their effect on patient lives and also make 16 drug development pipelines more robust and efficient. Reliable in silico prediction of ADRs can be helpful in this context, and thus, it has been intensely studied. Recent works achieved promising results using machine learning. The presented work focuses on machine learning methods that use drug profiles for making predictions and use features from multiple data sources. We argue that despite promising results, existing works have limitations, especially regarding flexibility in experimenting with different data sets and/or predictive models. We suggest to address these limitations by generalisation of the key principles used by the state of the art. Namely, we explore the effects of: (1) using knowledge graphs—machine-readable interlinked representations of biomedical knowledge—as a convenient uniform representation of heterogeneous data; and (2) casting ADR prediction as a multi-label ranking problem. We present a specific way of using knowledge graphs to generate different feature sets and demonstrate favourable performance of selected off-the-shelf multi-label learning models in comparison with existing works. Our experiments suggest better suitability of certain multi-label learning methods for applications where ranking is preferred. The presented approach can be easily extended to other feature sources or machine learning methods, making it flexible for experiments tuned toward specific requirements of end users. Our work also provides a clearly defined and reproducible baseline for any future related experiments. I personally contributed to the publication in the following ways: • I helped to conceptualise the reported research and coordinated the corresponding work. • I advised on the selection of training data sets and the process of their conversion into a knowledge graph form. • I helped to design the validation study. • I edited the manuscript. 3.2.3 Integrated Biomedical Knowledge Graph The sixth specific contribution of my research I present in this thesis is based on the following conference paper: • Brian Walsh, Sameh K Mohamed, and Vít Nováček. Biokg: A knowledge graph for relational learning on biological data. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 3173-3180, 2020. 17 The paper describes a method and a software framework for automated fetching and integration of a number of widely used biomedical data sets covering proteins, genes, drugs, chemicals, pathways and their mutual interactions into a common knowledge graph. We also describe various machine learning benchmark data sets that can be derived from the knowledge graph and thus support training and validation of manifold models addressing challenges like drug side effect or protein interaction prediction. More specific summary of this contribution is as follows. Knowledge graphs became a popular means for modelling complex biological systems where they model the interactions between biological entities and their effects on the biological system. They also provide support for relational learning models which are known to provide highly scalable and accurate predictions of associations between biological entities. Despite the success of the combination of biological knowledge graph and relation learning models in biological predictive tasks, there is a lack of unified biological knowledge graph resources. This forced all current efforts and studies for applying a relational learning model on biological data to compile and build biological knowledge graphs from open biological databases. This process is often performed inconsistently across such efforts, especially in terms of choosing the original resources, aligning identifiers of the different databases and assessing the quality of included data. To make relational learning on biomedical data more standardised and reproducible, we propose a new biological knowledge graph which provides a compilation of curated relational data from open biological databases in a unified format with common, interlinked identifiers. We also provide a new module for mapping identifiers and labels from different databases which can be used to align our knowledge graph with biological data from other heterogeneous sources. Finally, to illustrate practical relevance of our work, we provide a set of benchmarks based on the presented data that can be used to train and assess the relational learning models in various tasks related to pathway and drug discovery. I personally contributed to the publication in the following ways: • I helped to conceptualise the reported research and coordinated the corresponding work. • I advised on the selection of the original biomedical data sets and the process of their conversion into a knowledge graph form. • I helped to design the task-specific benchmarks. 18 • I edited the manuscript. 3.2.4 Discovering Protein Drug Targets The seventh specific contribution of my research I present in this thesis is based on the following journal article: • Sameh K Mohamed, Vit Novacek, and Aayah Nounu. Discovering protein drug targets using knowledge graph embeddings. Bioinformatics, 36(2):603-610, 2020. The article shows how state of the art performance in a number of standard drug target prediction benchmarks can be achieved by a relatively simple application of a knowledge graph embedding model. More specifically, the article falls under the broad area of computational approaches for predicting drug-target interactions (DTIs) that can provide valuable insights into the drug mechanism of action. DTI predictions can help to quickly identify new promising (on-target) or unintended (off-target) effects of drugs. However, existing models face several challenges. Many can only process a limited number of drugs and/or have poor proteome coverage. The current approaches also often suffer from high false positive prediction rates. In this work, we propose a novel computational approach for predicting drug target proteins. The approach is based on formulating the problem as a link prediction in knowledge graphs (robust, machine-readable representations of networked knowledge). We use biomedical knowledge bases to create a knowledge graph of entities connected to both drugs and their potential targets. We propose a specific knowledge graph embedding model, TriModel, to learn vector representations (i.e. embeddings) for all drugs and targets in the created knowledge graph. These representations are consequently used to infer candidate drug target interactions based on their scores computed by the trained TriModel model. We have experimentally evaluated our method using computer simulations and compared it to five existing models. This has shown that our approach outperforms all previous ones in terms of both area under ROC and precision-recall curves in standard benchmark tests. I personally contributed to the publication in the following ways: • I helped to conceptualise the reported research and coordinated the corresponding work. • I advised on the selection of the original biomedical data sets and the process of their conversion into a knowledge graph form. 19 • I helped to design the validation study. • I edited the manuscript. 3.2.5 Prediction of Kinase-Substrate Networks The eighth specific contribution of my research I present in this thesis is based on the following journal article: • Vít Nováček, Gavin McGauran, David Matallanas, Adrián Vallejo Blanco, Piero Conca, Emir Mufioz, Luca Costabello, Kamalesh Ka-nakaraj, Zeeshan Nawaz, Brian Walsh, et al. Accurate prediction of kinase-substrate networks using knowledge graphs. PLoS computational biology, 16(12) :el007578, 2020. The article describes a conceptually new computational approach to prediction of signalling protein interactions (phosphorylations). It addresses the problem from a totally different viewpoint than existing solutions, while outperforming them in two independent benchmarks, and in laboratory validations. Phosphorylation of specific substrates by protein kinases is a key control mechanism for vital cell-fate decisions and other cellular processes. However, discovering specific kinase-substrate relationships is time-consuming and often rather serendipitous. Computational predictions alleviate these challenges, but the current approaches suffer from limitations like restricted kinome coverage and inaccuracy. They also typically utilise only local features without reflecting broader interaction context. To address these limitations, we have developed an alternative predictive model. It uses statistical relational learning on top of phosphorylation networks interpreted as knowledge graphs, a simple yet robust model for representing networked knowledge. Compared to a representative selection of six existing systems, our model has the highest kinome coverage and produces biologically valid high-confidence predictions not possible with the other tools. Specifically, we have experimentally validated predictions of previously unknown phosphorylations by the LATS1, AKT1, PKA and MST2 kinases in human. Thus, our tool is useful for focusing phosphoproteomic experiments, and facilitates the discovery of new phosphorylation reactions. Our model can be accessed publicly via an easy-to-use web interface (LinkPhinder). I personally contributed to the publication in the following ways: • I conceptualised the reported research and coordinated the corresponding work. 20 • I devised and implemented the process of converting site-specific ki-nase-substrate interaction data into the knowledge graph form and back. • I advised on implementing the relational machine learning model for predicting the signalling reactions. • I designed, coordinated and interpreted the computational validation studies for the prototype. • I wrote the manuscript. 3.2.6 One Method to Rule Them All The ninth specific contribution of my research I present in this thesis is based on the following journal article: • Sameh K Mohamed, Aayah Nounu, and Vít Nováček. Biological applications of knowledge graph embedding models. Briefings in bioin-formatics, 22(2): 1679-1693, 2021. The article reviews several of our previous works in a comprehensive comparison to other related approaches. This is done to present a survey on solving a broad range of biomedical prediction problems with essentially one suite of techniques based on knowledge graph embedding models. This can be done due to the fact that complex biological systems are traditionally modelled as graphs of interconnected biological entities. These graphs, i.e. biological knowledge graphs, are then processed using graph exploratory approaches to perform different types of analytical and predictive tasks. Despite the high predictive accuracy of these approaches, they have limited scalability due to their dependency on time-consuming path exploratory procedures. In recent years, owing to the rapid advances of computational technologies, new approaches for modelling graphs and mining them with high accuracy and scalability have emerged. These approaches, i.e. knowledge graph embedding (KGE) models, operate by learning low-rank vector representations of graph nodes and edges that preserve the graph's inherent structure. These approaches were used to analyse knowledge graphs from different domains where they showed superior performance and accuracy compared to previous graph exploratory approaches. In this work, we study this class of models in the context of biological knowledge graphs and their different applications. We then show how KGE models can be a natural fit 21 for representing complex biological knowledge modelled as graphs. We also discuss their predictive and analytical capabilities in different biology applications. In this regard, we present two example case studies that demonstrate the capabilities of KGE models: prediction of drug-target interactions and polypharmacy side effects. Finally, we analyse different practical considerations for KGEs, and we discuss possible opportunities and challenges related to adopting them for modelling biological systems. I personally contributed to the publication in the following ways: • I helped to conceptualise the reported research and coordinated the corresponding work. • I helped to define the scope of the survey and the set of discovery problems to be covered. • I edited the manuscript. 22 Chapter 4 Impact This chapter gives a brief overview of the impact of my work in real world, first in academic and then in industry settings. 4.1 Reception in Academia The seven articles included in the thesis have been published in internationally-disseminated peer-reviewed journals with an impact factor ranging from 1.897 (Web Semantics Journal, Elsevier) to 11.622 (Briefings in Bioin-formatics, Oxford University Press). The other two works appeared in the proceedings of ECML-PKDD and CIKM conferences that are both A-ranked in the CORE conference database. As of October 2021, the works included in this thesis have been cited by over 70 other publications according to the Web of Science database. The number of citations according to the Google Scholar service is roughly three-times higher. This demonstrates that my research has been acknowledged by the global scientific community and used in many consequent works that are pushing the state of the art further. In terms of other means of academic recognition, I would like to highlight the following points: • The work described in [17] has been awarded the 2nd prize in the Elsevier Grand Challenge in Knowledge Enhancement in Life Sciences, where we won $15,000 prize money in a tough competition of over 70 teams from world-renowned institutions like Stanford or Carnegie Mellon University. The competition was judged by world-leading scientists and publishers (such as Eduard H. Hovy, a renowned AI and 23 NLP researcher, or Emilie Marcus, then the Editor in Chief of Cell, one of the most influential journals in the world across any field of science). The fact that such a prestigious committee considered my work relevant enough to be awarded the prize was a tremendous validation of my research vision and the capability to realise it already at a rather early stage of my career. • The works described in [17, 16] were used for literature search on a daily basis by clinicians and by representatives of a large US patient organisation (Spinal Muscular Atrophy Foundation). Their positive feedback was another strong signal that I was on the right path in terms of turning my research into societally relevant services. • The interdisciplinary character of my research vision has allowed me to establish a number of fruitful collaborations with biologists, clinicians and representatives of the pharma industry. Their feedback has been crucial to motivate my work by actual needs of users who can benefit from the research outcomes right away. In many cases, these groups have also provided a strong, realistic validation of my work, which has been priceless. • Last but not least, the research reported in this thesis was instrumental for building a platform I used to successfully acquire funding for my research groups from various bodies in Europe, USA and Japan. The total amount of funding directly dedicated to groups I have supervised corresponds to ca. €1,779,000, or roughly 43.6 million CZK, as of October 2021. This is yet another confirmation of the international relevance of the presented research. 4.2 Reception in Industry I have spent a substantial portion of my post-doctoral career (2012-2019) coordinating a research group in a large industrial collaboration fully funded by Fujitsu Laboratories Limited (FLL). FLL has appreciated the relevance and uniqueness of our research by applying for several patents on our behalf (6 in total, 5 of which have been related to the research reported in the publications here, with 3 granted and 2 pending as of October 2021). More importantly, however, the Fujitsu group's research scientists, developers and business units have been taking up our research outcomes. They 24 have built dedicated internal teams around our prototypes that have been commercialising the resulting applications in Japan (for instance, in 2020, a genomic AI system based on our work was brought to a production stage by a subsidiary company of the Fujitsu group). This clearly demonstrates a tangible impact of my research vision not only in academia, but also in the global corporate world. More recently, this has been corroborated by my other involvements within the healthcare, pharma and biotech verticals, such as: • Serving as an AI consultant in a large oncology hospital (Masaryk Memorial Cancer Institute). • A request by the large biotech company QIAGEN to license experimental and prediction data reported in [20] as a part of their pathway analysis tool. • An invitation to serve on the Advisory Board of the BioXcel Therapeutics, Inc. pharma company. I expect to be involved in more such dissemination efforts in the future, which will further help to get my ideas out into the real world. 25 Bibliography [1] Jose Maria Abasolo, M Gomez, and M Melisa. An ontology-based agent for information retrieval in medicine. In Proceedings of the First International Workshop on the Semantic Web (SemWeb2000), pages 73-82, 2000. [2] Albert-Laszlo Barabasi and Zoltan N Oltvai. Network biology: understanding the cell's functional organization. Nature reviews genetics, 5(2):101-113, 2004. [3] Brian L Claus and Dennis J Underwood. Discovery informatics: its evolving role in drug discovery. Drug discovery today, 7(18): 95 7-966, 2002. [4] Andreas Doms and Michael Schroeder. Gopubmed: exploring pubmed with the gene ontology. Nucleic acids research, 33(suppl_2):W783-W786, 2005. [5] Yolanda Gil and Haym Hirsh. Discovery informatics: Ai opportunities in scientific discovery. In 2012 AAAI Fall Symposium Series, 2012. [6] Asuncion Gomez-Perez, Mariano Fernandez-Lopez, and Oscar Corcho. Ontological Engineering: with examples from the areas of Knowledge Management, e-Commerce and the Semantic Web. Springer Science & Business Media, 2006. [7] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d'Amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, Jose Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. Knowledge graphs. ACM Computing Surveys (CSUR), 54(4):l-37, 2021. [8] Vasant G Honavar. The promise and potential of big data: A case for discovery informatics. Review of Policy Research, 31(4):326-330, 2014. 26 [9] Heiko Horn, Erwin M Schoof, Jinho Kim, Xavier Robin, Martin L Miller, Francesca Diella, Anita Palma, Gianni Cesareni, Lars Juhl Jensen, and Rune Linding. Kinomexplorer: an integrated platform for kinome biology studies. Nature methods, ll(6):603-604, 2014. [10] Hiroaki Kitano. Nobel turing challenge: creating the engine for scientific discovery, npj Systems Biology and Applications, 7(1):1—12, 2021. [11] Mei Liu, Yonghui Wu, Yukun Chen, Jingchun Sun, Zhongming Zhao, Xue-wen Chen, Michael Edwin Matheny, and Hua Xu. Large-scale prediction of adverse drug reactions using chemical, biological, and pheno-typic properties of drugs. Journal of the American Medical Informatics Association, 19(el):e28-e35, 2012. [12] Deborah L McGuinness. Ontology-enhanced search for primary care medical literature. In Proceedings of the Medical Concept Representation and Natural Language Processing Conference, pages 16-19, 1999. [13] Hans-Michael Múller, Eimear E Kenny, Paul W Sternberg, and Michael Ashburner. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS biology, 2(ll):e309, 2004. [14] Emir Muňoz, Vít Nováček, and Pierre-Yves Vandenbussche. Using drug similarities for discovery of possible adverse reactions. In A MIA Annual Symposium Proceedings, volume 2016, page 924. American Medical Informatics Association, 2016. [15] Emir Muňoz, Vít Nováček, and Pierre-Yves Vandenbussche. Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models. Briefings in bioinformatics, 20(l):190-202, 2019. [16] Vít Nováček and Gully APC Burns. Skimmr: Facilitating knowledge discovery in life sciences by machine-aided skim reading. Peer J, 2:e483, 2014. [17] Vít Nováček, Tudor Groza, Siegfried Handschuh, and Stefan Decker. Coraal—dive into publications, bathe in the knowledge. Web Semantics: Science, Services and Agents on the World Wide Web, 8(2-3):176— 181, 2010. 27 [18] Vít Nováček, Siegfried Handscřmh, and Stefan Decker. Getting the meaning right: A complementary distributional layer for the web semantics. In International Semantic Web Conference, pages 504-519. Springer, 2011. [19] Vít Nováček, Loredana Laera, Siegfried Handschuh, and Brian Davis. Infrastructure for dynamic knowledge integration—automated biomedical ontology extension using textual resources. Journal of biomedical informatics, 41(5):816-828, 2008. [20] Vít Nováček, Gavin McGauran, David Matallanas, Adrián Vallejo Blanco, Piero Conca, Emir Mufioz, Luca Costabello, Ka-malesh Kanakaraj, Zeeshan Nawaz, Brian Walsh, et al. Accurate prediction of kinase-substrate networks using knowledge graphs. PLoS computational biology, 16(12):el007578, 2020. [21] Rawan S Olayan, Haitham Ashoor, and Vladimir B Bajic. Ddr: efficient computational method to predict drug-target interactions using graph mining and machine learning approaches. Bioinformatics, 34(7): 1164— 1173, 2018. [22] Steffen Staab and Rudi Studer. Handbook on ontologies. Springer Science & Business Media, 2010. [23] Junichi Tsujii. Refine and pathtext, which combines text mining with pathways. Keynote at Semantic Enrichment of the Scientific Literature, 2009. [24] Fangping Wan, Lixiang Hong, An Xiao, Tao Jiang, and Jianyang Zeng. Neodti: neural integration of neighbor information from a heterogeneous network for discovering new drug-target interactions. Bioinformatics, 35(1):104-111, 2019. [25] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724-2743, 2017. [26] Wilson Wong, Wei Liu, and Mohammed Bennamoun. Ontology learning from text: A look back and into the future. ACM Computing Surveys (CSUR), 44(4):l-36, 2012. 28 [27] Yu Xue, Jian Ren, Xinjiao Gao, Changjiang Jin, Longping Wen, and Xuebiao Yao. Gps 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy. Molecular & cellular proteomics, 7(9):1598-1608, 2008. [28] Yoshihiro Yamanishi, Edouard Pauwels, and Masaaki Kotera. Drug side-effect prediction based on the integration of chemical and biological spaces. Journal of chemical information and modeling, 52(12):3284-3292, 2012. [29] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Bioin-formatics, 34(13):i457-i466, 2018. 29 Part II Prom Ontology Learning... 30 Chapter 5 Ontology Population and Evolution 31 Journal of Biomedical Informatics 41 (2008) 816-828 ELSEVIER Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin Infrastructure for dynamic knowledge integration—Automated biomedical ontology extension using textual resources Vít Nováček3*, Loredana Laerab, Siegfried Handschuh3, Brian Davis3 ^Digital Enterprise Research institute, National University of Ireland, Galway, IDA Business Park, Lower Dangan, Galway, Co. Galway, Ireland b Department of Computer Science, University of Liverpool, UK ARTICLE INFO Article history: Received 1 September 2007 Available online 24 July 2008 Keywords: Dynamic ontology integration Ontology evolution Ontology alignment and negotiation Ontology learning Biomedical ontologies Knowledge acquisition Lifecycle ABSTRACT We present a novel ontology integration technique that explicitly takes the dynamics and data-intensive-ness of e-health and biomedicine application domains into account. Changing and growing knowledge, possibly contained in unstructured natural language resources, is handled by application of cutting-edge Semantic Web technologies. In particular, semi-automatic integration of ontology learning results into a manually developed ontology is employed. This integration bases on automatic negotiation of agreed alignments, inconsistency resolution and natural language generation methods. Their novel combination alleviates the end-user effort in the incorporation of new knowledge to large extent. This allows for efficient application in many practical use cases, as we show in the paper. © 2008 Elsevier Inc. All rights reserved. 1. Introduction Ontologies (formal knowledge bases) on the Semantic Web are often very likely subject to change given the dynamic nature of domain knowledge. Knowledge changes and evolves over time as experience accumulates—it is revised and augmented in the light of deeper understanding: new facts are becoming known while some of the older ones need to be revised and/or retracted at the same time. This holds especially for scientific domains, however, even virtually any industrial domain is dynamic—changes typically occur in product portfolios, personnel structure or industrial processes, which can all be reflected by an ontology in a knowledge management policy. The domains of e-health and biomedicine are both scientific (biomedical research) and industrial (clinical practice, pharmaceutics). The need for ontologies in biomedicine knowledge and data management has already been recognised by the community. Ontologies can serve as structured repositories giving a shared meaning to data and thus making it possible to process and query them in a more efficient and expressive manner. The shared meaning provided by ontologies also results in facilitation of integration between different medical data formats once they are bound to an ontology. Moreover, the state of the art ontology-based techniques * Corresponding author. Fax: +353 91 495541. E-mail addresses: vit.novacek@deri.org (V. Novacek), lori@csc.liv.ac.uk (L. Laera), siegfried.handschuh@deri.org (S. Handschuh), brian.davis@deri.org (B. Davis). (like alignment or reasoning as described in [39]) can help to integrate the data even if they adhere to different ontologies. In the biomedical domain, ontology construction is usually a result of a collaboration involving ontology engineers and domain experts, where the knowledge is being extracted and modelled manually. However, it is not always feasible to process all the relevant data and extract the knowledge manually from domain resources, since we might not have a sufficiently large committee of ontology engineers and/or dedicated experts at hand in order to process new data anytime it arrives. This implies a need for automation of knowledge extraction and maintenance processes in dynamic and data-intensive medical environments. If the knowledge is available in textual resources, ontology learning (see [33]) can help in this task. Therefore, a lifecycle of an ontology development process apt for universal application in the medicine domain should also support appropriate mechanisms for the incorporation of dynamically extracted knowledge. In this paper, we introduce such a lifecycle scenario and a novel solution to the dynamic knowledge integration task. Our efforts have several particular motivations. While there has been a great deal of work on ontology learning for ontology construction, e.g. in [10], as well as on manual or collaborative ontology development in [41 ], relatively little attention has been paid to the user-friendly integration of both approaches within an ontology lifecycle scenario. By user-friendly we mean especially accessible to users who are not experts in ontology engineering (i.e. biomedicine researchers or practitioners). In this paper, we 1532-0464/$ - see front matter © 2008 Elsevier Inc. All rights reserved, doi :10.1016/j.jbi.2008.06.003 32 V. Nováček et aL/Journal of Biomedical Informatics 41 (2008) 816-828 817 introduce our framework for practical handling of dynamic and large data-sets in an ontology lifecycle, focusing particularly on dynamic integration of learned knowledge into manually maintained ontologies. However, the introduced integration mechanism is not restricted only to learned ontologies—arbitrary "external" ontology can be integrated into the primary ontology in question by the very same process. The dynamic nature of knowledge is one of the most challenging problems not only in biomedicine, but in the whole current Semantic Web research. Here we provide a solution for dealing with these dynamics on a large scale, based on the properly developed connection of ontology learning and dynamic manual development. We do not concentrate on formal specification of respective ontology integration operators, we focus rather on implementation of them, following certain practical requirements: (1) the ability to process new knowledge (resources) automatically whenever it appears and when it is inappropriate for human users to incorporate it (2) the ability to automatically compare the new knowledge with a "master" ontology that is manually maintained and select the new knowledge accordingly (3) the ability to resolve possible major inconsistencies between the new and current knowledge, possibly favouring the assertions from presumably more complex and precise master ontology against the learned ones (4) the ability to automatically sort the new knowledge according to user-defined preferences and present it to them in a very simple and accessible way, thus further alleviating human effort in the task of knowledge integration On one hand, using the automatic methods, we are able to deal with large amounts of changing data. On the other hand, the final incorporation of new knowledge is to be decided by the expert human users, repairing possible errors and inappropriate findings of the automatic techniques. The key to success and applicability is to let machines do most of the tedious and time-consuming work and provide people with concise and simple suggestions on ontology integration. The main contribution of the presented work is two-fold: • proposal and implementation of a generic algorithm for dynamic integration of knowledge automatically extracted from various unstructured resources (e.g., natural language articles or web pages) into manually maintained formal ontologies (described in Sections 4 and 5) • presentation of an example application of the implemented algorithm in a task of biomedical ontology extension by integrating knowledge automatically learned from textual domain resources, showing usability of the approach in the context of the presented use cases (Section 6) related user feedback analysis is provided in Section 7, too. Section 8 summarises the paper and future directions of the presented research. 2. Key notions In the following list we give a brief description of the essential notions that are relevant for the presented content and describe how they relate to the field of bioinformatics: • Semantic Web—the Semantic Web initiative is generally about giving a formal shared meaning to the data present on the normal world wide web in order to make them fully accessible and "comprehensible" by machines, not only by humans (see [5]). However, the technologies that have been developed within Semantic Web research are applicable to many other fields. In the case of bioinformatics, biomedical data management and decision support, we can exploit for instance Semantic Web methods of intelligent and efficient knowledge representation, reasoning, data integration or knowledge management. • ontology— according to a popular definition in [24], ontology is a representation of shared conceptualisation. As such, ontologies are used for formal representation of knowledge in particular domains, i.e. various subfields of biomedicine. • ontology integration—the process of merging, consolidating and respective analysis and modification of two or more ontologies into one (integrated) ontology (see [38]). The process can be either manual or (semi)automatical. • ontology (earning—acquisition of an ontology from unstructured or semi-structured natural language text, typically resources relevant for a particular domain (e.g. web pages, articles or other types of documents). Natural Language Processing and Machine Learning methods are mostly used as a base for ontology learning algorithms (see [33]). • ontology alignment —ontology alignment establishes mappings between concepts and other entities (e.g. relations or instances) in two or more ontologies. Either manually designed mappings (created on the fly or contained in appropriate alignment repositories), or automatically generated ones can be used to align the ontologies (see [16,18]). • ontology evolution— development and maintenance of ontologies in dynamic environments (see [40,36,25]), where the knowledge needs to be updated on regular basis and changes in the domain conceptualisation occur often (e.g. science or business domains, where frequent introduction of new concepts or revision of the old ones is essential). • ontology lifecycle— a methodology or scenario, that describes how the particular phases of the ontology development, maintenance and possibly also exploitation are mutually connected and dependent (see [23,34]). 3. Related work The rest of the paper is organized as follows: Section 2 gives basic overview of the essential notions and background of the paper, together with respective relevant references. Section 3 discusses the related work. Section 4 gives an overview of our ontology lifecycle scenario and framework, whereas Section 5 presents the integration of manually designed and automatically learned ontologies in more detail. In Section 6, we describe an example practical application of our integration technique, using real world input data (from the biomedicine research domain). Preliminary evaluation and its discussion is also provided. Section 7 outlines relevant real-world settings, challenges and contributions our framework can bring in these contexts. A 33 Within the Semantic Web research, several approaches and methodologies have been defined and implemented in the context of ontology lifecycle and integration. Recent overviews of the state-of-the-art in ontologies and related methodologies can be found in [39] and [23], However, none of them offers a direct solution to the requirements specified in Section 1. The Methontology methodology by [19] was developed in the Esperonto EU project. It defines the process of designing ontologies and extends it towards evolving ontologies. It is provided with an ontology lifecycle based on evolving prototypes (see [20]) and defines stages from specification and knowledge acquisition to configuration management. The particular stages and their 818 V. Nováček et al./Journal of Biomedical Informatics 41 (2008) 816-828 requirements are characterised, but rather in a general manner. The automatic ontology acquisition methods are considered in Methontology, however, their concrete incorporation into the whole lifecycle is not covered. The ODESeW and WebODE (see [9]) projects base on Methontology and provide an infrastructure and tools for semantic application development/management, which is in the process of being extended for networked and evolving ontologies. However, they focus rather on the application development part of the problem than on the ontology evolution and dynamic ontology integration parts. The methods and tools referenced above lack concrete mechanisms that would efficiently deal with the dynamics of realistic domains (so characteristic for instance for e-health and biomedi-cine). Moreover, the need for automatic methods of ontology acquisition in data-intensive environments is acknowledged, but the role and application of the automatic techniques is usually not clearly studied and implemented. Our approach described in [34] offers a complex picture of how to deal with the dynamics in the general lifecycle scenario. The work we present here implements the fundamental semi-automatic dynamic integration component of the scenario. There are more specific approaches similar to the one presented by our lifecycle framework. [14] incorporates automatic ontology extraction from a medical database and its consequent population by linguistic processing of corpus data. However, the mechanism is rather task-specific—the ontology is represented in RDF(S) format (see [6]) that is less expressive than the OWL language (see [4]), which we use. The extraction is oriented primarily at taxonomies and does not take the dynamics directly into account. Therefore the approach can hardly be applied in universal settings, which is one of our aims. Protege (see [22]) and related PROMPT (see [37]) tools are designed for manual ontology development and semi-automatic ontology merging, respectively. PROMPT provides heuristic methods for identification of similarities between ontologies. The similarities are offered to the users for further processing. However, the direct connection to ontology learning, which we find important for dynamic and data-intensive domains like e-health and biomed-icine, is missing. There are several works addressing directly the topic of ontology integration. Alasoud et al. [1] and Calvanese et al. [7] describe two approaches inspired mainly by database techniques of data mediation and query rewriting in order to provide integrated (global) view on several (local) ontologies. Heflin and Hendler [28] present web ontology integration method using SHOE, a web-based knowledge representation language, and semi-automatically generated alignments. Deenand Ponnamperuma [12] implement a dynamic and automatic ontology integration technique in multi-agent environments, based on relatively simple graph ontology model inclusions and other operations. Again, none of the approaches tackles the requirements we specify in Section 1. Even though the methods propose solutions to the integration problem in general, there is no direct way how to integrate knowledge from unstructured resources, minimising human intervention. Furthermore, there is no emphasis on accessibility of the ontology integration to the laymen users. Our approach is distinguished by the fact that it pays special attention to these features, which we find essential for the application in e-health and/or bioinformatics. 4. DINO—a dynamic ontology lifecycle scenario Our integration platform is a part of a broader lifecycle scenario (see [34]). We refer to both lifecycle and integration platform by the DINO abbreviation, evoking multiple features of our solution: it reflects three key elements of the lifecycle scenario—Dynamics, INtegration and Ontology; however, the first two parts can also be Data and INtensive; finally, DINO can be read as Dynamic INtegration of Ontologies, too. All these features express the primary aim of our efforts—to make the knowledge (integration) efficiently and reasonably manageable in data-intensive and dynamic domains. Fig. 1 depicts the scheme of the proposed dynamic and application-oriented ontology lifecycle that deals with the problems mentioned as a part of our motivations. Our ontology lifecycle builds on four basic phases of an ontology lifecycle: creation (comprises both manual and automatic ontology development and update approaches), versioning, evaluation and negotiation (comprises ontology alignment and merging as well as negotiation among different possible alignments). The four main phases are indicated by the boxes annotated by respective names. Ontologies or their snapshots in time are represented by circles, with arrows expressing the information flow and transitions between them. The boxes labelled A; present actors (institutions, companies, research teams etc.) involved in ontology development, where A] is zoomed-in in order to show the lifecycle's components in detail. The general dynamics of the lifecycle goes as follows: (1), the community experts and/or ontology engineers develop a relatively precise and complex domain ontology (the Community part of the Creation component): (2), the experts use means for continuous ontology evaluation and versioning to maintain high quality and manage changes during the development process, respectively: (3), if the amount of data suitable for knowledge extraction (e.g. domain resources in natural language) is too large to be managed by the community, ontology learning takes its place: (4), the ontology learning results are evaluated by human experts and eventually integrated (using the negotiation component) into the more precise reference community ontology, if the respective extensions have been found appropriate. The integration in the scenario is based on alignment and merging covered by the negotiation component. Its proposal, implementation principles and application in selected e-health use case form the key contribution of this paper (see Sections 5 and 6 for details). The negotiation component takes its place also when interchanging or sharing the knowledge with other independent actors in the field. All the phases support ontologies in the standard OWL format. In the following we will concentrate on the integration mechanism. More information on other parts of the lifecycle can be found in [34], 5. Dynamic integration of automatically learned knowledge The key novelty of the presented lifecycle scenario is its support for incorporation of changing knowledge in data-intensive domains, especially when unstructured data (i.e. natural language) is involved. This is achieved by implementation of a specific integration mechanism introduced in this section. The scheme of the integration process is depicted in Fig. 2. The integration scheme details the combination of several generic lifecycle components—mainly the (automatic) creation and negotiation—in the process of incorporation of learned ontologies into a collaboratively developed one. The latter ontology serves as a master, presumably precise model in the process of learned knowledge integration. The master ontology—0M circle in Fig. 2—is supposed to be developed within a dedicated external application such as Protege1. The DINO integration platform itself is implemented as a respective API library and GUI interface. Simple research prototypes of these applications and user documentation can be downloaded at http://smile.deri.ie/tools/dino. See http://protege.stanford.edu/. V. Nováček et aL/journal of Biomedical Informatics 41 (2008) 816-828 819 Community "8 ' Ontology Learning I Eval"alim I few) F -—i I A- I Fig. 1. Dynamic ontology lifecycle scheme. DINO Integration Scheme DINO Library Fig. 2. Dynamic ontology integration scheme. 0M in Fig. 2 presents a reference for integration with the 0L ontology resulting from the learning process. Ontology Alignment Wrapper produces an alignment ontology 0A that encodes mappings between 0M and 0L. All these ontologies are passed to the Ontology Merging Wrapper that resolves possible inconsistencies and produces integrated ontology Oj. Ontology Diff Wrapper compares Oj with the former master ontology 0M and passes the respective additional statements (not present in 0M) to the NLG and Sorted Suggestions Generator component. NLG (Natural Language Generator) produces a comprehensive natural language representation of all the addition statements. The Sorted Suggestions Generator component outputs the final product of the integration process—particular natural language suggestions on the master ontology extension, sorted according to the user preferences. The suggestions agreed by human users form a base of a next version of the 0M ontology created after the integration. Note that during 35 all phases of integration, we use the former 0M base namespace for all the other ontologies involved. The integration phases outlined in Fig. 2 are described in detail in the sections below. 5.J. Ontology learning wrapper In this phase, machine learning and NLP methods are used for the processing of relevant resources and extracting knowledge from them (ontology learning). The ontology learning is realised using the Text20nto framework (see [8]) that is able to extract an ontology from an arbitrary set of textual documents. Due to space restrictions, we cannot properly comment on the methods used for ontology extraction and post-processing in Text20nto, however, they are described in detail in [32,8], Note that this component does not tackle selection of the documents the ontology is to be learned from—this task needs to be performed manually by the system users. 820 V. Nováček et al./Journal of Biomedical Informatics 41 (2008) 816-828 In the current implementation, only a restricted subset of possible OWL (DL) constructs is being extracted: rdf s: subClas sOf axioms, class instances, named class assertions, owl:disjointWith axioms and owl:ObjectProperty assertions with rdf s :domain and rdf s: range properties specified. owl:equivalentClass relations can be inferred from mutual rdfs:subClassOf axioms between particular classes. owl:equivalentClass and owl:sam-eAs constructs can also be extracted using the Text20nto concept/ instance similarity determination algorithms, however, their performance, precision and coverage was not found to be sufficient enough, therefore they are not included in the current version of the DINO framework. We have not performed a rigorous evaluation of the ontology learning step as such, however, the informal precision rate of ontology extraction was about 70% for the sample application described in Section 6 (given by ratio of meaningful axioms to all extracted axioms). Note that even an arbitrary external ontology can be integrated instead of the learned one, however, the integration results are not necessarily complete in the case of more complex ontologies (e.g., containing complex restrictions and anonymous classes). This is due to the fact that the current implementation is tailored specifically to the rather simple learned ontologies. 5.2. Ontology alignment wrapper When the learned ontology 0L has been created, it has to be reconciled with master ontology 0M since they cover the same domain, but might be structured differently. The reconciliation of these ontologies depends on the ability to reach an agreement on the semantics of the terms used. The agreement takes the form of an alignment between the ontologies, that is, a set of correspondences (or mappings) between the concepts, properties, and relationships in the ontologies. However, the ontologies are developed in different contexts and under different conditions and thus they might represent different perspectives over similar knowledge, so the process by which to come to an agreement will necessarily only come through a negotiation process. The negotiation process is performed using argumentation-based negotiation that uses preferences over the types of correspondences in order to choose the mappings that will be used to finally merge the ontologies (see Section 5.3). The preferences depend on the context and situation. A major feature of this context is the ontology, and the structural features thereof, such as the depth of the subclass hierarchy and branching factor, ratio of properties to concepts, etc. The analysis of the components of the ontology is aligned with the approach to ontology evaluation, demonstrated in [13], and can be formalized in terms of feature metrics. Thus the preferences can be determined on the characteristics of the ontology. For example, we can select a preference for terminological mapping if the ontology is lacking in structure, or prefer extensional mapping if the ontology is rich in instances. Thus, the alignment/negotiation wrapper interfaces two tools-one for the ontology alignment discovery and one for negotiation of agreed alignment. We call these tools AKit and NKit, respectively, within this section. For the former, we use the ontology alignment API (see [18]) developed by INRIA Rhone-Alpes2. For the negotiation we use the framework described in [30]. Both tools are used by the wrapper in order to produce 0A—an ontology consisting of axioms3 merging classes, individuals and properties in the 0L and 0M ontologies. It is used in consequent factual merging and refine- 2 See http://alignapi.gforge.inria.fr/ for up-to-date information on the API. 3 Using constructs like owtequivalentClass, owtsameAs, owtequivalentProperty, rdfsisubClassOf or rdfsisubPropertyOf ment in the ontology reasoning and management wrapper (see Section 5.3 for details). Algorithm 1. Meta-algorithm of the alignment and negotiation Require: 0L, 0M—ontologies in OWL format Require: AKit, NKit—ontology alignment and alignment negotiation tools, respectively Require: ALMSET—a set of the alignment methods to be used Require: PREFSET—a set of alignment formal preferences corresponding to the 0L, 0M ontologies (to be used in N-kit) 1: S„ <-0 2: for method e ALMSET do 3: SA <- SA uAKit.getAlignment(0L,0M, method) 4: end for 5: Aogreed <- NKit.negotiateAlignment(SA,PREFSET) 6: 0A <— AKit.produceBridgeAxioms(Aagreeii) 7: return 0A The wrapper itself works according to the meta-code in Algorithm 1. The ontology alignment API offers several possibilities of actual alignment methods, which range from trivial lexical equality detection through more sophisticated string and edit-distance based algorithms to an iterative structural alignment by the OLA algorithm (see [17]). The ontology alignment API has recently been extended by a method for the calculation of a similarity metric between ontology entities, an adaptation of the SRMetric used in [44], We also consider a set of justifications, that explain why the mappings have been generated. This information forms the basis for the negotiation framework that dynamically generates arguments, supplies the reasons for the mapping choices and negotiates an agreed alignment for both ontologies 0L and 0M. 5.3. Ontology merging wrapper This wrapper is used for merging of the 0L and 0M ontologies according to the statements in 0A (each of the ontologies technically represented as a respective Jena ontology model). Moreover, the wrapper resolves possible inconsistencies caused by the merging—favouring the assertions in the 0M ontology, which are supposed to be more relevant. The resulting ontology 0, is passed to the ontology diff wrapper to be compared with the former 0M master ontology. The respective addition model forms a basis for the natural language suggestions that are produced as a final product of the integration (see Sections 5.4 and 5.5 for details). Algorithm 2. Meta-algorithm of the merging and inconsistency resolution Require: 0L, 0M, 0A—ontologies in OWL format Require: mergeQ—a function that merges the axioms from input ontologies, possibly implementing reasoning routines according to the ontology model used Require: C—set of implemented consistency restrictions; each element r g C can execute two functions r.detectQ and r.resolveQ that detect (and return) and resolve an inconsistency in the input ontology, respectively 1: 0,^merge(0M,0L,0A) 2: inconsistencies <— 0 3: for r 6 C do 4: inconsistencies <— inconsistencies u r.detect(Oi) 5: Oj <— r.resolve(Oi) 6: end for 7: return 0h inconsistencies V. Nováček et aL/Journal of Biomedical Informatics 41 (2008) 816-828 821 Algorithm 2 describes the meta-code of the process arranged by the ontology merging and reasoning wrapper. We currently employ no reasoning in the merge() function. However, sub-class subsumption (as implemented by the Jena framework) is used when detecting and resolving inconsistencies. The inconsistencies are constituted by user-defined restrictions. These restrictions are implemented as extensions of a generic inconsistency detector and resolver in the ontology merging wrapper. Thus we can implement either logical (in terms of Description Logics, see [2]) inconsistencies, or custom-defined inconsistencies (i.e. cyclic definitions) according to requirements of particular practical applications. The automatic inconsistency resolution itself is somewhat tricky. However, we can apply a sort of "greedy" heuristic, considering the assertions in the master 0M ontology to be more valid. Therefore we can discard axioms from 0L or 0A that are inconsistent with axioms in 0M—we call such axioms candidate in the text below. If there are more such axioms, we discard them one by one randomly until the inconsistency is resolved4. If all the conflicting axioms originated in 0M, we just report them without resolution. We currently implement and resolve the following inconsistencies: • Sub-class hierarchy cycles: these are resolved by cutting the cycle, i.e. removing a candidate rdfs:subClassOf statement; • Disjointness-subsumption conflicts: if classes are said to be disjoint and a sub-class relationship holds between them at the same time, a candidate conflicting assertion is removed: • Disjointness-superclass conflicts: if a class is said to be a sub-class of classes that are disjoint, a candidate conflicting assertion is removed; • Disjointness-instantiation conflicts (specialisation of the above): if an individual is said to be an instance of classes that are disjoint, a candidate conflicting assertion is removed. The first one is non-logical inconsistency, whereas the remaining free are examples of logical inconsistencies. More on the types and nature of logical (DL) inconsistencies can be found for instance in [21]. Since most logical inconsistencies are introduced by negative constructs like owl:dis jointWith, owl:complementOf or owl:differentJrom, we can easily adapt the above techniques related to disjointness in order to support additional inconsistency types. A transparent and flexible support of arbitrary non-logical consistency constraints is a part of our future work. We plan to implement this feature on the top of user-defined rules (expressing facts like "if X is a male mammal, then it does not have an ovary"). DINO will not include the learned statements that are in a conflict with the respective rule-based constraints into the merged ontology. Certain more subtle issues related to the ontology design (such as possibly unwelcome multiple inheritance) cannot, however, be generally handled even by the rule-based inconsistency resolution, therefore the more sophisticated refinement of the integrated ontology is deliberately left for the user. Note that each element of the set of inconsistencies returned by Algorithm 2 (besides the integrated ontology itself) is associated with respective simple natural language description. The descriptions are presented for further examinations by human users in the DINO user interface. 4 This is the currently implemented way, however, we plan to improve the selection of candidate axioms according to confidence ranking produced by the Text20nto tool—similarly to the technique described in [26]. This is scheduled for the next version of the DINO integration library. 5.4. Ontology diff wrapper Possible extension of a master ontology 0M by elements contained in the merged and refined ontology Oj naturally corresponds to the differences between them. In particular, the possible extensions are equal to the additions 0, brings into 0M. The additions can be computed in several ways. Ontology diff wrapper in DINO offers a way how to uniformly interface the particular methods of addition computation. No matter which underlying method is employed, a respective Jena ontology model containing the respective additions is returned. Currently, the following methods are implemented within the wrapper: (1) SemVersion-based diff computation—additions at the RDF (triple) level computed using the SemVersion library (see [43]) (2) addition model computation by set operations on the underlying Jena RDF models (3) addition model computation by direct iterative querying of the former master ontology model, integrated model and alignment model for reference purposes (see Algorithm 3 for details on implementation) For the practical experiments with ontologies, we have used the third method—mainly due to the fact that it computes the additions directly at the ontology level and not at the lower triple level (which means subsequent processing load when getting back to the ontology model again). Algorithm 3. Meta-algorithm of the addition model computation (by direct model querying) Require: 0M, 0;, 0A—former master, integrated and alignment ontologies, respectively Require: copyResource()—a function that returns a copy of an ontology resource (e.g. class or property) including all relevant features that are bound to it (e.g. subclasses, superclasses, instances for a class or domain and range for a property) 1: 0 added <— 0 2: for c 6 Oi.getNamedOntologyClassesQ do 3: if not 0M.contains(c) or 0A.contains(c) then 4: 0added <- copyResource{c) 5: end if 6: end for 7: for p e 01.getOntologyProperties() do 8: if not 0M.contains(p) or 0A.contains(p) then 9: 0added <- copyResource{p) 10: end if 11: end for 12: return 0added Note that the algorithm does not compute all differences between arbitrary ontologies in general. However, this is no drawback for the current implementation of DINO integration. We deal with learned ontology extending the master one. The extensions originating in automatically learned knowledge do not cover the whole range of possible OWL constructs, thus we do not need to tackle e.g. anonymous classes and restrictions in the addition model computation. Therefore the employed custom addition computation can be safely applied without any loss of information. The computed addition ontology model is passed to the suggestion sorter then (see Section 5.5 for details). 822 V. Nováček et al./Journal of Biomedical Informatics 41 (2008) 816-828 Table 1 Scheme of suggestion generation Axiom pattern NL suggestion scheme Example Class C\ is related by relation r to class c2 Individual i is a member of class c Property px with features features x is related to property p2 by relation r Property px with features x has domain/range class The class C\ AabeK) f if) the class c2.label(). The class c.labelQ has the i.labelQ instance. There is a px .labeli) g{x) property. It is/(r) p2.label(). There is a px .labeli) g(x) property. Its domain/range is the c.labelQ) class. The class "differences" is disjoint with the class "inclusions". The class "the_cytoskeleton_organiser_c" has the "centrosome_i" instance. There is a "contains" object property. Its range is the "organs" class. There is a "contains" object property. It has the "hass'arts" superproperty. 5.5. Sorted suggestions generator The addition ontology passed to this component forms a base for the eventual extension suggestions for the domain experts. In order to reduce the effort in the final reviewing of the master ontology extensions, we create respective simple natural language suggestions that are associated with corresponding facts in the addition ontology model. The natural language suggestions are then presented to users—when a suggestion is accepted by the users, the associated fact is included into the master ontology model. Table 1 shows a scheme of the natural language (NL) suggestion generation. The r variable represents possible relations between classes or properties (e.g. rdfs:subClassOf, rdfs:subPropertyOf or owi:disjointWith), mapped by the function/() to a respective natural language representation (e.g. is a sub-class of, is a sub-property of or is disjoint with). The x variable represents possible features of a property (e.g. owl :0b jectProperty or owl:Functional-Property, mapped by the function g() to a respective natural language representation (e.g. object or functional). In general, the number of suggestions originating from the addition ontology model can be quite large, so an ordering that takes a relevance measure of possible suggestions into account is needed. Thus we can for example eliminate suggestions with low relevance level when presenting the final set to the users (without overwhelming them with a large number of possibly irrelevant suggestions). As a possible solution to this task, we have proposed and implemented a method based on string subsumption and a specific distance measure (see [31]). These two measures are used within relevance computation by comparing the lexical labels occurring in a suggestion with respect to two sets SP,S„ of words, provided by users. The Sp and Sn sets contain preferred and unwanted words respectively, concerning the lexical level of optimal extensions. The suggestions T are sorted according to the respective rel(T,Sp) - rel(T,Sn) values, where rel(T,S) is a function measuring the relevance of the suggestion triple T with respect to the words in the set S. The higher the value, the more relevant the suggestion triple is. We develop the relevance function in detail in Algorithm 4. Algorithm 4. The relevance function Require: St— a set of (possibly multiword) lexical terms occurring in the suggestion Require: S—set of words Require: p e (0,1) influences the absolute value of relevance measure Require: t—integer constant; maximal allowed distance Require: levDist(S\,s2)—Lev. distance implementation 1: for elem e St do 2: Relem <— 0 3: end for 4: for elem e St do 5: if elem is a substring of or equals to any word in S or vice versa then else d ■<— oo for veS if levDist(elem, v) < d then d <— levDist(elem, v) end if end for if d < t then else if elem is a multiword term then I <— set of single terms in the elem label expression EXP ^ 0 for u e L do if u is a substring of or equals to any word in S or vice versa then EXP <- EXP + 1 else d ■<— oo for veSdo if levDist(u, v) < d then d <— levDist(u, v) end if end for if d < t then EXP <- EXP -\ end if end if end for if EXP = 0 then Klem <— 0 else Relem *~ end if end if end if (1- t) The function naturally measures the "closeness" of the labels occurring in the suggestion to the set of terms in S. The value of 1 is achieved when the label is a direct substring of or equal to any word in S or vice versa. When the Levenshtein distance between the label and a word in S is lower than or equal to the defined threshold t, the relevance decreases from 1 by a value proportional to the fraction of the distance and t. If this is not the case (i.e. the label's distance is greater than t for each word in S), a similar principle is applied for possible word-parts of the label and the relevance is further proportionally decreased (the minimal possible value being 0). Note that the complexity of the sorting itself mostly contributes to the overall complexity of the relevance-based sorting of suggestions. As can be found out from Algorithm 4, the complexity is in 0(cmnl2 + m log m)(c—maximal number of terms occurring in a suggestion, thus a constant; m—number of suggestions; n—number of words in the preference sets; I—maximal length of a word in suggestion terms, basically a constant), which gives 0(m(n + log m)). As the size of the sets of user preferences can be practically treated as constant, we obtain the 0(m log m) complexity class with respect to the number of suggestions, which is feasible. 5.6. Natural language generation (NLG) component The DINO framework is supposed to be used primarily by users who are not experts in ontology engineering. Therefore the suggestions are produced in a form of very simple natural language statements, as seen in the previous section. Moreover, we automatically create a natural language representation of the whole addition model, interfacing the framework described in [42], This is meant to further support laymen users by readable representation of the whole addition model in order to give them an overall impression of the changes. 38 V. Nováček et aL/journal of Biomedical Informatics 41 (2008) 816-828 823 The single suggestions are still bound to the underlying statement in the addition ontology model. Therefore a user can very easily add the appropriate OWL axioms into the new version of the 0M master ontology without actually dealing with the intricate OWL syntax itself. Concrete examples of both suggestions and continuous natural language representation of the addition model are given in Section 6. 6. Example application and results of DINO integration We applied the integration technique described in Section 5 in the context of data typical for biomedical research. However, the way of exploiting the DINO integration technique reported in this section is rather general, since it aims at cost-efficient extension or population of a master ontology by knowledge learned from empirical data. Thus, a similar deployment of the integration can actually help to tackle needs of many other possible use cases. Real world data for the master ontology and ontology learning sources were used. More specifically, we employed resources from CO-ODE biomedicine ontology fragment repository5 and data from relevant Wikipedia topics, respectively. Rigorous evaluation of the whole process of integration is a complex task involving lot of open problems as its sub-problems (for instance, there is no standard ontology evaluation process applicable in general—see [27,13]). Moreover, there is an emphasis on the human-readable and laymen oriented form of the integration process results. This dimension forms a primary axis of the evaluation, however, its realisation involves logistically demanding participation of a broader (biomedicine) expert community. Accomplishing the above tasks properly is a part of our future work. Nonetheless, there are several aspects that can be assessed and reported even without devising an optimal ontology evaluation method (which may be impossible anyway) and/or getting involved large representative sample of domain experts: • features of the learned ontology (e.g. size or complexity) • mappings established by alignment • basic assessment of the quality and correctness of suggestions and their sorting according to defined preferences These factors of integration are analysed and discussed within an experimental application described in Section 6.1. The negotiation component has recently been evaluated separately as a stand-alone module, using the Ontology Alignment Evaluation Initiative test suite6 and experiments on the impact that the argumentation approach has over a set of mappings. A comparison wrt. current alignment tools is presented in [29]. The preliminary results of these experiments are promising and suggest that the argumentation approach can be beneficial and an effective solution to the problem of dynamically aligning heterogeneous ontologies. This justifies also the application of the implemented technique in the ontology integration task. 6. J. Experimental integration of biomedical research knowledge-extension of (blood) cells ontology fragment In order to show the basic features of our novel integration technique in practice, we tested the implementation using knowledge resources from biomedicine domain7. In particular, we combined fragments of GO cellular component description and 5 See http://www.co-ode.org/ontologies. 6 See http://oaei.ontologymatching.org/. 7 Should the reader be interested, all relevant resources used and/or created during the described experiment are available at http://smile.deri.ie/resources/2007/08/31/ dino_exp_data.zip O Mitochondrion O Nucleolus Nucleus NucleusAxes Plastid ( ' Chloroplast (_j Ribosome Spindle Vacuole O itplication_c (!' dna_replica1ion_c Q responses results (j b lo o d_c e I l_re s u lt_c ^ ribosome_c right_c rouleaux_c salamanders saturations Fig. 3. Sample from master and learned ontology. eukaryotic cell description8 to form the master ontology. In the example scenario, we wanted to extend this master ontology using content of Wikipedia entries on Cells_(biology) and Red_-blood_cell. These resources were passed to the ontology learning DINO component and respective ontology was learned. Both master and learned ontology samples are displayed in Fig. 3 (on the left-hand and right-hand side, respectively). Note that these master and learned ontologies correspond to the 0M, 0L ontologies displayed in Fig. 2, Section 5. The names in learned ontology have specific suffixes (i.e. "_c"). This is due to naming conventions of the ontology learning algorithm we use. We keep the suffixes in suggestions, since they help to easily discriminate what comes from empirical data and what from the master ontology. However, we filter them out when generating the text representing the whole extension model (see below for examples). Table 2 compares metric properties of the master and learned ontologies, as computed by the Protege tool. The particular metrics are expanded as follows: M]—number of named classes (all/primitive/defined): M2—number of parents per class (mean/ median/maximum): M3—number of siblings per class (mean/median/maximum): M4—number of anonymous classes (restrictions): M5—number of properties (all/object/datatype): M6—Description Logics expressivity. The learned ontology has higher ratio of primitive classes, moreover, it contains no restriction on class definitions. There are some simple object properties with both domains and ranges defined. Its DL expressivity allows concept intersection, full universal and existential quantification, atomic and complex negation and datatypes. The expressivity of the master ontology does not involve datatypes, however, it contains numeric restrictions. Summing up, the master ontology contains several complicated constructs not present in the learned ontology, however, the ontology learned only from two simple and relatively small resources is much larger. When computing the negotiated alignment (the 0A ontology as given in Fig. 2, Section 5) between master and learned ontology, 207 mappings were produced and among them, 16 were accepted. A sample from the alignment ontology is displayed in Fig. 4. Merging of the learned and master ontologies according to the computed alignments results in several inconsistencies—the report generated by DINO is displayed in Fig. 5. Two of these three inconsistencies are resolved correctly (according to human intuition) by the algorithm, forming an integrated ontology 0,, as displayed in Fig. 2, Section 5. After resolving the inconsistencies (three inconsistencies per an integrated resource were resolved in average within our experiment) and generating the addition model, natural language 39 8 Samples downloaded from the CO-ODE repository, see http://www.co-ode.org/ ontologies/bio-tutorial/sources/GO_CELLULAR_COMPONENT_EXTRACT.owl and http://www.co-ode.org/ontologies/eukariotic/2005/06/01 /eukari otic.owl, respectively. 824 V. Nováček et al./Journal of Biomedical Informatics 41 (2008) 816-828 Table 2 Metrics of master and learned ontologies Metric/ontology Mi Al2 Al3 Al4 Al5 Me Learned 391/379/12 3/1/5 7/1/16 0 13/13/0 Master 40/36/4 2/1/2 5/1/15 16 (restr.) 1/1/0 «owl: equivalentclass rdf: resource="#the_nbosome_c"/> Fig. 4. Sample alignment. suggestions (associated with respective OWL axioms) are produced. Sample suggestions associated with respective relevance measures are displayed in Fig. 6. A portion of the continuous text generated by the NLG component that is corresponding to the addition model is displayed in Fig. 7. Similar "pretty" texts are to be presented to users in the extended DINO interface (the current interface offers only raw text, however, necessary parsing, filtering and highlighting of the ontology terms is under construction). It provides users with additional source of lookup when deciding which suggestions to accept into the next version of the master ontology. The suggestions are the ultimate output of the integration algorithm. Their main purpose is to facilitate laymen effort in incorporation of new knowledge from unstructured resources into an ontology. Therefore we performed basic evaluation of several parameters that influence actual applicability of the suggestions. We ran the integration algorithm on the same data with four different suggestion-preference sets, simulating four generic trends in the preference definition: • specification of rather small number of preferred terms, no unwanted terms • specification of rather small number of preferred and unwanted terms • specification of larger number of preferred terms, no unwanted terms • specification of larger number of preferred and unwanted terms Table 3 gives an overview of the four iterations, the particular preferred and unwanted terms and distribution of suggestions into relevance classes. The terms were set by a human user arbitrarily, reflecting general interest in clinical aspects of the experimental domain knowledge. The terms in preference sets reflect possible topics to be covered by the automatic extension of the current ontology. S+, S0 and S_ are classes of suggestions with relevance greater, equal and lower than zero, respectively (S = S+l)S0 US_). For each of the relevance classes induced by one iteration, we randomly selected 20 suggestions and computed two values on this sample: • Px,x e {+,0,-}—ratio of suggestions correctly placed by the sorting algorithm into an order defined by a human user for the same set (according to the interest defined by the particular preferences) • Ax,x e {+,0, -}—ratio of suggestions that are considered appropriate by a human user according to his or her knowledge of the domain (among all the suggestions in the sample) The results are summed up in Table 4. More details on interpretation of all the experimental findings are given in consequent Section 6.2. 6.2. Discussion of the experiment results The DINO integration library allows users to submit the resources containing knowledge they would like to reflect in their current ontology. The only thing that is needed is to specify preferences on the knowledge to be included using the sets of preferred and unwanted terms. After this, sorted suggestions on possible ontology extensions (after resolution or reporting of possible inconsistencies) can be produced and processed in minutes, whereas the purely manual development and integration of respective ontology would take hours even for relatively simple natural language resources. Moreover, it would require a certain Inconsistency: The following classes are disjoint and in mutual sub-class relationship at the same time: "organelle_c" and "nucleus_c" Inconsistency: The following classes are disjoint and in mutual sub-class relationship at the same time: "cell_c" and "blood_cell_c" Inconsistency: The following classes are disjoint and in mutual sub-class relationship at the same time: "cellwallc" and "membranec" Fig. 5. Report on inconsistencies. 40 V. Nováček et aL/journal of Biomedical Informatics 41 (2008) 816-828 825 Relevance: 0.75 Suggestion : The class "cell_nucleus_c" is disjoint with the class "compartment_c". Relevance: 0.083333336 Suggestion : The class "Nucleus" is equivalent to the class "nucleus_c" Relevance: 0.0 Suggestion : The class "organelle_c" has the "mitochondrium_c" subclass. Relevance: 0.0 Suggestion : The class "Mitochondrion" is equivalent to the class "mitochondrium_c". Relevance: -0.3333333 Suggestion : The class "chromosome_c" has the "Organelle" superclass. Relevance: -0.9166666 Suggestion : The class "Chromosome" is equivalent to the class "chromosomec". Fig. 6. Sample suggestions. There are "Cells", "Nucleuss", "bacteriums", and "genetic diseases". There are "red blood cells", "absorptions", "additional functions", "advantages", and "archaeons". There are "autoimmunediseases", "aplasiums", "appendages", "areas", and "atoms". There are "bacterias", "bacteriums", "beacons", "bilayers", and "blockages". There are "cannots", "capacitys", "capsules", "cells", and "changes". There are "chloroplasts", "chromosomals", "ciliums", "coagulations", and "comparisons". Fig. 7. Sample from the generated continuous text. Table 3 Iterations—the preference sets and sizes of the resulting suggestion classes Iteration Preferred Unwanted s+ So S- S ii cell: autoimmune disease: transport: drug: gene; DNA 0 310 429 0 739 h cell: autoimmune disease: transport: drug: gene; DNA bacteria; prokaryotic; organelle; wall; chromosome; creation 250 344 145 739 h cell: autoimmune disease: transport: drug: gene; DNA eukaryotic; organ; function: part; protein: disease: treatment: cell part immunosuppression: production 0 485 254 0 739 U cell: autoimmune disease: transport: drug: gene; DNA eukaryotic; organ; function; part; protein; disease; treatment; cell part immunosuppression; production bilayer; bacteria; prokaryotic; additional function; organelle; macromollecule; archaeon; vessel; wall; volume; body; cell nucleus; chromosome; erythrocyte; creation 314 292 133 739 Table 4 Evaluation of random suggestion samples per class Iteration P+ A+ Po Ao A. ii 0.45 0.75 0.90 0.60 - - h 0.45 0.75 1.00 0.80 0.60 0.70 h 0.70 0.80 0.95 0.75 - - U 0.55 0.75 0.70 0.85 0.50 0.85 experience with knowledge engineering, which is uncommon among biomedicine domain experts. In Section 6.1 we described the application of our integration technique to an extension of biomedical research ontology fragment. The analysed results show that the suggestions produced are mostly correct (even though rather simple and sometimes obvious) with respect to the domain in question, ranging from 50% to 85% among the algorithm iterations. The relevance-based sorting according to preferences is more appropriate in case of irrelevant (zero relevance) suggestions, ranging from 70% to 100% of correctly placed suggestions. Its precision in case of suggestions with positive and negative relevance is lower, ranging from 45% to 70%. More terms in the preference sets cause better sorting performance (the ratio of appropriate suggestions being independent on this fact). Thus, the best discrimination in terms of presenting the most relevant suggestions first is achieved for larger preference sets. However, even the discrimination for smaller sets is fair enough (as seen in Table 3 in the previous section). The automatically produced natural language suggestions can be very easily browsed and assessed by users who are not familiar with ontology engineering at all. Since the respective axioms are associated to the suggestions, their inclusion into another version of the master ontology is pretty straightforward once a suggestion is followed by a user. The DINO integration technique still needs to be evaluated with a broader domain expert audience involved, however, even the preliminary results presented here are very promising in the scope of the requirements specified in Section 1. 7. Notes on realistic DINO deployment The EU 1ST 6th Framework project RIDE has identified and analysed several biomedical use case areas in [15] relevant concerning deployment of the Semantic Web technologies 826 V. Nováček et al./Journal of Biomedical Informatics 41 (2008) 816-828 (i.e., ontologies and related querying, knowledge and data management tools). The scope of [15] is rather broad, however, we can track few specific areas with significant needs that can be covered by the DINO ontology lifecycle and integration framework (Section 7.1). Section 7.2 discusses preliminary feedback of our potential users and consequently suggests most appropriate modes of the DINO prototype exploitation. 7.1. Selected use case areas 7.Í.Í. Longitudinal electronic health record The main topic here is development of standards and platforms supporting creation and management of long-term electronic health records of particular patients. These records should be able to integrate various sources of data coming from different medical institutions a patient may have been treated in during his whole life. Quite obviously, one needs to integrate different data sources, present very often in unstructured natural language form. Ontologies extracted from the respective patient data resources can very naturally support their integration into longitudinal electronic health records by means of DINO. 7.1.2. Epidemiological registries Epidemiology analyses diseases, their reasons, statistical origins and their relation to a selected population sample's socioeconomic characteristic. Epidemiological registries should be able to reasonably store and manage data related to population samples and their medical attributes in order to support efficient processing of the respective knowledge by the experts. In this use case area, one has to integrate knowledge from electronic health records in order to create population-wise repositories. Once the ontology-enabled electronic health records are created (DINO can help here as mentioned above), one can integrate them within another version of an "epidemiology" ontology (again, by means of the DINO framework). The resulting model can be employed in order to perform symbolic analysis (using ontology-based symbolic querying and logical inference) of the registry data, complementing the statistical numeric analysis methods. 7.1.3. Public health surveillance Public health surveillance presents ongoing collection, analysis, interpretation and dissemination of health-related data in order to facilitate a public health action reducing mortality and/or improving health. The use case area puts an emphasis on efficient dynamic processing of new data that are mostly in the free natural language text form, which can be directly facilitated by the DINO integration of respective learned ontologies. Ontologies created from and extended by urgent dynamic data can efficiently support expert decisions in risk management tasks. Continuous integration of less urgent data from various sources (either texts or ontologies) can support studies on public health issues in the long term perspective then. 7.1.4. Management of clinical trials Clinical trials are studies of the effects of newly developed drugs on real patient samples. They are essential part of approval of new drugs for normal clinical use and present an important bridge between medical research and practice. Efficient electronic data representation and querying is crucial here. However, even if the data are electronically represented, problems with their heterogeneity and integration occur as there are typically several different institutions involved in a single trial. The presented integration method can help in coping with the data heterogeneity here, especially when some of the data is present in the natural language form. 7.2. Preliminary user feedback and lessons learned We presented a DINO demo and/or sample knowledge integration results to biomedical domain and ontology engineering experts9. We also discussed a sketch of the DINO application in the above practical use cases with them. Their preliminary feedback can be summarised into the following three points: (1) the framework was considered as a helpful complement to the traditional manual ontology development environments (such as Protege): (2) the results were found promising concerning the scalable ontology extension by the knowledge in unstructured domain resources, however, certain refinement by ontology engineers was generally considered as a must in order to maintain high quality of the respective master biomedical ontologies: (3) the natural language presentation of the sorted extension suggestions was found to be very useful for the domain experts with no ontology engineering background. The last finding has been further supported by the recent evaluation of the natural language generation framework we use in DINO (see [11] for details). Following the discussion with the domain and ontology engineering experts, we can distinguish between two practical and reliable DINO application modes with different requirements on the expert user involvement: • Instance-only integration: ontology learned from the textual resources is semi-automatically integrated into a master ontology, taking only instance-related assertions into account, i.e., the upper ontology is populated with new instances of the present concepts and with relations among the instances. Such an application does not require any extensive expert involvement of ontology engineers, since the instance-related suggestions produced by DINO are relatively reasonable according to our discussions with domain experts. Severe modelling errors can only be introduced very rarely, therefore only the expert knowledge of the domain is generally enough to decide which DINO suggestions to follow in the master ontology extension. • Full-fledged integration: an unrestricted processing of the DINO suggestions, i.e., taking also the class-related assertions into account, requires more careful expert involvement in order to guarantee high quality of the integration results. Ontology experts are generally still needed when resolving possible modelling bugs (such as multiple class inheritance or redundant disjointness relations) that might be insufficiently tackled by the domain experts when processing the natural language DINO suggestions. State of the art methodologies such as ontology "re-engineering" as introduced in [3]can help when applying DINO this way. 8. Summary and future work We have presented the basic principles of DINO—a novel lifecycle scenario and framework for ontology integration and maintenance in dynamic and data-intensive domains like medicine. As a core contribution of the paper, we have described the mechanism of integration of automatically learned and manually maintained medical knowledge. The presented method covers all the requirements specified in Section 1. The proposed combination of automatic and manual knowledge acquisition principles, integration and inconsistency resolution ensures more scalable production and extension of ontol- 9 These were namely researchers from the REMEDI institute, see http://www.nui-galway.ie/remedi/, Prof. Werner Ceusters, M.D. (director of the Ontology Research Group of the New York State Center of Excellence in Bioinformatics and Life Sciences) and ontology engineers from the Knowledge Engineering Group at the University of Economics in Prague. V. Nováček et al./Journal of Biomedical Informatics 41 (2008) 816-828 827 ogies in dynamic domains. We presented and analysed results of a preliminary practical application of the DINO integration technique in Section 6. Section 7 outlined possible applications in realistic use case areas that have been recently identified in the biomedicine and e-health fields. The section also summarised preliminary feedback of our potential users. Based on the feedback analysis, two practical DINO application modes were suggested. Note that we have also delivered prototype implementations of a DINO API library and a respective GUI interface (research prototypes of the respective software can be downloaded at http://smile.deri.ie/tools/dino). Since the primary funding project has finished, we are in the process (as of 2008) of securing another funding that could support further improvements of DINO. These improvements consist mainly of an extended support for inconsistency resolution, integration with state of the art ontology editors (primarily Protege) and extension of the DINO user interface (e.g., providing explicit support for the two application modes given in Section 7.2). Moreover, we have recently started to work on another project with motivations similar to DINO, however, with much more ambitious goals. [35] presents a preliminary proposal and results of a novel empirical knowledge representation and reasoning framework. One of the principal applications of the researched framework is a complex empirical inference-based integration of arbitrary emergent knowledge (e.g., learned ontologies) with precise manually designed knowledge bases. We plan to combine the ontology integration powered by the reasoning described in [35] with the results achieved within the DINO implementation in order to allow for more efficient, scalable, user-friendly and robust dynamic maintenance of (partially emergent) ontologies. Last but not least, we are going to continuously evaluate the resulting framework among broader biomedicine expert communities and improve it in line with demands of interested industry partners (possibly, but not only within the presented real-world application domains). Acknowledgments The article is a significantly extended version of the paper: Vít Nováček, Loredana Laera, Siegfried Handschuh. Dynamic Integration of Medical Ontologies in Large Scale. In: Proceedings of the WWW2007/HCLSDI workshop. ACM Press, 2007. (see http:// www2007.org/workshops/paper_141.pdf). The presented work has been kindly supported by the EU 1ST 6th framework's Network of Excellence 'Knowledge Web' (FP6-507482), by the 'Lion' project funded by Science Foundation Ireland under Grant No. SFI/02/ CE1 /I131 and partially by Academy of Sciences of the Czech Republic, 'Information Society' national research program, the Grant number AV 1ET100300419. Moreover, we greatly appreciated the anonymous reviewers' remarks that resulted in significant improvements of the submitted text. References [1 ] Alasoud A, Haarslev V, Shiri N. A hybrid approach for ontology integration. In: Proceedings of the 31st VLDB conference. Very large data base endowment, 2005. [2] Baader F, Calvanese D, McGuinness DL, Nardi D, Patel-Schneider PF. The decription logic handbook: theory, implementation, and applications. Cambridge, USA: Cambridge University Press; 2003. [3] Bechhofer S, Gangemi A, Guarino N, van Harmelen F, Horrocks 1, Klein M, Masolo C, Oberle D, Staab S, Stuckenschmidt H, Volz R. Tackling the ontology acquisition bottleneck: an experiment in ontology re-engineering, 2003. Retrieved at: http://citeseer.ist.psu.edu/bechhofer03taclding.html, Apr 3 2008. [4] Bechhofer S, van Harmelen F, Hendler J, Horrocks 1, McGuinness DL, Patel-Schneider PF, Stein LA. OWL Web Ontology Language Reference, 2004. Available from (February 2006): http://www.w3.org/TR/owl-ref/. [5] Berners-Lee T, Hendler J, Lassila O. The semantic web. Scientific American, 5, 2001. [6] Brickley D, Guha RV. RDF Vocabulary Description Language 1.0: RDF Schema, 2004. Available from (February 2006): http://www.w3.org/TR/rdf-schema/. Calvanese D, Giacomo GD, Lenzerini M. A framework for ontology integration. In: Proceedings of the first semantic web working symposium. Springer-Verlag, 2001. Cimiano P, Völker J. Text20nto—a framework for ontology learning and data-driven change discovery. In: Proceedings of the NLDB 2005 conference. Springer-Verlag; 2005. p. 227-38. Corcho O, Lopez-Cima A, Gomez-Perez A. The ODESeW 2.0 semantic web application framework. In: Proceedings of WWW 2006. New York: ACM Press; 2006. p. 1049-50. Brewster FCC, Willis Y. User-centred onlology learning for knowledge management. In: Proceedings seventh international workshop on applications of natural language to information systems, Stockholm. 2002. Davis B, Iqbal AA, Funk A, Tablan V, Bontcheva K, Cunningham H, Handschuh S. Roundtrip ontology authoring. In: Proceedings of 1SWC 2008. Springer-Verlag; 2008. Deen SM, Ponnamperuma K. Dynamic ontology integration in a multi-agent environment. In: Proceedings of A1NA '06. IEEE computer society. 2006. Dellschaft K, Staab S. On how to perform a gold standard based evaluation of ontology learning. In: Proceedings of the international semantic web conference. Athens, GA, USA. 2006. Dieng-Kuntz R, Minier D, Ruzicka M, Corby F, Corby O, Alamarguy L. Building and using a medical ontology for knowledge management and cooperative work in a health care network Comput Biol Med 2006;36:871-92. Eichelberg M. Requirements analysis for the ride roadmap. Deliverable D2.1.1, RIDE. 2006. EuzenatJ, Bach TL, Barrasa J, Bouquet P, Bo JD, Dieng Ret al. D2.2.3: state of the art on ontology alignment. Technical report. Knowledge Web. 2004. Euzenat J, Loup D, Touzani M, Valtchev P. Ontology alignment with ola. In: Proceedings of the third international workshop on evaluation of ontology based tools (EON), Hiroshima, Japan, 2004. CEUR-WS. EuzenatJ. An API for ontology alignment. In: 1SWC 2004: third international semantic web conference. Proceedings. Springer-Verlag; 2004. p. 698-12. Fernandez-Lopez M, Gomez-Perez A, Juristo N. Methontology: from ontological art towards ontological engineering. In: Proceedings of the AAA197 spring symposium series on ontological engineering. Stanford, USA, March 1997. p. 33-40. Fernandez-Lopez M, Gomez-Perez A, Rojas MD. Ontologies' crossed life cycles. In: Proceedings of international conference in knowledge engineering and management. Springer-Verlag; 2000. p. 65-79. Flouris G, Huang Z, Pan JZ, Plexousakis D, Wache H. Inconsistencies, negations and changes in ontologies. In: Proceedings of AAA1 2006. AAA1 Press; 2006. Gennari JH, Musen MA, Fergerson RW, Grosso WE, Crubezy M, Eriksson H, et al. The evolution of Protege: an environment for knowledge-based systems development. International Journal of Human—Computer Studies 2003;58(1):89-123. Gomez-Perez A, Fernandez-Lopez M, Corcho O. Ontological engineering. Advanced information and knowledge processing. Springer-Verlag; 2004. Gruber TR. Towards principles for the design of ontologies used for knowledge sharing. In: Guarino N, Poli R, editors. Formal ontology in conceptual analysis and knowledge representation. Deventer, The Netherlands: Kluwer Academic Publishers; 1993. Haase P, Sure Y. State-of-the-art on ontology evolution. Deliverable 3.1.1.b, SEKT. 2004. Haase P, Völker J. Ontology learning and reasoning—dealing with uncertainty and inconsistency. In: Proceedings of the URSW2005 workshop. NOV 2005; p. 45-55. Hartmann J, Spyns P, Giboin A, Maynard D, Cuel R, Suarez-Figueroa MC, et al. Methods for ontology evaluation (Dl.2.3). Deliverable 123, Knowledge Web. 2005. Heflin J, Hendler J. Dynamic ontologies on the web. In: Proceedings of AAA1 2000. AAAI Press; 2000. Laera L, Blacoe 1, Tamma V, Payne T, EuzenatJ, Bench-Capon T. Argumentation over ontology correspondences in MAS. In: Proceedings of the sixth international joint conference on autonomous agents and multi-agent systems (AAMAS 2007). New York NY, USA: ACM Press; 2007. Laera L, Tamma V, Euzenat J, Bench-Capon T, Payne TR. Reaching agreement over ontology alignments. In: Proceedings of fifth international semantic web conference (ISWC 2006). Springer-Verlag; 2006. Levenshtein VI. Binary codes capable of correcting deletions, insertions and reversals. Cybern Control Theory 1966;10:707-10. Maedche A, Staab S. Learning ontologies for the semantic web. In: Semantic web workshop 2001. 2001. Maedche A, Staab S. Ontology learning. In: Staab S, Studer R, editors. Handbook on ontologies. Springer-Verlag; 2004. p. 173-90. chapter 9. Nováček V, Handschuh S, Laera L, Maynard D, Völkel M, Groza T, et al. Report and prototype of dynamics in the ontology lifecycle (D2.3.8vl). Deliverable 238vl, Knowledge Web. 2006. Nováček V. Complex inference for emergent knowledge. Technical Report DER1-TR-2008-04-18, DER], NUIG, 2008. Available from: http://smile.deri.ie/ resources/2008/vit/pubs/aerTR0408.pdf. Noy NF, Klein M. Ontology evolution: not the same as schema evolution. Knowledge Inf Syst 2004:428-40. Noy N, Musen M. The prompt suite: interactive tools for ontology merging and mapping. 2002. Pinto HS, Martins JP. A methodology for ontology integration. In: Proceedings of K-CAP'01. 2001. 43 828 V. Nováček et alf Journal of Biomedical Informatics 41 (2008) 816-828 [39] Staab S, Studer R, editors. Handbook on ontologies. International handbooks on information systems. Springer-Verlag; 2004. [40] Stojanovic L. Methods and tools for ontology evolution. PhD thesis. University of Karlsruhe, 2004. [41 ] Sure Y, Erdmann M, AngeleJ, Staab S, Studer R, Wenke D. OntoEdit: collaborative ontology development for the Semantic Web. In: First international Semantic Web conference (ISWC2002), Sardinia, Springer; 2002. [42] Tablan V, Polajnar T, Cunningham H, Bontcheva K. User-friendly ontology authoring using a controlled language. In: Proceedings of LREC 2006—fifth international conference on language resources and evaluation. ELRA/ELDA Paris, 2006. [43] Volkel M, Groza T. SemVersion: RDF-based ontology versioning system. In: Proceedings of the IADIS international conference WWW/Internet 2006 (ICWI 2006), 2006. [44] Tamma BLSV, Blacoe 1, Wooldridge M. Introducing autonomic behaviour in semantic web agents. In: Proceedings of the fourth international Semantic Web conference (1SWC 2005), Galway, Ireland, November, 2005. 44 Chapter 6 Semantic Literature Search 45 Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 176-181 - Contents lists available at ScienceDirect ELSEVIER Web Semantics: Science, Services and Agents on the World Wide Web journal homepage: www.elsevier.com/locate/websem Invited Paper CORAAL—Dive into publications, bathe in the knowledge Vít Nováček*, Tudor Groza, Siegfried Handschuh, Stefan Decker Digital Enterprise Research Institute, National University of Ireland Galway, IDA Business Park, Dangan, Galway, Ireland ARTICLE INFO Article history: Received 26 May 2009 Received in revised form 9 October 2009 Accepted 26 March 2010 Available online 24 April 2010 Keywords: Knowledge acquisition Knowledge integration Life sciences Knowledge-based publication search ABSTRACT Search engines used in contemporary online scientific publishing mostly exploit raw publication data (bags of words) and shallow metadata (authors, key words, citations, etc.). Exploitation of the knowledge contained implicitly in published texts is still largely not utilized. Following our long-term ambition to take advantage of such knowledge, we have implemented CORAAL (Content extended by emeRgent and Asserted Annotations of Linked publication data), an enhanced-search prototype and the second-prize winner of the Elsevier Grand Challenge. CORAAL extracts asserted publication metadata together with the knowledge implicitly present in the relevant text, integrates the emergent content, and displays it using a multiple-perspective search&browse interface. This way we enable semantic querying for individual publications, and convenient exploration of the knowledge contained within them. In other words, recalling the metaphor in the article title, we let the users dive into publications more easily, and allow them to freely bathe in the related unlocked knowledge. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Online scientific publishing makes knowledge production and dissemination much more efficient than before. The publication process is faster, since the essential phases like authoring, submission, reviewing, and final typesetting are largely computerised. Moreover, the published content is easily disseminated to global audiences via the Internet. In effect, more and more knowledge is being made available. However, is this growing body of knowledge also easily accessible? We believe the answer is negative, since the rapid growth of the number of available resources is making it harder to identify any particular desired piece of knowledge using current solutions. For instance, Medline, a comprehensive source of life sciences and biomedical bibliographic information (cf. http://medline.cos.com/) currently hosts over 18 million resources. It has a growth rate of 0.5 million items per year, which represents around 1300 new resources per day [9]. Using the current publication search engines,1 one can explore the vast and ever-growing article repositories using relevant keywords. But this is very often not enough. Imagine for instance a junior researcher compiling a survey on var- * Corresponding author. Tel.: +353 91 495738. E-mail addresses: vit.novacek@deri.org (V. Novacek), tudor.groza@deri.org (T. Groza), siegfried.handschuh@deri.org (S. Handschuh), stefan.decker@deri.org (S. Decker). 1 For example, ScienceDirect, Elsevier's front-end to their journals (cf. http://www.sciencedirect.com/), or PubMed, a search service covering bibliographic entries from Medline and many additional life science journals together with links to article full texts (cf. http://www.ncbi.nlm.nih.gov/pubmed/). 1570-8268/$ - see front matter © 2010 Elsevier B.V. All rights reserved, doi: 10.1016/j.websem.2010.03.008 ious types of leukemia. The researcher wants to state and motivate in the survey that acute granulocytic leukemia is different from T-cell leukemia. Although such a statement might be obvious for a life scientist, one should support it in the survey by a citation of a published paper. Our researcher may be a bit inexperienced in oncology and may not know the proper reference straightaway. Using, e.g., the PubMed search service, it is easy to find articles that contain both leukemia names. Unfortunately, there are more than 500 such results. It is tedious or even impossible to go through them all to discover one that actually supports that acute granulocytic leukemia is different from T-cell leukemia. Given the wealth of knowledge in life science publications and the limitations of current search engines, it is often necessary to manually scan a lot of possibly irrelevant content. To overcome the problem that anything more expressive than (Boolean combinations of) mere keywords is virtually impossible today, it is necessary to develop technologies that can operate at an enhanced level, using more-expressive concepts and their various relationships. This requires collecting, extracting, and interrelating the knowledge scattered across the large numbers of available life science publications. Unfortunately, manual creation of the necessary information is not really possible at large scale, and automated extraction produces noisy and sparse results [2]. We believe that a few essential elements will enable more knowledge-based search in scientific publications: (i) extraction of publication annotations asserted by people (e.g., author names, titles, references or text structure), (ii) Extraction of knowledge implicitly present in publications (e.g., statements encoding typed relations between particular concepts, or structured representations of arguments made by authors in the text). 46 V. Nováček et al./ Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 176-181 177 Web Browser XSLT Engine WEB Hub CORAAL-KNOWLEDGE CORAAL ■ TITLE CORAAL ■ BLOCK CORAAL-AUTHOR CORAAL■TEXT WEB Helpers CORAAL-EUREEKA CORAAL- KONNEX □BUS L. Tills Ü.M. Text Q.M. Authors Q.M, _ Query r Manager Registration _ Manager KONNEX Repository Manager Elsevier Reo. M. PDF Reg. M. Fig. 1. CORAAL architecture. (iii) Comprehensive integration, augmentation, and refinement of the extracted content, possibly using extant machine-readable resources (e.g., life science thesauri or vocabularies), (iv) Interlinking of the processed content (e.g., connecting relevant arguments across publications, or preserving provenance of statements about relations between particular concepts), (v) Intuitive access to and display of the content extracted from publications, so that everybody can easily search for the extracted knowledge and track its provenance, (vi) Methods for collaborative curation of the resulting content, so that global expert communities can contribute to further refinement of the extracted publication knowledge. CORAAL constitutes a particular solution to some of the principal problems inherent in knowledge-based publication search. We provide an overview of the system and its implementation in Section 2. Then we describe its application and evaluation within the Elsevier Grand Challenge in Section 3. Summary of related systems is given in Section 4. Section 5 concludes the paper with a discussion of a future outlook. 2. CORAAL essentials In the following we introduce the basic features of the CORAAL system. Section 2.1 provides an overview of CORAAL, its architecture, and relevant implementation details. In Section 2.2 we describe the pipeline in which we extract, process, and display the knowledge and metadata extracted from publications. 2.1. Overview of the solution In order to provide comprehensive search capabilities in CORAAL, we augment standard (full-text) publication search with novel services that enable knowledge-based search. By knowledge-based search we mean the ability to query for and browse statements that capture relations between concepts in the retrieved source articles. CORAAL is built on top of two substantial research products of our group at DERI-the KONNEX [4] and EUREEKA [8] frameworks. The former is used for storing and querying of full-text publications and associated metadata. The latter supports exploitation of the knowledge implicitly contained in the texts by means of knowledge-based search. CORAAL itself essentially extracts asserted publication metadata together with the knowledge implicitly present in the respective text, integrates the emergent content with existing domain knowledge, and displays it via a multiple-perspective search&browse interface. This allows fine-grained publication search to be combined with convenient and effortless large-scale exploitation of the knowledge associated with and implicit within the texts. 2.3.3. Architecture The architecture of CORAAL is depicted in Fig. 1. The EUREEKA library caters for knowledge extraction from text and other knowledge resources (e.g., ontologies or machine readable thesauri) via 47 178 V. Nováček et al./ Web Semantics: Science, Services and Agents on the World Wide Web 8 (20Í0) Í76-Í8Í the knowledge extraction module. After being processed by the ACE knowledge refinement and augmentation pipeline (details provided in Section 2.2), new facts are added into a knowledge base. There can be multiple knowledge bases if users wish to keep content from different domains separate. The knowledge bases are exposed to consumers via a semantic query answering module. Indices are used to help optimize the retrieval and sorting of statements based upon relevance scores. Another crucial part of the CORAAL back-end is KONNEX, which processes the publication text and metadata in order to complement the knowledge-based search supported by EUREEKA by rather traditional full-text services. KONNEX integrates the texts and metadata extracted from the input publication corpus in a triple store, representing all the information as RDF graphs (cf. http://www.w3.org/TR/rdf-primer/). Operations related to data incorporation and necessary full-text indices (for the publication text and particular metadata types) are handled by dedicated manager modules. 2.3.2. Relevant implementation details Since CORAAL contains several conceptually separate modules, we utilise an inter-process communication layer implemented using the D-BUS framework (cf. http://dbus.freedesktop.org/). A set of proxy helper services rests on top of the core-level EUREEKA and KONNEX APIs. These manage the user requests and forward the data returned by the core APIs to the so-called web hub layer, which is organized as a decoupled set of stateless web services, each of which handles particular types of search. The web services produce machine-readable RDF expressions that represent answers to user queries. The RDF has XSL style sheets attached, allowing both its rendering for human readability and machine consumption to be provided simultaneously. The human-readable form is also enhanced by the Exhibit faceted browsing web front-end (cf. http://www.simile-widgets.org/exhibit/). A hidden advantage of this mechanism is the shifting of the processing of the data for visualization purposes from the server to the client side, as the XSL transformation is performed by the client's browser. Such a solution results in CORAAL being a pure Semantic Web application, as the data flow between the core infrastructure and the other modules is strictly based on RDF graphs. 2.2. Knowledge acquisition pipeline The publications' metadata and full text were stored and indexed within KONNEX for link processing [4]. After parsing the input articles (either directly from PDF, or from annotated text-based files as provided, e.g., by Elsevier), the metadata and structural annotations were processed by KONNEX. First we eliminated possible duplicate metadata annotations using a string-based similarity heuristic. Each article was then represented as a comprehensive RDF graph consisting of its shallow metadata, such as title, authors, linear structure with pointers to the actual content (sections, paragraphs, etc.), and references. The references were anchored in citation contexts (i.e., paragraphs they occur in), and represented as individual graphs allowing for incremental enrichment over time. The article's full-text information was managed using multiple Lucene indices (cf. http://lucene.apache.org/), while the graphs were integrated and linked within the KONNEX RDF repository. While KONNEX catered for the raw publication text and metadata, exploitation of the more structured publication knowledge was tackled by our novel EUREEKA framework for emergent (e.g., automatically extracted) knowledge processing [8]. The framework builds on a simple subject-predicate-object triple model. We extend the subject-predicate-object triples by adding positive or negative certainty measures and organised them in so-called conceptual matrices, concisely representing every positive and negative relation between an entity and other entities. Metrics can be easily defined on the conceptual matrices. The metrics then serve as a natural basis for context-dependent concept similarity scoring that provides the basic light-weight empirical semantics in EUREEKA. On top of the similarity-based semantics, we implemented two simple yet practical inference services: (i) retrieval of knowledge similar to an input concept, and/or its extension by means of similar stored content; (ii) rule-based materialisation of relations implied by the explicit knowledge base content, and/or complex querying (similarity as a basis for finding query variable instances for approximate evaluation of rules). The inference algorithms have anytime behaviour, meaning that it is possible to programmatically adjust their completeness/efficiency trade-off (i.e., one can either have complete, but possibly largely irrelevant set of solutions in a long time, or incomplete, but rather relevant set in a relatively short time). Technical details of the solution are out of scope of this system overview article, but one can find them in [8]. We applied our EUREEKA prototype to: (i) automate extraction of machine-readable knowledge from particular life science article texts; (ii) integrate, refine, and extend the extracted knowledge within one large emergent knowledge base; (iii) expose the processed knowledge via a query-answering and faceted browsing interface, tracking the article provenance of statements. For the initial knowledge extraction, we used a heuristics based on natural language processing (NLP)—stemming essentially from [5,10]—to process chunk-parsed texts into subject-predicate-object-score quads.2 The scores were derived from aggregated absolute and document-level frequencies of subject/object and predicate terms. The extracted quads encoded three major types of ontological relations between concepts: (I) taxonomical—type—relationships; (II) concept difference (i.e., negative type relationships); (III) "facet" relations derived from verb frames in the input texts (e.g., has part, involves, or occurs in). Over 27,000 types of facet relations were extracted. We imposed a taxonomy on them, considering the head verb of the phrase as a more generic relation (e.g., involves expression o/was assumed to be a type of involves). Also, several artificial relation types were introduced to restrict the semantics of some of the most frequent relations: a (positive) type was considered transitive and anti-symmetric, and same as was set transitive and symmetric. Similarly, part of was assumed transitive and being inverse of has part. Note that the has part relation has rather general semantics within the extracted knowledge, i.e., its meaning is not strictly physically mereological, it can refer also to, e.g., conceptual parts or possession of entities. The quads were processed as follows in the ACE pipeline (details of the particular steps are described in [8]): (I) Addition—The extracted quads were incrementally added into an growing knowledge base K, using a fuzzy aggregation of the relevant conceptual matrices. To take into account the basic domain semantics (i.e., synonymy relations and core taxonomy of if), we used the EMTREE (http://www.embase.com/emtree/) and NCI (http://nciterms.nci.nih.gov) thesauri. (II) Closure—After the addition of new facts into K, we computed its materialisation according to imported RDFS entailment rules (cf. http://www.w3.org/TR/rdf-schema/). (III) Extension—the extracted concepts were analogically extended using similar stored knowledge. 2 Implemented for English only in the current version. However, the EUREEKA framework itself is language-agnostic—it requires only input entities and their relations to be represented in an RDF-compatible format. Porting CORAAL to another language is quite straightforward, given a suitable relation extraction pipeline. 48 V. Nováček et al./ Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 176-181 179 We display the content of the knowledge base via a query-answering module. Answers to queries are sorted according to their relevance scores and similarity to the query [8]. Answers are provided by an intersection of publication provenance sets corresponding to the respective statements' subject and object terms. The module currently supports queries in the following form: f | s: {NOT )?p : o( AND s : {NOT )?p : of, where NOT and AND stands for negation and conjunction, respectively (the ? and * wildcards mean zero or one and zero or more occurrences of the preceding symbols, respectively, | stands for OR), s, o, p may be either a variable—anything starting with the ? character or even the ? character alone—or a lexical expression, f may be lexical expressions only. 3. Elsevier grand challenge deployment This section describes the deployment of CORAAL for the Elsevier grand challenge. Section 3.1 describes the data we processed, while Section 3.2 illustrates the query answering capabilities of CORAAL using the example outlined in the article's introduction. Finally, Sections 3.3 and 3.4 report on the continuous tests with real users and on the evaluation of the quality of the exposed knowledge. 3.3. Data Input: As of March 2009, we had processed 11,761 Elsevier journal articles from the provided XML repositories that were related to cancer research and treatment. Access to the articles was provided within the Elsevier Grand Challenge competition (cf. http://www.elseviergrandchallenge.com). The domain was selected to conform to the expertise of our sample users and testers from Masaryk Oncology Institute in Brno, Czech Republic. We processed cancer-related articles from a selection of Elsevier journals focusing on oncology, genetics, pharmacology, biochemistry, general biology, cell research and clinical medicine. From the article repository, we extracted the knowledge and publication metadata for further processing by CORAAL. Besides the publications themselves, we employed extant machine-readable vocabularies for the refinement and extension of the extracted knowledge (currently, we use the NCI and EMTREE thesauri). Output: CORAAL exposes two datasets as an output of the publication processing: First, we populated a triple store with publication metadata (citations, their contexts, structural annotations, CORAAL Knowledge qjery builder HOTSubjecL flelalior: Object; / iKiiteqi......locytii leukeini. ■ isa - MullMň___ 1 -1 bnp Tíell feukemu | S.l.i, Fig. 2. Knowledge-based query construction. titles, authors and affiliations) and built auxiliary indices for each metadata type to facilitate full-text querying of the stored content. The resulting store contained 7,608,532 RDF subject-predicate-object statements describing the input articles. This included 247,392 publication titles and 374,553 authors (extracted from both processed articles and their literature reference lists). Apart from the triple store, we employed a custom EUREEKA knowledge base [8], containing facts of variable levels of certainty extracted and inferred from the article texts and the imported life science thesauri. Over 215,000 concepts were extracted from the articles. Together with the data from the initial thesauri, the domain lexicon contained 622,611 terms, referring to 347,613 unique concepts. The size of the emergent knowledge base was 4,715,992 weighed statements (ca. 99 and 334 extracted and inferred statements per publication on average, respectively). The contextual knowledge related to the statements, namely provenance information, amounted to more than 10,000,000 additional statements (when expressed in RDF triples). Query evaluation on the produced content typically took fractions of seconds. 3.2. Asking queries, browsing answers CORAAL can answer classical full-text or knowledge-based queries using a simple yet powerful query language (details are given in http://smile.deri.ie/projects/egc/quickstart). Answers in CORAAL are presented as a list of query-conforming s tatements (for the knowledge-based search) or resources (publication titles, paragraphs or author names for the full-text search). The statement results can be filtered based on their particular elements (e.g., subjects, properties, and objects), associated contextual information and whether the statement is positive or negative. The resource acute granulocytic leukemia NOT TYPE T-cell leukemia Sources: ▼ Coding sequence and Jntron\xe2\x80\x93exon junctions of the c-mybgene are intact in the chrort— Title: Coding sequence and intron^emn junctions of the c-rnyb gene are inlact in Ihe chronic phase and blastcrisis stages of chronic myeloid leukemia patients Authors: D Colorner. B. Calabrelta. C. Silveslri, G. Martinelli. F. Cervantes, R, Bussolari. O. Candtnr. F. Corradini. C. Guerzoni. SA Marian). S. Cattelani. L Pecorari. I. lacobucci, S- Soverini. T. Fasano Abstract: The c-mybgene encodes a transcription factor required for proliferation, differentiation and survival of normal and leukemic hematopoietic cells. c-Myb has a longer half-life in BCR/ABL-expressing than in normal cells, a feature which depends, in part, on PI-3K/Akt-dependent regulation of proteins interacting with the leucine zipper/negative regulatory region of c-Myb. Thus, we asked whether the stability of c-Myb in leukemic cells might be enhanced by mutations interfering with its degradation. We analyzed the c-mybgene in 133 chronic myeloid leukemia (CML) patients in chronic phase and/or blast crisis by denaturing-high performance liquid chromatography (D-H PLC) and sequence analysis of PCR products corresponding to the entire coding sequence and eachexon?intron boundary. No mutations were found. We found four single nucleotide polymorphisms [SNPs) and identified an alternatively spliced transcript lacking exon 5, but SNPs frequency and expression of the alternatively spliced transcript were identical in normal and CML cells. Thus, the enhanced stability of c-Myb in CML blast crisis cells and perhaps in othertypes of leukemia is notcaused by agenetic mechanism. Certainty: 0 6640 Contexts: oncology, genetics, pharmacology, biochemistry, biology, cell_research, and clinical_medicine Inferred: false Fig. 3. Query answer detail. 49 180 V. Nováček et al./ Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 176-181 results can be filtered according to the concepts associated with them (both extracted and inferred) and additional metadata (e.g., authors or citations present in the context of the resulting paragraphs). Using filtering (i.e., faceted browsing), one can quickly focus on items of interest within the whole result set. Recalling the example from Section 1, the query for sources supporting that acute granulocytic leukemia is different from T-cell leukemia can be conveniently constructed in CORAAL as depicted in Fig. 2 (guided query building using a form-based interface). The query builder includes a context-sensitive auto-completion capability; if one rests the cursor on, e.g., a subject, only relations (properties) actually associated with that subject in the knowledge base are displayed. Fig. 3 shows the highest ranked answer to the query constructed in Fig. 2, proving that the two types of leukemia are not the same. The source article of the statement (displayed as an inline summary in Fig. 3) is the desired reference supporting the claim. The particular types of contextual information associated with statements (as can be observed in Fig. 3) are: (I) source provenance—articles relevant to the statement, which can be expanded into an inline summary (as shown in Fig. 3) or explored in detail after clicking on the respective publication title; (II) context provenance—domain of life sciences that the statement relates to (determined according to the main topic of the journal that contained the articles the statement was extracted from); (III) certainty —a number describing how certain the system is that the statement holds and is relevant to the query (values between 0 and 1; derived from the absolute value of the respective statement degree and from the actual similarity of the statement to the query); (IV) inferred—a Boolean value determining whether the statement was inferred or not (the latter indicating it was directly extracted). More information can be seen with CORAAL at http://coraal.deri.ie:8080/coraal. 3.3. Continuous tests with users During the development of the CORAAL prototype, we continually collaborated with several biomedical experts, who formed a committee of sample users and evaluators. Before the final stages of the Elsevier Grand Challenge, we prepared five tasks to be worked out with CORAAL and a baseline application (Sci-enceDirect or PubMed). Our hypothesis was that users should perform better with CORAAL than with the baseline, since the tasks were focused on knowledge rather than on a plain text-based search.3 Users indicated that the tasks used for evaluating CORAAL were relevant to their day to day work by giving it a score of 3.9 out of 6 (the scale was from 1 to 6, with 1 indicating no relevance, and 6 indicating high relevance). The success rate of task accomplishment was 60.7% with CORAAL and 10.7% with the baseline application. This confirms our above-mentioned hypothesis that users will be able to accomplish the given tasks better with CORAAL due to its enhanced querying and search capabilities. Besides evaluating the users' performance in sample knowledge-based search tasks, we interviewed them regarding the overall usability of the CORAAL interface. The most critical issue was related to the query language—half of the users were not always able to construct appropriate queries. However, CORAAL also offers a form-based query builder that assists the user as 3 For instance, the users were asked to find all authors who support the fact that theacute granulocytic leukemia and T-cell leukemia concepts are disjoint, or to find which process is used as a complementary method, while being different from the polymerase chain reaction, and identify publications that support their findings. illustrated in Section 3.2. Using this feature, users performed up to six times faster and 40% more efficiently than with purely manually constructed queries. The expert users also had problems with too general, obvious, or irrelevant results. These concerns were expressed when the users were presented with a raw list of answer statements within the evaluation. After being discussed with the users within the evaluation interviews, the problems were addressed by the following features in the user interface: (i) relevance-based sorting of concepts and statements [8]—the most relevant statements were displayed at the top of the results list; (ii) intuitive faceted browsing functionality—support for fast and easy reduction of the displayed results to a subset which reference particular entities (i.e., statements having only certain objects or authors writing about certain topics). The solutions were considered as mostly sufficient regard-ingtheusers'concerns(anaverage4.6scoreonthel - 6 scale going from least to most sufficient). 3.4. Knowledge quality evaluation To evaluate the quality of the knowledge served by CORAAL,4 we generated 200 random queries composed of anything from single terms to a conjunction of multiple possibly negated statements. To ensure non-empty answer sets, the queries were generated from the actual content of the knowledge base. Also, we took into account only the content extracted from the input articles and not from the NCI or EMTREE seed thesauri. We let our domain expert committee vote on the relevance of queries to their day-to-day work and used the ten most relevant ones to evaluate the answers provided by CORAAL. We used the traditional notions of precision, recall, and F-measure for the evaluation of the quality of the answers. A gold standard set of statements relevant to the queries used in the evaluation was created by the user committee, who employed their own knowledge combined with the full-text search of the publications incorporated in CORAAL. For a baseline comparison, we imported the knowledge extracted by CORAAL from the input articles and thesauri into a state-of-the art RDF store. The store had inference and querying support, however, it lacked proper means for emergent knowledge processing (namely regarding the negation, uncertainty, inconsistence resolution and approximate query processing features). The set of queries used for CORAAL evaluation was executed using the baseline RDF store and the results were compared. Due to the novel support of the emergent knowledge, CORAAL quite substantially outperformed the baseline, achieving F-measures from two- to eight-times better for the various evaluated features. The absolute CORAAL results may still be considered rather poor when compared to the gold standard generated by the users (i.e., F-measures for some queries around 0.2). However, one must recognise that the answers to the gold standard questions took almost two working days for an expert committee to generate. In about the same time, the CORAAL knowledge base was produced purely automatically for much larger amounts of data (involving statements about hundreds of thousands of concepts instead of a few query entities). The queries take seconds to evaluate and one can find many relevant answers very quickly due to the relevance-based sorting of the results (the first 10 statements contained more than 67% of relevant answers on average, while the 200th to 400th results contained only about 5% correct statements). The evaluation committee unequivocally considered the ability of CORAAL 4 Note that this section provides only an outline of the actual evaluation, summarising the most important points. A full description of the knowledge quality evaluation and the numeric results achieved is provided in [8]. 50 V. Nováček et al./ Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 176-181 181 to perform purely automatically as an acceptable trade-off for the detected noise in the results. 4. Related work Approaches tackling problems related to those addressed by the core technologies powering CORAAL are analysed in [8,4]. Here, we offer an overview of systems targeting similar problems to those tackled by our framework. State-of-the-art applications like ScienceDirect or PubMed Central require almost no effort in order to expose arbitrary life science publications for search (therefore we used them as a baseline in the user-centric experiment). However, the benefit they provide is rather limited when compared to cutting-edge approaches aimed at utilising also the publication knowledge within the query construction and/or result visualisation. Such innovative solutions may require much more a priori effort in order to work properly, though. FindUR [6], Melisa [1] and GoPubMed [3] are ontology-based interfaces to a traditional publication full-text search. GoPubMed allows for effective restriction and intelligent visualisation of the query results. FindUR and Melisa support focusing the queries on particular topics based on an ontology (FindUR uses a Description Logic ontology built from scratch, while Melisa employs a custom ontology based on MeSH, cf. http://www.nlm.nih.gov/mesh/). GoPubMed dynamically extracts parts of the Gene Ontology (cf. http://www.geneontology.org/) relevant to the query, which are then used for restriction and a sophisticated visualisation of the classical PubMed search results. Nevertheless, none of the tools mentioned so far offers querying for or browsing of arbitrary publication knowledge. Terms and relations not present in the systems' rather static ontologies simply cannot be reflected in the search. On the other hand, CORAAL works on any domain and extracts arbitrary knowledge from publications automatically, although the offered benefits may not be that high due to a possibly higher level of noise. Textpresso [7] is quite similar to CORAAL concerning searching for relations between concepts in particular chunks of text. However, the underlying ontologies and their instance sets have to be provided manually, whereas CORAAL can operate with or without any available ontology. Moreover, CORAAL includes far more full-text publications and concepts. The biggest challenge of systems with goals similar to CORAAL is a reliable automation of truly expressive content extraction. In contrast to CORAAL, none of the related systems addresses this problem appropriately, which makes them scale poorly, or makes them difficult to port to new domains. This is why we were not able to use the related systems for a baseline comparison in our domain-specific application scenario—we simply could not adapt them so that they would be able to perform reasonably, both due to technical difficulties and lack of necessary resources. 5. Conclusions and future work With CORAAL, we have addressed most of the elements of a truly knowledge-based scientific publication search as specified in Section 1. We are able to extract and integrate emergent knowledge and metadata from a large number of publications, as well as augment and refine the extracted content. CORAAL also allows for intuitive searching and browsing of the processed knowledge. Although the primary focus of CORAAL is the knowledge-based search, the underlying technologies are straightforwardly applicable to many other tasks. These are for instance automated tagging of articles by the associated general concepts, population of existing domain-specific vocabularies, or utilisation of CORAAL as a general-purpose knowledge back-end exposing arbitrary services (e.g., knowledge-based retrieval of similar articles or profile-based article recommendation). However, we still have to tackle several challenges in order to fully realize the current potential of CORAAL. First, we want to utilise the wisdom of the crowds by supporting intuitive and unobtrusive community-based curation of the emergent knowledge, namely by validation or invalidation of existing statements, introduction of new statements and submission of new rules refining the domain semantics. Then we intend to make the step from CORAAL to a CORAAL reef, a distributed peer-to-peer model covering multiple CORAAL installations autonomously communicating with each other (e.g., asking for answers when no answer is available locally or exchanging appropriate rules to improve the local semantics). After incorporating the capabilities of the prospective CORAAL reefs into the ecosystem of the current online publishing, we will be able to unlock, connect, augment and retrieve the knowledge with unprecedented scale and efficiency. Acknowledgments This work has been supported by the 'Lion', 'Lion II' projects funded by SFI under Grants No. SFI/02/CE1/I131, SFI/08/CE/I1380, respectively. Big thanks goes to our evaluators: Doug Foxvog, Peter Grell, MD, Miloš Holánek, MD, Matthias Samwald, Holger Stenzhorn and Jiří Vyskočil, MD. We also appreciated the challenge judges' feedback that helped to streamline the final prototype a lot. We acknowledge the support provided by Noelle Gracy, Anita de Waard and other Elsevier people regarding the challenge organisation. Finally, we thank to the anonymous reviewers and to Ed Hovy and Susie Stephens, the special issue editors, who all helped us to improve the final article version a lot. References [1] J.M. Abasolo, M. Gómez, M. Melisa, An ontology-based agent for information retrieval in medicine, in: Proceedings of the SemWeb2000, 2000. [2] S. Bechhofer, et al.. Tackling the ontology acquisition bottleneck: an experiment in ontology re-engineering, at http://tinyurl.com/96w7ms, Apr'08. (2003). [3] H. Dietze, et al., Gopubmed: exploring pubmed with ontological background knowledge, in: Ontologies and Text Mining for Life Sciences, IBFI, 2008. [4] T. Groza, S. Handschuh, K. Moeller, S. Decker, KonneXSALT: first steps towards a semantic claim federation infrastructure, in: Proceedings of the ESWC 2008, Springer-Verlag, 2008. [5] A. Maedche, S. Staab, Discovering conceptual relations from text, in: Proceedings of the ECAI 2000, IOS Press, 2000. [6] D.L. McGuinness, Ontology-enhanced search for primary care medical literature, in: Proceedings of the Medical Concept Representation and NLP Conference, 1999. [7] H.M. Müller, E.E. Kenny, P.W. Sternberg, Textpresso: an ontology-based information retrieval and extraction system for biological literature, PLoS Biology 2 (ll)(2004)e309. [8] V. Nováček S. Decker, Towards lightweight and robust large scale emergent knowledge processing, in: Proceedings of the ISWC'09, 2009. [9] J. Tsujii, Refine and pathtext, which combines text mining with pathways, in: Keynote at SESL 2009, 2009. [10] J. Voelker, D. Vrandecic, Y. Sure, A. Hotho, Learning disjointness, in: Proceedings of the ESWC'07, Springer, 2007. 51 Chapter 7 Distributional Semantics in Semantic Literature Search 52 PeerJ SKIMMR: facilitating knowledge discovery in life sciences by machine-aided skim reading Vít Nováček1 and Gully A.P.C. Burns2 1 Insight Centre (formerly DERI), National University of Ireland Galway, Galway, Ireland 2 Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA Submitted 6 December 2013 Accepted 23 June 2014 Published 22 July 2014 Corresponding author Vít Nováček, vit.novacek@deri.org Academic editor Harry Hochheiser Additional Information and Declarations can be found on page 35 DOI 10.7717/peerj.483 © Copyright 2014 Nováček and Burns Distributed under Creative Commons CC-BY 3.0 OPEN ACCESS ABSTRACT Background. Unlike full reading, 'skim-reading' involves the process of looking quickly over information in an attempt to cover more material whilst still being able to retain a superficial view of the underlying content. Within this work, we specifically emulate this natural human activity by providing a dynamic graph-based view of entities automatically extracted from text. For the extraction, we use shallow parsing, co-occurrence analysis and semantic similarity computation techniques. Our main motivation is to assist biomedical researchers and clinicians in coping with increasingly large amounts of potentially relevant articles that are being published ongoingly in life sciences. Methods. To construct the high-level network overview of articles, we extract weighted binary statements from the text. We consider two types of these statements, co-occurrence and similarity, both organised in the same distributional representation (i.e., in a vector-space model). For the co-occurrence weights, we use point-wise mutual information that indicates the degree of non-random association between two co-occurring entities. For computing the similarity statement weights, we use cosine distance based on the relevant co-occurrence vectors. These statements are used to build fuzzy indices of terms, statements and provenance article identifiers, which support fuzzy querying and subsequent result ranking. These indexing and querying processes are then used to construct a graph-based interface for searching and browsing entity networks extracted from articles, as well as articles relevant to the networks being browsed. Last but not least, we describe a methodology for automated experimental evaluation of the presented approach. The method uses formal comparison of the graphs generated by our tool to relevant gold standards based on manually curated Pub Med, TREC challenge and MeSH data. Results. We provide a web-based prototype (called 'SKIMMR') that generates a network of inter-related entities from a set of documents which a user may explore through our interface. When a particular area of the entity network looks interesting to a user, the tool displays the documents that are the most relevant to those entities of interest currently shown in the network. We present this as a methodology for browsing a collection of research articles. To illustrate the practical applicability of SKIMMR, we present examples of its use in the domains of Spinal Muscular Atrophy and Parkinson's Disease. Finally, we report on the results of experimental evaluation using the two domains and one additional dataset based on the TREC challenge. The results show how the presented method for machine-aided skim reading How to cite this article Nováček and Burns (2014), SKIMMR: facilitating knowledge discovery in life sciences by machine-aided skim reading. PeerJ 2:e483; DOI 10.7717/peen'.483 53 PeerJ outperforms tools like PubMed regarding focused browsing and informativeness of the browsing context. Subjects Bioinformatics, Neuroscience, Human-Computer Interaction, Computational Science Keywords Machine reading, Skim reading, Publication search, Text mining, Information visualisation The central US repository of published papers in the life sciences since the 1950s, see http://www.ncbi.nlm.nih. gov/pubmed. INTRODUCTION In recent years, knowledge workers in life sciences are increasingly overwhelmed by an ever-growing quantity of information. PubMed1 contained more than 23 million abstracts as of November 2013, with a new entry being added every minute. The current textual content available online as PubMed abstracts amount to over 2 billion words (based on estimates derived from a random sample of about 7,000 records). Information retrieval technology helps researchers pinpoint individual papers of interest within the overall mass of documents, but how can scientists use that to acquire a sense of the overall organization of the field? How can users discover new knowledge within the literature when they might not know what they are looking for ahead of time? Strategic reading aided by computerised solutions may soon become essential for scientists (Renear & Palmer, 2009). Our goal is to provide a system that can assist readers to explore large numbers of documents efficiently. We present 'machine-aided skim-reading' as a way to extend the traditional paradigm of searching and browsing a text collection (in this case, PubMed abstracts) through the use of a search tool. Instead of issuing a series of queries to reveal lists of ranked documents that may contain elements of interest, we let the user search and browse a network of entities and relations that are explicitly or implicitly present in the texts. This provides a simplified and high-level overview of the domain covered by the text, and allows users to identify and focus on items of interest without having to read any text directly. Upon discovering an entity of interest, the user may transition from our 'skimming' approach to read the relevant texts as needed. This article is organised as follows. 'Methods' describes methods used in SKIMMR for: (1) extraction of biomedical entities from data; (2) computation of the co-occurrence and similarity relationships between the entities; (3) indexing and querying of the resulting knowledge base; (4) evaluating the knowledge base using automated simulations. Each of the methods is explained using examples. 'Results' presents the SKIMMR prototype and explains typical usage of the tool in examples based on user interactions. We also describe evaluation experiments performed with three different instances of the tool. In 'Discussion' we discuss the results, give an overview of related work and outline our future directions. There is also 'Formulae Definitions' that provides details on some of the more complex formulae introduced in the main text. The main contributions of the presented work are: (1) machine-aided skim-reading as a new approach to semi-automated knowledge discovery; (2) fuzzy indexing and querying method for efficient on-demand construction and presentation of the high-level Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 H2/38 54 PeerJ ■ A part of the LingPipe suite, see http:// alias-i.com/lingpipe/ for details. 3 Which we have processed in one of the pre-computed instances of SKIMMR, see 'Parkinson's disease' for details. graph-based article summaries; (3) detailed examples that explain the applied methods in a step-by-step fashion even to people with little or no computer science background; (4) an open-source prototype implementing the described method, readily available for processing custom data, and also in the form of two pre-computed instances deployed on Spinal Muscular Atrophy and Parkinson's Disease data; (5) an evaluation methodology based on simulations and formally defined measures of semantic coherence, information content and complexity that can be used not only for evaluating SKIMMR (as we did in the article), but also for assessment of other tools and data sets utilising graph structures. METHODS This section describes how the knowledge base supporting the process of machine-aided skim reading is generated from the input data (i.e., biomedical articles and data). Firstly we describe extraction of entities and basic co-occurrence relationships between them ('Extracting basic co-occurrence statements from texts'). 'Computing a knowledge base from the extracted statements' is about how we compute more general, corpus-wide relationships from the basic extracted co-occurrence statements. 'Indexing and querying the knowledge base' explains how the processed content can be indexed and queried in order to generate the graph-based summaries with links to the original documents. Finally, 'Evaluation methodology' introduces a method for a simulation-based evaluation of the generated content in the context of machine-aided skim reading. For the research reported in this article, we received an exemption from IRB review by the USC UPIRB, under approval number UP-12-00414. Extracting basic co-occurrence statements from texts We process the abstracts by a biomedical text-mining tool2 in order to extract named entities (e.g., drugs, genes, diseases or cells) from the text. For each abstract with a PubMed ID PMID, we produce a set of (ex,ey,cooc((ex,ey),PubMedpMiD),PubMedpMiD) tuples, where ex, ey range over all pairs of named entities in the abstract with the PMID identifier, and cooc((ex,ey),PubMedpMiD) is a co-occurrence score of the two entities computed using the formula (1) detailed in 'Co-occurrences'. The computation of the score is illustrated in the following example. Example 1 Imagine we want to investigate the co-occurrence of the parkinsonism and DRD (dopamine-responsive dystopia) concepts in a data set of PubMed abstracts concerned with clinical aspects of Parkinson s disease.3 There are two articles in the data set where the corresponding terms co-occur: • Jeon BS, etal. Dopamine transporter density measured by 123Ibeta-CIT single-photon emission computed tomography is normal in dopa-responsive dystonia (PubMed ID: 9629849). • Snow BJ, et al. Positron emission tomographic studies of dopa-responsive dystonia and early-onset idiopathic parkinsonism (PubMed ID: 8239569). Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 H3/38 55 PeerJ The relevant portions of the first abstract (PubMed ID: 9629849) are summarised in the following table (split into sentences numbered from the beginning of the text): 12 Therefore, we performed 123Ibeta-CIT single-photon emission computed tomography (123Ibeta-CIT SPECT) in clinically diagnosed DRD, PD, and JPD, and examined whether DAT imaging can differentiate DRD from PD and JPD. 14 Five females (4 from two families, and 1 sporadic) were diagnosed as DRD based on early-onset foot dystonia and progressive parkinsonism beginning at ages 7-12. 17 123Ibeta-CIT striatal binding was normal in DRD, whereas it was markedly decreased in PD and JPD. 22 A normal striatal DAT in a parkinsonian patient is evidence for a nondegenerative cause of parkinsonism and differentiates DRD from JPD. 23 Finding a new mutation in one family and failure to demonstrate mutations in the putative gene in other cases supports the usefulness of DAT imaging in diagnosing DRD. Based on the sentence numbers in the excerpt, we can compute the co-occurrence score of the (parkinsonism, DRD) tuple as: I 1 1 1\ / 1\ cooc((parkinsonism, DRD), PubMed9629849) = I 1 + - + - + - I + I 1 + - I = 3.416. Similar to the above, the portions relevant to the (parkinsonism, DRD) co-occurrences according to the second abstract (PubMed ID: 8239569) are as follows: 1 There are two major syndromes presenting in the early decades of life with dystonia and parkinsonism: dopa-responsive dystonia (DRD) and early-onset idiopathic parkinsonism (EOIP). 2 DRD presents predominantly in childhood with prominent dystonia and lesser degrees of parkinsonism. 5 Some have suggested, however, that DRD is a form of EOIP. The co-occurrence score is then: (I 1\ 1 cooc((parkinsonism,DRD),P«bMefl8239569) = l^~'~2~'~^~'~2/~'~4 = Therefore the basic co-occurrence tuples produced from the two articles are: (parkinsonism, DRD, 3A16,PubMedg629S49), (parkinsonism, DRD, 3.25, PubMedg239569) ■ Novacek and Burns (2014), PeerJ, DO110.7717/peerj.483 ZD 4/38 56 PeerJ Computing a knowledge base from the extracted statements From the basic co-occurrence statements, we compute a knowledge base, which is a comprehensive network of interlinked entities. This network supports the process of navigating a skeletal structure of the knowledge represented by the corpus of the input PubMed articles (i.e., the actual skim reading). The knowledge base consists of two types of statements: (1) corpus-wide co-occurrence and (2) similarity. The way to compute the particular types of statements in the knowledge base is described in the following two sections. Corpus-wide co-occurrence The basic co-occurrence tuples extracted from the PubMed abstracts only express the cooccurrence scores at the level of particular documents. We need to aggregate these scores to examine co-occurrence across the whole corpus. For that, we use point-wise mutual information (Manning, Raghavan & Schütze, 2008), which determines how much two co-occurring terms are associated or disassociated, comparing their joint and individual distributions over a data set. We multiply the point-wise mutual information value by the absolute frequency of the co-occurrence in the corpus to prioritise more frequent phenomena. Finally, we filter and normalise values so that the results contain only scores in the [0,1] range. The scores are computed using the formulae (2)-(5) in 'Co-occurrences'. The aggregated co-occurrence statements that are added to the knowledge base are in the form of (x,cooc,y,v(fpmi(x,y),P)) triples, where x,y range through all terms in the basic co-occurrence statements, the scores are computed across all the documents where x,y co-occur, and the cooc expression indicates co-occurrence as the actual type of the relation between x,y. Note that the co-occurrence relation is symmetric, meaning that if (x, cooc,y,w\) and (y,cooc,x,W2) are in the knowledge base, w\ must be equal to wj. Example 2 Assuming our corpus consists only of the two articles from Example 1, the point-wise mutual information score of the (parkinsonism, DRD) tuple can he computed using the following data: • ^(parkinsonism, ÜRÜ)-joint distribution of the (parkinsonism, DRD) tuple within all the tuples extracted from the PubMed abstracts with IDs 9629849 and 8239569, which equals 3.416 + 3.25 = 6.6 (sum across all the (parkinsonism,DRD) basic co-occurrence tuples); • p(parkinsonism),p(DRD)-zWzVidMflZ distributions of the parkinsonism, DRD arguments within all extracted tuples, which equal 28.987 and 220.354, respectively (sums of the weights in all basic co-occurrence statements that contain parkinsonism or DRD as one of the arguments, respectively); • ^(parkinsonism, DRD), |T|-f?ie absolute frequency ofthe parkinsonism, DRD co-occurrence and the number of all basic co-occurrence statements extracted from the abstracts, which equals to 2 and 1,414, respectively; • P-ihe percentile for the normalisation, equal to 95, which results in the normalisation constant 2.061 (a non-normalised score such that only 5% of the scores are higher than that). Noväcek and Burns (2014), PeerJ, DO110.7717/peerj.483 ZD 5/38 57 PeerJ The whole formula is then: «/Smi(parkinsonism, DRD) = v(fpmi(parkinsonism, DRD),P) : v(i7(parkinsonism, DRD) • logi ^(parkinsonism, DRD) p(parkinsonism)p(DRD)' ,95): 2 • log2 28.987-220.354 2.061 = 0.545. Thus the aggregated co-occurrence statement that is included in the knowledge base is (parkinsonism, cooc,DRD,0.545). 4 Similar to the co-occurrence statements described before, the sim expression refers to the type of the relation between x.y, i.e., similarity. Similarity After having computed the aggregated and filtered co-occurrence statements, we add one more type of relationship-similarity. Many other authors have suggested ways for computing semantic similarity (see d'Amato, 2007 for a comprehensive overview). We base our approach on cosine similarity, which has become one of the most commonly used approaches in information retrieval applications (Singhal, 2001; Manning Raghavan & Schütze, 2008). The similarity and related notions are described in detail in 'Similarities', formulae (6) and (7). Similarity indicates a higher-level type of relationship between entities that may not be covered by mere co-occurrence (entities not occurring in the same article may still be similar). This adds another perspective to the network of connections between entities extracted from literature, therefore it is useful to make similarity statements also a part of the SKIMMR knowledge base. To do so, we compute the similarity values between all combinations of entities x,y and include the statements (x,sim,y,sim(x,y)) into the knowledge base whenever the similarity value is above a pre-defined threshold (0.25 is used in the current implementation).4 A worked example of how to compute similarity between two entities in the sample knowledge base is given below. Example 3 Let us use 'parkinsonisms', 'mrpi values' as sample entities a, b. In a full version of Parkinson s disease knowledge base (that contains the data used in the previous examples, but also hundreds of thousands of other statements), there are 19 shared entities among the ones related to a, b (for purposes of brevity, each item is linked to a short identifier to be used later on): (1) msa — p ~ in, (2) clinically unclassif iable parkinsonism ~ t\, (3) cup ~ <2. (4) vertical ocular slowness ~ (5) baseline clinical evaluation ~ £4, (6) mr ~ is, (7) parkinsonian disorders ~ tß, (8) pspphenotypes ~ t-j, (9) duration ~ ig, (10) patients ~ tg, (11) clinical diagnostic criteria ~ iin, (12) abnormal mrpi values ~ (13) pd ~ t\2, (14) magnetic resonance parkinsonism index ~ £13, (15) parkinson disease ~ tu, (16) mri ~ £15, (17) parkinson's disease (18) psp ~ tyj, (19) normal mrpi values ~ iig. ill, t\6i Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 H6/38 58 PeerJ The co-occurrence complements a, b of the parkinsonisms, mrpi values entities (i.e., associated co-occurrence context vectors) are summarised in the following table: to h »2 h f4 «5 t6 fy h f10 ill *13 f14 «15 h? fl8 a 0.14 0.39 1.0 0.08 0.26 0.06 0.18 0.4 0.07 0.27 0.09 0.7 0.03 0.14 0.33 0.25 b 0.26 0.57 1.0 0.3 0.82 0.2 0.33 0.26 0.39 0.43 0.36 0.41 0.06 0.34 1.0 1.0 Note that the elements fg, fn, fi6 are omitted since their weight in at least one of the complements is <0.01 and thus does not contribute significantly to the result. The sizes of the co-occurrence complement vectors are 3.048, 2.491 for parkinsonisms, mrpi values, respectively, while their dot product is 2.773. Therefore their similarity is equal to 3 048 2349i ^ 0-365 and the new statement to be added to the knowledge base is (parkinsonisms, sim,mrpi values,0.365). Indexing and querying the knowledge base The main purpose of SKIMMR is to allow users to efficiently search and navigate in the SKIMMR knowledge bases, and retrieve articles related to the content discovered in the high-level entity networks. To support that, we maintain several indices of the knowledge base contents. The way how the indices are built and used in querying SKIMMR is described in the following two sections. Knowledge base indices In order to expose the SKIMMR knowledge bases, we maintain three main indices: (1) a term index-a mapping from entity terms to other terms that are associated with them by a relationship (like co-occurrence or similarity); (2) a statement index-a mapping that determines which statements the particular terms occur in; (3) a source index-a mapping from statements to their sources, i.e., the texts from which the statements have been computed. In addition to the main indices, we use a full-text index that maps spelling alternatives and synonyms to the terms in the term index. The main indices are implemented as matrices that reflect the weights in the SKIMMR knowledge base: Ti T2 ■ • T„ Si s2 ■ Sm Pi P2 ■ ■ P, Ti fi,i fl,2 • ■ h,n Ti SU $1,2 ■ Si pi,i Pl,2 ■ ■ Phq T2 f2,l Í2.2 ■ h,n T2 s2,l S2,2 ■ S2,m s2 P2,l P2,2 ■ ■ P2,q T„ fn,i in, 2 • tn,n T„ Sn,l Sn,2 ■ Sn,m Srn pm,l pm,2 ■ Pm,q where: • T\,...,Tn are identifiers of all entity terms in the knowledge base and e [0,1] is the maximum weight among the statements of all types existing between entities Ti, Tj in the knowledge base (0 if there is no such statement); Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 7J7/38 59 PeerJ • Si,...,Sm are identifiers of all statements present in the knowledge base and s,j e {0,1} determines whether an entity T, occurs in a statement Sj or not; • P\,...,Pq are identifiers of all input textual resources, andp,j e [0,1] is the weight of the statement S, if Pj was used in order to compute it, or zero otherwise. 5 In reality, the number of source article used for computing these statements in Parkinson's disease knowledge base is much larger, but here we take into account only a few of them to simplify the example. Example 4 To illustrate the notion of the knowledge base indices, let us consider a simple knowledge base with only two statements from Examples 1 and 3: Si ~ (parkinsonism, cooc,DRD, 0.545), S2 ~ (parkinsonisms, sim,mrpi values, 0.365). Furthermore, let us assume that: (i) the statement Si has been computed from the articles with PubMed identifiers 9629849, 8239569 (beingreferred to by the P\,Pj provenance identifiers respectively); (ii) the statement S2 has been computed from articles with PubMed identifiers 9629849, 21832222, 22076870 (beingreferred to by the Pi,P^,P^ provenance identifiers, respectively5). This corresponds to the following indices: term index parkinsonism DRD parkinsonisms mrpi values parkinsonism 0.0 0.545 0.0 0.0 DRD 0.545 0.0 0.0 0.0 parkinsonisms 0.0 0.0 0.0 0.365 mrpi values 0.0 0.0 0.365 0.0 statement index Si Si parkinsonism 1.0 0.0 provenance index Pi Pi Pi Pi DRD 1.0 0.0 Si 0.545 0.545 0.0 0.0 parkinsonisms 0.0 1.0 s2 0.0 0.0 0.365 0.365 mrpi values 0.0 1.0 3 One can expand the coverage of their queries using the advanced full-text search features like wildcards or boolean operators for the term look-up. Detailed syntax of the full-text query language we use is provided at http://pythonhosted. org/Whoosh/querylang.html. Querying The indices are used to efficiently query for the content of SKIMMR knowledge bases. We currently support atomic queries with one variable, and possibly nested combinations of atomic queries and propositional operators of conjunction (AND), disjunction (OR) and negation (NOT). An atomic query is defined as ? T, where ? refers to the query variable and T is a full-text query term.6 The intended purpose of the atomic query is to retrieve all entities related by any relation to the expressions corresponding to the term T. For instance, the ? parkinsonism query is supposed to retrieve all entities co-occurring-with or similar-to parkinsonism. Combinations consisting of multiple atomic queries linked by logical operators are evaluated using the following algorithm: 1. Parse the query and generate a corresponding 'query tree' (where each leaf is an atomic query and each node is a logical operator; the levels and branches of this tree reflect the nested structure of the query). Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 H8/38 60 PeerJ 2. Evaluate the atomic queries in the nodes by a look-up in the term index, fetching the term index rows that correspond to the query term in the atomic query. 3. The result of each term look-up is a fuzzy set (Hájek, 1998) of terms related to the atomic query term, with membership degrees given by listed weights. One can then naturally combine atomic results by applying fuzzy set operations corresponding to the logical operators in the parsed query tree nodes (where conjunction, disjunction and negation correspond to fuzzy intersection, union and complement, respectively). 4. The result is a fuzzy set of terms Rj = {{T\,wf), (T2, wj),..., (T„, wj)}, with their membership degrees reflecting their relevance as results of the query. The term result set Rj can then be used to generate sets of relevant statements (Rs) and provenances (Rp) using look-ups in the corresponding indices as follows: (a) RS = l(Si,ws1),(S2,ws2),...,(Sm,wsm)}, where wf = vs£;"=1 wfch„ (b) RP = {(Pi,<),(P2,wf),...,(P,,tvJ)}, where wf = vp£™ iwfwj,i- vs,vp are normalisation constants for weights. The weight for a statement S, in the result set Rg is computed as a normalised a dot product (i.e., sum of the element-wise products) of the vectors given by: (a) the membership degrees in the term result set Rj, and (b) the column in the statement index that corresponds to S,. Similarly, the weight for a provenance P, in the result set Rp is a normalised dot product of the vectors given by the Sj membership degrees and the column in the provenance index corresponding to P,. The fuzzy membership degrees in the term, statement and provenance result sets can be used for ranking and visualisation, prioritising the most important results when presenting them to the user. The following example outlines how a specific query is evaluated. Example 5 Let us assume we want to query the full SKIMMR knowledge base about Parkinsons Disease for the following: ? -o parkinsonism AND (? -o mrpi OR ? -o magnetic resonance parkinsonism index) This aims to find all statements (and corresponding documents) that are related to parkinsonism and either magnetic resonance parkinsonism index or its mrpi abbreviation. First of all, the full-text index is queried, retrieving two different terms conforming to the first atomic part of the query due to its stemming features: parkinsonism and parkinsonisms. The other two atomic parts of the initial query are resolved as is. After the look-up in the term index, four fuzzy sets are retrieved: 1. Tparkinsonism (3,714 results), 2. ^parkinsonisms (151 results), 3. Tmrpi (39 results). 4. ^magnetic resonance parkinsonism index (29 results). The set of terms conforming to the query is then computed as (^parkinsonism U ^parkinsonisms) ^ (Pmrpi U ^magnetic resonance parkinsonism index)- When using maximum and minimum as t-conorm and t-normfor computing the fuzzy union and intersection (Hájek, 1998), respectively, the resulting set has 29 elements with non-zero membership degrees. The top five of them are (1) cup, (2) mrpi, (3) magnetic resonance parkinsonism index, (4) clinically unclassif iable parkinsonism, (5) clinical evolution Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 ZD 9/38 61 PeerJ with membership degrees 1.0,1.0,0.704,0.39,0.34, respectively. Accordingto the statement index, there are 138 statements corresponding to the top five term results of the initial query, composed of 136 co-occurrences and 2 similarities. The top five co-occurrence statements and the two similarity statements are: Type Entityi Entity2 Membership degree cooc mrp i cup 1.0 cooc mrpi magnetic resonance parkinsonism index 0.852 cooc cup magnetic resonance parkinsonism index 0.852 cooc mrpi clinically unclassif iable parkinsonism 0.695 cooc cup clinically unclassif iable parkinsonism 0.695 sim psp patients magnetic resonance parkinsonism index 0.167 sim parkinsonism clinical evolution 0.069 where the membership degrees are computed from the combination of the term weights as described before the example, using an arithmetic mean for the aggregation. Finally, a look-up in the source index for publications corresponding to the top seven result statements retrieves 8 relevant PubMed identifiers (PMID). The top five of them correspond to the following list of articles: PMID Title Authors Weight 21832222 The diagnosis of neurodegenerative disorders based on clinical and pathological findings using an MRI approach Watanabe H et al. 1.0 21287599 MRI measurements predict PSP in unclassifiable parkinsonisms: a cohort study Morelli M et al. 0.132 22277395 Accuracy of magnetic resonance parkinsonism index for differentiation of progressive supranuclear palsy from probable or possible Parkinson disease Morelli M et al. 0.005 15207208 Utility of dopamine transporter imaging (123-1 Ioflupane SPECT) in the assessment of movement disorders Garcia Vicente AM et al. 0.003 8397761 Alzheimer's disease and idiopathic Parkinson's disease coexistence Rajput AH et al. 0.002 7 See the SMA SKIMMR instance at http:/ /www.skimmr.org:8008/data/html/trial. tmp for details. where the weights have been computed by summing up the statement set membership degrees multiplied by the source index weights and then normalising the values by their maximum. Evaluation methodology In addition to proposing specific methods for creating knowledge bases that support skim reading, we have also come up with a specific methodology for evaluating the generated knowledge bases. An ideal method for evaluating the proposed approach, implemented as a SKIMMR tool, would be to record and analyse user feedback and behaviour via SKIMMR instances used by large numbers of human experts. We do have such means for evaluating SKIMMR implemented in the user interface.7 However, we have not yet managed to collect sufficiently large sample of user data due to the early stage of the prototype deployment. Therefore we implemented an indirect methodology for automated quantitative evaluation of SKIMMR instances using publicly available manually Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 "J10/38 62 PeerJ MeSH (Medical Subject Headings) is a comprehensive, manually curated and regularly updated controlled vocabulary and taxonomy of biomedical terms. It is frequently used as a standard for annotation of biomedical resources, such as PubMed abstracts. See http:// www.ncbi.nlm.nih.gov/mesh for details. curated data. The methodology is primarily based on simulation of various types of human behaviour when browsing the entity networks generated by SKIMMR. We formally define certain properties of the simulations and measure their values in order to determine the utility of the entity networks for the purposes of skim reading. Details are given in the following sections. Overview of the evaluation methods The proposed methods intend to simulate human behaviour when using the data generated by SKIMMR. We apply the same simulations also to baseline data that can serve for the same or similar purpose as SKIMMR (i.e., discovery of new knowledge by navigating entity networks). Each simulation is associated with specific measures of performance, which can be used to compare the utility of SKIMMR with respect to the baseline. The primary evaluation method is based on random walks (Lovdsz, 1993) in an undirected entity graph corresponding to the SKIMMR knowledge base. For the baseline, we use a network of MeSH terms assigned by human curators to the PubMed abstracts that have been used to create the SKIMMR knowledge base.8 This represents a very similar type of content, i.e., entities associated with PubMed articles. It is also based on expert manual annotations and thus supposed to be a reliable gold standard (or at least a decent approximation thereof due to some level of transformation necessary to generate the entity network from the annotations). Example 6 Returning to the knowledge base statement from Example 2 in 'Corpus-wide co-occurrence': (parkinsonism, cooc,DRD, 0.545). In the SKIMMR entity graph, this corresponds to two nodes (parkinsonism,DRD) and one edge between them with weight 0.545. We do not distinguish between the types of the edges (i.e., co-occurrence or similarity), since it is not of significant importance for the SKIMMR users according to our experience so far (they are more interested in navigating the connections between nodes regardless of the connection type). A baseline entity graph is generated from the PubMed annotations with MeSH terms. For all entities X, Y associated with an abstract A, we construct an edge connecting the nodes X and Y in the entity graph. The weight is implicitly assumed to be 1.0 for all such edges. To explain this using concrete data, let us consider the two PubMed IDs from Example 1, 9629849 and 8239569. Selected terms from the corresponding MeSH annotations are {Parkinson Disease/radionuclide imaging,Male, Child}, {Parkinson Disease/radionuclide imaging,Dystonia/drug therapy}, respectively. The graph induced by these annotations is depicted in Fig. 1. The secondary evaluation method uses an index of related articles derived from the entities in the SKIMMR knowledge bases. For the baseline, we use either an index of related articles produced by a specific service of PubMed (Lin & Wilbur, 2007), or the evaluation data from the document categorisation task of the TREC'04 genomics track (Cohen & Hersh, 2006) where applicable. We use the TREC data since they were used also for evaluation of the actual algorithm used by PubMed to compute related articles. Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 111/38 63 PeerJ ^^ystonia/drug therap^^ ^^^arkinson Disease/radionuclide imagin^^^ Figure 1 Example of an entity graph derived from Pub Med. 9 The articles in the TREC data set are annotated by membership in a number of specific categories. The membership is gradual, with three possible values-definitely relevant, possibly relevant and not relevant. To generate the index of related articles from the SKIMMR data, we first use the knowledge base indices (see 'Extracting basic co-occurrence statements from texts') to generate a mapping Ep : £ —> 2P from entities from a set £ to a set of corresponding provenance identifiers (subsets of a set P). In the next step, we traverse the entity graph Ge derived from the statements in the SKIMMR knowledge base and build an index of related articles according to the following algorithm: 1. Initialise a map Mp between all possible (Pi,Pj) provenance identifier pairs and the weight of an edge between them so that all values are zero. 2. For all pairs of entities £i ,£„ (i.e., nodes in Ge), do: • If there is a path V of edges {(£i ,£2), (£2, £3), ••., (£„_!,£„)} in Ge- - compute an aggregate weight of the path as w-p = weue2 ' we2,e3 ■ ■■■■ we„-uE„ (as a multiplication of all weights along the path V); - set the values Mp(Pi,Pj) of the map Mp to max(Mp(Pi,Pj),w-p) for every Pi,Pj such that Pi € Ep{E\),Pj € Ep(En) (i.e., publications corresponding to the source and target entities of the path). 3. Interpret the Mp map as an adjacency matrix and construct a corresponding weighted undirected graph Gp. 4. For every node P in Gp, iteratively construct the index of related articles by associating the key P with a list L of all neighbours of P in Gp sorted by the weights of the corresponding edges. Note that in practice, we restrict the maximum length of the paths to three and also remove edges in Gp with weight below 0.1. This is to prevent a combinatorial explosion of the provenance graph when the entity graph is very densely connected. The baseline index of related publications according to the PubMed service is simply a mapping of one PubMed ID to an ordered list of the related PubMed IDs. The index based on the TREC data is generated from the article categories in the data set. For a PubMed ID X, the list of related IDs are all IDs belonging to the same category as X, ordered so that the definitely relevant articles occur before the possibly relevant ones.9 Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 12/38 64 PeerJ Motivation of the evaluation methods The random walks are meant to simulate user's behaviour when browsing the SKIMMR data, starting with an arbitrary entry point, traversing a number of edges linking the entities and ending up in a target point. Totally random walk corresponds to when a user browses randomly and tries to learn something interesting along the way. Other types of user behaviour can be simulated by introducing specific heuristics for selection of the next entity on the walk (see below for details). To determine how useful a random walk can be, we measure properties like the amount of information along the walk and in its neighbourhood, or semantic similarity between the source and target entities (i.e., how semantically coherent the walk is). The index of related articles has been chosen as a secondary means for evaluating SKIMMR. Producing links between publications is not the main purpose of our current work, however, it is closely related to the notion of skim reading. Furthermore, there are directly applicable gold standards we can use for automated evaluation of the lists of related articles generated by SKIMMR, which can provide additional perspective on the utility of the underlying data even if we do not momentarily expose the publication networks to users. Running and measuring the random walks To evaluate the properties of random walks in a comprehensive manner, we ran them in batches with different settings of various parameters. These are namely: (1) heuristics for selecting the next entity (one of the four denned below); (2) length of the walk (2, 5, 10 or 50 edges); (3) radius of the walk's envelope, i.e., the maximum distance between the nodes of the path and entities that are considered its neighbourhood (0, 1, 2); (4) number of repetitions (100-times for each combination of the parameter (1-3) settings). Before we continue, we have to introduce few notions that are essential for the definition of the random walk heuristics and measurements. The first of them is a set of top-level (abstract) clusters associated with an entity in a graph (either from SKIMMR or from PubMed) according to the MeSH taxonomy. This is denned as a function Ca '■ E —> M, where E,M are the sets of entities and MeSH cluster identifiers, respectively. The second notion is a set of specific entity cluster identifiers Cs, denned on the same domain and range as Ca, i-e., Cs : E —> M. The MeSH cluster identifiers are derived from the tree path codes associated with each term represented in MeSH. The tree path codes have the form L\.Lj.....L„_i.L„ where L, are sub-codes of increasing specificity (i.e., L\ is the most general and Ln most specific). For the abstract cluster identifiers, we take only the top-level tree path codes into account as the values of Ca, while for Cs we consider the complete codes. Note that for the automatically extracted entity names in SKIMMR, there are often no direct matches in the MeSH taxonomy that could be used to assign the cluster identifiers. In these situations, we try to find a match for the terms and their sub-terms using a lemmatised full-text index implemented on the top of MeSH. This helps to increase the coverage two- to three-fold on our experimental data sets. Novacek and Burns (2014), PeerJ, DO110.7717/peerj.483 ZD 13/38 65 PeerJ For some required measures, we will need to consider the number and size of specific clusters associated with the nodes in random walks and their envelopes. Let us assume a set of entities Z QE. The number of clusters associated with the entities from Z, cn(Z), is then denned as cn(Z) = | UxezCPOl wnere C is one of Ca,Cs (depending on which type of clusters are we interested in). The size of a cluster Q e C(X), cs(Q), is denned as an absolute frequency of the mentions of Q among the clusters associated with the entities in Z. More formally, cs(Q) = \{X\X e Za Q e C(X)}\. Finally, we need a MeSH-based semantic similarity of entities siniM(X, Y), which is denned in detail in the formula (8) in 'Similarities'. Example 7 To illustrate the MeSH-based cluster annotations and similarities, let us consider two entities, supranuclear palsy, progressive, 3 and secondary parkinson disease. The terms correspond to the MeSH tree code sets {C10.228.662.700,...,C23.888.592.636.447.690,...,Cll.590.472.500,...} and {C10.228.662.600.700}, respectively, which are also the sets of specific clusters associated with the terms. The top-level clusters are {C10, Cll, C23} and {C10}, respectively. The least common subsumer of the two terms is C10.228.662 of depth 3 (the only possibility with anythingin common is C10.228.662.700 and C10.228.662.600.700J. The depths of the related cluster annotations are 4 and 5, therefore the semantic similarity is = |. We define four heuristics used in our random walk implementations. All the heuristics select the next node to visit in the entity graph according to the following algorithm: 1. Generate the list L of neighbours of the current node. 2. Sort L according to certain criteria (heuristic-dependent). 3. Initialise a threshold e to e„ a pre-defined number in the (0,1) range (we use 0.9 in our experiments). 4. For each node u in the sorted list L, do: • Generate a random number r from the [0,1] range. • If r < e: - return u as the next node to visit. • Else: - set e to e ■ e, and continue with the next node in L. 5. If nothing has been selected by now, return a random node from L. All the heuristics effectively select the nodes closer to the head of the sorted neighbour list more likely than the ones closer to the tail. The random factor is introduced to emulate the human way of selecting next nodes to follow, which is often rather fuzzy according to our observations of sample SKIMMR users. The distinguishing factor of the heuristics are the criteria for sorting the neighbour list. We employed the following four criteria in our experiments: (1) giving preference to the nodes that have not been visited before (H = 1); (2) giving preference to the nodes connected by edges with higher weight (H = 2); (3) giving preference to the nodes that are Novacek and Burns (2014), PeerJ, DO110.7717/peerj.483 ZD 14/38 66 PeerJ Biconnected components can be understood as sets of nodes in a graph that are locally strongly connected and therefore provide us with a simple approximation of clustering in the entity graphs based purely on their structural properties. more similar, using the simu function introduced before (H = 3); (4) giving preference to the nodes that are less similar (H = 4). The first heuristic simulates a user that browses the graph more or less randomly, but prefers to visit previously unknown nodes. The second heuristic models a user that prefers to follow a certain topic (i.e., focused browsing). The third heuristic represents a user that wants to learn as much as possible about many diverse topics. Finally, the fourth heuristic emulates a user that prefers to follow more plausible paths (approximated by the weight of the statements computed by SKIMMR). Each random walk and its envelope (i.e., the neighbourhood of the corresponding paths in the entity graphs) can be associated with various information-theoretic measures, graph structure coefficients, levels of correspondence with external knowledge bases, etc. Out of the multitude of possibilities, we selected several specific scores we believe to soundly estimate the value of the underlying data for users in the context of skim reading. Firstly, we measure semantic coherence of the walks. This is done using the MeSH-based semantic similarity between the nodes of the walk. In particular, we measure: (A) coherence between the source S and target T nodes as simM(S,T); (B) product coherence between all the nodes Ui,U2,...,Un of the walk as n Ui+i); (C) average coherence between all the nodes U\, Uj,---, Un of the walk as i Xiieji «-i) slmM(Lri> Ui+i). This family of measures helps us to assess how convergent (or divergent) are the walks in terms of focus on a specific topic. The second measure we used is the information content of the nodes on and along the walks. For this, we use the entropy of the association of the nodes with clusters denned either (a) by the MeSH annotations or (b) by the structure of the envelope. By definition, the higher the entropy of a variable, the more information the variable contains (Shannon, 1948). In our context, a high entropy value associated with a random walk means that there is a lot of information available for the user to possibly learn when browsing the graph. The specific entropy measures we use relate to the following sets of nodes: (D) abstract MeSH clusters, path only; (E) specific MeSH clusters, path only; (F) abstract MeSH clusters, path and envelope; (G) specific MeSH clusters, path and envelope; (H) clusters defined by biconnected components (Hop croft & Tarjan, 1973) in the envelope.10 The entropies of the sets (D-G) are defined by formulae (9) and (10) in'Entropies'. The last family of random walk evaluation measures is based on the graph structure of the envelopes: (I) envelope size (in nodes); (J) envelope size in biconnected components; (K) average component size (in nodes); (L) envelope's clustering coefficient. The first three measures are rather simple statistics of the envelope graph. The clustering coefficient is widely used as a convenient scalar representation of the structural complexity of a graph, especially in the field of social network analysis (Carrington, Scott & Wasserman, 2005). In our context, we can see it as an indication of how likely it is that the connections in the entity graph represent non-trivial relationships. To facilitate the interpretation of the results, we computed also the following auxiliary measures: (M) number of abstract clusters along the path; (N) average size of the abstract clusters along the path; (O) number of abstract clusters in the envelope; (P) average size of the abstract clusters in the envelope; (Q) number of specific clusters along the path; Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 15/38 67 PeerJ (R) average size of the specific clusters along the path; (S) number of specific clusters in the envelope; (T) average size of the specific clusters in the envelope. Note that all the auxiliary measures use the MeSH cluster size and number notions, i.e., cs(...) and cn{...) as denned earlier. Comparing the indices of related articles The indices of related articles have quite a simple structure. We can also use the baseline indices as gold standard, and therefore evaluate the publication networks implied by the SKIMMR data using classical measures of precision and recall (Manning, Raghavan & Schütze, 2008). Moreover, we can also compute correlation between the ranking of the items in the lists of related articles which provides an indication of how well SKIMMR preserves the ranking imposed by the gold standard. For the correlation, we use the standard Pearson's formula (Dowdy, Weardon & Chilko, 2005), taking into account only the ranking of articles occurring in both lists. The measures of precision and recall are defined using overlaps of the sets of related articles in the SKIMMR and gold standard indices. The detailed definitions of the specific notions of precision and recall we use are given in formulae (11) and (12) in 'Precision and recall'. The gold standard is selected depending on the experimental data set, as explained in the next section. In order to cancel out the influence of different average lengths of lists of related publications between the SKIMMR and gold standard indices, one can take into account only a limited number of the most relevant (i.e., top) elements in each list. RESULTS We have implemented the techniques described in the previous section as a set of software modules and provided them with a search and browse front-end. This forms a prototype implementation of SKIMMR, available as an open source software package through the GitHub repository (see 'Software packages' for details). We here describe the architecture of the SKIMMR software ('Architecture') and give examples on the typical use of SKIMMR in the domains of Spinal Muscular Atrophy and Parkinson's Disease ('Using SKIMMR'). 'Evaluation' presents an evaluation of the proposed approach to machine-aided skim reading using SKIMMR running on three domain-specific sets of biomedical articles. Architecture The SKIMMR architecture and data flow is depicted in Fig. 2. First of all, SKIMMR needs a list of PubMed identifiers (unique numeric references to articles indexed on PubMed) specified by the user of system administrator. Then it automatically downloads the abstracts of the corresponding articles and stores the texts locally. Alternatively, one can export results of a manual PubMed search as an XML file (using the 'send to file' feature) and then use a SKIMMR script to generate text from that file. From the texts, a domain-specific SKIMMR knowledge base is created using the methods described in 'Extracting basic co-occurrence statements from texts' and 'Computing a knowledge base from the extracted statements'. The computed statements and their article provenance are then indexed as described in 'Indexing and querying the knowledge base'. This allows Noväcek and Burns (2014), PeerJ, DO110.7717/peerj.483 ZD 16/38 68 PeerJ Pubmed Abstract Texts downloader Text mining Co-occurence analysis Similarity computation Full text Statements Provenance KB indexing Figure 2 Architecture of the SKIMMR system. 11 The live instances are running at http://www.skimmr.org:8008 and http: //www.skimmr.org:8090, respectively, as of June 2014. Canned back-up versions of them are available at http ://www.skimmr.org/resources/ skimmr/sma.tgz and http://www. skimmr.org/resources/skimmr/pd. tgz (SMA and Parkinson's Disease, respectively). If the SKIMMR dependencies are met (see https://github.com/ vitnov/SKIMMR), the canned instances can be used locally on any machine with Python installed (versions higher than 2.4 and lower than 3.0 are supported, while 2.6.* and 2.7.* probably work best). After downloading the archives, unpack them and switch to the resulting folder. Run the re-indexing script, following Section 3.6 in the README provided in the same folder. To execute the SKIMMR front-end locally, run the server as described in Section 3.7 of the README. users to search and browse the high-level graph summaries of the interconnected pieces of knowledge in the input articles. The degrees in the result sets (explained in detail in 'Indexing and querying the knowledge base') are used in the user interface to prioritise the more important nodes in the graphs by making their font and size proportional to the sum of the degrees of links (i.e., the number of statements) associated with them. Also, only a selected amount of the top scoring entities and links between them is displayed at a time. Using SKIMMR The general process of user interaction with SKIMMR can be schematically described as follows: 1. Search for an initial term of interest in a simple query text box. 2. A graph corresponding to the results of the search is displayed. The user has two options then: (a) Follow a link to another node in the graph, essentially browsing the underlying knowledge base along the chosen path by displaying the search results corresponding to the selected node and thus going back to step 1 above. (b) Display most relevant publications that have been used for computing the content of the result graph, going to step 3 below. 3. Access and study the displayed publications in detail using a re-direct to Pub Med. The following two sections illustrate the process using examples from two live instances of SKIMMR deployed on articles about Spinal Muscular Atrophy and Parkinson's Disease.11 The last section of this part of the article gives a brief overview of the open source software packages of SKIMMR available for developers and users interested in deploying SKIMMR on their own data. Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 117/38 69 PeerJ Ask a query: [etiology proteosomal dysfunction lixl exon sequence lixl-associated etiology iology searching skimming- reading An approximately 140-kb deletion associated with feline spinal muscular atrophy implies an essential 6. LIX1 function for motor neuron survival. Fyfe JC, Menotti-Raymond M, David VA, Brichta L, Schäffer AA, Agarwala R. Murphy WJ, Wedemeyer WJ, Gregory BL, Buzzell BG, Drummond MC. Wirth B, O'Brien SJ. Genome Res. 2006 Sep:16(9):1084-90. Epub 2006 Aug 9. PMID: 16899696 [PubMed - indexed for MEDLINE] Free PMC Article Related citations Figure 3 Exploring SMA etiology. 12 A genetic neurological disease caused by mutation of SMN1 gene that leads to death of motor neurons and consequent progressive muscle atrophy. It is the most common genetic cause of infant death and there is no cure as of now. See http://en.wikipedia.org/wiki/Spinal_ muscular .atrophy for details. 13 See http://www.smafoundation.org/. Spinal muscular atrophy Fig. 3 illustrates a typical session with the Spinal Muscular Atrophy12 instance of SKIMMR. The SMA instance was deployed on a corpus of 1,221 abstracts of articles compiled by SMA experts from the SMA foundation.13 The usage example is based on an actual session with Maryann Martone, a neuroscience professor from UCSD and a representative of the SMA Foundation who helped us to assess the potential of the SKIMMR prototype. Following the general template from the beginning of the section, the SMA session can be divided into three distinct phases: 1. Searching: The user was interested in the SMA etiology (studies on underlying causes of a disease). The keyword etiology was thus entered into the search box. 2. Skimming: The resulting graph suggests relations between etiology of SMA, various gene mutations, and the Lixl gene. Lixl is responsible for protein expression in limbs which seems relevant to the SMA manifestation, therefore the Lixl — associated etiology path was followed in the graph, moving on to a slightly different area in the underlying knowledge base extracted from the SMA abstracts. When browsing the graph along that path, one can quickly notice recurring associations with feline SMA. According to the neuroscience expert we consulted, the cat models of the SMA disease appear to be quite a specific and interesting fringe area of SMA Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 118/38 70 PeerJ research. Related articles may be relevant and enlightening even for experienced researchers in the field. 3. Reading: The reading mode of SKIMMR employs an in-line redirect to a specific Pub Med result page. This way one can use the full set of PubMed features for exploring and reading the articles that are mostly relevant to the focused area of the graph the user skimmed until now. The sixth publication in the result was most relevant for our sample user, as it provided more details on the relationships between a particular gene mutation in a feline SMA model and the Lixl function for motor neuron survival. This knowledge, albeit not directly related to SMA etiology in humans, was deemed as enlightening by the domain expert in the context of the general search for the culprits of the disease. The whole session with the neuroscience expert lasted about two minutes and clearly demonstrated the potential for serendipitous knowledge discovery with our tool. Parkinson's disease Another example of the usage of SKIMMR is based on a corpus of 4,727 abstracts concerned with the clinical studies of Parkinson's Disease (PD). A sample session with the PD instance of SKIMMR is illustrated in Fig. 4. Following the general template from the beginning of the section, the PD session can be divided into three distinct phases again: 1. Searching: The session starts with typing parkinson's into the search box, aiming to explore the articles from a very general entry point. 2. Skimming: After a short interaction with SKIMMR, consisting of few skimming steps (i.e., following a certain path in the underlying graphs of entities extracted from the PD articles), an interesting area in the graph has been found. The area is concerned with Magnetic Resonance Parkinsons Index (MRPI). This is a numeric score calculated by multiplying two structural ratios: one for the area of the pons relative to that of the midbrain and the other for the width of the Middle Cerebellar Peduncle relative to the width of the Superior Cerebellar Peduncle. The score is used to diagnose PD based on neuroimaging data (Morelli etdL, 2011). 3. Reading: When displaying the articles that were used to compute the subgraph surrounding MRPI, the user reverted to actual reading of the literature concerning MRPI and related MRI measures used to diagnose Parskinson's Disease as well a range of related neurodegenerative disorders. This example illustrates once again how SKIMMR provides an easy way of navigating through the conceptual space of a subject that is accessible even to novices, reaching interesting and well-specified components areas of the space very quickly. Software packages In addition to the two live instances described in the previous sections, SKIMMR is available for local installation and custom deployment either on biomedical article abstracts from PubMed, or on general English texts. Moreover, one can expose SKIMMR Novacek and Burns (2014), PeerJ, DO110.7717/peerj.483 Z319/38 71 PeerJ Ask a query: (pärkinson's " [göTJ |.,| |,|-[ |,..„.,..| 1 1.........:........ parkinsonism searching skimming skimming J parkinsonisms | | middle cerebe ar peduncle | | clinical diagnostic criteria | | ojo |^ | pons area/mid bra In area | | psp patients | | midbrain/pons ratio | J mrpiva ^ "~---- ......-.....-.....J]~; ^t^^Z^^^^^Zi magnetic resonance parkinsonism index — Zf~~^^^^^^^ .........................——--— [^j J—6| |m,ůtlralnarea/pQn5area| | ™t,na, r^netlc ™nCe ,mag,n9 (mh | | M pe^nCe | | cllnical, —,e | - reading Acirary -- -mgre'c -e-o-n; v 0.7 contains the ratio of SKIMMR results that have Novacek and Burns (2014), PeerJ, DO110.7717/peerj.483 Z3 27/38 79 PeerJ Table 6 Results for the related articles. PD SMA TREC preavx recavx C > 0.7 preavx recavf! C > 0.7 preavg recavf! C > 0.7 0.0095 0.0240 0.5576 0.0139 0.0777 0.5405 0.0154 0.0487 0.5862 significant correlation (i.e., at least 0.7) with the corresponding baseline. The absolute values of the average precision and recall are very poor, in units of percents. The correlation results are more promising, showing that more than half of the related document rankings produced by SKIMMR are reasonably aligned with the gold standard. Moreover, the correlation is highest for the TREC data set based on the only gold standard that is manually curated. DISCUSSION SKIMMR provides a computational instantiation of the concept of 'skim reading.' In the early prototype stage, we generally focussed on delivering as much of the basic functionality as possible in a lightweight interface. Lacking enough representative data collected from ongoing user studies, we have designed a series of automated experiments to simulate several skim reading modes one can engage in with SKIMMR. We evaluated these experiments using gold standards derived from manually curated biomedical resources. Here we offer a discussion of the results in relation to the concept of machine-aided skim reading as realised by the SKIMMR prototype. The discussion is followed by an overview of related work and an outline of possible future directions. Interpreting the results The secondary evaluation using lists of related publications induced by the SKIMMR knowledge bases did not bring particularly good results in terms of precision and recall. However, the correlation with the related document ranking provided by baselines was more satisfactory. This indicates that with better methods for pruning the rather extensive lists of related publications produced with SKIMMR, we may be able to improve the precision and recall substantially. Still, this evaluation was indirect since generating lists of related publications is not the main purpose of SKIMMR. Apart from indirect evaluation, we were also curious whether the data produced by SKIMMR could not be used also for a rather different task straightaway. The lesson learned is that this may be possible, however, some post-processing of the derived publication lists would be required to make the SKIMMR-based related document retrieval more accurate for practical applications. Our main goal was to show that our approach to machine-aided skim reading can be efficient in navigating high-level conceptual structures derived from large numbers of publications. The results of the primary evaluation experiment—simulations of various types of skimming behaviour by random walks—demonstrated that our assumption may indeed be valid. The entity networks computed by SKIMMR are generally more semantically coherent, more informative and more complex than similar networks based on Novacek and Burns (2014), PeerJ, DO110.7717/peerj.483 zd 28/38 80 PeerJ the manually curated Pub Med article annotations. This means that users will typically be able to browse the SKIMMR networks in a more focused way. At the same time, however, they will learn more interesting related information from the context of the browsing path, and can also potentially gain additional knowledge from more complex relationships between the concepts encountered on the way. This is very promising in the context of our original motivations for the presented research. Experiments with actual users would have brought many more insights regarding the practical relevance of the SKIMMR prototype. Still, the simulations we have proposed cover four distinct classes of possible browsing behaviour, and our results are generally consistent regardless of the particular heuristic used. This leads us to believe that the evaluation measures computed on paths selected by human users would not be radically different from the patterns observed within our simulations. Related work The text mining we use is similar to the techniques mentioned in Yan et al (2009), but we use a finer-grained notion of co-occurrence. Regarding biomedical text mining, tools like BioMedLEE (Friedman et al, 2004), MetaMap (Aronson &Lang, 2010) or SemRep (Liu et al, 2012) are closely related to our approach. The tools mostly focus on annotation of texts with concepts from standard biomedical vocabularies like UMLS which is very useful for many practical applications. However, it is relatively difficult to use the corresponding software modules within our tool due to complex dependencies and lack of simple APIs and/or batch scripts. The tools also lack the ability to identify concepts not present in the biomedical vocabularies or ontologies. Therefore we decided to use LingPipe's batch entity recogniser in SKIMMR. The tool is based on a relatively outdated GENIA corpus, but is very easy to integrate, efficient and capable of capturing unknown entities based on the underlying statistical model, which corresponds well to our goal of delivering a lightweight, extensible and easily portable tool for skim-reading. The representation of the relationships between entities in texts is very close to the approach of Baroni & Lend (2010), however, we have extended the tensor-based representation to tackle a broader notion of text and data semantics, as described in detail in Nováček, Handschuh & Decker (2011). The indexing and querying of the relationships between entities mentioned in the texts is based on fuzzy index structures, similarly to Zadrozny & Nowacka (2009). We make use of the underlying distributional semantics representation, though, which captures more subtle features of the meaning of original texts. Graph-based representations of natural language data have previously been generated using dependency parsing (Ramakrishnan etal, 2008; Biemann et al, 2013). Since these representations are derived directly from the parse structure, they are not necessarily tailored for the precise task of skim-reading but could provide a valuable intermediate representation. Another graph-based representation that is derived from the text of documents are similarity-based approaches derived from 'topic models' of document corpora (Talley et al, 2011). Although these analyses typically provide a visualization of the organization of documents, not of their contents, the topic modeling methods provide Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 zd 29/38 81 PeerJ statistical representation of the text that can then be leveraged to examine other aspects of the context of the document, such as its citations (Foulds & Smyth, 2013). A broad research area of high relevance to the presented work is the field of'Machine Reading' that can be defined as "the autonomous understanding of text" (Etzioni, Banko & Cafarella, 2006). It is an ambitious goal that has attracted much interest from NLP researchers (Mulkar etal, 2007; Strassel etal, 2010; Poon & Domingos, 2010). By framing the reading task as 'skimming' (which provides a little more structure than simply navigating a set of documents, but much less than a full representation of the semantics of documents), we hope to leverage machine reading principles into practical tools that can be used by domain experts straightforwardly. Our approach shares some similarities with applications of spreading activation in information retrieval which are summarised for instance in the survey (Crestani, 1997). These approaches are based on associations between search results computed either off-line or based on the "live" user interactions. The network data representation used for the associations is quite close to SKIMMR, however, we do not use the spreading activation principle to actually retrieve the results. We let the users to navigate the graph by themselves which allows them to discover even niche and very domain-specific areas in the graph's structure that may not be reached using the spreading activation. Works in literature based discovery using either semantic relationships (Hristovski et at, 2006) or corresponding graph structures (Wilkowski etal, 2011) are conceptually very similar to our approach to skim reading. However, the methods are quite specific when deployed, focusing predominantly on particular types of relationships and providing pre-defined schema for mining instances of the relationships from the textual data. We keep the process lightweight and easily portable, and leave the interpretation of the conceptual networks on the user. We do lose some accuracy by doing so, but the resulting framework is easily extensible and portable to a new domain within minutes, which provides for a broader coverage compensating the loss of accuracy. From the user perspective, SKIMMR is quite closely related to GoPubMed (Dietze et at, 2008), a knowledge-based search engine for biomedical texts. GoPubMed uses Medical Subject Headings and Gene Ontology to speed up finding of relevant results by semantic annotation and classification of the search results. SKIMMR is oriented more on browsing than on searching, and the browsing is realised via knowledge bases inferred from the texts automatically in a bottom-up manner. This makes SKIMMR independent on any pre-defined ontology and lets users to combine their own domain knowledge with the data present in the article corpus. Tools like DynaCat (Pratt, 1997) or QueryCat (Pratt & Wasserman, 2000) share the basic motivations with our work as they target the information overload problem in life sciences. They focus specifically on automated categorisation of user queries and the query results, aiming at increasing the precision of document retrieval. Our approach is different in that it focuses on letting users explore the content of the publications instead of the publications themselves. This provides an alternative solution to the information overload by leading Novacek and Burns (2014), PeerJ, DOI10.7717/peerj.483 ^^^^^^^^~Z~H 30/38 82 PeerJ users to interesting information spanning across multiple documents that may not be grouped together by Pratt (1997) and Pratt & Wasserman (2000). Another related tool is Exhibit (Huynh, Karger & Miller, 2007), which can be used for faceted browsing of arbitrary datasets expressed in jSON (Crockford, 2006). Using Exhibit one can dynamically define the scope from which they want to explore the dataset and thus quickly focus on particular items of interest. However, Exhibit does not provide any solution on how to get the structured data to explore from possibly unstructured resources (such as texts). Textpresso (Müller, Kenny & Sternberg, 2004) is quite similar to SKIMMR concerning searching for relations between concepts in particular chunks of text. However, the underlying ontologies and their instance sets have to be provided manually which often requires years of work, whereas SKIMMR operates without any such costly input. Moreover, the system's scale regarding the number of publications' full-texts and concepts covered is generally lower than the instances of SKIMMR that can be set up in minutes. CORAAL (Nováček et al, 2010) is our previous work for cancer publication search, which extracts relations between entities from texts, based on the verb frames occurring in the sentences. The content is then exposed via a multiple-perspective search and browse interface. SKIMMR brings the following major improvements over CORAAL: (1) more advanced back-end (built using our distributional data semantics framework introduced in Nováček, Handschuh & Decker, 2011); (2) simplified modes of interaction with the data leading to increased usability and better user experience; (3) richer, more robust fuzzy querying; (4) general streamlining of the underlying technologies and front-ends motivated by the simple, yet powerful metaphor of machine-aided skim reading. Future work Despite the initial promising results, there is still much to do in order to realise the full potential of SKIMMR as a machine-aided skim reading prototype. First of all, we need to continue our efforts in recruiting coherent and reliable sample user groups for each of the experimental SKIMMR instances in order to complement the presented evaluation by results of actual user studies. Once we get the users' feedback, we will analyse it and try to identify significant patterns emerging from the tracked behaviour data in order to correlate them with the explicit feedback, usability assessments and the results achieved in our simulation experiments. This will provide us with a sound basis for the next iteration of the SKIMMR prototype development, which will reflect more representative user requirements. Regarding the SKIMMR development itself, the most important things to improve are as follows. We need to extract more types of relations than just co-occurrence and rather broadly defined similarity. One example of domain specific complex relation are associations of potential side effects with drugs. Another, more general example, is taxonomical relations (super-concept, sub-concept), which may help provide additional perspective to browsing the entity networks (i.e., starting with high-level overview of the relations between more abstract concepts and then focusing on the structure of the Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 ^^^^^^^^MZZl31/38 83 PeerJ connections between more specific sub-concepts of selected nodes). Other improvements related to the user interface are: (1) smoother navigation in the entity networks (the nodes have to be active and shift the focus of the displayed graph upon clicking on them, they may also display additional metadata, such as summaries of the associated source texts); (2) support of more expressive (conjunctive, disjunctive, etc.) search queries not only in the back-end, but also in the front-end, preferably with a dedicated graphical user interface that allows to formulate the queries easily even for lay users; (3) higher-level visualisation features such as evolution of selected concepts' neighbourhoods in time on a sliding scale. We believe that realisation of all these features will make SKIMMR a truly powerful tool for facilitating knowledge discovery (not only) in life sciences. ACKNOWLEDGEMENTS We would like to thank to our former colleagues Eduard H. Hovy and Drashti Dave for their generously shared insights regarding the NLP and biomedical aspects, respectively, of the presented work. Last but not least, we are indebted to Maryann Martone for her guidance concerning the Spinal Muscular Atrophy domain and for multiple testing sessions during SKIMMR development which helped us to refine the tool in order to meet the actual requirements of life scientists. APPENDIX. FORMULAE DEFINITIONS In this appendix we give full account on definitions of some of the formal notions used throughout the main article but not covered in detail there. Co-occurrences The basic co-occurrence score cooc((ex,ey),PubMedpMiD) for two entities ex,ey in an article PubMedpMiDt introduced in 'Extracting basic co-occurrence statements from texts', is computed as cooc((ex,ey),PubMedpMID) = —-i—- (1) ,jefc,)1 + l'--'1 where S(ex, ey) is a set of numbers of sentences that contain the entity ex or ey (assuming the sentences numbered sequentially from the beginning of the text). In practice, one may impose a limit on the maximum allowed distance of entities to be taken into account in the co-occurrence score computation (we disregard entities occurring more than 3 sentences apart from the score sum). The non-normalised formula for corpus-wide co-occurrence for two outcomes (i.e., terms in our specific use case) x,y, using a base-2 logarithm (introduced in 'Corpus-wide co-occurrence'), is: p(x,y) fpmi(x,y) = F(x,y)log2 (2) p(x)p(y) where F(x,y) is the absolute frequency of the x,y co-occurrence and p(x,y),p(x),p(y) are the joint and individual distributions, respectively. In our case, the distributions are the Novacek and Burns (2014), PeerJ, DO110.7717/peerj.483 ^^^^^^^^MZZ\ 32/38 84 PeerJ weighted relative frequencies of the entity terms in the basic co-occurrence tuples generated from the input texts which are computed as follows. Let us assume a set T of tuples h = (ei,x,ei j,cooc((ei,x,eij),PubMedpMiDi),PubMedpMiDi), h = (e2,x,e2,y,cooc((e2,x,e2,y),PubMedpMii>l),PubMedpMiD2), tn = (e„tX,e„,y,cooc((e„tX,e„ty),PubMedpMiD„),PubMedpMiD„) as a result of the basic co-occurrence statement extraction described in the previous section. The joint distribution of terms x,y specific to our case can then be computed as: E P(x,y) =-- (3) where W(x,y,T) = {w\3ei,e2,w,i.(ei,e2,w,i) € T A ((ei = xA e2 =y) V (ei =yA e2 = x))} is the set of weights in the basic co-occurrence tuples that contain both x,y as entity arguments. Finally, the individual distribution of a term z is computed as: w E p(z)=—^ (4) where W(z, T) = {w\3ei,e2,w,i.(ei,e2,w,i) e T A {e\ = z V e2 = z)} is the set of weights in the basic co-occurrence tuples that contain z as any one of the entity arguments. In the eventual result, all co-occurrence tuples with score lower than zero are omitted, while the remaining ones are normalised as follows: npmi(x,y) = v(Jpmi(x,y),P) (5) where v is a function that divides the scores by the P-th percentile of all the scores and truncates the resulting value to 1 if it is higher than that. The motivation for such definition of the normalisation is that using the percentile, one can flexibly reduce the influence of possibly disproportional distributions in the scores (i.e., when there are few very high values, normalisation by the sum of all values or by the maximal value would result in most of the final scores being very low, whereas the carefully selected percentile can balance that out, reducing only relatively low number of very high scores to crisp 1). Similarities Firstly we define the cosine similarity introduced in 'Similarity'. For that we need few auxiliary notions. First of them is a so called 'co-occurrence complement' x of an entity x: x = {(e, w)|3e, w.(e, cooc,x, w) e KB V (x,cooc,e,w) e KB] (6) where KB is the knowledge base, i.e., the set of the aggregated co-occurrence statements computed as shown in 'Corpus-wide co-occurrence'. Additionally, we define an element-set projection of an entity's co-occurrence complement x as x\ = [y \ 3w. w 0 A (y, w) e x\, w Novacek and Burns (2014), PeerJ, DO110.7717/peerj.483 ^^^^^^^^HZD 33/38 85 PeerJ i.e., set of all the entities in the co-occurrence complement abstracting from the corresponding co-occurrence weights. Finally, we use a shorthand notation x[y] = w such that (y, w) e x for a quick reference to the weight corresponding to an entity in a co-occurrence complement. If an entity y is missing in the co-occurrence complement of x, we define x[y] = 0. Example 8 Assuming that the knowledge base consists only from one co-occurrence tuple (parkinsonism, cooc,DRD, 0.545) from the previous Example 2, we can define two co-occurrence complements on the entities in it: parkinsonism = {(DRD,0.545)}, DRD = {(parkinsonism, 0.545)}. The element-set projection of parkinsonism is then a set {DRD}, while parkinsonism[DRD] equals 0.545. Now we can define the similarity between two entities a, b in a SKIMMR knowledge base sim(a, b) = Z^£ainbl (7) where a,b are the co-occurrence complements of a,b, and a\,bi their element-set projections. It can be easily seen that the formula directly corresponds to the definition of cosine distance: its top part is the dot product of the co-occurrence context vectors corresponding to the entities a, b, while the lower part is multiplication of the vectors' sizes (Euclidean norms in particular). The MeSH-based semantic similarity of entities, introduced in 'Running and measuring the random walks', is denned as 2 • dpt(lcs(u, v)) simM(X,Y) = max---- (8) ueCs(X),veCs(Y) dpt(u) + dpt(v) where the specific tree codes in the Cs(X), Cs(Y) are interpreted as nodes in the MeSH taxonomy, the les stands for the least common subsumer of two nodes in the taxonomy and dpt is the depth of a node in the taxonomy (denned as zero if no node is supplied as an argument, i.e., if les has no result). The formula we use is essentially based on a frequently used taxonomy-based similarity measure denned in Wu & Palmer (1994). We only maximise it across all possible cluster annotations of the two input entities to find the best match. Note that this strategy is safe in case of a resource with as low ambiguity as MeSH - while there are often more annotations of a term, they do not refer to different senses but rather to different branches in the taxonomy. Therefore using the maximum similarity corresponds to finding the most appropriate branch in the MeSH taxonomy along which the terms can be compared. Novacek and Burns (2014), PeerJ, DOI10.7717/peerj.483 ^^^^HD 34/38 86 PeerJ Entropies 'Running and measuring the random walks' introduced entropies for expressing information value of SKIMMR evaluation samples (i.e., random walks and their contexts). The entropies are defined using the notion of MeSH cluster size (cs(...)) introduced in the main part of the article. Given a set Z of nodes of interest, the entropy based on MeSH cluster annotations, Hm(Z), is computed as cs(C) Ecs(Ci) —-—--log2 QeC(Z) £qeC(Z) CS(Q) Eqec(Z) a(c;) (9) where C is one of Ca,Cs, depending whether we consider the abstract or the specific nodes. Similarly, the component-based entropy He (Z) is defined as IQI log2 IQI C^S(Z)^C^B(Z)\Cj\ &iEqeB(Z)IQI (10) where B(Z) is a function returning a set of biconnected components in the envelope Z, which is effectively a set of subsets of nodes from Z. Precision and recall The indices of related articles are compared using precision and recall measures, as stated in 'Comparing the indices of related articles'. Let Is : P - 2p,IG : 2F be the SKIMMR and gold standard indices of related publications, respectively (P being a set of publication identifiers). Then the precision and recall for a publication p e P are computed as pre(p) : \Is(p)MG(p)\ rec(p) ■ \Is(p)MG(p)\ (11) \h(p)\ \Ig(P)\ respectively. To balance the possibly quite different lengths of the lists of related articles, we limit the computation of the precision and recall up to at most 50 most relevant items in the lists. The average values of precision and recall for a corpus of articles X C P are computed as preavg(X) = respectively. T,P&xPre(p) \x\ recavJX) : T.peXrec(P) \X\ (12) ADDITIONAL INFORMATION AND DECLARATIONS Funding This publication has emanated from research supported in part by research grants from Science Foundation Ireland (SFI) under Grant Numbers SFI/08/CE/I1380, SFI/08/CE/I1380 - STTF 11 (2), and SFI/12/RC/2289. Work was also supported under NIH grants RO1-GM083871 and RO1-MH079068-01A2. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 ETJ 35/38 87 PeerJ Grant Disclosures The following grant information was disclosed by the authors: Science Foundation Ireland (SFI): SFI/08/CE/I1380, SFI/08/CE/I1380-STTF 11 (2), SFI/12/RC/2289. NIH: RO1-GM083871, RO1-MH079068-01A2. Competing Interests The authors declare there are no competing interests. Author Contributions • Vít Nováček conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper, implemented the SKIMMR prototype and corresponding experimental validation scripts. • Gully A.RC. Burns conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Ethics The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): IRB name: USC UPIRB, approval number: UP-12-00414. Data Deposition The following information was supplied regarding the deposition of related data: http://skimmr.org/resources/skimmr/pd.tgz (Parskinson's Disease instance of SKIMMR, canned archive version) http://skimmr.org/resources/skimmr/sma.tgz (Spinal Muscular Atrophy instance of SKIMMR, canned archive version) http://skimmr.org/resources/skimmr/trectgz (TREC instance of SKIMMR, canned archive version) http://skimmr.org/resources/skimmr/plots.tgz (complete plots of the results). REFERENCES AronsonAR, Lang F-M. 2010. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17(3):229-236 DOI 10.1136/jamia.2009.002733. Baroni M, Lenci A. 2010. Distributional memory: a general framework for corpus-based semantics. Computational Linguistics 36(4):673-721 DOI 10.1162/coli_a_00016. Bieraann C, Coppola B, Glass MR, Gliozzo A, Hatěni M, Riedl M. 2013. JoBimText visualizer: a graph-based approach to contextualizing distributional similarity. In: Proceedings of TextGraphs-8 graph-based methods for natural language processing, Seattle, Washington, USA. Association for Computational Linguistics, 6-10. Carrington PJ, Scott J, Wasserraan S. 2005. Models and methods in social network analysis. Cambridge: Cambridge University Press. Nováček and Burns (2014), PeerJ, DO110.7717/peerj.483 ^^^^^^^^^mj 36/38 88 PeerJ Cohen AM, Hersh WR. 2006. The tree 2004 genomics track categorization task: classifying full text biomedical documents. Journal of Biomedical Discovery and Collaboration 1(1): Article 4 DOI 10.1186/1747-5333-1-4. Crestani F. 1997. Application of spreading activation techniques in information retrieval. Artificial Intelligence Review ll(6):453-482 DOI 10.1023/A:1006569829653. Crockford D. 2006. The application/json media type for JavaScript Object Notation (JSON). Available at http://www.ietf.org/rfc/rfc4627.txt (accessed July 2013). d'Amato C. 2007. Similarity-based learning methods for the semantic web. PhD Thesis. Dietze H, Alexopoulou D, Alvers MR, Barrio-Alvers B, Doras A, Hakenberg I., Mönnich I, Plake C, Reischuk A, Royer L, Wächter T, Zschunke M, Schroeder M. 2008. GoPubMed: exploring pubmed with ontological background knowledge. In: Ashburner M, Leser U, Rebholz-Schuhmann D, eds. Ontologies and text mining for life sciences: current status and future perspectives. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik. Dowdy S, Weardon S, Chilko D. 2005. Statistics for research, 3rd edition. John Wiley & Sons, Inc. Etzioni O, Banko M, Cafarella MJ. 2006. Machine reading. In: Proceedings of the 2007 AAAI spring symposium on machine reading. AAAI Press. Foulds I, Smyth P. 2013. Modeling scientific impact with topical influence regression. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Seattle, Washington, USA. Association for Computational Linguistics, 113-123. Friedman C, Shagina L, Lussier Y, Hripcsak G. 2004. Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association ll(5):392-402 DOI 10.1197/jamia.M1552. Hájek P. 1998. Metamathematics of fuzzy logic. Dordrecht: Kluwer. Hopcroft J, Tarjan R. 1973. Algorithm 447: efficient algorithms for graph manipulation. Communications of the ACM 16(6):372-378 DOI 10.1145/362248.362272. Hristovski D, Friedman C, Rindflesch TC, Peterlin B. 2006. Exploiting semantic relations for literature-based discovery. In: AMIA annual symposium proceedings, vol. 2006. American Medical Informatics Association, 349-353. Huynh DF, Karger DR, Miller RC. 2007. Exhibit: lightweight structured data publishing. In: Proceedings of the 16th international conference on World Wide Web. 737-746. Lin I, Wilbur WJ. 2007. PubMed related articles: a probabilistic topic-based model for content similarity. BMCBioinformatics 8(1):423 DOI 10.1186/1471-2105-8-423. Liu Y, Bill R, Fiszman M, Rindflesch T, Pedersen T, Melton GB, Pakhomov SV. 2012. Using semrep to label semantic relations extracted from clinical text. In: AMIA annual symposium proceedings, vol. 2012. American Medical Informatics Association, 587. Lovász L. 1993. Random walks on graphs: a survey, Bolyai society mathematical studies, vol. 2. 1-46. Manning CD, Raghavan P, Schütze H. 2008. Introduction to information retrieval. Cambridge: Cambridge University Press. Morelli M, Arabia G, Salsone M, Novellino F, Giofrě L, Paletta R, Messina D, Nicoletti G, Condino F, Gallo O, Lanza P, Quattrone A. 2011. Accuracy of magnetic resonance parkinsonism index for differentiation of progressive supranuclear palsy from probable or possible parkinson disease. Movement Disorders 26(3):527-533 DOI 10.1002/mds.23529. Mulkar R, Hobbs JR, Hovy E, Chalupsky H, Lin C-Y. 2007. Learning by reading: two experiments. In: Proceedings of the IJCAI workshop on knowledge and reasoning for answering questions (KRAQ). Hyderabad, India. Nováček and Burns (2014), PeerJ, DOI10.7717/peerj.483 37/38 89 PeerJ Múller HM, Kenny EE, Sternberg PW. 2004. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2(ll):e309 DOI 10.13 71 /joumal.pbio.0020309. Nováček V, Groza T, Handschuh S, Decker S. 2010. CORAAL-dive into publications, bathe in the knowledge. Web Semantics: Science, Services and Agents on the World Wide Web 8(2-3):176-181 DOI 10.1016/j.websem.2010.03.008. Nováček V, Handschuh S, Decker S. 2011. Getting the meaning right: a complementary distributional layer for the web semantics. In: Proceedings ofISWC'11. Springer. Poon H, Doraingos P. 2010. Machine reading: a "killer app" for statistical relational AI. In: AAAI workshop on statistical relational artificial intelligence. AAAI Press. Pratt W. 1997. Dynamic organization of search results using the umls. In: Proceedings of the AMIA annual fall symposium. American Medical Informatics Association, 480. Pratt W, Wasserraan H. 2000. Querycat: automatic categorization of medline queries. In: Proceedings of the AMIA symposium. American Medical Informatics Association, 655. Raraakrishnan C, Mendes PN, da Gama RATS, Ferreira GCN, Sheth AP. 2008. Joint extraction of compound entities and relationships from biomedical literature. In: Web intelligence. IEEE, 398-401. Renear AH, Palmer CL. 2009. Strategic reading, ontologies, and the future of scientific publishing. Science 325(5942):828-832 DOI 10.1126/science.ll57784. Shannon CE. 1948. A mathematical theory of communication. Bell System Technical Journal 27:379-423 DOI 10.1002/j.l538-7305.1948.tb01338.x. Singhal A. 2001. Modern information retrieval: a brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24:35-43. Strassel S, Adams D, Goldberg H, Herr J, Keesing R, Oblinger D, Simpson H, Schrag R, Wright J. 2010. The DARPA machine reading program-encouraging linguistic and reasoning research with a series of reading tasks. In: Proceedings of the international conference on language resources and evaluation. Valletta, Malta. Talley EM, Newman D, Mimno D, Herr BW, Wallach HM, Burns GAPC, Leenders AGM, McCallum A. 2011. Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods 8(6):443-444 DOI 10.1038/nmeth.l619. Watts DJ, Strogatz SH. 1998. Collective dynamics of small-world' networks. Nature 393(6684):440-442 DOI 10.1038/30918. Wilkowski B, Fiszman M, Miller CM, Hristovski D, Arabandi S, Rosemblat G, Rindflesch TC. 2011. Graph-based methods for discovery browsing with semantic predications. In: AMIA annual symposium proceedings, vol. 2011. American Medical Informatics Association, 1514-1523. WuZ, Palmer M. 1994. Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics, ACL'94, Stroudsburg, PA, USA. Association for Computational Linguistics, 133-138. Yan Y, Okazaki N, Matsuo Y, Yang Z, Ishizuka M. 2009. Unsupervised relation extraction by mining wikipedia texts using information from the web. In: Proceedings ofACL/AFNLP'09. Association for Computational Linguistics, 1021-1029. Zadrozny S, Nowacka K. 2009. Fuzzy information retrieval model revisited. Fuzzy Sets and Systems 160(15):2173-2191 DOI 10.1016/j.fss.2009.02.012. Nováček and Burns (2014), PeerJ, DOI10.7717/peerj.483 ^^^^^^^^^^B 38/38 90 Part III ... to Knowledge Graph Embeddings 91 Chapter 8 Injecting Axioms into Knowledge Graphs 92 Regularizing Knowledge Graph Embeddings via Equivalence and Inversion Axioms Pasquale Minervini1^ ), Luca Costabello2, Emir Muňoz1'2, Vít Nováček1, and Pierre-Yves Vandenbussche2 1 Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland {pasquale.minervini,emir.munoz,vit.novacek}@insight-centre.org 2 Fujitsu Ireland Ltd., Galway, Ireland {luca.costabello,emir.munoz,pierre-yves.vandenbussche}@ie.fuj itsu.com Abstract. Learning embeddings of entities and relations using neural architectures is an effective method of performing statistical learning on large-scale relational data, such as knowledge graphs. In this paper, we consider the problem of regularizing the training of neural knowledge graph embeddings by leveraging external background knowledge. We propose a principled and scalable method for leveraging equivalence and inversion axioms during the learning process, by imposing a set of model-dependent soft constraints on the predicate embeddings. The method has several advantages: (i) the number of introduced constraints does not depend on the number of entities in the knowledge base; (ii) regularities in the embedding space effectively reflect available background knowledge; (Hi) it yields more accurate results in link prediction tasks over non-regularized methods; and (iv) it can be adapted to a variety of models, without affecting their scalability properties. We demonstrate the effectiveness of the proposed method on several large knowledge graphs. Our evaluation shows that it consistently improves the predictive accuracy of several neural knowledge graph embedding models (for instance, the MRR of TransE on WordNet increases by 11%) without compromising their scalability properties. 1 Introduction Knowledge graphs are graph-structured knowledge bases, where factual knowledge is represented in the form of relationships between entities: they are powerful instruments in search, analytics, recommendations, and data integration. This justified a broad line of research both from academia and industry, resulting in projects such as DBpedia (Auer et al. 2007), Freebase (Bollacker et al. 2007), YAGO (Suchanek et al. 2012), NELL (Carlson et al. 2010), and Google's Knowledge Graph and Knowledge Vault projects (Dong et al. 2014). However, despite their size, knowledge graphs are often very far from being complete. For instance, 71% of the people described in Freebase have no known place of birth, 75% have no known nationality, and the coverage for less used relations can be even lower (Dong et al. 2014). Similarly, in DBpedia, 66% of © Springer International Publishing AG 2017 M. Ceci et al. (Eds.): ECML PKDD 2017, Part I, LNAI 10534, pp. 668-683, 2017. https://doi.org/10.1007/978-3-319-71249-9_40 93 Regularizing Knowledge Graph Embeddings 669 the persons are also missing a place of birth, while 58% of the scientists are missing a fact stating what they are known for (KrompafS et al. 2015). In this work, we focus on the problem of predicting missing links in large knowledge graphs, so to discover new facts about the world. In the literature, this problem is referred to as link prediction or knowledge base population: we refer to Nickel et al. (2016) for a recent survey on machine learning-driven solutions to this problem. Recently, neural knowledge graph embedding models (Nickel et al. 2016) -neural architectures for embedding entities and relations in continuous vector spaces - have received a growing interest: they achieve state-of-the-art link prediction results, while being able to scale to very large and highly-relational knowledge graphs. Furthermore, they can be used in a wide range of applications, including entity disambiguation and resolution (Bordes et al. 2014), taxonomy extraction (Nickel et al. 2016), and query answering on probabilistic databases (Krompafi et al. 2014). However, a limitation in such models is that they only rely on existing facts, without making use of any form of background knowledge. At the time of this writing, how to efficiently leverage preexisting knowledge for learning more accurate neural knowledge graph embeddings is still an open problem (Wang et al. 2015). Contribution — In this work, we propose a principled and scalable method for leveraging external background knowledge for regularising neural knowledge graph embeddings. In particular, we leverage background axioms in the form p = q and p = q~, where the former denotes that relations p and q are equivalent, such as in the case of relations PARtOf and COMPONENtOf, while the latter denotes that the relation p is the inverse of the relation q, such as in the case of relations PARtOf and HAsPart. Such axioms are used for defining and imposing a set of model-dependent soft constraints on the relation embeddings during the learning process. Such constraints can be considered as regularizers, reflecting available prior knowledge on the distribution of embedding representations of relations. The proposed method has several advantages: (i) the number of introduced constraints is independent on the number of entities, allowing it to scale to large and Web-scale knowledge graphs with millions of entities; (ii) relationships between relation types in the embedding space effectively reflect available background schema knowledge; (Hi) it yields more accurate results in link prediction tasks than state-of-the-art methods; and (iv) it is a general framework, applicable to a variety of embedding models. We demonstrate the effectiveness of the proposed method in several link prediction tasks: we show that it consistently improves the predictive accuracy of the models it is applied to, without negative impact on their scalability properties. 2 Preliminaries Knowledge Graphs — A knowledge graph is a graph-structured knowledge base, where factual information is stored in the form of relationships 94 670 P. Minervini et al. between entities. Formally, a knowledge graph Q = {(s,p,o)} C £ x 1Z x £ is a set of (s,p,o) triples, each consisting of a subject s, a predicate p and an object o, and encoding the statement "s has a relationship p with o". The subject and object s,o £ £ are entities, p spo to each triple (s,p,o) as follows: 4>spo = 4>({S,P,0)]0), where the score spo represents the confidence of the model that the statement encoded by the triple (s,p, o) holds true, (■; 0) denotes a triple scoring function, with (f> : £ xlZx £ —>M., and 0 represents the parameters of the scoring function and thus of the link prediction model. Triples associated with a higher score by the link prediction model have a higher probability of encoding a true statement, and are thus considered for a completion of the knowledge graph Q. 3 Neural Knowledge Graph Embedding Models Recently, neural link prediction models received a growing interest (Nickel et al. 2016). They can be interpreted as simple multi-layer neural networks, where given a triple (s,p,o), its score (f>((s,p, o);0) is given by a two-layer neural network architecture, composed by an encoding layer and a scoring layer. Encoding Layer — in the encoding layer, the subject and object entities s and o are mapped to distributed vector representations es and eOI referred to as embeddings, by an encoder tp : £ h-> Mf such that es = ip(s) and e0 = ip(o). Given an entity s p(es,e0) ((s,p,o);0) = 4>ep(es,e0) es,eQ = if(s),tf(o), and the set of parameters 0 corresponds to 0 = {9,\P}. Neural link prediction model generate distributed embedding representations for all entities in a knowledge graph, as well as a model of determining whether a triple is more likely than 96 672 P. Minervini et al. others, by means of a neural network architecture. For such a reason, they are also referred to as neural knowledge graph embedding models (Yang et al. 2015; Nickel et al. 2016). Several neural link prediction models have been proposed in the literature. For brevity, we overview a small subset of these, namely the Translating Embeddings model TransE (Bordes et al. 2013); the Bilinear-Diagonal model DistMult (Yang et al. 2015); and its extension in the complex domain complex (Trouillon et al. 2016). Unlike previous models, such models can scale to very large knowledge graphs, thanks to: (i) a space complexity that grows linearly with the number of entities \£\ and relations \1Z\; and (ii) efficient and scalable scoring functions and parameters learning procedures. In the following, we provide a brief and self-contained overview of such neural knowledge graph embedding models. TransE - The scoring layer in TransE is defined as follows: (j>p(es,e0) = - \\es + rp - eQ\\ £ K, where es,eQ p(es, eQ) is then given by the similarity between the translated subject embedding es + rp and the object embedding e0. DistMult - The scoring layer in DistMult is denned as follows: (j>p(es,e0) = (rp,es,e0) £ K, where, given x,y,z £ M,k, (x,y,z) = Yli=i xiyizi denotes the standard component-wise multi-linear dot product, and rp £ M.k is a predicate-dependent vector. ComplEx - The recently proposed ComplEx is related to DistMult, but uses complex-valued embeddings while retaining the mathematical definition of the dot product. The scoring layer in ComplEx is defined as follows: (j>p(es,e0) = Re((rp,es,e^}) = (Re(rp) ,Re(es) ,Re(e0)} + (Re (rp) ,Im (es), Im (e0)} + (Im(rp) ,Re(es) ,Im(e0)} - (Im(rp) ,Im(es) ,Re(e0)} e K, where given x £ Ck, x denotes the complex conjugate of x1, while Re (x) £ M.k and Im (x) £ M.k denote the real part and the imaginary part of x, respectively. 4 Training Neural Knowledge Graph Embedding Models In neural knowledge graph embedding models, the parameters 0 of the embedding and scoring layers are learnt from data. A widely popular strategy for 1 Given 16C, its complex conjugate is x = Re (a;) — ilm (a;). 97 Regularizing Knowledge Graph Embeddings 673 Algorithm 1. Learning the model parameters 0 via Projected SGD Require: Batch size n, epochs r, learning rates r\ G Kr Ensure: Optimal model parameters G 1: for i = 1,..., r do 2: ee -f- ee/||ee||, Ve G £ 3: {Sample a batch of positive and negative examples 23 = {(£, £)}} 4: 23 <- SampleBatch(5, n) 5: {Compute the gradient of the loss function J on examples 23} 6: g*^- V £(M~)6B [7 - (*; ©i-i) + (*; ©i-O] + 7: {Update the model parameters via gradient descent} 8: Gt <- Gi-i - rngi 9: end for 10: return GT learning the model parameters is described in Bordes et al. (2013); Yang et al. (2015); Nickel et al. (2016). In such works, authors estimate the optimal parameters by minimizing the following pairwise margin-based ranking loss function J defined on parameters 0: W)=E E b-^t+;0) + ^t-;0)]+ (2) t+e<3t-eC{t+) where [x]+ = max{0,rc}, and 7 > 0 specifies the width of the margin. Positive examples t+ are composed by all triples in Q, and negative examples i~ are generated by using the following corruption process: C((s,p,o}) 4 {(s,p,o) I s e £}U {(s,p,o) I 5 e £}, which, given a triple, generates a set of corrupt triples by replacing its subject and object with all other entities in Q. This method of sampling negative examples is motivated by the Local Closed World Assumption (LCWA) (Dong et al. 2014). According to the LCWA, if a triple (s,p,o) exists in the graph, other triples obtained by corrupting either the subject or the object of the triples not appearing in the graph can be considered as negative examples. The optimal parameters can be learnt by solving the following minimization problem: minimize J{0) © (3) subject to Ve ( ■; 0) and parameters 0 should assign the same scores to the triples (s,p, o) and (s, q, o), for all entities s, o ( ■ ; 0) and parameters 0 should assign the same scores to the triples (s,p,o) and (o,q,s), for all entities s, o 0 is the weight associated with the soft constraints, and D [x||y] is a divergence measure between two vectors x and y. In our experiments, we use the Euclidean distance as divergence measure, i.e. D [x||y] = ||x — y|||. In particular, 1Z$ in Eq. (10) can be thought of as a schema-aware regular-ization term, which encodes our prior knowledge on the distribution of predicate embeddings. Note that the formulation in Eq. (10) allows us to freely interpolate between hard constraints (A = oo) and the original models represented by the loss function J (A = 0), permitting to adaptively specify the relevance of each logical axiom in the embedding model. i?e((rp,es,e0}) = Re \^(rp,es,e0) = Re ((f~J,e0,e7) = #e((rT,eOIe7}) 101 Regularizing Knowledge Graph Embeddings 677 6 Related Works How to effectively improve neural knowledge graph embeddings by making use of background knowledge is a largely unexplored field. Chang et al. (2014); Krompass et al. (2014); KrompaB et al. (2015) make use of type information about entities for only considering interactions between entities belonging to the domain and range of each predicate, assuming that type information about entities is complete. In Minervini et al. (2016), authors assume that type information can be incomplete, and propose to adaptively decrease the score of each missing triple depending on the available type information. These works focus on type information about entities, while we propose a method for leveraging background knowledge about relation types which can be used jointly with the aforementioned methods. Dong et al. (2014); Nickel et al. (2014); Wang et al. (2015) propose combining observable patterns in the form of rules and latent features for link prediction tasks. However, rules are not used during the parameters learning process, but rather after, in an ensemble fashion. Wang et al. (2015) suggest investigating how to incorporate logical schema knowledge during the parameters learning process as a future research direction. Rocktaschel et al. (2015) regularize relation and entity representations by grounding first-order logic rules. However, as they state in their paper, adding a very large number of ground constraints does not scale to domains with a large number of entities and predicates. In this work we focus on 2-way models rather than 3-way models (Garcia-Duran et al. 2014), since the former received an increasing attention during the last years, mainly thanks to their scalability properties (Nickel et al. 2016). According to Garcia-Duran et al. (2014), 3-way models such as RESCAL (Nickel et al. 2011; 2012) are more prone to overfitting, since they typically have a larger number of parameters. It is possible to extend the proposed model to RESCAL, whose score for a (s,p, o) triple is e^Wpe0. For instance, it is easy to show that e;fWpe0 = ejwjes. However, extending the proposed method to more complex 3-way models, such as the latent factor model proposed by Jenatton et al. (2012) or the ER-MLP model (Dong et al. 2014) can be less trivial. 7 Evaluation We evaluate the proposed schema-based soft constraints on three datasets: WordNet, DBpedia and YAG03. Each dataset is composed by a training, a validation and a test set of triples, as summarized in Table 1. All material needed for reproducing the experiments in this paper is available online2. WordNet (Miller 1995) is a lexical knowledge base for the English language, where entities correspond to word senses, and relationships define lexical relations between them: we use the version made available by Bordes et al. (2013). 2 At https://github.com/pminervini/neural-schema-regularization. 102 678 P. Minervini et al. Table 1. Statistics for the datasets used in experiments Dataset |£| \TZ\ ^training ^validation #test WordNet 40,943 18 141,442 5,000 5,000 DBpedia 32,510 7 289,825 5,000 5,000 YAG03 123,182 37 1,079,040 5,000 5,000 YAG03 (Mahdisoltani et al. 2015) is a large knowledge graph automatically extracted from several sources: our dataset is composed by facts stored in the YAG03 Core Facts component of YAG03. DBpedia (Auer et al. 2007) is a knowledge base created extracting structured, multilingual knowledge from Wikipedia, and made available using Semantic Web and Linked Data standards. We consider a fragment extracted following the indications from KrompafS et al. (2014), by considering relations in the music domain3. The axioms we used in experiments are simple common-sense rules, and are listed in Table 1. Evaluation Metrics — For evaluation, for each test triple (s,p, o), we measure the quality of the ranking of each test triple among all possible subject and object substitutions (s,p,o) and (s,p,o), with s,d mem. holonym ffl-0.048 0.106 -0.180-0.113 0.206 -0.165 ^ 0.274 -0.008 Predi mem. holonym 0.288 mem. meronym ^0.047-0.109 0.178 0.112-0.207 0.163 ^fl-0.274 0.011 mem. meronym -0.290 -0.007 Fig. 2. WordNet predicate embeddings learned using the TransE model, with k = 10 and regularization weight A = 0 (left) and A = 106 (right) - embeddings are represented as a heatmap, with values ranging from larger (red) to smaller (blue). Note that, assuming the axiom p = q~ holds, using the proposed method leads to predicate embeddings such that rp « — rq. (Color figure online) Table 3. Average number of seconds required for training. Plain Regularized WordNet 31.7s 32.0s DBpedia 57.9 s 58.5s YAG03 220.7s 221.3s A similar phenomenon in Fig. 1 (right), where predicated embeddings have been trained using COMPLEX: we can see that the model is naturally inclined to assign complex conjugate embeddings to inverse relations and, as a consequence, nearly-zero imaginary parts to the embeddings of symmetric predicates - since it is the only way of ensuring rp w r^". However, we can enforce such relationships explicitly by means of model-specific regularizers, for increasing the predictive accuracy and generalization abilities of the models. We also benchmarked the computational overhead introduced by the novel regularizers by timing the training time for unregularized (plain) models and for 105 Regularizing Knowledge Graph Embeddings 681 regularized ones - results are available in Table 3. We can see that the proposed method for leveraging background schema knowledge during the learning process adds a negligible overhead to the optimization algorithm - less than 10_1 s per epoch. 8 Conclusions and Future Works In this work we introduced a novel and scalable approach for leveraging background knowledge into neural knowledge graph embeddings. Specifically, we proposed a set of background knowledge-driven regularizers on the relation embeddings, which effectively enforce a set of desirable algebraic relationships among the distributed representations of relation types. We showed that the proposed method improves the generalization abilities of all considered models, yielding more accurate link prediction results without impacting on the scalability properties of neural link prediction models. Future Works A promising research direction consists in leveraging more sophisticated background knowledge - e.g. in the form of First-Order Logic rules - in neural knowledge graph embedding models. This can be possible by extending the model in this paper to regularize over subgraph pattern embeddings (such as paths), so to leverage relationships between such patterns, rather than only between predicates. Models for embedding subgraph patterns have been proposed in the literature - for instance, see (Niepert 2016; Guu et al. 2015). For instance, it can be possible to enforce an equivalency between the path PARENtOfoparentOf and GRANdParentOf, effectively incorporating a First-Order rule in the model, by regularizing over their embeddings. Furthermore, a future challenge is also extending the proposed method to more complex models, such as ER-MLP (Dong et al. 2014), and investigating how to mine rules by extracting regularities from the latent representations of knowledge graphs. Acknowledgements. This work was supported by the TOMOE project funded by Fujitsu Laboratories Ltd., Japan and Insight Centre for Data Analytics at National University of Ireland Galway (supported by the Science Foundation Ireland grant 12/RC/2289). References Auer, S., Bizer, C, Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722-735. Springer, Heidelberg (2007). https://doi.org/10.1007/ 978-3-540-76298-0.52 Baroni, M., Bernardi, R., Do, N-Q., Shan, C: Entailment above the word level in distributional semantics. In: EACL, pp. 23-32. The Association for Computer Linguistics (2012) 106 682 P. Minervini et al. Bollacker, K.D., Cook, R.P., Tufts, P.: Freebase: a shared database of structured general human knowledge. In: A A AI, pp. 1962-1963. AAAI Press (2007) Bordes, A., Usunier, N., Garcia-Durän, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: NIPS, pp. 2787-2795 (2013) Bordes, A., Glorot, X., Weston, J., Bengio, Y.: A semantic matching energy function for learning with multi-relational data - application to word-sense disambiguation. Mach. Learn. 94(2), 233-259 (2014) Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hmschka Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI. AAAI Press (2010) Chang, K.-W., Yih, W., Yang, B., Meek, C: Typed tensor decomposition of knowledge bases for relation extraction. In: EMNLP, pp. 1568-1579. ACL (2014) Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: KDD, pp. 601-610. ACM (2014) Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121-2159 (2011) Garcia-Durän, A., Bordes, A., Usunier, N.: Effective blending of two and three-way interactions for modeling multi-relational data. In: ECML-PKDD, pp. 434-449 (2014) Guu, K., Miller, J., Liang, P.: Traversing knowledge graphs in vector space. In: EMNLP, pp. 318-327. The Association for Computational Linguistics (2015) Hayes, P., Patel-Scfmeider, P.: RDF 1.1 semantics. W3C recommendation, W3C, February 2014. http://www.w3.org/TR/2014/REC-rdfll-mt-20140225/ Jenatton, R., Roux, N.L., Bordes, A., Obozinski, G.: A latent factor model for highly multi-relational data. In: NIPS, pp. 3176-3184 (2012) Krompass, D., Nickel, M., Tresp, V.: Large-scale factorization of type-constrained multi-relational data. In: DSAA, pp. 18-24. IEEE (2014) Krompaß, D., Nickel, M., Tresp, V.: Querying factorized probabilistic triple databases. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8797, pp. 114-129. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11915-l_8 Krompaß, D., Baier, S., Tresp, V.: Type-constrained representation learning in knowledge graphs. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 640-655. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6.37 Mahdisoltani, F., Biega, J., Suchanek, F.M.: YAG03: a knowledge base from multilingual wikipedias. In: CIDR (2015). www.cidrdb.org Meseguer, P., Rossi, F., Schiex, T.: Soft constraints. In: Handbook of Constraint Programming, of Foundations of Artificial Intelligence, vol. 2, pp. 281-328. Elsevier (2006) Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39-41 (1995) Minervini, P., d'Amato, C, Fanizzi, N., Esposito, F.: Leveraging the schema in latent factor models for knowledge graph completion. In: SAC, pp. 327-332. ACM (2016) Nickel, M., Tresp, V., Kriegel, H.-P.: A three-way model for collective learning on multi-relational data. In: ICML, pp. 809-816 (2011) Nickel, M., Tresp, V., Kriegel, H.-P.: Factorizing YAGO: scalable machine learning for linked data. In: WWW, pp. 271-280. ACM (2012) Nickel, M., Jiang, X., Tresp, V.: Reducing the rank in relational factorization models by including observable patterns. In: NIPS, pp. 1179-1187 (2014) Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1), 11-33 (2016) 107 Regularizing Knowledge Graph Embeddings 683 Niepert, M.: Discriminative Gaifman models. In: NIPS, pp. 3405-3413 (2016) Rocktäschel, T., Singh, S., Riedel, S.: Injecting logical background knowledge into embeddings for relation extraction. In: HLT-NAACL, pp. 1119-1129. The Association for Computational Linguistics (2015) Schneider, M.: OWL 2 web ontology language RDF-based semantics, 2nd edn. W3C recommendation, W3C, December 2012. http://www.w3.org/TR/2012/REC-owl2-rdf- based- semantics- 20121211/ Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In: WWW, pp. 697-706. ACM (2007) Trouillon, T., Welbl, J., Riedel, S., Gaussier, E., Bouchard, G.: Complex embeddings for simple link prediction. In: ICML, of JMLR Workshop and Conference Proceedings, vol. 48, pp. 2071-2080. JMLR. org (2016) Wang, Q., Wang, B., Guo, L.: Knowledge base completion using embeddings and rules. In: IJCAI, pp. 1859-1866. AAAI Press (2015) Yang, B., Yih, W-t., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: Proceedings of the International Conference on Learning Representations (ICLR) 2015, May 2015 108 Chapter 9 Prediction of Adverse Drug Reactions 109 Briefings in Bioinformatics, 2017,1-13 doi: 10.1093/bib/bbx099 Paper OXFORD Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models Emir Muňoz, Vít Nováček and Pierre-Yves Vandenbussche Corresponding author: Emir Muňoz, Fujitsu Ireland Ltd., Insight Building, IDA Business Park, Lower Dangan, Newcastle, Galway, Ireland. E-mail: emir.mu-noz@ie.fujitsu.com or emir.munoz@gmail.com Abstract Timely identification of adverse drug reactions (ADRs) is highly important in the domains of public health and pharmacology. Early discovery of potential ADRs can limit their effect on patient lives and also make drug development pipelines more robust and efficient. Reliable in silico prediction of ADRs can be helpful in this context, and thus, it has been intensely studied. Recent works achieved promising results using machine learning. The presented work focuses on machine learning methods that use drug profiles for making predictions and use features from multiple data sources. We argue that despite promising results, existing works have limitations, especially regarding flexibility in experimenting with different data sets and/or predictive models. We suggest to address these limitations by generalization of the key principles used by the state of the art. Namely, we explore effects of: (1) using knowledge graphs—machine-readable interlinked representations of biomedical knowledge—as a convenient uniform representation of heterogeneous data; and (2) casting ADR prediction as a multi-label ranking problem. We present a specific way of using knowledge graphs to generate different feature sets and demonstrate favourable performance of selected off-the-shelf multi-label learning models in comparison with existing works. Our experiments suggest better suitability of certain multi-label learning methods for applications where ranking is preferred. The presented approach can be easily extended to other feature sources or machine learning methods, making it flexible for experiments tuned toward specific requirements of end users. Our work also provides a clearly defined and reproducible baseline for any future related experiments. Keywords: adverse drug reactions (ADR); drug similarity; knowledge graphs; multi-label learning Introduction Adverse drug reactions (ADRs) can cause significant clinical problems and represent a major challenge for public health and the pharmaceutical industry. During a drug development process, pharmacology profiling leads to the identification of potential drug-induced biological system perturbations including primary effects (intended drug-target interactions) as well as secondary effects (off-target-drug interactions) mainly responsible for ADRs [1], Many ADRs are discovered during preclinical and clinical trials before a drug is released on the market. However, the use of a registered drug within a large population (demonstrating a wider range of clinical genotypes and phenotypes than considered in the clinical trials) can result in serious ADRs that have not been identified before. This has a large impact on patient safety and quality of life, and also has significant financial consequences for the pharmaceutical industry [2], Emir Muňoz is a PhD student at Insight Centre for Data Analytics, National University of Ireland Galway, and a Researcher at Fujitsu Ireland Ltd. His main interests lie within the areas of databases and machine learning. He is currently focused on representational learning and knowledge graphs mining. Vít Nováček holds a PhD from National University of Ireland, Galway. Vit has background in NLP, Semantic Web and knowledge representation, and his current research revolves around knowledge discovery from biomedical texts and data. He works as a project leader at the Insight Centre for Data Analytics in Galway. Pierre-Yves Vandenbussche holds a PhD in Information Technology from Paris VI University. Currently leading the Knowledge Engineering and Discovery research team in Fujitsu Ireland working with the Insight Centre, his research interest concerns methods to improve semantic data representation, knowledge extraction and knowledge graph mining. Submitted: 12 April 2017; Received (in revised form): 17 July 2017 © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com 110 2 | Munoz et al. The result of a recent review of epidemiological studies in Europe states that 3.5% of hospital admissions are because of ADRs and 10% of patients experience an ADR during their hospitalization [3], ADRs are a major cause of morbidity (and associated reduction of quality of life) and mortality [4, 2]. Recent estimates set the number of yearly drug-induced fatalities to 100 000 in the United States and almost 200 000 in Europe, making it the fourth cause of death before pulmonary diseases or diabetes [5, 3], In addition to the significance for the public health, ADRs are associated with an important economic burden imposed for public health systems and pharmaceutical industry. The extra costs are caused mainly by the withdrawal of dangerous drugs from the market, litigations and further hospitalizations to treat the adverse effects. The annual cost of ADRs in the United States is estimated at $136 billion [6]. Any improvements in the early identification of ADRs can decrease the high attrition rate in the drug discovery and development process. After the drug registration, better prediction of ADRs can alleviate associated clinical problems and decrease the adverse effect-induced extra costs. In silico approaches to predict ADRs of candidate drugs are now commonly used to complement costly and time-consuming in uitro methods [7]. Computational methods differ by the drug development/deployment stage they are applied at, and by the features used for the prediction of ADRs. Pharmacovigilance systems (monitoring the effects of drugs after they have been licensed for use) mine statistical evidence of ADRs from spontaneous reports by physicians, such as the Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) [8-10]; from patient records [11]; or more recently from non-traditional sources, such as logs of search engine activity or social media [12, 13]. While these methods limit the risk of serious public health issues by identifying early occurrences of ADRs, they assume that such adverse effects are already demonstrated within a population. Computational prediction of ADRs during the development cycle of a drug (before the drug is licenced for use) can reduce the cost of drug development and provide a safer therapy for patients [14]. Most state-of-the-art techniques adopt a drug-centric approach and rely on the assumption that similar drugs share the same properties, mechanism of action and therefore also ADRs [15,16] (there are also methods that focus on the ADR information to overcome certain specific problems like drugs with little or no features at all, or ADRs with low number of related drugs [15, 9]. The methods are, however, less numerous and also harder to evaluate in a comparative manner). Predictions of new ADRs are then based on a drug-drug similarity network. In most of the early works, this network was based on the similarity of the substructures within the active ingredients of drugs [17-20]. More recent approaches combine data covering both chemical space of drugs and biological interaction-based features such as drug target, pathways, enzymes, transporters or protein-protein interactions [21-23]. Lately, integrative methods take into account also phenotypic observation-based features such as drug indications [24-27]. The availability of multi-source structured data has allowed for integration of complementary aspects of drugs and their links to side effects leading to higher accuracy [28]. The scope of this review is given by recent state-of-the-art methods (from 2011 on) that satisfy two key requirements. First, we consider methods that take advantage of multi-source structured data. Secondly, we focus on techniques that use machine learning to predict the likelihood of a side effect being caused by a given drug (drug-centric approach). Table 1 lists the reviewed approaches along with the features they use. Facilitating prediction of adverse drug reactions | 3 While many of the state-of-the-art approaches produce results that have great potential for being used in drug development pipelines, there are still things to improve. A limitation that is most relevant as a motivation for the presented review is the lack of flexibility that prevents users who are not proficient in machine learning from easily using the predictive models. This makes it difficult for people like biologists, pharmacologists or clinicians to experiment with data and models fine-tuned towards their specific requirements on the ADR prediction (such as increasing the sensitivity or specificity of the model). The main issue of existing models is that they typically work with data sets that have been manually preprocessed, and the particular prediction methods are adapted to the experimental data in a focused manner. We review the key points and limitations of existing approaches and introduce their generalization based on: (1) tapping into many diverse interlinked knowledge bases (i.e. knowledge graphs) related to drugs and adverse effects that substantially limits the manual effort required for data integration and feature engineering. (2) Rigorous formulation of the ADR prediction problem as a multi-label learning-to-rank problem that allows for easy experimentation with many off-the-shelf machine learning models. We show that specific applications of these two principles can lead to performance comparable with existing methods. Moreover, the proposed approach produces ranked predictions by default, with many relevant predictions present at the top of the ranked result lists. This is potentially useful in scenarios where experts (e.g. pharmaceutical researchers or public health officials) have limited time and/or resources, and thus they can only process a few prediction candidates out of many possibilities (there can often be hundreds of predictions for a single drug). The main contributions of this work are as follows. We propose a specific way of using knowledge graphs to generate different feature sets for ADR prediction and demonstrate the favourable performance of selected off-the-shelf multi-label learning models in comparison with existing works. In addition to that, we show how the approach can be easily extended to other feature sources or machine learning methods. This makes the method flexible for experiments tuned towards specific requirements of end users. Our results and data also provide a clearly defined and reproducible baseline for any future related experiments. Materials Various publicly available data sources can be used to define similarity between drugs [14]. Each data source describes a specific aspect of the pharmacological space of a drug such as its chemical, biological or phenotypic properties. For instance, SIDER database [30] presents information of side effects and indication for marketed drugs. PubChem Compound data [31] contain chemical structure description of drugs. DrugBank [32] provides detailed information about drugs such as their binding proteins and targets, enzymes or transporters, thus informing on drugs' mechanism of action and metabolism. KEGG Genes, Drug, Compound and Disease databases [33] describe further information about molecular interaction of drugs and their signalling pathways. In the following, we review the materials—results of data integration using multiple data sources, provided by the authors of the state-of-the-art methods. Because previous data integration activities were expensive and mostly carried out manually, here, Table 2. The data set characteristics Data set Number of drugs Number of side effects Liu's data set 832 1385 Bio2RDF data set 1824 5880 SIDER 4 data set 1080 5579 Aeolus data set 750 181 we propose a different data source and representation, which can be considered a superset of all previous data sets used. This data source is represented using a graph database, a model in which it is simpler to integrate different data sources such as the ones already mentioned. We also provide an algorithm to generate the required drugs' profile, similarly to the ones provided by the reviewed methods (Supplementary Section D). For comparisons, we use Liu's data set [24] and Zhang et a\. [25] data set termed 'SIDER 4' as benchmarks. As presented in Table 1, Liu's data set contains six types of features covering the chemical, biological and phenotypic spaces of drugs combined with information on their associated ADRs (cf. Table 2). We use this data set as primary means to compare the reviewed methods. SIDER 4 data set introduced by Zhang et a\. [25] is an update of Liu's data set integrating the fourth version of SIDER. This data set is interesting, as it introduces newly approved drugs for which fewer post-market ADR have been detected. We use the SIDER 4 data set as secondary means to compare the methods. A new alternative multi-source graph data have recently become via the Bio2RDF project [34]. Bio2RDF publishes the pharmacological databases used in many ADR prediction experiments in the form of a knowledge graph—a standardized, interlinked knowledge representation based on labelled relationships between entities of interest. Bio2RDF data were first used for the prediction of ADRs by Muhoz et a\. [27], where drug similarities were computed by measuring the shared connections between drugs in the graph. Here, we build on top of that and evaluate the use of the BioRDF knowledge graph as a means to facilitate the generation of more expressive features for computing similarity between drugs. Such automatically generated data can be used to replace or enrich existing manually integrated feature sets, and be used to evaluate prediction methods as per normal machine learning pipelines. Finally, to get another perspective for interpreting the evaluation results, we use the FDA FAERS [8, 10]. FAERS publishes recent ADR reports coming from population-wide post-marketing drug effect surveillance activities. Extracting the most recent ADRs for newly marketed drugs helps us to evaluate the ability of various methods to predict ADRs of drugs after their release on the market. We extract this information from the Aeolus data set [35], which is a curated and annotated, machine-readable version of the FAERS database. We use Aeolus to generate an updated version of the SIDER 4 data set that includes also the latest ADRs as observed in the population. For details on the generation of Liu's data set [24] and the SIDER4 data set [25], we refer the readers to the original articles. We will now detail the construction of the 'Bio2RDF data set' and the 'Aeolus data set'. Bio2RDF data set The Bio2RDF project (http://bio2rdf.org/) aims at simplifying the use of publicly available biomedical databases by representing them in a form of an interconnected multigraph [34, 36]. 112 4 I Munoz et al. e7&l66eb12eeaebdaSaM0ae7flt3beea DrugBank International-Brand After oral administration, peak concentrations atone hour. Internale occur T I T As Cyciophosp ham ide [d rug ban k: DBM531| Humans and other mammals group a/tec rod orgs nam Drug I / / *r Cyclophosphamide tfos^ge Small-moHfiit.Q«- rmtm 500 mg/25mL Injection, nodt-f acrerms.BSe J I powd»r, for solution form '-' with intravenous: oral route DrugBank entity SIDER entity Q KEGG entity (3 Bio2RDF entity I I Literal Resource CO °^ •*** Side Effect *■* J I 25 mg Capsule form L—J with oral route Resource jf-eftemspijfei' chtmspit)tr;2804 pubchem. substan ce:46505441 Cyclophosphamide] K [keggrC07883] J-1*" C7H15CI2N2Q2p| ^ Topiramate "v. : V^AntiTheumatfc Agents umls:Cq001339 ^^MV^J ____ :node-h eirecr —W* BO&tactoincn Acute pancreatitis «■-cas \ W Mutagens cas:50-18-0 . *-pubetem.tompound 5*Ds.ejraciMaicft -tfaV Anemia roY.jypa itlf.iype pubchem.compound:2907 stitch :CID10000Z90l k*gg:t.14.13.- Warfarin resistance Trimethylaminuria (TMAU) q Drug metabolism - cytochrome P45C Drug-Effect Association sKos-.emciMWcf) SEDER Anaphylaxis ^^W^SJ; V—* Fish-odor syndrome Figure 1. A Bio2RDF fragment around the cyclophosphamide drug, showing the connections between three main databases: DrugBank, SIDER and KEGG. The project provides a set of scripts (https://github.com/bio2rdf) to convert from the typically relational or tabular databases (e.g. DrugBank; SIDER) to a more flexible triple-based RDF (Resource Description Framework) format. The project also makes available the output of their conversions, and in its version 4, published in December 2015, Bio2RDF represented 30 databases including PubChem, DrugBank, SIDER and KEGG, among others with valuable drug-related information. Each information such as drug, protein or side effect is represented as a node entity with a unique identifier, and each relation between them as an edge with a qualified label, generating a network of great value for bio informatics [37]. Here, we use the release 4 of Bio2RDF and represent its data using a knowledge graph Q = {(s,p,o)} c £ x 1Z x £, which is a set of (s,p,o) triples each consisting of a subject s, a predicate p and an object o, and encoding the statement's has a relationship p with o'. The subject and object s, o e £ are entities, p e 1Z is a relation type and £, 1Z denote the sets of all entities and relation types in the knowledge graph, respectively. Figure 1 shows a fragment of the Bio2RDF knowledge graph that integrates three databases, namely, DrugBank, SIDER and KEGG. Usually, connections between databases are made using identifiers such as PubChem compound or Chemical Abstracts Service (CAS) number. Notice that a knowledge graph can also be built from the original databases using different methods or scripts, and here, we select Bio2RDF because it already contains mappings for most of the relevant databases. A knowledge graph Q can contain biomedical facts [note that the URIs in the examples are used as unique identifiers; the availability of corresponding records through an HTTP request (such as in a browser) depends on a third-party service (Bio2RDF.org)] such as: Table 3. Number of (s, p, o) triples in the Bio2RDF data set used in our experiments Data set Type of information Number of triples DrugBank Drug types, chemical information 5151999 SIDER Side effects of drugs 5 578286 KEGG Drugs, genes and pathway maps 4387 541 (http://bio2rdf.Org/drugbank:DB00531, label, "Cyclophosphamide") or (http://bio2rdf.org/drugbanfcDB00531, enzyme, http://bio2rdf.org/ kegg:1.14.13.-). This format allows for easy representation of equivalences or external links between data sources as an additional relation/ edge. For instance, a relationship between a DrugBank drug and a PubChem compound can be expressed as: (http://bio2rdf.org/drugbanfcDB00531, x-pubchemcompound, http:// bio2rdf.org/pubchem.compound:2907). By simply merging multiple data sources from Bio2RDF, we are able to build an integrated knowledge graph with links between databases materialized. During the integration, the PubChem compound of a drug is used to link DrugBank and SIDER, while the CAS number is used to link DrugBank and KEGG. This flexibility for generating training and testing data is currently impossible with the manual integration pipelines used by the reviewed methods. In our experiments, we shall use a knowledge graph integrating the DrugBank, SIDER and KEGG databases (cf. Table 3). 113 Facilitating prediction of adverse drug reactions | 5 Aeolus data set Aeolus [35] is a curated and standardized adverse drug events resource meant to facilitate research in drug safety. The data in Aeolus come from the publicly available US FDA FAERS, but is extensively processed to allow for easy use in experiments. In particular, the cases (i.e. ADR events) in the FAERS reports are deduplicated and the drug and outcome (i.e. effect) concepts are mapped to standard vocabulary identifiers (RxNorm and SNOMED-CT, respectively). A similar approach for extracting ADR terms from FDA-approved drug labels was applied in [38] to group similar drugs by topics. However, Aeolus is preferred because of its curated status. The Aeolus data set is presented in a convenient comma-separated values (CSV) format, from which we can easily extract pairs of drugs and their adverse effects ranked by the statistical significance of their occurrences within the FAERS reports. We map the identifiers for drugs and for adverse effects in Aeolus to the ones in DrugBank, which are used in our experiments. This means that we are able to use the FDA FAERS data as an additional manually curated resource for validating any adverse effect prediction method, as detailed later on in the description of our experiments. Methods In this section, we present details of the reviewed approaches for ADR prediction, on the basis of a multi-label learning setting. Multi-label learning framework As a drug can generally have multiple adverse reactions, the ADR prediction can be naturally formulated as a multi-label learning problem [39]. Multi-label learning addresses a special variant of the classification problem in machine learning, where multiple labels (i.e. ADRs) are assigned to each example (i.e. drug). The problem can be solved either by transforming the multi-label problem into a set of binary classification problems or by adapting existing machine learning techniques to the full multi-label problem (see https://en.wikipedia.org/wiki/Multi-label_classification for more details and a list of examples). Most of the current ADR prediction methods, however, do not fully exploit the convenient multi-label formulation, as they simply convert the main problem into a set of binary classification problems [40]. This is problematic for two main reasons. First, transforming the multi-label problem into a set of binary classification problems is typically computationally expensive for large numbers of labels (which is the case in predicting thousands of ADRs). Secondly, using binary classifiers does not accurately model the inherently multi-label nature of the main problem. We validate these two points empirically in 'Results and discussion' section. Here, we follow the philosophy of algorithm adaptation: fit algorithms to data [40]. Yet, there are exceptions, such as the work in [25], presenting the multi-label learning method FS-MLKNN that integrates feature selection and fe-nearest neighbours (feNN). Unlike most previous works, Zhang et a\. [25] propose a method that does not generate binary classifiers per label but uses an ensemble learning instead. (We shall provide more details of this and other methods in 'Learning models' section.) Also, Murioz et a\. [27] proposed a multi-label solution for the prediction problem using a constrained propagation of ADRs between neighbouring drugs, making clear the benefits of a graph structure of data (cf. Supplementary Section F). In the following, we formalize the learning framework with Q-labels as in [41, 42]. Let X = {xi,X2,... ,xN} be the instance space of N different data points (i.e. drugs) in Rd, and let y = {yi,y2, ■ ■ ■, vq} be the finite set of labels (i.e. ADRs). Given a training set V = {(xj,Yj)|1 < i < N}, where Xj 2y, which optimizes some specific evaluation metric. In most cases, however, the learning system will not output a multi-label classifier but instead will produce a real-valued function (aka. regressor) of the form f : X x y —> R, where _f(x,y) can be regarded as the confidence of y _f(x,y2) for any yi e Y and y2 ^ Y. In other words, the model should consistently be more 'confident' about true positives (actual ADRs) than about false positives. Intuitively, the regressor /(-, -) can be transformed into a ranking function rankj(-, -), which maps the outputs of/(x,y) for any y /(x,y2), then rank^x^-^) < rankj(x,y2). The ranking function can naturally be used for instance for selecting top-fe predictions for any given drug, which can be useful in cases where only limited numbers of prediction candidates can be further analysed by the experts. Our learning problem can now be formally stated as: given a drugx and a finite-size vector Y with its initially known adverse reactions (i.e. labels) seek to find a discriminant function /(x,Y) = Y, where Y is a finite-size vector representation of the labelling function Y = [f(x,yi),... ,Jr(x,yQ)]T for yj