Centre of Natural Language Processing Faculty of Informatics Masaryk University Brno Computer Processing of Czech Syntax and Semantics Aleš Horák Brno, 2008 Aleš Horák Faculty of Informatics, Masaryk University Centre of Natural Language Processing (NLP Centre) Botanická 68a CZ-602 00 Brno, Czech Republic E-mail: hales@fi.muni.cz Reviewed by Karel Pala, Masaryk University, Czech Republic This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, and permission for use must always be obtained from NLP Centre, Faculty of Informatics, Masaryk University. Violations are liable for prosecution under the Czech Copyright Law. Copyright c NLP Centre, Faculty of Informatics, Masaryk University, 2008 ISBN: 978-80-7399-375-7 Preface This book presents the results of research obtained during the course of a number of natural language processing projects that were led by Aleš Horák in the Centre of Natural Language Processing (aka NLP Centre or NLP laboratory), Faculty of Informatics, Masaryk University in Brno. As such, the presented results are based on the team work of researchers as well as students who were directly participating in the projects. The whole text offers a survey of sophisticated research methods concentrating on two complex levels of natural language (NL) processing, namely those of syntax and semantics. However, we do not try to cover all approaches in these areas – we focus on rule-based introspective methods with an encapsulation of empirical paradigms in the form of Figures of Merits (FOMs) of particular syntactic and semantic phenomena. As a basis for many NLP research projects, we have developed a number of advanced natural language processing tools and language resources. In the second chapter, we offer a detailed description covering three years of the development of VerbaLex, a large lexicon of Czech verb valencies in the form of complex valency frames. This part is then followed by the presentation of specific software developed for working with this as well as other language resources. These tools are VisDic, DEBVisDic, DEBDict, PRALED and others. These tools are used by project teams all over the world. The next chapter presents the latest development of the syntactic analyser synt that has been under development in the NLP Centre for several years. Besides the comprehensive description of synt inside and formats used, we also provide a comparison with several other natural language parsers, in which we show that the synt qualities are at least comparable to the best current parsers. iii PREFACE PREFACE The fourth chapter outlines the advances made in the Normal Translation Algorithm (NTA) from [Hor02]. It describes the methods and techniques aimed at an automatic translation from a NL sentence to its meaning expressed as a construction in Transparent Intensional Logic. The description is not complete, yet, but we have concentrated on selected phenomena where we offer sample solutions or even prototypical implementations. The last chapter gives details of a project that concentrates on intelligent methods for increasing the reliability of electrical networks. One its part involves the development of a human-machine communication framework for dialogues about the specific knowledge domain of electrical power systems (EPS). The task of another project part is the development of a multi-agent system for representing the EPS processes. These allow simulating different configurations of an EPS setup with an automatic computation of the economic aspects of the system failures. Acknowledgements For more than 10 years, all research in the NLP Centre has been driven and inspired by its head, Karel Pala. For this reason above all others he is the first person I would like to thank for his invaluable help and support. In addition, my thanks for cooperation on this work go to the following NLP Centre members (in alphabetical order): Pavel Cenek, Andrej Gardoň, Jiří Golembiovský, Dana Hlaváčková, Vladimír Kadlec, Vojtěch Kovář, Martin Kudlej, Petr Pelán, Martin Povolný, Miroslav Prýmek, Adam Rambousek, Pavel Rychlý, Lukáš Svoboda, Marek Veber and Radek Vykydal, as well as co-authors of project publications from other research teams: Marie Duží, Tomáš Holan, Pavel Materna, Pavel Smrž, and Piek Vossen. The financial support of this work was provided by the Faculty of Informatics, Masaryk University, Brno and by the following grant projects: • the EU project Balkanet IST-2000-29388, • the Czech Science Foundation projects 201/05/2781 and 405/03/ 0913, • the projects 1ET100300414, 1ET100300419 and 1ET200610406 of the Grant Agency of the Academy of Sciences of CR, • the Ministry of Education of the Czech Republic within the Re- iv PREFACE PREFACE search Intent CEZ:J07/98:143300003, • the Ministry of Education of the Czech Republic within the Center of basic research LC536, and • the Ministry of Education of the Czech Republic in the National Research Programme II project 2C06009. v PREFACE PREFACE vi Contents Preface iii 1 Introduction 1 2 New Language Resources and Tools 7 2.1 VerbaLex – New Comprehensive Lexicon of Verb Valencies for Czech . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Linguistic Requirements for the VerbaLex Format 11 2.1.2 Semantic Roles . . . . . . . . . . . . . . . . . . . . 14 2.1.3 The Implementation of Editing and Exporting Tools 18 2.1.4 Application of VerbaLex in Syntactic Analysis . . 23 2.2 VisDic – Off-line WordNet Editor . . . . . . . . . . . . . . 25 2.2.1 Basic Functionality . . . . . . . . . . . . . . . . . . 25 2.2.2 Advanced Functionality . . . . . . . . . . . . . . . 28 2.2.3 XML Configuration . . . . . . . . . . . . . . . . . 31 2.3 DEBVisDic and other DEB Platform Applications . . . . 39 2.3.1 The Features of the Platform for Lexicographers’ Tools . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.2 Assets of the DEB Platform . . . . . . . . . . . . . 43 2.3.3 The DEB Administration Interface . . . . . . . . . 44 2.3.4 How To Make a Sample Dictionary . . . . . . . . . 50 2.3.5 Usage Variability – The Users’ Interfaces . . . . . 54 2.4 Future Work on the Language Resources and NLP Tools . 66 3 synt – Czech Syntax Analyzer 69 3.1 The Grammar Development Process . . . . . . . . . . . . 73 vii CONTENTS CONTENTS 3.1.1 Grammar Development Workbench . . . . . . . . . 73 3.2 New Meta-grammar Constructs in synt . . . . . . . . . . 76 3.2.1 The Meta-grammar Design . . . . . . . . . . . . . 77 3.2.2 The Parsing Algorithm . . . . . . . . . . . . . . . 82 3.2.3 Evaluation of Contextual Constraints . . . . . . . 83 3.2.4 The synt Parser Implementation . . . . . . . . . . 84 3.3 Best Analysis Selection – a Supervised Construction of Pruning Constraints . . . . . . . . . . . . . . . . . . . . . 87 3.4 Parsing with Verb Frame Information . . . . . . . . . . . 89 3.4.1 Automatic Extraction of Verb Frames from the Packed Shared Forest . . . . . . . . . . . . . . . . . . . 91 3.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 92 3.5 The Beautified Chart Method – Pruning Technique Based on Linguistic Adequacy . . . . . . . . . . . . . . . . . . . 94 3.5.1 Beautified Trees . . . . . . . . . . . . . . . . . . . 95 3.5.2 The Previous Estimate of the Effect of the Beautified Chart Method . . . . . . . . . . . . . . . . . . 97 3.5.3 The Beautified Chart Algorithm . . . . . . . . . . 101 3.5.4 The Beautified Chart Results . . . . . . . . . . . . 105 3.6 Parser Comparison Experiments . . . . . . . . . . . . . . 106 3.6.1 synt and Moore’s parser . . . . . . . . . . . . . . . 106 3.6.2 Phrasal synt Compared with Dependency Parsers 107 3.7 Further Development of synt . . . . . . . . . . . . . . . . 116 4 Transparent Intensional Logic as a Way to Semantics 119 4.1 Overview of the Transparent Intensional Logic . . . . . . 121 4.1.1 TIL Types . . . . . . . . . . . . . . . . . . . . . . 122 4.2 The Logical Analysis of a Sentence . . . . . . . . . . . . . 124 4.2.1 Verb Frame Analysis . . . . . . . . . . . . . . . . . 126 4.2.2 The Sentence Analysis . . . . . . . . . . . . . . . . 130 4.3 Sentence Logical Analysis Using Complex Valency Frames 134 4.3.1 Examples of Logical Analysis . . . . . . . . . . . . 136 4.4 TIL Knowledge Base Representation . . . . . . . . . . . . 142 4.4.1 Knowledge Base Implementation – the Dolphin System . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.5 Experiment of Using TIL in a Simulation System Easel . 153 4.5.1 Using TIL and Easel in Applications . . . . . . . . 154 4.6 Long Way to Full Natural Language Semantics . . . . . . 156 viii CONTENTS CONTENTS 5 Application in Dialogues – the Electrical Power Systems Simulation 159 5.1 Multi-Agent Framework for EPS Simulation and Monitoring163 5.1.1 The Rice System Architecture . . . . . . . . . . . . 165 5.2 Human-machine Dialogues with the Rice System . . . . . 179 5.2.1 Building a Specialized Corpus . . . . . . . . . . . . 180 5.2.2 Morphological Tagging . . . . . . . . . . . . . . . . 182 5.2.3 Syntactic Analysis of the Domain Texts . . . . . . 182 5.2.4 Designing an Intelligent Dialogue Interface . . . . 184 5.3 Rice Usage Scenarios . . . . . . . . . . . . . . . . . . . . . 188 5.3.1 Example Nodes Implementation . . . . . . . . . . 192 5.4 Modelling of Economic Aspects of a Power System Failure 196 5.4.1 Network Parts . . . . . . . . . . . . . . . . . . . . 199 5.4.2 The Agents Implementation . . . . . . . . . . . . . 201 5.4.3 The Simulation Run . . . . . . . . . . . . . . . . . 204 5.5 Future Work on the Rice System . . . . . . . . . . . . . . 206 6 Conclusions and Future Directions 207 Bibliography 211 Annotation 227 ix CONTENTS CONTENTS x Chapter 1 Introduction The Natural Language Processing Laboratory was established at the Faculty of Informatics, Masaryk University in 1997 within the project VS97028 of the Ministry of Education of the Czech Republic. The founding members of the laboratory at that time were Karel Pala, Aleš Horák, Pavel Rychlý and Pavel Smrž. In 2005 the NLP laboratory was renamed to the Centre of Natural Language Processing. Since its beginnings the NLP laboratory or the NLP Centre has remained true to the essential notion of its raison d’ˆetre: – the NLP Centre is a place where about twenty researchers and dozens of undergraduate and postgraduate students work on research tasks from the exciting domain of the computational processing of written and spoken natural language. In this text, we present the details of four of the projects in which the NLP Centre has been engaged over last five years. What all these projects have in common is the fact that the author of this text is the leading person of the project work. As a first project we introduce new language resources and new language processing tools developed in the NLP Centre. The most important of the resources described is a new lexicon of complex valency frames of Czech, named VerbaLex. The lexicon includes all the usual verb valency features plus additional relevant information such as verb aspect, verb synonymity, types of use and semantic verb classes based on the VerbNet project. An important property of VerbaLex as far as the computer processing is concerned, is its close relationship with the widely exploited English and Czech WordNet semantic networks. 1 1. INTRODUCTION We also present new tools based on client/server XML database system called DEB ii. Thanks to the versatility of the XML format used, this system enables us to cover various applications, namely the management of the electronic readable dictionaries, WordNet-like lexical databases as well as ontologies for Semantic Web applications. Considerable attention is paid to the inner workings of the DEB ii framework as well as to particular DEB client tools especially to WordNet development tools VisDic and DEBVisDic, which represents a well-designed and developed system for lexical database editing and is currently employed in many national WordNet building projects. We discuss the basic features of the tools as well as more elaborate functions that facilitate linguistic work in multilingual environments. We argue for the benefits the new DEB ii platform brings to WordNet editing and to XML databases in general. In the following text, the main features and assets of the DEB ii dictionary writing platform are outlined, and the implementation strategies of both server and client parts of the platform are characterized. We also pay attention to the process of merging lexical data, particularly Czech WordNet with VerbaLex, the list of Czech valency frames, developed separately in XML format. We show an example of the merge which also indicates how DEBVisDic can serve as a means for such a kind of integration. We also point out that this type of merge can be extended to other languages as it was done with Bulgarian and Romanian in the Balkanet project. The reader will also find here an overview of the current state of other DEB ii applications that include: • PRALED – a client for building the Czech Lexical Database, • DEBDict – a browser for parallel viewing of several electronic dic- tionaries, • Cornetto – an augmented WordNet-like lexical database system, • DEB CPA browser and editor – a client for building a database of verb patterns as they derive from corpora (called corpus patterns), • DEB TEDI tool – an application for building specialized terminological dictionaries. For each of these DEB clients we describe and demonstrate their main features and functionality. 2 1. INTRODUCTION The third chapter presents a survey of the latest development of the Czech sentence parsing system synt. The presented system uses the meta-grammar formalism, which allows us to define the grammar with a sustainable number of meta-rules. At the same time, these meta-rules can be automatically translated into rules for efficient and fast headdriven chart parsing and supplemented with an evaluation of additional contextual constraints. In general, the system represents a rule-based approach to syntactic analysis, however, within the parsing process synt uses empirical parameters of the input lexical items in the form of socalled Figures of Merits (FOMs). In this respect, the system may be viewed as a combination of the rule-based approach and stochastic approaches involved within most of the Czech language parsing systems developed by members of the research group of Jan Hajič [HH07]. The text includes a comprehensive description of the meta-grammar constructs in synt as well as actual running times of the system tested on corpus data. The lexicographer’s environment, the so-called Grammar Development Workbench (GDW), is integrated with synt into one system that allows a team of experts (computational linguists as well as programmers) to cooperate on the development of a grammar covering all frequent Czech language phenomena. Besides the description of the synt system, we illustrate the process of the meta-grammar development. One of the first phases is formed by constructing corpus data for testing. We demonstrate the exploitation of the corpus on testing a method for detecting the Best Analysis Selection with the results of testing the synt analysis on a Czech corpus. The section about the Best Analysis Selection discusses methods that enhance the algorithm determining “the best” parsing tree from the output of natural language syntactic analysis. It presents a method for pruning redundant parse trees based on the information obtained from a dependency tree-bank corpus. The VerbaLex valency lexicon from the second chapter is exploited in the syntactic parsing process. The description of the verb frame extraction algorithm and the measured results of running its implementation on a newspaper corpus is displayed as one of the language specific features used in the tree ranking algorithm, which is a crucial part of the synt mode of analysis. The effectiveness of the enhanced parser is demonstrated by results of two inter-system parser comparison experiments. The first tests were run 3 1. INTRODUCTION on the standard evaluation grammars, namely ATIS, CT and PT, where the synt system outperforms the referential implementations. The second experiment compared the effectiveness of real text parsers of Czech based on completely different approaches – stochastic parsers that provide dependency trees as their outputs, and the meta-grammar parser that generates a resulting chart structure representing a packed forest of phrasal derivation trees. We describe and formulate the main questions and problems accompanying such experiments, try to offer answers to these questions and finally display factual results of the tests as measured on 10 thousand Czech sentences. In the fourth chapter, we describe the extended type hierarchy of the Transparent Intensional Logic (TIL [Tic04]) as a higher order logic theory. We also present the basic ideas of TIL constructions as a suitable natural language knowledge representation. TIL is a logic system, designed for representing the meaning of natural language expressions. The system is built on a typed λ-calculus logic with a hierarchy of types. It was created as a parallel to Montague’s logic [Mon74], however TIL is more capable of describing natural language semantics while retaining the simplicity of the basic idea. Moreover, the inference rules for TIL are well defined, thus enabling us to use constructions as an instrument for representing sentence meaning in knowledge base systems. The connection between a construction and the constructed object is fact-independent and is driven by the mechanism of typed λ-calculus. Constructions carry information about relations between the elementary parts of language expression objects. TIL was introduced by Pavel Tichý [Tic88] with the purpose of overcoming paradoxes arising from other modern logical systems (first order predicate logic as well as intensional logics [Mon74]). A short summary of the advantages of TIL over Montague’s dynamic logic can be found in [Hor02]. TIL is well suited to handle the difficult language phenomena such as temporal relations, (hyper)intensionality and propositional attitudes. The techniques described in this text are part of the long-term development of the Normal Translation Algorithm aimed at the automatic translation of natural language sentences into TIL constructions. We describe methods for exploiting the VerbaLex valency frames lexicon in relation to the transparent intensional logic. We examine the relations between complex valency frames (CVFs) and TIL constructions of predicate-argument structures and discuss the procedure of automatic 4 1. INTRODUCTION acquisition of the verbal object constructions. At the end of the chapter, the design of a newly developed Dolphin system for the effective implementation of a knowledge base and basic question answering based on the transparent intensional logic is explained. We will introduce the database acting as a knowledge base for inference in TIL. The time aspect of the truth value of propositions will be included and the basic “thinking” capabilities of the Dolphin system will be exemplified. We also present an experiment in which the purely logically oriented type system of TIL is compared with the property-based types of the Easel [Fis99] world fact simulation language. The discussion is also oriented towards the possibility of applications combining the two ap- proaches. The last of the projects we present deals with the role of biologically motivated emergent systems and intelligent agents in the simulation of electric power networks. The main aim is to provide a platform for analyzing the databases of failures of power systems in the Czech Republic (and a part of Slovakia) and to point out where the potential weak points are. The developed system, called Rice, is designed for simulating electrical power system processes. The system is based on the multi-agent approach, which allows a unique versatility of the design and development of the particular power system network. The applications of the system aim at off-line analysis and prediction of power system failure. The communication among the defined agents is based on standards in multi-agent systems – the communication protocols CORBA (Common Object Request Broker Architecture) and KQML (Knowledge Query and Manipulation Language). The system itself economizes the open source implementation of the protocols. The whole system is able to perform an active simulation of the energy flow in a power system and its visualization. We take into account a possible future replacement of any particular agent with an on-line power equipment monitoring facility (ad-hoc sensors), which allows the whole power system in real time to be monitored.. The latest developed features in the Rice system allow users to capture the behaviour of dynamic emergent networks. To fulfil this requirement, it has to cope with the complexity of local and global changes in the network characteristics including such basic ones as the network topology is. The text also includes several examples of typical power network com- 5 1. INTRODUCTION ponents and the definitions of their behaviour in the emergent network environment. We further present an explication of the natural language processing tasks of the NLP Centre in the design and development of a natural language dialogue interface for querying large databases with temporal data about electrical power network failures. The implementation of such a dialogue interface includes the creation and preparation of several auxiliary resources that are required for natural language processing of texts over this specific domain. In this text, we describe the process and statistical results of the creation of a corpus of electrical power networks texts consisting of more than 1 million positions. We also offer preliminary results of the syntactic analysis of the specialized corpus data and describe the problems of morphological and syntactic analysis of such domain specific texts. 6 Chapter 2 New Language Resources and Tools Each long term natural language processing (NLP) project needs to work with a firm basis of representative language resources and appropriate tools for their processing. The NLP Centre at the Masaryk University in Brno is taking part in research in all areas of the NLP field with the “handicap” of concentrating on the Czech language, in contrast to the most frequent languages like English, German or French. The situation with the Czech language resources is, however, getting better during last years with the valuable corpus resources at the beginning (DESAM [PRS97], the Czech National Corpus [CNC06, KKK00] or the Prague Dependency Treebank [Haj04a]) and the consecutive specialized resources including the verb frame lexicons VALLEX [ŽL04] and VerbaLex (see the Section 2.1). In the first section of this chapter, we summarize the three years of building the new VerbaLex lexicon of Czech verb frames containing more than 10 000 verbs. The features of the lexicon are designed to bring important semantic information to computer processing of predicate constructions in running texts in the form of complex valency frames (CVFs). The most notable attributes of CVFs include synset (synonymical set) organization, two-level semantic labels with linkage to the Princeton Word- 7 2. NEW LANGUAGE RESOURCES AND TOOLS Net and EuroWordNet1 hierarchy and surface verb frame patterns used for automatic syntactic analysis. In the area of lexical databases, ontologies and common sense knowledge resources, the Princeton WordNet [Mil90] became one of the most popular ones. It is currently used in many areas of natural language processing such as information retrieval, automatic summarization, document categorization, question answering, machine translation etc. To integrate into the applications, many researchers work with the Princeton database and transform data to their own proprietary formats. The Princeton team also developed a data browser for WordNet which could be downloaded together with English data from the WordNet project web page [WN07] both for Windows and UNIX platform. Currently, this browser is replaced with a purely web access application. No WordNet editing tools are provided as the only instruments for majority of the lexicographic work in Princeton are standard text editors. The consistency of data is not therefore checked during the editing process itself, it is postponed to later phases. Year by year the number of Princeton WordNet derivatives and WordNet-inspired initiatives increased. In 1998–1999 the EU project EuroWordNet 1 and 2 [Eur99] took place, in which multilingual approach dominated and WordNets for 8 European languages, particularly for English, Dutch, Italian, Spanish, French, German, Czech and Estonian, were developed. The Interlingual Index (ILI), the Top Ontology, set of Base Concepts and a set of Internal Language Relations were introduced as well [Vos98]. These changes also led to the design and development of a new database engine for EuroWordNet and it resulted in the editing and browsing tool called Polaris [Lou98]. In 2001–2004 the EU project Balkanet [Bal04] was launched which can be viewed as a continuation of the EuroWordNet project. It was conceived as multilingual, as well, and within its framework WordNets for 6 languages were developed or augmented, particularly for Greek, Turkish, Romanian, Bulgarian, Serbian and Czech. Before Balkanet had started it was already obvious that the Polaris tool had no future because its development had been closed and as a licensed software product (by Lernout and Hauspie) it had been rather expensive for most of the research institutions involved (typically universities). Moreover, the system 1see further text for details 8 2. NEW LANGUAGE RESOURCES AND TOOLS had been provided only for MS Windows platform. That is why a specialized open source software system VisDic has been developed by MU NLP Centre during the work within the Balkanet project. VisDic was designed and implemented as a highly configurable multiplatform tool with easy to use interface for working not only with WordNets, but with general dictionaries using a variant of XML schema for the dictionary entries. We present the details of this tool in the Section 2.2. In June 2000, the Global WordNet Association (GWA, see the web site [GWA07]) was established by Piek Vossen and Christiane Fellbaum. The purpose of this association is to “provide a platform for discussing, sharing and connecting WordNets for all languages in the world.” One of the most important actions of GWA is the Global WordNet Conference (GWC) that is being held every two years on different places all over the world. The second GWC was organized by the MU NLP Centre in Brno and the NLP Centre members are actively participating in GWA plans and activities. A new idea that was born during the third GWC in Korea is called the Global WordNet Grid with the purpose of providing a free network of smaller (at the beginning) WordNets linked together through Interlingual Ontology (as opposed to Interlingual Index from EuroWordNet). The Grid preparation is currently just starting and the MU NLP Centre is developing its software background. The growing need to handle various lexical resources that take the form of dictionaries, semantic networks, ontologies, valency lexicons, or FrameNets is the cause why researchers seek for software systems that are able to store dictionary-like data using XML as the core element. Many dictionary publishing houses operate large systems with the complex functionality of so called lexicographic stations that manipulate XML (DPS Longman [McN03]) and several companies offer dictionary writing programs of different complexity (TshwaneLex [JdS04], iLEX [Erl04] or Field Linguist’s Toolbox [Too07]). However, these and similar tools are not always able to efficiently manipulate resources obtained from datadriven NLP applications. Therefore, they cannot provide a universal environment for lexical database management as well as semantic networks and ontologies. They often represent rather large systems that are quite complex which is not always an advantage. And, last but not least, some of them are rather expensive. That is why we decided to build a development framework on which the individual clients can be built – this solution is modular and flexible since the clients can be adapted for 9 2.1 2. NEW LANGUAGE RESOURCES AND TOOLS the particular purpose in a short time. The description of this development platform, called DEB ii, forms the content of the third part of this chapter. The contents of this chapter is an extension of the previous work published in [HS03, HS04, HH05a, HH05b, HPRP06, HP06a, PH06a, HPRR06, HR07, HP07, HVR08]. 2.1 VerbaLex – New Comprehensive Lexicon of Verb Valencies for Czech The beginnings of building the verb valency frame dictionary at the MU NLP Centre dates back to 1997, when Karel Pala prepared the first version of a verb valency dictionary with 15 000 entries [PŠ97]. Since then, the dictionary, denoted as Brief, has undergone a long development and has been used in various tools from semantic classification to syntactic analysis of Czech sentence [SH98]. The data in this dictionary can be entered in several mutually convertible formats: brief format: jíst2 hTc4,hTc4-hTc6r{na}, hTc4-hTc7 verbose format: display format: jíst2 = co jíst něco = co & na čem jíst něco na něčem = co & čím jíst něco něčím The Brief dictionary contains about 15 000 verbs with 50 000 verb valency frames, thus making it an invaluable language resource with high coverage. However, the different verb senses are not distinguished here. Another advance in the Czech verb valency processing came during the work on the Czech WordNet within the Balkanet project (see [Bal04]). The Czech WordNet has been supplemented with a new language resource, Czech WordNet valency frames dictionary. The new acquisition 2‘jíst’=‘to eat’, ‘co’=‘what’, ‘na čem’=‘on what’, ‘čím’=‘(with) what’, ‘něco’=‘something’, ‘na něčem’=‘on something’, ‘něčím’=‘(with) something’ 10 2. NEW LANGUAGE RESOURCES AND TOOLS 2.1.1 of this dictionary were the semantic roles and links to the Czech WordNet semantic network. During the work on enhancing the list and adding new entries into it, we have come to the need of comparing the quality and features of the list with the valency lexicon of Czech verbs denoted as VALLEX 1.0 [SLŽ02] that was created independently of the Czech WordNet verb frames. Based on these three verb frame resources, we have designed new format of a verb frame entry named the complex valency frame (CVF). The resulting lexicon of Czech verbs CVFs, named VerbaLex, contains information useful for automatic computer processing of verb frames with the linguistic background. The VerbaLex dictionary is being actively developed, checked and supplemented with new data since 2005. The coordination of the work of 4 IT developers and 15 linguists is directed by Aleš Horák with Dana Hlaváčková as the head of the linguistic team. Currently, VerbaLex contains 10 782 verb lemmata which, when gathered in synonymic groups, share 28 566 verb frames. 2.1.1 Linguistic Requirements for the VerbaLex For- mat In this section, we present the substantiation of the main differences between VerbaLex and VALLEX 1.0 valency frames notation. VerbaLex differs from VALLEX 1.0 in augmentation of the original format, detailed differentiation of valency frames and above all two-level semantic roles.3 The headwords in VerbaLex are formed with lemmata in a synonymic relation followed by their sense numbers (standard Princeton WordNet notation, such an expression is denoted as a WordNet literal). The lexical units in WordNet are organized into synsets (sets of synonyms) arranged in the hierarchy of word meanings (hyper-hyponymic relations). The standard definition of synonymy says that two synonymic words can be always substituted in the context. However, the synonymy in synsets is understood like very close sense affinity of given words, the substitution rule cannot be applied in all cases here. In VALLEX 1.0, a headword is one lemma, possibly two or more lemmata in case of lemma 3for more details about the VerbaLex semantic roles see the Section 2.1.2 11 2.1.1 2. NEW LANGUAGE RESOURCES AND TOOLS Lemma variants: Princeton WordNet – plan:2 Definition: make plans for something VALLEX 1.0: vymyslet2 / vymyslit2 VerbaLex: vymyslet:1, vymyslit:1, naplánovat:3 Word entries: Princeton WordNet – arrive:1, get:5, come:2 Definition: reach a destination; arrive by movement or progress VALLEX 1.0: dojít1 VerbaLex: dojít:1, dorazit:1, dostat se:1, přicestovat:1, přijet:1, přijít:1 Figure 2.1: Examples of verb frame entry heads for verbs with lemma variants and for synonymic verbs. variants.4 Lemma variants in VerbaLex are considered as independent lemmata and they are distinguished by their WordNet sense numbers. An example of two verb frame entries in VALLEX 1.0 and VerbaLex is displayed in the Figure 2.1. In VerbaLex, each word entry includes an information about the verb aspect (perfective – pf., imperfective – impf. or both aspects – biasp.). VerbaLex valency frames are enriched with aspect differentiations for examples containing the verb used with the given valency frame. This is important in case of synonymic lemmata with different aspect: Princeton WordNet – wade:1 Definition: walk (through relatively shallow water) VerbaLex: brodit se:2impf., přebrodit se:1pf. frame: AGobl who nom VERB SUBSobl (through+)what ins example: přebrodil se blátempf. / he wade through mud 4the lemmata with small phoneme alternation (in Czech) that are interchangeable in any context without any change of the meaning – bydlet/bydlit = to live (somewhere). 12 2. NEW LANGUAGE RESOURCES AND TOOLS 2.1.1 example: brodil se pískemimpf. / he wade through sand The constituent elements of frame entries are enriched with pronominal terms (e.g. who, what) and the morphological case number or short word.5 This notation allows to differentiate an animate or inanimate agent position:6 Princeton WordNet – bump:1, knock:3 Definition: knock against with force or violence VerbaLex: narazit:1pf. / narážet:1impf. frame: AGobl who nom VERB OBJobl to+what gen,at+what acc PARTobl (with+)what ins example: I bumped to the wall with my head frame: OBJobl what nom VERB OBJobl to+what gen,at+what acc example: the car bumped to the tree 2.1.1.1 Verb Usage and Verb Classes VerbaLex captures additional information about types of verb use and semantic verb classes. Three types of verb use are displayed in the lexicon. The primary usage of a verb is marked with abbreviation prim, metaphorical or figurative use with fig and idiomatic and phraseological use with idiom (this notation corresponds to the original VALLEX 1.0). The assigned semantic verb classes have been adopted from the Martha Palmer’s [DKPR98] VerbNet project. The verb classes list is based on Beth Levin’s [Lev93] classes with more fine-grained sets of verbs. There are 395 classes in the current development version of VerbNet, which was provided by Martha Palmer’s team. But this number seems to be too high for Czech verbs, therefore the list of verb classes is adapted to the conditions of the Czech language: Princeton WordNet – cry:2, weep:1 5for the 7 grammatical cases in Czech – nom[inative], gen[itive], dat[ive], acc[usative], voc[ative], loc[ative] and ins[trumental] 6The prepositions in VerbaLex are stated in Czech. Here, in the presented examples, we translate them to English. 13 2.1.2 2. NEW LANGUAGE RESOURCES AND TOOLS Definition: shed tears because of sadness, rage, or pain VerbaLex: brečet:1, plakat:1, ronit:1 class: nonverbal expression-40.2 Princeton WordNet – take care:2, mind:3 Definition: be in charge of or deal with VerbaLex: dbát:2, starat se:2, pečovat:3 class: care-86 Princeton WordNet – be:11, live:5 Definition: have life, be alive VerbaLex: žít:1, být:2, existovat:3 class: exist-47 2.1.2 Semantic Roles Semantic role annotation is usually based on the appropriate inventories of labels for semantic roles (deep cases, arguments of verbs, functors, actants) describing argument predicate structure of verbs. It can be observed that different inventories are exploited in different projects (e.g. VALLEX [ŽL04], VerbNet [KDP00], FrameNet [FBS04], Salsa [BPGR06], CPA [Han04] or VerbaLex here). The idea of semantic roles in VerbaLex has come from the specification of needs of the Czech syntactic analysis – we need a technique for distinguishing sentence constituents as a) obligatory,7 b) typical (for the purpose of syntactic tree ranking,8 or c) forbidden, for the sake of tree pruning. These considerations led us to the design of the inventory of twolevel labels which are presently exploited for annotating semantic roles in VerbaLex. VerbaLex has thus introduced a different concept of semantic roles than in the VALLEX 1.0 project.9 The functors used in VALLEX 1.0 valency frames seem to be too general and they do not allow to distinguish different senses of verbs according to the WordNet style. We suppose that a more specific subcategorization of the semantic role tags is necessary for the needs we have defined above. 7with regard to the valency of another sentence constituent 8see the Sections 3.2.2 and 3.2.3 9semantic roles are denoted as functors in VALLEX 1.0 14 2. NEW LANGUAGE RESOURCES AND TOOLS 2.1.2 The first level of VerbaLex semantic labels contains the main semantic roles proposed on the basis of 1st and 2nd Order Entities from the EuroWordNet Top Ontology [VB+ 98]. On the second level, we use selected particular literals (lexical units) from the set of Princeton WordNet Base Concepts with relevant sense numbers. We can thus specify groups of words (hyponyms of these literals) replenishable to valency frames. This concept allows us to specify valency frames notation with large degree of sense differentiability. The motivation for this choice is based on the fact that the Princeton WordNet has a hierarchical structure which covers about 110 000 English lexical units (synsets). It is then possible to use general labels corresponding to selected top and middle nodes and go down the hyperonymy/hyponymy (H/H) tree until the particular synset is found or matched. This allows us to see what is the semantic structure of the analyzed sentences using their respective valency frames. The nodes that we have to traverse when going down the H/H tree at the same time form a sequence of the semantic features which characterize meaning of the lexical unit fitting into a particular valency frame. These sequences can be interpreted as quite detailed selectional restrictions. Currently, we use about 40 1st -level and 200 2nd -level semantic roles in VerbaLex. For example the literal writing implement:1 is a hypernym for any implement that is used to write. Princeton WordNet – draw:6 Definition: represent by making a drawing of, as with a pencil, chalk, etc. on a surface VerbaLex: kreslit:1, malovat:1 frame: AGobl who nom VERB ARTobl what acc INSobl (with+)what ins example: my sister draws a picture with coloured pencils, the famous artist was drawing his painting only with charcoal The left-side valency position is most frequently occupied by the semantic role AG, an agent. The agent position in a valency frame is understood as a very general semantic role (functor ACT) in VALLEX 1.0. This label does not allow to distinguish various types of action cause. Two level 15 2.1.2 2. NEW LANGUAGE RESOURCES AND TOOLS semantic role labels in VerbaLex are able to define cause of action quite precisely. The main semantic role AG is completed by an adequate literal depending on the verb sense and valency frame. Thus, we can identify whether the agent is a person AG, an animal AG, a group of people AG, an institution AG or a machine AG. For some verbs with a very specific sense, hyponyms of these literals are used. For example: Princeton WordNet – sugar:1, saccharify:1 Definition: sweeten with sugar VerbaLex: sladit:4, osladit:1, pocukrovat:1 frame: AGobl who nom VERB SUBSobl what acc SUBSobl (with+)what ins example: sugar your tea with brown sugar In VALLEX 1.0, each valency frame starts always with functor ACT. In our opinion, it is useful to differentiate the sense of the left-side valency position (subject position) in more detail. According to our definition of agent AG (somebody or something doing something actively) this position may be also occupied by other semantic roles. The subject position can contain objects OBJ, substances SUBS or a semantic role denoting abstract concepts – human activity ACT, knowledge KNOW, event EVEN, information INFO, state STATE. For example: Princeton WordNet – follow:6, come after:1 Definition: come after in time, as a result VerbaLex: přijít:25 / přicházet:25, následovat:4 frame: EVENobl what nom VERB EVENobl (after+)what loc example: heavy rain followed flood Princeton WordNet – fall:3 Definition: pass suddenly and passively into a state of body or mind VerbaLex: zachvátit:2, zmocnit se:2 frame: STATEobl what nom VERB PATobl whom acc example: he fall into a depression 16 2. NEW LANGUAGE RESOURCES AND TOOLS 2.1.2 Table 2.1: List of 1st -level semantic roles from VerbaLex that are used in the examples. AG the semantic role of the animate entity that instigates or causes the happening denoted by the verb in the clause, we extended this definition for inanimate entity that does sth actively (e.g. machine) ART a man-made object taken as a whole SUBS that which has mass and occupies space PART a portion of a natural object, something determined in relation to something that includes it, something less than the whole of a human artifact INS a device that requires skill for proper use OBJ a tangible and visible entity; an entity that can cast a shadow EVEN something that happens at a given place and time STATE the way something is with respect to its main attributes A large number of semantic roles inspired by EuroWordNet Top Ontology roughly correspond with the PAT functor in VALLEX 1.0. The PAT label covers quite different senses, which can be very well identified. In our inventory, PAT is defined as: the semantic role of an entity that is not the agent but is directly involved in or affected by the happening denoted by the verb in the clause (the definition of the patient:2 literal from the Princeton WordNet). Princeton WordNet – experience:1, undergo:2, see:21, go through:1 Definition: go or live through VALLEX 1.0: absolvovat2 frame: ACTobl 1 PATobl 4 VerbaLex: absolvovat:2, prožít:1 / prožívat:1 frame: AGobl who nom VERB EVENobl what acc example: he underwent difficult surgery 17 2.1.3 2. NEW LANGUAGE RESOURCES AND TOOLS Some second level literals cannot be adopted from the Princeton WordNet Base Concepts – especially specifications of roles considered as “classic” deep cases. These literals (e.g. agent:6, patient:2, donor:1, addressee:1, or beneficiary:1) do not have any hyponyms in Princeton WordNet and cannot be substituted by any word. For such cases, the literal person:1 is used (or another suitable literal with large number of hyponyms, e.g. AG, PAT). This “classic” semantic roles are consistent with some functors in VALLEX 1.0 (ACT, PAT, ADDR, BEN etc.). A list of VerbaLex semantic roles that are used in the presented examples is displayed in the Table 2.1. 2.1.2.1 Special Semantic Roles VerbaLex describes not only the valency and semantic frames, it also includes other relevant information about Czech verbs, such as the verb position. In a free-word order language like Czech the position of the verb within the verb frame is usually not strictly specified. VerbaLex uses a special semantic role, VERB, which marks the canonic position of the verb in its verb frame. Such default verb position is not needed only for analysis of verb valencies, it can be also directly used in the process of generation of Czech sentences, e.g. as an output of a question-answering machine. The left side of the verb position is traditionally occupied by the sentence subject, which is also marked in most of the verb frames in VerbaLex. However, there are some cases, where the verb frame has to obey different rules – e.g. sentence Dalo se do deště (It started to rain) cannot contain any subject. For the notation of such cases, VerbaLex uses another special semantic role ISUB, an inexplicit subject. 2.1.3 The Implementation of Editing and Exporting Tools For the sake of editing and entry management in the newly adopted verb valency frame format for VerbaLex, we have implemented a new set of editing tools. The main interactive tool for user editing of the valency dictionary, named verbalex.sh, is based on a highly configurable multi-platform 18 2. NEW LANGUAGE RESOURCES AND TOOLS 2.1.3 Figure 2.2: The tool for editing verb valency frames dictionary in the VerbaLex format. editor VIM [Sch07] (see the Figure 2.2). Such approach enables a linguistic expert to easily enter computer-parseable data in a fixed plain text format and still, thanks to the flexible color syntax highlighting, he or she has a full visual control of possible errors in the format. The editing itself is not fixed to one platform, users can run the same environment under any of the current popular computer operating systems (VIM editor runs on nearly any platform). The authoring tool verbalex.sh currently offers these functions to the editing user: • free editing of the dictionary entries • regular expression searching in the dictionary • template-based adding of a new verb entry or a new verb frame to the current entry • menu-based adding of a new semantic role to the current frame 19 2.1.3 2. NEW LANGUAGE RESOURCES AND TOOLS • multilevel folding – hiding/displaying of valency attributes, valencies or full valency frames • visual marking of the current frame for further inquiry • interactive merging of definitions from two parallel sources Moreover, the interpreted approach of the tool makes adding new features to the editing system easier to implement. The plain text format edited by a human expert was inspired by the editing format of VALLEX 1.0 [Žab05]. This text format is in further processing transformed into an XML standard format which enables conversions into different formats used for visual checking, searching and presentation of the valency dictionary. A good example of merging various lexical data is the work going on in the NLP Centre at FI MU where the data from Czech WordNet and Czech Valency Lexicon VerbaLex are combined together. The VerbaLex lexicon is currently being developed separately and independently of Czech WordNet using a particular XML format (see the Figure 2.3). However, the entries in VerbaLex are written in form of WordNet synsets, which enables combining the data from both these resources. The Czech WordNet currently contains a smaller set of valency frames in plain PCDATA format (see the Figure 2.4). The current work is directed to merging the VerbaLex valency frames with the Czech WordNet synset structures. The Czech verbs for which valency frames already exist are or will be linked to their English equivalents by means of ILI (Inter-Lingual Index). If Czech and English verbs (synsets) are linked correctly, the deep valency frames developed for Czech can be also valid for English (surface valencies are obviously different since Czech is a synthetic language whereas English is an analytic one). The XML schema for VerbaLex also took inspiration from VALLEX 1.0, where the original schema had to be changed to suit the augmentation of the format in VerbaLex. The changes include • adding class attribute to frame slot tag to cover WordNet basic concept literals • including the WordNet word sense in the lemma tags • shifting the verb aspect to headword lemma, which now enumerates all the aspectual counterpart tuples. An example of such XML substructure can be found in the Figure 2.3. 20 2. NEW LANGUAGE RESOURCES AND TOOLS 2.1.3 dodat ... dát ... vložit ... dok: připojili ke smlouvě své podpisy prim
... Figure 2.3: An example of (a part of) an entry in the VerbaLex XML format for the synset dát:8, vložit:1, vsunout:1, přidat:2, připojit:1, dodat:1 (i.e. insert:1, infix:1, enter:7, introduce:6 in the Princeton WordNet) 21 2.1.3 2. NEW LANGUAGE RESOURCES AND TOOLS Synset: dát:8, vložit:1, vsunout:1, přidat:2, připojit:1, dodat:1 {dát, vložit, vsunout} kdo1*AG(person:1)=co4*OBJ(object:1) & do čeho2*OBJ(container:1) {dát, vsunout, přidat, vložit, dodat} kdo1*AG(person:1)=co4*INFO(info:1) & do čeho2*COM(written communication:1) %dodal do textu nové poznámky, přidal k článku obrázek {dát,přidat, připojit, dodat} kdo1*AG(person:1)=co4*INFO(info:1) & k čemu3*COM(written communication:1) %připojili k smlouvě své podpisy {přidat, připojit, dodat} kdo1*AG(person:1)=co4*OBJ(object:1) ? k čemu3*OBJ(object:1) %připojil hadici ke kohoutku Figure 2.4: An example of valency frames in the Czech Wordnet for the same synset insert:1, infix:1, enter:7, introduce:6 as was displayed in the Figure 2.3 The resulting XML structure is then transformed into various output formats with the use of modified tools from VALLEX 1.0. The export formats are: • HTML with navigation among the characteristic features of the dictionary entries, • Postscript document for printing including page index of all verbs, and • PDF, which allows navigation through the document in the same visual form as for hardcopy printing. 22 2. NEW LANGUAGE RESOURCES AND TOOLS 2.1.4 2.1.4 Application of VerbaLex in Syntactic Analysis The design of VerbaLex verb valency lexicon was driven mainly by the requirement to describe the verb frame (VF) features in a computer readable form suitable for syntactic and semantic analysis. The current CVFs10 structure contains: • morphological and syntactic features of constituents • two-level semantic roles • links to Princeton WordNet and Czech WordNet hypero/hyponymic hierarchy • differentiation of animate/inanimate constituents • default verb position • verb frames linked to verb senses • VerbNet classes of verbs. We are currently testing the application of VerbaLex in the syntactic analyzer synt (see the Chapter 3) that is designed for parsing real-text sentences. For the detailed description of the verb frame extraction process see the Section 3.4.1. The system processing can be presented on an example sentence – see the syntactic tree in the Figure 2.5 and the textual output of the part of the system that implements the VFE algorithm in the Figure 2.6. The system first identifies the verb rule constituents (nterms). Then the corresponding groups, i.e. the actual sentence constituents that will play the role as verb frame arguments, are extracted from the forest of values. Groups usually do not correspond to nterms one-to-one, since they are stored within non-terminals deeper in the forest and not directly in the verb rule. This part of the VFE algorithm has unfortunately exponential time complexity, however, for common sentences the depth of the verb frame constituents is not more than three levels, so the actual running times are usually within fractions of seconds. After the identification of the groups, the algorithm looks for possible subjects – this is not as easy as it may look at the first sight, since the sentence subject can be expressed not only by a noun phrase in nominative (which is the most frequent option in Czech), but also by e.g. prepositional phrase or verb infinitive. If no possible subject is found, the algorithm supplies a pronoun for an inexplicit subject with the gender corresponding to the verb. The Clause valency list displays all possible combinations of the 10Complex Valency Frames 23 2.1.4 2. NEW LANGUAGE RESOURCES AND TOOLS start clause np ADJ Malé np N děti part adv ADV špatně V snáší np ADJ dlouhou np np N jízdu np N autem ends ’.’ . Figure 2.5: Syntactic tree of an example input sentence “Malé děti špatně snáší dlouhou jízdu autem.” (Small children badly withstand long journey by car.) verb rule schema: 3 nterms, ’#2’ nterm 1: k1gNnPc1 nterm 2: k5eAp3nPtPmIaI nterm 3: k1gFnSc4 group 1: 0,2, npnl -> .{ left modif } np . k1gMnSc1 "malé děti" group 2: 2,3, ADV -> .’špatně’ . k6xMeAd1 group 3: 4,7, npnl -> .{ left modif } np . k1gFnSc4 "dlouhou jízdu autem" possible subjects: #1 Clause valency list: snášet -#2:(1)hH#1:(0)hPTc1-#3:(2)hPTc4 snášet(0) #1:(1)hH-#2:(2)hPTc4 Verb valency list: snášet #2:hH-#1:hPTc4 snášet #1:hPTc4 Matched valency list: snášet(0) #2:(1)hH-#1:(2)hPTc4 Figure 2.6: The output of the verb frame extraction algorithm during the example sentence analysis. 24 2. NEW LANGUAGE RESOURCES AND TOOLS 2.2 translations of the verb arguments found into verb frame patterns. This list is then intersected with the list of lexicon entries for the verb to obtain the Matched valency list as a result of the VFE algorithm. The effectiveness of the syntactic analysis with the VFE algorithm was measured on approximately 4 000 Czech corpus sentences with the median of 15 words per sentence and the Clause valency list contained 11 possible verb frames with the running time of 0.07 seconds per sentence. 2.2 VisDic – Off-line WordNet Editor As the developers of Czech WordNet within EuroWordNet project we came to the conclusion that a new tool for WordNet browsing and editing has to be developed rather quickly. At the same time we realized that it was necessary to look for the solution that would also support establishing the necessary standards for WordNet like lexical (knowledge) databases. Thus we decided to develop a new tool for WordNets based on XML data format, which can be used for lexical databases of various sorts. The tool is called VisDic and it has been implemented in 1999–2004 in the Natural Language Processing Laboratory at the Faculty of Informatics, Masaryk University for both Windows and Linux platform [PP02, HS03]. 2.2.1 Basic Functionality VisDic was developed as a tool for presentation and editing (primarily WordNet-like) dictionary databases stored in XML format. Most of the program behaviour and the dictionary design can be configured. With these capabilities, we can adopt VisDic to various dictionary types— monolingual, translational, thesaurus or generally linked WordNet lexi- cons. 2.2.1.1 Multiple Views of Multiple WordNets The main working window is divided into several dictionary panels. Each panel represents a place for entering queries and browsing context of one specified WordNet dictionary. The panels can display different WordNets as well as multiple contexts of the same dictionary. The contents of a panel offers, besides the query input and matching results list, a set of overlapping notebook tabs each of which represents 25 2.2.1 2. NEW LANGUAGE RESOURCES AND TOOLS Figure 2.7: An example of freely defined text view of a WordNet entry one kind of view of the same entry from the list of results. The order, the type and even the content of each notebook tab is specified by the user in the configuration files (see the Section 2.2.3). The main types of views are described in the following sections. 2.2.1.2 Freely Defined Text Views The content of the Text View notebook tab is entirely built from the user definition that follows the XML structure of the WordNet entry. The editor can thus present an easily readable view of the entry with highlighting important parts of the entry content (see the Figure 2.7). 2.2.1.3 Edit The editing capabilities allow to give the user a full control over the content and linking of each entry in the WordNet hierarchy. To prevent a user from moving the entry as an object in the multicolored spider web of the linkage relations, the linguist rather specifies all the links in 26 2. NEW LANGUAGE RESOURCES AND TOOLS 2.2.1 a textual dialog, where all the bindings are displayed in one place with consistency checks after each change request. The actual contents of the Edit notebook tab is also entirely driven by user instructions in the configuration, where each editing field is given a textual label and is assigned to an XML tag from the entry structure. 2.2.1.4 Tree and RevTree WordNet dictionaries are specific with a heavy network of various kinds of relations between the dictionary entries with the function to capture the ontology relations of the underlying natural language. Navigation in such environment is thus a crucial point of a successful linguistic work with WordNet data. Since the linkage relations generally do not need to obey any rules, that could make the resulting structure to be an arbitrary directed graph. VisDic implements a browsing mechanism for general graphs. The navigation process works with two interconnected notebook tabs, which always both start at the same dictionary entry and display its position in the graph represented as a breadth-first path trees of all the linkage relations that lead from the entry to other entries in the dictionary. Each of the notebook tabs displays mutually opposite linkage relations, allowing the user to choose the direction of graph navigation in every step. To facilitate the orientation and to help to position the entry in the WordNet hierarchy, the navigation also displays the path from the entry to its top in the hyper-hyponymic relation tree (see the Figure 2.8). For more advanced navigation the linguist may also use advanced tree browsing techniques (described in the Section 2.2.2.3). 2.2.1.5 Query Result and External File Lists Common actions in WordNet creation and editing often include processing of a subset of entries based on certain criteria. VisDic offers a suitable kind of views for this situation, which allow to prepare a notebook tab with a list of entries matching any user specified query or a list of entries identified by entry-IDs gathered in a plain text file. 27 2.2.2 2. NEW LANGUAGE RESOURCES AND TOOLS Figure 2.8: The tree-like navigation in the WordNet linkage relations graph 2.2.1.6 Plain XML View Sometimes users need a thorough view into the data structures contained in the dictionary entry. XML View notebook tab offers this possibility. In this view, the user can see a graphically structured XML text, which represents the entry structure as it is stored in the dictionary (see the Figure 2.9). 2.2.2 Advanced Functionality The basic functionality described in the previous section generally conforms to any XML based dictionary. However, linguistic work specialized to WordNet creation and editing requires some more specific and more sophisticated functions in the editor. 28 2. NEW LANGUAGE RESOURCES AND TOOLS 2.2.2 Figure 2.9: Raw XML view of a synset entry. 2.2.2.1 Synchronization Within the creation of a national (e.g. Czech) WordNet, which would correspond to the English WordNet as a primary reference, one of the most frequent operation is a lookup of a dictionary entry (synset) from one WordNet in another dictionary. Such lookup uses either the SYNSET.ID tag (as a direct equivalent of the entity) or one of the, so called, equivalence tags (or attributes) defined in the configuration. An example of such tag may be REVMAP or MAPHINT used to help the linguist to process ambiguous link references between various versions of English WordNet. The lookup function in VisDic can work in two modes: as an instant (one time) lookup — the Show (by) operation, and also as a firmly established link between two notebook tabs called the AutoLookUp (by). In case of AutoLookUp, any move to another dictionary entry in the source notebook tab leads to an automatic lookup of the new entry in the destination dictionary. VisDic allows to have any acceptable combination of AutoLookUps among all the notebook tabs. 2.2.2.2 Editing Support The efforts of unifying national WordNets based on the English WordNet in many cases lead to copying of synset information between different language dictionaries. Such functionality in VisDic is split into two common situation — either the SYNSET.ID of an existing synset is to be unified 29 2.2.2 2. NEW LANGUAGE RESOURCES AND TOOLS with the ID of the English synset (Take key from operation) or a whole new entry is to be copied to another dictionary (Copy entry to). 2.2.2.3 Tree Browsing The basic navigation in related synsets (in some cases reduced to the hypernymic and hyponymic relations tree) is supplemented with two important WordNet operations — Topmost entries and Full expansion. The Topmost entries operation identifies all synsets, which are (in the tree subset of linkage relations) found as the roots of relational hierarchy, i.e. are not hung below any other synset. This helps the linguist to identify the level 1 entries as well as so far unfiled entries. The Full expansion allows the user to see all possible descendants of a selected synset in the linkage relations graph. During the operation cycle detection techniques check any violations of tree properties in the graph. Some relations can be also configured to be left out from the full expansion process.11 2.2.2.4 Consistency Checks Semi-automatic processing, which often takes part in the national WordNets creation, as well as common human processing of the data inevitably brings in the possibility of mistakes. The inconsistencies, which may be revealed as duplicates, are controlled by VisDic consistency checks, which contain • check duplicate IDs • check duplicate literals and senses • check duplicate synset literals • check duplicate synset links These checks allow the linguist to identify the most common errors e.g. after merging data from various sources. 2.2.2.5 Journaling The work on a large and representative national WordNet usually employs more than one linguist working on the data. The synchronization of 11see also Visual Definitions in the Section 2.2.3.2 30 2. NEW LANGUAGE RESOURCES AND TOOLS 2.2.3 the resulting dictionary is made possible in VisDic with the usage of journaling. During the work with VisDic, any change of the data is marked in a journal file. Each journal file is specific to one dictionary and one user at a time. Such journal file can then be “applied” to the dictionary data and merged with the original. In this way, the simultaneous work of several linguists can be easily interchanged with a common data source. 2.2.3 XML Configuration Most of the functionality in the VisDic WordNet editor can be adapted to local needs by means of its configuration files. All settings for the VisDic application are stored in several XML files. 2.2.3.1 Global Configuration The main configuration file (visdic.cfg) serves for global application data storage such as the list of dictionaries, the list of views, fonts, colors or query history. All information is stored in an XML structure. The first-level subsections of the global configuration are: colors In the COLOR section the user can define colors which are then referenced by its name in dictionary configuration files. Each color is enclosed in its name tag and consists of three hexadecimal values separated by commas, representing consequently its red, green and blue components. Each value can be in range 0x0000, 0xffff . fonts The FONT section defines fonts which can be referenced by the defined names from dictionary configurations. Each font definition is enclosed in its name tag and its value correspond to the font string description. application settings The APPL section contains all global data that are related to the application state. The most common settings that can be found here are: • DICT – path to a dictionary that is presented in the list offered to the user. • OPEN – ordinal number of a dictionary that should be opened in one notebook tab. 31 2.2.3 2. NEW LANGUAGE RESOURCES AND TOOLS • AUTOLOOKUP – definition of a synchronization link between two notebook tabs. • SIZE – size of the notebook tab in percentage of the main window width. • HIST – history of last queries that were entered by the user in a specific notebook entry line. A shortened example of a global configuration file is displayed in Figure 2.10. 2.2.3.2 Dictionary Specific Configuration Each WordNet dictionary has its special configuration file (named dictionary.cfg), which enables the linguist to set up most of the texts displayed in the application as well as the content of notebook tabs specific to the particular dictionary with respect to the XML structure of the entries. The configuration contains attribute settings of the dictionary and sections describing the layout of the dictionary views. The main attributes in the dictionary configuration are: • NAME – full name of the dictionary. This name is presented to the user in various places in the application, e.g. on the top of the dictionary notebook tab. • SHORT NAME – short name of the dictionary. • MAIN TAG – the default XML tag in the user queries (e.g. SYNSET. SYNONYM.LITERAL). • MAX QUERY – limit of the number of results of a query. • MAX VIEW – limit of the number of characters displayed in the user defined text view. • CHARSET – name of a character set indicating the encoding of the dictionary. This information is necessary for correct manipulation with the dictionary in some systems. The rest of the dictionary configuration file contains sections defining the list of the available dictionary views and their content or the list of duplicate checking actions in the application menu. 32 2. NEW LANGUAGE RESOURCES AND TOOLS 2.2.3 Visdic general configuration file Colors definition 0x0000, 0x0000, 0x0000 0xffff, 0xffff, 0xffff 0xffff, 0x0000, 0x0000 ... Fonts definition /nlp/wn/visdic/data/eng20/wneng20 /nlp/wn/visdic/data/cze/wncze ... v2-v1 1 43 trench house ... 5 57 pes Figure 2.10: The global configuration example (... stands for shortened parts). 33 2.2.3 2. NEW LANGUAGE RESOURCES AND TOOLS Visual Definitions The VISUAL section describes the way, how to display dictionary entries. Definitions are enclosed in tags corresponding to their names. VisDic uses primarily two special visual definitions. The first is called VISDIC SHORT and it presents the entry in a short one-line format (e.g. list of all entries matching the query or within a tree view). The second visual definition, named VISDIC, describes the content of the user defined text view, i.e. it presents the entry in a more descriptive way. Each tag from the entity XML structure can be displayed in its own way. The definition contains C-like string format specifications consisting of a string in double quotation marks and other parameters. These parameters have the following meaning: • %c – color name (taken from visdic.cfg), changes the current color. • %f – font name (taken from visdic.cfg), changes the current font. • %s – string, it can be @tag:name for the current tag name, or @tag:value for the tag value. • %i – includes the output of processing of subtags. • %K+ – in the tree view, stop expanding the Full expansion view under the current entry. • %K – in the tree view, delete the current line from the tree. The format string can include parts that are displayed only under a certain condition. The available conditions are • \\{^...\\} – display ... only if the tag is the first in the list. • \\{$...\\} – display ... only if the tag is the last in the list. • \\{*...\\} – display ... only if the tag is not the last in the list. An example of usage of the conditional parts of the format string can be a comma separated list of literals with their senses: "%i" "%s:%i\\{*, \\}",@tag:value "%s",@tag:value 34 2. NEW LANGUAGE RESOURCES AND TOOLS 2.2.3 The visual definition of each XML tag can contain a test for the value of the tag in the form ="value":"result". For instance, various type of WordNet relations between synsets can be transcribed in colored one-letter acronyms like this "%i" ="hypernym":"%cH",BLACK ="holo_member":"%cM",BLUE ="derived":"%cD",DARK_GREEN "%c[%s]",RED,@tag:value A special tag named DEFAULT stands for any tag. It is used for tags that do not have their own definitions. Views The VIEW section specifies the design of notebook tabs. Each tab is described by one LIST subsection. Each tab has its own name in the NAME tag and its own type in the TYPE tag. According to the type, the LIST subsection can include other specifications of the tab content: • XML view has no other options. It just displays the XML structure of a dictionary entry. • USER type has DEF tag referencing the name of a visual definition of the user defined text view. • TREE type contains two tags specifying parent and child link tags in the dictionary and the DEF tag for the visual definition used in the presented tree-like ordering of entries. • EDIT type describes the form fields for editing one dictionary entry. The subsection contains ITEM or BUTTON tags. Items refer to XML tags in TAG, each has its own head label in HEAD and its own item type in TYPE. The appearance of the form field is specified in the EDIT tag. It can be a single line entry (ENTRY), a multi line entry (TEXT) or a checkbox (CHECKBOX). All form fields used for editing the link or reverse link tags12 will be displayed as combo boxes with an arrow on the right side of the box, which allows the user to 12see the field type R in the Section 2.2.3.3 35 2.2.3 2. NEW LANGUAGE RESOURCES AND TOOLS navigate to the referred entry. All form fields that represent a tag which can occur more than once are supplemented by two buttons and . These buttons are used for adding another instance or removing the current instance of the tag. The BUTTON tags define buttons that run one of the storage actions. Each button has its label specified in the TEXT tag and its type in the TYPE tag. The type can be either NEW for creating the new entry, DELETE for deleting the current entry or UPDATE for saving the edited entry content. • WORD type view presents a list of all words from the dictionary that can be found among values of the given tag. • ENTR type view is a list of entries that meet a condition given by the user query in the QUERY tag. Main Menu Actions The MENU section describes a list of dictionaryspecific actions which can be added to the VisDic main menu. All these actions will be appended to the Dictionary submenu. An example of the actions that can be specified in the MENU are the duplicate checking actions. These actions are looking for duplicate values within the dictionary, either among entries or within a single entry. The action definition is enclosed in the DUPL tag. The TYPE tag chooses the kind of comparison – ENTR for comparing entries or ITEM for comparing items within the range of one entry. The NAME contains a name of the action, which will be displayed in the menu. The TAGS tag enlists all tags that are included in the duplicate checking, more tags are separated with the vertical bar sign ‘|’. If a tag begins with a dot ‘.’, then the tag is considered as a subtag of the previous tag. Examples of the duplicate checking actions are: • searching for all entries (synsets in WordNet) having the same SYNSET.ILI value ENTR Check duplicate ILI numbers SYNSET.ILI 36 2. NEW LANGUAGE RESOURCES AND TOOLS 2.2.3 • identification of all pairs literal:sense in WordNet stored in more than one synset. Here, the .SENSE tag corresponds to SYNSET. SYNONYM.LITERAL.SENSE subtag of the SYNSET.SYNONYM.LITERAL tag ENTR Check duplicate literals & senses SYNSET.SYNONYM.LITERAL|.SENSE • finding all literals in WordNet that occur more than once in one entry (synset) ITEM Check duplicate synset literals SYNSET.SYNONYM.LITERAL 2.2.3.3 Dictionary Definition Each dictionary has, besides its configuration file, an associated definition file named dictionary.def. This file describes the XML structure of the dictionary. The structure of the definition file contains features that are specific to the WordNet-like XML dictionaries. The definition file format is a plain text with each row corresponding to one XML tag. The line format is level tag min max type args where the corresponding fields contain • level – the tag level (0 for the top level). • tag – the tag name. • min – minimal number of occurrences of the tag within its supertag. • max – maximum number of occurrences of the tag within its supertag (−1 means infinite number). 37 2.2.3 2. NEW LANGUAGE RESOURCES AND TOOLS 0 SYNSET 1 1 N 1 ID 1 1 K 1 POS 1 1 N 1 SYNONYM 1 1 N 2 LITERAL 1 -1 N 3 SENSE 1 1 I @maxbyparent+1 3 LNOTE 0 1 N 1 ILR 0 -1 L 2 TYPE 1 1 N 1 RILR 0 -1 R SYNSET.ILR 1 BCS 0 1 N 1 DEF 0 1 N 1 USAGE 0 -1 N 1 SNOTE 0 -1 N 1 STAMP 0 1 N Figure 2.11: An example of a dictionary definition file. • type – the kind of the tag. It can be one of – N – normal text entry. – I – integer number entry. In the args column a function for the default value can be stated. – K – key value uniquely identifying the entry. Such key is used by the following L, R, and E kinds of tags. – L – link to another synset, it represents a semantic relation. – R – similar to L. It is defined as reversed link specified in the args column. – E – contains an external information stored in another dictionary. The name of the external tag and the path to the dictionary are contained in the args columns. The path is absolute or relative to the VisDic initial directory, not relative to the dictionary path. • args – extra arguments for some kinds of tags. An example of a dictionary definition file can be found in the Figure 2.11. 38 2. NEW LANGUAGE RESOURCES AND TOOLS 2.3 2.3 DEBVisDic and other DEB Platform Ap- plications VisDic, during its rather short history, has already proved its suitability for lexical database creation. The main power of VisDic manifests itself especially in development of highly interlinked databases such as WordNet. Its unique features have assured VisDic the leading role in many WordNet editing projects. In comparison with previous WordNet tools, VisDic exploits XML data format thus making the WordNet-like databases more standard and exchangeable. Not only that, thanks to the XML data format used and to its dictionary specific configurability VisDic can serve for developing various types of dictionaries, i.e. monolingual, translational, thesauri and multilingually linked WordNet-like databases. The experience with the VisDic tool during Balkanet project has been positive [HS04] and it was used as the main tool with which all Balkanet WordNets were developed. VisDic, however, has its disadvantages, particularly it is not based on the client/server architecture and it does not allow to associate various attributes with literals and handle the links between them. It can work with links only between synsets which is a limiting feature for enriching WordNets with various sorts of information, e.g. in Czech with word derivation relations existing within one part of speech as well as across them. The experience with VisDic has led us to more systematic research into the usage of XML data formats within the field of the computational lexicography. In parallel, we also pay attention to the relations between WordNets and Semantic Web. This interest gives us a strong motivation for studying the properties of the XML data formats and tools for working with them. Thus we set as our task to design and implement a more universal dictionary writing system that could be exploited in various lexicographic applications to build large lexical databases. The system has been called Dictionary Editor and Browser (further DEB) and in its current version (named DEB ii) is used in several larger lexicographic projects that are described further. The design of DEB allows us to use it advantageously also for building WordNet-like databases. 39 2.3.1 2. NEW LANGUAGE RESOURCES AND TOOLS Figure 2.12: The schema of the DEB II platform architecture 2.3.1 The Features of the Platform for Lexicographers’ Tools The acronym DEB ii denotes a platform or framework for building (especially) dictionary writing applications. It is based on client/server architecture, thus the application falls into two parts (see the schema on the Figure 2.12). The server includes majority of the required functions, each client part on the other hand serves as a user graphical interface which transfers user’s requirements to the server that returns the demanded data. The server part is built from small parts, called servlets, which allow a modular composition of all services. The clients communicate with servlets using HTTP requests in a manner similar to recently popular concept in web development called AJAX (Asynchronous JavaScript and XML [RM98]). The data are transported (using plain HTTP) in the RDF, generic XML or plain-text formats or they are marshalled using JSON (JavaScript Object Notation [Cro06]) data structure encapsulation. The actual data storage backend on the server side is provided by Oracle Berkeley DB XML [DBX07, CRZ03, BS05], which is a native XML database providing XPath and XQuery access into a set of document containers. The metadata are stored in widely-used Berkeley DB embedded 40 2. NEW LANGUAGE RESOURCES AND TOOLS 2.3.1 database which runs on many systems and devices ranging from Linux and Windows operating systems to mobile phones. Oracle Berkeley DB XML comes in form of a C++ library with interfaces to many scripting languages. Since the client applications are mostly oriented to the graphical user interfaces (GUI), we have decided to adopt the concepts of the Mozilla Development Platform [O+ 02]. Firefox Web browser is one of the many applications created using this platform. Other applications include Mozilla Thunderbird mail client, Netscape Web browser, Komodo integrated development environment or Nvu web page editor. The Mozilla Cross Platform Engine provides a clear separation between application logic and definition, presentation and language-specific texts. The application design is simple and allows the possibility of concurrent work of different team members which leads to significant time savings. Mozilla platform is open source free software which ensures that it will stay free and its development will continue. Every new major version adds more features and possibilities. Also, thanks to open source design, there is a large number of free extensions of existing applications or the platform itself. Mozilla developers pay much attention to security and any reported bugs are promptly fixed. Applications built on the Mozilla platform are working within many operating systems, actually any OS on which Mozilla runs (i.e. officially Windows, Linux, and Mac OS X, unofficially many others). The platform also provides easy way (both for developers and users) for application installation and update. The main “programming language” used for the GUI design of the DEB clients is called XUL (XML User-interface Language, pronounced “zool”). XUL is a user interface description language based on XML. It allows relatively simple creation of cross platform applications with possibility of easy customization of design, texts and localization. XUL itself is aimed mostly on creation of user interfaces, e.g. windows, buttons or toolbars, but it incorporates wide range of standardized technologies: • Cascading Style Sheets (CSS) for describing the graphic appearance of the application, • JavaScript as a programming language for simple application logic, • Document Object Model (DOM), XSLT and XPath to work with HTML and XML documents, 41 2.3.1 2. NEW LANGUAGE RESOURCES AND TOOLS • DTD for easy localization, • RDF as data source. 2.3.1.1 The DEB Server Side The server side of DEB is implemented in the programming language called Ruby [TH01, RUB07]. Ruby (originating in Japan) is an objectoriented, interpreted programming language with weak type checking. The DEB server uses also various additional libraries, both pure Ruby and interfaces to C/C++ libraries: • REXML (XML processing) and the WEBRick HTTP and SOAP4R SOAP servers (client–server communication). These modules are pure Ruby. • Oracle Berkeley DB XML API (storage backend). • ICU (International Component for Unicode) by IBM – language dependent character manipulation, sorting and formating [ICU07], libxslt and libxml2 (XSLT and additional XML processing) from the GNOME project. We actively participate in the development of both ICU and libxml2/libxslt bindings to the Ruby programming language. • the GRASS geographical information system13 (GIS) interface used in several client applications for displaying the geographical linkage of the linguistic data. • SQL interface (connection to classical relational databases, used e.g. for the database of geographical data). The DEB server suite runs on the Linux operating system, currently it is tested with Ubuntu Dapper on Intel x86 and AMD64 architectures, but it should generally run on any recent UNIX-based system (including Mac OS X). Current DEB server modules (i.e. the servlets) include: • generic document servlet – serves data from a DB XML container, supports querying the database, fetching individual documents and storage of documents or XSLT transformation of the output; 13see [NM04, GRA07] 42 2. NEW LANGUAGE RESOURCES AND TOOLS 2.3.2 • SQL servlet – provides interface to relational data in PostgreSQL (or other SQL) database; • various specific servlets based on generic document servlet – provide additional function over XML data stored in the DB XML container; • GRASS servlet – provides interface to the GRASS GIS, it is used for map generation; 2.3.2 Assets of the DEB Platform The DEB platform is based on client-server architecture, which brings along a lot of benefits. All the data are stored on the server and a considerable part of the functionality is also implemented on the server, while the client applications can be very lightweight. This approach provides adequate basis for team cooperation; data modifications are immediately seen by all the users. The server part also provides authentication and authorization mechanisms. The server can offer different interfaces using the same data structure and these interfaces can be reused by many client applications. For example, several client applications use the same interface to query XML based dictionaries (with different underlying structure). Although the clients are usually created using the Mozilla platform, the client software can be implemented in any way – it may be coded in any programming language or may even look only as a simple web page. One of the main benefits of developing a dictionary writing system on the DEB platform is the homogeneity of the data structure and presentation. If the application administrator commits a change in the data presentation, this change will automatically appear in each client software. And of course, any data flaws discovered can be instantly corrected, there is no need to change the client software or provide new data files to each client. The data sources can be implemented with different structures, that the server transforms seamlessly to a homogeneous form, which is then provided to client applications. Of course, a drawback of the client-server architecture is that an operating server is necessary for a fully functional application. However, in special situations, the server can be installed within a local environment, 43 2.3.3 2. NEW LANGUAGE RESOURCES AND TOOLS or for the possibility of simple off-line WordNet editing, the client may work in a degraded manner without the instant connection to the server. 2.3.3 The DEB Administration Interface Initially, the DEB server was developed with just command-line management of dictionaries and administration of user passwords for authentication. The configuration was realized by structured text files and data processing scripted programs. After the client applications have spread to more users world-wide and have been used, e.g., in several national WordNet projects (Dutch, Polish, Hungarian, Slovenian or Afrikaans WordNets), a more sophisticated administration interface for the DEB users and dictionaries was created by Adam Rambousek in the MU NLP Centre under the supervision of Aleš Horák. The interface was gradually transformed into a general and complex dictionary management application for the whole DEB server. 2.3.3.1 Overall Design Goals The DEB server packages are currently being deployed on several servers in different organizations and often more than one user need to administer a single DEB server without having a direct server access. Thus, the administration interface must be accessible remotely and without any special tools. The best choice for this task is a web-based interface, where the user needs just a web browser. The interface should support easy administration of all the server areas. Of course, the main area of a dictionary management server is the dictionary management. Each dictionary is described with several basic attributes, like its name and code, the filename of its storage in the DB XML database, its dictionary type, the XML schema or indexed elements or XSLT templates for output displaying. Also, some projects may need extra specific settings – e.g. the DEBVisDic clients need to store information about the inter-dictionary links. After the dictionary is set up, the interface has to support import and export of XML data into and from the DB XML format. When the administrator sets up the server dictionaries, these can be grouped to “services.” A service is one individual part of the DEB server, usually used for one particular project. For example, DEBVisDic or 44 2. NEW LANGUAGE RESOURCES AND TOOLS 2.3.3 DEBDict are separate services, but they share the same base libraries and management database. Several services can access the same dictionaries, each providing different view on the data. The user accounts are shared between all the services. Thanks to the database sharing between services, each user needs just one account for all the services he or she may use. The administrator can restrict access to selected services and for each service, more detailed access permissions can be set for each dictionary (read-only, read-write, update, . . . , see the Figures 2.13 and 2.14). The actual usage of the dictionary access permissions depends completely on the service implementation. This means, one service can ignore permissions at all and another service can use complex access rights. Apart from access rights, the user account management provides all the needed functions – it allows to create, modify and delete user accounts. Each user can log-in to the administration interface and change his or her password. In case the user forgets a password, he or she can ask for a new random password. To ease the deployment of the DEB platform, we are experimenting with automated creation of the client applications. Now, the server is able to create straightforward applications based on the Relax NG Schema [vdV03] of the dictionary, and we are aiming at automated creation of client packages for new national WordNets. Another very useful feature is uploading the client source files onto the server using the web interface. This way, the administrator can easily modify web page templates (XSLT) or other files without the need of direct (FTP, SSH) access to the server. The server administration interface is based on the same postulates as the other DEB server dictionaries and modules. The Oracle Berkeley DB XML database provides a storage backend for the administration meta-data. The server-side scripts are developed in Ruby programming language. All the data about users, dictionaries, permissions and other control data are stored in the DB XML database in the XML format. Each dictionary module of the DEB server uses a common interface to access data from this administration database. The administration module provides several services – user authentication, access rights control, entry locking and journaling of dictionary changes. 45 2.3.3 2. NEW LANGUAGE RESOURCES AND TOOLS Figure 2.13: User management showing how access rights modify the dictionary list in DEBDict; list for selected user is on the left, list of all dictionaries is on the right. The administration interface is a web-based application where the web pages are generated using an HTTP template which allows easy design and content modification and then served to the users by a lightweight web server – WEBrick [San04]. The users are authenticated using standard HTTP authentication mechanism. The administration module extends the standard interface for passwords stored in a file and loads user’s login and password from the XML database. Each change in user accounts or access rights is propagated to all DEB services in the real- time. 2.3.3.2 The Dictionary Management For each dictionary, the administrator has to define several attributes (see the Figure 2.15). The minimal set of attributes contains a unique dictionary code, a database filename and a dictionary class (the implementation class in Ruby), the other attributes are more or less optional. 46 2. NEW LANGUAGE RESOURCES AND TOOLS 2.3.3 adam Adam Rambousek xrambous@fi.muni.cz Faculty of Informatics Botanicka 68a, Brno 3Ja8ivX12OB0U Figure 2.14: XML entry for the user from the Figure 2.13. Figure 2.15: Dictionary management showing basic information and indexed elements for the Czech WordNet dictionary. 47 2.3.3 2. NEW LANGUAGE RESOURCES AND TOOLS The meaning of the dictionary attributes is: • the dictionary name is displayed to users by the client application. • the definition of the XML entry root tag and its key element are needed for XML import and for searching (in case, the application does not have its own, more complex search method). • indexes speed up search operations, so each element or attribute that is used in user queries should be indexed. • XSLT templates transform XML data to another form suitable for presentation or machine processing. Extra dictionary attributes are required for the WordNet dictionaries: • each WordNet dictionary is linked to the client software by the client package code. • the WordNet Dictionaries can refer to each other using the specified “equivalence tags.” • in the next field, the administrator can list dictionaries that should be reloaded after an edit action in the client (usually in another dictionary). • and the last option specifies related dictionaries – for example, several national WordNets linked with ILI (Inter-Lingual Index). It is possible to display the same entry in different languages or to copy entries between languages. 2.3.3.3 Import and Export The import function takes an XML file and stores the data into the DB XML database. The XML file has to be uploaded to the server (it is possible to upload it through web interface). All entries must share the same root tag (specified in the dictionary management), entries with different root tags are ignored. The administrator can choose if he or she wants to delete all the entries from database before the import or just add the new entries. The import utilizes two methods for XML reading. The first method loads the whole XML file into memory and uses an XML 48 2. NEW LANGUAGE RESOURCES AND TOOLS 2.3.3 parser on the big document. This method is accurate, unfortunately it has exponential time complexity, so it can take hours for large XML files (over 10 MB). The second method uses regular expressions to read entries one by one from the XML file and then each single entry is parsed. Entries are stored in the database with value of the specified key tag as a unique key. The administrator is informed about the import progress on the web page – a number of processed entries, a total number of entries, an estimated time till the end and last ten entry keys are displayed. The administration module also supports export from database to plain XML file, the output files may be compressed to save disk space. The export also has an option to save the file in the form of a Ruby language script that will setup the database and import initial data. This is needed for the administration database itself. The output files are saved in a specified directory on the server and the administrator is informed about the export progress. Once the export ends, the administrator is offered a link to download the file through the web interface. The same function is used also for daily database backup. 2.3.3.4 Locking and Sequences of Identifiers The administration interface offers entry locking management to other DEB server modules. If multiple users can edit the database at the same time (which is one of the basic advantages of the client-server architecture), it is crucial to provide exclusive write locking of entries so that two users are not able to edit the same entry at a time. Decisions about entry locking depends on each application design: 1) when should an entry be locked and unlocked? 2) should only the edited entry be locked or should the locking affect other entries too? An application then sends the request to the administration module which updates the lock database. The administration module provides several functions – besides simple lock and unlock functions, it can tell which user has locked a given entry, return the list of locks for selected user and/or dictionary or group several locks together if they are related. The administrator has access to the list of all locks and he or she can also delete chosen locks if the application did not release them correctly. Newly created entries should have a unique identifier. If the application does not generate its own identifiers, the administration module can 49 2.3.4 2. NEW LANGUAGE RESOURCES AND TOOLS provide such service. It is possible to set an identifier pattern for each dictionary – this pattern looks like CZE-[id] and [id] will be replaced with sequentially increased number. The administrator can also affect the number used. 2.3.3.5 The Installation Packages The administration interface supports automated creation of Firefox Extension installation packages (XPI). If the administrator specifies a Relax NG schema for the dictionary, it is possible to automatically transform this schema to an application design description in the XUL description language and the supporting code in JavaScript. The application created in this way supports basic forms – single and multiple text fields, select-boxes of specific values or relational links to other dictionaries. It can serve as basis for custom modifications. Of course, the application is able to connect to server, load data from server and save a modified entry back. We are currently working on more complex support for creation of new packages, mainly for the DEBVisDic client packages. 2.3.4 How To Make a Sample Dictionary 2.3.4.1 New Dictionary Definition As a first step, the administrator needs to provide basic information about the dictionary. The dictionary data can be loaded from an XML file or it can be built from scratch. The administrator must specify an entry root element, the XPath specification of a unique key, several indexes for fast querying and an XML schema of the entry. Let us create a demonstration dictionary from scratch, we will name the root element entry and have the unique key identifier in the element /entry/headword. The corresponding Relax NG schema is given in the Figure 2.16. This schema describes entry with one headword element, with pos attribute, and one or more sense elements. Of course, Relax NG supports description of much more complex XML structures. 50 2. NEW LANGUAGE RESOURCES AND TOOLS 2.3.4 Figure 2.16: A part of the Relax NG dictionary schema. Figure 2.17: An example client application generated by the DEB administration interface according to the dictionary schema from the Figure 2.16. 2.3.4.2 Preparation of an Installation Package The preparation of a new basic client application package requires selection of a dictionary and running the package generation function. The administration module checks the Relax NG schema and finds all elements or attributes that contain the text child element. All such elements and attributes are transformed to XUL textbox fields with the respective name as a label describing the field. If an element can occur multiple times in the entry (such as the sense tag in our example), buttons for adding and removing the textbox are added to the application form, too. The created JavaScript supports loading and saving documents and 51 2.3.4 2. NEW LANGUAGE RESOURCES AND TOOLS Figure 2.18: A web service automatically built following a dictionary schema also searching for documents. The application thus enables querying each indexed field specified in the dictionary management interface. For example, users can easily find all nouns. All the created application files are then packaged into the Firefox extension installation package (XPI). Users can download this package for installation or individual files for editing. An example of the resulting application is shown on the Figure 2.17. For the new client, there are also two basic preview templates (in XSLT) saved on the server side. One provides basic entry preview displaying all the data and the second displays raw XML data. For certain environments that either do not allow users to install new software packages or where the deployment of the software would be too time consuming, the DEB ii server is able to generate simple web-service (see the Figure 2.18). The same as for XPI package generation, this function uses the dictionary Relax NG schema and generates a XUL form for 52 2. NEW LANGUAGE RESOURCES AND TOOLS 2.3.4 a) b) Figure 2.19: Change of a textbox field to a drop-down list. remote access. To work with the dictionary, a user needs a web browser based on the Mozilla engine (Firefox, SeaMonkey, Netscape, Camino, . . . ). All parts of the generated web-service are easily customizable via XSLT templates. 2.3.4.3 Application Customization Thanks to the design of applications based on the Mozilla development platform, these applications are easily customizable. Any change in the layout and design of the form is done by editing the XUL (XML User-interface Language) files accompanied with standard CSS stylesheets. The application logic (i.e. procedures implemented in JavaScript) stays the same for a new layout. Combination of XUL and CSS languages is very powerful and supports long list of features that are commonly used in desktop applications. To give a simple example, we can show how to change the Part-Of-Speech textbox field into a drop-down list, see the Figure 2.19. The application localization (translation of the user interface to another language) is one of the core features of the Mozilla XPI packages. As we can see in the example, the field labels contain XML element names only (marked with &). This allows the application designer to change them to general textual labels that are more informative to users. The actual texts are stored in a DTD (Document Type Definition) file as XML entity definitions, where they can be adjusted to any texts in one place. For the sake of the above mentioned localization of the application, it is possible to include several DTD files for different languages 53 2.3.5 2. NEW LANGUAGE RESOURCES AND TOOLS application.xul: