HPSG parser & CCG parser Lecture 9 qkang@fi.muni.cz Syntactic formalisms for natural language parsing FI MU autumn 2011 2 Outline ● HPSG Parser : Enju – Parsing method – Description of parser – Result ● CCG Parser : C&C Tools – Parsing method – Description of parser – Result 3 ● Theoretical backgrounds Lecture 3 about HPSG Parsing Lecture 6 & 7 about CCG Parsing and Combinatory Logic 4 Enju (Y. Miyao, J.Tsujii, 2004, 2008) ● Syntactic parser for English ● Developed by Tsujii Lab. Of the University of Tokyo ● Based on the wide-coverage probabilistic HPSG – HPSG theory [Pollard and Sag, 1994] ● Useful links to Enju – http://www-tsujii.is.s.u-tokyo.ac.jp/enju/demo.html – http://www-tsujii.is.s.u-tokyo.ac.jp/enju/ 5 Motivations ● Parsing based on a proper linguistic formalism is one of the core research fields in CL and NLP. But! a monolithic, esoteric and inward looking field, largely dissociated from real world application. 6 ● So why not! The integration of linguistic grammar formalisms with statistical models to propose an robust, efficient and open to eclectic sources of information other than syntactic ones Motivations (cont.) 7 Two main ideas ● Development of wide-coverage linguistic grammars ● Deep parser which produces semantic representation (predicate-argument structures) Motivations (cont.) 8 Parsing method ● Application of probabilistic model in the HPSG grammar and development of an efficient parsing algorithm – Accurate deep analysis – Disambiguation – Wide-coverage – High speed – Useful for high level NLP application 9 1.Parsing based on HPSG – Mathematically well-defined with sophisticated constraint-based system – Linguistically justified – Deep syntactic grammar that provides semantic analysis Parsing method (Cont.) 10 ● Difficulties in parsing based on HPSG ➢ Difficult to develop a broad-coverage HPSG grammar ➢ Difficult to disambiguate ➢ Low efficiency: very slow Parsing method (Cont.) 11 ● Solution: Corpus-oriented development of an HPSG grammar – The principal aim of grammar development is treebank construction – Penn treebank is coverted into an HPSG treebank – A lexicon and a probabilistic model are extracted from the HPSG treebank Parsing method (Cont.) 12 ● Approach: ➢ develop grammar rules and an HPSG treebank ➢ collect lexical entries from the HPSG treebank Parsing method (Cont.) How to make an HPSG treebank? Convert Penn Treebank into HPSG and develop grammar by restructuring a treebank in conformity with HPSG grammar rules 13 HPSG = lexical entries and grammar rules Enju grammar has 12 grammar rules and 3797 lexical entries for 10,536 words (Miyao et al. 2004) Parsing method (Cont.) 14 Overview of grammar development Parsing method (Cont.) 1. Treebank conversion 2. Grammar rule application 3. Lexical entry collection Modify constituent structures by adding feature structures Apply the grammar rule when a parse tree contains correct analysis and specified feature values are filled Collect terminal nodes of HPSG parse trees and assign predicate-argument structure 15 2. Probabilistic model and HPSG: Log-linear model for unification-based grammars (Abney 1997, Johnson et al. 1999, Riezler et al. 2000, Miyao et al. 2003, Malouf and van Noord 2004, Kaplan et al. 2004, Miyao and Tsujii 2005) Parsing method (Cont.) p(T|w) w= “A blue eyes girl with white hair and skin walked T= 16 Parsing method (Cont.) All possible parse trees derived from w with a grammar. For example, p(T3|w) is the probability of selecting T3 from T1, T2, …, and Tn. 17 ● Log-linear model for unification-based grammars – Input sentence: w w= w1 /P1 , w2 /P2 , .....wn /Pn – Output parse tree T Parsing method (Cont.) ∑= u uu Tf Z Tp ))(exp( 1 )|( λw Normalization factor Weight for a feature function Feature function 18 Description of parser 19 parsing proceeds in the following steps: 1. preprocessing ● Preprocessor converts an input sentence into a word lattice. 2. lexicon lookup ● Parser uses the predicate to find lexical entries for the word lattice 3. kernel parsing ● Parser does phrase analysis using the defined grammar rules in the kernel parsing process. Description of parser (Cont.) 20 ● Chart ➢ data structure ➢ two dimensional table ➢ we call each cell in the table `CKY cell.' Example Let an input sentence s(= w1, w2, w3,...,wn), w1 = "I", w2="saw", w3= "a", w4 = "girl", w5 = "with", w6 = "a", w7 = "telescope" for the sentence "I saw a girl with a telescope", the chart is arranged as follows. Description of parser (Cont.) 21 System overview Description of parser (Cont.) Mary loved John Supertagger Enumeration of assignments Deterministic disambiguation Mary loved John HEAD noun SUBJ < > COMPS < > HEAD verb SUBJ COMPS HEAD noun SUBJ < > COMPS < > Maryloved John HEAD noun SUBJ < > COMPS < > HEAD verb SUBJ COMPS HEAD noun SUBJ < > COMPS < >HEAD noun SUBJ < > COMPS < > HEAD noun SUBJ < > COMPS < >HEAD noun SUBJ < > COMPS < > HEAD verb SUBJ COMPS HEAD verb SUBJ COMPS HEAD noun SUBJ < > COMPS < > 22 ● Demonstration – http://www-tsujii.is.s.u-tokyo.ac.jp/enju/demo.html 23 Results ● Fast, robust and accurate analysis – Phrase structures – Predicate argument structures ● Accurate deep analysis — the parser can output both phrase structures and predicate-argument structures. The accuracy of predicate-argument relations is around 90% for newswire articles and biomedical papers. ● High speed — parsing speed is less than 500 msec. per sentence by default (faster than most Penn Treebank parsers), and less than 50 msec when using the highspeed setting ("mogura"). 24 C&C tools ● Developed by Curran and Clark [Clark and Curran, 2002, Curran, Clark and Bos, 2007], University of Edinburgh ● Wide-coverage statistical parser based on the CCG: CCG Parser ● Computational semantic tools named Boxer ● Useful links http://svn.ask.it.usyd.edu.au/trac/candc http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Demo 25 CCG Parser [Clark, 2007] ● Statistical parsing and CCG Advantages of CCG ➢ providing a compositional semantic for the grammar →completely transparent interface between syntax and semantics ➢ the recovery of long-range dependencies can be integrated into the parsing process in a straightforward manner 26 ● Penn Treebank conversion : TAG, LFG, HPSG and CCG ● CCGBank [Hockenmaier and Steedman, 2007] ➢ CCG version of the Penn Treebank ➢ Grammar used in CCG parser Parsing method Lexical category set Some rules used as the grammar Training data for the statistical models Supertagger Parser CCGBank 27 ● Corpus translated from the Penn Treebank, CCGBank contains – Syntactic derivations – Word-word dependencies – Predicate-argument structures Parsing method (Cont.)-CCG Bank 28 ● Semi automatic conversion of phrase-structure trees in the Penn Treebank into CCG derivations ● Consists mainly of newspaper texts ● Grammar: Lexical category set Combinatory rules Unary type-changing rules Normal-form constraints Punctuation rules Parsing method (Cont.)-CCG Bank 29 Parsing method (Cont.) ● Supertagging [Clark, 2002] uses conditional maximum entropy models implement a maximum entropy supertagger 30 ● Set of 425 lexical categories from the CCGbank ● The per-word accuracy of the Supertagger is around 92% on unseen WSJ text. → Using the multi-supertagger increases the accuracy significantly -- to over 98% -- with only a small cost in increased ambiguity. Parsing method (Cont.)-Supertagger 31 ● Log-linear models in NLP applications: ➢ POS tagging ➢ Name entity recognition ➢ Chunking ➢ Parsing → referred as maximum entropy models and random fields Parsing method (Cont.)-Supertagger 32 ● Log-linear parsing models for CCG 1) the probability of a dependency structure 2) the normal-form model: the probability of a single derivation → modeling 2) is simpler than 1) 1) defined as P(π|S)=∑ P(d, π|S) 2) defined using a log-linear form as follows: P(w|S)=1 e λ.f(w) Parsing method (Cont.)-Supertagger ZS ZS = ∑ e λ.f(w') w∈p(S) dЄΔ(π) 33 ● Features common to the dependency and normal-form models Parsing method (Cont.)-Supertagger 34 ● Predicate-argument dependency features for the dependency model Parsing method (Cont.)-Supertagger 35 ● Rule dependency features for the normal-form model Parsing method (Cont.)-Supertagger 36 Description of parser CCGBank C&C taggers Supertagger POStagger Chunker Input sentence Parser Boxer 37 ● Demonstration – http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Demo 38 Results Supertagger ambiguity and accuracy on section00 39 Parsing accuracy on DepBank Results (Cont.) DepBank: Parc Dependency Bank [King et al. 2003] 40 Results (Cont.)