Statistical parsing & Statistical parsers Lecture 10 qkang@fi.muni.cz Syntactic formalisms for natural language parsing IA161, FI MU autumn 2011 Study materials ➢ Course materials and homeworks are available on the following web site: https://is.muni.cz/course/fi/autumn2011/IA161 ➢ Refer to Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, D. Jurafsky and J.H. Martin, Prentice Hall, New Jersey, 2000. 3 Outline ● Introduction to Statistical parsing methods ● Statistical Parsers ● RASP system ● Stanford parser ● Collins parser ● Charniak parser ● Berkeley parser 4 1. Introduction to statistical parsing ● The main theoretical approaches behind modern statistical parsers ● Over the last 12 years statistical parsing has succeeded significantly! ● NLP researchers have produced a range of statistical parsers → wide-coverage and robust parsing accuracy ● They continues to improve the parsers year on year. 5 Application domains of statistical parsing ● Question answering systems of high precision ● Named entity extraction ● Syntactically based sentence compressions ● Extraction of people's opinion about products ● Improved interaction in computer ganes ● Helping linguists find data 6 NLP parsing problem and solution ● The structure of language is ambiguous! → local and global ambiguities ● Classical parsing problem → simple 10 grammar rules can generate 592 parsers → real size wide-coverage grammar generates millions of parses 7 ● NLP parsing solution We need mechanisms that allow us to find the most likely parses → statistical parsing lets us work with very loose grammars that admit millions of parses for sentences but to still quickly find the best parses NLP parsing problem and solution (Cont.) 8 Improved methodology for robust parsing ● The annotated data: Penn Treebank (early 90's) ● Building a treebank seems a lot slower and less useful than building a grammar ● But it has many helpful things ● Reusability of the labor ● Broad coverage ● Frequencies and distributional information ● A way to evaluate systems 9 Characterization of Statistical parsing ● What the grammar which determines the set of legal syntactic structures for a sentence? How is that grammar obtained? ● What is the algorithm for determining the set of legal parses for a sentence? ● What is the model for determining the probability of different parses for a sentence? ● What is the algorithm, given the model and a set of possible parses which finds the best parse? 10 Tbest = arg max Score (T,S) ● Two components: ➢ The model: a function Score which assigns scores (probabilities) to tree and sentence pairs ➢ The parser: the algorithm which implements the search for Tbest Characterization of Statistical parsing (Cont.) T 11 ● Statistical parsing seen as more of a pattern recognition/Machine Learning problem plus search ➢ The grammar is only implicitly defined by the training data and the method used by the parser for generating hypotheses Characterization of Statistical parsing (Cont.) 12 Statistical parsing models ● Probabilistic approach would suggest the following for the Score function Score (T,S) = P (T|S) ● Lots of research on different probability models for Penn Treebank trees ➢ Generative models, log-linear (maximum entropy) models, ... 13 2. Statistical parsers ● Many kinds of parsers based on the statistical methods:probability, machine learning ● Different objectives: research, commercial, pedagogical ➢ RASP, Stanford parser, Berkeley parser, 14 RASP system Robust Accurate Statistical Parsing (2nd release): [Briscoe&Carroll, 2002; Briscoe et al. 2006] ● system for syntactic annotation of free text ● Semantically-motivated output representation ● Enhanced grammar and part-of-speech tagger lexicon ● Flexible and semi-supervised training method for structural parse ranking model Useful links to RASP http://ilexir.co.uk/applications/rasp/download/ http://www.informatics.susx.ac.uk/research/groups/nlp/rasp/ 15 Components of system ● Input: unannotated text or transcribed (and punctuated) speech ● 1st step: sentence boundary detection and tokenisation modules ● 2nd step: Tokenized text is tagged with one of 150 POS and punctuation labels (derived from the CLAWS tagset) → first-order ('bigram') HMM tagger → trained on the manually corrected tagged version of the Susanne, LOB and BNC corpora 16 Components of system (Cont.) ● 3rd step: Morphological analyzer ● 4th step: Manually developed wide-coverage tag sequence grammar in the parser → 689 unification based phrase structure rules → preterminals to this grammar are the POS and punctuation tags → terminals are featural description of the preterminals → non-terminals project information up the tree using an X-bar scheme with 41 attributes with a maximum of 33 atomic values 17 Components of system (Cont.) ● 5th step: Generalized LR Parser → a non-deterministic LALR table is constructed automatically from CF 'backbone' compiled from the featurebased grammar → the parser builds a packed parse forest using this table to guide the actions it performs → the n-best parses can be efficiently extracted by unpacking sub-analyses, following pointers to contained subanalyses and choosing alternatives in order of probabilistic ranking 18 Components of system (Cont.) ● Output: set of named grammatical relations (GRs) → resulting set of ranked parses can be displayed or passed on for further processing → transformation of derivation trees into a set of named GRs → GR scheme captures those aspects of predicate-argument structure 19 Evaluation ● The system has been evaluated using the re-annotation of the PARC dependency bank (DepBank, King et al., 2003) ● It consists of 560 sentences chosen randomly from section 23 of the WSJ with grammatical relations compatible with RASP system. ● Form of relations (relation subtype head dependent initial) Type of relationship between the head and the dependent Encoding additional specifications of the relation type for some relations and the initial or underlying logical relation of the grammatical subject in constructions such as passive 20 Evaluation (Cont.) ● Micro-averaged precision, recall and F1 score are calculated from the counts for all relations in the hierarchy ● Macro-averaged scores are the mean of the individual scores for each relation ● Micro-averaged F1 score of 76.3% across all relations Parsing accuracy on DepBank [Briscoe et al., 2006] 21 Stanford parser Java implementation of probabilistic natural language parsers (version 1.6.9) : [Klein and Manning, 2003] ● Parsing system for English and has been used in Chinese, German, Arabic, Italian, Bulgarian, Portuguese ● Implementation, both highly optimized PCFG and lexicalized dependency parser, and lexicalized PCFG parser ● Useful links http://nlp.stanford.edu/software/lex-parser.shtml http://nlp.stanford.edu:8080/parser/ 22 ● Input various form of plain text ● Output Various analysis formats → Stanford Dependencies (SD): typed dependencies as GRs → phrase structure trees → POS tagged text Graphical representation of the SD for the sentence “Bell, based in Los Angeles, makes and distributes electronic,  computer and building products.” 23 Standford typed dependencies [De Marmette and Manning, 2008] ● provide a simple description of the grammatical relationships in a sentence ● represents all sentence relationships uniformly as typed dependency relations ● quite accessible to non-linguists thinking about tasks involving information extraction from text and is quite effective in relation extraction applications. 24 Standford typed dependencies [De Marnette and Manning, 2008] (Cont.) ● For an example sentence: Bell, based in Los Angeles, makes and distributes electronic, computer  and building products. ● Stanford Dependencies (SD) representation is: ➢ nsubj(makes-8, Bell-1) ➢ nsubj(distributes-10, Bell-1) ➢ partmod(Bell-1, based-3) ➢ nn(Angeles-6, Los-5) ➢ prep_in(based-3, Angeles-6) ➢ root(ROOT-0, makes-8) ➢ conj_and(makes-8, distributes-10) ➢ amod(products-16, electronic-11) ➢ conj _and(electronic-11, computer-13) ➢ amod(products-16, computer-13) ➢ conj _and(electronic-11, building-15) ➢ amod(products-16, building-15) ➢ dobj(makes-8, products-16) ➢ dobj(distributes-10, products-16) 25 Output ●  A line­up of masseurs was waiting to take the media in hand. POS tagged text Parsing [sent. 4 len. 13]: [A, line­up,  of, masseurs, was, waiting, to, take, the,  media, in, hand, .] CFPSG representation (ROOT   (S     (NP       (NP (DT A) (NN line­up))       (PP (IN of)         (NP (NNS masseurs))))     (VP (VBD was)       (VP (VBG waiting)         (S           (VP (TO to)             (VP (VB take)               (NP (DT the) (NNS  media))               (PP (IN in)                 (NP (NN hand))))))))     (. .))) Typed dependencies representation det(line­up­2, A­1) nsubj(waiting­6, line­up­2) xsubj(take­8, line­up­2) prep_of(line­up­2, masseurs­4) aux(waiting­6, was­5) root(ROOT­0, waiting­6) aux(take­8, to­7) xcomp(waiting­6, take­8) det(media­10, the­9) dobj(take­8, media­10) prep_in(take­8, hand­12) 26 Berkeley parser Learning PCFGs, statistical parser (release 1.1, version 09.2009) : [Petrov et al., 2006; Petrov and Klein, 2007] ● Parsing system for English and has been used in Chinese, German, Arabic, Bulgarian, Portuguese, French ● Implementation of unlexicalized PCFG parser ● Useful links http://nlp.cs.berkeley.edu/ http://tomato.banatao.berkeley.edu:8080/parser/parser.html http://code.google.com/p/berkeleyparser/ 27 Comparison of parsing an example sentence   A line­up of masseurs was waiting to take the media in hand. Berkeley parser Stanford parser 28 charniak parser Probabilistic LFG F-Structure Parsing : [Charniak, 2000; Bikel, 2002] ● Parsing system for English ● PCFG based wide coverage LFG parser ● Useful links http://nclt.computing.dcu.ie/demos.html http://lfg-demo.computing.dcu.ie/lfgparser.html 29 Collins parser Head-Driven Statistical Models for natural language parsing (Release 1.0, version 12.2002): [Collins, 1999] ● Parsing system for English ● Useful links http://www.cs.columbia.edu/~mcollins/code.html 30 Bikel's parser Multilingual statistical parsing engine (release 1.0, version 06.2008) : [Charniak, 2000; Bikel, 2002] ● Parsing system for English, Chinese, Arabic, Korean ● Useful links http://www.cis.upenn.edu/~dbikel/#stat-parser http://www.cis.upenn.edu/~dbikel/software.html 31 Comparing parser speed on section 23 of WSJ Penn Treebank