Lecture 10
.
......
Syntactic Formalisms for Parsing
Natural Languages
Aleš Horák, Miloš Jakubíček, Vojtěch Kovář
(based on slides by Juyeon Kang)
ia161@nlp.fi.muni.cz
Autumn 2013
IA161 Syntactic Formalisms for Parsing Natural Languages 1 / 31
Lecture 10
Study materials
Course materials and homeworks are available on the
following web site
https://is.muni.cz/course/fi/autumn2011/IA161
IA161 Syntactic Formalisms for Parsing Natural Languages 2 / 31
Lecture 10
Outline
Introduction to Statistical parsing methods
Statistical Parsers
RASP system
Stanford parser
Collins parser
Charniak parser
Berkeley parser
IA161 Syntactic Formalisms for Parsing Natural Languages 3 / 31
Lecture 10
1. Introduction to statistical parsing
The main theoretical approaches behind modern statistical
parsers
Over the last 12 years statistical parsing has succeeded
signiﬁcantly!
NLP researchers have produced a range of statistical parsers
→ wide-coverage and robust parsing accuracy
They continues to improve the parsers year on year.
IA161 Syntactic Formalisms for Parsing Natural Languages 4 / 31
Lecture 10
Application domains of statistical parsing
Question answering systems of high precision
Named entity extraction
Syntactically based sentence compressions
Extraction of people’s opinion about products
Improved interaction in computer ganes
Helping linguists ﬁnd data
IA161 Syntactic Formalisms for Parsing Natural Languages 5 / 31
Lecture 10
NLP parsing problem and solution
The structure of language is ambiguous!
→ local and global ambiguities
Classical parsing problem
→ simple 10 grammar rules can generate 592 parsers
→ real size wide-coverage grammar generates millions of
parses
IA161 Syntactic Formalisms for Parsing Natural Languages 6 / 31
Lecture 10
NLP parsing problem and solution
NLP parsing solution
We need mechanisms that allow us to ﬁnd the most likely
parses
→ statistical parsing lets us work with very loose grammars
that admit millions of parses for sentences but to still quickly
ﬁnd the best parses
IA161 Syntactic Formalisms for Parsing Natural Languages 7 / 31
Lecture 10
Improved methodology for robust parsing
The annotated data: Penn Treebank (early 90’s)
Building a treebank seems a lot slower and less useful than
building a grammar
But it has many helpful things
Reusability of the labor
Broad coverage
Frequencies and distributional information
A way to evaluate systems
IA161 Syntactic Formalisms for Parsing Natural Languages 8 / 31
Lecture 10
Characterization of Statistical parsing
What the grammar which determines the set of legal syntactic
structures for a sentence? How is that grammar obtained?
What is the algorithm for determining the set of legal parses for
a sentence?
What is the model for determining the probability of diﬀerent
parses for a sentence?
What is the algorithm, given the model and a set of possible
parses which ﬁnds the best parse?
IA161 Syntactic Formalisms for Parsing Natural Languages 9 / 31
Lecture 10
Characterization of Statistical parsing
Tbest = arg max Score(T, S)
Two components:
The model: a function Score which assigns scores
(probabilities) to tree and sentence pairs
The parser: the algorithm which implements the search for
Tbest
IA161 Syntactic Formalisms for Parsing Natural Languages 10 / 31
Lecture 10
Characterization of Statistical parsing
Statistical parsing seen as more of a
pattern recognition/Machine Learning problem plus
search
The grammar is only implicitly deﬁned by the training data
and the method used by the parser for generating hypotheses
IA161 Syntactic Formalisms for Parsing Natural Languages 11 / 31
Lecture 10
Statistical parsing models
Probabilistic approach would suggest the following for the
Score function
Score(T, S) = P(T|S)
Lots of research on diﬀerent probability models for Penn
Treebank trees
Generative models, log-linear (maximum entropy) models, …
IA161 Syntactic Formalisms for Parsing Natural Languages 12 / 31
Lecture 10
2. Statistical parsers
Many kinds of parsers based on the statistical
methods:probability, machine learning
Diﬀerent objectives: research, commercial, pedagogical
RASP, Stanford parser, Berkeley parser,
IA161 Syntactic Formalisms for Parsing Natural Languages 13 / 31
Lecture 10
RASP system
Robust Accurate Statistical Parsing (2nd
release):
[Briscoe&Carroll, 2002; Briscoe et al. 2006]
system for syntactic annotation of free text
Semantically-motivated output representation
Enhanced grammar and part-of-speech tagger lexicon
Flexible and semi-supervised training method for structural
parse ranking model
Useful links to RASP
http://ilexir.co.uk/applications/rasp/download/
http://www.informatics.susx.ac.uk/research/groups/nlp/rasp/
IA161 Syntactic Formalisms for Parsing Natural Languages 14 / 31
Lecture 10
Components of system
Tokeniser
PoS Tagger
Lemmatiser
Parser/Grammar
Parse Ranking Model
raw text Input:
unannotated text or transcribed (and punctuated)
speech
1st
step:
sentence boundary detection and tokenisation
modules
2nd
step:
Tokenized text is tagged with one of 150
POS and punctuation labels (derived from
the CLAWS tagset)
→ ﬁrst-order (’bigram’) HMM tagger
→ trained on the manually corrected
tagged version of the Susanne, LOB and
BNC corpora
IA161 Syntactic Formalisms for Parsing Natural Languages 15 / 31
Lecture 10
Components of system
Tokeniser
PoS Tagger
Lemmatiser
Parser/Grammar
Parse Ranking Model
raw text 3rd
step:
Morphological analyzer
4th
step:
Manually developed wide-coverage tag sequence
grammar in the parser
→ 689 uniﬁcation based phrase structure
rules
→ preterminals to this grammar are the
POS and punctuation tags
→ terminals are featural description of the
preterminals
→ non-terminals project information up the
tree using an X-bar scheme with 41 attributes
with a maximum of 33 atomic
values
IA161 Syntactic Formalisms for Parsing Natural Languages 16 / 31
Lecture 10
Components of system
Tokeniser
PoS Tagger
Lemmatiser
Parser/Grammar
Parse Ranking Model
raw text 5th
step:
Generalized LR Parser
→ a non-deterministic LALR table is constructed
automatically from CF ’backbone’
compiled from the featurebased grammar
→ the parser builds a packed parse forest
using this table to guide the actions it
performs
→ the n-best parses can be eﬃciently
extracted by unpacking sub-analyses,
following pointers to contained
subanalyses and choosing alternatives in
order of probabilistic ranking
IA161 Syntactic Formalisms for Parsing Natural Languages 17 / 31
Lecture 10
Components of system
dependent
ta arg_mod det aux conj
mod arg
subj_or_dobj
subj comp
ncmod xmod cmod pmod
ncsubj xsubj csubj
obj pcomp clausal
dobj obj2 iobj xcomp ccomp
Output:
set of named grammatical relations
(GRs)
→ resulting set of ranked parses
can be displayed or passed on for
further processing
→ transformation of derivation
trees into a set of named GRs
→ GR scheme captures those aspects
of predicate-argument struc-
ture
IA161 Syntactic Formalisms for Parsing Natural Languages 18 / 31
Lecture 10
Evaluation
The system has been evaluated using the re-annotation of the
PARC dependency bank (DepBank, King et al., 2003)
It consists of 560 sentences chosen randomly from section 23 of
the WSJ with grammatical relations compatible with RASP
system.
Form of relations
(relation subtype head dependent initial)
Type of relationship
between the head and
the dependent
Encoding additional speciﬁcations of the relation
type for some relations and the initial or underlying
logical relation of the grammatical subject in
constructions such as passive
IA161 Syntactic Formalisms for Parsing Natural Languages 19 / 31
Lecture 10
Evaluation
Relation Precision Recall F1 std GRs
dependent 79.76 77.49 78.61 10696
aux 93.33 91.00 92.15 400
conj 72.39 72.27 72.33 595
ta 42.61 51.37 46.58 292
det 87.73 90.48 89.09 1114
arg_mod 79.18 75.47 77.28 8295
mod 74.43 67.78 70.95 3908
ncmod 75.72 69.94 72.72 3550
xmod 53.21 46.63 49.70 178
cmod 45.95 30.36 36.56 168
pmod 30.77 33.33 32.00 12
arg 77.42 76.45 76.94 4387
subj_or_dobj 82.36 74.51 78.24 3127
subj 78.55 66.91 72.27 1363
ncsubj 79.16 67.06 72.61 1354
xsubj 33.33 28.57 30.77 7
csubj 12.50 50.00 20.00 2
comp 75.89 79.53 77.67 3024
obj 79.49 79.42 79.46 2328
dobj 83.63 79.08 81.29 1764
obj2 23.08 30.00 26.09 20
iobj 70.77 76.10 73.34 544
clausal 60.98 74.40 67.02 672
xcomp 76.88 77.69 77.28 381
ccomp 46.44 69.42 55.55 291
pcomp 72.73 66.67 69.57 26
macroaverage 62.12 63.77 62.94
microaverage 77.66 74.98 76.29
Parsing accuracy on DepBank [Briscoe et al., 2006]
Micro-averaged precision,
recall and F1 score are
calculated from the counts for
all relations in the hierarchy
Macro-averaged scores are
the mean of the individual
scores for each relation
Micro-averaged F1 score of
76.3% across all relations
IA161 Syntactic Formalisms for Parsing Natural Languages 20 / 31
Lecture 10
Stanford parser
Java implementation of probabilistic natural language
parsers (version 1.6.9)
: [Klein and Manning, 2003]
Parsing system for English and has been used in Chinese,
German, Arabic, Italian, Bulgarian, Portuguese
Implementation, both highly optimized PCFG and lexicalized
dependency parser, and lexicalized PCFG parser
Useful links
http://nlp.stanford.edu/software/lex-parser.shtml
http://nlp.stanford.edu:8080/parser/
IA161 Syntactic Formalisms for Parsing Natural Languages 21 / 31
Lecture 10
Stanford parser
Input
various form of plain text
Output
Various analysis formats
→ Stanford Dependencies (SD): typed dependencies
as GRs
→ phrase structure trees
→ POS tagged text
makes
distributes
Bell
based
Angeles
Los
products
electronic
computer building
conj_and
dobj
dobj
nsubj
nsubj
partmod
prep_in
nn
amod
amodamod
conj_and
conj_and
Graphical representation of the SD for the sentence
“Bell, based in Los Angeles, makes and distributes
electronic, computer and building products.”
IA161 Syntactic Formalisms for Parsing Natural Languages 22 / 31
Lecture 10
Standford typed dependencies [De Marmette and
Manning, 2008]
provide a simple description of the grammatical relationships in
a sentence
represents all sentence relationships uniformly as typed
dependency relations
quite accessible to non-linguists thinking about tasks involving
information extraction from text and is quite eﬀective in relation
extraction applications.
IA161 Syntactic Formalisms for Parsing Natural Languages 23 / 31
Lecture 10
Standford typed dependencies [De Marnette and
Manning, 2008]
For an example sentence:
Bell, based in Los Angeles, makes and distributes electronic,
computer and building products.
Stanford Dependencies (SD) representation is:
nsubj(makes-8, Bell-1)
nsubj(distributes-10, Bell-1)
partmod(Bell-1, based-3)
nn(Angeles-6, Los-5)
prep_in(based-3, Angeles-6)
root(ROOT-0, makes-8)
conj_and(makes-8, distributes-10)
amod(products-16, electronic-11)
conj_and(electronic-11, computer-13)
amod(products-16, computer-13)
conj_and(electronic-11, building-15)
amod(products-16, building-15)
dobj(makes-8, products-16)
dobj(distributes-10, products-16)
IA161 Syntactic Formalisms for Parsing Natural Languages 24 / 31
Lecture 10
Output
A lineup of masseurs was waiting to take the media in hand.
.
POS tagged text
..
......
Parsing [sent. 4 len. 13]: [A, lineup, of, masseurs,
was, waiting, to, take, the, media, in, hand, .]
.
CFPSG representation
..
......
(ROOT
(S
(NP
(NP (DT A) (NN lineup))
(PP (IN of)
(NP (NNS masseurs))))
(VP (VBD was)
(VP (VBG waiting)
(S
(VP (TO to)
(VP (VB take)
(NP (DT the) (NNS media))
(PP (IN in)
(NP (NN hand))))))))
(. .)))
.
Typed dependencies
representation
..
......
det(lineup2, A1)
nsubj(waiting6, lineup2)
xsubj(take8, lineup2)
prep_of(lineup2, masseurs4)
aux(waiting6, was5)
root(ROOT0, waiting6)
aux(take8, to7)
xcomp(waiting6, take8)
det(media10, the9)
dobj(take8, media10)
prep_in(take8, hand12)
IA161 Syntactic Formalisms for Parsing Natural Languages 25 / 31
Lecture 10
Berkeley parser
Learning PCFGs, statistical parser (release 1.1, version
09.2009)
: [Petrov et al., 2006; Petrov and Klein, 2007]
Parsing system for English and has been used in Chinese,
German, Arabic, Bulgarian, Portuguese, French
Implementation of unlexicalized PCFG parser
Useful links
http://nlp.cs.berkeley.edu/
http://tomato.banatao.berkeley.edu:
8080/parser/parser.html
http://code.google.com/p/berkeleyparser/
IA161 Syntactic Formalisms for Parsing Natural Languages 26 / 31
Lecture 10
Comparison of parsing an example sentence
A lineup of masseurs was waiting to take the media in hand.
IA161 Syntactic Formalisms for Parsing Natural Languages 27 / 31
Lecture 10
charniak parser
Probabilistic LFG F-Structure Parsing
: [Charniak, 2000; Bikel, 2002]
Parsing system for English
PCFG based wide coverage LFG parser
Useful links
http://nclt.computing.dcu.ie/demos.html
http://lfg-demo.computing.dcu.ie/lfgparser.html
IA161 Syntactic Formalisms for Parsing Natural Languages 28 / 31
Lecture 10
Collins parser
Head-Driven Statistical Models for natural language
parsing (Release 1.0, version 12.2002)
: [Collins, 1999]
Parsing system for English
Useful links
http://www.cs.columbia.edu/~mcollins/code.html
IA161 Syntactic Formalisms for Parsing Natural Languages 29 / 31
Lecture 10
Bikel’s parser
Multilingual statistical parsing engine (release 1.0,
version 06.2008)
: [Charniak, 2000; Bikel, 2002]
Parsing system for English, Chinese, Arabic, Korean
http://www.cis.upenn.edu/~dbikel/#stat-parser
http://www.cis.upenn.edu/~dbikel/software.html
IA161 Syntactic Formalisms for Parsing Natural Languages 30 / 31
Lecture 10
Comparing parser speed on section 23 of WSJ Penn
Treebank
Parser Time (min.)
Collins 45
Charniak 28
Sagae 11
CCG 1.9
IA161 Syntactic Formalisms for Parsing Natural Languages 31 / 31