Statistical parsing
& Statistical parsers
Lecture 10
qkang@fi.muni.cz
Syntactic formalisms for natural language parsing
IA161, FI MU autumn 2011
Study materials
➢ Course materials and homeworks are available on
the following web site:
https://is.muni.cz/course/fi/autumn2011/IA161
➢ Refer to Speech and Language Processing: An
Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition,
D. Jurafsky and J.H. Martin, Prentice Hall, New
Jersey, 2000.
3
Outline
● Introduction to Statistical parsing methods
● Statistical Parsers
● RASP system
● Stanford parser
● Collins parser
● Charniak parser
● Berkeley parser
4
1. Introduction to statistical
parsing
● The main theoretical approaches behind
modern statistical parsers
● Over the last 12 years statistical parsing has
succeeded significantly!
● NLP researchers have produced a range of
statistical parsers
→ wide-coverage and robust parsing accuracy
● They continues to improve the parsers year on
year.
5
Application domains of statistical
parsing
● Question answering systems of high precision
● Named entity extraction
● Syntactically based sentence compressions
● Extraction of people's opinion about products
● Improved interaction in computer ganes
● Helping linguists find data
6
NLP parsing problem and solution
● The structure of language is ambiguous!
→ local and global ambiguities
● Classical parsing problem
→ simple 10 grammar rules can generate 592
parsers
→ real size wide-coverage grammar generates
millions of parses
7
● NLP parsing solution
We need mechanisms that allow us to find the most
likely parses
→ statistical parsing lets us work with very loose
grammars that admit millions of parses for
sentences but to still quickly find the best parses
NLP parsing problem and solution
(Cont.)
8
Improved methodology for robust
parsing
● The annotated data: Penn Treebank
(early 90's)
● Building a treebank seems a lot slower and less
useful than building a grammar
● But it has many helpful things
● Reusability of the labor
● Broad coverage
● Frequencies and distributional information
● A way to evaluate systems
9
Characterization of Statistical
parsing
● What the grammar which determines the set of legal
syntactic structures for a sentence? How is that grammar
obtained?
● What is the algorithm for determining the set of legal parses
for a sentence?
● What is the model for determining the probability of different
parses for a sentence?
● What is the algorithm, given the model and a set of possible
parses which finds the best parse?
10
Tbest
= arg max Score (T,S)
● Two components:
➢ The model: a function Score which assigns
scores (probabilities) to tree and sentence
pairs
➢ The parser: the algorithm which implements
the search for Tbest
Characterization of Statistical
parsing (Cont.)
T
11
● Statistical parsing seen as more of a pattern
recognition/Machine Learning problem plus
search
➢ The grammar is only implicitly defined by the
training data and the method used by the parser
for generating hypotheses
Characterization of Statistical
parsing (Cont.)
12
Statistical parsing models
● Probabilistic approach would suggest the following for
the Score function
Score (T,S) = P (T|S)
● Lots of research on different probability models for Penn
Treebank trees
➢ Generative models, log-linear (maximum entropy)
models, ...
13
2. Statistical parsers
● Many kinds of parsers based on the statistical
methods:probability, machine learning
● Different objectives: research, commercial,
pedagogical
➢ RASP, Stanford parser, Berkeley parser,
14
RASP system
Robust Accurate Statistical Parsing (2nd
release):
[Briscoe&Carroll, 2002; Briscoe et al. 2006]
● system for syntactic annotation of free text
● Semantically-motivated output representation
● Enhanced grammar and part-of-speech tagger lexicon
● Flexible and semi-supervised training method for structural
parse ranking model
Useful links to RASP
http://ilexir.co.uk/applications/rasp/download/
http://www.informatics.susx.ac.uk/research/groups/nlp/rasp/
15
Components of system
● Input:
unannotated text or transcribed (and
punctuated) speech
●
1st
step:
sentence boundary detection and
tokenisation modules
●
2nd
step:
Tokenized text is tagged with one of
150 POS and punctuation labels
(derived from the CLAWS tagset)
→ first-order ('bigram') HMM tagger
→ trained on the manually corrected
tagged version of the Susanne, LOB and
BNC corpora
16
Components of system (Cont.)
●
3rd
step:
Morphological analyzer
●
4th
step:
Manually developed wide-coverage tag
sequence grammar in the parser
→ 689 unification based phrase structure
rules
→ preterminals to this grammar are the
POS and punctuation tags
→ terminals are featural description of
the preterminals
→ non-terminals project information up
the tree using an X-bar scheme with 41
attributes with a maximum of 33 atomic
values
17
Components of system (Cont.)
●
5th
step:
Generalized LR Parser
→ a non-deterministic LALR table is
constructed automatically from CF
'backbone' compiled from the featurebased
grammar
→ the parser builds a packed parse forest
using this table to guide the actions it
performs
→ the n-best parses can be efficiently
extracted by unpacking sub-analyses,
following pointers to contained subanalyses
and choosing alternatives in
order of probabilistic ranking
18
Components of system (Cont.)
● Output:
set of named grammatical
relations (GRs)
→ resulting set of ranked parses
can be displayed or passed on for
further processing
→ transformation of derivation
trees into a set of named GRs
→ GR scheme captures those
aspects of predicate-argument
structure
19
Evaluation
● The system has been evaluated using the re-annotation of
the PARC dependency bank (DepBank, King et al., 2003)
● It consists of 560 sentences chosen randomly from section
23 of the WSJ with grammatical relations compatible with
RASP system.
● Form of relations
(relation subtype head dependent initial)
Type of relationship
between the head and
the dependent Encoding additional specifications of the relation
type for some relations and the initial or underlying
logical relation of the grammatical subject in
constructions such as passive
20
Evaluation (Cont.)
● Micro-averaged precision,
recall and F1
score are
calculated from the counts
for all relations in the
hierarchy
● Macro-averaged scores
are the mean of the
individual scores for each
relation
● Micro-averaged F1
score of
76.3% across all relations
Parsing accuracy on DepBank [Briscoe et al., 2006]
21
Stanford parser
Java implementation of probabilistic natural language
parsers (version 1.6.9)
: [Klein and Manning, 2003]
● Parsing system for English and has been used in Chinese,
German, Arabic, Italian, Bulgarian, Portuguese
● Implementation, both highly optimized PCFG and
lexicalized dependency parser, and lexicalized PCFG parser
● Useful links
http://nlp.stanford.edu/software/lex-parser.shtml
http://nlp.stanford.edu:8080/parser/
22
● Input
various form of plain text
● Output
Various analysis formats
→ Stanford Dependencies
(SD): typed dependencies
as GRs
→ phrase structure trees
→ POS tagged text
Graphical representation of the SD for the sentence
“Bell, based in Los Angeles, makes and distributes electronic, 
computer and building products.”
23
Standford typed dependencies [De Marmette and Manning,
2008]
● provide a simple description of the grammatical relationships
in a sentence
● represents all sentence relationships uniformly as typed
dependency relations
● quite accessible to non-linguists thinking about tasks
involving information extraction from text and is quite
effective in relation extraction applications.
24
Standford typed dependencies [De Marnette and Manning,
2008] (Cont.)
● For an example sentence:
Bell, based in Los Angeles, makes and distributes electronic, computer 
and building products.
● Stanford Dependencies (SD) representation is:
➢ nsubj(makes-8, Bell-1)
➢ nsubj(distributes-10, Bell-1)
➢ partmod(Bell-1, based-3)
➢ nn(Angeles-6, Los-5)
➢ prep_in(based-3, Angeles-6)
➢ root(ROOT-0, makes-8)
➢ conj_and(makes-8, distributes-10)
➢ amod(products-16, electronic-11)
➢ conj _and(electronic-11, computer-13)
➢ amod(products-16, computer-13)
➢ conj _and(electronic-11, building-15)
➢ amod(products-16, building-15)
➢ dobj(makes-8, products-16)
➢ dobj(distributes-10, products-16)
25
Output
●  A line­up of masseurs was waiting to take the media in hand.
POS tagged text
Parsing [sent. 4 len. 13]: [A, line­up, 
of, masseurs, was, waiting, to, take, the, 
media, in, hand, .]
CFPSG representation
(ROOT
  (S
    (NP
      (NP (DT A) (NN line­up))
      (PP (IN of)
        (NP (NNS masseurs))))
    (VP (VBD was)
      (VP (VBG waiting)
        (S
          (VP (TO to)
            (VP (VB take)
              (NP (DT the) (NNS 
media))
              (PP (IN in)
                (NP (NN hand))))))))
    (. .)))
Typed dependencies
representation
det(line­up­2, A­1)
nsubj(waiting­6, line­up­2)
xsubj(take­8, line­up­2)
prep_of(line­up­2, masseurs­4)
aux(waiting­6, was­5)
root(ROOT­0, waiting­6)
aux(take­8, to­7)
xcomp(waiting­6, take­8)
det(media­10, the­9)
dobj(take­8, media­10)
prep_in(take­8, hand­12)
26
Berkeley parser
Learning PCFGs, statistical parser (release 1.1,
version 09.2009)
: [Petrov et al., 2006; Petrov and Klein, 2007]
● Parsing system for English and has been used in Chinese,
German, Arabic, Bulgarian, Portuguese, French
● Implementation of unlexicalized PCFG parser
● Useful links
http://nlp.cs.berkeley.edu/
http://tomato.banatao.berkeley.edu:8080/parser/parser.html
http://code.google.com/p/berkeleyparser/
27
Comparison of parsing an example sentence  
A line­up of masseurs was waiting to take the media in hand.
Berkeley parser
Stanford parser
28
charniak parser
Probabilistic LFG F-Structure Parsing
: [Charniak, 2000; Bikel, 2002]
● Parsing system for English
● PCFG based wide coverage LFG parser
● Useful links
http://nclt.computing.dcu.ie/demos.html
http://lfg-demo.computing.dcu.ie/lfgparser.html
29
Collins parser
Head-Driven Statistical Models for natural language
parsing (Release 1.0, version 12.2002): [Collins, 1999]
● Parsing system for English
● Useful links
http://www.cs.columbia.edu/~mcollins/code.html
30
Bikel's parser
Multilingual statistical parsing engine (release 1.0,
version 06.2008)
: [Charniak, 2000; Bikel, 2002]
● Parsing system for English, Chinese, Arabic, Korean
● Useful links
http://www.cis.upenn.edu/~dbikel/#stat-parser
http://www.cis.upenn.edu/~dbikel/software.html
31
Comparing parser speed on section 23
of WSJ Penn Treebank