Lecture 12
.
......
Syntactic Formalisms for Parsing
Natural Languages
Aleš Horák, Miloš Jakubíček, Vojtěch Kovář
(based on slides by Juyeon Kang)
ia161@nlp.fi.muni.cz
Autumn 2013
IA161 Syntactic Formalisms for Parsing Natural Languages 1 / 24
Lecture 12
.
...... Parsing Evaluation
IA161 Syntactic Formalisms for Parsing Natural Languages 2 / 24
Lecture 12
Parsing Results
usually some complex (i.e. non-scalar) structure, mostly a tree
or a graph-like structure
crucial question: how to measure the “goodness” of the result?
IA161 Syntactic Formalisms for Parsing Natural Languages 3 / 24
Lecture 12
Extrinsic vs. Intrinsic Evaluation
Intrinsic
by comparing to a “gold”, i.e. correct, representation
Extrinsic
by exploiting the result in a 3rd party task and evaluating its
results
Which is better?
IA161 Syntactic Formalisms for Parsing Natural Languages 4 / 24
Lecture 12
Intrinsic Evaluation – Phrase-Structure Syntax
i.e. compare two phrase-structure trees and tell a number
PARSEVAL metric
LAA (Leaf-ancestor assessment) metric
IA161 Syntactic Formalisms for Parsing Natural Languages 5 / 24
Lecture 12
PARSEVAL metric
basic idea: penalize crossing brackets in the tree
i.e. compare all constituents in the test tree to the gold tree
⇒ parsing viewed as classiﬁcation problem
IA161 Syntactic Formalisms for Parsing Natural Languages 6 / 24
Lecture 12
Precision, recall
for classiﬁcation problems in NLP, the standard evaluation is by
means of precision and recall
precision = |test ∩ gold|
|test| recall = |test ∩ gold|
|gold|
two numbers, we just want to have one – F-score
F1 score = 2·precision·recall
precision+recall
IA161 Syntactic Formalisms for Parsing Natural Languages 7 / 24
Lecture 12
F-score
also F-measure
general form: Fβ score
Fβ score = (1 + β2) · precision·recall
(β2+precision)+recall
special case of β = 1 corresponds to the harmonic mean of
precision and recall
β can be used for favouring precision over recall (for β < 1) or
vice versa (for β > 1)
IA161 Syntactic Formalisms for Parsing Natural Languages 8 / 24
Lecture 12
PARSEVAL metric
basic idea: penalize crossing brackets in the tree
i.e. compare all constituents in the test tree to the gold tree
⇒ parsing viewed as classiﬁcation problem
⇒ F-score on correct bracketings/constituents
might even disregard non-terminal names
sort of standardized tool available: the evalb script at
http://nlp.cs.nyu.edu/evalb/
IA161 Syntactic Formalisms for Parsing Natural Languages 9 / 24
Lecture 12
PARSEVAL metric – example
test vs. gold
test:[S [NP John][VP [V likes][NP ice cream] [PP with chocolate]]]
gold:[S [NP John][VP [V likes][NP [NP ice cream] [PP with chocolate]]]]
precision = 6/6 = 1.0, recall = 6/7 = 0.86, F-score = 0.92
IA161 Syntactic Formalisms for Parsing Natural Languages 10 / 24
Lecture 12
PARSEVAL metric
test vs. gold
test:[S [NP John][VP [V likes][NP ice cream] [PP with chocolate]]]
gold:[S [NP John][VP [V likes][NP [NP ice cream] [PP with chocolate]]]]
precision = 6/6 = 1.0, recall = 6/7 = 0.86, F-score = 0.92
IA161 Syntactic Formalisms for Parsing Natural Languages 11 / 24
Lecture 12
PARSEVAL metric
often subject to criticism (see e.g. Sampson, 2000)
Sampson proposed another metric, the leaf-ancestor
assessment (LAA)
IA161 Syntactic Formalisms for Parsing Natural Languages 12 / 24
Lecture 12
LAA metric
basic idea: for each leaf (word), compare the path to the root of
the tree, compute the edit distance between both paths, ﬁnally
take the average of all words
in the previous example, the paths (lineages) are:
(John) NP S vs. (John) NP S
(likes) V VP S vs. (likes) V VP S
(ice cream) NP VP S vs. (ice cream) NP NP VP S
(with chocolate) PP VP S vs. (with chocolate) PP NP VP S
IA161 Syntactic Formalisms for Parsing Natural Languages 13 / 24
Lecture 12
Intrinsic Evaluation – Dependency Syntax
much easier
just precision, labeled or unlabeled (as the number of correct
dependencies)
IA161 Syntactic Formalisms for Parsing Natural Languages 14 / 24
Lecture 12
Intrinsic Evaluation – Building Treebanks
treebank = a syntactically annotated text corpus
manual annotation according to some guidelines
from the evaluation point of view: inter-annotator agreement
(IAA) is a crucial property
IA161 Syntactic Formalisms for Parsing Natural Languages 15 / 24
Lecture 12
Measuring IAA
naïve approach: count how many times people agreed on
problem: it does not account for agreement by chance
IA161 Syntactic Formalisms for Parsing Natural Languages 16 / 24
Lecture 12
Chance-corrected coeﬃcients for IAA
S (Benett, Alpert and Goldstein, 1954)
π (Scott, 1955)
κ (Cohen, 1960)
(there is lot of terminology confusion, we follow Ron Artstein,
Massimo Poesio: Inter-coder Agreement for Computational
Linguistics, 2008)
Ao – observed agreement
Ae – expected (chance) agreement
for all coeﬃcients, they compute:
S, π, κ =
Ao − Ae
1 − Ae
IA161 Syntactic Formalisms for Parsing Natural Languages 17 / 24
Lecture 12
Chance-corrected coeﬃcients for IAA
S (Benett, Alpert and Goldstein, 1954)
assumes that all categories and all annotators have uniform
probability distribution
π (Scott, 1955)
assumes that diﬀerent categories have diﬀerent distributions
shared across annotators
κ (Cohen, 1960)
assumes that diﬀerent categories and diﬀerent annotators have
diﬀerent distributions
devised for 2 annotators, various modiﬁcations for more than 2
annotators available
IA161 Syntactic Formalisms for Parsing Natural Languages 18 / 24
Lecture 12
Intrinsic Evaluation – Conclusions
generally not easy
builds on the assumption of having THE correct parse
there is evidence that it does not correlate with extrinsic
evaluation, i.e. how good the tool is for some particular job
IA161 Syntactic Formalisms for Parsing Natural Languages 19 / 24
Lecture 12
Extrinsic Evaluation
= evaluation on a particular task/application
advantages: measures direct ﬁtness for that task
disadvantages: may not generalize for other tasks
leads to crucial question: what can be parsing used for?
IA161 Syntactic Formalisms for Parsing Natural Languages 20 / 24
Lecture 12
What can parsing be used for?
in theory, (full) parsing is suitable/appropriate/necessary for
many NLP tasks
practically it turns out to be:
often not accurate enough
often too complicated to exploit
sometimes just an overkill compared to shallow parsing or yet
simpler approaches
IA161 Syntactic Formalisms for Parsing Natural Languages 21 / 24
Lecture 12
What can parsing be used for?
in theory, (full) parsing is suitable/appropriate/necessary for
many NLP tasks
information extraction
information retrieval
machine translation
corpus linguistics
computer lexicography
question answering
…
IA161 Syntactic Formalisms for Parsing Natural Languages 22 / 24
Lecture 12
Where is parsing actually used now?
prototype systems
academia work
production systems ???
IA161 Syntactic Formalisms for Parsing Natural Languages 23 / 24
Lecture 12
What to evaluate parsing on
Sample (more or less well deﬁned) applications
(partial) morphological disambiguation
text correcting systems
word sketches
phrase extraction
simple treebank of high IAA
IA161 Syntactic Formalisms for Parsing Natural Languages 24 / 24