Lecture 12 . ...... Syntactic Formalisms for Parsing Natural Languages Aleš Horák, Miloš Jakubíček, Vojtěch Kovář (based on slides by Juyeon Kang) ia161@nlp.fi.muni.cz Autumn 2013 IA161 Syntactic Formalisms for Parsing Natural Languages 1 / 24 Lecture 12 . ...... Parsing Evaluation IA161 Syntactic Formalisms for Parsing Natural Languages 2 / 24 Lecture 12 Parsing Results usually some complex (i.e. non-scalar) structure, mostly a tree or a graph-like structure crucial question: how to measure the “goodness” of the result? IA161 Syntactic Formalisms for Parsing Natural Languages 3 / 24 Lecture 12 Extrinsic vs. Intrinsic Evaluation Intrinsic by comparing to a “gold”, i.e. correct, representation Extrinsic by exploiting the result in a 3rd party task and evaluating its results Which is better? IA161 Syntactic Formalisms for Parsing Natural Languages 4 / 24 Lecture 12 Intrinsic Evaluation – Phrase-Structure Syntax i.e. compare two phrase-structure trees and tell a number PARSEVAL metric LAA (Leaf-ancestor assessment) metric IA161 Syntactic Formalisms for Parsing Natural Languages 5 / 24 Lecture 12 PARSEVAL metric basic idea: penalize crossing brackets in the tree i.e. compare all constituents in the test tree to the gold tree ⇒ parsing viewed as classification problem IA161 Syntactic Formalisms for Parsing Natural Languages 6 / 24 Lecture 12 Precision, recall for classification problems in NLP, the standard evaluation is by means of precision and recall precision = |test ∩ gold| |test| recall = |test ∩ gold| |gold| two numbers, we just want to have one – F-score F1 score = 2·precision·recall precision+recall IA161 Syntactic Formalisms for Parsing Natural Languages 7 / 24 Lecture 12 F-score also F-measure general form: Fβ score Fβ score = (1 + β2) · precision·recall (β2+precision)+recall special case of β = 1 corresponds to the harmonic mean of precision and recall β can be used for favouring precision over recall (for β < 1) or vice versa (for β > 1) IA161 Syntactic Formalisms for Parsing Natural Languages 8 / 24 Lecture 12 PARSEVAL metric basic idea: penalize crossing brackets in the tree i.e. compare all constituents in the test tree to the gold tree ⇒ parsing viewed as classification problem ⇒ F-score on correct bracketings/constituents might even disregard non-terminal names sort of standardized tool available: the evalb script at http://nlp.cs.nyu.edu/evalb/ IA161 Syntactic Formalisms for Parsing Natural Languages 9 / 24 Lecture 12 PARSEVAL metric – example test vs. gold test:[S [NP John][VP [V likes][NP ice cream] [PP with chocolate]]] gold:[S [NP John][VP [V likes][NP [NP ice cream] [PP with chocolate]]]] precision = 6/6 = 1.0, recall = 6/7 = 0.86, F-score = 0.92 IA161 Syntactic Formalisms for Parsing Natural Languages 10 / 24 Lecture 12 PARSEVAL metric test vs. gold test:[S [NP John][VP [V likes][NP ice cream] [PP with chocolate]]] gold:[S [NP John][VP [V likes][NP [NP ice cream] [PP with chocolate]]]] precision = 6/6 = 1.0, recall = 6/7 = 0.86, F-score = 0.92 IA161 Syntactic Formalisms for Parsing Natural Languages 11 / 24 Lecture 12 PARSEVAL metric often subject to criticism (see e.g. Sampson, 2000) Sampson proposed another metric, the leaf-ancestor assessment (LAA) IA161 Syntactic Formalisms for Parsing Natural Languages 12 / 24 Lecture 12 LAA metric basic idea: for each leaf (word), compare the path to the root of the tree, compute the edit distance between both paths, finally take the average of all words in the previous example, the paths (lineages) are: (John) NP S vs. (John) NP S (likes) V VP S vs. (likes) V VP S (ice cream) NP VP S vs. (ice cream) NP NP VP S (with chocolate) PP VP S vs. (with chocolate) PP NP VP S IA161 Syntactic Formalisms for Parsing Natural Languages 13 / 24 Lecture 12 Intrinsic Evaluation – Dependency Syntax much easier just precision, labeled or unlabeled (as the number of correct dependencies) IA161 Syntactic Formalisms for Parsing Natural Languages 14 / 24 Lecture 12 Intrinsic Evaluation – Building Treebanks treebank = a syntactically annotated text corpus manual annotation according to some guidelines from the evaluation point of view: inter-annotator agreement (IAA) is a crucial property IA161 Syntactic Formalisms for Parsing Natural Languages 15 / 24 Lecture 12 Measuring IAA naïve approach: count how many times people agreed on problem: it does not account for agreement by chance IA161 Syntactic Formalisms for Parsing Natural Languages 16 / 24 Lecture 12 Chance-corrected coefficients for IAA S (Benett, Alpert and Goldstein, 1954) π (Scott, 1955) κ (Cohen, 1960) (there is lot of terminology confusion, we follow Ron Artstein, Massimo Poesio: Inter-coder Agreement for Computational Linguistics, 2008) Ao – observed agreement Ae – expected (chance) agreement for all coefficients, they compute: S, π, κ = Ao − Ae 1 − Ae IA161 Syntactic Formalisms for Parsing Natural Languages 17 / 24 Lecture 12 Chance-corrected coefficients for IAA S (Benett, Alpert and Goldstein, 1954) assumes that all categories and all annotators have uniform probability distribution π (Scott, 1955) assumes that different categories have different distributions shared across annotators κ (Cohen, 1960) assumes that different categories and different annotators have different distributions devised for 2 annotators, various modifications for more than 2 annotators available IA161 Syntactic Formalisms for Parsing Natural Languages 18 / 24 Lecture 12 Intrinsic Evaluation – Conclusions generally not easy builds on the assumption of having THE correct parse there is evidence that it does not correlate with extrinsic evaluation, i.e. how good the tool is for some particular job IA161 Syntactic Formalisms for Parsing Natural Languages 19 / 24 Lecture 12 Extrinsic Evaluation = evaluation on a particular task/application advantages: measures direct fitness for that task disadvantages: may not generalize for other tasks leads to crucial question: what can be parsing used for? IA161 Syntactic Formalisms for Parsing Natural Languages 20 / 24 Lecture 12 What can parsing be used for? in theory, (full) parsing is suitable/appropriate/necessary for many NLP tasks practically it turns out to be: often not accurate enough often too complicated to exploit sometimes just an overkill compared to shallow parsing or yet simpler approaches IA161 Syntactic Formalisms for Parsing Natural Languages 21 / 24 Lecture 12 What can parsing be used for? in theory, (full) parsing is suitable/appropriate/necessary for many NLP tasks information extraction information retrieval machine translation corpus linguistics computer lexicography question answering … IA161 Syntactic Formalisms for Parsing Natural Languages 22 / 24 Lecture 12 Where is parsing actually used now? prototype systems academia work production systems ??? IA161 Syntactic Formalisms for Parsing Natural Languages 23 / 24 Lecture 12 What to evaluate parsing on Sample (more or less well defined) applications (partial) morphological disambiguation text correcting systems word sketches phrase extraction simple treebank of high IAA IA161 Syntactic Formalisms for Parsing Natural Languages 24 / 24