Lecture 4
Syntactic Formalisms for Parsing Natural Languages
Aleš Horák, Miloš Jakubíček, Vojtěch Kovář (based on slides by Juyeon Kang)
ia161@nlp.fi.muni.cz
Autumn 2013 Dependency Syntax and Parsing

Lecture 4 Outline
1 Motivation
2 Dependency Syntax
3 Dependency Parsing

Lecture 4 Motivation
what you have seen as far: applying analysis of formal languages to a natural language – creating a phrase-structure derivation tree according to some grammar
PS accounts for one important syntactic property: constituency
is that all?
but what about: discontinuous phrases, structure sharing

Lecture 4 Motivation
another crucial syntactic phenomenon is dependency
what is a dependency? "some relation between two words"
what is the difference to phrase-structure?
what does constituency express?
what does dependency express? A relation! . Dependency Relation .. ...... Let W be a set of all words within a sentence, then dependency relation → is D ⊆ W × W such that:
D is anti-reflexive: a → b ⇒ a ̸= b
D is anti-symmetric: a → b ∧ b → a ⇒ a = b, ≡ (anti-reflexivity) a → b ⇒ b ↛ a
D is anti-transitive: a → b ∧ b → c ⇒ a ↛ c
optionally: D is labeled: there is a mapping l : D → L, L being the set of labels

Lecture 4 Dependency Representation
a → b: a depends on b, a is a dependent b, b is the head of a
a dependency graph
a dependency tree

Lecture 4 Dependency Tree vs. PS Tree
sleep S
ideas furiously NP VP
Green A N V ADV
Green ideas sleep furiously

Lecture 4 Non-projectivity
a property of a dependency tree: a sentence is non-projective whenever drawing (projecting) a line from a node to the surface of the tree crosses an arc
a lot of attention has been paid to this problem
practical implications are rather limited (in most cases non-projectivity can be easily handled or avoided)
hard cases: koupil Malou chaloupku

Lecture 4 Czech Tradition of Dependency Syntax
a long tradition of dependency syntax in the Prague linguistic school (Sgall, Hajičová, Panevová)
Institute of Formal and Applied Linguistics at Charles University
formalized as Functional Generative Description (FGD) of language
Prague Dependency Treebank (PDT)

Lecture 4 Dependencies vs. PS
is one of the formalisms clearly better than the other one? No. dependencies: ⊕ account for relational phenomena, ⊕ simple phrase-structure: ⊕ account for constituency, ⊕ easy chunking can we perform transformation from one of the formalism to the other one a vice versa? Technically yes, but . . . Lecture 4 Dependency Parsing
rule-based vs. statistical
transition-based (→ deterministic parsing)
graph-based (→ spanning trees algorithms)
various other approaches (ILP, PS conversion, . . . )
very recent advances (vs. long studied PS parsing algorithms)

Lecture 4 Introduction to Dependency parsing
Motivation
a. dependency-based syntactic representation seem to be useful in many applications of language technology: machine translation, information extraction → transparent encoding of predicate-argument structure
b. dependency grammar is better suited than phrase structure grammar for language with free or flexible word order → analysis of diverse languages within a common framework
c. leading to the development of accurate syntactic parsers for a number of languages → combination with machine learning from syntactically annotated corpora (e.g. treebank)

Lecture 4 Introduction to Dependency parsing
Dependency parsing
"Task of automatically analyzing the dependency structure of a given input sentence"
Dependency parser
"Task of producing a labeled dependency structure of the kind depicted in the follow figure, where the words of the sentence are connected by typed dependency relations"
ROOT Economic news had little effect on financial markets . PRED PU PC ATTATT OBJ ATTSBJATT

Lecture 4 Definitions of dependency graphs and dependency parsing
Dependency graphs: syntactic structures over sentences
Def. 1.: A sentence is a sequence of tokens denoted by S = w0w1 . . . wn
Def. 2.: Let R = {r1, . . . , rm} be a finite set of possible dependency relation types that can hold between any two words in a sentence. A relation type r ∈ R is additionally called an arc label. Lecture 4 Definitions of dependency graphs and dependency parsing
Dependency graphs: syntactic structures over sentences
Def. 3.: A dependency graph G = (V, A) is a labeled directed graph, consists of nodes, V, and arcs, A, such that for sentence S = w0w1 . . . wn and label set R the following holds:
1 V ⊆ {w0w1 . . . wn}
2 A ⊆ V × R × V
3 if (wi, r, wj) ∈ A then (wi, r′ , wj) /∈ A for all r′ ̸= r

Lecture 4 Approach to dependency parsing
a. data-driven
it makes essential use of machine learning from linguistic data in order to parse new sentences
b. grammar-based
it relies on a formal grammar, defining a formal language, so that it makes sense to ask whether a given input is in the language defined by the grammar or not.
→ Data-driven have attracted the most attention in recent years. Lecture 4 Data-driven approach
according to the type of parsing model adopted, the algorithms used to learn the model from data the algorithms used to parse new sentences with the model
a. transition-based
start by defining a transition system, or state machine, for mapping a sentence to its dependency graph.
b. graph-based
start by defining a space of candidate dependency graphs for a sentence. Lecture 4 Data-driven approach
a. transition-based
learning problem: induce a model for predicting the next state transition, given the transition history
parsing problem: construct the optimal transition sequence for the input sentence, given induced model
b. graph-based
learning problem: induce a model for assigning scores to the candidate dependency graphs for a sentence
parsing problem: find the highest-scoring dependency graph for the input sentence, given induced model

Lecture 4 Transition-based Parsing
Transition system consists of a set C of parser configurations and of a set D of transitions between configurations.
Main idea: a sequence of valid transitions, starting in the initial configuration for a given sentence and ending in one of several terminal configurations, defines a valid dependency tree for the input sentence. D1′m = d1(c1), . . . , dm(cm)

Lecture 4 Transition-based Parsing
Definition
Score of D1′m factors by configuration-transition pairs (ci, di):
s(D1′m) = ∑m i=1 s(ci, di)
Learning
Scoring function s(ci, di) for di(ci) ∈ D1′m
Inference
Search for highest scoring sequence D∗ 1′m given s(ci, di)

Lecture 4 Transition-based Parsing
Inference for transition-based parsing
Common inference strategies:
Deterministic [Yamada and Matsumoto 2003, Nivre et al. 2004]
Beam search [Johansson and Nugues 2006, Titov and Henderson 2007]
Complexity given by upper bound on transition sequence length
Transition system
Projective O(n) [Yamada and Matsumoto 2003, Nivre 2003]
Limited non-projective O(n) [Attardi 2006, Nivre 2007]
Unrestricted non-projective O(n2) [Nivre 2008, Nivre 2009]

Lecture 4 Transition-based Parsing – Nivre algorithm

Lecture 4 Transition-based Parsing
Learning for transition-based parsing
Typical scoring function: s(ci, di) = w · f(ci, di)
where f(ci, di) is a feature vector over configuration ci and transition di and w is a weight vector [wi = weight of featurefi(ci, di)]
Transition system
Projective O(n) [Yamada and Matsumoto 2003, Nivre 2003]
Limited non-projective O(n) [Attardi 2006, Nivre 2007]
Unrestricted non-projective O(n2) [Nivre 2008, Nivre 2009]
Problem
Learning is local but features are based on the global history

Lecture 4 Transition-based Parsing
Projectivization to pseudo-projectivity:

Lecture 4 Graph-based Parsing
For a input sentence S we define a graph Gs = (Vs, As) where Vs = {w0, w1, . . . , wn} and As = {(wi, wj, l)|wi, wj ∈ V and l ∈ L}
Score of a dependency tree T factors by subgraphs Gs, . . . , Gs:
s(T) = ∑m i−1 s(Gi)
Learning: Scoring function s(Gi) for a subgraph Gi ∈ T
Inference: Search for maximum spanning tree scoring sequence T∗ of Gs given s(Gi)

Lecture 4 Graph-based Parsing
Learning graph-based models
Typical scoring function: s(Gi) = w · f(Gi)
where f(Gi) is a high-dimensional feature vector over subgraphs and w is a weight vector [wj = weight of feature fj(Gi)]
Structured learning [McDonald et al. 2005a, Smith and Johnson 2007]:
Learn weights that maximize the score of the correct dependency tree for every sentence in the training set
Problem
Learning is global (trees) but features are local (subgraphs)

Lecture 4 Graph-based Parsing – Eisner algorithm

Lecture 4 Graph-based Parsing – Chu-Liu-Edmonds algorithm

Lecture 4 Grammar-based approach
a. context-free dependency parsing
exploits a mapping from dependency structures to CFG structure representations and reuses parsing algorithms originally developed for CFG → chart parsing algorithms
b. constraint-based dependency parsing
parsing viewed as a constraint satisfaction problem
grammar defined as a set of constraints on well-formed dependency graphs
finding a dependency graph for a sentence that satisfies all the constraints of the grammar (having the best score)

Lecture 4 Grammar-based approach
a. context-free dependency parsing
Advantage: Well-studied parsing algorithms such as CKY, Earley's algorithm can be used for dependency parsing as well.
→ need to convert dependency grammars into efficiently parsable context-free grammars; (e.g. bilexical CFG, Eisner and Smith, 2005)
b. constraint-based dependency parsing
defines the problem as constraint satisfaction
Weighted constraint dependency grammar (WCDG, Foth and Menzel, 2005)
Transformation-based CDG

Lecture 4 Conclusions
1 Dependency syntax vs. constituency (phrase-structure) syntax
2 Non-projectivity
3 Graph-based and Transition-based methods