HMM Tagging
PA154 Jazykové modelovaní (6.2)
Pavel Rychlý
pary@fi.muni.cz
March 23, 2020
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
Review
■ Recall:
► tagging ~ morphological disambiguation
► tagset Vt C (Ci, C2,... Cn)
► Cj - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,...
► mapping w —>► {t G Vj} exists
► restriction of Morphological Analysis: A+ —>- 2(/-'C2'C2'-'°7) where >4 is the language alphabet, L is the set of lemmas
► extension of punctuation, sentence boundaries (treated as words)
PA154 Jazykové modelování (6.2)
H M M Tagging
2/12
The Setting
Noisy Channel setting:
Input (tags)
NNP VBZDT.
The channel
(adds "noise")
Output (words)
John drinks the ..
Goal (as usual): discover "input" to the channel (T, the tag seq.) given the "output" (W, the word sequence)
► p(T\W) = p(W\T)p(T)/p(W)
► p{W) fixed (W given)... argmaxjp(T\ l/l/) = argmaxjp{W\ 7~)p(7~)
PA154 Jazykové modelování (6.2)
H M M Tagging
3/12
The Model
Two models {d — \ W\ — \ T\ word sequence length)
► p(W\T) = Ui=lmmmdp(wi\w1,...,        ti,..., td)
► p{T) = n/=i...c/p(t/|ti,..., tj-i)
Too much parameters (as always) Approximation using the following assumptions:
► words do not depend on the context
► tag depends on limited history:
p(t/|ti,..., t/_i) = p(t/|t/_„+i,..., t/_i)
► n-gram tag " language" model
► word depends on tag only: p(w/|wi,...,        ti,...
td) = p(w/|t/)
PA154 Jazykové modelování (6.2)
H M M Tagging
4/12
The HMM Model Definition
■ (Almost) general HMM:
► output (words) emitted by states (not arcs)
► states: (n-l)-tuples of tags if n-gram tag model used
► five-tuple (S, So, V, P5, Py) where:
► S = {so, si,..., St} is the set of states, so is the initial state,
► Y = {yi,y2, • • • , yy} is the output alphabet (the words),
► Ps(sj\si) is the set of prob. distributions of transitions
-Ps{Sj\s;) = p{t;\ £/_n+i, . . . , £/—1); Sj = (£/_n+2, • • • , £;),S/ = (t/—n+lj • • • 5 £/ —l)
► Py(y/c|s/) is the set of output (emission) probability distributions
-another simplification: Py{yk\sj) if s, and s7 contain the same tag as the rightmost element: Py{yk\si) = p(w/|£/)
PA154 Jazykové modelování (6.2)
HMM Tagging
5/12
Supervised Learning (Manually Annotated Data Available)
Use MLE
► p(w;\t;) = Cwt(th W;)/Ct(t;)
► p(t/| t/_n_|_i, ) = Ct/7(t/_n_|_i, . . . , t/_i, t/)/ct(n_i)(t/_n_|_i, . . . , t/_i)
Smooth(both!)
► p(w/|t/) : "Add 1" for all possible tag, word pairs using a predefined dictionary (thus some 0 kept!)
► p(t/|t/_n+i,..., t/_i) : linear interpolation:
► e.g. for trigram model:
p'x(ti\ti-2, ti-i) = A3p(t/|t,-2, t/_i) + A2p(t,-|t,-_i) + Aip(t/) + Ao/| VV
PA154 Jazykové modelování (6.2)
H M M Tagging
6/12
Unsupervised Learning
■ Completely unsupervised learning impossible
► at least if we have the tagset given- how would we associate words with tags?
■ Assumed (minimal) setting:
► tagset known
► dictionary/morph. analysis available (providing possible tags for any word)
■ Use: Baum-Welch algorithm (see class 15,10/13)
► "tying": output (state-emitting only same dist. from two states with same "final" tag)
PA154 Jazykové modelování (6.2)
H M M Tagging
7/12
Comments on Unsupervised Learning
Initialization of Baum-Welch
► is some annotated data available, use them
► keep 0 for impossible output probabilities
Beware of:
► degradation of accuracy (Baum-Welch criterion: entropy, not accuracy!)
► use heldout data for cross-checking
Supervised almost always better
PA154 Jazykové modelování (6.2)
H M M Tagging
8/12
Unknown Words
■ "OOV" words (out-of-vocabulary)
► we do not have list of possible tags for them
► and we certainly have no output probabilities
■ Solutions:
► try all tags (uniform distribution)
► try open-class tags (uniform, unigram distribution)
► try to "guess" possible tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitalization)
PA154 Jazykové modelování (6.2)
H M M Tagging
9/12
Running the Tagger
Use Viterbi
► remember to handle unknown words
► single-best, n-best possible
Another option
► assign always the best tag at each word, but consider all possibilities for previous tags (no back pointers nor a path-backpass)
► introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors)
PA154 Jazykové modelování (6.2)
H M M Tagging
10/12
(Tagger) Evaluation
■ A must. Test data (S), previously unseen (in training)
► change test data often if at all possible! ("feedback cheating"
► Error-rate based
■ Formally:
► Out(w) = set of output "items" for an input "item" w
► True(w) = single correct output (annotation) for w
► Errors(S) = X)/=i \s\ $ (Out(i/i/,) 7^ True(i/i/,))
► Correct(S) = X)/=i \s\ $ (True(i/i/;) <G Out(i/i/,))
► Generated(S) = X)/=i \s\ 5|Out(w/)|
PA154 Jazykové modelování (6.2)
H M M Tagging
Evaluation Metrics
■ Accuracy: Single output (tagging: each word gets a single tag)
► Error rate: Err(S) = Errors(S)/|S|
► Accuracy: Acc(S) = 1—(Errors(S)/|S|) = 1— Err(S)
■ What if multiple (or no) output?
► Recall: R(S) = Correct(S)/|S|
► Precision: P(S) = Correct(S)/Generated(S)
► Combination: F measure: F = l/(a/P + (1 — ®)/R)
► a is a weight given to precision vs. recall; for a = 5, F = 2PR/(R + P)
PA154 Jazykové modelování (6.2)
H M M Tagging
12/12