Review
H M M Tagging
PA154 Jazykové modelovaní (6.2)
Pavel Rychlý
pary@fi.muni.cz
March 23, 2020
Recall:
► tagging ~ morphological disambiguation
► tagset VT C (Ci, C2,... C„)
►  Q - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,. ..
► mapping w^jre Vj} exists
*■ restriction of Morphological Analysis: A+ -> 2C-'C2'C2'-'C") where A is the language alphabet, L is the set of lemmas
► extension of punctuation, sentence boundaries (treated as words)
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic
The Setting
The Model
Noisy Channel setting:
Input (tags)		Output (words)	
	The channel		■
NNP VBZDT...	(adds "noise")	John drinks the ...	■
Goal (as usual): discover "input" to the channel (T, the tag seq.) given the "output" (W, the word sequence)
► p{T\W)=p{W\T)p{T)lp{W)
► p(W) fixed (W given)... argmaxTp(T\W) = argmaxTp(W\T)p(T)
Two models (d = | W\ = | T\ word sequence length):
► p(W\T) = n,= i...dp(w;|wi, . . . , Wj-u ti,..., u)
► p{T) = n/=i...dp(t;|ti,...,t;_i)
Too much parameters (as always) Approximation using the following assumptions:
► words do not depend on the context
► tag depends on limited history:
p{ti\tu t;_!) = p(t,-|t;_„+1, . . . , t;_!)
► n-gram tag "language" model
► word depends on tag only: p{wj\w\,...,        ti,...
td)^p(w;\t;)
The HMM Model Definition
Supervised Learning (Manually Annotated Data Available)
(Almost) general HMM:
► output (words) emitted by states (not arcs)
► states: (n-l)-tuples of tags if n-gram tag model used
► five-tuple (5, so, Y, P5, Py) where:
► S — {so, si,..., sj} is the set of states, so is the initial state,
► Y — {yi,y2,... ,yy} is the output alphabet (the words),
► Ps(sj\s;) is the set of prob. distributions of transitions
-Ps(Sj\Si) = P(ti\t,-_„+1, . . . , ti-!); Sj = {ti_n+2, ti), Si =
(ti-,,+!,...,ti-!)
* Py(yk\si) is the set of output (emission) probability distributions
-another simplification: Py(yk\sj) if s; and sj contain the same tag as the rightmost element: Py(y^|s,) — p(w,|£,)
Use MLE
► p(w;\t;) = Cwt(t/, W;)/Ct(t;)
p{ti\ti_n+1, ) = Ctn{ti_n+1, t/_i, t/J/Ctf^ijJt/.^i, . . . , t/_i)
Smooth(both!)
► p(wj\tj) : "Add 1" for all possible tag, word pairs using a predefined dictionary (thus some 0 kept!)
► p(tj\tj-n+i,..., t/_i) : linear interpolation:
► e.g. for trigram model:
P'x{t;\t;_2,t;_!) = A3p( t; \ t;_2, t;^) + A2p(t," | t,_i) + Aip(t,-) + X0/\VT
PA154 Jazykové modelováni (6.2)
PA154 Jazykové modelováni (6.2)
Unsupervised Learning
Comments on Unsupervised Learning
Completely unsupervised learning impossible
► at least if we have the tagset given- how would we associate words with tags?
Assumed (minimal) setting:
► tagset known
► dictionary/morph. analysis available (providing possible tags for any word)
Use: Baum-Welch algorithm (see class 15,10/13)
► "tying": output (state-emitting only, same dist. from two states with same "final" tag)
Initialization of Baum-Welch
► is some annotated data available, use them
► keep 0 for impossible output probabilities Beware of:
► degradation of accuracy (Baum-Welch criterion: entropy, not accuracy!)
► use heldout data for cross-checking
Supervised almost always better
Unknown Words
Running the Tagger
"OOV" words (out-of-vocabulary)
► we do not have list of possible tags for them
► and we certainly have no output probabilities Solutions:
► try all tags (uniform distribution)
► try open-class tags (uniform, unigram distribution)
► try to "guess" possible tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitalization)
Use Viterbi
► remember to handle unknown words
► single-best, n-best possible Another option
► assign always the best tag at each word, but consider all possibilities for previous tags (no back pointers nor a path-backpass)
► introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors)
(Tagger) Evaluation
Evaluation Metrics
A must. Test data (S), previously unseen (in training)
► change test data often if at all possible! ("feedback cheating"
► Error-rate based Formally:
► Out(w) = set of output " items" for an input "item" w
► True(w) = single correct output (annotation) for w
► Errors(S) = 2/=i \s\ $ (Out(w,-) / True(w,-))
► Correct(S) = X]/=i..|s| s (True(w,-) £ Out(w,-))
► Generated(S) = X]/=i isi <5|0ut(i/i/,-)|
Accuracy: Single output (tagging: each word gets a single tag)
► Error rate: Err(S) = Errors(S)/|S|
► Accuracy: Acc(S) = l-(Errors(S)/|S|) = 1- Err(S) What if multiple (or no) output?
► Recall: R(S) = Correct(S)/|S|
► Precision: P(S) = Correct(S)/Generated(S)
► Combination: F measure: F = l/(a/P + (1 — a)/R)
► a is a weight given to precision vs. recall; for a — 5, F — 2PR/(R + P)
PA154 Jazykové modelováni (6.2)
PA154 Jazykové modelováni (6.2)