HMM Tagging PA154 Jazykové modelovaní (6.2) Pavel Rychlý pary@fi.muni.cz April 6, 2021 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic Review ■ Recall: ► tagging ~ morphological disambiguation ► tagset Vt C (Ci, C2,... Cn) ► Cj - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,... ► mapping w —>► {t G Vj} exists ► restriction of Morphological Analysis: A+ —>- 2(/-'C2'C2'-'°7) where >4 is the language alphabet, L is the set of lemmas ► extension of punctuation, sentence boundaries (treated as words) PA154 Jazykové modelování (6.2) H M M Tagging 2/12 The Setting Noisy Channel setting: Input (tags) NNP VBZDT. The channel (adds "noise") Output (words) John drinks the .. Goal (as usual): discover "input" to the channel (T, the tag seq.) given the "output" (W, the word sequence) ► p(T\W) = p(W\T)p(T)/p(W) ► p{W) fixed (W given)... argmaxjp(T\ l/l/) = argmaxjp{W\ 7~)p(7~) PA154 Jazykové modelování (6.2) H M M Tagging 3/12 The Model Two models {d — \ W\ — \ T\ word sequence length) ► p(W\T) = Ui=lmmmdp(wi\w1,..., ti,..., td) ► p{T) = n/=i...c/p(t/|ti,..., tj-i) Too much parameters (as always) Approximation using the following assumptions: ► words do not depend on the context ► tag depends on limited history: p(t/|ti,..., t/_i) = p(t/|t/_„+i,..., t/_i) ► n-gram tag " language" model ► word depends on tag only: p(w/|wi,..., ti,... td) = p(w/|t/) PA154 Jazykové modelování (6.2) H M M Tagging 4/12 The HMM Model Definition ■ (Almost) general HMM: ► output (words) emitted by states (not arcs) ► states: (n-l)-tuples of tags if n-gram tag model used ► five-tuple (S, So, V, P5, Py) where: ► S = {so, si,..., St} is the set of states, so is the initial state, ► Y = {yi,y2, • • • , yy} is the output alphabet (the words), ► Ps(sj\si) is the set of prob. distributions of transitions -Ps{Sj\s;) = p{t;\ £/_n+i, . . . , £/—1); Sj = (£/_n+2, • • • , £;),S/ = (t/—n+lj • • • 5 £/ —l) ► Py(y/c|s/) is the set of output (emission) probability distributions -another simplification: Py{yk\sj) if s, and s7 contain the same tag as the rightmost element: Py{yk\si) = p(w/|£/) PA154 Jazykové modelování (6.2) HMM Tagging 5/12 Supervised Learning (Manually Annotated Data Available) Use MLE ► p(w;\t;) = Cwt(th W;)/Ct(t;) ► p(t/| t/_n_|_i, ) = Ct/7(t/_n_|_i, . . . , t/_i, t/)/ct(n_i)(t/_n_|_i, . . . , t/_i) Smooth(both!) ► p(w/|t/) : "Add 1" for all possible tag, word pairs using a predefined dictionary (thus some 0 kept!) ► p(t/|t/_n+i,..., t/_i) : linear interpolation: ► e.g. for trigram model: p'x(ti\ti-2, ti-i) = A3p(t/|t,-2, t/_i) + A2p(t,-|t,-_i) + Aip(t/) + Ao/| VV PA154 Jazykové modelování (6.2) H M M Tagging 6/12 Unsupervised Learning ■ Completely unsupervised learning impossible ► at least if we have the tagset given- how would we associate words with tags? ■ Assumed (minimal) setting: ► tagset known ► dictionary/morph. analysis available (providing possible tags for any word) ■ Use: Baum-Welch algorithm (see class 15,10/13) ► "tying": output (state-emitting only same dist. from two states with same "final" tag) PA154 Jazykové modelování (6.2) H M M Tagging 7/12 Comments on Unsupervised Learning Initialization of Baum-Welch ► is some annotated data available, use them ► keep 0 for impossible output probabilities Beware of: ► degradation of accuracy (Baum-Welch criterion: entropy, not accuracy!) ► use heldout data for cross-checking Supervised almost always better PA154 Jazykové modelování (6.2) H M M Tagging 8/12 Unknown Words ■ "OOV" words (out-of-vocabulary) ► we do not have list of possible tags for them ► and we certainly have no output probabilities ■ Solutions: ► try all tags (uniform distribution) ► try open-class tags (uniform, unigram distribution) ► try to "guess" possible tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitalization) PA154 Jazykové modelování (6.2) H MM Tagging 9/12 Running the Tagger Use Viterbi ► remember to handle unknown words ► single-best, n-best possible Another option ► assign always the best tag at each word, but consider all possibilities for previous tags (no back pointers nor a path-backpass) ► introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors) PA154 Jazykové modelování (6.2) H M M Tagging 10/12 (Tagger) Evaluation ■ A must. Test data (S), previously unseen (in training) ► change test data often if at all possible! ("feedback cheating" ► Error-rate based ■ Formally: ► Out(w) = set of output "items" for an input "item" w ► True(w) = single correct output (annotation) for w ► Errors(S) = X)/=i \s\ $ (Out(i/i/,) 7^ True(i/i/,)) ► Correct(S) = X)/=i \s\ $ (True(i/i/;)