Review H M M Tagging PA154 Jazykové modelovaní (6.2) Pavel Rychlý pary@fi.muni.cz March 23, 2020 Recall: ► tagging ~ morphological disambiguation ► tagset VT C (Ci, C2,... C„) ► Q - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,. .. ► mapping w^jre Vj} exists *■ restriction of Morphological Analysis: A+ -> 2C-'C2'C2'-'C") where A is the language alphabet, L is the set of lemmas ► extension of punctuation, sentence boundaries (treated as words) Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic The Setting The Model Noisy Channel setting: Input (tags) Output (words) The channel ■ NNP VBZDT... (adds "noise") John drinks the ... ■ Goal (as usual): discover "input" to the channel (T, the tag seq.) given the "output" (W, the word sequence) ► p{T\W)=p{W\T)p{T)lp{W) ► p(W) fixed (W given)... argmaxTp(T\W) = argmaxTp(W\T)p(T) Two models (d = | W\ = | T\ word sequence length): ► p(W\T) = n,= i...dp(w;|wi, . . . , Wj-u ti,..., u) ► p{T) = n/=i...dp(t;|ti,...,t;_i) Too much parameters (as always) Approximation using the following assumptions: ► words do not depend on the context ► tag depends on limited history: p{ti\tu t;_!) = p(t,-|t;_„+1, . . . , t;_!) ► n-gram tag "language" model ► word depends on tag only: p{wj\w\,..., ti,... td)^p(w;\t;) The HMM Model Definition Supervised Learning (Manually Annotated Data Available) (Almost) general HMM: ► output (words) emitted by states (not arcs) ► states: (n-l)-tuples of tags if n-gram tag model used ► five-tuple (5, so, Y, P5, Py) where: ► S — {so, si,..., sj} is the set of states, so is the initial state, ► Y — {yi,y2,... ,yy} is the output alphabet (the words), ► Ps(sj\s;) is the set of prob. distributions of transitions -Ps(Sj\Si) = P(ti\t,-_„+1, . . . , ti-!); Sj = {ti_n+2, ti), Si = (ti-,,+!,...,ti-!) * Py(yk\si) is the set of output (emission) probability distributions -another simplification: Py(yk\sj) if s; and sj contain the same tag as the rightmost element: Py(y^|s,) — p(w,|£,) Use MLE ► p(w;\t;) = Cwt(t/, W;)/Ct(t;) p{ti\ti_n+1, ) = Ctn{ti_n+1, t/_i, t/J/Ctf^ijJt/.^i, . . . , t/_i) Smooth(both!) ► p(wj\tj) : "Add 1" for all possible tag, word pairs using a predefined dictionary (thus some 0 kept!) ► p(tj\tj-n+i,..., t/_i) : linear interpolation: ► e.g. for trigram model: P'x{t;\t;_2,t;_!) = A3p( t; \ t;_2, t;^) + A2p(t," | t,_i) + Aip(t,-) + X0/\VT PA154 Jazykové modelováni (6.2) PA154 Jazykové modelováni (6.2) Unsupervised Learning Comments on Unsupervised Learning Completely unsupervised learning impossible ► at least if we have the tagset given- how would we associate words with tags? Assumed (minimal) setting: ► tagset known ► dictionary/morph. analysis available (providing possible tags for any word) Use: Baum-Welch algorithm (see class 15,10/13) ► "tying": output (state-emitting only, same dist. from two states with same "final" tag) Initialization of Baum-Welch ► is some annotated data available, use them ► keep 0 for impossible output probabilities Beware of: ► degradation of accuracy (Baum-Welch criterion: entropy, not accuracy!) ► use heldout data for cross-checking Supervised almost always better Unknown Words Running the Tagger "OOV" words (out-of-vocabulary) ► we do not have list of possible tags for them ► and we certainly have no output probabilities Solutions: ► try all tags (uniform distribution) ► try open-class tags (uniform, unigram distribution) ► try to "guess" possible tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitalization) Use Viterbi ► remember to handle unknown words ► single-best, n-best possible Another option ► assign always the best tag at each word, but consider all possibilities for previous tags (no back pointers nor a path-backpass) ► introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors) (Tagger) Evaluation Evaluation Metrics A must. Test data (S), previously unseen (in training) ► change test data often if at all possible! ("feedback cheating" ► Error-rate based Formally: ► Out(w) = set of output " items" for an input "item" w ► True(w) = single correct output (annotation) for w ► Errors(S) = 2/=i \s\ $ (Out(w,-) / True(w,-)) ► Correct(S) = X]/=i..|s| s (True(w,-) £ Out(w,-)) ► Generated(S) = X]/=i isi <5|0ut(i/i/,-)| Accuracy: Single output (tagging: each word gets a single tag) ► Error rate: Err(S) = Errors(S)/|S| ► Accuracy: Acc(S) = l-(Errors(S)/|S|) = 1- Err(S) What if multiple (or no) output? ► Recall: R(S) = Correct(S)/|S| ► Precision: P(S) = Correct(S)/Generated(S) ► Combination: F measure: F = l/(a/P + (1 — a)/R) ► a is a weight given to precision vs. recall; for a — 5, F — 2PR/(R + P) PA154 Jazykové modelováni (6.2) PA154 Jazykové modelováni (6.2)