MUNI FI Review HMM Tagging PA154 Language Modeling (5.2) Pavel Rychlý pary@fi.muni.cz March 16,2023 Recall: ■ tagging ~ morphoLogicaL disambiguation ■ tagset VT c (Q, C2,... C„) ■ C, - morphoLogicaL categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,.. ■ mapping w^jte VT] exists ■ restriction of MorphoLogicaL AnaLysis: A+ -+ 2C-C2"C2"---"C"' where A is the Language aLphabet, L is the set of Lemmas ■ extension of punctuation, sentence boundaries (treated as words) Source: Introduction to Natural Language Processing (600.465) Jan Hajic,CS Dept.,Johns Hopkins Univ. www.cs.jn u.edu/~najic ^avel Rycniy ■ HMM Tagging ■ Maren 16,2023 The Setting The Model Noisy Channel setting: Input (tags) NNP VBZDT. Hie channel (adds "noise") Output (words) John drinks the . Goal (as usual): discover "input" to the channel (T, the tag seq.) given the "output" (W, the word sequence) ■ p{T\W) = p{W\T)p{T)lp{W) ■ p(W) fixed (W given)... argmaxTp(T\W) = argmaxTp(W\T)p(T) Two models (d = \ W\ = \ T\ word sequence length): ■ p{W\T) = n/=i...rfp(iv/|ivi,...,iv/_i,ti,...,tt<) ■ p{T) = n/=i...rfp(t/|ti,...,t/_i) Too much parameters (as always) Approximation using the following assumptions: ■ words do not depend on the context ■ tag depends on Limited history: p(t,|ti,...,t,_i) =p(t,|t,_„+i,...,t,-i) ■ n-gram tag "Language" modeL ■ word depends on tag onLy: p(w:\wi,..., ti,..., td) = p(w,|t,) ^avel Rycniý ■ HMM Tagging ■ Maren 16,2023 ^avel Rycniý ■ HMM Tagging ■ Maren 16,2023 The HMM ModeL Definition (Almost) general HMM: ■ output (words) emitted by states (not arcs) ■ states: (n-l)-tupLes of tags if n-gram tag modeL used ■ five-tupLe (S,s0, Y,Ps,Py) where: ■ S = {so.si,... ,sr} is the set of states,s0 is the initiaL state, ■ Y = {yi,yi, ■ ■ ■ ,yy} is the output aLphabet (the words), ■ Ps(s;|s,) is the set of prob. distributions of transitions -Ps(Sj\Si) = p(f;| t,-_„+1, .... fn); Sy = (f;_„+2, .... ti). * = (f;_„+l,...,f;_l) ■ PY(yt\Si) is the set of output (emission) probabiLity distributions -another simpLification: PY(yt\sj) if s, and sy contain the same tag as the rightmost eLement: Py(y*|s,-) = p(w,|f,) Supervised Learning (Manually Annotated Data Available) Use MLE ■ p{w,\ti) = CM(thWi)/ct(ti) U p(t/|t/_„+i,...,t/_i) = Cm(f/-/H-1, • • • , t/-l, tf)/Ct(n—1)n+l> • • • . Smooth(bothl) ■ p(wi\ti) : "Add l"for aLL possibLe tag, word pairs using a predefined dictionary (thus some 0 kept!) ■ p(t/|t/_„+i,..., t,-_i) : Linear interpoLation: ■ e.g. for trigram modeL: p'A(t,-|t,-_2, f;_!) = A3p(f;|f;_2, f;-l) + A2p(f;|f;_i) + Aip(f;) + A0/| VT\ ^avel Rycniy ■ HMM Tagging ■ Maren 16,2023 5/12 ^avel Rycniy ■ HMM Tagging ■ Maren 16,2023 6/12 Unsupervised Learning Comments on Unsupervised Learning Completely unsupervised learning impossible ■ at Least if we have the tagset given- how wouLd we associate words with tags? Assumed (minimal) setting: ■ tagset known ■ dictionary/morph. analysis avaiLabLe (providing possibLe tags for any word) Use: Baum-Welch algorithm (see lecture 5.1) ■ "tying": output (state-emitting onLy, same dist. from two states with same "finaL'tag) ^avel Rychly ■ HMM Tagging ■ March 16,2023 Unknown Words "OOV" words (out-of-vocabulary) ■ we do not have List of possibLe tags for them ■ and we certainLy have no output probabilities Solutions: ■ try aLL tags (uniform distribution) ■ try open-cLass tags (uniform, unigram distribution) ■ try to "guess" possibLe tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitaLization) ^avel RychLy ■ HMM Tagging ■ March 16,2023 (Tagger) Evaluation A must. Test data (S), previously unseen (in training) ■ change test data often if at aLL possibLe! ("feedback cheating'" ■ Error-rate based Formally: ■ Out(w) = set of output "items" for an input "item" w ■ True(w) = singLe correct output (annotation) for w ■ Errors(S) = E,=i..|s| s (Out(w,) jt True(w,)) ■ Correct(S) = £(=1 |S| S (True(w,) e Out(w,)) ■ Generated(S) = 2,=1 |5| <5|Out(w,) Initialization of Baum-Welch ■ is some annotated data avaiLabLe, use them ■ keep 0 for impossibLe output probabilities Beware of: ■ degradation of accuracy (Baum-WeLch criterion: entropy, not accuracy!) ■ use heLdout data for cross-checking Supervised almost always better ^avel Rychly ■ HMM Tagging ■ March 16,2023 Running the Tagger Use Viterbi ■ remember to handLe unknown words ■ singLe-best, n-best possibLe Another option ■ assign aLways the best tag at each word, but consider aLL possibilities for previous tags (no back pointers nor a path-backpass) ■ introduces random errors, impLausibLe sequences, but might get higher accuracy (Less secondary errors) ^avel Rychly ■ HMM Tagging ■ March 16,2023 Evaluation Metrics Accuracy: Single output (tagging: each word gets a single tag) ■ Error rate: Err(S) = Errors(S)/|S ■ Accuracy: Acc(S) = l-(Errors(S)/|S|) = 1- Err(S) What if multiple (or no) output? RecaLL: R(S) = Correct(S)/|S| ■ Precision: P(S) = Correct(S)/Generated(S) ■ Combination: F measure: F = l/(a/P + (1 - a)/R) ■ a is a weight given to precision vs. recall; for a = .5,F= 2PR/(R + P) ^avel Rychly ■ HMM Tagging ■ March 16,2023 11/12 ^avel Rychly ■ HMM Tagging ■ March 16,2023 12/12