MUNI
FI
Review
HMM Tagging
PA154 Language Modeling (5.2)
Pavel Rychlý
pary@fi.muni.cz March 16,2023
Recall:
■ tagging ~ morphoLogicaL disambiguation
■ tagset VT c (Q, C2,... C„)
■ C, - morphoLogicaL categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,..
■ mapping w^jte VT] exists
■ restriction of MorphoLogicaL AnaLysis: A+ -+ 2C-C2"C2"---"C"' where A is the Language aLphabet, L is the set of Lemmas
■ extension of punctuation, sentence boundaries (treated as words)
Source: Introduction to Natural Language Processing (600.465) Jan Hajic,CS Dept.,Johns Hopkins Univ. www.cs.jn u.edu/~najic
^avel Rycniy ■ HMM Tagging ■ Maren 16,2023
The Setting
The Model
Noisy Channel setting: Input (tags)
NNP VBZDT.
Hie channel
(adds "noise")
Output (words)
John drinks the .
Goal (as usual): discover "input" to the channel (T, the tag seq.) given the "output" (W, the word sequence)
■ p{T\W) = p{W\T)p{T)lp{W)
■ p(W) fixed (W given)... argmaxTp(T\W) = argmaxTp(W\T)p(T)
Two models (d = \ W\ = \ T\ word sequence length):
■ p{W\T) = n/=i...rfp(iv/|ivi,...,iv/_i,ti,...,tt<)
■ p{T) = n/=i...rfp(t/|ti,...,t/_i) Too much parameters (as always) Approximation using the following assumptions:
■ words do not depend on the context
■ tag depends on Limited history: p(t,|ti,...,t,_i) =p(t,|t,_„+i,...,t,-i)
■ n-gram tag "Language" modeL
■ word depends on tag onLy: p(w:\wi,...,       ti,..., td) = p(w,|t,)
^avel Rycniý ■ HMM Tagging ■ Maren 16,2023
^avel Rycniý ■ HMM Tagging ■ Maren 16,2023
The HMM ModeL Definition
(Almost) general HMM:
■ output (words) emitted by states (not arcs)
■ states: (n-l)-tupLes of tags if n-gram tag modeL used
■ five-tupLe (S,s0, Y,Ps,Py) where:
■ S = {so.si,... ,sr} is the set of states,s0 is the initiaL state,
■ Y = {yi,yi, ■ ■ ■ ,yy} is the output aLphabet (the words),
■ Ps(s;|s,) is the set of prob. distributions of transitions
-Ps(Sj\Si) = p(f;| t,-_„+1, .... fn); Sy = (f;_„+2, .... ti). * = (f;_„+l,...,f;_l)
■ PY(yt\Si) is the set of output (emission) probabiLity distributions
-another simpLification: PY(yt\sj) if s, and sy contain the same tag as the rightmost eLement: Py(y*|s,-) = p(w,|f,)
Supervised Learning (Manually Annotated Data Available)
Use MLE
■ p{w,\ti) = CM(thWi)/ct(ti) U p(t/|t/_„+i,...,t/_i) =
Cm(f/-/H-1, • • • , t/-l, tf)/Ct(n—1)n+l> • • • .
Smooth(bothl)
■ p(wi\ti) : "Add l"for aLL possibLe tag, word pairs using a predefined dictionary (thus some 0 kept!)
■ p(t/|t/_„+i,..., t,-_i) : Linear interpoLation:
■ e.g. for trigram modeL:
p'A(t,-|t,-_2, f;_!) = A3p(f;|f;_2, f;-l) + A2p(f;|f;_i) + Aip(f;) + A0/| VT\
^avel Rycniy ■ HMM Tagging ■ Maren 16,2023 5/12
^avel Rycniy ■ HMM Tagging ■ Maren 16,2023
6/12
Unsupervised Learning
Comments on Unsupervised Learning
Completely unsupervised learning impossible
■ at Least if we have the tagset given- how wouLd we associate words with tags?
Assumed (minimal) setting:
■ tagset known
■ dictionary/morph. analysis avaiLabLe (providing possibLe tags for any word)
Use: Baum-Welch algorithm (see lecture 5.1)
■ "tying": output (state-emitting onLy, same dist. from two states with same "finaL'tag)
^avel Rychly ■ HMM Tagging ■ March 16,2023
Unknown Words
"OOV" words (out-of-vocabulary)
■ we do not have List of possibLe tags for them
■ and we certainLy have no output probabilities Solutions:
■ try aLL tags (uniform distribution)
■ try open-cLass tags (uniform, unigram distribution)
■ try to "guess" possibLe tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitaLization)
^avel RychLy ■ HMM Tagging ■ March 16,2023
(Tagger) Evaluation
A must. Test data (S), previously unseen (in training)
■ change test data often if at aLL possibLe! ("feedback cheating'"
■ Error-rate based Formally:
■ Out(w) = set of output "items" for an input "item" w
■ True(w) = singLe correct output (annotation) for w
■ Errors(S) = E,=i..|s| s (Out(w,) jt True(w,))
■ Correct(S) = £(=1 |S| S (True(w,) e Out(w,))
■ Generated(S) = 2,=1 |5| <5|Out(w,)
Initialization of Baum-Welch
■ is some annotated data avaiLabLe, use them
■ keep 0 for impossibLe output probabilities Beware of:
■ degradation of accuracy (Baum-WeLch criterion: entropy, not accuracy!)
■ use heLdout data for cross-checking
Supervised almost always better
^avel Rychly ■ HMM Tagging ■ March 16,2023
Running the Tagger
Use Viterbi
■ remember to handLe unknown words
■ singLe-best, n-best possibLe Another option
■ assign aLways the best tag at each word, but consider aLL possibilities for previous tags (no back pointers nor a path-backpass)
■ introduces random errors, impLausibLe sequences, but might get higher accuracy (Less secondary errors)
^avel Rychly ■ HMM Tagging ■ March 16,2023
Evaluation Metrics
Accuracy: Single output (tagging: each word gets a single tag)
■ Error rate: Err(S) = Errors(S)/|S
■ Accuracy: Acc(S) = l-(Errors(S)/|S|) = 1- Err(S) What if multiple (or no) output?
RecaLL: R(S) = Correct(S)/|S|
■ Precision: P(S) = Correct(S)/Generated(S)
■ Combination: F measure: F = l/(a/P + (1 - a)/R)
■ a is a weight given to precision vs. recall; for a = .5,F= 2PR/(R + P)
^avel Rychly ■ HMM Tagging ■ March 16,2023
11/12
^avel Rychly ■ HMM Tagging ■ March 16,2023
12/12