Introduction to Natural Language Processing(600.465) HMM Parameter Estimation: the Baum-welch algorithm Dr. Jan Hajiˇc CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic HMM: The Tasks HMM(the general case): five-tuple (S, S0, Y , PS , PY ), where: S = {s1, s2, . . . , sT } is the set of states, S0 is the initial state, Y = {y1, y2, . . . , yy } is the output alphabet, PS (sj |si ) is the set of prob. distributions of transitions, PY (yk |si , sj ) is the set of output (emission) probability distributions. Given an HMM & an output sequence Y = {y1, y2, . . . , yk} : (Task 1) compute the probability of Y ; (Task 2) compute the most likely sequence of states which has generated Y (Task 3) Estimating the parameters (transition/output distributions) 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 2/14 A variant of EM Idea(∼EM, for another variant see LM smoothing): Start with (possibly random) estimates of PS and PY . Compute(fractional) ”counts” of state transitions/emissions taken, from PS and PY , given data Y Adjust the estimates of PS and PY from these ”counts” (using MLE, i.e. relative frequency as the estimate). Remarks: many m,ore parameters than the simple four-way smoothing no proofs here; see Jelinek Chapter 9 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 3/14 Setting HMM (without PS , PY )(S, S0, Y ), and data T = {y1 ∈ Y }i=1...|T| will use T ∼ |T| HMM structure is given: (S, S0) PS : Typically, one wants to allow ”fully connected” graph (i.e. no transitions forbidden ∼ no transitions set to hard 0) why? → we better leave it on the learning phase, based on the data! sometimes possible to remove some transitions ahead of time PY : should be restricted (if not, we will not get anywhere!) restricted ∼ hard 0 probabilities of p(y|s, s ) ”Dictionary”: states ↔ words, ”m:n” mapping on S × Y (in general) 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 4/14 Initialization For computing the initial expected ”counts” Important part EM guaranteed to find a local maximum only (albeit a good one in most cases) PY initialization more important fortunately, often easy to determine together with dictionary ↔ vocabulary mapping, get counts, then MLE PS initialization less important e.g. uniform distribution for each p(.|s) 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 5/14 Data structures Will need storage for: The predetermined structure of the HMM (unless fully connected → need not to keep it!) The parameters to be estimated (PS , PY ) The expected counts (same size as (PS , PY )) The training data T = {yi ∈ Y }i=1...T The trellis (if f.c.): 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 6/14 The Algorithm Part I 1. Initialize PS , PY 2. Compute ”forward” probabilities: follow the procedure for trellis (summing), compute α(s, i) everywhere use the current values of PS , PY (p(s |s), p(y|s, s )) : α(s , i) = s→s, α(s, i − 1) × p(s |s) × p(yi |s, s ) NB: do not throw away the previous stage! 3. Compute ”backward” probabilities start at all nodes of the last stage, proceed backwards, β(s, i) i.e., probability of the ”tail” of data from stage i to the end of data β(s , i) = s ←s β(s, i + 1) × p(s|s ) × p(yi+1|s , s) also, keep the β(s, i) at all trellis states 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 7/14 The Algorithm Part II 1. Collect counts: for each output/transition pair compute c(s, s ) = y∈Y c(y, s, s ) (assuming all observed yi in Y ) c(s) = s ∈S c(s, s ) 2. Reestimate: p (s |s) = c(s, s )/c(s) p (y|s, s ) = c(y, s, s )/c(s, s ) 3. Repeat 2-5 until desired convergence limit is reached 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 8/14 Baum-Welch: Tips & Tricks Normalization badly needed long training data → extremely small probabilities Normalize α, β using the same norm.factor: N(i) = s∈S α(s, i) as follows: compute α(s, i) as usual (Step 2 of the algorithm), computing the sum N(i) at the given stage i as you go. at the end of each stage, recompute all alphas(for each state s): α ∗ (s, i) = α(s, i)/N(i) use the same N(i) for βs at the end of each backward (Step 3) stage: β ∗ (s, i) = β(s, i)/N(i) 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 9/14 Example Task: pronunciation of ”the” Solution: build HMM, fully connected, 4 states: S - short article, L - long article, C,V - word starting w/consonant, vowel thus, only ”the” is ambiguous (a, an, the - not members of C,V) Output form states only (p(w|s, s ) = p(w|s )) 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 10/14 Example: Initialization Output probabilities: pinit(w|c) = c(c, w)/c(c); where c(S, the) = c(L, the) = c(the)/2 (other than that, everything is deterministic) Transition probabilities: pinit(c |c) = 1/4(uniform) Don’t forget: about the space needed initialize α(X, 0) = 1 (X : the never-occuring front buffer st.) initialize β(s, T) = 1 for all s (except for s = X) 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 11/14 Fill in alpha, beta Left to right, alpha: α(s , i) = s→s α(s, i − 1) × p(s |s) × p(wi |s ), where s is the output from states Remember normalization (N(i)). Similary, beta (on the way back from the end). 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 12/14 Counts & Reestimation One pass through data At each position i, go through all pairs (si , si+1) Increment appropriate counters by frac. counts (Step 4): inc(yi+1, si , si+1) = a(si , i)p(si+1|si )p(yi+1|si+1)b(si+1,i+1) c(y, si , si+1)+ = inc (for y at pos i+1) c(si , si+1)+ = inc (always) c(si )+ = inc (always) inc(big,L,C)=α(L, 7)p(C|L)p(big,C)β(V , 8) inc(big,S,C)=α(S, 7)p(C|S)p(big,C)β(V , 8) Reestimate p(s |s), p(y|s) and hope for increase in p(C|S) and p(V |L) . . . !! 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 13/14 HMM: Final Remarks Parameter ”tying” keep certain parameters same (∼ just one ”counter” for all of them) any combination in principle possible ex.: smoothing (just one set of lambdas) Real Numbers Output Y of infinite size (R, Rn ) parametric (typically: few) distribution needed (e.g., ”Gaussian”) ”Empty” transitions: do not generate output ∼ vertical areas in trellis; do not use in ”counting” 10/24/00 JHU CS 600.465/Intro to NLP/Jan Hajic 14/14 Introduction to Natural Language Processing(600.465) HMM Tagging Dr. Jan Hajiˇc CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic Review Recall: tagging ∼ morphological disambiguation tagset VT ⊂ (C1, C2, . . . Cn) Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,. . . mapping w → {t ∈ VT } exists restriction of Morphological Analysis: A+ → 2(L,C2,C2,...,Cn) where A is the language alphabet, L is the set of lemmas extension of punctuation, sentence boundaries (treated as words) 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 2/12 The Setting Noisy Channel setting: Goal (as usual): discover ”input” to the channel (T, the tag seq.) given the ”output” (W, the word sequence) p(T|W ) = p(W |T)p(T)/p(W ) p(W ) fixed (W given)... argmaxT p(T|W ) = argmaxT p(W |T)p(T) 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 3/12 The Model Two models (d = |W | = |T| word sequence length): p(W |T) = Πi=1...d p(wi |w1, . . . , wi−1, t1, . . . , td ) p(T) = Πi=1...d p(ti |t1, . . . , ti−1) Too much parameters (as always) Approximation using the following assumptions: words do not depend on the context tag depends on limited history: p(ti |t1, . . . , ti−1) ∼= p(ti |ti−n+1, . . . , ti−1) n-gram tag ”language” model word depends on tag only: p(wi |w1, . . . , wi−1, t1, . . . , td ) ∼= p(wi |ti ) 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 4/12 The HMM Model Definition (Almost) general HMM: output (words) emitted by states (not arcs) states: (n-1)-tuples of tags if n-gram tag model used five-tuple (S, s0, Y , PS , PY ) where: S = {s0, s1, . . . , sT } is the set of states, s0 is the initial state, Y = {y1, y2, . . . , yy } is the output alphabet (the words), PS (sj |si ) is the set of prob. distributions of transitions -PS (sj |si ) = p(ti |ti−n+1, . . . , ti−1); sj = (ti−n+2, . . . , ti ), si = (ti−n+1, . . . , ti−1) PY (yk |si ) is the set of output (emission) probability distributions -another simplification: PY (yk |sj ) if si and sj contain the same tag as the rightmost element: PY (yk |si ) = p(wi |ti ) 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 5/12 Supervised Learning (Manually Annotated Data Available) Use MLE p(wi |ti ) = cwt(ti , wi )/ct(ti ) p(ti |ti−n+1, ) = ctn(ti−n+1, . . . , ti−1, ti )/ct(n−1)(ti−n+1, . . . , ti−1) Smooth(both!) p(wi |ti ) : ”Add 1” for all possible tag, word pairs using a predefined dictionary (thus some 0 kept!) p(ti |ti−n+1, . . . , ti−1) : linear interpolation: e.g. for trigram model: pλ(ti |ti−2, ti−1) = λ3p(ti |ti−2, ti−1) + λ2p(ti |ti−1) + λ1p(ti ) + λ0/|VT | 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 6/12 Unsupervised Learning Completely unsupervised learning impossible at least if we have the tagset given- how would we associate words with tags? Assumed (minimal) setting: tagset known dictionary/morph. analysis available (providing possible tags for any word) Use: Baum-Welch algorithm (see class 15,10/13) ”tying”: output (state-emitting only, same dist. from two states with same ”final” tag) 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 7/12 Comments on Unsupervised Learning Initialization of Baum-Welch is some annotated data available, use them keep 0 for impossible output probabilities Beware of: degradation of accuracy (Baum-Welch criterion: entropy, not accuracy!) use heldout data for cross-checking Supervised almost always better 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 8/12 Unknown Words ”OOV” words (out-of-vocabulary) we do not have list of possible tags for them and we certainly have no output probabilities Solutions: try all tags (uniform distribution) try open-class tags (uniform, unigram distribution) try to ”guess” possible tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitalization) 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 9/12 Running the Tagger Use Viterbi remember to handle unknown words single-best, n-best possible Another option assign always the best tag at each word, but consider all possibilities for previous tags (no back pointers nor a path-backpass) introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors) 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 10/12 (Tagger) Evaluation A must. Test data (S), previously unseen (in training) change test data often if at all possible! (”feedback cheating”) Error-rate based Formally: Out(w) = set of output ”items” for an input ”item” w True(w) = single correct output (annotation) for w Errors(S) = i=1..|S| δ (Out(wi ) = True(wi )) Correct(S) = i=1..|S| δ (True(wi ) ∈ Out(wi )) Generated(S) = i=1..|S| δ|Out(wi )| 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 11/12 Evaluation Metrics Accuracy: Single output (tagging: each word gets a single tag) Error rate: Err(S) = Errors(S)/|S| Accuracy: Acc(S) = 1−(Errors(S)/|S|) = 1− Err(S) What if multiple (or no) output? Recall: R(S) = Correct(S)/|S| Precision: P(S) = Correct(S)/Generated(S) Combination: F measure: F = 1/(α/P + (1 − α)/R) α is a weight given to precision vs. recall; for α = 5, F = 2PR/(R + P) 10/31/00 JHU CS 600.465/Intro to NLP/Jan Hajic 12/12 Introduction to Natural Language Processing(600.465) Transformation-Based Tagging Dr. Jan Hajiˇc CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic The Task, Again Recall: tagging ∼ morphological disambiguation tagset VT ∈ (C1, C2, . . . , Cn) Ci - moprhological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,.... mapping w → {t ∈ VT } exists restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn) , where A is the language alphabet, L is the set of lemmas extension to punctuation, sentence boundaries (treated as word) 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 2/15 Setting Not a source channel view Not even a probabilistic model (no ”numbers” used when tagging a text after a model is developed) Statistical, yes: uses training data (combination of supervised [manually annotated data available] and unsupervised [plain text, large volume] training) learning [rules] criterion: accuracy (that’s what we are interested in in the end after all!) 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 3/15 The General Scheme 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 4/15 The Learner 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 5/15 The I/O of an Iteration In (iteration i): Intermediate data (initial or the result of previous iteration) The TRUTH (the annotated training data) poolofpossiblerules Out: One rule rselected(i) to enhance the set of rules learned so far Intermediate data (input data transformed by the rule learned in this iteration, rselected(i)) 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 6/15 The Initial Assignment of Tags One possibilty: NN Another: the most frequent tag for a given word form Even: use an HMM tagger for the initial assignment Not particulary sensitive 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 7/15 The Criterion Error rate (or Accuracy): beginning of an iteration: some error rate Ein each possible rule r, when applied at every data position: makes an improvement somewhere in the data (cimproved (r)) makes it worse at sme places (cworsened (r)) and, of course, does not touch the remaining data Rule contribution to the improvement of the error rate: contrib(r) = cimproved(r) − cworsened (r) Rule selection at iteration i: rselected(i) = argmaxr contrib(r) New error rate: Eout = Ein − contrib(rselected(i)) 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 8/15 The Stopping Criterion Obvious: no improvement can be made contrib(r) ≤ 0 or improvement too small contrib(r) ≤ Threshold NB: prone to overtraining! therefore, setting a reasonable threshold advisable Heldout? maybe: remove rules which degrade performance on H 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 9/15 The Pool of Rules(Templates) Format: change tag at position i from a to b / condition Context rules (condition definition - ”template”): 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 10/15 Lexical Rules Other type: lexical rules Example: wi has suffix -ied wi has prefix ge- 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 11/15 Rule Application Two possibilities: immediate consequences (left-to-right): delayed (”fixed input”): use original input for context the above rule then applies twice 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 12/15 In Other Words... 1. Strip the tags off the truth, keep the original truth 2. Initialize the stripped data by some simple method 3. Start with an empty set of selected rules S. 4. Repeat until the stopping criterion applies: compute the contribution of the rule r, for each r: contrib(r) = cimproved (r) − cworsened (r) select r which has the biggest contribution contrib(r), add it to the final set of selected rules S. 5. Output the set S 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 13/15 The Tagger Input: untagged data rules (S) learned by the learner Tagging: use the same initialization as the learner did for i = 1..n (n - the number of rules learnt) apply the rule i to the whole intermediate data, changing (some) tags the last intermediate data is the output 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 14/15 N-best & Unsupervised Modifications N-best modification allow adding tags by rules criterion: optimal combination of accuracy and the number of tags per word (we want: close to ↓ 1) Unsupervised modification use only unambiguous words for evaluation criterion work extremely well for English does not work for languages with few unambiguous words 11/01/00 JHU CS 600.465/Intro to NLP/Jan Hajic 15/15