Markov Models PA154 Language Modeling (4.1) Pavel Rychly pary@fi.muni.cz March 9,2023 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept.Johns Hopkins Univ. www.cs.jhu.edu/~hajic Review: Markov Process i Bayes formula (chain rule): P(W) = P(wi, w2,wr) = n/=i.Tp(w/ n-gram Language modeLs: ■ Markov process (chain) of the order n-1: P(iy) = p(wi, w2,wr) = n/=1.Tpi>; | w/.^+i, w/_A?+2 Using just one distribution (Ex.: trigram model: p(w\ \ wi_21wi_ Positions: 12 3 4 5 6 7 8 9 10 11 12 13 1415 16 Words: My car broke down , and within hours Bob's can broke down , too p(,| broke down) = p(w5 | W3, W4) = p(wi4 | W12, Pavel Rychlý • Markov ModeLs • March 9,2023 2/15 Markov Properties ■ Generalize to any process (not just words/LM): ■ Sequence of random variables: X = (Xi,X2,... ,XT) ■ Sample space S (states), size N:S = (So,5i,S2,... ,5/y) 1. Limited History (Context, Horizon): V/ e l.T; P(X, | Xx, ..,X/_!) = (X/ | X/_0 ^^^^^1^4 5... 1 7 3 7 9 0 5... 2. Time invariance (M.C. is stationary, homogenous) V/ G l.T, Vy,x G S; P(X, = y | = x) = p(y 10|1]0[?]O60|T|45... x) ok...same distribution Pavel Rychlý • Markov Models • March 9,2023 3/15 Long History Possible What if we want trigrams: 1 7 3 79 0[67][T]45... i Formally, use transformation: Define new variables Qi.suchthatXi = 0,_i, 0 Then P(X!\Xi_1) = P(Qi_1,Qi\Qi-2,Qi-Predicting (X,) 1 7 3 7 [90] 6 7 3 4 5... History (X, = {Q/_2, Q/-i}): 0 6 7 3 4 9 0 6 7 3 Pavel Rychlý • Markov Models • March 9,2023 Graph Representation: State Diagram ■ S - {so, Si, Sj,..., sn }'. states ■ Distribution P(X,- | ■ transitions (as arcs) with probabilities attached to them: Bigram case: p(toe) = .6 x .88 x 1 = .528 Pavel Rychlý • Markov Models • March 9,2023 5/15 The Trigram Case ■ S = {so,Si,S2,... ,sn}: states: pairs s,- = (x,y) ■ Distribution P(X; | (r.v. X: generates pairs s,) p(toe) = .6 x .88 x .07 = .037 p(one) = ? Pavel Rychlý • Markov Models • March 9,2023 6/15 Finite State Automaton ■ States ~ symbols of the [input/output] alphabet ■ pairs (or more): last element of the n-tuple \ ■ Arcs ~ transitions (sequence of states) I ■ [Classical FSA: alphabet symbols on arcs: / ■ transformation: arcs nodes] / ■ Possible thanks to the "Limited history" Mairav Property ■ So far: Visible Markov Models (WMMf Pavel Rychlý • Markov Models • March 9,2023 7/15 Hidden Markov Models ■ The simplest HMM: states generate [observable] output (using the"data" alphabet) but remain "invisible": p(toe) = .6 x .88 x 1 = .528 Pavel Rychlý • Markov Models • March 9,2023 8/15 Added Flexibility... ■ So far, no change; but different states may generate the same output (why not?): p(toe) = .6 x .88 x 1 + .4x1x1 = .568 Pavel Rychlý • Markov Models • March 9,2023 9/15 Output from Arcs... Added flexibility: Generate output from arcs, not states p(toe) = .6 x .88 x 1 + .4x1x1 + .4 x .2 x .3 + .4 x .2 x .4 = .624 Pavel Rychlý • Markov Models • March 9,2023 10/15 ...and Finally, Add Output Probabilities ■ Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc: (simplified! p(t)=.8 p(t)=o P(o)=0 p(e)=l p(toe) = .6 x .8 x .88 x .7 x 1 x .6 + .4 x .5 x 1 x 1 x .88 x .2 + .4 x .5 x 1 x 1 x .12 x 1 PW=° ~ .237 Pavel Rychlý • Markov Models • March 9,2023 11/15 Slightly Different View ■ ALLow for muLtipLe arcs from s,- -» Sy, mark them by output symboL s, get rid of output distributions: o,.06 e,.12 p(toe) = .48 x .616 x .6 + .2 x 1 x .176 + .2 x 1 x .12 ^ .237 In the future, we wiLL use the view more convenient for the problem at hand. Pavel Rychlý • Markov Models • March 9,2023 12/15 Formalization HMM (the most general case): ■ five-tuple (S, Sq, Y, Ps, Py), where: ■ S = {so,Si,s2,... ,sj} is the set of states, s0 is the initial state, ■ Y = {yi,y2,... ,yv} is the output alphabet, ■ Ps(sj | s,-) is the set of prob. distributions of transitions, ■ size of Ps :| 5 |2. ■ Py(y* I s/)sy) 's ^l16 set of output (emission) probability distributions. ■ size of PY :| S |2 x | V Example: ■ S= x, 1,2, 3,4, Sq = x ■ Y={f,o,e} Pavel Rychlý • Markov Models • March 9,2023 13/15 Formalization - Example Example (for graph, see foils 11,12): S= {x Y= {e l,2,3,4},s0 = x f} 0 5 v5 X 1 2 3 4 X 0 .6 0 .4 0 1 0 0 .12 0 .88 2 0 0 0 0 1 3 0 1 0 0 0 4 0 0 1 0 0 V V — e x 1 2 3 0 x 1 2 3 4 t x 1 2 3 4 .2 x .8 .5 ^ .7 1 0 .1 2 0 3 0 4 0 Pavel Rychlý • Markov Models • March 9,2023 14/15 Using the HMM The generation algorithm (of Limited value :-)): 1. Start in s = so. 2. Move from s to s'with probability Ps(s' | s). 3. Output (emit) symbol with probability Py(yk \ s,s;). 4. Repeat from step 2 (until somebody says enough). More interestirig usage: ■ Given an output sequence Y = {yi,/2, • • • ,Yk} compute its probability. ■ Given an output sequence Y = {ylly2l... ,yk} compute the most likely sequence of states which has generated it. ■ ...plus variations: e.g., n best state sequences Pavel Rychlý • Markov Models • March 9,2023 15/15