MUNI FI Markov Models PA154 Language Modeling (4.1) Pavel Rychly pary@fi.muni.cz March 9, 2023 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept.,Johns Hopkins Univ. www.cs.jh u.edu/~hajic Review: Markov Process ■ Bayes formula (chain rule): P(W) = P(wu W2, ..,WT) = n;=i..7-p(W; | Wltni-Jt-, Wi-n+l,••, W;_ ■ n-gram language models: ■ Markov process (chain) of the order n-1: P(W) = P(WUW2,..,WT) = n;=i.Tp(W; I W;_n+i,W;_n+2,..,W;-l Using just one distribution (Ex.: trigram model: p(w, | W;_2,W;_ Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 Words: My car\broke down and within hours Bob's can broke down , too . p(,| broke down) = p(w$ | w3, vv4) = p(wi4 | W12, Wu) ^avel Rychlý ■ Markov Models ■ March 9,2023 Markov Properties Long History Possible ■ Generalize to any process (not just words/LM): ■ Sequence of random variabLes: X = (X±,X2,... ,Xj) ■ SampLe space S (states), size N: S = (So, Si, S2,..., S«) 1. Limited History (Context, Horizon): V/el..7-;P(X;|X1,.,X;_0 = (X;|X;_0 |T|4 5... 1 7 3 79 0 6J|[T]45.. 2. Time invariance (M.C. is stationary, homogenous) V/ G 1..7", Vy,x g S; P(X, = y | = x) = p{y | x) >v—f >v—^ ok...same distribution ■ What if we want trigrams: 17 5 7 9 0ITTIITI4 5... 1 Formally, use transformation: Define new variables Oj,suchthatXj = Qj-i,Qj: Then P[Xi\Xi-i) = PiQi-uQi\Qi-2,Qi-i) Predicting (X,) 1 7 5 7 [9~Ö| 6 7 5 4 5... History (X,- = {0,_2, &-i}): X 1 7 3 ... 0 6 7 3 4 X X 1 7 -A 9 0 6 7 3 ^avel Rychlý ■ Markov Models ■ March 9,2023 ^avel Rychlý ■ Markov Models ■ March 9,2023 Graph Representation: State Diagram ■ S = {s0,si,s2,...,s„}: states ■ Distribution P(X, | X,-_i): ■ transitions (as arcs) with probabilities attached to them: Bigram case: sum of outgoing probs = 1 The Trigram Case S = {s0,s1,s2,... ,s„}: states: pairs s, = (x,y) Distribution P(X, | X;_i): (r.v. X: generates pairs s,) 1 p(toe) = .6 x .88 x .07 = .037 p(one) = ? p(toe) = .6 x .88 x 1 = .528 ^avel Rychlý ■ Markov Models ■ March 9,2023 ^avel Rychlý ■ Markov Models ■ March 9,2023 Finite State Automaton Hidden Markov Models States ~ symbols of the [input/output] alphabet c ■ pairs (or more): Last eLement of the n-tupLe Arcs ~ transitions (sequence of states) [Classical FSA: alphabet symbols on arcs: ■ transformation: arcs o nodes] Possible thanks to the "limited history" MajKov Property So far: Visible Markov Models (VMN The simplest HMM: states generate [observable] output (using the"data" alphabet) but remain "invisible": p(toe) = .6 x .88 x 1 = .528 ^avel Rychlý ■ Markov Models ■ March 9,2023 ^avel Rychlý ■ Markov Models ■ March 9,2023 Added Flexibility... ■ So far, no change; but different states may generate the same output (why not?): Output from Arcs. p(toe) = .6 x .88 x 1 + .4 x .1 x 1 = .568 Added flexibility: Generate output from arcs, not states: p(toe) = .6 x .88 x 1 + .4x.lxl + .4 x .2 x .3 + .4 x .2 x .4 = .624 ^avel Rychlý ■ Markov Models ■ March 9,2023 ^avel Rychlý ■ Markov Models ■ March 9,2023 .and Finally, Add Output Probabilities Slightly Different View Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc: p(t)= p(o)=0 p(toe) = .6 x .8 x .88 x .7 x 1 x .6 + .4 x .5 x 1 x 1 x .88 x .2 + .4 x .5 x 1 x 1 x .12 x 1 p(t)=0 .237 p(o)=.4 p(e)=.6 Allow for multiple arcs from s, —► sy-, mark them by output symbol s, get rid of output distributions: O..06 e,.12 p(toe) = .48 x .616 x .6 + .2 x 1 x .176 + .2 x 1 x .12 .237 In the future, we will use the view more convenient for the problem at hand. ^avel Rychlý ■ Markov Models ■ March 9,2023 ^avel Rychlý ■ Markov Models ■ March 9,2023 Formalization Formalization - Example HMM (the most general case): ■ five-tuple (S, s0, Y,Ps,Py), where: ■ S = {s0,Si,s2,... ,sT} is the set of states, s0 is the initial, state, ■ Y = {yi,Y2, ■ ■ ■ ,yv] is the output alphabet, ■ Ps(Sj | Sj) is the set of prob. distributions of transitions, ■ size of Ps :| 5 |2. ■ Py(yk I Si,Sj) is the set of output (emission) probability distributions. ■ size of Py :| 5 |2 x | Y\ Example: ■ S= x, 1,2, 3,4, s0 = x ■ Y={t,o,e} ^avel Rychlý ■ Markov Models ■ March 9,2023 Using the HMM The generation algorithm (of limited value :-)): 1. Start in s = s0. 2. Move from sto s'with probability Ps(s' | s). 3. Output (emit) symbol yk with probability Py(yk \s,s'). 4. Repeat from step 2 (until somebody says enough). More interestirig usage: ■ Given an output sequence V = {yi,yj, ■■■,/*} compute its probability. ■ Given an output sequence V = {yi,yj, ■■■,/*} compute the most likely sequence of states which has generated it. ■ ...plus variations: e.g., n best state sequences Example (for graph, see foils 11,12): ■ S={x,l,2,3,4},s0 = x ■ Y= {e,o,t} m1 ^avel Rychlý ■ Markov Models ■ March 9,2023 ^avel Rychlý ■ Markov Models ■ March 9,2023 15/15