Introduction to Natural Language Processing (600.465) Markov Models Dr. Jan Hajič CS Dept., Johns Hopkins Univ. haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic 10/1&00 J HU C S 6 0 0.465 ň. ňtŕ o to N LP/J an H aji c Review: Markov Process Bayes formula (chain rule): p(w) = p(w1;w2,...,wT) = n1=1 T pCwjw^w^:^.^,..^^) n-gram language models: - Markov process (chain) of the order n-1: p(w) = p(w1;w2;...,wT) - n1=1 T p(w1|w1.n+1;w1 *;2,..,wM) Using just one distribution (Ex.: trigram model: p(w1|wl2?w1.1)): Positions: 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Words: 'My car(broke downJ&nd within hours Bob 's canbroke dowiv too . p(,|broke down) = p(w5|w3,w4)) = p(w14|w13,w13) 10/18^00 JHUCSö00.4ö5/IntfotoNLP/JanHajic 2 Markov Properties • Generalize to any process (not just words/LM): - Sequence of random variables: X = (X1?X2,...,XT) - Sample space S (states), size N: S = {s0,s1?s2,...,sN} 1. Limited History (Context, Horizon): Viel.^PmX^...^) = WMM 1 7 3 7 9 0 6 7|Í3| 4 5... 1 7 3 7 9 0 6\7\\3\ 4 5 2. Time invariance (M.C. is statiónařyVliomogeneous) Vi el..T, Vy,x e S; P(Xi=y|X1.1=x) = p(y|x) 1 [ŤjS Hill 0 6 00 4 5... \_/^T'--- \ / jok...same distribution 10/18^00 JHUCSö00.4ö5/IntfotoNLP/JanHajic 3. Long History Possible What if we want trigrams: 0^3 4 5... 1 7 3 7 9 0 Formally, use transformation: Define new variables Qi? such that Xi = {Q^O^}: Then ppcjx,!) = P^iAIQ^A-i) = p(QilQi-2?Qii) Predicting (X): 1 7 3 7ro~n1 6 7 3 4 5 ££££ I AA£AA History (X,.! = {Q^Q,-!»: WW 3 \1 \0 6 La mm i 4 2L3 10/18^00 JHU CS 600.465/Intro to NLP/Jan Haji c 4 Graph Representation: State Diagram * S {S[|,s^;S2?...,Sj^}. states • Distribution Pp^X^): ♦ transitions (as arcs) with probabilities attached to them: 10/18^00 JHU CS 600.465/1 ntro to NLP/Jan Haji c The Trigram Case • S = {s0,sl5s2v..;sN}: states: pairs st = (x,y) • Distribution P^X^): (r.v. X: generates pairs sL) p(toe) = .6 X .88 X .07= .037 p(one) = ? 10/18/00 JHUCS600.465/IntfotoNLP/JanHajic 6 Finite State Automaton States ~ symbols of the [input/output] alphabet - pairs (or more): last element of the n-tuple Arcs ~ transitions (sequence of states) [Classical FSA: alphabet symbols on arcs: - transformation: arcs nodes] Possible thanks to the "limited history" M'ov Pro So far: Visible Markov Models (VMM) 10/18/00 JHU CS 6 00.465/Intro to NLP/Jan Haji c Hidden Markov Models • The simplest HMM: states generate [observable] output (using the "data" alphabet) but remain "invisible": Added Flexibility • So far, no change; but different states may generate the same output (why not?): Output from Arcs. • Added flexibility: Generate output from arcs, not states: .624 o 10/18^00 JHUCS 600.465/IntrotoNLP/JanHajic 10 and Finally, Add Output Probabilities ! simplified! Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc: p(t) = 0 p(o) = 0 p(e) = l p(t) = .8 p(o) = .l p(e)=.l p(t) = .5 .P(o) = .2 p(e)=3 p(l) = 0
sJ5 mark them by output symbols, get rid of output distributions:
p(toe) = .48x.616x.6+ .2xlx.l76 +
.2x1x12 = .237
In the future, we will use the view more convenient for the problem at hand.
10/18^00 JHUCS600.465/IntrotoNLP/JanHajic 12
Formalization
• HMM (the most general case):
- five-tuple (S, s0, Y, Ps? PY), where:
* S = {sqjSj fS2?...,sT} is the set of states, s0 is the initial state,
* Y = {ylfy2v..fyv} is die output alphabet,
* Pjtsjlsj) is the set of prob. distributions of transitions,
- size ofPs: |S|2.
* ^Y(yklsi?sj) *s se* °f output (emission) probability distributions.
- size ofPY: |S|2x |Y|
• Example:
- S= {x, 1, 2, 3,4},s0 = x
- Y={t,o,e}
10/18/00 JHUCSe00.465/IntfotoNLP/JanHajic 13
Formalization - Example
Example (for graph, see foils 11,12):
- S= {x, 1,2, 3,4},s0 = x
- Y={e,o,t}
X 1 2 3 4
X 0 .6 0 .4 0
1 0 0 .12 0 .88
2 0 0 0 0 1
3 0 1 0 0 0
4 0 0 1 0 0
e X 1 2 3 4
t 0 X i 1 2 2 3 4 4
I- J.
X .8 .5 y
1 0 .1
2 0
3 0
4 0
>I = 1
10/18/00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
14
Using the HMM
• The generation algorithm (of limited value :-)):
1. Start in s = s0.
2. Move from s to s' with probability Ps(s'|s).
3. Output (emit) symbol yk with probability Ps(yk|s,s').
4. Repeat from step 2 (until somebody says enough).
• More interesting usage:
- Given an output sequence Y = {y1?y2?...?yk}? compute its probability.
- Given an output sequence Y = {yi,y2,...,yk}, compute the most likely sequence of states which has generated it.
- ...plus variations: e.g., n best state sequences
10/18/00 JHUCSe00.465/IntfotoNLP/JanHajic 15
Introduction to Natural Language Processing (600.465)
HMM Algorithms: Trellis and Viterbi
Dr. Jan Hajic CS Dept., Johns Hopkins Univ.
haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic
10/23^00
JHU CS 600.463 /Intro to NLP/J an Hajic
1
HMM: The Two Tasks
HMM (the general case): - five-tuple (S, S0, Y, Ps, PY), where:
* S = {slfs2v..fsT} is die set of states, S0 is the initial state,
* Y = {yi?y2v?yv } *s ^e output alphabet,
* Ps(Sj|s^) is the set ofprob. distributions of transitions,
* I*Y(yklsi?sj) is me set of output (emission) probability distributions.
Given an HMM & an output sequence Y = {ypy2,...;yk}
(Task 1) compute the probability of Y;
(Task 2) compute the most likely sequence of states which has generated Y.
10/23^00 JHUCSe00.465/IntfotoNLP/JanHajic 2
Trellis - Deterministic Output
HMM: Trellis: time/position t
0 12 3
Y: t + o
trellis state: (HMM state, position) a{m= l fl(A;1)= 6 a^2)
each state: holds one number (prob): a a(c,i) = a
probability or Y: in the last state
10/23/00
JHU CS 600.465/Intro to NLP/Jan Haji c
Creating the Trellis: The Start
position/stage
• Start in the start state (x), o 1
- set its aXx,0) to 1.
a = 1
• Create the first stage:
- get the first "output" symbol yA
- create the first stage (column)
- but only those trellis states
which generate yl Vi- t-
- set their a(staiej) to the Pg(jtate|x) a(x,0)
• ...and forget about the 0-th stage 1
10/23^00
JHU CS 600.465/1 ntro to NLP/Jan Haji c
4
■
Trellis: The Next Step
Suppose we are in stage i Creating the next stage:
- create all trellis states in the next stage which generate yi+1? but only those reachable from any of the stage-i states
- s et their a (state > i +1) to:
Ps(state|prev.state) x a(prev.state, i) (add up all such numbers on arcs going to a common trellis state) ■-■ ...and forget about stage i
position/stage i=l
10/23/00
JHU CS 600.4(55/Intro to NLP/Jan Haji c
Trellis: The Last Step
Continue until "output" exhausted
- |Y| = 3: until stage 3
Add together all the a(state,\Y\) That's the Fffl.
Observation (pleasant):
- memory usage max: 2|S|
- multiplications max: |S|2|Y|
last position/stage
ö= .568
a= .568
P(Y)= .568
10/23/00
JHU CS 600.465/1 ntro to NLP/Jan Haji c
6
Trellis: The General Case (still, bigrams
• Start as usual:
- start state (x), set its aXx/J) to 1.
Q = 1
p(toe) = .48x.616x.6+ .2xlx.l76 +
.2x1x12 = .237
10/23/00 JHUCS600.4ö5/IntfotoNLP/JanHajic 7
General Trellis: The Next Step
0
Oi= 1
We are in stage i:
- Generate the next stage i + 1 as before (except now arcs generate output, thus use only those arcs marked by the output symbol yi+1)
- For each generated state, compute a(state,i+1) = = 2„_gPy(yi+11state,prev.state) x a(p rev. state, j)
position/stage
incoming arcs
Wa= .48
.and forget about stage i as usual.
10/23^00
JHU CS 600.4(55/Intro to NLP/Jan Haji c
Trellis: The Complete Example
The Case of Trigrams
• Like before, but:
- states correspond to bigrams,
- output function always emits the second output symbol of
the pair (state) to which the arc goes:
Multiple paths not possible —> trellis not really needed
10/23/00 JHUCS600.465/IntrotoNLP/JanHajic 10
Trigrams with Classes
More interesting:
- n-gram class LM: pCwJw^^w^) = p(wjq) ptcjq.^q.j) -» states are pairs of classes (q.1?q); and emit "words":
(letters in our example)
p(t|C) = 1 usual,
p(o|V) = .3 non-
p(e|V) = .6 overlapping
p(y|V) = .1 classes
p(toe) = .6 X 1 x .88 x .3 x .07 X , 6 = .00665 p(teo) = .6 X 1 X .gg X . 6 X .07 X , 3 S .00665 p(totf= .6 X 1 x .88 x .3 x .07 X , 1 h .00111 y(\ty) = .6 X 1 X .12 X 1 X 1 X . 1 S .0072 10/23.00 JHU CS 600.465/Intfo to NLFtf anHaji.:c 11
Class Trigrams: the Trellis
■ Trellis generation (Y - "toy"):
p(t|C) = l p(o|V) = .3 p(e|V)=.6
p(y|v) = .i m
again, trellis useful but not re ally ne eded
a= .15S4x .07 x .1 = .00111
c<= .6 x .88 x .3 o y
10/23/00
JHU CS 600.4(55/Intro to NLP/Jan Haji c
12
Overlapping Classes
• Imagine that classes may overlap
- e.g. 'r' is sometimes vowel sometimes consonant, belongs to V as well as C:
P(t|C) = .3
p(r|C) = .7 p(o|V) = .1 p(e|V)=.3 p(y|V) = .4 p(r|V) = .2
p(try) = ?
10/23/00 JHUCS600.465/IntrotoNLP/JanHajic 13
Overlapping Classes: Trellis Example
10/23/00 JHUC So" 00.4(55/Intro to NLP/Jan Hajic 14
Trellis: Remarks
• So far, we went left to right (computing a)
• Same result: going right to left (computing p)
- supposed we know where to start (finite data)
• In fact, we might start in the middle going left and right
• Important for parameter estimation
(Forward-Backward Algortihm alias Baum-Welch)
• Implementation issues:
- scaling/normalizing probabilities, to avoid too small numbers & addition problems with many transitions
10/23/00 JHUCSe00.465/IntfotoNLP/JanHajic 15
The Viterbi Algorithm
• Solving the task of finding the most likely sequence of states which generated the observed data
• i.e., finding
Sbest=argmaxsP(S|Y) which is equal to (Y is constant and thus P(Y) is fixed):
SbeSt=argmaxsp(S,Y) =
- argmax^Cs^s^s^.-.^y^y,,...^) -
= argmaxsni=1^kp(y1|s1,s1.1)p(s1|s1.1)
10/23^00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The Crucial Observation
Imagine the trellis build as before (but do not compute the as yet; assume they are o.k.); stage /':
a- .4 s
NB: remember previous state from which we got the maximum:
a = max(.3,.32) = .32
stage 1 2
"reverse" the arc
?......max! f o= 32
this is certainly the "backwards" maximum to (D,2)... but
it cannot change even whenever we go forward (M. Property: Limited History)
10/23^00 JHUCS600.465/IntrotoNLP/JanHajic 17
Viterbi Example
• V classification (C or V?, sequence?):
p(t|C) = .3 p(r|C) = .7 p(o|V) = .l p(e|V)=.3 p(y|V) = .4 p(r|V) - .2
argmaxXYS p(rry|XYZ) = ?
Possible state seq.: (x,v)(v,c)(c,v)[vcv]? (x,c)(c,c)(c,v)[ccv]f (xx)(c,v)(v,v) [cw]
10/23,00 JHUCS600.465/InteotoNLP/JanHajic IS
Viterbi Computation
10/23/00 JHUCS600.465/IntrotoNLP/JanHajic \9
n-best State Sequences
Keep track of n best "back pointers" Ex.: n= 2: Two 4^nners": VCV (best) CCV (2nd best)
.42 x .12 x .7 .03528
Ot= .07392 x .07 x