Introduction to Natural Language Processing (600.465)
Markov Models
Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic
10/1&00
J HU C S 6 0 0.465 ň. ňtŕ o to N LP/J an H aji c
Review: Markov Process
Bayes formula (chain rule):
p(w) = p(w1;w2,...,wT) = n1=1 T pCwjw^w^:^.^,..^^)
n-gram language models: - Markov process (chain) of the order n-1:
p(w) = p(w1;w2;...,wT) - n1=1 T p(w1|w1.n+1;w1 *;2,..,wM)
Using just one distribution (Ex.: trigram model: p(w1|wl2?w1.1)):
Positions:    12   3        4   5 6    7       8       9 10 11 12  13     14 15 16 Words:
'My car(broke downJ&nd within hours Bob 's canbroke dowiv too .
p(,|broke down) = p(w5|w3,w4)) = p(w14|w13,w13)
10/18^00 JHUCSö00.4ö5/IntfotoNLP/JanHajic 2
Markov Properties
• Generalize to any process (not just words/LM):
- Sequence of random variables: X = (X1?X2,...,XT)
- Sample space S (states), size N: S = {s0,s1?s2,...,sN}
1. Limited History (Context, Horizon): Viel.^PmX^...^) = WMM
1 7 3 7 9 0 6 7|Í3| 4 5... 1 7 3 7 9 0 6\7\\3\ 4 5
2. Time invariance (M.C. is statiónařyVliomogeneous) Vi el..T, Vy,x e S; P(Xi=y|X1.1=x) = p(y|x)
1 [ŤjS Hill 0 6 00 4 5...
\_/^T'---    \ / jok...same distribution
10/18^00 JHUCSö00.4ö5/IntfotoNLP/JanHajic 3.
Long History Possible
What if we want trigrams:
0^3 4 5...
1 7 3 7 9 0
Formally, use transformation:
Define new variables Qi? such that Xi = {Q^O^}: Then
ppcjx,!) = P^iAIQ^A-i) = p(QilQi-2?Qii)
Predicting (X): 1 7 3 7ro~n1 6 7 3 4 5
££££ I AA£AA
History (X,.! = {Q^Q,-!»: WW
3
\1
\0
6
La mm
i
4 2L3
10/18^00
JHU CS 600.465/Intro to NLP/Jan Haji c
4
Graph Representation: State Diagram
* S   {S[|,s^;S2?...,Sj^}. states
• Distribution Pp^X^):
♦ transitions (as arcs) with probabilities attached to them:
10/18^00
JHU CS 600.465/1 ntro to NLP/Jan Haji c
The Trigram Case
• S = {s0,sl5s2v..;sN}: states: pairs st = (x,y)
• Distribution P^X^): (r.v. X: generates pairs sL)
p(toe) = .6 X .88 X .07= .037 p(one) = ?
10/18/00 JHUCS600.465/IntfotoNLP/JanHajic 6
Finite State Automaton
States ~ symbols of the [input/output] alphabet
- pairs (or more): last element of the n-tuple Arcs ~ transitions (sequence of states)
[Classical FSA: alphabet symbols on arcs:
- transformation: arcs nodes] Possible thanks to the "limited history" M'ov Pro So far: Visible Markov Models (VMM)
10/18/00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
Hidden Markov Models
• The simplest HMM: states generate [observable] output (using the "data" alphabet) but remain "invisible":
Added Flexibility
• So far, no change; but different states may generate the same output (why not?):
Output from Arcs.
• Added flexibility: Generate output from arcs, not states:
.624
o
10/18^00 JHUCS 600.465/IntrotoNLP/JanHajic 10
and Finally, Add Output Probabilities
! simplified!
Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc:
p(t) = 0 p(o) = 0 p(e) = l
p(t) = .8 p(o) = .l p(e)=.l
p(t) = .5 .P(o) = .2 p(e)=3
p(l) = 0 <p(o) = l
mm
p(t) = 0 p(o) = .4 p(e)=6
10/18^00
JHU CS 600.465/Intro to NLP/Jan Haji c
.6x.8x&%x.7x\x.6 + Ax.Sxlxl x.88x,2 + .4x,5xlxJ x.l2xj = .237
11
Slightly Different View
• Allow for multiple arcs from st —> sJ5 mark them by output symbols, get rid of output distributions:
p(toe) = .48x.616x.6+ .2xlx.l76 +
.2x1x12 = .237
In the future, we will use the view more convenient for the problem at hand.
10/18^00 JHUCS600.465/IntrotoNLP/JanHajic 12
Formalization
• HMM (the most general case):
- five-tuple (S, s0, Y, Ps? PY), where:
* S = {sqjSj fS2?...,sT} is the set of states, s0 is the initial state,
* Y = {ylfy2v..fyv} is die output alphabet,
* Pjtsjlsj) is the set of prob. distributions of transitions,
- size ofPs: |S|2.
* ^Y(yklsi?sj) *s      se* °f output (emission) probability distributions.
- size ofPY: |S|2x |Y|
• Example:
- S= {x, 1, 2, 3,4},s0 = x
- Y={t,o,e}
10/18/00 JHUCSe00.465/IntfotoNLP/JanHajic 13
Formalization - Example
Example (for graph, see foils 11,12):
- S= {x, 1,2, 3,4},s0 = x
- Y={e,o,t}
	X	1	2	3	4
X	0	.6	0	.4	0
1	0	0	.12	0	.88
2	0	0	0	0	1
3	0	1	0	0	0
4	0	0	1	0	0
		e	X	1	2	3	4
t	0	X i	1 2	2	3 4	4	
I-		J.					
X		.8		.5	y		
1			0		.1		
2					0		
3		0					
4			0				
>I = 1
10/18/00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
14
Using the HMM
• The generation algorithm (of limited value :-)):
1. Start in s = s0.
2. Move from s to s' with probability Ps(s'|s).
3. Output (emit) symbol yk with probability Ps(yk|s,s').
4. Repeat from step 2 (until somebody says enough).
• More interesting usage:
- Given an output sequence Y = {y1?y2?...?yk}? compute its probability.
- Given an output sequence Y = {yi,y2,...,yk}, compute the most likely sequence of states which has generated it.
- ...plus variations: e.g., n best state sequences
10/18/00 JHUCSe00.465/IntfotoNLP/JanHajic 15
Introduction to Natural Language Processing (600.465)
HMM Algorithms: Trellis and Viterbi
Dr. Jan Hajic CS Dept., Johns Hopkins Univ.
haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic
10/23^00
JHU CS 600.463 /Intro to NLP/J an Hajic
1
HMM: The Two Tasks
HMM (the general case): - five-tuple (S, S0, Y, Ps, PY), where:
* S = {slfs2v..fsT} is die set of states, S0 is the initial state,
* Y = {yi?y2v?yv } *s ^e output alphabet,
* Ps(Sj|s^) is the set ofprob. distributions of transitions,
* I*Y(yklsi?sj) is me set of output (emission) probability distributions.
Given an HMM & an output sequence Y = {ypy2,...;yk}
(Task 1) compute the probability of Y;
(Task 2) compute the most likely sequence of states which has generated Y.
10/23^00 JHUCSe00.465/IntfotoNLP/JanHajic 2
Trellis - Deterministic Output
HMM: Trellis: time/position t
0 12 3
Y: t    + o
trellis state: (HMM state, position) a{m= l fl(A;1)= 6 a^2)
each state: holds one number (prob): a a(c,i) = a
probability or Y:      in the last state
10/23/00
JHU CS 600.465/Intro to NLP/Jan Haji c
Creating the Trellis: The Start
position/stage
• Start in the start state (x),        o 1
- set its aXx,0) to 1.
a = 1
• Create the first stage:
- get the first "output" symbol yA
- create the first stage (column)
- but only those trellis states
which generate yl Vi- t-
- set their a(staiej) to the Pg(jtate|x) a(x,0)
• ...and forget about the 0-th stage 1
10/23^00
JHU CS 600.465/1 ntro to NLP/Jan Haji c
4
■
Trellis: The Next Step
Suppose we are in stage i Creating the next stage:
- create all trellis states in the next stage which generate yi+1? but only those reachable from any of the stage-i states
- s et their a (state > i +1) to:
Ps(state|prev.state) x a(prev.state, i) (add up all such numbers on arcs going to a common trellis state) ■-■ ...and forget about stage i
position/stage i=l
10/23/00
JHU CS 600.4(55/Intro to NLP/Jan Haji c
Trellis: The Last Step
Continue until "output" exhausted
- |Y| = 3: until stage 3
Add together all the a(state,\Y\) That's the Fffl.
Observation (pleasant):
- memory usage max: 2|S|
- multiplications max: |S|2|Y|
last position/stage
ö= .568
a= .568
P(Y)= .568
10/23/00
JHU CS 600.465/1 ntro to NLP/Jan Haji c
6
Trellis: The General Case (still, bigrams
• Start as usual:
- start state (x), set its aXx/J) to 1.
Q = 1
p(toe) = .48x.616x.6+ .2xlx.l76 +
.2x1x12 = .237
10/23/00 JHUCS600.4ö5/IntfotoNLP/JanHajic 7
General Trellis: The Next Step
0
Oi= 1
We are in stage i:
- Generate the next stage i + 1 as before (except now arcs generate output, thus use only those arcs marked by the output symbol yi+1)
- For each generated state, compute a(state,i+1) = = 2„_gPy(yi+11state,prev.state) x a(p rev. state, j)
position/stage
incoming arcs
Wa= .48
.and forget about stage i as usual.
10/23^00
JHU CS 600.4(55/Intro to NLP/Jan Haji c
Trellis: The Complete Example
The Case of Trigrams
• Like before, but:
- states correspond to bigrams,
- output function always emits the second output symbol of
the pair (state) to which the arc goes:
Multiple paths not possible —> trellis not really needed
10/23/00 JHUCS600.465/IntrotoNLP/JanHajic 10
Trigrams with Classes
More interesting:
- n-gram class LM: pCwJw^^w^) = p(wjq) ptcjq.^q.j) -» states are pairs of classes (q.1?q); and emit "words":
(letters in our example)
p(t|C) = 1 usual,
p(o|V) = .3 non-
p(e|V) = .6 overlapping
p(y|V) = .1 classes
p(toe) = .6 X 1 x .88 x .3 x .07 X , 6 = .00665 p(teo) = .6 X 1 X .gg X . 6 X .07 X , 3 S .00665 p(totf= .6 X 1 x .88 x .3 x .07 X , 1 h .00111 y(\ty) = .6 X 1 X .12 X 1 X 1 X . 1 S .0072 10/23.00 JHU CS 600.465/Intfo to NLFtf anHaji.:c 11
Class Trigrams: the Trellis
■ Trellis generation (Y - "toy"):
p(t|C) = l p(o|V) = .3 p(e|V)=.6
p(y|v) = .i m
again, trellis useful but not re ally ne eded
a= .15S4x .07 x .1 = .00111
c<= .6 x .88 x .3 o y
10/23/00
JHU CS 600.4(55/Intro to NLP/Jan Haji c
12
Overlapping Classes
• Imagine that classes may overlap
- e.g. 'r' is sometimes vowel sometimes consonant, belongs to V as well as C:
P(t|C) = .3
p(r|C) = .7 p(o|V) = .1 p(e|V)=.3 p(y|V) = .4 p(r|V) = .2
p(try) = ?
10/23/00 JHUCS600.465/IntrotoNLP/JanHajic 13
Overlapping Classes: Trellis Example
10/23/00 JHUC So" 00.4(55/Intro to NLP/Jan Hajic 14
Trellis: Remarks
• So far, we went left to right (computing a)
• Same result: going right to left (computing p)
- supposed we know where to start (finite data)
• In fact, we might start in the middle going left and right
• Important for parameter estimation
(Forward-Backward Algortihm alias Baum-Welch)
• Implementation issues:
- scaling/normalizing probabilities, to avoid too small numbers & addition problems with many transitions
10/23/00 JHUCSe00.465/IntfotoNLP/JanHajic 15
The Viterbi Algorithm
• Solving the task of finding the most likely sequence of states which generated the observed data
• i.e., finding
Sbest=argmaxsP(S|Y) which is equal to (Y is constant and thus P(Y) is fixed):
SbeSt=argmaxsp(S,Y) =
- argmax^Cs^s^s^.-.^y^y,,...^) -
= argmaxsni=1^kp(y1|s1,s1.1)p(s1|s1.1)
10/23^00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The Crucial Observation
Imagine the trellis build as before (but do not compute the as yet; assume they are o.k.); stage /':
a- .4 s
NB: remember previous state from which we got the maximum:
a = max(.3,.32) = .32
stage 1 2
"reverse" the arc
?......max! f o= 32
this is certainly the "backwards" maximum to (D,2)... but
it cannot change even whenever we go forward (M. Property: Limited History)
10/23^00 JHUCS600.465/IntrotoNLP/JanHajic 17
Viterbi Example
• V classification (C or V?, sequence?):
p(t|C) = .3 p(r|C) = .7 p(o|V) = .l p(e|V)=.3 p(y|V) = .4 p(r|V) - .2
argmaxXYS p(rry|XYZ) = ?
Possible state seq.: (x,v)(v,c)(c,v)[vcv]? (x,c)(c,c)(c,v)[ccv]f (xx)(c,v)(v,v) [cw]
10/23,00 JHUCS600.465/InteotoNLP/JanHajic IS
Viterbi Computation
10/23/00 JHUCS600.465/IntrotoNLP/JanHajic \9
n-best State Sequences
Keep track of n best "back pointers" Ex.: n= 2: Two 4^nners": VCV (best) CCV (2nd best)
.42 x .12 x .7 .03528
Ot= .07392 x .07 x <g02Ö7£U>
10/2 3.0 0
Q= .08 x 1 x .7 = .056 Q= .4x .2 = D8
JHU CS 600.465/Intro to NLP/Jan Haji c
c = .03528 x 1 x ?{ =.01411
OiyC = .056 x .8 x .4
'Cm792>w
Tracking Back the n-best paths
• Backtracking-style algorithm:
* Start at the end, in the best of the n states (%es^)
* Put the other n-1 b e st no des/back p ointer pairs on stack, except tho se leading from sbestto die same best-back state.
• Follow the back "beam" towards the start of the data, spitting out nodes on the way (backwards of course) using always only the best back pointer.
• At every beam split, push the diverging node/back pointer pairs onto the stack (node/heam width is sufficient!).
• When you reach the start of data, close the path, and pop the topmost node/back pointer(width) pair from the stack.
• Repeat until the stack is empty; expand the result tree if necessary.
10/23/00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
Pruning
Sometimes, too many trellis states in a stage:
it
\y* *^a= .002
criteria: (a) a < threshold
(b) J.% < threshold
(c) # of states > threshold (get rid of smallest a)
10/23/00
:000435 :»= .0066
JHU CS 6 00.465/Intro to NLP/Jan Haji c