Cross Entropy
PA154 Jazykové modelování (2.1) Pavel Rychlý
pary@fi.muni.cz
March 2, 2017
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
"Coding" Interpretation of Entropy
■ The least (average) number of bits needed to encode a message (string, sequence, series, ...) (each element having being a result of a random process with some distribution p): = H(p)
■ Remember various compressing algorithms?
► they do well on data with repeating (= easily predictable = = low entropy) patterns
► their results though have high entropy ^> compressing compressed data does nothing
PA154 Jazykové modelování (2.1)
Cross Entropy
2/12
Coding: Example
■ How many bits do we need for ISO Latin 1?
► ^> the trivial answer: 8
■ Experience: some chars are more common, some (very) rare:
► ... so what if we use more bits for the rare, and less bits for the frequent? (be careful: want to decode (easily)!)
► suppose: p('a') = 0.3, p('b') = 0.3, p('c') = 0.3, the rest: p(x)^.0004
► code: 'a' ^ 00, 'b' ^ 01, 'c' ^ 10, rest: 116162^3^4^5^6^7^8
► code 'acbbecbaac':
00   10   01   01    1100001111    10   01   00   00 10
acbb e cbaac
► number of bits used: 28 (vs. 80 using "naive" coding)
■ code length ~ —log(probability)
PA154 Jazykové modelování (2.1)
Cross Entropy
3/12
Entropy of Language
■ Imagine that we produce the next letter using
p(ln+i\li, • • • In),
where /i,... In is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call /i,... In the history /7(/7n+i), and all histories H:
■ Then compute its entropy:
■ Not very practical, isn't it?
PA154 Jazykové modelování (2.1)
Cross Entropy
4/12
Cross-Entropy
■ Typical case: we've got series of observations
T — {ti, t2, £3, £4,..., tn} (numbers, words, ...; t\ G ft); estimate
(sample): Vy G ft : p(y) = ~pFj^»
def. c(y) = |{te T;t = y}\
■ ... but the true p is unknown; every sample is too small!
■ Natural question: how well do we do using p (instead of p)?
■ Idea: simulate actual p by using a different T (or rather: by using different observation we simulate the insufficiency of T vs. some other data ("random" difference))
PA154 Jazykové modelování (2.1)
Cross Entropy
5/12
Cross Entropy: The Formula
Hp,(~p) = H(p') + D(p'\\~p)
p' is certainly not the true p, but we can consider it the "real world" distribution against which we test p
note on notation (confusing ...): — <H> p, also Hjr(p)
P
(Cross)Perplexity: Gp/(p) = GT/(p) = 2Hp'^p)
PA154 Jazykové modelování (2.1)
Cross Entropy
6/12
Conditional Cross Entropy
■ So far: "unconditional" distribution(s) p(x), p'(x)...
■ In practice: virtually always conditioning on context
■ Interested in: sample space v|/, r.v. Y, y e *|/; context: sample space fi, r.v.X, x 6 Q:
"our" distribution p(y|x), test against p;(y,x), which is taken from some independent data:
PA154 Jazykové modelování (2.1)
Cross Entropy
7/12
Sample Space vs. Data
■ In practice, it is often inconvenient to sum over the space(s) (especially for cross entropy!)
■ Use the following formula: Hpr(p) —
-EyG^GQ p'(y>x) lQg2 p(y|*) = - VI^IEz-i...!^! lQg2 p(y/l*/)
■ This is in fact the normalized log probability of the "test" data:
Hp/(p) = -i/\r\iog2 H p(yi\xi)
;=1...I7~'I
PA154 Jazykové modelování (2.1)
Cross Entropy
8/12
Computation Example
Q = {a, b, ..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) = .5, p(a) =     for a G {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
Data (test): barb P'(a) = p'(r) = .25, p'(b) = .5 Sum over Q:
a a   bcdefg...pq   r    s t  ... z
-p\d)\og2p(<^)   -5+.5+0+0+0+0+0+0+0+O+0+1.5+0+Q+0+0+0 = 2
Sum over data:
i/sx 1/b     2/a     3/r     4/b ^^li\T\
-iog^Si) 1   +   2+   6   +   1    =10    (1/4)  x 10 =2
PA154 Jazykové modelování (2.1)
Cross Entropy
9/12
Cross Entropy: Some Observations
H(p) ??<,=,>?? Hp/(p) : ALL!
Previous example:
p(a) = .25, p(b) = .5, p(a)= A for a £ {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
H(p) = 2.5bits = H(p')(barb)
Other data: probable: (±)(6 + 6 + 6 + 1 + 2 + 1 + 6 + 6)= 4.25
H(p) < A.25bits = H{p'){probable)
And finally: abba: (±)(2 + 1 + 1 + 2) = 1.5
H(p) > 1.5bits = H(p')(abba) But what about: baby -p'('y') log2 p('y') = -.25 log2 0 = oo (??)
PA154 Jazykové modelování (2.1)
Cross Entropy
10/12
Cross Entropy: Usage
■ Comparing data??
NO! (we believe that we test on real data!)
■ Rather: comparing distributions (vs. real data)
■ Have (got) 2 distributions: p and q (on some Q,X)
► which is better?
► better: has lower cross-entropy (perplexity) on real data S
■ "Real" data: S
				
Hs(P) = -1/	5|E,=i..|si loS2P{y,	x,) © Hs(q) = -1/	5|E,=i..|s| loS2q{y,	
PA154 Jazykové modelování (2.1)
Cross Entropy
11/12
Comparing Distributions
p(.) from previous example:
p(a) = .25, p(b) = .5, p(a) =     for a G {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
q(.|.) (conditional; defined by a table):
	a	b	e	i	□	p	r	other
	0	í	0	0	0	.125	D	0
b	i	0	D	0	1	.125	0	0
e	0	0	0	1	0	125	0	
1	0	.5	0	D	a	.125	0 s	0
□	0	0	0	0	0	.1 2Í	ľ	0
P	0	0	0	G	0	.1 25	n	1^
r	0	I]	■;i	0	0	.1 25	- n —	
								
other	0	0	i	0	ú	.1 25	0	Ij
1
_gx.: q(o|r)
q(r|p) =. 125
(1/8) (log(p|oth.)+log(rjP)+log(ojr)+log(b|o)+log(a|b)+log(b|a)+log(I|b)+log(e|l)) (1/8) (    0      +3+0+0+1    +    0+1   +   0 )
H3(q) = .625
PA154 Jazykové modelovaní (2.1)
Cross Entropy
12/12