Cross Entropy
PA154 Jazykové modelování (2.1) Pavel Rychlý
pary@fi.muni.cz
March 2, 2017
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
"Coding" Interpretation of Entropy
■ The least (average) number of bits needed to encode a message (string, sequence, series, ...) (each element having being a result of a random process with some distribution p): = H(p)
■ Remember various compressing algorithms?
► they do well on data with repeating (= easily predictable = = low entropy) patterns
► their results though have high entropy ^> compressing compressed data does nothing
PA154 Jazykové modelování (2.1)
Cross Entropy
2/12
Coding: Example
■ How many bits do we need for ISO Latin 1?
► ^> the trivial answer: 8
■ Experience: some chars are more common, some (very) rare:
► ... so what if we use more bits for the rare, and less bits for the frequent? (be careful: want to decode (easily)!)
► suppose: p('a') = 0.3, p('b') = 0.3, p('c') = 0.3, the rest: p(x)^.0004
► code: 'a' ^ 00, 'b' ^ 01, 'c' ^ 10, rest: 116162^3^4^5^6^7^8
► code 'acbbecbaac':
00   10   01   01    1100001111    10   01   00   00 10
acbb e cbaac
► number of bits used: 28 (vs. 80 using "naive" coding)
■ code length ~ —log(probability)
PA154 Jazykové modelování (2.1)
Cross Entropy
3/12
Entropy of Language
■ Imagine that we produce the next letter using
p(ln+i\li, • • • In),
where /i,... In is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call /i,... In the history /7(/7n+i), and all histories H:
■ Then compute its entropy:
■ Not very practical, isn't it?
PA154 Jazykové modelování (2.1)
Cross Entropy
4/12
Cross-Entropy
■ Typical case: we've got series of observations
T — {ti, t2, £3, £4,..., tn} (numbers, words, .. .; t\ G ft);
estimate (sample): Vy G ft : p(y) = >
def. c(y) = \{t G 7;t = y}|
■ ... but the true p is unknown; every sample is too small!
■ Natural question: how well do we do using p (instead of p)?
■ Idea: simulate actual p by using a different T (or rather: by using different observation we simulate the insufficiency of T some other data ("random" difference))
PA154 Jazykové modelování (2.1)
Cross Entropy
Cross Entropy: The Formula
Hp,(~p) = H(p') + D(p'\\~p)
p' is certainly not the true p, but we can consider it the world" distribution against which we test p
note on notation (confusing . ..): — <H> p, also Hjr(p)
p
(Cross)Perplexity: Gp/(p) = GT/(p) = 2Hp'^p)
Cross Entropy
Conditional Cross Entropy
■ So far: "unconditional" distribution(s) p(x), p'(x)...
■ In practice: virtually always conditioning on context
■ Interested in: sample space v|/, r.v. Y, y e *|/; context: sample space fi, r.v.X, x 6 Q:
"our" distribution p(y|x), test against p;(y,x), which is taken from some independent data:
PA154 Jazykové modelování (2.1)
Cross Entropy
7/12
Sample Space vs. Data
■ In practice, it is often inconvenient to sum over the space(s)
(especially for cross entropy!)
■ Use the following formula:
hp'{p) = - EyGV,xeft p'(y,*) log2 p(y|*) = -VI7/IE/=i...|^i lQg2 p(y/k)
■ This is in fact the normalized log probability of the "test" data:
Hp/(p) = -i/\r\iog2 H p(yi\xi)
;=1...I7~'I
PA154 Jazykové modelování (2.1)
Cross Entropy
8/12
Computation Example
Q = {a, b, ..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) = .5, p(a) =     for a G {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
Data (test): barb p'(a) = p'(r) = .25, p'(b) = .5 Sum over Q:
a a   bcdefg.,.pq   r    s t  .. . z
-p'C^logapCa)  .5+.5+0+0+0+0+0+0+040+0+1.5+0+0+0+0+0 = 2 .5
Sum over data:
i/sx 1/b     2/a     3/r     4/b ^^^^l/IT'l
4og2p(Si) 1   +   2+   6   +   1    =10    (1/4)  x 10 =2.5
PA154 Jazykové modelování (2.1)
Cross Entropy
9/12
Cross Entropy: Some Observations
■ H(p) ??<,= >?? HPip) : ALL!
■ Previous example:
p(a) = .25, p(b) = .5, p(a)= ^ for a G {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
H(p) = 2.5bits = H(p')(barb)
■ Other data: probable: (|)(6 + 6 + 6 + 1 + 2 + 1 + 6 + 6) = 4.25
H(p) < A.lbbits = H(p')(probable)
u And finally: abba: (J)(2 + 1 + 1 + 2) = 1.5
H(p) > l.bbits = H(o'Mabba)
■ But what about: baby -p'('y') log2 p('y') = -.25 log2 0 = oo (??)
PA154 Jazykové modelování (2.1)
Cross Entropy
10/12
Cross Entropy: Usage
Comparing data??
► NO! (we believe that we test on real data!)
Rather: comparing distributions (vs. real data)
Have (got) 2 distributions: p and q (on some Q,X)
► which is better?
► better: has lower cross-entropy (perplexity) on real data S "Real" data: S
Hs(p) = -1/|S| E/=i..isi /o£2p(y/|x/)      Hs{q) = -1/\S\ E/=i..isi /og2q(y/|x/)
PA154 Jazykové modelování (2.1)
Cross Entropy
11/12
Comparing Distributions
p(.) from previous example:
p(a) = .25, p(b) = .5, p(a) =     for a G {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
q(.|.) (conditional; defined by a table):
	a	b	e	1		0	P	r	□thei
J,									
a	0	.5	0	Q		0	.1 25	D	0
b	1	0	0	0		i	.125	0	U
e	0	0	0	1		0	.125	0	
1	0	.5	0	D		a	.125	o r	u
0	0	0	0	0		0	.1 25	ľ	0
P	Ú	0	0	G		0	.1 25	0	1 ^
r	0	0	0	0		0	1 2 5	*-9--—	
other	0	0	1	0		ú	.1 25	0	0
ex.: q(o|r) = 1
q(r|p)=125
(l/8)(log(p|oth.)+log(r|p)+log(o|r)+log(b|o)+log(a|b)+log(b|a)+loga|b)+log(e|l)) (1/8) (    0      +3+0+0+1   +    0+1   +   0 )
Hs(q) = .625
PA154 Jazykové modelování (2.1)
Cross Entropy
12/12