"Coding" Interpretation of Entropy
Cross Entropy
PA154 Jazykové modelování (2.1) Pavel Rychlý
pary@fi.muni.cz
February 24, 2020
The least (average) number of bits needed to encode a message (string, sequence, series, ...) (each element having being a result of a random process with some distribution p): — H(p) Remember various compressing algorithms?
► they do well on data with repeating (= easily predictable = = low entropy) patterns
► their results though have high entropy => compressing compressed data does nothing
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic
Coding: Example
Entropy of Language
How many bits do we need for ISO Latin 1?
► the trivial answer: 8
Experience: some chars are more common, some (very) rare:
► ... so what if we use more bits for the rare, and less bits for the frequent? (be careful: want to decode (easily)!)
► suppose: p('a') = 0.3, p('b') = 0.3, p('c') = 0.3, the rest: p(x)S
► code: 'a' ~ 00, 'b' ~ 01, 'c' ~ 10, rest: 11 fai£i2i>3i>41>51>6i>7iiB
► code 'acbbecbaac':
00 10 01 01 1100001111 10 01 00 00 10 acbb e cbaac
► number of bits used: 28 (vs. 80 using "naive" coding)
code length ~ —log(probability)
.0004
Imagine that we produce the next letter using
p('n+i|'i, ■ ■ ■ In),
where /i,... /„ is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call k,... /„ the history h(hn+i), and all histories H: Then compute its entropy:
-£/ieH£/e,lp(','>)i°g2P('l'0
Not very practical, isn't it?
Cross-Entropy
Cross Entropy: The Formula
(sample): Vy g ft : p(y) =
Typical case: we've got series of observations
T = {ti, t2, f3, t4,..., f„} (numbers, words, ...; ti 6 ft); estimate
cM
\T\ '
def. c(y) = \{t g T;t = y}\
... but the true p is unknown; every sample is too small! Natural question: how well do we do using p (instead of p)? Idea: simulate actual p by using a different T (or rather: by using different observation we simulate the insufficiency of T vs. some other data ("random" difference))
Hp,(p) = H(p') + D(p'||p)
HP'(P) = - Ex€S2 P'M |0S2 P(x)
p' is certainly not the true p, but we can consider it the "real world" distribution against which we test p
note on notation (confusing .
-op, also HT/(p)
(Cross)Perplexity: Gp/(p) = GT,(p) = 2Hp'(r5)
PA154 Jazykové modelováni (2.1)
Conditional Cross Entropy
Sample Space vs. Data
So far: "unconditional" distribution(s) p(x), p'(x)... In practice: virtually always conditioning on context Interested in: sample space V, r.v. Y, y g V; context: sample space Q., r.v.X, xeQ:
"our" distribution p(y|x), test against p'(y,x), which is taken from some independent data:
-       p'(y<x) los2P(ylx)
In practice, it is often inconvenient to sum over the space(s)
(especially for cross entropy!)
Use the following formula: Hp/(p) =
- Eyev.xen P'(y.x) log2P(ylx) = "Vl T'\ E;=i...|T'| |og2p(y/lx/) This is in fact the normalized log probability of the "test" data:
H„{p) = -1/| T'\log2 p(y;|x;)
Computation Example
ft = {a, b, ..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) = .5, p(a) = ^ for a £ {c.r}, p(a)= 0 for the rest: s,t,u,v,w,x,y,z
Data (test): barb p'(a) = p'(r) = .25, p'(b) = .5 Sum over ft:
a abcdefg...pqrst...z
-p'(a)log2P(a)   -S+.5+0+0+0+0+0+0+0+0+0+1.5+0+0+0+0+0 = 2
Sum over data: i/s l/b
2/a     3/r 4/b
-l°fopOv>
1    +   2+6    + 1
10
1/|T'|
(1/4)  x 10 =2-
Cross Entropy: Some Observations
■ H(p) ??<, =, >?? Hp/(p) : ALL!
■ Previous example:
p(a) = .25, p(b) = .5, p(a)= ^ for a £ {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
H(p) = 2.5bits = H(p')(barb)
■ Other data: probable: (|)(6 + 6 + 6 + 1+ 2 + 1 + 6 + 6) = 4.25
H(p) < 4.25b/ts = H(p')(probable)
■ And finally: abba: (i)(2 + 1 + 1 + 2) = 1.5
H(p) > l.bbits = H(p')(abba)
■ But what about: baby -p'(y)log2p(y) = -.25log20 = oo (??)
Cross Entropy: Usage
Comparing Distributions
Comparing data??
► NO! (we believe that we test on real data!) Rather: comparing distributions fvs. real data) Have (got) 2 distributions: p and q (on some ft,X)
► which is better?
► better: has lower cross-entropy (perplexity) on real data S "Real" data: S
Hs(p) = -1/|S| £;=i..is| log2p(y;\x;) © Hs(q) = -1/|S| £
log2q(yi\xi)
p(.) from previous example: (^s(p) = 4.25 ]
p(a) = .25, p(b) = .5, p(a) = ^ for a £ {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
q(.|.) (conditional; defined by a table):
q(J,)-»	i	b	t	1	□	p	r	□ thei
								
	U	5	■j	0	0	.125	11	0
b	1	0	"1	0	1	123	0	0
*	0	0	j	1	0	.125	0	
1	r	.J	]	0	ú	125	o r	0
o	t:	0	:i	0	0	125 1		0
	r	0	i	0	0	.1 25	0	1
t	0	0	H	0	n	.1 25 ■		
útľlŤi	0	0	] 0		0	1 25	0	0
ex. q(o|r) = 1 fl(r|p)= 125
(1/8) (log^lothO+logfrlpJ+logfoW+logtbloJ+logfaW+logCI^+logaibí+logCell)) (1/8) (    0      +     3+   0+    0+1    +    0+1   +   0 )
PA154 Jazykové modelováni (2.1)