Essential Information Theory I
Pavel Rychlý
PA154 Statistické nástroje pro korpusy, Spring 2014
Pavel Rychlý
Essential Information Theory I
Source
Introduction to Natural Language Processing (600.465) Dr. Jan Hajic CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic
Pavel Rychlý
Essential Information Theory I
The Notion of Entropy
■ Entropy - "chaos" , fuzziness, opposite of order,...
■ you know it
■ it is much easier to create "mess" than to tidy things up. . .
■ Comes from physics:
■ Entropy does not go down unless energy is used
■ Measure of uncertainty:
■ if low . .. low uncertainty
Entropy
The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment.
Pavel Rychlý
Essential Information Theory I
The Formula
■ Let px(x) be a distribution of random variable x
■ Basic outcomes (alphabet) ft
\h(X) = -ExenP(x) log2p(x)|
■ Unit: bits (log10: nats)
■ Notation: h{X) = hP(X) = h(p) = hx(p) = h(px)
Pavel Rychlý
Essential Information Theory I
Using the Formula: Example
■ Toss a fair coin: ft = {head, tail}
m p(head) — .5, p(tail) — .5
■ h{p) = -0.5 log2(0.5) + (-0.5 log2(0.5)) = 2 x ((-0.5) x (-1)) = 2x0.5=1
■ Take fair, 32-sided die: p(x) = — for every side x
■ H{P) = - E;=i...32 P(x') loS2 P(xi) = -32(p(xi)log2p(xi)) (since for all / p(x,) — p(xi) = ^
= —32 x (^ x (—5)) — 5 (nowyou see why it's called bits?)
■ Unfair coin:
■ p(head) = .2 ... H(p) = .722
■ p(head) = .1 ... H(p) = .081
Pavel Rychlý
Essential Information Theory I
Example: Book Availability
Pavel Rychlý
Essential Information Theory I
The Limits
■ When H(p) = 0?
■ if a result of an experiment is known ahead of time:
■ necessarily:
3x e fi; p(x) = l&Vy eQ;y^x4 p(y) = 0
■ Upper bound?
■ none in general
■ for |fi |= n : h(p) < log2 n
■ nothing can be more uncertain than the uniform distribution
Pavel Rychlý
Essential Information Theory I
Entropy and Expectation
■ Recall:
■ e{x) = Exex(ň)Px(x) x x
■ Then:
E('°g2 Gm)) = E«X'D»ft(x)IOg2 G^)) = - £xex(n)Px(x) log2Px(x) = H(px) =notation h(p)
Pavel Rychlý
Essential Information Theory I
Perplexity: motivation
■ Recall:
■ 2 equiprobable outcomes: H(p) — 1 bit
■ 32 equiprobable outcomes: H(p) — 5 bits
■ 4.3 billion equiprobable outcomes: H(p) = 32 bits
■ What if the outcomes are not equiprobable?
■ 32 outcomes, 2 equiprobable at 0.5, rest impossible:
■ H(p) = 1 bit
■ any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes?
Pavel Rychlý
Essential Information Theory I
Perplexity
■ Perplexity:
■ G(p) = 2"(p)
■ ... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc.
■ it is easier to imagine:
■ NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict
■ the "wilder" (biased) distribution, the better:
■ lower entropy, lower perplexity
Pavel Rychlý
Essential Information Theory I
Joint Entropy and Conditional Entropy
■ Two random variables: x (space ft), y (v|/)
■ Joint entropy:
■ no big deal: ((x,y) considered a single event):
h(x, y) = -   E   y) '°g2 p(*> y)
■ Conditional entropy:
"(y|x) = -EEp(x<^ log2^w
recall that h(x) = e (log2 -4~r )
(weighted "average", and weights are not conditional)
Pavel Rychlý
Essential Information Theory I
Co
nditional Entropy (Using the Calculus
other definition:
h(Y\X) = ZxenP(xMY\X = x) = for h(Y\X = x), we can use the single-variable definition (x ~ constant)
Exen P(x) (- Eyev P(y\x) loS2 p(y M
= - Exen Eyev p{y\x)p{x) i°g2 p{y\x) = ~ Exen Eyev p(x, y) i°g2 p{y\x)
Pavel Rychlý
Essential Information Theory I
Properties of Entropy
■ Entropy is non-negative:
■ h{x) > 0
■ proof: (recall: h(x) = - Y,xen p(x) loS2 P(x))
■ log2(p(x)) is negative or zero for x < 1,
■ p(x) is non-negative; their product p(x) log(p(x)) is thus negative,
■ sum of negative numbers is negative,
■ and -f is positive for negative f
■ Chain rule:
■ h{x, y) = h{y\x) + h{x), as well as
■ h(x, y) = h{x\y) + h(y) (since h{y,x) = h{x, y))
Pavel Rychlý
Essential Information Theory I
Properties of Entropy II
■ Conditional Entropy is better (than unconditional):
■ h{y\x) < h(y)
■ h(x, y) < h(x) + h(y) (follows from the previous (in)equalities)
■ equality iff X,y independent
■ (recall: X,y independent iff p(X,y)=p(X)p(y))
■ H(p) is concave (remember the book availability graph?)
■ concave function f over an interval (a,b): Vx,y e (a,6),VA G [0,1] : f(Xx + (1 - A)y) > Xf(x) + (1 - \)f(y)
m function f is convex if -f is concave
■ for proofs and generalizations, see Cover/Thomas
Pavel Rychlý
Essential Information Theory I
loding" Interpretation of Entropy
■ The least (average) number of bits needed to encode a message (string, sequence, series, ...) (each element having being a result of a random process with some distribution p) = h(p)
■ Remember various compressing algorithms?
■ they do well on data with repeating (— easily predictable — — low entropy) patterns
■ their results though have high entropy =^> compressing compressed data does nothing
Pavel Rychlý
Essential Information Theory I
Coding: Example
■ How many bits do we need for ISO Latin 1?
■ =^> the trivial answer: 8
■ Experience: some chars are more common, some (very) rare:
■ ... so what if we use more bits for the rare, and less bits for the frequent? (be careful: want to decode (easily)!)
■ suppose: p('a') — 0.3, p('b') — 0.3, p('c') — 0.3, the rest: p(x)=.0004
■ code: 'a' ~ 00, 'b' ~ 01, 'c' ~ 10, rest: Wbibibsbub^b^b-ibz
■ code 'acbbecbaac':
00 10 01 01 1100001111 10 01 00 00 10 acbb e cbaac
■ number of bits used: 28 (vs. 80 using "naive" coding)
■ code length ~-; conditional prob. OK!
probability
Pavel Rychlý
Essential Information Theory I
Entropy of Language
■ Imagine that we produce the next letter using
p(/n+i|/i,---/n),
where k,... In is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call /i,... /„ the history h(hn+i), and all histories H:
■ Then compute its entropy:
■ -E/,eHE/eAP(/>/,)l°g2P(/l/')
■ Not very practical, isn't it?
Pavel Rychlý
Essential Information Theory I
■
ross-Entropy
■ Typical case: we've got series of observations
t = {ti, £2, £3, U,..., £■„} (numbers, words, ...; t\ 6 ft);
estimate (sample): Vy £ ft : p(y) =
def. c(y) = \{t€ t;t = y}\
■ ... but the true p is unknown; every sample is too small!
■ Natural question: how well do we do using p (instead of p)?
■ Idea: simulate actual p by using a different t (or rather: by using different observation we simulate the insufficiency of t vs. some other data ("random" difference))
Pavel Rychlý
Essential Information Theory I
ross Entropy: The Formula
■ Hp,(~p) = H(p') + D(p'\\p)_
/VO) = ~ ExgQ P'(x) lQg2 P(x))
■ p' is certainly not the true p, but we can consider it the "real world" distribution against which we test p
■ note on notation (confusing ...): —f o p, also Hj\p)
■ (Cross)Perplexity: Gp,(p) = Gv{p) = 2Hp'^
Pavel Rychlý
Essential Information Theory I
Conditional Cross Entropy
■ So far: "unconditional" distribution(s) p(x), p'(x)...
■ In practice: virtually always conditioning on context
■ Interested in: sample space v|/, r.v. Y, y £ v|/; context: sample space ft, r.v.x, x G ft:
"our" distribution p(y|x), test against p'(y,x), which is taken from some independent data:
hp'(p) = - p'(y^x)]o&2P{y\x)
Pavel Rychlý
Essential Information Theory I
Sample Space vs. Data
■ In practice, it is often inconvenient to sum over the space(s) v|/, ft (especially for cross entropy!)
■ Use the following formula:
hp'{p) = -J2yev,xenP'(y>x) lo&2P{y\x) =
-1/l7"'IE/=i...|r'il°e2P(y/lx/)
■ This is in fact the normalized log probability of the "test" data:
hp,(p) = -1/| t'\log2   H p(y/|x/)
i=l...\T'\
Pavel Rychlý
Essential Information Theory I
Computation Example
■ ft = {a, b, ..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) — .5, p(a) = ^ for a £ {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z
■ Data (test): barb p'(a) = p'(r) = .25, p'(b) = .5
■ Sum over ft:
a abcdefg.-.pqrst.z
p'(a)>og2p(0()  .5+.5+0+0+0+0+0+0+0+0+0+1. S+0+0+0+0+0 = 2.5
■ Sum over data:
L/sj 1/b     2/a     3/r     4/b ^--"1/|T'|
-log2p(s;) 1   +   2+   6   +   1    =10    (1/4)  x 10 =2.5
Pavel Rychlý
Essential Information Theory I
ross Entropy: Some Observations
■ H{p) ??<,= >?? Hpi(p) : ALL!
■ Previous example:
p(a) — .25, p(b) — .5, p(a)— ^ for a £ {c.r}, — 0 for the rest: s,t,u,v,w,x,y,z
H(p) = 2.5bits = H(p')(barb)
■ Other data: probable:
(|)(6 + 6 + 6 + 1 + 2 + 1 + 6 + 6) = 4.25
H(p) < 4.25bits = H(p')(probable) m And finally: abba: (|)(2 + 1 + 1 + 2) = 1.5
H(p) > l.Sbits = H(p')(abba)
■ But what about: baby -p'('y') log2 p('y') = -.25 log2 0 = c» (??)
Pavel Rychlý
Essential Information Theory I
I
ross Entropy: Usage
■ Comparing data??
■ NO! (we believe that we test on real data!)
■ Rather: comparing distributions (vs. real data)
■ Have (got) 2 distributions: p and q (on some Q.,X)
m which is better?
■ better: has lower cross-entropy (perplexity) on real data S
■ "Real" data: S
. Hs(p) = -1/\S\E;=i..|S| log2p{y;\xi) ©
Hs{q) = - V|5| E,-=i..|si loS2q{yi\xi)
Pavel Rychlý
Essential Information Theory I
I
omparing Distributions
p(.) from previous example:
p(a) = .25, p(b) = .5, p(a) = ^ for a e {c.r},
q(.\.) (conditional; defined by a table):
hs(p) =4.25
: 0 for the rest: s,t,u,v,w,x,y,z
4	a	b	e	1	0	p	r	Qther
a	0	.5	D	0	a	125	D	0
b	1	0	Q	0	i	125	D	0
e	0	0	0	1	a	,125	0	
t	a	.5	D	0	a	125	o r	0
0	0	0	0	0	0	1 25	1*	0
f>	a	0	0	0	0	125	a	1
r	D	0	0	0	0	1 25 -f-U---		
other	0	0	1 0		0	125	0	0
ex: q(o|r) = 1 q(r|p) = .125
(l/8)(log(p|oth.)+log(r|p)+log(o|r)+log(b|o)+log(a|b)+log(b|a)+log(l|b)+log(e|l)) (1/8) (    0      +    3+   0+    0+1    +    0+1   +    0 )
Hs(q) = .625
□  ►   < fiP ► <
Pavel Rychlý
Essential Information Theory I