Essential Information Theory I Pavel Rychlý PA154 Statistické nástroje pro korpusy, Spring 2014 Pavel Rychlý Essential Information Theory I Source Introduction to Natural Language Processing (600.465) Dr. Jan Hajic CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic Pavel Rychlý Essential Information Theory I The Notion of Entropy ■ Entropy - "chaos" , fuzziness, opposite of order,... ■ you know it ■ it is much easier to create "mess" than to tidy things up. . . ■ Comes from physics: ■ Entropy does not go down unless energy is used ■ Measure of uncertainty: ■ if low . .. low uncertainty Entropy The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment. Pavel Rychlý Essential Information Theory I The Formula ■ Let px(x) be a distribution of random variable x ■ Basic outcomes (alphabet) ft \h(X) = -ExenP(x) log2p(x)| ■ Unit: bits (log10: nats) ■ Notation: h{X) = hP(X) = h(p) = hx(p) = h(px) Pavel Rychlý Essential Information Theory I Using the Formula: Example ■ Toss a fair coin: ft = {head, tail} m p(head) — .5, p(tail) — .5 ■ h{p) = -0.5 log2(0.5) + (-0.5 log2(0.5)) = 2 x ((-0.5) x (-1)) = 2x0.5=1 ■ Take fair, 32-sided die: p(x) = — for every side x ■ H{P) = - E;=i...32 P(x') loS2 P(xi) = -32(p(xi)log2p(xi)) (since for all / p(x,) — p(xi) = ^ = —32 x (^ x (—5)) — 5 (nowyou see why it's called bits?) ■ Unfair coin: ■ p(head) = .2 ... H(p) = .722 ■ p(head) = .1 ... H(p) = .081 Pavel Rychlý Essential Information Theory I Example: Book Availability Pavel Rychlý Essential Information Theory I The Limits ■ When H(p) = 0? ■ if a result of an experiment is known ahead of time: ■ necessarily: 3x e fi; p(x) = l&Vy eQ;y^x4 p(y) = 0 ■ Upper bound? ■ none in general ■ for |fi |= n : h(p) < log2 n ■ nothing can be more uncertain than the uniform distribution Pavel Rychlý Essential Information Theory I Entropy and Expectation ■ Recall: ■ e{x) = Exex(ň)Px(x) x x ■ Then: E('°g2 Gm)) = E«X'D»ft(x)IOg2 G^)) = - £xex(n)Px(x) log2Px(x) = H(px) =notation h(p) Pavel Rychlý Essential Information Theory I Perplexity: motivation ■ Recall: ■ 2 equiprobable outcomes: H(p) — 1 bit ■ 32 equiprobable outcomes: H(p) — 5 bits ■ 4.3 billion equiprobable outcomes: H(p) = 32 bits ■ What if the outcomes are not equiprobable? ■ 32 outcomes, 2 equiprobable at 0.5, rest impossible: ■ H(p) = 1 bit ■ any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes? Pavel Rychlý Essential Information Theory I Perplexity ■ Perplexity: ■ G(p) = 2"(p) ■ ... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. ■ it is easier to imagine: ■ NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict ■ the "wilder" (biased) distribution, the better: ■ lower entropy, lower perplexity Pavel Rychlý Essential Information Theory I Joint Entropy and Conditional Entropy ■ Two random variables: x (space ft), y (v|/) ■ Joint entropy: ■ no big deal: ((x,y) considered a single event): h(x, y) = - E y) '°g2 p(*> y) ■ Conditional entropy: "(y|x) = -EEp(x<^ log2^w recall that h(x) = e (log2 -4~r ) (weighted "average", and weights are not conditional) Pavel Rychlý Essential Information Theory I Co nditional Entropy (Using the Calculus other definition: h(Y\X) = ZxenP(xMY\X = x) = for h(Y\X = x), we can use the single-variable definition (x ~ constant) Exen P(x) (- Eyev P(y\x) loS2 p(y M = - Exen Eyev p{y\x)p{x) i°g2 p{y\x) = ~ Exen Eyev p(x, y) i°g2 p{y\x) Pavel Rychlý Essential Information Theory I Properties of Entropy ■ Entropy is non-negative: ■ h{x) > 0 ■ proof: (recall: h(x) = - Y,xen p(x) loS2 P(x)) ■ log2(p(x)) is negative or zero for x < 1, ■ p(x) is non-negative; their product p(x) log(p(x)) is thus negative, ■ sum of negative numbers is negative, ■ and -f is positive for negative f ■ Chain rule: ■ h{x, y) = h{y\x) + h{x), as well as ■ h(x, y) = h{x\y) + h(y) (since h{y,x) = h{x, y)) Pavel Rychlý Essential Information Theory I Properties of Entropy II ■ Conditional Entropy is better (than unconditional): ■ h{y\x) < h(y) ■ h(x, y) < h(x) + h(y) (follows from the previous (in)equalities) ■ equality iff X,y independent ■ (recall: X,y independent iff p(X,y)=p(X)p(y)) ■ H(p) is concave (remember the book availability graph?) ■ concave function f over an interval (a,b): Vx,y e (a,6),VA G [0,1] : f(Xx + (1 - A)y) > Xf(x) + (1 - \)f(y) m function f is convex if -f is concave ■ for proofs and generalizations, see Cover/Thomas Pavel Rychlý Essential Information Theory I loding" Interpretation of Entropy ■ The least (average) number of bits needed to encode a message (string, sequence, series, ...) (each element having being a result of a random process with some distribution p) = h(p) ■ Remember various compressing algorithms? ■ they do well on data with repeating (— easily predictable — — low entropy) patterns ■ their results though have high entropy =^> compressing compressed data does nothing Pavel Rychlý Essential Information Theory I Coding: Example ■ How many bits do we need for ISO Latin 1? ■ =^> the trivial answer: 8 ■ Experience: some chars are more common, some (very) rare: ■ ... so what if we use more bits for the rare, and less bits for the frequent? (be careful: want to decode (easily)!) ■ suppose: p('a') — 0.3, p('b') — 0.3, p('c') — 0.3, the rest: p(x)=.0004 ■ code: 'a' ~ 00, 'b' ~ 01, 'c' ~ 10, rest: Wbibibsbub^b^b-ibz ■ code 'acbbecbaac': 00 10 01 01 1100001111 10 01 00 00 10 acbb e cbaac ■ number of bits used: 28 (vs. 80 using "naive" coding) ■ code length ~-; conditional prob. OK! probability Pavel Rychlý Essential Information Theory I Entropy of Language ■ Imagine that we produce the next letter using p(/n+i|/i,---/n), where k,... In is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call /i,... /„ the history h(hn+i), and all histories H: ■ Then compute its entropy: ■ -E/,eHE/eAP(/>/,)l°g2P(/l/') ■ Not very practical, isn't it? Pavel Rychlý Essential Information Theory I ■ ross-Entropy ■ Typical case: we've got series of observations t = {ti, £2, £3, U,..., £■„} (numbers, words, ...; t\ 6 ft); estimate (sample): Vy £ ft : p(y) = def. c(y) = \{t€ t;t = y}\ ■ ... but the true p is unknown; every sample is too small! ■ Natural question: how well do we do using p (instead of p)? ■ Idea: simulate actual p by using a different t (or rather: by using different observation we simulate the insufficiency of t vs. some other data ("random" difference)) Pavel Rychlý Essential Information Theory I ross Entropy: The Formula ■ Hp,(~p) = H(p') + D(p'\\p)_ /VO) = ~ ExgQ P'(x) lQg2 P(x)) ■ p' is certainly not the true p, but we can consider it the "real world" distribution against which we test p ■ note on notation (confusing ...): —f o p, also Hj\p) ■ (Cross)Perplexity: Gp,(p) = Gv{p) = 2Hp'^ Pavel Rychlý Essential Information Theory I Conditional Cross Entropy ■ So far: "unconditional" distribution(s) p(x), p'(x)... ■ In practice: virtually always conditioning on context ■ Interested in: sample space v|/, r.v. Y, y £ v|/; context: sample space ft, r.v.x, x G ft: "our" distribution p(y|x), test against p'(y,x), which is taken from some independent data: hp'(p) = - p'(y^x)]o&2P{y\x) Pavel Rychlý Essential Information Theory I Sample Space vs. Data ■ In practice, it is often inconvenient to sum over the space(s) v|/, ft (especially for cross entropy!) ■ Use the following formula: hp'{p) = -J2yev,xenP'(y>x) lo&2P{y\x) = -1/l7"'IE/=i...|r'il°e2P(y/lx/) ■ This is in fact the normalized log probability of the "test" data: hp,(p) = -1/| t'\log2 H p(y/|x/) i=l...\T'\ Pavel Rychlý Essential Information Theory I Computation Example ■ ft = {a, b, ..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) — .5, p(a) = ^ for a £ {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z ■ Data (test): barb p'(a) = p'(r) = .25, p'(b) = .5 ■ Sum over ft: a abcdefg.-.pqrst.z p'(a)>og2p(0() .5+.5+0+0+0+0+0+0+0+0+0+1. S+0+0+0+0+0 = 2.5 ■ Sum over data: L/sj 1/b 2/a 3/r 4/b ^--"1/|T'| -log2p(s;) 1 + 2+ 6 + 1 =10 (1/4) x 10 =2.5 Pavel Rychlý Essential Information Theory I ross Entropy: Some Observations ■ H{p) ??<,= >?? Hpi(p) : ALL! ■ Previous example: p(a) — .25, p(b) — .5, p(a)— ^ for a £ {c.r}, — 0 for the rest: s,t,u,v,w,x,y,z H(p) = 2.5bits = H(p')(barb) ■ Other data: probable: (|)(6 + 6 + 6 + 1 + 2 + 1 + 6 + 6) = 4.25 H(p) < 4.25bits = H(p')(probable) m And finally: abba: (|)(2 + 1 + 1 + 2) = 1.5 H(p) > l.Sbits = H(p')(abba) ■ But what about: baby -p'('y') log2 p('y') = -.25 log2 0 = c» (??) Pavel Rychlý Essential Information Theory I I ross Entropy: Usage ■ Comparing data?? ■ NO! (we believe that we test on real data!) ■ Rather: comparing distributions (vs. real data) ■ Have (got) 2 distributions: p and q (on some Q.,X) m which is better? ■ better: has lower cross-entropy (perplexity) on real data S ■ "Real" data: S . Hs(p) = -1/\S\E;=i..|S| log2p{y;\xi) © Hs{q) = - V|5| E,-=i..|si loS2q{yi\xi) Pavel Rychlý Essential Information Theory I I omparing Distributions p(.) from previous example: p(a) = .25, p(b) = .5, p(a) = ^ for a e {c.r}, q(.\.) (conditional; defined by a table): hs(p) =4.25 : 0 for the rest: s,t,u,v,w,x,y,z 4 a b e 1 0 p r Qther a 0 .5 D 0 a 125 D 0 b 1 0 Q 0 i 125 D 0 e 0 0 0 1 a ,125 0 t a .5 D 0 a 125 o r 0 0 0 0 0 0 0 1 25 1* 0 f> a 0 0 0 0 125 a 1 r D 0 0 0 0 1 25 -f-U--- other 0 0 1 0 0 125 0 0 ex: q(o|r) = 1 q(r|p) = .125 (l/8)(log(p|oth.)+log(r|p)+log(o|r)+log(b|o)+log(a|b)+log(b|a)+log(l|b)+log(e|l)) (1/8) ( 0 + 3+ 0+ 0+1 + 0+1 + 0 ) Hs(q) = .625 □ ► < fiP ► < Pavel Rychlý Essential Information Theory I