Essential Information Theory PA154 Language Modeling (1.3) Pavel Rychlý pary@fi.muni.cz February 16, 2023 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic The Notion of Entropy ■ Entropy - "chaos", fuzziness, opposite of order,... ■ you know it ■ it is much easier to create'rness" than to tidy things up... ■ Comes from physics: ■ Entropy does not go down unless energy is used ■ Measure of uncertainty: ■ if low ...low uncertainty Entropy The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment. Pavel Rychly • Essential Information Theory • February 16, 2023 2/13 The Formula ■ Let px(x) be a distribution of random variable X ■ Basic outcomes (alphabet) ft Entropy H(X) = - Zxen Pi*) ^2 />(*)_ ■ Unit: bits (log10: nats) ■ Notation: H(X) = HP(X) = H(p) = Hx(p) = H(px) Pavel Rychly • Essential Information Theory • February 16, 2023 3/13 Using the Formula: Example ■ Toss a fair coin: ft = {head, tail} ■ p(head) = .5, p(tail) = .5 ■ H(p) = -0.5 log2(0.5) + (-0.5 log2(0.5)) = 2 x ((-0.5) x (-1)) = 2 x 0.5 = 1 1 ■ Take fair, 32-sided die: p(x) = — for every side x ■ H(P) = ~ E/=1...32 P(Xd l0§2 P(xi) = -32(p(Xi) l0g2 p(Xi)) (since for all / p(x,) = p(xi) = ^ = —32 x x (—5)) = 5 (now you see why it's called bits?) ■ Unfair coin: ■ p(head) = .2 ...H(p) = .722 ■ p(head) = .1 ...H(p) = .081 Pavel Rychly • Essential Information Theory • February 16, 2023 Example: Book Availability Entropy 0 H(p) bad bookstöre 0 good bookstons 0.5 1 p(Book Available) Pavel Rychly • Essential Information Theory • February 16, 2023 5/13 The Limits When H(p) = 0? ■ if a result of an experiment is known ahead of time: ■ necessarily: 3x e ft; p(x) = l&Vy e ft; y ^ x => p(y) = 0 Upper bound? ■ none in general ■ for |ft |= n : H(p) < \og2 n ■ nothing can be more uncertain than the uniform distribution Pavel Rychly • Essential Information Theory • February 16, 2023 6/13 Entropy and Expectation ■ Recall: ■ E(x) = Exex(n)Px(x) xx ■ Then: E ('0g2 G£))) = SxeXW Px(x) l0g2 (^) = - ExeX(fi) Px(x) log2 Px(x) = H(px) dotation H{p) Pavel Rychly • Essential Information Theory • February 16, 2023 7/13 Perplexity: motivation ■ Recall: ■ 2 equiprobable outcomes: H(p) = 1 bit ■ 32 equiprobable outcomes: H(p) = 5 bits ■ 4.3 billion equiprobable outcomes: H(p) = 32 bits ■ What if the outcomes are not equiprobable? ■ 32 outcomes, 2 equiprobable at 0.5, rest impossible: ■ H(p) = 1 bit ■ any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes? Pavel Rychly • Essential Information Theory • February 16, 2023 8/13 Perplexity ■ Perplexity: ■ G(p) = ■ ...so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. ■ it is easier to imagine: ■ NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict ■ the "wilder" (biased) distribution, the better: ■ lower entropy, lower perplexity Pavel Rychly • Essential Information Theory • February 16, 2023 9/13 Joint Entropy and Conditional Entropy ■ Two random variables: X (space ft), Y ■ Joint entropy: ■ no big deal: ((X,Y) considered a single event): n = -EE ^ y) x 0 ■ proof: (recall: H(X) = - J2xen P(x) loS2 PM) ■ log2(p(x)) is negative or zero for x < 1, ■ p(x) is non-negative; their product p(x) log(p(x)) is thus negative, ■ sum of negative numbers is negative, ■ and -f is positive for negative f ■ Chain rule: ■ H(X, Y) = H{Y\X) + H(X), as well as ■ Y) = H{X\Y) + H(Y) (since H(Y,X) = /)) Pavel Rychly • Essential Information Theory • February 16, 2023 12/13 Properties of Entropy II ■ Conditional Entropy is better (than unconditional): ■ H(Y\X) < H(Y) ■ H(X, Y) < H(X) + H(Y) (follows from the previous (in)equalities) ■ equality iff X,Y independent ■ (recall: X,Y independent iff p(X,Y)=p(X)p(Y)) ■ H(p) is concave (remember the book availability graph?) ■ concave function/over an interval (a,b): Vx,y e {a,b),VX e [0,1] : /(Ax + (1 - A)y) > A/(x) + (1 - A)/(y) ■ function/is convex if -/is concave ■ for proofs and generalizations, see Cover/Thomas Pavel Rychly • Essential Information Theory • February 16, 2023 13/13