Essential Information Theory PA154 Jazykové modelování (1.3) Pavel Rychlý pary@fi.muni.cz 23 February, 2017 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic The Notion of Entropy Entropy - "chaos" , fuzziness, opposite of order,.. . ► you know it ► it is much easier to create "mess" than to tidy things up. Comes from physics: ► Entropy does not go down unless energy is used Measure of uncertainty: ► if low .. . low uncertainty The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment. PA154 Jazykové modelování (1.3) Essential Information Theory 2/13 "he Formula Let px(x) be a distribution of random variable X Basic outcomes (alphabet) ft H(X) = - £x€fi p(x) log2 p(x) ■ Unit: bits (log10: nats) ■ Notation: H(X) = Hp(X) = H(p) = Hx(p) = H(px) PA154 Jazykové modelování (1.3) Essential Information Theory 3/13 Using the Formula: Example ■ Toss a fair coin: ft = {head, tail} ► p(head) = .5, p(tail) = .5 ► H(p) = -0.5 log2(0.5) + (-0.5 log2(0.5)) = 2 x ((-0.5) x (-1)) = 2 x 0.5 = 1 1 ■ Take fair, 32-sided die: p(x) = — for every side x ► H(p) = - E/=i...32 P(x') lo§2 P(xi) = -32(p(xi) log2 p(xi)) (since for all /' p(x() = p{x\) = ^> = —32 x (^ x (—5)) = 5 {nowyou see why it's called bits?) ■ Unfair coin: ► p(head) = .2 ... h(p) = .722 ► p(head) = .1 ... h(p) = .081 PA154 Jazykové modelování (1.3) Essential Information Theory 4/13 Example: Book Availability PA154 Jazykové modelování (1.3) Essential Information Theory 5/13 "he Limits When H{p) = 0? ► if a result of an experiment is known ahead of time: ► necessarily: 3x e ft; p(x) = l&Vy GQ;y/x^> p(y) = 0 Upper bound? ► none in general ► for |ft |= n : H(p) < log2 n ► nothing can be more uncertain than the uniform distribution PA154 Jazykové modelování (1.3) Essential Information Theory 6/13 Entropy and Expectation Recall: ► E(X) = Ex€X(H)Px(x) X X Then: E Mog2 ^))=Ľ«X(íi)PxWl0g2(^) = - ExeX(fi) Px{x) I0g2 Px(x) = H(px) =notation H(p) PA154 Jazykové modelování (1.3) Essential Information Theory 7/13 Perplexity: motivation ■ Recall: ► 2 equiprobable outcomes: H(p) = 1 bit ► 32 equiprobable outcomes: H(p) = 5 bits ► 4.3 billion equiprobable outcomes: H(p) = 32 bits ■ What if the outcomes are not equiprobable? ► 32 outcomes, 2 equiprobable at 0.5, rest impossible: ► H(p) = 1 bit ► any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes? PA154 Jazykové modelování (1.3) Essential Information Theory 8/13 Perplexity ■ Perplexity: ► G(p) = 2"(») ■ ... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. ■ it is easier to imagine: ► NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict ■ the "wilder" (biased) distribution, the better: ► lower entropy, lower perplexity PA154 Jazykové modelování (1.3) Essential Information Theory 9/13 Joint Entropy and Conditional Entropy ■ Two random variables: X (space fi), Y ■ Joint entropy: ► no big deal: ((X,Y) considered a single event): h(x, y) = - E p(x> y) '°g2 p(*> y) ■ Conditional entropy: recall that H(X) = E (logo —tt I (weighted "average", and weights are not conditional) PA154 Jazykové modelování (1.3) Essential Information Theory Conditional Entropy (Using the Calculus) other definition H(Y\X) = ExenP(x)H(Y\X = x) = for H(Y\X = x), we can use the single-variable definition (x ~ constant) = ExGft p(x) (- EyGvi/ p(ylx) lo§2 p(y\x)) = = - Exeft EyGvi/ p{y\x)p(x) '°g2 p{y\x) = = - Exeft EyGV p(x' y) lo§2 p(y\x) PA154 Jazykové modelování (1.3) Essential Information Theory 11/13 Properties of Entropy I ■ Entropy is non-negative: ► H(X) > 0 ► proof: (recall: H(X) = - J]xGQ p(x) log2 p(x)) ► log2(p(x)) is negative or zero for x < 1, ► p(x) is non-negative; their product p(x) log(p(x)) is thus negative, ► sum of negative numbers is negative, ► and -f is positive for negative f ■ Chain rule: ► H(X, Y) = H(Y\X) + H(X), as well as ► H(X, Y) = H{X\Y) + H(Y) (since W(V,X) = H(X, V)) PA154 Jazykové modelování (1.3) Essential Information Theory Properties of Entropy II ■ Conditional Entropy is better (than unconditional): ► h(y\x) < h(y) ■ H(X, Y) < H(X) + H(Y) (foil ows from the previous (in)equalities) ► equality iff X,Y independent ► (recall: X,Y independent iff p(X,Y)=p(X)p(Y)) H(p) is concave (remember the book availability graph?) ► concave function f over an interval (a,b): Vx,y e (a,6),VA G [0,1] : f(Xx + (1 - A)y) > Xf(x) + (1 - \)f(y) ► function f is convex if -f is concave for proofs and generalizations, see Cover/Thomas PA154 Jazykové modelování (1.3) Essential Information Theory 13/13