The Notion of Entropy Essential Information Theory PA154 Jazykové modelování (1.3) Pavel Rychlý pary@fi.muni.cz 23 February, 2017 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/" hajic Entropy - "chaos" , fuzziness, opposite of order,... ► you know it ► it is much easier to create "mess" than to tidy things up. Comes from physics: ► Entropy does not go down unless energy is used Measure of uncertainty: ► if low ... low uncertainty Entropy The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment. Essential Inform The Formula Using the Formula: Example Let px(x) be a distribution of random variable X Basic outcomes (alphabet) Q H(x) = -Ľ*=n/>Mi°g2PM ■ Unit: bits (log10: nats) ■ Notation: H(X) = HP(X) = H(p) = Hx(p) = H(px) Toss a fair coin: Q = {head, tail] ► p(head) = .5, p(tail) = .5 ► H(p) = -0.5 log2(0.5) + (-0.5 log2(0.5)) = 2 x ((-0.5) x (-1)) = 2 x 0.5 = 1 Take fair, 32-sided die: p(x) = for every side x * h(p) = - E;=i...32 P(x>) loS2 p(xi) = -32(p(xi) log2 p(xi)) (since for all / p(x() = p(xi) = ^ = —32 x (jj x (—5)) = 5 (nowyou see why it's called bits?) Unfair coin: ► p(head) = .2 ...H(p) = .722 ► p(head) = .1 ...H(p) = .081 Essential Information Theory Essential Inform Example: Book Availability The Limits When H(p) = 0? ► if a result of an experiment is known ahead of time: ► necessarily: 3x G fi; p(x) = l&Vy G fi; y + x p(y) = 0 Upper bound? ► none in general ► for |fi |= n : H(p) < log2 n ► nothing can be more uncertain than the uniform distribution 1 «—p(Book Available) PA154 Jazykové modelováni (1.3) Essential Information Theory PA154 Jazykové modelováni (1.3) Essential Inform Entropy and Expectation Perplexity: motivation Recall: * E(x) = Exex(n) P*M x x Then: E ('og2 G£ö)) = E^)p*(x)log2 - Exex(n) Px{x) log2 Px(x) = H(px) =notatK>n H(p) Recall: ► 2 equiprobable outcomes: H(p) = 1 bit ► 32 equiprobable outcomes: H(p) = 5 bits ► 4.3 billion equiprobable outcomes: H(p) = 32 bits What if the outcomes are not equiprobable? ► 32 outcomes, 2 equiprobable at 0.5, rest impossible: ► H(p) - 1 bit ► any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes'? Essential Info Essential Information Theory Perplexity Joint Entropy and Conditional Entropy Perplexity: ► G(p) = 2h y) lo§2 P{yM Entropy is non-negative: ► H(x) > 0 ► proof: (recall: H(X) = - yjxen p(x) log2 p(x)) *■ log2(p(x)) is negative or zero for x < 1, *■ p(x) is non-negative; their product p(x) log(p(x)) is thus negative, ► sum of negative numbers is negative, ► and -f is positive for negative f Chain rule: ► H(x, y) = H(y\x) + H(x), as well as ► H(x, y) = H(x\y) + H(y) (since H(y,x) = H(x, y)) PA154 Jazykové modelováni (1.3) PA154 Jazykové modelováni (1.3) Essential Inform Properties of Entropy II ■ Conditional Entropy is better (than unconditional): ► H(y\x) < H(y) ■ H{X, Y) < H(X) + H( Y) (fellows from the previous (in)equalities) ► equality iff X,Y independent ► (recall: X,Y independent iff p(X,Y)=p(X)p(Y)) ■ H(p) is concave (remember the book availability graph?) ► concave function f over an interval (a,b): Vx,y e (a,fa),VA G [0,1] : f(Ax + (l-A)y)>Af(x) + (l-A)f(y) ► function f is convex if -f is concave ■ for proofs and generalizations, see Cover/Thomas PA154 Jazykové modelováni (1.3) Essential Information Theory 13/13