The Notion of Entropy Essential Information Theory PA154 Jazykové modelování (1.3) Pavel Rychlý pary@fi.muni.cz February 17, 2020 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic ■ Entropy - "chaos" , fuzziness, opposite of order,... ► you know it ► it is much easier to create"mess" than to tidy things up. ■ Comes from physics: ► Entropy does not go down unless energy is used ■ Measure of uncertainty: ► if low ... low uncertainty Entropy The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment. Essential Information Theory The Formula Using the Formula: Example ■ Let px(x) be a distribution of random variable X ■ Basic outcomes (alphabet) Q Entropy HW = -ExenPM '°g2P( ■ Unit: bits (log10: nats) ■ Notation: H(X) = HP(X) = H{p) = Hx(p) = H(px) Toss a fair coin: Q = {head, tail] ► p(head) = .5, p(tail) = .5 ► H(p) = -0.5log2(0.5) + (-0.5log2(0.5)) = 2 x ((-0.5) x (-1)) = 2 x 0.5 = 1 Take fair, 32-sided die: p(x) = — for every side x * h(p) = - E;=i...32 P(x>) loS2 p(xi) = -32(p(xi) log2 p(xi)) (since for all / p(x/) = p(xi) = ^ = —32 x (jj x (—5)) = 5 (now you see why it's called bits?) Unfair coin: ► p(head) = .2 ...h(p) = .722 ► p(head) = .1 ...h(p) = .081 Essential Information Theory Essential Information Theory Example: Book Availability The Limits When H(p) = 0? ► if a result of an experiment is known ahead of time: ► necessarily: 3x G fi; p(x) = l&Vy G fi; y + x p(y) = 0 Upper bound? ► none in general ► for |fi |= n : H(p) < log2 n ► nothing can be more uncertain than the uniform distribution 1 <— p(Book Available) PA154 Jazykové modelováni (1.3) Essential Information Theory PA154 Jazykové modelováni (1.3) Essential Information Theory Entropy and Expectation Perplexity: motivation Recall: * E(x) = Exex(n) P*M x x Then: E ('og2 Gfe) ) = Exexfn) PxM 'og2 - ExeX(H) PXM l0g2Px(x) - H(px) = notat/on H(p) Recall: ► 2 equiprobable outcomes: H(p) = 1 bit ► 32 equiprobable outcomes: H(p) = 5 bits ► 4.3 billion equiprobable outcomes: H(p) = 32 bits What if the outcomes are not equiprobable? ► 32 outcomes, 2 equiprobable at 0.5, rest impossible: ► H(p) - 1 bit ► any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes'? Essential Info Essential Information Theory Perplexity Joint Entropy and Conditional Entropy Perplexity: ► G(p) = 2h y) lo§2 P{ylx) Entropy is non-negative: ► H(x) > 0 ► proof: (recall: H(X) = - yjxen p(x) log2 p(x)) ► log2(p(x)) is negative or zero for x < 1, ► p(x) is non-negative; their product p(x) log(p(x)) is thus negative, *■ sum of negative numbers is negative, ► and -f is positive for negative f Chain rule: ► H(x, y) = H(y\x) + H(x), as well as ► H(x, y) = H(x\y) + H( y) (since H( y, x) = H(x, y)) PA154 Jazykové modelováni (1.3) PA154 Jazykové modelováni (1.3) Essential Inform Properties of Entropy II ■ Conditional Entropy is better (than unconditional): ► H(y\x) < H(y) ■ H{X, Y) < H(X) + H(Y) (fellows from the previous (in)equalities) ► equality iff X,Y independent ► (recall: X,Y independent iff p(X,Y) = p(X)p(Y)) ■ H(p) is concave (remember the book availability graph?) ► concave function f over an interval (a,b): Vx,y e (a,fa),VA G [0,1] : f(Ax + (l-A)y) >Af(x) + (l-A)f(y) ► function f is convex if -f is concave ■ for proofs and generalizations, see Cover/Thomas PA154 Jazykové modelováni (1.3) Essential Information Theory