MUNI FI Essential Information Theory PA154 Language Modeling (1.3) Pavel Rychly pary@fi.muni.cz February 16, 2023 The Notion of Entropy Entropy - "chaos", fuzziness, opposite of order,... ■ you know it ■ it is much easier to create'mess" than to tidy things up.. Comes from physics: ■ Entropy does not go down unLess energy is used Measure of uncertainty: ■ if Low ...Low uncertainty Entropy The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment. Source: Introduction to Natural Language Processing (600.465) Jan Hajic.CS Dept.,Johns Hopkins Univ. www. c s. j h u. e d u/~ h a j i c ^avel Rychly ■ Essential Information Theory ■ February 16,2023 The Formula Using the Formula: Example Let px(x) be a distribution of random variable X Basic outcomes (alphabet) n Entropy Unit: bits (log10: nats) Notation: H(X) = HP(X) = H(p) = Hx(p) = H{px) Toss a fair coin: Q = {head, tail} u p(head) = .5, p(tail) = .5 ■ H(p) = -0.5 log2(0.5) + (-0.5 log2(0.5)) = 2 x ((-0.5) x (-1)) = 2 x 0.5 = 1 1 Take fair, 32-sided die: p(x) = — for every side x ■ «(P) = -E/=i...32P(*/)log2P(*/) = -32(p(xi)log2p(xi)) (since for aLL / p(x,) = p(xi) = = -32 x(i x (-5)) = 5 (now you see why it's called bits?) Unfair coin: ■ p(head) = .2 ...H(p) = .722 ■ p(head) = .1 ...H(p) = .081 ^avel Rychlý ■ Essential Information Theory ■ February 16,2023 ^avel Rychlý ■ Essential Information Theory ■ February 16,2023 Example: Book Availability The Limits When H{p) = 0? ■ if a resuLt of an experiment is known ahead of time: ■ necessariLy: 3x G fi;p(x) = l&Vy G fi; y + x p(y) = 0 Upper bound? ■ none in generaL ■ for |fi |= n : H(p) < \og2 n ■ nothing can be more uncertain than the uniform distribution 1 pfBook Available) ^avel Rychly ■ Essential Information Theory ■ February 16,2023 5/13 ^avel Rychly ■ Essential Information Theory ■ February 16,2023 6/13 Entropy and Expectation Perplexity: motivation Recall: ■ e{x) = Exex(n) P*M x x Then: E ' l0g2 (^(Vl ) ) =Sxex(n)Px(x)log2 - Ex€X(n) PxM l°g2 PxM = H(px) =mtation H(p) Recall: ■ 2 equiprobabLe outcomes: H(p) = 1 bit ■ 32 equiprobabLe outcomes: H(p) = 5 bits ■ 4.3 billion equiprobabLe outcomes: H(p) = 32 bits What if the outcomes are not equiprobable? ■ 32 outcomes, 2 equiprobabLe at 0.5, rest impossibLe: ■ H(p) = 1 bit ■ any measure for comparing the entropy (i.e. uncertainty/difficuLty of prediction) (aLso) for random variabLes with different number of outcomes! ^avel Rychly ■ Essential Information Theory ■ February 16,2023 ^avel Rychly ■ Essential Information Theory ■ February 16,2023 Perplexity Joint Entropy and Conditional Entropy Perplexity: ■ G(p) = 2h

...so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. it is easier to imagine: ■ NLP exampLe: vocabuLary size of a vocabuLary with uniform distribution, which is equaLLy hard to predict the "wilder" (biased) distribution, the better: ■ Lower entropy, Lower perpLexity Two random variables: X (space Q), Y (V) Joint entropy: ■ no big deaL: ((X,Y) considered a singLe event): h(x, n = - E E p(x' y) lo& p(x' y) Conditional entropy: HOW = -£!>(*. y) i°g2p(yW recall that H(X) = £ log PxM (weighted "average", and weights are not conditional) ^avel Rychly ■ Essential Information Theory ■ February 16,2023 ^avel Rychlý ■ Essential Information Theory ■ February 16,2023 Conditional Entropy (Using the Calculus) Properties of Entropy I other definition: for H(Y\X = x), we can use the single-variable definition (x ~ constant) = Exen pM (- Eyev p(y M lo§2 p(y|*)) = = - Exen EyevP(yl*)PM loS2 p(yM = = -ExenEyevP(*,y)l°g2P(yM Entropy is non-negative: ■ h(x) > 0 ■ proof: (recaLL: h(x) = - ExenPM logiPM) ■ log2(p(x)) is negative or zero for x < 1, ■ p(x) is non-negative; their product p(x) log(p(x)) is thus negative, ■ sum of negative numbers is negative, ■ and -f is positive for negative f Chain rule: ■ h(x, V) = h(y\x) + H(X), as weLL as ■ h(x, y) = h(x\y) + h(y) (since h(y,x) = h(x, V)) ^avel Rychlý ■ Essential Information Theory ■ February 16,2023 11/13 ^avel Rychly ■ Essential Information Theory ■ February 16,2023 12/13 Properties of Entropy II Conditional Entropy is better (than unconditional): ■ h(y\x) < h(y) ■ h(x, y) < h(x) + h(y) (follows from the previous (in)equalities) ■ equality iff X,Y independent ■ (recaLL: X,Y independent iff p(X,Y)=p(X)p(Y)) ■ H(p) is concave (remember the book availability graph?) ■ concave function/over an interval. (a,b): Vx, y e (a, b), VA e [0,1] : /(Ax + (1 - A)y) > A/(x) + (1 - A)/(y) ■ function/is convex if-/is concave ■ for proofs and generalizations, see Cover/Thomas x y ^avel Rychlý ■ Essential Information Theory ■ February 16,2023 13/13