Essential Information Theory
PA154 Language Modeling (1.3)
Pavel Rychlý
pary@fi.muni.cz February 16, 2023
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
The Notion of Entropy
■ Entropy - "chaos", fuzziness, opposite of order,...
■ you know it
■ it is much easier to create'rness" than to tidy things up...
■ Comes from physics:
■ Entropy does not go down unless energy is used
■ Measure of uncertainty:
■ if low ...low uncertainty
Entropy
The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment.
Pavel Rychly • Essential Information Theory • February 16, 2023
2/13
The Formula
■ Let px(x) be a distribution of random variable X
■ Basic outcomes (alphabet) ft
Entropy
H(X) = - Zxen Pi*) ^2 />(*)_
■ Unit: bits (log10: nats)
■ Notation: H(X) = HP(X) = H(p) = Hx(p) = H(px)
Pavel Rychly • Essential Information Theory • February 16, 2023
3/13
Using the Formula: Example
■ Toss a fair coin: ft = {head, tail}
■ p(head) = .5, p(tail) = .5
■ H(p) = -0.5 log2(0.5) + (-0.5 log2(0.5)) = 2 x ((-0.5) x (-1)) = 2 x 0.5 = 1
1
■ Take fair, 32-sided die: p(x) = — for every side x
■ H(P) = ~ E/=1...32 P(Xd l0§2 P(xi) = -32(p(Xi) l0g2 p(Xi))
(since for all / p(x,) = p(xi) = ^
= —32 x     x (—5)) = 5 (now you see why it's called bits?)
■ Unfair coin:
■ p(head) = .2 ...H(p) = .722
■ p(head) = .1 ...H(p) = .081
Pavel Rychly • Essential Information Theory • February 16, 2023
Example: Book Availability
Entropy
0
H(p)
bad bookstöre
0
good bookstons
0.5
1      p(Book Available)
Pavel Rychly • Essential Information Theory • February 16, 2023
5/13
The Limits
When H(p) = 0?
■ if a result of an experiment is known ahead of time:
■ necessarily:
3x e ft; p(x) = l&Vy e ft; y ^ x => p(y) = 0
Upper bound?
■ none in general
■ for |ft |= n : H(p) < \og2 n
■ nothing can be more uncertain than the uniform distribution
Pavel Rychly • Essential Information Theory • February 16, 2023 6/13
Entropy and Expectation
■ Recall:
■ E(x) = Exex(n)Px(x) xx
■ Then:
E ('0g2 G£))) = SxeXW Px(x) l0g2 (^) = - ExeX(fi) Px(x) log2 Px(x) = H(px) dotation H{p)
Pavel Rychly • Essential Information Theory • February 16, 2023
7/13
Perplexity: motivation
■ Recall:
■ 2 equiprobable outcomes: H(p) = 1 bit
■ 32 equiprobable outcomes: H(p) = 5 bits
■ 4.3 billion equiprobable outcomes: H(p) = 32 bits
■ What if the outcomes are not equiprobable?
■ 32 outcomes, 2 equiprobable at 0.5, rest impossible:
■ H(p) = 1 bit
■ any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with
different number of outcomes?
Pavel Rychly • Essential Information Theory • February 16, 2023
8/13
Perplexity
■ Perplexity:
■ G(p) =
■ ...so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc.
■ it is easier to imagine:
■ NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict
■ the "wilder" (biased) distribution, the better:
■ lower entropy, lower perplexity
Pavel Rychly • Essential Information Theory • February 16, 2023
9/13
Joint Entropy and Conditional Entropy
■ Two random variables: X (space ft), Y
■ Joint entropy:
■ no big deal: ((X,Y) considered a single event):
n = -EE    ^ y)
x<Eft yeW
■ Conditional entropy:
w(m) = -££p(*,y) iog2p(/W
recall that H(X) = f ^log2 ^)
(weighted "average", and weights are not conditional)
Pavel Rychly • Essential Information Theory • February 16, 2023
Conditional Entropy (Using the Calculus)
other definition:
H(Y\X) = EXenP(x)H(Y\X = x) = for H(Y\X = x), we can use the single-variable definition (x ~ constant)
= J2x€iiP(X) (- Ey€M/P(ylX) log2P(y|*)) =
= - Exen Ey€M/ p(y\x)p(x) lQg2 p(y\x) = = - Exeft EyGU/ p(x, y) |o§2 p{y\x)
Pavel Rychly • Essential Information Theory • February 16, 2023
Properties of Entropy I
■ Entropy is non-negative:
■ H(X) > 0
■ proof: (recall: H(X) = - J2xen P(x) loS2 PM)
■ log2(p(x)) is negative or zero for x < 1,
■ p(x) is non-negative; their product p(x) log(p(x)) is thus negative,
■ sum of negative numbers is negative,
■ and -f is positive for negative f
■ Chain rule:
■ H(X, Y) = H{Y\X) + H(X), as well as
■ Y) = H{X\Y) + H(Y) (since H(Y,X) = /))
Pavel Rychly • Essential Information Theory • February 16, 2023
12/13
Properties of Entropy II
■ Conditional Entropy is better (than unconditional):
■ H(Y\X) < H(Y)
■ H(X, Y) < H(X) + H(Y) (follows from the previous (in)equalities)
■ equality iff X,Y independent
■ (recall: X,Y independent iff p(X,Y)=p(X)p(Y))
■ H(p) is concave (remember the book availability graph?)
■ concave function/over an interval (a,b):
Vx,y e {a,b),VX e [0,1] : /(Ax + (1 - A)y) > A/(x) + (1 - A)/(y)
■ function/is convex if -/is concave
■ for proofs and generalizations, see Cover/Thomas
Pavel Rychly • Essential Information Theory • February 16, 2023
13/13