The Notion of Entropy
Essential Information Theory
PA154 Jazykové modelování (1.3)
Pavel Rychlý
pary@fi.muni.cz
February 17, 2020
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic
■ Entropy - "chaos" , fuzziness, opposite of order,...
► you know it
► it is much easier to create"mess" than to tidy things up.
■ Comes from physics:
► Entropy does not go down unless energy is used
■ Measure of uncertainty:
► if low ... low uncertainty
Entropy
The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment.
Essential Information Theory
The Formula
Using the Formula: Example
■ Let px(x) be a distribution of random variable X
■ Basic outcomes (alphabet) Q
Entropy
HW = -ExenPM '°g2P(
■ Unit: bits (log10: nats)
■ Notation: H(X) = HP(X) = H{p) = Hx(p) = H(px)
Toss a fair coin: Q = {head, tail]
► p(head) = .5, p(tail) = .5
► H(p) = -0.5log2(0.5) + (-0.5log2(0.5)) = 2 x ((-0.5) x (-1)) = 2 x 0.5 = 1
Take fair, 32-sided die: p(x) = — for every side x
* h(p) = - E;=i...32 P(x>) loS2 p(xi) = -32(p(xi) log2 p(xi)) (since for all / p(x/) = p(xi) = ^
= —32 x (jj x (—5)) = 5 (now you see why it's called bits?) Unfair coin:
► p(head) = .2 ...h(p) = .722
► p(head) = .1 ...h(p) = .081
Essential Information Theory
Essential Information Theory
Example: Book Availability
The Limits
When H(p) = 0?
► if a result of an experiment is known ahead of time:
► necessarily:
3x G fi; p(x) = l&Vy G fi; y + x     p(y) = 0
Upper bound?
► none in general
► for |fi |= n : H(p) < log2 n
► nothing can be more uncertain than the uniform distribution
1  <— p(Book Available)
PA154 Jazykové modelováni (1.3)
Essential Information Theory
PA154 Jazykové modelováni (1.3)
Essential Information Theory
Entropy and Expectation
Perplexity: motivation
Recall:
* E(x) = Exex(n) P*M x x Then:
E ('og2 Gfe) ) = Exexfn) PxM 'og2
- ExeX(H) PXM l0g2Px(x) - H(px) = notat/on H(p)
Recall:
► 2 equiprobable outcomes: H(p) = 1 bit
► 32 equiprobable outcomes: H(p) = 5 bits
► 4.3 billion equiprobable outcomes: H(p) = 32 bits What if the outcomes are not equiprobable?
► 32 outcomes, 2 equiprobable at 0.5, rest impossible:
► H(p) - 1 bit
► any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with
different number of outcomes'?
Essential Info
Essential Information Theory
Perplexity
Joint Entropy and Conditional Entropy
Perplexity:
► G(p) = 2h<p)
... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. it is easier to imagine:
► NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict
the "wilder" (biased) distribution, the better:
► lower entropy, lower perplexity
Two random variables: X (space Q.), Y (V) Joint entropy:
► no big deal: ((X,Y) considered a single event):
Conditional entropy:
H(Y|X) = -]T]rp(x,y)log2p(y|.
recall that H(X) = E log:
62 -    / i
PxM,
(weighted "average", and weights are not conditional)
Essential Info
Essential Information Theory
Conditional Entropy (Using the Calculus)
Properties of Entropy I
other definition:
H(Y\X) = J2x€np(x)H(Y\X = x) = for H(Y\X = x), we can use the single-variable definition (x ~ constant)
= ExenPM (-EyevPM*) log2P(ylx)) =
= - Exen Eyev p(yk)pM |o§2 p(yM =
= - Exen Eyev P(x> y) lo§2 P{ylx)
Entropy is non-negative:
► H(x) > 0
► proof: (recall: H(X) = - yjxen p(x) log2 p(x))
► log2(p(x)) is negative or zero for x < 1,
► p(x) is non-negative; their product p(x) log(p(x)) is thus negative, *■ sum of negative numbers is negative,
► and -f is positive for negative f
Chain rule:
► H(x, y) = H(y\x) + H(x), as well as
► H(x, y) = H(x\y) + H( y) (since H( y, x) = H(x, y))
PA154 Jazykové modelováni (1.3)
PA154 Jazykové modelováni (1.3)
Essential Inform
Properties of Entropy II
■ Conditional Entropy is better (than unconditional):
► H(y\x) < H(y)
■ H{X, Y) < H(X) + H(Y) (fellows from the previous (in)equalities)
► equality iff X,Y independent
► (recall: X,Y independent iff p(X,Y) = p(X)p(Y))
■ H(p) is concave (remember the book availability graph?)
► concave function f over an interval (a,b): Vx,y e (a,fa),VA G [0,1] : f(Ax + (l-A)y) >Af(x) + (l-A)f(y)
► function f is convex if -f is concave
■ for proofs and generalizations, see Cover/Thomas
PA154 Jazykové modelováni (1.3)
Essential Information Theory