The Notion of Entropy
Essential Information Theory
PA154 Jazykové modelování (1.3)
Pavel Rychlý
pary@fi.muni.cz
23 February, 2017
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/" hajic
Entropy - "chaos" , fuzziness, opposite of order,...
► you know it ► it is much easier to create "mess" than to tidy things up.
Comes from physics:
► Entropy does not go down unless energy is used Measure of uncertainty:
► if low ... low uncertainty
Entropy
The higher the entropy, the higher uncertainty, but the higher "surprise" (information) we can get out of experiment.
Essential Inform
The Formula
Using the Formula: Example
Let px(x) be a distribution of random variable X Basic outcomes (alphabet) Q
H(x) = -Ľ*=n/>Mi°g2PM
■ Unit: bits (log10: nats)
■ Notation: H(X) = HP(X) = H(p) = Hx(p) = H(px)
Toss a fair coin: Q = {head, tail]
► p(head) = .5, p(tail) = .5
► H(p) = -0.5 log2(0.5) + (-0.5 log2(0.5)) = 2 x ((-0.5) x (-1)) = 2 x 0.5 = 1
Take fair, 32-sided die: p(x) =     for every side x
* h(p) = - E;=i...32 P(x>) loS2 p(xi) = -32(p(xi) log2 p(xi)) (since for all / p(x() = p(xi) = ^
= —32 x (jj x (—5)) = 5 (nowyou see why it's called bits?) Unfair coin:
► p(head) = .2 ...H(p) = .722
► p(head) = .1 ...H(p) = .081
Essential Information Theory
Essential Inform
Example: Book Availability
The Limits
When H(p) = 0?
► if a result of an experiment is known ahead of time:
► necessarily:
3x G fi; p(x) = l&Vy G fi; y + x     p(y) = 0
Upper bound?
► none in general
► for |fi |= n : H(p) < log2 n
► nothing can be more uncertain than the uniform distribution
1   «—p(Book Available)
PA154 Jazykové modelováni (1.3)
Essential Information Theory
PA154 Jazykové modelováni (1.3)
Essential Inform
Entropy and Expectation
Perplexity: motivation
Recall:
* E(x) = Exex(n) P*M x x Then:
E ('og2 G£ö)) = E^)p*(x)log2
- Exex(n) Px{x) log2 Px(x) = H(px) =notatK>n H(p)
Recall:
► 2 equiprobable outcomes: H(p) = 1 bit
► 32 equiprobable outcomes: H(p) = 5 bits
► 4.3 billion equiprobable outcomes: H(p) = 32 bits What if the outcomes are not equiprobable?
► 32 outcomes, 2 equiprobable at 0.5, rest impossible:
► H(p) - 1 bit
► any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with
different number of outcomes'?
Essential Info
Essential Information Theory
Perplexity
Joint Entropy and Conditional Entropy
Perplexity:
► G(p) = 2h<p)
... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc.
it is easier to imagine:
► NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict
the "wilder" (biased) distribution, the better:
► lower entropy, lower perplexity
Two random variables: X (space Q.), Y (V) Joint entropy:
► no big deal: ((X,Y) considered a single event):
Conditional entropy:
H(Y|X) = -]T]rp(x,y)log2p(y|.
recall that H(X) = E log
. 32Px(x)y
(weighted "average", and weights are not conditional)
Essential Info
Essential Information Theory
Conditional Entropy (Using the Calculus)
Properties of Entropy I
other definition:
H(Y\X) = J2x€np(x)H(Y\X = x) = for H(Y\X = x), we can use the single-variable definition (x ~ constant)
= ExenPM (-EyevPM*) log2P(ylx)) =
= - Exen Eyev p(y|*)pM |o§2 p(yM =
= - Exen Eyev P(x> y) lo§2 P{yM
Entropy is non-negative:
► H(x) > 0
► proof: (recall: H(X) = - yjxen p(x) log2 p(x))
*■ log2(p(x)) is negative or zero for x < 1, *■ p(x) is non-negative; their product p(x) log(p(x)) is thus negative,
► sum of negative numbers is negative,
► and -f is positive for negative f
Chain rule:
► H(x, y) = H(y\x) + H(x), as well as
► H(x, y) = H(x\y) + H(y) (since H(y,x) = H(x, y))
PA154 Jazykové modelováni (1.3)
PA154 Jazykové modelováni (1.3)
Essential Inform
Properties of Entropy II
■ Conditional Entropy is better (than unconditional):
► H(y\x) < H(y)
■ H{X, Y) < H(X) + H( Y) (fellows from the previous (in)equalities)
► equality iff X,Y independent
► (recall: X,Y independent iff p(X,Y)=p(X)p(Y))
■ H(p) is concave (remember the book availability graph?)
► concave function f over an interval (a,b): Vx,y e (a,fa),VA G [0,1] : f(Ax + (l-A)y)>Af(x) + (l-A)f(y)
► function f is convex if -f is concave
■ for proofs and generalizations, see Cover/Thomas
PA154 Jazykové modelováni (1.3)
Essential Information Theory
13/13