Cross Entropy PA154 Language Modeling (1.4) Pavel Rychlý pary@fi.muni.cz February 24, 2023 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic "Coding" Interpretation of Entropy ■ The Least (average) number of bits needed to encode a message (string, sequence, series,...) (each element having being a result of a random process with some distribution p)\ = H(p) ■ Remember various compressing algorithms? ■ they do well on data with repeating (= easily predictable = = low entropy) patterns ■ their results though have high entropy compressing compressed data does nothing Pavel Rychlý • Cross Entropy • February 24, 2023 2/12 Coding: Example ■ How many bits do we need for ISO Latin 1? ■ the trivial answer: 8 ■ Experience: some chars are more common, some (very) rare: ■ ...so what if we use more bits for the rare, and less bits for the frequent? (be careful: want to decode (easily)!) ■ suppose: p('a') = 0.3, p('b') = 0.3, p('c') = 0.3, the rest: p(x)^.0004 ■ code: 'a' - 00, 'b' - 01, t' - 10, rest: \\bxb1b^bAbsb^b1b^ ■ code 'acbbecbaac': 00 10 01 01 1100001111 10 01 00 00 10 acbb e cbaac ■ number of bits used: 28 (vs. 80 using "naive" coding) ■ code Length ~ -log(probability) Pavel Rychlý • Cross Entropy • February 24, 2023 3/12 Entropy of Language ■ Imagine that we produce the next Letter using • • • h), where l^...ln is the sequence of all the Letters which had been uttered so far (i.e. n is reaLLy big!); Let's caLL /i,... ln the history h(hn+i), and aLL histories H: ■ Then compute its entropy: ■ Not very practicaL, isn't it? Pavel Rychlý • Cross Entropy • February 24, 2023 4/12 Cross-Entropy ■ Typical case: we've got series of observations T = {ti, tj, t3, t4,..., tn} (numbers, words,...; t\ e ft); estimate (sample): Vy e ft : p(y) = ^jyp, def.c(y) = |{fG7;f = y}| ■ ...but the true p is unknown; every sample is too small! ■ Natural question: how well do we do using p (instead of p)l ■ Idea: simulate actual p by using a different T(or rather: by using different observation we simulate the insufficiency of Tvs. some other data ("random" difference)) Pavel Rychlý • Cross Entropy • February 24, 2023 5/12 Cross Entropy: The Formula ■ Hp/(p) = H(p') + D(p'\\p) (^(P) = -Ex^P/W'og2PW) ■ p1 is certainly not the true p, but we can consider it the "real world" distribution against which we test p P ■ note on notation (confusing ...): — p, also Hr(p) P ■ (Cross)PerpLexity: Gp/(p) = Gr(p) = 2Hp'^ Pavel Rychlý • Cross Entropy • February 24, 2023 Conditional Cross Entropy ■ So far: "unconditional" distribution(s) p(x),p'(x)... ■ In practice: virtually always conditioning on context ■ Interested in: sample space r.v. Y,y e V; context: sample space ft, x.vX,x e ft: "our" distribution p(y\x), test against p'(y, x), which is taken from some independent data: hp' (p) = - Yl p'(y^x) log2 p(y\x) Pavel Rychlý • Cross Entropy • February 24, 2023 7/12 Sample Space vs. Data ■ In practice, it is often inconvenient to sum over the space(s) ft (especially for cross entropy!) ■ Use the following formula: Hpr(p) = -Eye^xeft/9'^*) \og2p{y\x) = -l/|r|X)/=i...|r/| >og2 P(y#l^#) ■ This is in fact the normalized Log probability of the "test" data: HPip) = -l/\r\log2 H p(yi\Xi) i=l...\T'\ Pavel Rychlý • Cross Entropy • February 24, 2023 8/12 Computation Example ■ ft = {a, b, .., z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) = .5, p(a) = ^ for a e {c.r}, p(a)= 0 for the rest: s,t,u,v,w,x,y,z ■ Data (test): barb p'(a) = p'(r) = .25, p'(b) = .5 ■ Sum over ft: a abcdefg...pqr s t _. . z i>Xd)\og2p(ti) .S+-S+0+0+0+0+0+0+0+0+0+1.5+0-K)+0+0+0 = 2 ■ Sum over data: Pavel Rychlý • Cross Entropy • February 24, 2023 9/12 Cross Entropy: Some Observations ■ H(p) ??<,=, >?? Hp,(p):ALL! ■ Previous example: p(a) = .25, p(b) = .5, p(a)= p for a e {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z H(p) = l.Sbits = H(p')(barb) ■ Other data: probable: (|)(6 + 6 + 6 + 1 + 2 + 1 + 6 + 6) = 4.25 H(p) < 4.2Sbits = H(p')(probable) ■ And finally: abba: {\){2 + 1 + 1 + 2) = 1.5 H(p) > l.Sbits = H(p'){gbba) ■ But what about: baby -p'('y') iog2 p('y') = -.25 iog2 0 = (??) Pavel Rychlý • Cross Entropy • February 24, 2023 10/12 Cross Entropy: Usage Comparing data?? ■ NO! (we believe that we test on real data!) Rather: comparing distributions (vs. real data) Have (got) 2 distributions: p and q (on some Q,X) ■ which is better? ■ better: has lower cross-entropy (perplexity) on real data S "Real" data: S Hs(P) = -1/\S\ E/=1 ,s, logip{yi\xi) VZ2 Hs(q) = -1/\S\ E/=1 is, log2q{yi\Xi) J Pavel Rychlý • Cross Entropy • February 24, 2023 11/12 Comparing Distributions p(.) from previous example: (^s(p) = 4.25 p(a) = .25, p(b) = .5, p(a) = ^ for a e {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z q(.\.) (conditional; defined by a table): qCI-H» a e i p r othef 0 G 0 0 .1 25 0 0 b 1 a 0 0 1 125 0 0 e 0 0 0 1 0 1 25 D 1 0 .5 0 0 0 .125 0 / 0 Ú 0 0 0 0 0 .1 25 ľ 0 P a 0 Ü 0 0 .1 25 ij r 0 0 ■;i 0 0 .1 25 —- other 0 0 l 0 ü .1 25 0 0 _exj_q(o|r) = ] g(r|p) =. 125 (l/8)(log(piothO+log(r|p)+log(o|r)+log(b|o)+log(a|b)Ho|(>|a)+Iog(l|b)+log(e|l)) (1/8) ( 0 + 3+ 0+ 0+1 + 0+1 + 0 ) Hs(q) = .625 L-- Pavel Rychlý • Cross Entropy • February 24, 2023 12/12