"Coding" Interpretation of Entropy Cross Entropy PA154 Jazykové modelování (2.1) Pavel Rychlý pary@fi.muni.cz February 24, 2020 The least (average) number of bits needed to encode a message (string, sequence, series, ...) (each element having being a result of a random process with some distribution p): — H(p) Remember various compressing algorithms? ► they do well on data with repeating (= easily predictable = = low entropy) patterns ► their results though have high entropy => compressing compressed data does nothing Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic Coding: Example Entropy of Language How many bits do we need for ISO Latin 1? ► the trivial answer: 8 Experience: some chars are more common, some (very) rare: ► ... so what if we use more bits for the rare, and less bits for the frequent? (be careful: want to decode (easily)!) ► suppose: p('a') = 0.3, p('b') = 0.3, p('c') = 0.3, the rest: p(x)S ► code: 'a' ~ 00, 'b' ~ 01, 'c' ~ 10, rest: 11 fai£i2i>3i>41>51>6i>7iiB ► code 'acbbecbaac': 00 10 01 01 1100001111 10 01 00 00 10 acbb e cbaac ► number of bits used: 28 (vs. 80 using "naive" coding) code length ~ —log(probability) .0004 Imagine that we produce the next letter using p('n+i|'i, ■ ■ ■ In), where /i,... /„ is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call k,... /„ the history h(hn+i), and all histories H: Then compute its entropy: -£/ieH£/e,lp(','>)i°g2P('l'0 Not very practical, isn't it? Cross-Entropy Cross Entropy: The Formula (sample): Vy g ft : p(y) = Typical case: we've got series of observations T = {ti, t2, f3, t4,..., f„} (numbers, words, ...; ti 6 ft); estimate cM \T\ ' def. c(y) = \{t g T;t = y}\ ... but the true p is unknown; every sample is too small! Natural question: how well do we do using p (instead of p)? Idea: simulate actual p by using a different T (or rather: by using different observation we simulate the insufficiency of T vs. some other data ("random" difference)) Hp,(p) = H(p') + D(p'||p) HP'(P) = - Ex€S2 P'M |0S2 P(x) p' is certainly not the true p, but we can consider it the "real world" distribution against which we test p note on notation (confusing . -op, also HT/(p) (Cross)Perplexity: Gp/(p) = GT,(p) = 2Hp'(r5) PA154 Jazykové modelováni (2.1) Conditional Cross Entropy Sample Space vs. Data So far: "unconditional" distribution(s) p(x), p'(x)... In practice: virtually always conditioning on context Interested in: sample space V, r.v. Y, y g V; context: sample space Q., r.v.X, xeQ: "our" distribution p(y|x), test against p'(y,x), which is taken from some independent data: - p'(y 1 + 2+6 + 1 10 1/|T'| (1/4) x 10 =2- Cross Entropy: Some Observations ■ H(p) ??<, =, >?? Hp/(p) : ALL! ■ Previous example: p(a) = .25, p(b) = .5, p(a)= ^ for a £ {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z H(p) = 2.5bits = H(p')(barb) ■ Other data: probable: (|)(6 + 6 + 6 + 1+ 2 + 1 + 6 + 6) = 4.25 H(p) < 4.25b/ts = H(p')(probable) ■ And finally: abba: (i)(2 + 1 + 1 + 2) = 1.5 H(p) > l.bbits = H(p')(abba) ■ But what about: baby -p'(y)log2p(y) = -.25log20 = oo (??) Cross Entropy: Usage Comparing Distributions Comparing data?? ► NO! (we believe that we test on real data!) Rather: comparing distributions fvs. real data) Have (got) 2 distributions: p and q (on some ft,X) ► which is better? ► better: has lower cross-entropy (perplexity) on real data S "Real" data: S Hs(p) = -1/|S| £;=i..is| log2p(y;\x;) © Hs(q) = -1/|S| £ log2q(yi\xi) p(.) from previous example: (^s(p) = 4.25 ] p(a) = .25, p(b) = .5, p(a) = ^ for a £ {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z q(.|.) (conditional; defined by a table): q(J,)-» i b t 1 □ p r □ thei U 5 ■j 0 0 .125 11 0 b 1 0 "1 0 1 123 0 0 * 0 0 j 1 0 .125 0 1 r .J ] 0 ú 125 o r 0 o t: 0 :i 0 0 125 1 0 r 0 i 0 0 .1 25 0 1 t 0 0 H 0 n .1 25 ■ útľlŤi 0 0 ] 0 0 1 25 0 0 ex. q(o|r) = 1 fl(r|p)= 125 (1/8) (log^lothO+logfrlpJ+logfoW+logtbloJ+logfaW+logCI^+logaibí+logCell)) (1/8) ( 0 + 3+ 0+ 0+1 + 0+1 + 0 ) PA154 Jazykové modelováni (2.1)