Seminar 3 Algorithm 1 (Variable byte code) A number 𝑛 is encoded in variable byte code in the following procedure: 1. Take a binary representation of 𝑛 with padding to the length of a multiple of 7. 2. Split into of 7 bit blocks right-to-left. 3. Add 1 to the beginning of the last block and 0 to the beginning of all previous blocks. Example: 𝑉 𝐡(824) = 00000110 10111000 Definition 1 (𝛼 code) Unary code, also referred to as 𝛼 code, is a coding type where a number 𝑛 is represented by a sequence of 𝑛 1s (or 0s) and terminated with one 0 (or 1). That is, 6 in unary code is 1111110 (or 0000001). The alternative representation in parentheses is equivalent but for this course we use the default representation. Definition 2 (𝛾 code) 𝛾 code is a coding type, that consists of an offset and its length: 𝛾(𝑛) = 𝛼 (οΈ€ length of offset(𝑛) )οΈ€ ,offset(𝑛). Offset is a binary representation of a number 𝑛 without the highest bit (1). The length of this offset encoded in the unary (𝛼) code. Then the number 60 is encoded in 𝛾 as 111110,11100. Definition 3 (𝛿 code) A number 𝑛 is encoded in 𝛿 code in the following way: 𝛿(𝑛) = 𝛾 (οΈ€ length of offset(𝑛) )οΈ€ ,offset(𝑛). Analogously, 600 is encoded in 𝛿 as 1110,001,001011000. Definition 4 (Zipf’s law) Zipf’s law says that the 𝑖-th most frequent term has the frequency 1 𝑖 . In this exercise we use the dependence of the Zipf’s law 𝑐𝑓𝑖 ∝ 1 𝑖 = 𝑐𝑖 π‘˜ where 𝑐𝑓𝑖 is the number of terms 𝑑𝑖 in a given collection with π‘˜ = βˆ’1. Definition 5 (Heaps’ law) Heaps’ law expresses an empiric dependency of collection size (number of all words) 𝑇 and vocabulary size (number of distinct words) 𝑀 by 𝑀 = π‘˜π‘‡ 𝑏 where 30 ≀ π‘˜ ≀ 100 and 𝑏 β‰ˆ 1 2 . Exercise 3/1 Count variable byte code for the postings list ⟨777, 17 743, 294 068, 31 251 336⟩. Bear in mind that the gaps are encoded. Write in 8-bit blocks. Encode the list of gaps ⟨777, 16 966, 276 325, 30 957 268⟩. Variable byte code of the gaps: βˆ™ 𝑉 𝐡(777) = 00000110 10001001 βˆ™ 𝑉 𝐡(16 966) = 00000001 00000100 11000110 βˆ™ 𝑉 𝐡(276 325) = 00010000 01101110 11100101 βˆ™ 𝑉 𝐡(30 957 268) = 00001110 01100001 00111101 11010100 Result: 𝑉 𝐡(⟨777, 17 743, 294 068, 31 251 336⟩) = 00000110 10001001 00000001 00000100 11000110 00010000 01101110 11100101 00001110 01100001 00111101 11010100 1 Exercise 3/2 Count 𝛾 and 𝛿 codes for the numbers 63 and 1023. According to the definition 2 it is necessary to count the offsets as binary representations without the highest bit 6310 = 1111112 and offset(63) = 11111. Offset length is encoded in 𝛼 as |11111| = 5 𝛼(5) = 111110. Finally, 𝛾(63) = 111110, 11111. Analogically for 1023. 102310 = 11111111112, offset is 111111111, its length is |111111111| = 9 𝛼(9) = 1111111110. Then 𝛾(1023) = 1111111110, 111111111. 𝛿 is a little more complicated. First we count the offset 63 = 11111 and its length |11111| = 5. The value of 5 we encode in 𝛾 so 𝛾(5) = 110, 01. By definition 3 we have 𝛿(63) = 110, 01, 11111. And finally, 𝛿(1023) = 1110, 001, 111111111. Exercise 3/3 Calculate the variable byte code, 𝛾 code and 𝛿 code of the postings list 𝑃 = [32, 160, 162]. Note that gaps are encoded. Include intermediate results (offsets, lengths). offset 32 = 00000 and 𝛼(|00000|) = 111110 𝛾(32) = 111110, 00000 offset 128 = 0000000 and 𝛼(|0000000|) = 11111110 𝛾(128) = 11111110, 0000000 offset 2 = 0 and 𝛼(|0|) = 10 𝛾(2) = 10, 0 𝛾(𝑃) = 11111000000111111100000000100 offset 32 = 00000 and 𝛾(|00000|) = 110, 01 𝛿(32) = 110, 01, 00000 offset 128 = 0000000 and 𝛾(|0000000|) = 110, 11 𝛿(128) = 110, 11, 0000000 offset 2 = 0 and 𝛾(|0|) = 0 𝛿(2) = 0, , 0 𝛿(𝑃) = 110010000011011000000000 Exercise 3/4 Consider a posting list with the following list of gaps 𝐺 = [4, 6, 1, 2048, 64, 248, 2, 130]. Using variable byte encoding, βˆ™ What is the largest gap in 𝐺 that you can encode in 1 byte? βˆ™ What is the largest gap in 𝐺 that you can encode in 2 bytes? βˆ™ How many bytes will 𝐺 occupy after encoding? βˆ™ 64 βˆ™ 2048 βˆ™ 11 2 Exercise 3/5 From the following sequence of 𝛾-encoded gaps, reconstruct first the gaps list and then the original postings list. Recall that the 𝛼 code encodes a number 𝑛 with 𝑛 1s followed by one 0. 1110001110101011111101101111011 [1110001, 11010, 101, 11111011011, 11011] [1001, 110, 11, 111011, 111] [9, 6, 3, 59, 7] [9, 15, 18, 77, 84] Exercise 3/6 What does the Zipf’s law say? Answers can vary. For official definition refer to the Manning book. Exercise 3/7 What does the Heaps’ law say? Answers can vary. For official definition refer to the Manning book. Exercise 3/8 A collection of documents contains 4 words: one, two, three, four of decreasing word frequencies 𝑓1, 𝑓2, 𝑓3 and 𝑓4. The total number of tokens in the collection is 5000. Assume that the Zipf’s law holds for this collection perfectly. What are the word frequencies? We use the Zipf’s law in Definition 4. The least frequent term is four, then three, two and the most frequent is one. Applying the Zipf’s law we get 𝑐𝑓1 + 𝑐𝑓2 + 𝑐𝑓3 + 𝑐𝑓4 = 5000 𝑐 Β· 1βˆ’1 + 𝑐 Β· 2βˆ’1 + 𝑐 Β· 3βˆ’1 + 𝑐 Β· 4βˆ’1 = 5000 𝑐 + 1 2 𝑐 + 1 3 𝑐 + 1 4 𝑐 = 5000 12𝑐 + 6𝑐 + 4𝑐 + 3𝑐 = 60000 25𝑐 = 60000 𝑐 = 2400 and, plugging in to the formula 𝑐𝑓𝑖 = π‘π‘–βˆ’1 , we obtain the term frequency values: 𝑐𝑓1 = 24001 1 = 2400 𝑐𝑓2 = 24001 2 = 1200 𝑐𝑓3 = 24001 3 = 800 𝑐𝑓4 = 24001 4 = 600 3 Exercise 3/9 How many distinct terms are expected in a document of 1,000,000 tokens? Use the Heaps’ law with parameters π‘˜ = 44 and 𝑏 = 0.5 By Definition 5, 44 Γ— 1,000,0000.5 = 44,000. 4