Seminar 3 Algorithm 1 (Variable byte code) A number 𝑛 is encoded in variable byte code in the following procedure: 1. Take a binary representation of 𝑛 with padding to the length of a multiple of 7. 2. Split into of 7 bit blocks right-to-left. 3. Add 1 to the beginning of the last block and 0 to the beginning of all previous blocks. Example: 𝑉 𝐡(824) = 00000110 10111000 Definition 1 (Unary code) Unary code, also referred to as 𝛼 code, is a coding type where a number 𝑛 is represented by a sequence of 𝑛 1s (or 0s) and terminated with one 0 (or 1). That is, 6 in unary code is 1111110 (or 0000001). The alternative representation in parentheses is equivalent but for this course we use the default representation. Definition 2 (𝛾 code) 𝛾 code is a coding type, that consists of an offset and its length: 𝛾(𝑛) = length of offset(𝑛) in 𝛼,offset(𝑛). Offset is a binary representation of a number 𝑛 without the highest bit (1). Length of this offset encoded in the unary (𝛼) code. Then the number 60 is encoded in 𝛾 as 111110,11100. Definition 3 (𝛿 code) A number 𝑛 is encoded in 𝛿 code in the following way. First calculate the offset of 𝑛 and the length of 𝑛 encode with 𝛾 code. Then add the offset of 𝑛. The final form is 𝛿(𝑛) = length of offset(𝑛) in 𝛾,offset(𝑛). Analogously, 600 is encoded in 𝛿 as 1110,001,001011000. Definition 4 (Zipf’s law) Zipf’s law says that the 𝑖-th most frequent term has the frequency 1 𝑖 . In this exercise we use the dependence of the Zipf’s law 𝑐𝑓𝑖 ∝ 1 𝑖 = 𝑐𝑖 π‘˜ where 𝑐𝑓𝑖 is the number of terms 𝑑𝑖 in a given collection with π‘˜ = βˆ’1. Definition 5 (Heaps’ law) Heaps’ law expresses an empiric dependency of collection size (number of all words) 𝑇 and vocabulary size (number of distinct words) 𝑀 by 𝑀 = π‘˜π‘‡ 𝑏 where 30 ≀ π‘˜ ≀ 100 and 𝑏 β‰ˆ 1 2 . Exercise 1 Count variable byte code for the postings list ⟨777, 17 743, 294 068, 31 251 336⟩. Bear in mind that the gaps are encoded. Write in 8-bit blocks. Exercise 2 Count 𝛾 and 𝛿 codes for the numbers 63 and 1023. 1 Exercise 3 A collection of documents contains 4 words: one, two, three, four of decreasing word frequencies 𝑓1, 𝑓2, 𝑓3 and 𝑓4. The total number of tokens in the collection is 5000. Assume that the Zipf’s law holds for this collection perfectly. What are the word frequencies? Exercise 4 How many distinct terms are expected in a document of 1,000,000 tokens? Use the Heaps’ law with parameters π‘˜ = 44 and 𝑏 = 0.5 2