Index Compression (Chapter 5)
Algorithm 1 (Variable byte code)
A number 𝑛 is encoded in variable byte code in the following procedure:
1. Take a binary representation of 𝑛 with padding to the length of a multiple of 7.
2. Split into of 7 bit blocks right-to-left.
3. Add 1 to the beginning of the last block and 0 to the beginning of all previous blocks.
Example: 𝑉 𝐵(824) = 00000110 10111000
Definition 1 (𝛼 code)
Unary code, also referred to as 𝛼 code, is a coding type where a number 𝑛 is represented
by a sequence of 𝑛 1s (or 0s) and terminated with one 0 (or 1). That is, 6 in unary code
is 1111110 (or 0000001). The alternative representation in parentheses is equivalent but
for this course we use the default representation.
Definition 2 (𝛾 code)
𝛾 code is a coding type, that consists of an offset and its length: 𝛾(𝑛) = 𝛼
(︀
length of
offset(𝑛)
)︀
,offset(𝑛). Offset is a binary representation of a number 𝑛 without the highest
bit (1). The length of this offset encoded in the unary (𝛼) code. Then the number 60 is
encoded in 𝛾 as 111110,11100.
Definition 3 (𝛿 code)
A number 𝑛 is encoded in 𝛿 code in the following way: 𝛿(𝑛) = 𝛾
(︀
length of offset(𝑛)
)︀
,offset(𝑛).
Analogously, 600 is encoded in 𝛿 as 1110,001,001011000.
Definition 4 (Zipf’s law)
Zipf’s law says that the 𝑖-th most frequent term has the frequency 1
𝑖 . In this exercise we
use the dependence of the Zipf’s law 𝑐𝑓𝑖 ∝ 1
𝑖 = 𝑐𝑖 𝑘
where 𝑐𝑓𝑖 is the number of terms 𝑡𝑖
in a given collection with 𝑘 = −1.
Definition 5 (Heaps’ law)
Heaps’ law expresses an empiric dependency of collection size (number of all words) 𝑇
and vocabulary size (number of distinct words) 𝑀 by 𝑀 = 𝑘𝑇 𝑏
where 30 ≤ 𝑘 ≤ 100 and
𝑏 ≈ 1
2 .
Exercise 5/1
Count variable byte code for the postings list ⟨777, 17 743, 294 068, 31 251 336⟩. Bear in
mind that the gaps are encoded. Write in 8-bit blocks.
Exercise 5/2
Count 𝛾 and 𝛿 codes for the numbers 63 and 1023.
Exercise 5/3
Calculate the variable byte code, 𝛾 code and 𝛿 code of the postings list 𝑃 = [32, 160, 162].
Note that gaps are encoded. Include intermediate results (offsets, lengths).
1
Exercise 5/4
Consider a posting list with the following list of gaps
𝐺 = [4, 6, 1, 2048, 64, 248, 2, 130].
Using variable byte encoding,
• What is the largest gap in 𝐺 that you can encode in 1 byte?
• What is the largest gap in 𝐺 that you can encode in 2 bytes?
• How many bytes will 𝐺 occupy after encoding?
Exercise 5/5
From the following sequence of 𝛾-encoded gaps, reconstruct first the gaps list and then
the original postings list. Recall that the 𝛼 code encodes a number 𝑛 with 𝑛 1s followed
by one 0.
1110001110101011111101101111011
Exercise 5/6
What does the Zipf’s law say?
Exercise 5/7
What does the Heaps’ law say?
Exercise 5/8
A collection of documents contains 4 words: one, two, three, four of decreasing word frequencies
𝑓1, 𝑓2, 𝑓3 and 𝑓4. The total number of tokens in the collection is 5000. Assume
that the Zipf’s law holds for this collection perfectly. What are the word frequencies?
Exercise 5/9
How many distinct terms are expected in a document of 1,000,000 tokens? Use the
Heaps’ law with parameters 𝑘 = 44 and 𝑏 = 0.5
2
Scoring, Term Weighting, and the Vector Space Model
(Chapter 6)
Definition 6 (Inverse document frequency)
Inverse document frequency of a term 𝑡 is defined as
idf 𝑡 = log
(︂
𝑁
df 𝑡
)︂
where 𝑁 is the number of all documents and df 𝑡 (the document frequency of 𝑡) is the
number of documents that contain 𝑡.
Definition 7 (tf-idf weighting scheme)
In the tf-idf weighting scheme, a term 𝑡 in a document 𝑑 has weight
tf-idf 𝑡,𝑑 = tf 𝑡,𝑑 · idf 𝑡
where tf 𝑡,𝑑 is the number of tokens 𝑡 (the term frequency of 𝑡) in a document 𝑑.
Definition 8 (ℓ2
(cosine) normalization)
A vector 𝑣 is cosine-normalized by
𝑣 𝑗 ←
𝑣 𝑗
||𝑣||
=
𝑣 𝑗
√︁
∑︀|𝑣|
𝑘=1 𝑣 𝑘
2
where 𝑣 𝑗 is the element at the 𝑗-th position in 𝑣.
Exercise 6/1
Consider the frequency table of the words of three documents 𝑑𝑜𝑐1, 𝑑𝑜𝑐2, 𝑑𝑜𝑐3 below.
Calculate the tf-idf weight of the terms car, auto, insurance, and best for each document.
idf values of terms are in the table.
𝑑𝑜𝑐1 𝑑𝑜𝑐2 𝑑𝑜𝑐3 idf
car 27 4 24 1.65
auto 3 33 0 2.08
insurance 0 33 29 1.62
best 14 0 17 1.5
Table 1: Exercise.
Exercise 6/2
Count document representations as normalized Euclidean weight vectors for each document
from the previous exercise. Each vector has four components, one for each term.
3
Exercise 6/3
Based on the weights from the last exercise, compute the similarity scores (scalar
products) of the three documents for the query car insurance. Use each of the two
weighting schemes:
a) Term weight is 1 if the query contains the word and 0 otherwise.
b) Euclidean normalized tf-idf.
Please note that a document and a representation of this document are different things.
Document is always fixed but the representations may vary under different settings and
conditions. In this exercise we fix document representations from the last exercises and
will count similarity scores for query and documents under two different representations
of the query. It might be helpful to view on a query as on another document, as it is a
sequence of words.
4