Compression Term statistics Dictionary compression Postings compression
PV211: Introduction to Information Retrieval
https://www.fi.muni.cz/~sojka/PV211
IIR 5: Index compression
Handout version
Petr Sojka, Hinrich Schütze et al.
Faculty of Informatics, Masaryk University, Brno
Center for Information and Language Processing, University of Munich
2024-03-14
(compiled on 2024-03-08 15:01:09)
Sojka, IIR Group: PV211: Index compression 1 / 57
Compression Term statistics Dictionary compression Postings compression
Overview
1 Compression
2 Term statistics
3 Dictionary compression
4 Postings compression
Sojka, IIR Group: PV211: Index compression 2 / 57
Compression Term statistics Dictionary compression Postings compression
Roadmap
Today: index compression, and vector space model
Next week: the whole picture of complete search system,
scoring and ranking
Sojka, IIR Group: PV211: Index compression 3 / 57
Compression Term statistics Dictionary compression Postings compression
Take-away today
Motivation for compression in information retrieval systems
How can we compress the dictionary component of the
inverted index?
How can we compress the postings component of the inverted
index?
Term statistics: how are terms distributed in document
collections?
Sojka, IIR Group: PV211: Index compression 4 / 57
Compression Term statistics Dictionary compression Postings compression
Inverted index
For each term t, we store a list of all documents that contain t.
Brutus −→ 1 2 4 11 31 45 173 174
Caesar −→ 1 2 4 5 6 16 57 132 . . .
Calpurnia −→ 2 31 54 101
...
dictionary postings ﬁle
Today:
How much space do we need for the dictionary?
How much space do we need for the postings ﬁle?
How can we compress them?
Sojka, IIR Group: PV211: Index compression 6 / 57
Compression Term statistics Dictionary compression Postings compression
Why compression? (in general)
Use less disk space (saves money).
Keep more stuﬀ in memory (increases speed).
Increase speed of transferring data from disk to memory
(again, increases speed).
[read compressed data and decompress in memory]
is faster than
[read uncompressed data]
Premise: Decompression algorithms are fast.
This is true of the decompression algorithms we will use.
Sojka, IIR Group: PV211: Index compression 7 / 57
Compression Term statistics Dictionary compression Postings compression
Why compression in information retrieval?
First, we will consider space for dictionary:
Main motivation for dictionary compression: make it small
enough to keep in main memory.
Then for the postings ﬁle
Motivation: reduce disk space needed, decrease time needed to
read from disk.
Note: Large search engines keep signiﬁcant part of postings in
memory.
We will devise various compression schemes for dictionary and
postings.
Sojka, IIR Group: PV211: Index compression 8 / 57
Compression Term statistics Dictionary compression Postings compression
Lossy vs. lossless compression
Lossy compression: Discard some information
Several of the preprocessing steps we frequently use can be
viewed as lossy compression:
downcasing, stop words, porter, number elimination
Lossless compression: All information is preserved.
What we mostly do in index compression
Sojka, IIR Group: PV211: Index compression 9 / 57
Compression Term statistics Dictionary compression Postings compression
Model collection: The Reuters collection
symbol statistic value
N documents 800,000
L avg. # word tokens per document 200
M word types 400,000
avg. # bytes per word token (incl. spaces/punct.) 6
avg. # bytes per word token (without spaces/punct.) 4.5
avg. # bytes per word type 7.5
T non-positional postings 100,000,000
Sojka, IIR Group: PV211: Index compression 11 / 57
Compression Term statistics Dictionary compression Postings compression
Eﬀect of preprocessing for Reuters
word types non-positional positional postings
(terms) postings (word tokens)
size of dictionary non-positional index positional index
size ∆cml size ∆ cml size ∆cml
unﬁltered 484,494 109,971,179 197,879,290
no numbers 473,723 -2 -2 100,680,242 -8 -8 179,158,204 -9 -9
case folding 391,523 -17 -19 96,969,056 -3 -12 179,158,204 -0 -9
30 stopw’s 391,493 -0 -19 83,390,443 -14 -24 121,857,825 -31 -38
150 stopw’s 391,373 -0 -19 67,001,847 -30 -39 94,516,599 -47 -52
stemming 322,383 -17 -33 63,812,300 -4 -42 94,516,599 -0 -52
Explain diﬀerences between numbers non-positional vs positional:
−3 vs 0, −14 vs −31, −30 vs −47, −4 vs 0
Sojka, IIR Group: PV211: Index compression 12 / 57
Compression Term statistics Dictionary compression Postings compression
How big is the term vocabulary?
That is, how many distinct words are there?
Can we assume there is an upper bound?
Not really: At least 7020 ≈ 1037 diﬀerent words of length 20.
The vocabulary will keep growing with collection size.
Heaps’ law: M = kTb
M is the size of the vocabulary, T is the number of tokens in
the collection.
Typical values for the parameters k and b are: 30 ≤ k ≤ 100
and b ≈ 0.5.
Heaps’ law is linear in log-log space.
It is the simplest possible relationship between collection size
and vocabulary size in log-log space.
Empirical law
Sojka, IIR Group: PV211: Index compression 13 / 57
Compression Term statistics Dictionary compression Postings compression
Heaps’ law for Reuters
0 2 4 6 8
0123456
log10 T
log10M
Vocabulary size M as a
function of collection size
T (number of tokens) for
Reuters-RCV1. For these
data, the dashed line
log10 M =
0.49 ∗ log10 T + 1.64 is the
best least squares ﬁt.
Thus, M = 101.64
T0.49
and k = 101.64
≈ 44 and
b = 0.49.
Sojka, IIR Group: PV211: Index compression 14 / 57
Compression Term statistics Dictionary compression Postings compression
Empirical ﬁt for Reuters
Good, as we just saw in the graph.
Example: for the ﬁrst 1,000,020 tokens Heaps’ law predicts
38,323 terms:
44 × 1,000,0200.49
≈ 38,323
The actual number is 38,365 terms, very close to the
prediction.
Empirical observation: ﬁt is good in general.
Sojka, IIR Group: PV211: Index compression 15 / 57
Compression Term statistics Dictionary compression Postings compression
Exercise
1 What is the eﬀect of including spelling errors vs. automatically
correcting spelling errors on Heaps’ law?
2 Compute vocabulary size M
Looking at a collection of web pages, you ﬁnd that there are
3,000 diﬀerent terms in the ﬁrst 10,000 tokens and
30,000 diﬀerent terms in the ﬁrst 1,000,000 tokens.
Assume a search engine indexes a total of 20,000,000,000
(2 × 1010
) pages, containing 200 tokens on average
What is the size of the vocabulary of the indexed collection as
predicted by Heaps’ law?
Sojka, IIR Group: PV211: Index compression 16 / 57
Compression Term statistics Dictionary compression Postings compression
Zipf’s law
Now we have characterized the growth of the vocabulary in
collections.
We also want to know how many frequent vs. infrequent
terms we should expect in a collection.
In natural language, there are a few very frequent terms and
very many very rare terms.
Zipf’s law: The ith most frequent term has frequency cfi
proportional to 1/i.
cfi ∝ 1
i
cfi is collection frequency: the number of occurrences of the
term ti in the collection.
Sojka, IIR Group: PV211: Index compression 17 / 57
Compression Term statistics Dictionary compression Postings compression
Zipf’s law
Zipf’s law: The ith most frequent term has frequency
proportional to 1/i.
cfi ∝ 1
i
cf is collection frequency: the number of occurrences of the
term in the collection.
So if the most frequent term (the) occurs cf1 times, then the
second most frequent term (of) has half as many occurrences
cf2 = 1
2cf1 . . .
. . . and the third most frequent term (and) has a third as
many occurrences cf3 = 1
3 cf1, etc.
Equivalent: cfi = cik and log cfi = log c + k log i (for k = −1)
Example of a power law
Sojka, IIR Group: PV211: Index compression 18 / 57
Compression Term statistics Dictionary compression Postings compression
Zipf’s law for Reuters
0 1 2 3 4 5 6 7
01234567
log10 rank
log10cf
Fit is not great. What
is important is the
key insight: Few frequent
terms, many
rare terms.
Sojka, IIR Group: PV211: Index compression 19 / 57
Compression Term statistics Dictionary compression Postings compression
Dictionary compression
The dictionary is small compared to the postings ﬁle.
But we want to keep it in memory.
Also: competition with other applications, cell phones,
onboard computers, fast startup time
So compressing the dictionary is important.
Sojka, IIR Group: PV211: Index compression 21 / 57
Compression Term statistics Dictionary compression Postings compression
Recall: Dictionary as array of ﬁxed-width entries
term document
frequency
pointer to
postings list
a 656,265 −→
aachen 65 −→
. . . . . . . . .
zulu 221 −→
space needed: 20 bytes 4 bytes 4 bytes
Space for Reuters: (20+4+4)*400,000 = 11.2 MB
Sojka, IIR Group: PV211: Index compression 22 / 57
Compression Term statistics Dictionary compression Postings compression
Fixed-width entries are bad.
Most of the bytes in the term column are wasted.
We allot 20 bytes for terms of length 1.
We cannot handle hydrochlorofluorocarbons and
supercalifragilisticexpialidocious
Average length of a term in English: 8 characters (or a little
bit less)
How can we use on average 8 characters per term?
Sojka, IIR Group: PV211: Index compression 23 / 57
Compression Term statistics Dictionary compression Postings compression
Dictionary as a string
. . . syst i l esyzyget i csyzygial syzygyszaibe ly i teszec inszono. . .
freq.
9
92
5
71
12
. . .
4 bytes
postings ptr.
→
→
→
→
→
. . .
4 bytes
term ptr.
3 bytes
. . .
Sojka, IIR Group: PV211: Index compression 24 / 57
Compression Term statistics Dictionary compression Postings compression
Space for dictionary as a string
4 bytes per term for frequency
4 bytes per term for pointer to postings list
8 bytes (on average) for term in string
3 bytes per pointer into string (need log2 8 · 400,000 < 24 bits
to resolve 8 · 400,000 positions)
Space: 400,000 × (4 + 4 + 3 + 8) = 7.6 MB (compared to
11.2 MB for ﬁxed-width array)
Sojka, IIR Group: PV211: Index compression 25 / 57
Compression Term statistics Dictionary compression Postings compression
Dictionary as a string with blocking
. . . 7 sys t i l e9 syzyget i c8 syzyg i a l 6 syzygy11sza i be l y i te6 szec i n. . .
freq.
9
92
5
71
12
. . .
postings ptr.
→
→
→
→
→
. . .
term ptr.
. . .
Sojka, IIR Group: PV211: Index compression 26 / 57
Compression Term statistics Dictionary compression Postings compression
Space for dictionary as a string with blocking
Example block size k = 4
Where we used 4 × 3 bytes for term pointers without blocking
. . .
. . . we now use 3 bytes for one pointer plus 4 bytes for
indicating the length of each term.
We save 12 − (3 + 4) = 5 bytes per block.
Total savings: 400,000/4 ∗ 5 = 0.5 MB
This reduces the size of the dictionary from 7.6 MB to
7.1 MB.
Sojka, IIR Group: PV211: Index compression 27 / 57
Compression Term statistics Dictionary compression Postings compression
Lookup of a term without blocking
aid
box
den
ex
job
ox
pit
win
Sojka, IIR Group: PV211: Index compression 28 / 57
Compression Term statistics Dictionary compression Postings compression
Lookup of a term with blocking: (slightly) slower
aid box den ex
job ox pit win
Sojka, IIR Group: PV211: Index compression 29 / 57
Compression Term statistics Dictionary compression Postings compression
Front coding
One block in blocked compression (k = 4) . . .
8 a u t o m a t a 8 a u t o m a t e 9 a u t o m a t i c 10 a u t o m a t i o n
⇓
. . . further compressed with front coding.
8 a u t o m a t ∗ a 1 ⋄ e 2 ⋄ i c 3 ⋄ i o n
Sojka, IIR Group: PV211: Index compression 30 / 57
Compression Term statistics Dictionary compression Postings compression
Dictionary compression for Reuters: Summary
data structure size in MB
dictionary, ﬁxed-width 11.2
dictionary, term pointers into string 7.6
∼, with blocking, k = 4 7.1
∼, with blocking & front coding 5.9
Sojka, IIR Group: PV211: Index compression 31 / 57
Compression Term statistics Dictionary compression Postings compression
Exercise
Which preﬁxes should be used for front coding? What are the
tradeoﬀs?
Input: list of terms (= the term vocabulary)
Output: list of preﬁxes that will be used in front coding
Sojka, IIR Group: PV211: Index compression 32 / 57
Compression Term statistics Dictionary compression Postings compression
Postings compression
The postings ﬁle is much larger than the dictionary, factor of
at least 10.
Key desideratum: store each posting compactly
A posting for our purposes is a docID.
For Reuters (800,000 documents), we would use 32 bits per
docID when using 4-byte integers.
Alternatively, we can use log2 800,000 ≈ 19.6 < 20 bits per
docID.
Our goal: use a lot less than 20 bits per docID.
Sojka, IIR Group: PV211: Index compression 34 / 57
Compression Term statistics Dictionary compression Postings compression
Key idea: Store gaps instead of docIDs
Each postings list is ordered in increasing order of docID.
Example postings list: computer: 283154, 283159,
283202, . . .
It suﬃces to store gaps: 283159 − 283154 = 5,
283202 − 283159 = 43
Example postings list using gaps: computer: 283154, 5,
43, . . .
Gaps for frequent terms are small.
Thus: We can encode small gaps with fewer than 20 bits.
Sojka, IIR Group: PV211: Index compression 35 / 57
Compression Term statistics Dictionary compression Postings compression
Gap encoding
encoding postings list
the docIDs . . . 283042 283043 283044 283045 . . .
gaps 1 1 1 . . .
computer docIDs . . . 283047 283154 283159 283202 . . .
gaps 107 5 43 . . .
arachnocentric docIDs 252000 500100
gaps 252000 248100
Sojka, IIR Group: PV211: Index compression 36 / 57
Compression Term statistics Dictionary compression Postings compression
Variable length encoding
Aim:
For arachnocentric and other rare terms, we will use
about 20 bits per gap (= posting).
For the and other very frequent terms, we will use only a few
bits per gap (= posting).
In order to implement this, we need to devise some form of
variable length encoding.
Variable length encoding uses few bits for small gaps and
many bits for large gaps.
Sojka, IIR Group: PV211: Index compression 37 / 57
Compression Term statistics Dictionary compression Postings compression
Variable byte (VB) code
Used by many commercial/research systems
Good low-tech blend of variable-length coding and sensitivity
to alignment matches (bit-level codes, see later).
Dedicate 1 bit (high bit) to be a continuation bit c.
If the gap G ﬁts within 7 bits, binary-encode it in the 7
available bits and set c = 1.
Else: encode lower-order 7 bits and then use one or more
additional bytes to encode the higher order bits using the
same algorithm.
At the end set the continuation bit of the last byte to 1
(c = 1) and of the other bytes to 0 (c = 0).
Sojka, IIR Group: PV211: Index compression 38 / 57
Compression Term statistics Dictionary compression Postings compression
VB code examples
docIDs 824 829 215406
gaps 5 214577
VB code 00000110 10111000 10000101 00001101 00001100 10110001
Sojka, IIR Group: PV211: Index compression 39 / 57
Compression Term statistics Dictionary compression Postings compression
VB code encoding algorithm
VBEncodeNumber(n)
1 bytes ←
2 while true
3 do Prepend(bytes, n mod 128)
4 if n < 128
5 then Break
6 n ← n div 128
7 bytes[Length(bytes)] += 128
8 return bytes
VBEncode(numbers)
1 bytestream ←
2 for each n ∈ numbers
3 do bytes ← VBEncodeNumber(n)
4 bytestream ← Extend(bytestream, bytes)
5 return bytestream
Sojka, IIR Group: PV211: Index compression 40 / 57
Compression Term statistics Dictionary compression Postings compression
VB code decoding algorithm
VBDecode(bytestream)
1 numbers ←
2 n ← 0
3 for i ← 1 to Length(bytestream)
4 do if bytestream[i] < 128
5 then n ← 128 × n + bytestream[i]
6 else n ← 128 × n + (bytestream[i] − 128)
7 Append(numbers, n)
8 n ← 0
9 return numbers
Sojka, IIR Group: PV211: Index compression 41 / 57
Compression Term statistics Dictionary compression Postings compression
Other variable codes
Instead of bytes, we can also use a diﬀerent “unit of
alignment”: 32 bits (words), 16 bits, 4 bits (nibbles), etc.
Variable byte alignment wastes space if you have many small
gaps – nibbles do better on those.
There is work on word-aligned codes that eﬃciently “pack” a
variable number of gaps into one word – see resources at the
end
Sojka, IIR Group: PV211: Index compression 42 / 57
Compression Term statistics Dictionary compression Postings compression
Codes for gap encoding
You can get even more compression with another type of
variable length encoding: bitlevel code.
Gamma code is the best known of these.
First, we need unary code to be able to introduce gamma
code.
Unary code
Represent n as n 1s with a ﬁnal 0.
Unary code for 3 is 1110
Unary code for 1 is 10, for 0 is 0, for 30 is
1111111111111111111111111111110
Sojka, IIR Group: PV211: Index compression 43 / 57
Compression Term statistics Dictionary compression Postings compression
Gamma code
Represent a gap G as a pair of length and oﬀset.
Oﬀset is the gap in binary, with the leading bit chopped oﬀ.
For example 13 → 1101 → 101 = oﬀset
Length is the length of oﬀset.
For 13 (oﬀset 101), this is 3.
Encode length in unary code: 1110.
Gamma code of 13 is the concatenation of length and oﬀset:
1110101.
Sojka, IIR Group: PV211: Index compression 44 / 57
Compression Term statistics Dictionary compression Postings compression
Another Gamma code (γ) examples
number unary code length oﬀset γ code
0 0
1 10 0 0
2 110 10 0 10,0
3 1110 10 1 10,1
4 11110 110 00 110,00
9 1111111110 1110 001 1110,001
13 1110 101 1110,101
24 11110 1000 11110,1000
511 111111110 11111111 111111110,11111111
1025 11111111110 0000000001 11111111110,0000000001
Sojka, IIR Group: PV211: Index compression 45 / 57
Compression Term statistics Dictionary compression Postings compression
The universal coding of the integers: Elias codes
☞ unary code α(N) = 11 . . . 1
N
0. α(4) = 11110
☞ binary code β(1) = 1, β(2N + j) = β(N)j, j = 0, 1. β(4) = 100
☞ β is not uniquely decodable (it is not a preﬁx code).
☞ ternary τ(N) = β(N)#. τ(4) = 100#
☞ β′
(1) = ǫ, β′
(2N) = β′
(N)0, β′
(2N + 1) = β′
(N)1,
τ′
(N) = β′
(N)#. β′
(4) = 00.
☞ γ(N) = α|β′
(N)|β′
(N). γ(4) = 11000
☞ alternatively, γ′
: every bit β′
(N) is inserted between a pair from
α(|β′
(N)|). the same length as γ (bit permutation γ(N)), but less
readable
☞ example: γ′
(4) = 10100
☞ Cγ = {γ(N) : N > 0} = (1{0, 1})∗
0 is regular and therefore it is
decodable by ﬁnite automaton.
Sojka, IIR Group: PV211: Index compression 46 / 57
Compression Term statistics Dictionary compression Postings compression
Elias codes: gamma, delta, omega: formal deﬁnitions II
☞ δ(N) = γ(|β(N)|)β′(N)
☞ example: δ(4) = γ(3)00 = 01100
☞ decoder δ: δ(?) = 1001?
☞ ω:
K := 0;
while ⌊log2(N)⌋ > 0 do
begin K := β(N)K;
N := ⌊log2(N)⌋
end.
Sojka, IIR Group: PV211: Index compression 47 / 57
Compression Term statistics Dictionary compression Postings compression
Exercise
Compute the variable byte code of 130
Compute the gamma code of 130
Compute δ(42)
Sojka, IIR Group: PV211: Index compression 48 / 57
Compression Term statistics Dictionary compression Postings compression
Length of gamma code
The length of oﬀset is ⌊log2 G⌋ bits.
The length of length is ⌊log2 G⌋ + 1 bits,
So the length of the entire code is 2 × ⌊log2 G⌋ + 1 bits.
γ codes are always of odd length.
Gamma codes are within a factor of 2 of the optimal encoding
length log2 G.
(assuming the frequency of a gap G is proportional to log2 G –
only approximately true)
Sojka, IIR Group: PV211: Index compression 49 / 57
Compression Term statistics Dictionary compression Postings compression
Gamma code: Properties
Gamma code is preﬁx-free: a valid code word is not a preﬁx of
any other valid code.
Encoding is optimal within a factor of 3 (and within a factor
of 2 making additional assumptions).
This result is independent of the distribution of gaps!
We can use gamma codes for any distribution. Gamma code
is universal.
Gamma code is parameter-free.
Sojka, IIR Group: PV211: Index compression 50 / 57
Compression Term statistics Dictionary compression Postings compression
Gamma codes: Alignment
Machines have word boundaries – 8, 16, 32 bits
Compressing and manipulating at granularity of bits can be
slow.
Variable byte encoding is aligned and thus potentially more
eﬃcient.
Another word aligned scheme: Anh and Moﬀat 2005
Regardless of eﬃciency, variable byte is conceptually simpler
at little additional space cost.
Sojka, IIR Group: PV211: Index compression 51 / 57
Compression Term statistics Dictionary compression Postings compression
Compression of Reuters
data structure size in MB
dictionary, ﬁxed-width 11.2
dictionary, term pointers into string 7.6
∼, with blocking, k = 4 7.1
∼, with blocking & front coding 5.9
collection (text, xml markup, etc.) 3600.0
collection (text) 960.0
T/D incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ encoded 101.0
Sojka, IIR Group: PV211: Index compression 52 / 57
Compression Term statistics Dictionary compression Postings compression
Term-document incidence matrix
Anthony Julius The Hamlet Othello Macbeth . . .
and Caesar Tempest
Cleopatra
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
Entry is 1 if term occurs.
Example: Calpurnia occurs in Julius Caesar.
Entry is 0 if term does not occur.
Example: Calpurnia doesn’t occur in The tempest.
Sojka, IIR Group: PV211: Index compression 53 / 57
Compression Term statistics Dictionary compression Postings compression
Compression of Reuters
data structure size in MB
dictionary, ﬁxed-width 11.2
dictionary, term pointers into string 7.6
∼, with blocking, k = 4 7.1
∼, with blocking & front coding 5.9
collection (text, xml markup, etc.) 3600.0
collection (text) 960.0
T/D incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ encoded 101.0
Sojka, IIR Group: PV211: Index compression 54 / 57
Compression Term statistics Dictionary compression Postings compression
Summary
We can now create an index for highly eﬃcient Boolean
retrieval that is very space eﬃcient.
Only 4% of the total size of the collection.
Only 10–15% of the total size of the text in the collection.
However, we’ve ignored positional and frequency information.
For this reason, space savings are less in reality.
Sojka, IIR Group: PV211: Index compression 55 / 57
Compression Term statistics Dictionary compression Postings compression
Take-away today
Motivation for compression in information retrieval systems
How can we compress the dictionary component of the
inverted index?
How can we compress the postings component of the inverted
index?
Term statistics: how are terms distributed in document
collections?
Sojka, IIR Group: PV211: Index compression 56 / 57
Compression Term statistics Dictionary compression Postings compression
Resources
http://ske.fi.muni.cz
Chapter 5 of IIR
Resources at https://www.fi.muni.cz/~sojka/PV211/
and http://cislmu.org, materials in MU IS and FI MU
library
Original publication on word-aligned binary codes by Anh and
Moﬀat (2005); also: Anh and Moﬀat (2006a).
Original publication on variable byte codes by Scholer,
Williams, Yiannis and Zobel (2002).
More details on compression (including compression of
positions and frequencies) in Zobel and Moﬀat (2006).
Sojka, IIR Group: PV211: Index compression 57 / 57