Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/PV211 IIR 4: Index construction Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno Center for Information and Language Processing, University of Munich 2024-03-07 (compiled on 2024-02-20 00:47:52) Sojka, IIR Group: PV211: Index construction 1 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Overview 1 Introduction 2 BSBI algorithm 3 SPIMI algorithm 4 Distributed indexing 5 Dynamic indexing Sojka, IIR Group: PV211: Index construction 2 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Take-away Two index construction algorithms: BSBI (simple) and SPIMI (more realistic) Distributed index construction: MapReduce Dynamic index construction: how to keep the index up-to-date as the collection changes Sojka, IIR Group: PV211: Index construction 3 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing hardware basics that we’ll need in this course. Sojka, IIR Group: PV211: Index construction 5 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Hardware basics Access to data is much faster in memory than on disk. (roughly a factor of 10 SSD, 100+ for rotational disks) Disk seeks are “idle” time: No data is transferred from disk while the disk head is being positioned. To optimize transfer time from disk to memory: one large chunk is faster than many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB Assuming an efficient decompression algorithm, the total time of reading and then decompressing compressed data is usually less than reading uncompressed data. Servers used in IR systems typically have many GBs of main memory and TBs of disk space. Fault tolerance is expensive: It’s cheaper to use many regular machines than one fault tolerant machine. Sojka, IIR Group: PV211: Index construction 6 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Some stats (ca. 2008) symbol statistic value s average seek time 5 ms = 5 × 10−3 s b transfer time per byte 0.02 µs = 2 × 10−8 s processor’s clock rate 109 s−1 p lowlevel operation (e.g., compare & swap a word) 0.01 µs = 10−8 s size of main memory several GB size of disk space 1 TB or more Sojka, IIR Group: PV211: Index construction 7 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing RCV1 collection Shakespeare’s collected works are not large enough for demonstrating many of the points in this course. As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection. English newswire articles sent over the wire in 1995 and 1996 (one year). Sojka, IIR Group: PV211: Index construction 8 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing A Reuters RCV1 document Sojka, IIR Group: PV211: Index construction 9 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Reuters RCV1 statistics N documents 800,000 L tokens per document 200 M terms (= word types) 400,000 bytes per token (incl. spaces/punct.) 6 bytes per token (without spaces/punct.) 4.5 bytes per term (= word type) 7.5 T non-positional postings 100,000,000 Exercise: Average frequency of a term (how many tokens)? 4.5 bytes per word token vs. 7.5 bytes per word type: why the difference? How many positional postings? Sojka, IIR Group: PV211: Index construction 10 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Goal: construct the inverted index Brutus −→ 1 2 4 11 31 45 173 174 Caesar −→ 1 2 4 5 6 16 57 132 . . . Calpurnia −→ 2 31 54 101 ... dictionary postings Sojka, IIR Group: PV211: Index construction 12 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Index construction in IIR 1: Sort postings in memory term docID I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i’ 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 =⇒ term docID ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i’ 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Sojka, IIR Group: PV211: Index construction 13 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Scaling index construction How can we construct an index for very large collections? Taking into account the hardware constraints we just learned about . . . . . . memory, disk, speed, etc. Sojka, IIR Group: PV211: Index construction 14 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Sort-based index construction As we build index, we parse docs one at a time. The final postings for any term are incomplete until the end. Can we keep all postings in memory and then do the sort in-memory at the end? No, not for large collections Thus: We need to store intermediate results on disk. Sojka, IIR Group: PV211: Index construction 15 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Same algorithm for disk? Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? No: Sorting very large sets of records on disk is too slow – too many disk seeks. We need an external sorting algorithm. Sojka, IIR Group: PV211: Index construction 16 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing “External” sorting algorithm (using few disk seeks) We must sort T = 100,000,000 non-positional postings. Each posting has size 12 bytes (4+4+4: termID, docID, term frequency). Define a block to consist of 10,000,000 such postings We can easily fit that many postings into memory. We will have 10 such blocks for RCV1. Basic idea of algorithm: For each block: (i) accumulate postings, (ii) sort in memory, (iii) write to disk Then merge the blocks into one long sorted order. Sojka, IIR Group: PV211: Index construction 17 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Merging two blocks Block 1 brutus d3 caesar d4 noble d3 with d4 Block 2 brutus d2 caesar d1 julius d1 killed d2 postings to be merged brutus d2 brutus d3 caesar d1 caesar d4 julius d1 killed d2 noble d3 with d4 merged postings disk Sojka, IIR Group: PV211: Index construction 18 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Blocked Sort-Based Indexing BSBIndexConstruction() 1 n ← 0 2 while (all documents have not been processed) 3 do n ← n + 1 4 block ← ParseNextBlock() 5 BSBI-Invert(block) 6 WriteBlockToDisk(block, fn) 7 MergeBlocks(f1, . . . , fn; f merged) Sojka, IIR Group: PV211: Index construction 19 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Problem with sort-based algorithm Our assumption was: we can keep the dictionary in memory. We need the dictionary (which grows dynamically) in order to implement a term to termID mapping. Actually, we could work with term,docID postings instead of termID,docID postings . . . . . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.) Sojka, IIR Group: PV211: Index construction 21 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Single-pass in-memory indexing Abbreviation: SPIMI Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks. Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur. With these two ideas we can generate a complete inverted index for each block. These separate indexes can then be merged into one big index. Sojka, IIR Group: PV211: Index construction 22 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing SPIMI-Invert SPIMI-Invert(token_stream) 1 output_file ← NewFile() 2 dictionary ← NewHash() 3 while (free memory available) 4 do token ← next(token_stream) 5 if term(token) /∈ dictionary 6 then postings_list ← AddToDictionary(dictionary,term(token)) 7 else postings_list ← GetPostingsList(dictionary,term(token)) 8 if full(postings_list) 9 then postings_list ← DoublePostingsList(dictionary,term(token) 10 AddToPostingsList(postings_list,docID(token)) 11 sorted_terms ← SortTerms(dictionary) 12 WriteBlockToDisk(sorted_terms,dictionary,output_file) 13 return output_file Merging of blocks is analogous to BSBI. Sojka, IIR Group: PV211: Index construction 23 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing SPIMI: Compression Compression makes SPIMI even more efficient. Compression of terms Compression of postings See next lecture Sojka, IIR Group: PV211: Index construction 24 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Distributed indexing For web-scale indexing (don’t try this at home!): must use a distributed computer cluster Individual machines are fault-prone. Can unpredictably slow down or fail. How do we exploit such a pool of machines? Sojka, IIR Group: PV211: Index construction 26 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Google data centers (2007 estimates; Gartner) Google data centers mainly contain commodity machines. Data centers are distributed all over the world. 1 million servers, 3 million processors/cores Google installs 100,000 servers each quarter. Based on expenditures of 200–250 million dollars per year This would be 10% of the computing capacity of the world! If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system (assuming it does not tolerate failures)? Answer: 37% Suppose a server will fail after 3 years. For an installation of 1 million servers, what is the interval between machine failures? Answer: less than two minutes Sojka, IIR Group: PV211: Index construction 27 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Distributed indexing Maintain a master machine directing the indexing job – considered “safe” Break up indexing into sets of parallel tasks Master machine assigns each task to an idle machine from a pool. Sojka, IIR Group: PV211: Index construction 28 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Parallel tasks We will define two sets of parallel tasks and deploy two types of machines to solve them: Parsers Inverters Break the input document collection into splits (corresponding to blocks in BSBI/SPIMI) Each split is a subset of documents. Sojka, IIR Group: PV211: Index construction 29 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Parsers Master assigns a split to an idle parser machine. Parser reads a document at a time and emits (termID,docID)-pairs. Parser writes pairs into j term-partitions. Each for a range of terms’ first letters E.g., a–f, g–p, q–z (here: j = 3) Sojka, IIR Group: PV211: Index construction 30 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Inverters An inverter collects all (termID,docID) pairs (= postings) for one term-partition (e.g., for a–f). Sorts and writes to postings lists Sojka, IIR Group: PV211: Index construction 31 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Data flow masterassign map phase reduce phase assign parser splits parser parser inverter postings inverter inverter a-f g-p q-z a-f g-p q-z a-f g-p q-z a-f segment files g-p q-z Sojka, IIR Group: PV211: Index construction 32 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing MapReduce The index construction algorithm we just described is an instance of MapReduce. MapReduce is a robust and conceptually simple framework for distributed computing . . . . . . without having to write code for the distribution part. The Google indexing system (ca. 2002) consisted of a number of phases, each implemented in MapReduce. Index construction was just one phase. Another phase: transform term-partitioned into document-partitioned index. Sojka, IIR Group: PV211: Index construction 33 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Index construction in MapReduce Schema of map and reduce functions map: input → list(k, v) reduce: (k,list(v)) → output Instantiation of the schema for index construction map: web collection → list(termID, docID) reduce: ( termID1, list(docID) , termID2, list(docID) , . . . ) → (postings_list1, postings_list2, . . . ) Example for index construction map: d2 : C died. d1 : C came, C c’ed. → ( C, d2 , died,d2 , C,d1 , came,d1 , C,d1 , c’ed,d1 ) reduce: ( C,(d2,d1,d1) , died,(d2) , came,(d1) , c’ed,(d1) ) → ( C,(d1:2,d2:1) , died,(d2:1) , came,(d1:1) , c’ed,(d1:1) ) Sojka, IIR Group: PV211: Index construction 34 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Exercise What information does the task description contain that the master gives to a parser? What information does the parser report back to the master upon completion of the task? What information does the task description contain that the master gives to an inverter? What information does the inverter report back to the master upon completion of the task? Sojka, IIR Group: PV211: Index construction 35 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Dynamic indexing Up to now, we have assumed that collections are static. They rarely are: Documents are inserted, deleted and modified. This means that the dictionary and postings lists have to be dynamically modified. Sojka, IIR Group: PV211: Index construction 37 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Dynamic indexing: Simplest approach Maintain big main index on disk New docs go into small auxiliary index in memory. Search across both, merge results Periodically, merge auxiliary index into big index Deletions: Invalidation bit-vector for deleted docs Filter docs returned by index using this bit-vector Sojka, IIR Group: PV211: Index construction 38 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Issue with auxiliary and main index Frequent merges Poor search performance during index merge Sojka, IIR Group: PV211: Index construction 39 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Logarithmic merge Logarithmic merging amortizes the cost of merging indexes over time. → Users see smaller effect on response times. Maintain a series of indexes, each twice as large as the previous one. Keep smallest (Z0) in memory Larger ones (I0, I1, . . . ) on disk If Z0 gets too big (> n), write to disk as I0 . . . or merge with I0 (if I0 already exists) and write merger to I1, etc. Sojka, IIR Group: PV211: Index construction 40 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing LMergeAddToken(indexes, Z0, token) 1 Z0 ← Merge(Z0, {token}) 2 if |Z0| = n 3 then for i ← 0 to ∞ 4 do if Ii ∈ indexes 5 then Zi+1 ← Merge(Ii , Zi ) 6 (Zi+1 is a temporary index on disk.) 7 indexes ← indexes − {Ii } 8 else Ii ← Zi (Zi becomes the permanent index Ii .) 9 indexes ← indexes ∪ {Ii } 10 Break 11 Z0 ← ∅ LogarithmicMerge() 1 Z0 ← ∅ (Z0 is the in-memory index.) 2 indexes ← ∅ 3 while true 4 do LMergeAddToken(indexes, Z0, getNextToken()) Sojka, IIR Group: PV211: Index construction 41 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Binary numbers: I3I2I1I0 = 23 22 21 20 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 Sojka, IIR Group: PV211: Index construction 42 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Logarithmic merge Number of indexes bounded by O(log T) (T is total number of postings read so far) So query processing requires the merging of O(log T) indexes Time complexity of index construction is O(T log T). . . . because each of T postings is merged O(log T) times. Auxiliary index: index construction time is O(T2) as each posting is touched in each merge. Suppose auxiliary index has size a a + 2a + 3a + 4a + . . . + na = an(n+1) 2 = O(n2 ) So logarithmic merging is an order of magnitude more efficient. Sojka, IIR Group: PV211: Index construction 43 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Dynamic indexing at large search engines Often a combination Frequent incremental changes Rotation of large parts of the index that can then be swapped in Occasional complete rebuild (becomes harder with increasing size – not clear if Google can do a complete rebuild) Sojka, IIR Group: PV211: Index construction 44 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Building positional indexes Basically the same problem except that the intermediate data structures are large. Sojka, IIR Group: PV211: Index construction 45 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Take-away Two index construction algorithms: BSBI (simple) and SPIMI (more realistic) Distributed index construction: MapReduce Dynamic index construction: how to keep the index up-to-date as the collection changes Sojka, IIR Group: PV211: Index construction 46 / 47 Introduction BSBI algorithm SPIMI algorithm Distributed indexing Dynamic indexing Resources Chapter 4 of IIR Resources at https://www.fi.muni.cz/~sojka/PV211/ and http://cislmu.org, materials in MU IS and FI MU library Original publication on MapReduce by Dean and Ghemawat (2004) Original publication on SPIMI by Heinz and Zobel (2003) YouTube video: Google data centers Sojka, IIR Group: PV211: Index construction 47 / 47