PA152: Efficient Use of DB 4. Indexing Vlastislav Dohnal PA152, Vlastislav Dohnal, FI MUNI, 2024 2 Indexing ◼ Reason: faster access to records than sequential (table) scan ◼ Variants: Conventional indexes B-tree Hashing ? recordkey value file key value …… PA152, Vlastislav Dohnal, FI MUNI, 2024 3 Terminology ◼ Sequential file  Index-sequential file ◼ Index  Primary index  Secondary index  Dense index  Sparse index  Multilevel index ◼ Search key  Primary key PA152, Vlastislav Dohnal, FI MUNI, 2024 4 File Sequential file 20 10 40 30 60 50 80 70 100 90 Index-sequential file 20 10 40 30 60 50 80 70 100 90 10 30 50 70 90 PA152, Vlastislav Dohnal, FI MUNI, 2024 5 Index ◼ Collection of items:  20 10 40 30 60 50 80 70 Sparse index 10 30 50 70 Dense index 10 20 30 40 50 60 70 80 PA152, Vlastislav Dohnal, FI MUNI, 2024 9 Problem of Duplicate Keys ◼ Index type dense index? sparse index? 10 10 20 10 30 20 30 30 45 40 File PA152, Vlastislav Dohnal, FI MUNI, 2024 27 Secondary Index ◼ File ordered by another key i.e., index created for different key than the primary file Or the file is not ordered at all ◼ Which type: Dense or sparse? PA152, Vlastislav Dohnal, FI MUNI, 2024 28 Secondary Index ◼ Must use pointers to records! Search key 50 30 70 20 40 80 10 100 60 90 Dense index 10 20 30 40 50 60 70 ... 10 50 90 ... Sparse for higher levels PA152, Vlastislav Dohnal, FI MUNI, 2024 29 Secondary Index: Duplicate Keys ◼ Replicated in index Increases ◼ space requirements ◼ access time 10 20 40 20 40 10 40 10 40 30 10 10 10 20 20 30 40 40 40 40 ... PA152, Vlastislav Dohnal, FI MUNI, 2024 30 Secondary Index: Duplicate Keys ◼ Index item contains list of pointers But the item is of variable length 10 20 40 20 40 10 40 10 40 30 10 40 30 20 PA152, Vlastislav Dohnal, FI MUNI, 2024 31 Secondary Index: Duplicate Keys ◼ Shift the variable-length list to “buckets” 10 20 40 20 40 10 40 10 40 30 10 20 30 40 50 60 ... buckets index blocks file blocks PA152, Vlastislav Dohnal, FI MUNI, 2024 32 Secondary Index: Buckets with Pointers ◼ Advantage: a list of records for querying Evaluate more selection constrains without accessing records ◼ Example: Relation ◼ employee(name, department, room) Indexes: ◼ name – primary index ◼ department – secondary index ◼ room – secondary index PA152, Vlastislav Dohnal, FI MUNI, 2024 33 Secondary Index: Duplicate Keys ◼ Query employees of Toys dept. in room G243 ◼ Intersect buckets  To get pointers to matching employee records  Also used in text information retrieval Toys G243 File (employee)Index (department) Index (room) Example: Text Information Retrieval ◼ “Full-text” index for documents ◼ Split documents into words ◼ Build an inverted file over all documents i.e., a file of records PA152, Vlastislav Dohnal, FI MUNI, 2024 34 Document 1 Word List for Doc1 Caesar and Brutus are ambitious. •Caesar •Brutus •ambitious 1. Split to words 2. Stemming 3. Ignore words in stop-list Example: Text Information Retrieval ◼ Inverted file ◼ Retrieve docs containing Brutus & Caesar Read posting lists for Brutus and Caesar Intersect them PA152, Vlastislav Dohnal, FI MUNI, 2024 35 Term docID ambitious 1 brutus 1 brutus 3 capitol 2 caesar 1 caesar 2 Term Posting list of docIDs ambitious 1 brutus 1, 3 capitol 1 caesar 1, 2 Relational view Inverted file PA152, Vlastislav Dohnal, FI MUNI, 2024 37 B-trees ◼ Another index type  Sequential order not necessary  Balanced – max I/Os guarantee ◼ More variants  B-tree, B+-tree, B*-tree, … ◼ Typically, by saying “B-tree” we mean “B+-tree”! ◼ Origin  Rudolf Bayer and Ed McCreight invented the B-tree while working at Boeing Research Labs in 1971 (Bayer & McCreight 1972) ◼ They did not explain what, if anything, the B stands for. ◼ Douglas Comer explains:  The origin of "B-tree" has never been explained by the authors. As we shall see, "balanced," "broad," or "bushy" might apply. Others suggest that the "B" stands for Boeing. Because of his contributions, however, it seems appropriate to think of B-trees as "Bayer"-trees. * Source: Wikipedia PA152, Vlastislav Dohnal, FI MUNI, 2024 38 B+-tree ◼ Example n=4 100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 … pointers to record in file … PA152, Vlastislav Dohnal, FI MUNI, 2024 65 B+-tree ◼ B+-tree as file Leaves store the records themselves. ◼ Duplicate keys Pointers in leaves = pointers to buckets ◼ i.e., blocks with a list of record pointers with the same key value ◼ Variable-length key values (e.g., strings) Store completely → low arity, varying arity, … Use prefixes (prefix compression) Lecture’s Takeaways ◼ Principle of indexing Use of record pointers and their utilization Handling duplicate keys ◼ Efficiency of B+ trees also, with respect to query types ◼ Revision of terminology Dense / sparse index Primary / secondary index ◼ Clustered / non-clustered index Covering index PA152, Vlastislav Dohnal, FI MUNI, 2024 66