PA152: Efficient Use of DB 5. Hashing Vlastislav Dohnal PA152, Vlastislav Dohnal, FI MUNI, 2024 2 Outline ◼ Hash index ◼ Bitmap index ◼ Multi-attribute indexes ◼ Grid index PA152, Vlastislav Dohnal, FI MUNI, 2024 3 Hashing: Principle ◼ Key-to-address transformation Based on a function returning an address ◼ for an input key value . . . Buckets (1 bucket ≈ 1 block) key h (key) address PA152, Vlastislav Dohnal, FI MUNI, 2024 4 Variants ◼ Direct addressing Typically address of a block ◼ Good for hashed file Address of a record ◼ Rare solution for secondary storage ◼ Used in in-mem hash tables File (collection of blocks) key h (key) . . . Record with . . . PA152, Vlastislav Dohnal, FI MUNI, 2024 5 Variants ◼ Indirect addressing Used in secondary indexes Has overhead . . . Buckets (1 bucket ≈ 1 block) key h (key) File (collection of blocks) . . . Record with . . . address Record pointer PA152, Vlastislav Dohnal, FI MUNI, 2024 6 Hash Function – Examples ◼ Hash function h: K → A Key space “K” ◼ key as a byte-array, e.g., n bytes Address space “A” ◼ B buckets ◼ Examples h(x0x1…xn-1) → y ◼ y = (x0+x1+…+xn-1) mod B ◼ y = (x0*31n-1+x1*31n-2+…+xn-1) mod B Applicable to variable-length characters PA152, Vlastislav Dohnal, FI MUNI, 2024 7 Hash Function ◼ Properties: Uniform ◼ Equal occupation of all buckets Random ◼ Low correlation between its input and output ◼ „Big science“ → special literature Good hash function must be uniform Locality-sensitive hashing Vector quantization, product quantization … PA152, Vlastislav Dohnal, FI MUNI, 2024 8 Address Space ◼ Collection of buckets ◼ Bucket Capacity larger than 1 record Sort the keys? ◼ Yes, if access should be faster ◼ Yes, if updates are rare ◼ No, if updates are frequent PA152, Vlastislav Dohnal, FI MUNI, 2024 9 Terminology ◼ Hash function ◼ Collisions in direct addressing, another record present on the returned address in indirect addressing, no problem if more keys/recs can be stored ◼ Overflows  Bucket capacity is exhausted  Overflow blocks ◼ Static vs. dynamic hashing Address space cardinality can be modified PA152, Vlastislav Dohnal, FI MUNI, 2024 10 Static Hashing: Collision Solutions ◼ Closed addressing (= open hashing) Returned address is fixed On overflow, allocate a new block (overflow area) ◼ and chain overflow blocks Used in secondary memory indexes/files ◼ Open addressing (= closed hashing) Collision function is introduced ◼ Linear, quadratic, double-hashing Used in in-mem hash tables PA152, Vlastislav Dohnal, FI MUNI, 2024 11 Example ◼ Static closed addressing Collisions handled by overflow blocks Block capacity = 2 keys 0 1 2 3 h (key) Skip revision… Block chaining PA152, Vlastislav Dohnal, FI MUNI, 2024 12 Example: Insert ◼ h(„b“) = 2 0 1 2 3 b a c d PA152, Vlastislav Dohnal, FI MUNI, 2024 13 Example: Insert ◼ h(„e“) = 1 0 1 2 3 b a c d e PA152, Vlastislav Dohnal, FI MUNI, 2024 14 Example: Deletion ◼ Free overflow blocks up ◼ Delete „b” 0 1 2 3 b a c d e PA152, Vlastislav Dohnal, FI MUNI, 2024 15 Example: Deletion ◼ Free overflow blocks up ◼ Delete „c” 0 1 2 3 a c e d e PA152, Vlastislav Dohnal, FI MUNI, 2024 16 Example: Deletion ◼ Free overflow blocks up ◼ Delete „a” 0 1 2 3 a e d PA152, Vlastislav Dohnal, FI MUNI, 2024 17 Rule of thumb ◼ Keep space utilization btw. 50% and 80% Utilization = # stored keys / max. capacity (in keys) < 50% → wasting space ≥ 80% → too many collisions ◼ Overflow areas degrade searching and insertion performance PA152, Vlastislav Dohnal, FI MUNI, 2024 18 Dynamic Data ◼ Static hashing Overflows → (global) reorganization ◼ i.e., designing a new hash function ◼ Dynamic hashing Extendible Linear PA152, Vlastislav Dohnal, FI MUNI, 2024 19 Extendible Hashing ◼ Idea: Use top i of b bits output by hash function Indirection added – directory ◼ Size is power of 2 h (key)[0:i ] 0 1 2 Buckets 1 2 2 Directory 2 0001 PA152, Vlastislav Dohnal, FI MUNI, 2024 20 Extendible Hashing: Insertion ◼ Example: h(x) = x, i.e., identity ◼ Locate a bucket is free space available? ◼ Yes → insert; ◼ No → so split the bucket & redistribute its content ◼ Insert 0001 1010 200 01 10 11 0 1 2 1000 1111 1 0100 2 2 1010 ◼ Insert 1010 Split bucket 1010 1111 1111 PA152, Vlastislav Dohnal, FI MUNI, 2024 21 Extendible Hashing: Insertion 200 01 10 11 1000 1 0100 2 0001 2 2 2 PA152, Vlastislav Dohnal, FI MUNI, 2024 22 Extendible Hashing: Insertion ◼ Insert 1011 Directory is full → double it (use one more bit) 3000 001 010 011 1000 3 0100 2 0001 2 1010 1011 3 100 101 110 111 1111 2 PA152, Vlastislav Dohnal, FI MUNI, 2024 23 Extendible Hashing: Deletion ◼ Delete keys Delete key Bucket gets empty ◼ No operation → assume new data arriving ◼ Merge neighboring buckets  Of the same prefix (shorted by one bit)  Directory can shrink too PA152, Vlastislav Dohnal, FI MUNI, 2024 24 Extendible Hashing: Summary ◼ Advantages (over static hashing) Handles growing files Wasting less space Local reorganization ◼ Disadvantages Indirection ◼ Not bad if directory in memory Directory doubles ◼ May not fit in memory ◼ Number of buckets grows linearly PA152, Vlastislav Dohnal, FI MUNI, 2024 25 Linear Hashing ◼ Idea: Use i low order bits of address (of b bits) No directory File grows linearly h (key)[b-i :b] 00 Buckets 1000 1100 0001 1011 0010 1 10 PA152, Vlastislav Dohnal, FI MUNI, 2024 26 Linear Hashing: Insertion ◼ Parameters: ◼ b – length of hash address ◼ i – suffix length => h(k)[b-i:b] ◼ n – number of allocated buckets ◼ Insert 1011b h(1011b)[2:4] = 11b = m ◼ If m < n, insert into bucket m ◼ Otherwise, insert into bucket m – 2i-1 h(1011b)[2:4] 00 1000 1100 1 0001 1011 10 0010 Current parameter values: i=2, n=3, b=4 PA152, Vlastislav Dohnal, FI MUNI, 2024 27 Linear Hashing: Insertion ◼ Insert 1001 h(1001)[2:4] = 01 No room → create overflow bucket h(1001)[2:4] 00 1000 1100 1 0001 1011 10 0010 1001 i=2, n=3, b=4 PA152, Vlastislav Dohnal, FI MUNI, 2024 28 Linear Hashing: Bucket Splitting ◼ Keep track of bucket utilization >80%, then split a bucket (smallest address of 0abcd…z) ◼ Add new bucket → address is 1abcd...z ◼ Distribute keys from bucket 0abcd…z and its overflow to buckets [01]abcd…z This address (n – 2i-1) is always split, i.e., 3-21 =1 h(1001)[2:4] 00 1000 1100 01 0001 1001 10 0010 1001 11 1011 n=4 i=3 PA152, Vlastislav Dohnal, FI MUNI, 2024 29 Linear Hashing ◼ Inserting new keys May lead to overflow buckets Utilization exceeding 80% ◼ Do the bucket split ◼ The bucket being split may differ from the overflowing one! After allocating 2i-th bucket, increment i ◼ i.e., the bucket count exceeds 2i -1 PA152, Vlastislav Dohnal, FI MUNI, 2024 30 Linear Hashing: Summary ◼ Advantages (over static hashing) Handles growing files Less wasted space Local reorganization No indirection like extendible hashing ◼ Disadvantages Overflow chains PA152, Vlastislav Dohnal, FI MUNI, 2024 31 Linear Hashing: Example ◼ „Bad“ data Last x bits do not distinguish keys One bucket “stores” all data; others are empty ◼ Utilization is decreased by split  → need to track the length of overflow blocks! h (key)[b-i :b] … PA152, Vlastislav Dohnal, FI MUNI, 2024 32 Hashing vs. Indexing: Query Types ◼ Hashing Good for exact match SELECT … WHERE a=5 ◼ Indexing Good for range search SELECT … WHERE a>5 ◼ What if cardinality of ``a’’ is low? PA152, Vlastislav Dohnal, FI MUNI, 2024 33 Bitmap (raster) index ◼ Attribute X has got h unique values ◼ Collection of h bit arrays One array (vector) per value Array length is equal to # records (n) ◼ Good for attributes with „few“ values Up to 10% of records ◼ Queries: WHERE a = 5 OR a = 7 ◼ Relation R(F,G) ◼ Bitmap index on G PA152, Vlastislav Dohnal, FI MUNI, 2024 34 Bitmap index: Example 50 foo 30 foo 30 bar 40 baz 30 baz 40 bar F G foo 1001001 bar 0100100 baz 0010010 Value Vector 34 foo PA152, Vlastislav Dohnal, FI MUNI, 2024 35 Bitmap index: Summary ◼ Disadvantages Storage demands ◼ If key is the primary key, then n(n + log2 n) bits ◼ Many records with the same value (many 1’s) or too many records in general  Compress to the bit vector of blocks (instead of records) Record updates ◼ New value → add new array ◼ New record → extend all arrays PA152, Vlastislav Dohnal, FI MUNI, 2024 36 Bitmap index: Summary ◼ Advantages Fast bit operations (AND, OR) Applicable to “range” queries Easy to combine multiple indexes together ◼ which are typically single-attribute indexes PA152, Vlastislav Dohnal, FI MUNI, 2024 37 Bitmap index: Compression ◼ Reduce individual arrays  Few 1’s, many 0’s  ends with one  trailing zeros are omitted ◼ Can be filled up to # records ◼ Run-Length Encoding (RLE)  Split to blocks ◼ Sequence of ”i” zeros followed by one  Encode “i” binary  Code for “i” : „number length + number itself“ PA152, Vlastislav Dohnal, FI MUNI, 2024 38 Bitmap index: RLE ◼ Example of compression – 24 bits  Sequence: 0000 0000 0000 0110 0010 0000 ◼ 13x zeros, 1x one ◼ 0x zero, 1x one ◼ 3x zero, 1x one ◼ 5x zero followed by nothing… → ignore  Binary: ◼ 13d → 1101b ◼ 0d → 0b ◼ 3d → 11b RLE code: ◼ Sequence of „blocks“:  Binary number length in prefix code, binary number itself ◼ Code: 11101101001011 PA152, Vlastislav Dohnal, FI MUNI, 2024 39 Bitmap index: RLE ◼ Example of decompression Code 11101101001011 Split into block and decompress… ◼ Binary number length:  bit count (ones followed by one zero) ◼ 11101101001011 Decoding parts ◼ 11101101 → 0000 0000 0000 01 ◼ 00 → 1 ◼ 1011 → 0001 Resulting sequence: ◼ 0000 0000 0000 0110 001 ◼ Missing trailing zeros filled up to # records PA152, Vlastislav Dohnal, FI MUNI, 2024 40 Bitmap index: Operations ◼ Bit operations AND, OR ◼ RLE strings Decompression is easy No need to decompress ◼ More complex implementation, but possible ◼ AND: sum of “i”’s in codes must match ◼ OR: by analogy… ◼ 11101101001011 OR/AND 1101011101001010 PA152, Vlastislav Dohnal, FI MUNI, 2024 41 Bitmap index: Implementation ◼ Issues for efficient use: 1. Locate bit arrays for the passed key value 2. Having the array, how to get records? 3. Updating records, how to update index? PA152, Vlastislav Dohnal, FI MUNI, 2024 42 Bitmap index: Solutions ◼ Ad 1: (Locate bit arrays for the passed key value)  We have a bit array for each key value ◼ B+-tree for keys ◼ Pointers to arrays in leaves  Storing bit arrays ◼ records of variable length ◼ Ad 2: (Having the array, how to get records?)  Get record r (ordinal number of record) ◼ Secondary index for record numbers ◼ Array of pointers to records (replacement of bit arrays) ◼ List of block occupations in the number of records PA152, Vlastislav Dohnal, FI MUNI, 2024 43 Bitmap index: Solutions ◼ Ad 3: (Updating records, how to update index?) Record numbers fixed (not seq. file / sec. index) ◼ Delete record → tombstone in file and clear a bit in the corresponding array ◼ Insert record → append to end of file (a new record number) → append bit 1 to the corresponding bit array → if not existing, create new one PA152, Vlastislav Dohnal, FI MUNI, 2024 44 Bitmap index: Solutions ◼ Ad 3: (Updating records, how to update index?) Record numbers are not fixed ◼ Reorganize all arrays (delete 1 bit) ◼ Update array of pointers to records ◼ Rarely used Bitmaps – data (TPC-H) Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); create bitmap index b_lin_2 on lineitem(l_returnflag); create bitmap index b_lin_3 on lineitem(l_linestatus); create bitmap index b_lin_4 on lineitem(l_linenumber);  100000 rows ; cold buffer  Dual Pentium II (450MHz, 512KB), 512 MB RAM, 3x18GB drives (10000RPM), Windows 2000. PA152, Vlastislav Dohnal, FI MUNI, 2024 45 Bitmaps -- queries Queries: 1 attribute select count(*) from lineitem where l_returnflag = 'N'; 2 attributes select count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3; 3 attributes select count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3 and l_linestatus = 'F'; PA152, Vlastislav Dohnal, FI MUNI, 2024 46 Bitmaps ◼ Order of magnitude improvement compared to table scan. ◼ Bitmaps are best suited for multiple conditions on several attributes, each having a low selectivity. PA152, Vlastislav Dohnal, FI MUNI, 2024 47 A N R l_returnflag O F l_linestatus 0 2 4 6 8 10 12 1 2 3 Throughput(Queries/sec) linear scan bitmap PA152, Vlastislav Dohnal, FI MUNI, 2024 48 Multi-key / Composite index ◼ Index on multiple attributes ◼ Motivation: SELECT name, salary FROM emp WHERE dept=‘Toys’ AND salary < 10000 ◼ Alternatives: a) Index on one attribute + filtering b) Combine two indexes + intersection of candidate record pointers c) Index in index d) Concatenation of key values into one PA152, Vlastislav Dohnal, FI MUNI, 2024 49 Index on one attribute ◼ SELECT name, salary FROM emp WHERE dept=‘Toys’ AND salary < 10000 ◼ Index on dept Filter found records using salary < 10000 I1 PA152, Vlastislav Dohnal, FI MUNI, 2024 50 Combining Indexes ◼ Index on dept ◼ Index on salary ◼ Each index returns a list of candidate records Intersection of lists → query result dept=‘Toys’ salary<10000 bucket with record pointers PA152, Vlastislav Dohnal, FI MUNI, 2024 51 Index in index ◼ Index on the first attribute  In leaves, pointers to embedded indexes on 2nd attribute  I1 on dept  Ix, x=2..k on salary ◼ I2 contains only records having the same dept. value (Electrical Supp.) I1 I2 I3 Electrical Supp. Toys … PA152, Vlastislav Dohnal, FI MUNI, 2024 52 Index in index: Example ◼ SELECT name, salary FROM emp WHERE dept=‘Toys’ AND salary < 10000 name=Peter dept=Toys salary=7000 Electr. Toys Garden… 10000 15000 17000 21000 5000 7000 15000 19000 Index dept Indexes on salary Record PA152, Vlastislav Dohnal, FI MUNI, 2024 53 Index in index ◼ What queries can use this solution? SELECT name, salary FROM emp WHERE a) dept = ‘Toys’ AND salary ≥ 10000 b) dept = ‘Toys’ c) salary = 10000 name=Peter dept=Toys salary=7000 Electr. Toys Gardening 10000 15000 17000 21000 5000 7000 15000 19000 Index dept Indexy salary Record PA152, Vlastislav Dohnal, FI MUNI, 2024 54 Concatenation of key values ◼ Analogous to an index on one attribute Key values are merged ◼ String concatenation, number multiplication, … ◼ Indexing used in PostgreSQL ◼ Hashing Partitioned hash function PA152, Vlastislav Dohnal, FI MUNI, 2024 55 Partitioned hash function ◼ Idea: Two keys Two hash functions One address h1 h2 010110 1110010 key1 key2 address PA152, Vlastislav Dohnal, FI MUNI, 2024 56 Partitioned hash function: Example ◼ Dept h1(Electr.)=0 h1(Toys) =1 h1(Gardening) =1 ◼ Salary h2(10000) =01 h2(20000) =11 h2(30000) =10 h2(40000) =00 ◼ Records to insert  000 001 010 011 100 101 110 111 PA152, Vlastislav Dohnal, FI MUNI, 2024 57 Partitioned hash function: Example ◼ Dept h1(Electr.)=0 h1(Toys) =1 h1(Gardening) =1 ◼ Salary h2(10000) =01 h2(20000) =11 h2(30000) =10 h2(40000) =00 ◼ Find  employess in Toys dept. with salary of 40000. 000 001 010 011 100 101 110 111 PA152, Vlastislav Dohnal, FI MUNI, 2024 58 Partitioned hash function: Example ◼ Dept h1(Electr.)=0 h1(Toys) =1 h1(Gardening) =1 ◼ Salary h2(10000) =01 h2(20000) =11 h2(30000) =10 h2(40000) =00 ◼ Find  employees with salary 30000 000 001 010 011 100 101 110 111 PA152, Vlastislav Dohnal, FI MUNI, 2024 59 Partitioned hash function: Example ◼ Dept h1(Electr.)=0 h1(Toys) =1 h1(Gardening) =1 ◼ Salary h2(10000) =01 h2(20000) =11 h2(30000) =10 h2(40000) =00 ◼ Find  employes in Toys dept. 000 001 010 011 100 101 110 111 PA152, Vlastislav Dohnal, FI MUNI, 2024 60 Grid Index: another multi-key index ◼ Idea: records with key1=V3, key2=X2 V1 V2 Vn X1 X2 … Xm … key1 key2 PA152, Vlastislav Dohnal, FI MUNI, 2024 61 Grid Index: Properties ◼ Efficient for exact match queries key1 = Vi and key2 = Xj key1 = Vi key2 = Xj ◼ Range queries key1  V3 and key2 < X3 ◼ Rectangular area V1 V2 Vn X1 X2 … Xm … key1 key2 V3 X3 ◼ How to store grid on disk? one-dimensional array Tradeoff: grid dimensions vs. cell capacity ◼ Disadvantage Grid dimensions must be fixed ◼ They are needed to calculate position of cell . Limited cell capacity PA152, Vlastislav Dohnal, FI MUNI, 2024 62 Grid Index: Implementation X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 V1 V2 V3 PA152, Vlastislav Dohnal, FI MUNI, 2024 63 Grid Index: Implementation ◼ Use buckets, i.e., indirection Grid cell points to a bucket ◼ Grid is fixed-sized Overhead with indirection Allows allocating buckets linearly. V1 V2 X1 X2 X3 V3 V4 Buckets -- ---- -- ---- -- ---- -- ---- … PA152, Vlastislav Dohnal, FI MUNI, 2024 64 Grid Index: Defining Grid ◼ Analysis of data and query types Find out dimensions of the grid Value on an axis can be an interval → binning ◼ E.g., numeric domains Grows linearly 1 2 3 Toys Electr. Gardening 0-20000 1 20000-50000 2 50000- 3 Salary Grid Dept PA152, Vlastislav Dohnal, FI MUNI, 2024 65 Grid Index: Summary ◼ Advantages Good for multikey indexing ◼ Disadvantages Grid size is fixed; may waste space ◼ Alternative is a hierarchical grid Choice of grid intervals (quantization) → uniform data distribution Index Design Guidelines ◼ Good selectivity (i.e., low selectivity)  Fraction of matching rows is small ◼ “Short” attribute values  Increase tree fan-out ◼ On “join” attributes ◼ Prefer more single-attribute indexes  for one multi-attribute one ◼ Few indexes on highly-updated tables ◼ No indexes on tiny tables PA152, Vlastislav Dohnal, FI MUNI, 2024 66 Lecture’s Takeaways ◼ Techniques of hashing ◼ Recall handling duplicate keys i.e., multiple records with the same key value. ◼ Alternative indexes Bitmap index ◼ do not mix up with bitmap scan in Pg’s explain plan Grid index ◼ Multi-attribute indexes and query predicates ◼ Further terms (follows…) PA152, Vlastislav Dohnal, FI MUNI, 2024 67 Further Terminology ◼ Clustered index  = index-sequential file / B+-file  Records are stored in leaf nodes and cannot be accessed in any other way (than by this index) ◼ Non-clustered index  = secondary index / B+-tree ◼ Covering index  Query can be fully answered using the index only ◼ i.e., records are not accessed.  MS SQL Server: Index with included columns ◼ Not a multi-attribute index, but leaf node values are enriched with values of the included cols. ◼ Indexed view  materialized view with clustered index ◼ Multidim. data and indexes (R-trees, Quad Trees, kd-trees) PA152, Vlastislav Dohnal, FI MUNI, 2024 68