PA152: Efficient Use of DB 5. Hashing Vlastislav Dohnal PA152, Vlastislav Dohnal, FI MUNI, 2022 2 Hashing – concept  Key-to-address transformation Based on a function returning an address  for an input key value . . . Buckets (1 bucket ≈ 1 block) key h (key) address PA152, Vlastislav Dohnal, FI MUNI, 2022 3 Alternatives  Direct addressing Record address  Typically address of a block  Good for hashed file File (collection of blocks) key h (key) . . . Record with . . . PA152, Vlastislav Dohnal, FI MUNI, 2022 4 Alternatives  Indirect addressing Used in secondary indexes Has overhead . . . Buckets (1 bucket ≈ 1 block) key h (key) File (collection of blocks) . . . Record with . . . address Record pointer PA152, Vlastislav Dohnal, FI MUNI, 2022 5 Hash Function – Examples  key is a string n bytes  Address space B buckets  Hash function h(x0x1…xn-1) (x0+x1+…+xn-1) mod B (x0*31n-1+x1*31n-2+…+xn-1) mod B Applicable to variable-length characters PA152, Vlastislav Dohnal, FI MUNI, 2022 6 Hash Function  Properties: Uniform  Equal occupation of all buckets Random  Low correlation between its input and output  „Big science“  special literature Good hash function must be uniform PA152, Vlastislav Dohnal, FI MUNI, 2022 7 Address Space  Collection of buckets  Bucket Capacity larger than 1 record Sort the keys?  Yes, if access should be faster  Yes, if updates are rare  No, if updates are frequent PA152, Vlastislav Dohnal, FI MUNI, 2022 8 Terminology  Hash function  Collisions Another record present on the returned address No problem if more keys/recs can be stored  Overflows  Bucket capacity is exhausted  Overflow blocks  Static vs. dynamic hashing Address space cardinality can be modified PA152, Vlastislav Dohnal, FI MUNI, 2022 9 Static Hashing: Collision Solutions  Closed addressing (= open hashing) Returned address is fixed On overflow, allocate a new block (overflow area)  Chaining overflow blocks  Open addressing (= closed hashing) Collision function is introduced  Linear, quadratic, double-hashing PA152, Vlastislav Dohnal, FI MUNI, 2022 10 Example  Static closed addressing Collisions handled by overflow blocks Block capacity = 2 keys 0 1 2 3 h (key) Skip revision… Block chaining PA152, Vlastislav Dohnal, FI MUNI, 2022 11 Example: Insert  h(„b“) = 2 0 1 2 3 b a c d PA152, Vlastislav Dohnal, FI MUNI, 2022 12 Example: Insert  h(„e“) = 1 0 1 2 3 b a c d e PA152, Vlastislav Dohnal, FI MUNI, 2022 13 Example: Deletion  Free overflow blocks up  Delete „b” 0 1 2 3 b a c d e PA152, Vlastislav Dohnal, FI MUNI, 2022 14 Example: Deletion  Free overflow blocks up  Delete „c” 0 1 2 3 a c e d e PA152, Vlastislav Dohnal, FI MUNI, 2022 15 Example: Deletion  Free overflow blocks up  Delete „a” 0 1 2 3 a e d PA152, Vlastislav Dohnal, FI MUNI, 2022 16 Rule of thumb  Keep space utilization btw. 50% and 80% Utilization = # stored keys / max. capacity (in keys) < 50%  wasting space ≥ 80%  too many collisions  Overflow areas degrade searching and insertion performance PA152, Vlastislav Dohnal, FI MUNI, 2022 17 Dynamic Data  Static hashing Overflows  reorganization  i.e., designing a new hash function  Dynamic hashing Extendible Linear PA152, Vlastislav Dohnal, FI MUNI, 2022 18 Extendible Hashing  Idea: Use top i of b bits output by hash function Indirection added – directory  Size is power of 2 h (key)[0:i ] 0 1 2 Buckets 1 2 2 Directory 2 0001 PA152, Vlastislav Dohnal, FI MUNI, 2022 19 Extendible Hashing: Insertion  Example: h(x) = x, i.e., identity  Locate a bucket is free space available?  Yes  insert;  No  so split the bucket & redistribute its content  Insert 0001 1010 200 01 10 11 0 1 2 1000 1111 1 0100 2 2 1010  Insert 1010 Split bucket 1010 1111 1111 PA152, Vlastislav Dohnal, FI MUNI, 2022 20 Extendible Hashing: Insertion 200 01 10 11 1000 1 0100 2 0001 2 2 2 PA152, Vlastislav Dohnal, FI MUNI, 2022 21 Extendible Hashing: Insertion  Insert 1011 Directory is full  double it (use one more bit) 3000 001 010 011 1000 3 0100 2 0001 2 1010 1011 3 100 101 110 111 1111 2 PA152, Vlastislav Dohnal, FI MUNI, 2022 22 Extendible Hashing: Deletion  Delete keys Delete key Bucket gets empty  No operation  assume new data arriving  Merge neighboring buckets  Of the same prefix (shorted by one bit)  Directory can shrink too PA152, Vlastislav Dohnal, FI MUNI, 2022 23 Extendible Hashing: Summary  Advantages Handles growing files Wasting less space (than static hashing) Local reorganization  Disadvantages Indirection  Not bad if directory in memory Directory doubles  May not fit in memory  Number of buckets grows linearly PA152, Vlastislav Dohnal, FI MUNI, 2022 24 Linear Hashing  Idea: Use i low order bits of address (of b bits) No directory File grows linearly h (key)[b-i :b] 00 Buckets 1000 1100 0001 1011 0010 1 10 PA152, Vlastislav Dohnal, FI MUNI, 2022 25 Linear Hashing: Insertion  Parameters: i – suffix length => h(k)[b-i:b] n – number of allocated buckets  Insert 1011b h(1011b)[2:4] = 11b = m  If m < n, insert into bucket m  Otherwise, insert into bucket m – 2i-1 h(1011b)[2:4] 00 1000 1100 1 0001 1011 10 0010 i=2, n=3 PA152, Vlastislav Dohnal, FI MUNI, 2022 26 Linear Hashing: Insertion  Insert 1001 h(1001)[2:4] = 01 No room  create overflow bucket h(1001)[2:4] 00 1000 1100 1 0001 1011 10 0010 1001 PA152, Vlastislav Dohnal, FI MUNI, 2022 27 Linear Hashing: Bucket Splitting  Keep track of bucket utilization >80%, then split a bucket (its address is 0abcd…z)  Add new bucket  address is 1abcd...z  Distribute keys from bucket 0abcd…z and its overflow to buckets [01]abcd…z This address (n – 2i-1) is always split, i.e., 3-21 =1 h(1001)[2:4] 00 1000 1100 01 0001 1001 10 0010 1001 11 1011 n=4 i=3 PA152, Vlastislav Dohnal, FI MUNI, 2022 28 Linear Hashing  Inserting new keys May lead to overflow buckets Utilization exceeding 80%  Do the bucket split  The bucket being split may differ from the overflowing one! After allocating 2i-th bucket, increment i  i.e., the bucket count exceeds 2i -1 PA152, Vlastislav Dohnal, FI MUNI, 2022 29 Linear Hashing: Summary  Advantages Handles growing files Less wasted space (than static hashing) Local reorganization No indirection like extendible hashing  Disadvantages Overflow chains PA152, Vlastislav Dohnal, FI MUNI, 2022 30 Linear Hashing: Example  „Bad“ data Last x bits do not distinguish keys One bucket “stores” all data; others are empty  Utilization is decreased by split   need to track the length of overflow blocks! h (key)[b-i :b] … PA152, Vlastislav Dohnal, FI MUNI, 2022 31 Hashing vs. Indexing  Hashing Good for exact match SELECT … WHERE a=5  Indexing Good for range search SELECT … WHERE a>5  What if cardinality of ``a’’ is low? PA152, Vlastislav Dohnal, FI MUNI, 2022 32 Bitmap (raster) index  Attribute X has got h unique values  Collection of h bit arrays One array (vector) per value Array length is equal to # records (n) = inverted file  Good for attributes with „few“ values Up to 10% of records  Queries: WHERE a = 5 OR a = 7  Relation R(F,G)  Bitmap index on G PA152, Vlastislav Dohnal, FI MUNI, 2022 33 Bitmap index: Example 50 foo 30 foo 30 bar 40 baz 30 baz 40 bar F G foo 1001001 bar 0100100 baz 0010010 Value Vector 34 foo PA152, Vlastislav Dohnal, FI MUNI, 2022 34 Bitmap index: Summary  Disadvantages Storage demands  If key is the primary key, then n(n + log2 n) bits  Many records with the same value (many 1’s) or too many records in general  Compress to the bit vector of blocks (instead of records) Record updates  New value  add new array  New record  extend all arrays PA152, Vlastislav Dohnal, FI MUNI, 2022 35 Bitmap index: Summary  Advantages Fast bit operations (AND, OR) Applicable to ``range’’ queries Easy to combine multiple indexes together  which are typically single-attribute indexes PA152, Vlastislav Dohnal, FI MUNI, 2022 36 Bitmap index: Compression  Reduce individual arrays  Few 1’s, many 0’s  ends with one  trailing zeros are omitted  Can be filled up to # records  Run-Length Encoding (RLE)  Split to blocks  Sequence of ”i” zeros followed by one  Encode “i” binary  Code for “i” : „Binary number length + number itself“ PA152, Vlastislav Dohnal, FI MUNI, 2022 37 Bitmap index: RLE  Example of compression – 24 bits  Sequence: 0000 0000 0000 0110 0010 0000  13x zeros, 1x one  0x zero, 1x one  3x zero, 1x one  5x zero followed by nothing…  ignore  Binary:  13d  1101b  0d  0b  3d  11b RLE code:  Sequence of „blocks“:  Binary number length in prefix code, binary number itself  Code: 11101101001011 PA152, Vlastislav Dohnal, FI MUNI, 2022 38 Bitmap index: RLE  Example of decompression Code 11101101001011 Split into block and decompress…  Binary number length:  bit count (ones followed by one zero)  11101101001011 Decoding parts  11101101  0000 0000 0000 01  00  1  1011  0001 Resulting sequence:  0000 0000 0000 0110 001  Missing trailing zeros filled up to # records PA152, Vlastislav Dohnal, FI MUNI, 2022 39 Bitmap index: Operations  Bit operations AND, OR  RLE strings Decompression is easy No need to decompress  More complex implementation, but possible  AND: sum of “i”’s in codes must match  OR: by analogy…  11101101001011 OR/AND 1101011101001010 PA152, Vlastislav Dohnal, FI MUNI, 2022 40 Bitmap index: Implementation  Issues for efficient use: 1. Locate bit arrays for the passed key value 2. Having the array, how to get records? 3. Updating records, how to update index? PA152, Vlastislav Dohnal, FI MUNI, 2022 41 Bitmap index: Solutions  Ad 1: (Locate bit arrays for the passed key value)  We have a bit array for each key value  B+-tree for keys  Pointers to arrays in leaves  Storing bit arrays  records of variable length  Ad 2: (Having the array, how to get records?)  Get record r (ordinal number of record)  Secondary index for record numbers  Array of pointers to records (replacement of bit arrays)  List of block occupations in the number of records PA152, Vlastislav Dohnal, FI MUNI, 2022 42 Bitmap index: Solutions  Ad 3: (Updating records, how to update index?) Record numbers fixed (not seq. file / sec. index)  Delete record  tombstone in file and clear a bit in the corresponding array  Insert record  append to end of file (a new record number)  append bit 1 to the corresponding bit array  if not existing, create new one PA152, Vlastislav Dohnal, FI MUNI, 2022 43 Bitmap index: Solutions  Ad 3: (Updating records, how to update index?) Record numbers are not fixed  Reorganize all arrays (delete 1 bit)  Update array of pointers to records  Rarely used Bitmaps – data (TPC-H) Settings: lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT ); create bitmap index b_lin_2 on lineitem(l_returnflag); create bitmap index b_lin_3 on lineitem(l_linestatus); create bitmap index b_lin_4 on lineitem(l_linenumber);  100000 rows ; cold buffer  Dual Pentium II (450MHz, 512KB), 512 MB RAM, 3x18GB drives (10000RPM), Windows 2000. PA152, Vlastislav Dohnal, FI MUNI, 2022 44 Bitmaps -- queries Queries: 1 attribute select count(*) from lineitem where l_returnflag = 'N'; 2 attributes select count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3; 3 attributes select count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3 and l_linestatus = 'F'; PA152, Vlastislav Dohnal, FI MUNI, 2022 45 Bitmaps  Order of magnitude improvement compared to table scan.  Bitmaps are best suited for multiple conditions on several attributes, each having a low selectivity. PA152, Vlastislav Dohnal, FI MUNI, 2022 46 A N R l_returnflag O F l_linestatus 0 2 4 6 8 10 12 1 2 3 Throughput(Queries/sec) linear scan bitmap PA152, Vlastislav Dohnal, FI MUNI, 2022 47 Multi-key / Composite index  Index on multiple attributes  Motivation: SELECT name, salary FROM emp WHERE dept=‘Toys’ AND salary < 10000  Alternatives: a) Index on one attribute + filtering b) Combine two indexes + intersection of candidate record pointers c) Index in index d) Concatenation of key values into one PA152, Vlastislav Dohnal, FI MUNI, 2022 48 Index on one attribute  SELECT name, salary FROM emp WHERE dept=‘Toys’ AND salary < 10000  Index on dept Filter found records using salary < 10000 I1 PA152, Vlastislav Dohnal, FI MUNI, 2022 49 Combining Indexes  Index on dept  Index on salary  Each index returns a list of candidate records Intersection of lists  query result dept=‘Toys’ salary<10000 bucket with record pointers PA152, Vlastislav Dohnal, FI MUNI, 2022 50 Index in index  Index on the first attribute  In leaves, pointers to embedded indexes on 2nd attribute  I1 on dept  Ix, x=2..k on salary  I2 contains only records having the same dept. value (Electrical Supp.) I1 I2 I3 Electrical Supp. Toys … PA152, Vlastislav Dohnal, FI MUNI, 2022 51 Index in index: Example  SELECT name, salary FROM emp WHERE dept=‘Toys’ AND salary < 10000 name=Peter dept=Toys salary=7000 Electr. Toys Garden… 10000 15000 17000 21000 5000 7000 15000 19000 Index dept Indexes on salary Record PA152, Vlastislav Dohnal, FI MUNI, 2022 52 Index in index  What queries can use this solution? SELECT name, salary FROM emp WHERE a) dept = ‘Toys’ AND salary ≥ 10000 b) dept = ‘Toys’ c) salary = 10000 name=Peter dept=Toys salary=7000 Electr. Toys Gardening 10000 15000 17000 21000 5000 7000 15000 19000 Index dept Indexy salary Record PA152, Vlastislav Dohnal, FI MUNI, 2022 53 Concatenation of key values  Analogous to an index on one attribute Key values are merged  String concatenation, number multiplication, …  Indexing rarely used unless (string) keys are compressed  Hashing Partitioned hash function PA152, Vlastislav Dohnal, FI MUNI, 2022 54 Partitioned hash function  Idea: Two keys Two hash functions One address h1 h2 010110 1110010 key1 key2 address PA152, Vlastislav Dohnal, FI MUNI, 2022 55 Partitioned hash function: Example  Dept h1(Electr.)=0 h1(Toys) =1 h1(Gardening) =1  Salary h2(10000) =01 h2(20000) =11 h2(30000) =10 h2(40000) =00  Records to insert  000 001 010 011 100 101 110 111 PA152, Vlastislav Dohnal, FI MUNI, 2022 56 Partitioned hash function: Example  Dept h1(Electr.)=0 h1(Toys) =1 h1(Gardening) =1  Salary h2(10000) =01 h2(20000) =11 h2(30000) =10 h2(40000) =00  Find  employess in Toys dept. with salary of 40000. 000 001 010 011 100 101 110 111 PA152, Vlastislav Dohnal, FI MUNI, 2022 57 Partitioned hash function: Example  Dept h1(Electr.)=0 h1(Toys) =1 h1(Gardening) =1  Salary h2(10000) =01 h2(20000) =11 h2(30000) =10 h2(40000) =00  Find  employees with salary 30000 000 001 010 011 100 101 110 111 PA152, Vlastislav Dohnal, FI MUNI, 2022 58 Partitioned hash function: Example  Dept h1(Electr.)=0 h1(Toys) =1 h1(Gardening) =1  Salary h2(10000) =01 h2(20000) =11 h2(30000) =10 h2(40000) =00  Find  employes in Toys dept. 000 001 010 011 100 101 110 111 PA152, Vlastislav Dohnal, FI MUNI, 2022 59 Another Multi-key Index  Grid index  Idea: records with key1=V3, key2=X2 V1 V2 Vn X1 X2 … Xm … key1 key2 PA152, Vlastislav Dohnal, FI MUNI, 2022 60 Grid Index: Properties  Efficient for exact match queries key1 = Vi and key2 = Xj key1 = Vi key2 = Xj  Range queries key1  V3 and key2 < X3  Rectangular area V1 V2 Vn X1 X2 … Xm … key1 key2 V3 X3  How to store grid on disk? one-dimensional array Tradeoff: grid dimensions vs. cell capacity  Disadvantage Fixed grid dimensions needed to calculate position of cell . Limited cell capacity PA152, Vlastislav Dohnal, FI MUNI, 2022 61 Grid Index: Implementation X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 V1 V2 V3 PA152, Vlastislav Dohnal, FI MUNI, 2022 62 Grid Index: Implementation  Use buckets, i.e., indirection Grid cell points to a bucket  Grid is fixed-sized Overhead with indirection Allows allocating buckets linearly. V1 V2 X1 X2 X3 V3 V4 Buckets -- ---- -- ---- -- ---- -- ---- … PA152, Vlastislav Dohnal, FI MUNI, 2022 63 Grid Index: Defining Grid  Analysis of data and query types Find out dimensions of the grid Value on an axis can be an interval  binning  E.g., numeric domains Grows linearly 1 2 3 Toys Electr. Gardening 0-20000 1 20000-50000 2 50000- 3 Salary Grid Dept PA152, Vlastislav Dohnal, FI MUNI, 2022 64 Grid Index: Summary  Advantages Good for multikey indexing  Disadvantages Grid size is fixed; may waste space  Alternative is a hierarchical grid Choice of grid intervals (quantization)  uniform data distribution Index Design Guidelines  Good selectivity (i.e., low selectivity)  “Short” attribute values Increase tree fan-out  On “join” attributes  Prefer more single-attribute indexes for one multi-attribute one  Few indexes for highly-updated tables  No indexes for tiny tables PA152, Vlastislav Dohnal, FI MUNI, 2022 65 Lecture’s Takeaways  Techniques of hashing  Recall handling duplicate keys i.e. multiple records with the same key value.  Alternative indexes Bitmap index  do not mix up with bitmap scan in Pg’s explain plan Grid index  Further terms (follows…) PA152, Vlastislav Dohnal, FI MUNI, 2022 66 Further Terms  Clustered index  = index-sequential file / B+-file  Records are stored in leaf nodes and cannot be accessed in any other way (than by this index)  Non-clustered index  = secondary index / B+-tree  Covering index  Query can be fully answered using the index only  I.e. records are not accessed.  MS SQL Server: Index with included columns  Not a multi-attribute index, but leaf node values are enriched with values of the included cols.  Indexed view  materialized view with clustered index  Multidim. data and indexes (R-trees, Quad Trees, kd-trees) PA152, Vlastislav Dohnal, FI MUNI, 2022 67