PA152: Efficient Use of DB 8. Summary of Algorithm Costs and Limits Vlastislav Dohnal Costs of One-Pass Algorithms Algorithm Costs I/O Relation Size Memory Blocks Pipelining Distinct B(R) B(R) ≤ M-2 1-input, 1-output, M-2 search table yes Group by B(R) B(R) ≤ M-1 1-input, 0-output, M-1 aggregates no Set union B(R)+B(S) B(R)+B(S) ≤ M-2 1-input, 1-output, M-2 search table yes Set intersection B(R)+B(S) min(B(R),B(S)) ≤ M-2 1-input, 1-output, M-2 search table yes, after reading S Set diff (R-S) B(R)+B(S) B(R)+B(S) ≤ M-2 1-input, 1-output, M-2 search table yes, after reading S Set diff (S-R) B(R)+B(S) B(S) ≤ M-1 1-input, 0-output, M-1 search table no Bag union B(R)+B(S) unlimited 1-input, 1-output yes Bag intersection B(R)+B(S) B(S) ≤ M-2 1-input, 1-output, M-2 search table yes, after reading S Bag diff (R-S) B(R)+B(S) B(S) ≤ M-2 1-input, 1-output, M-2 search table yes, after reading S Bag diff (S-R) B(R)+B(S) B(S) ≤ M-1 1-input, 0-output, M-1 search table no Cross join B(R)+B(S) B(S) ≤ M-2 1-input, 1-output, M-2 cache for S yes, after reading S (any) Join B(R)+B(S) B(S) ≤ M-2 1-input, 1-output, M-2 search table yes, after reading S PA152, Vlastislav Dohnal, FI MUNI, 2021 2 * S is always the smaller relation. Costs of Join Algorithms Algorithm Costs I/O Relation Size Memory Blocks Pipelining Block-based Nested-loop *** B(S)(1+B(R)) unlimited 2-input, 1-output yes Cached BB NL *** B(S)/(M-2)  (M-2 + B(R)) unlimited 1-input, 1-output, M-2 cache of S yes Merge Join (w/o sorting) B(R)+B(S) unlimited 2-input, 1-output (+x when too many matches) yes Merge Join (incl. sorting) 5(B(R) + B(S)) B(R) ≤ M(M-1) + 1 sorting: M-input for a run, 0-output merging: (M-1)-runs, 1-output joining: 2-input, 1-output (+x when too many matches) yes, after sorting R&S Sort Join 3(B(R) + B(S)) 𝑀 = 𝐵(𝑅) 𝑀 + 𝐵(𝑆) 𝑀 + 1 approx. B(R)+B(S) ≤ M(M-1) sorting: M-input for a run, 0-output joining: (M-1)-runs, 1-output (+x when too many matches) yes, after sorting R&S Index Join (R.Y index) (max costs) B(S) + T(S)  (HT + ) e.g.,  = T(R)/V(R,Y) unlimited 2-input, 1-output (+x for index cache) yes PA152, Vlastislav Dohnal, FI MUNI, 2021 3 * S is always the smaller relation. ** Y are common attributes. *** 1.5-pass algorithms Costs of Join Algorithms Algorithm Costs I/O Relation Size Memory Blocks Pipelining Hash Join 3(B(R) + B(S)) B(S) ≤ (M-2)(M-1) hashing: 1-input, M-1-buckets joining: 1-bucket of R, 1-output, M-2-a bucket of S yes, after hashing R&S Hybrid HJ *** 3 𝐵 𝑅 + 𝐵(𝑆) − 2 𝐵 𝑅 + 𝐵(𝑆) 𝐵 𝑅 𝐵 𝑆 ≪ 𝑀2 𝑀 = 𝐵(𝑅) 𝐵 𝑅 + 𝐵 𝑅 + 1 hashing: 1-input, x= 𝐵(𝑆) -1st bucket of S, M-1-x-buckets joining: 1-bucket of R, x-bucket of S, 1-output yes, after hashing S Pointer HJ B(S)+B(R)+T(R)   e.g.,  = T(S)/V(S,Y) “unlimited”, hash index on S.Y + pointers must fit in M indexing: 1-input, M-3-hash index on S joining: 1-block of R, 1-block of S, 1-output, M-3-hash index on S yes, after indexing S PA152, Vlastislav Dohnal, FI MUNI, 2021 4 * S is always the smaller relation. ** Y are common attributes. *** keeping just 1 bucket of S in memory Costs of Two-Pass Algorithms Algorithm Costs I/O Relation Size Memory Blocks Pipelining Joins – see slides above Distinct (sorting) 3B(R) B(R) ≤ M(M-1) + 1 sorting: M-input for a run, 0-output distinct: (M-1)-runs, 1-output yes, after initial sorting Distinct (hashing) 3B(R) B(R) ≤ (M-1)(M-1) hashing: 1-input, M-1-buckets distinct: M-1-bucket, 1-output yes, after hashing Group by (sorting) 3B(R) B(R) ≤ M(M-2) sorting: M-input for a run, 0-output group by: (M-2)-runs, 1-output, 1-aggregates yes, after initial sorting Group by (hashing) 3B(R) hashing: 1-input, M-1-buckets group by: M-2-bucket, 1-output, 1-aggregates yes, after hashing PA152, Vlastislav Dohnal, FI MUNI, 2021 5 * S is always the smaller relation. Costs of Two-Pass Algorithms Algorithm Costs I/O Relation Size Memory Blocks Pipelining Set union (sorting) 3(B(R)+B(S)) 𝑀 = 𝐵(𝑅) 𝑀 + 𝐵(𝑆) 𝑀 + 1 approx. B(R)+B(S) ≤ M(M-1) sorting: M-input for a run, 0-output union: M-1-all runs, 1-output yes, after initial sorting Set union (hashing) 3(B(R)+B(S)) B(S) ≤ (M-2)(M-1) hashing: 1-input, M-1-buckets union: 1-buckets of R, 1-output, M-2-bucket of S yes, after hashing Set/Bag ∩, - (sorting) 3(B(R)+B(S)) 𝑀 = 𝐵(𝑅) 𝑀 + 𝐵(𝑆) 𝑀 + 1 approx. B(R)+B(S) ≤ M(M-1) sorting: M-input for a run, 0-output oper: M-1-all runs, 1-output, (+1 for counts) yes, after initial sorting Set/Bag ∩,S-R (hashing) 3(B(R)+B(S)) B(S) ≤ (M-2)(M-1) hashing: 1-input, M-1-buckets oper: 1-buckets of R, 1-output, M-2-bucket of S yes, after hashing R Set/Bag R-S (hashing) 3(B(R)+B(S)) B(R) ≤ (M-2)(M-1) hashing: 1-input, M-1-buckets diff: 1-buckets of S, 1-output, M-2-bucket of R yes, after hashing PA152, Vlastislav Dohnal, FI MUNI, 2021 6 * S is always the smaller relation.