PA152: Efficient Use of DB 3. Query Processing Vlastislav Dohnal PA152, Vlastislav Dohnal, FI MUNI, 2024 2 Query Processing ◼ Procedure: Query in SQL Checking syntax and semantics ◼ Parse Tree – in relational algebra Logical query plan ◼ Plan modifications, estimate relation sizes Physical query plan ◼ Concrete algorithms for individual operations ◼ Estimate costs Evaluation ◼ Choose the “best” physical query plan PA152, Vlastislav Dohnal, FI MUNI, 2024 3 Example ◼ Relation R(A,B,C) S(C,D,E) ◼ Query select B,D from R,S where R.C=S.C and R.A=‘c’ and S.E=2 PA152, Vlastislav Dohnal, FI MUNI, 2024 4 Example R A B C S C D E a 1 10 10 x 2 b 1 20 20 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 select B,D from R,S where R.C=S.C and R.A=‘c’ and S.E=2 PA152, Vlastislav Dohnal, FI MUNI, 2024 5 Example R A B C S C D E a 1 10 10 x 2 b 1 20 20 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 Result: B D 2 x PA152, Vlastislav Dohnal, FI MUNI, 2024 6 How to evaluate this query? ◼ Cartesian product ◼ Selecting records ◼ Projection 1. way PA152, Vlastislav Dohnal, FI MUNI, 2024 7 R  S R.A R.B R.C S.C S.D S.E a 1 10 10 x 2 a 1 10 20 y 2 . . c 2 10 10 x 2 . . PA152, Vlastislav Dohnal, FI MUNI, 2024 8 This record matches R  S R.A R.B R.C S.C S.D S.E a 1 10 10 x 2 a 1 10 20 y 2 . . c 2 10 10 x 2 . . Output – query result select B,D from R,S where R.C=S.C and R.A=‘c’ and S.E=2 PA152, Vlastislav Dohnal, FI MUNI, 2024 9 Describing Evaluation Plans ◼ Using relational algebra  select B,D from R,S where R.C=S.C and R.A=‘c’ and S.E=2 B,D [ sR.A=‘c’ S.E=2  R.C = S.C (RS)] ◼ Example of Plan 1: Query plan B,D sR.A=‘c’ S.E=2  R.C=S.C  R S PA152, Vlastislav Dohnal, FI MUNI, 2024 10 Describing Evaluation Plans ◼ Example of Plan 2: B,D sR.A = ‘c’ sS.E = 2 R S natural join PA152, Vlastislav Dohnal, FI MUNI, 2024 11 Describing Evaluation Plans ◼ Example of Plan 2: R S A B C s (R) s(S) C D E a 1 10 A B C C D E 10 x 2 b 1 20 c 2 10 10 x 2 20 y 2 c 2 10 20 y 2 30 z 2 d 2 35 30 z 2 40 x 1 e 3 45 50 y 3 PA152, Vlastislav Dohnal, FI MUNI, 2024 12 Describing Evaluation Plans ◼ Plan 3: (physical plan) Assume an index on R.A and on S.C Scan index R.A for looking up records of R satisfying R.A = “c” ◼ For each record found, take value of R.C to scan index on S.C to get matching records of S ◼ Filter out records of S, where S.E ≠ 2 Join the corresponding records of R and S Project on B,D PA152, Vlastislav Dohnal, FI MUNI, 2024 13 R S A B C C D E a 1 10 10 x 2 b 1 20 20 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 A C I1 I2 PA152, Vlastislav Dohnal, FI MUNI, 2024 14 R S A B C C D E a 1 10 10 x 2 b 1 20 20 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 A C I1 I2 =‘c’ PA152, Vlastislav Dohnal, FI MUNI, 2024 15 R S A B C C D E a 1 10 10 x 2 b 1 20 20 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 A C I1 I2 =“c” <10,x,2> PA152, Vlastislav Dohnal, FI MUNI, 2024 16 R S A B C C D E a 1 10 10 x 2 b 1 20 20 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 A C I1 I2 =“c” <10,x,2> Check e=2? Output: <2,x> PA152, Vlastislav Dohnal, FI MUNI, 2024 17 next record: R S A B C C D E a 1 10 10 x 2 b 1 20 20 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50 y 3 A C I1 I2 =“c” <10,x,2> Check=2? Output: <2,x> … PA152, Vlastislav Dohnal, FI MUNI, 2024 18 Overview on Query Optimization parser translate rules estimate sizes create physical query plans estimate costs of p.q.p. choose best execute {P1,P2,…..} {(P1,C1),(P2,C2)...} Pi result SQL query parse tree logical query plan „improved“ l.q.p. l.q.p. with sizes statistics data PA152, Vlastislav Dohnal, FI MUNI, 2024 19 Example: SQL query ◼ Relations  StarsIn(title, year, starName)  MovieStar(name, birthdate) ◼ Query  Select movies with stars born in 1960:  SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE extract(year from birthdate) = 1960 ); PA152, Vlastislav Dohnal, FI MUNI, 2024 20 Example: Parse Tree SELECT FROM WHERE IN title StarsIn ( ) starName SELECT FROM WHERE = name MovieStar extract(year from birthdate) 1960 PA152, Vlastislav Dohnal, FI MUNI, 2024 21 Example: Parse Tree in Rel. Algebra ◼ Selection has two arguments Must be transformed… ◼ IN operator i.e., avoiding nested queries (sub-selects) title s StarsIn IN name sextract(year from birthdate) = 1960 starName MovieStar PA152, Vlastislav Dohnal, FI MUNI, 2024 22 Example: Logical Query Plan ◼ IN operator replaced with product title sstarName=name StarsIn name sextract(year from birthdate) = 1960 MovieStar  PA152, Vlastislav Dohnal, FI MUNI, 2024 23 Example: Improving Logical Plan ◼ Substitution of cross product and selection by join ◼ Next option Push project to relation StarsIn? ◼ Depends on:  index on StarsIn.starName?  many attributes in StarsIn title starName=name StarsIn name sextract(year from birthdate) = 1960 MovieStar ◼ Before generating physical plans ◼ Influences estimation of evaluation costs selection of algorithm PA152, Vlastislav Dohnal, FI MUNI, 2024 24 Example: Estimate Result Sizes Estimate size StarsIn MovieStar name sextract(year from birthdate) = 1960 PA152, Vlastislav Dohnal, FI MUNI, 2024 25 Example: Possible Physical Plan Parameters: relation order, memory size, project attributes,... Index join SEQ scan index scan Parameters: select condition, ... StarsIn MovieStar PA152, Vlastislav Dohnal, FI MUNI, 2024 26 Example: Estimate Costs Logical Query Plan Phys. Plan1 Plan2 …. Plann Cost1 Cost2 …. Costn Pick the plan with the best cost! PA152, Vlastislav Dohnal, FI MUNI, 2024 27 Query Optimization ◼ Relational algebra level ◼ Detailed query plan level Estimate costs ◼ Without indexes ◼ With indexes ◼ Consider size of memory buffers Generate and compare plans PA152, Vlastislav Dohnal, FI MUNI, 2024 28 Optimization in Relational Algebra ◼ Transformation rules Must preserve equivalence What are good transformations? PA152, Vlastislav Dohnal, FI MUNI, 2024 29 Transformation Rules in R.A. ◼ Natural join Relation order is not important since all attributes are preserved ◼ Example: R S = S R (R S) T = R (S T) R S S T RT PA152, Vlastislav Dohnal, FI MUNI, 2024 30 Transformation Rules ◼ Same for cartesian product and union R  S = S  R (R  S)  T = R  (S  T) R  S = S  R R  (S  T) = (R  S)  T PA152, Vlastislav Dohnal, FI MUNI, 2024 31 Transformation Rules ◼ Selects sp1p2(R) = sp1p2(R) = sp1p2(R) = sp1 [ sp2 (R)] [ sp1 (R)]  [ sp2 (R)] [ sp1 (R)]  [ sp2 (R)] PA152, Vlastislav Dohnal, FI MUNI, 2024 32 Question of Tuple Duplicates ◼ Sets vs. bags?  Relations are bags (in reality) ◼ Example  R = {a,a,b,b,b,c}  S = {b,b,c,c,d} ◼ R  S = ?  MIN: R  S = {b,b,c} SQL: INTERSECT ALL ◼ R  S = ?  SUM: R  S = {a,a,b,b,b,b,b,c,c,c,d} SQL: UNION ALL  MAX: R  S = {a,a,b,b,b,c,c,d} R and S can be observed as ouput of some operations, e.g., B,D [ sA=‘c’ (tbl1)] PA152, Vlastislav Dohnal, FI MUNI, 2024 33 Option MAX: Select Decomposition ◼ Select decomposition: ◼ Example: R={a,a,b,b,b,c} a,b satisfy p1; b,c satisfy p2 sp1p2 (R) = sp1(R)  sp2(R) sp1p2 (R) = {a,a,b,b,b,c} sp1(R) = {a,a,b,b,b} sp2(R) = {b,b,b,c} sp1(R) max sp2(R) = {a,a,b,b,b,c} PA152, Vlastislav Dohnal, FI MUNI, 2024 34 Choice of Correct Option ◼ Pragmatic solution for  Use “SUM” for bag union “MAX” for dividing disjunctive predicates () ◼ Some rules cannot be applied to bags! Associativity of except: R – (S – T) Distributivity: R  (S  T) != (R  S)  (R  T) PA152, Vlastislav Dohnal, FI MUNI, 2024 35 Transformation Rules ◼ Notation: X = set of attributes Y = set of attributes XY = X  Y ◼ Project pxy (R) = px [py (R)] PA152, Vlastislav Dohnal, FI MUNI, 2024 36 Transformation Rules ◼ Combining select and natural join ◼ Let p = expr. containing only attrs. of R q = expr. containing only attrs. of S m = expr. containing attrs. of both R, S [sp (R)]  S R  [sq (S)] sp (R  S) = sq (R  S) = PA152, Vlastislav Dohnal, FI MUNI, 2024 37 Transformation Rules ◼ Combining select and natural join Other rules can be derived spq (R  S) = [sp (R)]  [sq (S)] spqm (R  S) = sm [(sp (R))  (sq (S))] spq (R  S) = [(sp (R))  S] max [R (sq (S))] Transformation Rules ◼ Combining select and natural join Example of rule derivation PA152, Vlastislav Dohnal, FI MUNI, 2024 38 spq (R  S) = sp [sq (R  S) ] = sp [ R  sq (S) ] = [sp (R)]  [sq (S)] PA152, Vlastislav Dohnal, FI MUNI, 2024 39 Transformation Rules ◼ Combining select and natural join Example of rule derivation Let m = expr. containing only attrs. common in R and S, but does not compare them [sm (R)]  [sm (S)]sm (R  S) = PA152, Vlastislav Dohnal, FI MUNI, 2024 40 Transformation Rules ◼ Combining project and select ◼ Let x = attribute subset of R z = attributes referenced in expr. P (subset of R) (sp [ px (R) ])px pxz px[sp (R) ] = PA152, Vlastislav Dohnal, FI MUNI, 2024 41 Transformation Rules ◼ Combining project and natural join ◼ Let x = attribute subset of R y = attribute subset S z = attributes common in R and S pxy([pxz (R) ]  [pyz (S) ]) pxy (R  S) = ◼ Combining previous with select PA152, Vlastislav Dohnal, FI MUNI, 2024 42 Transformation Rules pxy (sp (R  S)) = pxy (sp [pxz’ (R)  pyz (S)]) z’ = z  {attributes referenced in P} ◼ Combining project, select and Cartesian product PA152, Vlastislav Dohnal, FI MUNI, 2024 43 Transformation Rules pxy (sp (R  S)) = pxy (sp [px’ (R)  py (S)]) x’ = x  {attributes referenced in P} PA152, Vlastislav Dohnal, FI MUNI, 2024 44 Transformation Rules ◼ Combining select and union ◼ Combining select and except Select can also be applied to S ◼ May be convenient for shrinking relation before doing except ◼ Are there some limits on P in both above? sp(R sum S) = sp(R) sum sp(S) sp(R – S) = sp(R) – S = sp(R) – sp(S) PA152, Vlastislav Dohnal, FI MUNI, 2024 45 Good Transformations ◼ Select early ◼ Project early Example: ◼ R(A,B,C,D,E,F,G,H,I,J) result={E} ◼ Filter using P: (A=3)  (B=“cat”) pE (sp (R)) vs. pE (sp (pABE(R))) px [sp (R)] → px (sp [px’ (R)]) sp (R  S) → [sp (R)]  S x’ = x  {attributes referenced in P} PA152, Vlastislav Dohnal, FI MUNI, 2024 46 Good Transformations sp1p2 (R) → sp1 [sp2 (R)] → sp2 [sp1 (R)] → [ sp1 (R)]  [ sp2 (R)] sp1p2 (R) → sp1(R) max sp2(R) R  S → S  R PA152, Vlastislav Dohnal, FI MUNI, 2024 47 Good Transformations ◼ Assume indexes On A and on B s(A=3)  (B=“cat”)(R) = s(A=3)(R)  s(B=“cat”)(R) A = 3B = ‘cat’ Record pointer: Intersect pointers to get pointers to matching tuples Good Transformations ◼ Previous case in SQL and PostgreSQL PA152, Vlastislav Dohnal, FI MUNI, 2024 48 Bitmap Heap Scan on customers (cost=25.76..61.62 rows=10 width=13) Recheck Cond: ((username)::text < 'user100'::text) -> BitmapAnd (cost=25.76..25.76 rows=10 width=0) -> Bitmap Index Scan on ix_cust_username (cost=0.00..5.75 rows=200 w Index Cond: ((username)::text < 'user100'::text) -> Bitmap Index Scan on customers_pkey (cost=0.00..19.75 rows=1000 w Index Cond: (customerid < 1000) SELECT * FROM customer WHERE username < ‘user100’ AND customerid < 1000; Good Transformations ◼ Push select to relations → apparently OK  But: Firstly, move them as far as possible; next push them back ◼ Example:  Relations: StarsIn(title, year, starName) Movie(title, year, studioName)  View: create view MoviesOf1996 as select * from Movie where year = 1996; ◼ Query: select starName, studioName from MoviesOf1996 natural join StarsIn; PA152, Vlastislav Dohnal, FI MUNI, 2024 49 starName,studioName MoviesOf1996 StarsIn syear=1996 starName,studioName Movie StarsIn syear=1996 starName,studioName Movie StarsIn syear=1996 prerequisite to be able to do it!!! PA152, Vlastislav Dohnal, FI MUNI, 2024 50 Good Transformations: Summary ◼ Recommendations: No transformation is always good. Usually, good to ◼ select early ◼ project early ◼ Eliminate common sub-expressions ◼ Eliminate tuple duplicates PA152, Vlastislav Dohnal, FI MUNI, 2024 51 Query Processing: Overview ◼ Relational algebra level Transformation rules Apply good rules ◼ Detail (physical) query plan level Estimate costs Generate and compare plans PA152, Vlastislav Dohnal, FI MUNI, 2024 52 Estimating Cost of Query Plan ◼ Why? To “grade” a physical plan. Choose proper algorithms to save time and RAM during evaluation. ◼ Many rows may need more RAM buffers. ◼ Multiple scans over data to save RAM. ◼ Why saving RAM? More queries to be processed at parallel. Individual operations of one query may need to keep all data in RAM buffers. ◼ May lead to storing it to a disk. PA152, Vlastislav Dohnal, FI MUNI, 2024 53 Estimating Cost of Query Plan ◼ The costs are in an arbitrary unit. not milliseconds or any other time unit. ◼ Process: 1. Estimate size of result ◼ Contributes to CPU costs 2. Estimate number of I/O operations ◼ Contributes to “slowness” of the query ◼ Done for each operation in the logical query plan ◼ Based on statistics (metadata about tables) PA152, Vlastislav Dohnal, FI MUNI, 2024 54 Estimating Result Size ◼ Keep statistics for relation R T(R) – # tuples in R S(R) – # of bytes in each R tuple ◼ S(R,A) – length in bytes of values of attribute A B(R) – # of blocks to hold all R tuples V(R, A) – # distinct values in R for attribute A ◼ Good estimates need Statistics up to date! PA152, Vlastislav Dohnal, FI MUNI, 2024 55 Statistics Example ◼ Relation R Attribute A – string, max. 20 ◼ S(R,A) = 3 ← average length Attribute B – integer, 4 B Attribute C – date, 8 B Attribute D – string, max. 5 ◼ S(R,D) = 1 ◼ Statistics T(R) = 5 S(R) = 16 V(R,A) = 3 V(R,B) = 1 V(R,C) = 5 V(R,D) = 4 A B C D cat 1 10.2.98 a cat 1 20.3.98 b dog 1 30.4.98 a dog 1 14.6.98 c bat 1 15.6.98 d Mind encoding – utf8 takes more bytes per char. PA152, Vlastislav Dohnal, FI MUNI, 2024 56 Estimating Result Size ◼ Cartesian product W = R1  R2 T(W) = T(R1)  T(R2) S(W) = S(R1) + S(R2) PA152, Vlastislav Dohnal, FI MUNI, 2024 57 Estimating Result Size ◼ Select W = sZ=val(R) S(W) = S(R) T(W) = ? ◼ W = sA=‘cat’(R) ◼ W2 = sB=2(R) T(W2) = ? V(R,A)=3 V(R,B)=1 V(R,C)=5 V(R,D)=4 T(W) = T(R) V(R,A) = 5/3 A B C D cat 1 10.2.98 a cat 1 20.3.98 b dog 1 30.4.98 a dog 1 14.6.98 c bat 1 15.6.98 d PA152, Vlastislav Dohnal, FI MUNI, 2024 58 Estimating Result Size ◼ Assumption of last estimate Values are uniformly distributed over possible V(R,Z) values! ◼ f(val) = 1 / V(R,Z) ◼ T(sZ=val(R)) = T(R)  f(val) ◼ Alternative Uniform distribution of values over domain with DOM(R,Z) values ◼ # values in domain is denoted as DOM(R,Z) ◼ f(val) = 1 / DOM(R,Z) PA152, Vlastislav Dohnal, FI MUNI, 2024 59 Estimating Result Size ◼ Select W = sZ=val(R) T(W) = ? ◼ using DOM(R,*) ◼ Formula derivation W = sC=val(R) ◼ T(W) = f(val)  T(R) = 1/10 * 5 = 0,5 W = sB=val(R) ◼ T(W) = (1/10)*5 W = sA=val(R) ◼ T(W) = 0,5 V(R,A)=3 DOM(R,A)=10 V(R,B)=1 DOM(R,B)=10 V(R,C)=5 DOM(R,C)=10 V(R,D)=4 DOM(R,D)=10 A B C D cat 1 10.2.98 a cat 1 20.3.98 b dog 1 30.4.98 a dog 1 14.6.98 c bat 1 15.6.98 d PA152, Vlastislav Dohnal, FI MUNI, 2024 60 Estimating Result Size ◼ Select W = sZ=val(R)  Original solution  Alternative  Mostly used: ◼ histogram over the most frequent ones ◼ plus, uniform distribution of remaining ones. T(W) = T(R) V(R,Z) V(R,A)=3 DOM(R,A)=10 V(R,B)=1 DOM(R,B)=10 V(R,C)=5 DOM(R,C)=10 V(R,D)=4 DOM(R,D)=10 T(W) = T(R) DOM(R,Z) A B C D cat 1 10.2.98 a cat 1 20.3.98 b dog 1 30.4.98 a dog 1 14.6.98 c bat 1 15.6.98 d PA152, Vlastislav Dohnal, FI MUNI, 2024 61 Estimating Result Size ◼ Select W = sZ≥val(R) Solution 1 ◼ T(W) = T(R) / 2 Solution 2 ◼ T(W) = T(R) / 3 Solution 3 ◼ Estimate values in range – using histogram. PA152, Vlastislav Dohnal, FI MUNI, 2024 62 Estimating Result Size ◼ Select – estimate values in range ◼ Calculate fraction of unique values in range T(W) = f  T(R) Min=1 V(R,Z)=10 W= sz  15 (R) Max=20 ZR f = 20-15+1 20-1+1 6 20 = PA152, Vlastislav Dohnal, FI MUNI, 2024 63 Estimate # values: Histogram ◼ Histogram of attribute values Replaces V(R,A) and DOM(R,A) More precise estimates ◼ Distinct values Few → abs. number per each Many → quantization ◼ Ranges (intervals) of equal size (in recs) ◼ Percentiles ◼ Most frequent ones only  others all together (i.e., uniformly distributed) 10 23 37 42 8 15 28 37 f PA152, Vlastislav Dohnal, FI MUNI, 2024 64 Estimating Result Size ◼ Select W = szval(R) T(W) = T(R)(1 – f(val)) = T(R)(1 – 1/V(R,Z)) Typical solution ◼ T(W) = T(R) Mind special cases ◼ e.g., V(R,Z) = 1 and the value is “val”. = T(R) – T(R) V(R,Z) PA152, Vlastislav Dohnal, FI MUNI, 2024 65 Estimating Result Size ◼ Natural join W = R1  R2 Notation ◼ X – attributes of R1 ◼ Y – attributes of R2 ◼ Case 1 X  Y = Ø Same as R1  R2 ◼ Case 2 X  Y = Z Follows… PA152, Vlastislav Dohnal, FI MUNI, 2024 66 Estimating Result Size: Natural Join ◼ Assumption Z={A} and also: V(R1,A) ≤ V(R2,A) → every A value in R1 is in R2 V(R1,A) ≥ V(R2,A) → every A value in R2 is in R1 R1 A B C R2 A DR1 R2 PA152, Vlastislav Dohnal, FI MUNI, 2024 67 Estimating Result Size: Natural Join ◼ V(R1,A) ≤ V(R2,A) ◼ One record is matched with T(R2) / V(R2,A) records Assumption of uniform distribution ◼ Result: R1 A B C R2 A D Take 1 tuple Match T(W) = T(R2) V(R2,A) T(R1)  PA152, Vlastislav Dohnal, FI MUNI, 2024 68 Estimating Result Size: Natural Join ◼ Overview of both variants V(R1,A) ≤ V(R2,A) V(R2,A) ≤ V(R1,A) ◼ Difference is in denominator T(W) = T(R2) V(R2,A) T(R1)  T(W) = T(R1) V(R1,A) T(R2)  PA152, Vlastislav Dohnal, FI MUNI, 2024 69 Estimating Result Size: Natural Join ◼ In general W = R1 R2 T(W) = T(R1)  T(R2) max { V(R1,A), V(R2,A) } PA152, Vlastislav Dohnal, FI MUNI, 2024 70 Estimating Result Size: Natural Join ◼ Alternate solution Values uniformly distributed over domain ◼ One rec. matches with T(R2)/DOM(R2,A) recs. ◼ Result: R1 A B C R2 A D Take 1 tuple Match T(W) = T(R1)  T(R2) DOM(R2,A) = T(R1)  T(R2) DOM(R1,A) assuming to be same PA152, Vlastislav Dohnal, FI MUNI, 2024 71 Estimating Result Size: Natural Join ◼ W = R1(X), R2(Y), X  Y = Z ◼ Size of record S(W) = S(R1) + S(R2) – S(R1,Z) Valid for all cases ◼ Number of tuples if Z contains more attrs.? Assume they are independent R1 R2 T(W) = T(R1)  T(R2) max{ V(R1,A1), V(R2,A1) }  max{ V(R1,A2), V(R2,A2) } PA152, Vlastislav Dohnal, FI MUNI, 2024 72 Estimating Size: Project, Select ◼ Project W = AB(R) T(W)=T(R) S(W)=S(R, AB) ◼ Select W = sA=aB=b(R) S(W)=S(R) Let n=T(R), T(W) = n(1 – (1-m1/n)(1-m2/n)) ◼ m1=T(R) / V(R,A) m2=T(R) / V(R,B) PA152, Vlastislav Dohnal, FI MUNI, 2024 73 Estimating Size: Set Operations ◼ Union, intersect, except  , , – ◼ T(W) – choose average size  E.g., ◼ T(R  S) = T(R) + T(S) … if  means UNION ALL ◼ T(R  S) = [ max{T(R), T(S)} , T(R) + T(S) ]  So, use: T(R  S) = avg{ max{T(R), T(S)} , T(R) + T(S) } ▪ If set union is evaluated ◼ T(R – S) = T(R) – 1/2T(S) ◼ T(R  S) = avg{ 0, min{T(R), T(S)} } ◼ DISTINCT  All attributes ◼ min{ 1/2T(R), (V(R,A)*V(R,B)*…) } PA152, Vlastislav Dohnal, FI MUNI, 2024 74 Estimating Result Size ◼ For complex expressions, intermediate results and their stats are needed ◼ Example W = [sA=a(R1)] R2 denote as U ◼ T(U) = T(R1) / V(R1,A) S(U) = S(R1) Need V(U,*) to estimate costs of W! ◼ at least for the join attributes PA152, Vlastislav Dohnal, FI MUNI, 2024 75 Estimate Number of Values ◼ To estimate V(U,*) U = sA=a(R1) Assume R1(A,B,C,D) PA152, Vlastislav Dohnal, FI MUNI, 2024 76 Estimate # values: Example ◼ Relation R1 ◼ U = sA=a(R1) ◼ Estimates V(U,A) = 1 V(U,B) = 1 V(U,C) = 1 .. (T(R1) / V(R1,A)) V(U,D) = 1 .. (T(R1) / V(R1,A)) V(R,A)=3 V(R,B)=1 V(R,C)=5 V(R,D)=4 A B C D cat 1 10.2.98 a cat 1 20.3.98 b dog 1 30.4.98 a dog 1 14.6.98 c bat 1 15.6.98 d sth. in between Not easy, if there is no dependence btw. A and B/C/D. ➔ functional dependencies PA152, Vlastislav Dohnal, FI MUNI, 2024 77 Estimate # values: Reality ◼ Common solution U = sA=a(R1) V(U,A) = 1 V(U,K) = T(U) ◼ Primary key K of R1 is an exception V(U,*) = V(R,*) or V(U,*) = T(U) ◼ As a result, original V(R,*) can be used V(U,*) = min { V(R,*), T(U) } PA152, Vlastislav Dohnal, FI MUNI, 2024 78 Estimate # values ◼ Join U = R1(A,B) R2(A,C) ◼ Estimates: V(U,A) = min{ V(R1,A), V(R2,A) } V(U,B) = min { V(R1,B), T(U) } V(U,C) = min { V(R2,C), T(U) } PA152, Vlastislav Dohnal, FI MUNI, 2024 79 Estimate # values ◼ Example Z = R1(A,B) R2(B,C) R3(C,D) ◼ T(R1) = 1000 V(R1,A)=50 V(R1,B)=100 ◼ T(R2) = 2000 V(R2,B)=200 V(R2,C)=300 ◼ T(R3) = 3000 V(R3,C)=90 V(R3,D)=500 Intermediate result ◼ U = R1(A,B) R2(B,C) ◼ T(U) = T(R1)  T(R2) / max{ V(R1,B), V(R2,B) } = 10k ◼ V(U,A) = 50 ◼ V(U,B) = min{ V(R1,B), V(R2,B) } = 100 ◼ V(U,C) = 300 PA152, Vlastislav Dohnal, FI MUNI, 2024 80 Estimate # values ◼ Final result Z = U R3(C,D) ◼ U(A,B,C) Estimates: ◼ T(Z) = 10 000  3 000 / 300 = 100 000 ◼ V(Z,A) = 50 ◼ V(Z,B) = 100 ◼ V(Z,C) = 90 ◼ V(Z,D) = 500 PA152, Vlastislav Dohnal, FI MUNI, 2024 81 Example of Stats in PostgreSQL ◼ Connect to student instance of PostgreSQL See the first lecture for instructions ◼ Check the schema xdohnal Relations: predmet, skupina, hotel ◼ Observe both the relation and attribute statistics Explanation of individual items in doc ◼ https://www.postgresql.org/docs/current/view-pg-stats.html PA152, Vlastislav Dohnal, FI MUNI, 2024 82 Example of Stats in PostgreSQL ◼ Relation hotel PA152, Vlastislav Dohnal, FI MUNI, 2024 83 Example of Stats in PostgreSQL ◼ Attribute hotel.id ◼ Attribute hotel.name PA152, Vlastislav Dohnal, FI MUNI, 2024 84 Example of Stats in PostgreSQL ◼ Attribute hotel.state ◼ Attribute hotel.distance_to_center PA152, Vlastislav Dohnal, FI MUNI, 2024 85 Summary ◼ Estimating size of results is an “art” ◼ Do not forget: Statistics must be kept up to date for precise estimates → necessity to maintain them during table updates What are the costs to update stats? PA152, Vlastislav Dohnal, FI MUNI, 2024 86 Statistics Update ◼ Stats do not change rapidly in short time ◼ Inaccurate stats may also help ◼ Instant stats update Can become a bottleneck ◼ Stats are used very often ◼ → Do not update often PA152, Vlastislav Dohnal, FI MUNI, 2024 87 Statistics Update ◼ Run periodically After some time elapses After some number of updates are made ◼ Slow for V(R,A) Especially if histograms are kept → Use a data sample ◼ If almost all are distinct → V(R,A)T(R) ◼ If not many are distinct → we likely saw most of them PA152, Vlastislav Dohnal, FI MUNI, 2024 88 Estimating Costs of Query Plan: Outline ◼ Estimating the size of results Already done ◼ Estimating # of IOs Next lecture… ◼ Generate and compare plans Takeaways ◼ Query plan What it is How to transform it and why ◼ Term “costs” ◼ Types of statistics maintained by DBMS ◼ How to estimate result size of an operation and the other statistics PA152, Vlastislav Dohnal, FI MUNI, 2024 89