PA152: Efficient Use of DB
3. Query Processing
Vlastislav Dohnal
PA152, Vlastislav Dohnal, FI MUNI, 2024 2
Query Processing
◼ Procedure:
Query in SQL
Checking syntax and semantics
◼ Parse Tree – in relational algebra
Logical query plan
◼ Plan modifications, estimate relation sizes
Physical query plan
◼ Concrete algorithms for individual operations
◼ Estimate costs
Evaluation
◼ Choose the “best” physical query plan
PA152, Vlastislav Dohnal, FI MUNI, 2024 3
Example
◼ Relation
R(A,B,C)
S(C,D,E)
◼ Query
select B,D
from R,S
where R.C=S.C and R.A=‘c’ and S.E=2
PA152, Vlastislav Dohnal, FI MUNI, 2024 4
Example
R A B C S C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
select B,D from R,S where R.C=S.C and R.A=‘c’ and S.E=2
PA152, Vlastislav Dohnal, FI MUNI, 2024 5
Example
R A B C S C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
Result: B D
2 x
PA152, Vlastislav Dohnal, FI MUNI, 2024 6
How to evaluate this query?
◼ Cartesian product
◼ Selecting records
◼ Projection
1. way
PA152, Vlastislav Dohnal, FI MUNI, 2024 7
R  S R.A R.B R.C S.C S.D S.E
a 1 10 10 x 2
a 1 10 20 y 2
.
.
c 2 10 10 x 2
.
.
PA152, Vlastislav Dohnal, FI MUNI, 2024 8
This record
matches
R  S R.A R.B R.C S.C S.D S.E
a 1 10 10 x 2
a 1 10 20 y 2
.
.
c 2 10 10 x 2
.
.
Output – query result
select B,D from R,S where R.C=S.C and R.A=‘c’ and S.E=2
PA152, Vlastislav Dohnal, FI MUNI, 2024 9
Describing Evaluation Plans
◼ Using relational algebra
 select B,D from R,S where R.C=S.C and R.A=‘c’ and S.E=2
B,D [ sR.A=‘c’ S.E=2  R.C = S.C (RS)]
◼ Example of Plan 1:
Query plan B,D
sR.A=‘c’ S.E=2  R.C=S.C

R S
PA152, Vlastislav Dohnal, FI MUNI, 2024 10
Describing Evaluation Plans
◼ Example of Plan 2:
B,D
sR.A = ‘c’ sS.E = 2
R S
natural join
PA152, Vlastislav Dohnal, FI MUNI, 2024 11
Describing Evaluation Plans
◼ Example of Plan 2:
R S
A B C s (R) s(S) C D E
a 1 10 A B C C D E 10 x 2
b 1 20 c 2 10 10 x 2 20 y 2
c 2 10 20 y 2 30 z 2
d 2 35 30 z 2 40 x 1
e 3 45 50 y 3
PA152, Vlastislav Dohnal, FI MUNI, 2024 12
Describing Evaluation Plans
◼ Plan 3: (physical plan)
Assume an index on R.A and on S.C
Scan index R.A for looking up records of R
satisfying R.A = “c”
◼ For each record found, take value of R.C to scan
index on S.C to get matching records of S
◼ Filter out records of S, where S.E ≠ 2
Join the corresponding records of R and S
Project on B,D
PA152, Vlastislav Dohnal, FI MUNI, 2024 13
R S
A B C C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
A C
I1 I2
PA152, Vlastislav Dohnal, FI MUNI, 2024 14
R S
A B C C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
A C
I1 I2
=‘c’
<c,2,10>
PA152, Vlastislav Dohnal, FI MUNI, 2024 15
R S
A B C C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
A C
I1 I2
=“c”
<c,2,10> <10,x,2>
PA152, Vlastislav Dohnal, FI MUNI, 2024 16
R S
A B C C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
A C
I1 I2
=“c”
<c,2,10> <10,x,2>
Check
e=2?
Output: <2,x>
PA152, Vlastislav Dohnal, FI MUNI, 2024 17
next record:
<c,7,15>
R S
A B C C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
A C
I1 I2
=“c”
<c,2,10> <10,x,2>
Check=2?
Output: <2,x>
…
PA152, Vlastislav Dohnal, FI MUNI, 2024 18
Overview on Query Optimization
parser
translate
rules
estimate sizes
create physical query plans estimate costs of p.q.p.
choose best
execute
{P1,P2,…..}
{(P1,C1),(P2,C2)...}
Pi
result
SQL query
parse tree
logical query plan
„improved“ l.q.p.
l.q.p. with sizes
statistics
data
PA152, Vlastislav Dohnal, FI MUNI, 2024 19
Example: SQL query
◼ Relations
 StarsIn(title, year, starName)
 MovieStar(name, birthdate)
◼ Query
 Select movies with stars born in 1960:
 SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE extract(year from birthdate) = 1960
);
PA152, Vlastislav Dohnal, FI MUNI, 2024 20
Example: Parse Tree
<Query>
<SFW>
SELECT <SelList> FROM <FromList> WHERE <Condition>
<Attribute> <RelName> <Tuple> IN <Query>
title StarsIn <Attribute> ( <Query> )
starName <SFW>
SELECT <SelList> FROM <FromList> WHERE <Condition>
<Attribute> <RelName> <Function> = <Value>
name MovieStar extract(year from birthdate) 1960
PA152, Vlastislav Dohnal, FI MUNI, 2024 21
Example: Parse Tree in Rel. Algebra
◼ Selection has two arguments
Must be transformed…
◼ IN operator
i.e., avoiding
nested
queries
(sub-selects)
title
s
StarsIn <condition>
<tuple> IN name
<attribute> sextract(year from birthdate) = 1960
starName MovieStar
PA152, Vlastislav Dohnal, FI MUNI, 2024 22
Example: Logical Query Plan
◼ IN operator replaced with product
title
sstarName=name
StarsIn name
sextract(year from birthdate) = 1960
MovieStar

PA152, Vlastislav Dohnal, FI MUNI, 2024 23
Example: Improving Logical Plan
◼ Substitution of cross product and selection
by join
◼ Next option
Push project to
relation StarsIn?
◼ Depends on:
 index on StarsIn.starName?
 many attributes in StarsIn
title
starName=name
StarsIn name
sextract(year from birthdate) = 1960
MovieStar
◼ Before generating physical plans
◼ Influences
estimation of evaluation costs
selection of algorithm
PA152, Vlastislav Dohnal, FI MUNI, 2024 24
Example: Estimate Result Sizes
Estimate size
StarsIn
MovieStar
name
sextract(year from birthdate) = 1960
PA152, Vlastislav Dohnal, FI MUNI, 2024 25
Example: Possible Physical Plan
Parameters: relation order,
memory size, project attributes,...
Index join
SEQ scan index scan Parameters:
select condition, ...
StarsIn MovieStar
PA152, Vlastislav Dohnal, FI MUNI, 2024 26
Example: Estimate Costs
Logical Query Plan
Phys. Plan1 Plan2 …. Plann
Cost1 Cost2 …. Costn
Pick the plan with the best cost!
PA152, Vlastislav Dohnal, FI MUNI, 2024 27
Query Optimization
◼ Relational algebra level
◼ Detailed query plan level
Estimate costs
◼ Without indexes
◼ With indexes
◼ Consider size of memory buffers
Generate and compare plans
PA152, Vlastislav Dohnal, FI MUNI, 2024 28
Optimization in Relational Algebra
◼ Transformation rules
Must preserve equivalence
What are good transformations?
PA152, Vlastislav Dohnal, FI MUNI, 2024 29
Transformation Rules in R.A.
◼ Natural join
Relation order is not important since all
attributes are preserved
◼ Example: R S = S R
(R S) T = R (S T)
R S S T
RT
PA152, Vlastislav Dohnal, FI MUNI, 2024 30
Transformation Rules
◼ Same for cartesian product and union
R  S = S  R
(R  S)  T = R  (S  T)
R  S = S  R
R  (S  T) = (R  S)  T
PA152, Vlastislav Dohnal, FI MUNI, 2024 31
Transformation Rules
◼ Selects
sp1p2(R) =
sp1p2(R) =
sp1p2(R) =
sp1 [ sp2 (R)]
[ sp1 (R)]  [ sp2 (R)]
[ sp1 (R)]  [ sp2 (R)]
PA152, Vlastislav Dohnal, FI MUNI, 2024 32
Question of Tuple Duplicates
◼ Sets vs. bags?
 Relations are bags (in reality)
◼ Example
 R = {a,a,b,b,b,c}
 S = {b,b,c,c,d}
◼ R  S = ?
 MIN: R  S = {b,b,c} SQL: INTERSECT ALL
◼ R  S = ?
 SUM: R  S = {a,a,b,b,b,b,b,c,c,c,d} SQL: UNION ALL
 MAX: R  S = {a,a,b,b,b,c,c,d}
R and S can be observed as ouput of some
operations, e.g., B,D [ sA=‘c’ (tbl1)]
PA152, Vlastislav Dohnal, FI MUNI, 2024 33
Option MAX: Select Decomposition
◼ Select decomposition:
◼ Example: R={a,a,b,b,b,c}
a,b satisfy p1; b,c satisfy p2
sp1p2 (R) = sp1(R)  sp2(R)
sp1p2 (R) = {a,a,b,b,b,c}
sp1(R) = {a,a,b,b,b}
sp2(R) = {b,b,b,c}
sp1(R) max sp2(R) = {a,a,b,b,b,c}
PA152, Vlastislav Dohnal, FI MUNI, 2024 34
Choice of Correct Option
◼ Pragmatic solution for 
Use “SUM” for bag union
“MAX” for dividing disjunctive predicates ()
◼ Some rules cannot be applied to bags!
Associativity of except: R – (S – T)
Distributivity: R  (S  T)
!= (R  S)  (R  T)
PA152, Vlastislav Dohnal, FI MUNI, 2024 35
Transformation Rules
◼ Notation:
X = set of attributes
Y = set of attributes
XY = X  Y
◼ Project
pxy (R) = px [py (R)]
PA152, Vlastislav Dohnal, FI MUNI, 2024 36
Transformation Rules
◼ Combining select and natural join
◼ Let
p = expr. containing only attrs. of R
q = expr. containing only attrs. of S
m = expr. containing attrs. of both R, S
[sp (R)]  S
R  [sq (S)]
sp (R  S) =
sq (R  S) =
PA152, Vlastislav Dohnal, FI MUNI, 2024 37
Transformation Rules
◼ Combining select and natural join
Other rules can be derived
spq (R  S) = [sp (R)]  [sq (S)]
spqm (R  S) =
sm [(sp (R))  (sq (S))]
spq (R  S) =
[(sp (R))  S] max [R (sq (S))]
Transformation Rules
◼ Combining select and natural join
Example of rule derivation
PA152, Vlastislav Dohnal, FI MUNI, 2024 38
spq (R  S) =
sp [sq (R  S) ] =
sp [ R  sq (S) ] =
[sp (R)]  [sq (S)]
PA152, Vlastislav Dohnal, FI MUNI, 2024 39
Transformation Rules
◼ Combining select and natural join
Example of rule derivation
Let
m = expr. containing only attrs. common in
R and S, but does not compare them
[sm (R)]  [sm (S)]sm (R  S) =
PA152, Vlastislav Dohnal, FI MUNI, 2024 40
Transformation Rules
◼ Combining project and select
◼ Let
x = attribute subset of R
z = attributes referenced in expr. P
(subset of R)
(sp [ px (R) ])px
pxz
px[sp (R) ] =
PA152, Vlastislav Dohnal, FI MUNI, 2024 41
Transformation Rules
◼ Combining project and natural join
◼ Let
x = attribute subset of R
y = attribute subset S
z = attributes common in R and S
pxy([pxz (R) ]  [pyz (S) ])
pxy (R  S) =
◼ Combining previous with select
PA152, Vlastislav Dohnal, FI MUNI, 2024 42
Transformation Rules
pxy (sp (R  S)) =
pxy (sp [pxz’ (R)  pyz (S)])
z’ = z  {attributes referenced in P}
◼ Combining project, select and Cartesian
product
PA152, Vlastislav Dohnal, FI MUNI, 2024 43
Transformation Rules
pxy (sp (R  S)) =
pxy (sp [px’ (R)  py (S)])
x’ = x  {attributes referenced in P}
PA152, Vlastislav Dohnal, FI MUNI, 2024 44
Transformation Rules
◼ Combining select and union
◼ Combining select and except
Select can also be applied to S
◼ May be convenient for shrinking relation before
doing except
◼ Are there some limits on P in both above?
sp(R sum S) = sp(R) sum sp(S)
sp(R – S) = sp(R) – S = sp(R) – sp(S)
PA152, Vlastislav Dohnal, FI MUNI, 2024 45
Good Transformations
◼ Select early
◼ Project early
Example:
◼ R(A,B,C,D,E,F,G,H,I,J) result={E}
◼ Filter using P: (A=3)  (B=“cat”)
pE (sp (R)) vs. pE (sp (pABE(R)))
px [sp (R)] → px (sp [px’ (R)])
sp (R  S) → [sp (R)]  S
x’ = x  {attributes referenced in P}
PA152, Vlastislav Dohnal, FI MUNI, 2024 46
Good Transformations
sp1p2 (R) → sp1 [sp2 (R)] → sp2 [sp1 (R)]
→ [ sp1 (R)]  [ sp2 (R)]
sp1p2 (R) → sp1(R) max sp2(R)
R  S → S  R
PA152, Vlastislav Dohnal, FI MUNI, 2024 47
Good Transformations
◼ Assume indexes
On A and on B
s(A=3)  (B=“cat”)(R)
= s(A=3)(R)  s(B=“cat”)(R)
A = 3B = ‘cat’
Record pointer:
<block_id;item_id>
Intersect pointers to get
pointers to matching tuples
Good Transformations
◼ Previous case in SQL and PostgreSQL
PA152, Vlastislav Dohnal, FI MUNI, 2024 48
Bitmap Heap Scan on customers (cost=25.76..61.62 rows=10 width=13)
Recheck Cond: ((username)::text < 'user100'::text)
-> BitmapAnd (cost=25.76..25.76 rows=10 width=0)
-> Bitmap Index Scan on ix_cust_username (cost=0.00..5.75 rows=200 w
Index Cond: ((username)::text < 'user100'::text)
-> Bitmap Index Scan on customers_pkey (cost=0.00..19.75 rows=1000 w
Index Cond: (customerid < 1000)
SELECT * FROM customer
WHERE username < ‘user100’ AND customerid < 1000;
Good Transformations
◼ Push select to relations → apparently OK
 But: Firstly, move them as far as possible; next push them back
◼ Example:
 Relations: StarsIn(title, year, starName)
Movie(title, year, studioName)
 View: create view MoviesOf1996 as
select * from Movie where year = 1996;
◼ Query: select starName, studioName
from MoviesOf1996 natural join StarsIn;
PA152, Vlastislav Dohnal, FI MUNI, 2024 49
starName,studioName
MoviesOf1996 StarsIn syear=1996
starName,studioName
Movie
StarsIn syear=1996
starName,studioName
Movie StarsIn
syear=1996
prerequisite to be able to do it!!!
PA152, Vlastislav Dohnal, FI MUNI, 2024 50
Good Transformations: Summary
◼ Recommendations:
No transformation is always good.
Usually, good to
◼ select early
◼ project early
◼ Eliminate common sub-expressions
◼ Eliminate tuple duplicates
PA152, Vlastislav Dohnal, FI MUNI, 2024 51
Query Processing: Overview
◼ Relational algebra level
Transformation rules
Apply good rules
◼ Detail (physical) query plan level
Estimate costs
Generate and compare plans
PA152, Vlastislav Dohnal, FI MUNI, 2024 52
Estimating Cost of Query Plan
◼ Why? To “grade” a physical plan.
Choose proper algorithms to save time and
RAM during evaluation.
◼ Many rows may need more RAM buffers.
◼ Multiple scans over data to save RAM.
◼ Why saving RAM?
More queries to be processed at parallel.
Individual operations of one query may need to
keep all data in RAM buffers.
◼ May lead to storing it to a disk.
PA152, Vlastislav Dohnal, FI MUNI, 2024 53
Estimating Cost of Query Plan
◼ The costs are in an arbitrary unit.
not milliseconds or any other time unit.
◼ Process:
1. Estimate size of result
◼ Contributes to CPU costs
2. Estimate number of I/O operations
◼ Contributes to “slowness” of the query
◼ Done for each operation
in the logical query plan
◼ Based on statistics (metadata about tables)
PA152, Vlastislav Dohnal, FI MUNI, 2024 54
Estimating Result Size
◼ Keep statistics for relation R
T(R) – # tuples in R
S(R) – # of bytes in each R tuple
◼ S(R,A) – length in bytes of values of attribute A
B(R) – # of blocks to hold all R tuples
V(R, A) – # distinct values in R for attribute A
◼ Good estimates need
Statistics up to date!
PA152, Vlastislav Dohnal, FI MUNI, 2024 55
Statistics Example
◼ Relation R
Attribute A – string, max. 20
◼ S(R,A) = 3 ← average length
Attribute B – integer, 4 B
Attribute C – date, 8 B
Attribute D – string, max. 5
◼ S(R,D) = 1
◼ Statistics
T(R) = 5 S(R) = 16
V(R,A) = 3 V(R,B) = 1
V(R,C) = 5 V(R,D) = 4
A B C D
cat 1 10.2.98 a
cat 1 20.3.98 b
dog 1 30.4.98 a
dog 1 14.6.98 c
bat 1 15.6.98 d
Mind encoding – utf8 takes
more bytes per char.
PA152, Vlastislav Dohnal, FI MUNI, 2024 56
Estimating Result Size
◼ Cartesian product W = R1  R2
T(W) = T(R1)  T(R2)
S(W) = S(R1) + S(R2)
PA152, Vlastislav Dohnal, FI MUNI, 2024 57
Estimating Result Size
◼ Select W = sZ=val(R)
S(W) = S(R)
T(W) = ?
◼ W = sA=‘cat’(R)
◼ W2 = sB=2(R) T(W2) = ?
V(R,A)=3
V(R,B)=1
V(R,C)=5
V(R,D)=4
T(W) =
T(R)
V(R,A)
= 5/3
A B C D
cat 1 10.2.98 a
cat 1 20.3.98 b
dog 1 30.4.98 a
dog 1 14.6.98 c
bat 1 15.6.98 d
PA152, Vlastislav Dohnal, FI MUNI, 2024 58
Estimating Result Size
◼ Assumption of last estimate
Values are uniformly distributed over possible
V(R,Z) values!
◼ f(val) = 1 / V(R,Z)
◼ T(sZ=val(R)) = T(R)  f(val)
◼ Alternative
Uniform distribution of values over domain
with DOM(R,Z) values
◼ # values in domain is denoted as DOM(R,Z)
◼ f(val) = 1 / DOM(R,Z)
PA152, Vlastislav Dohnal, FI MUNI, 2024 59
Estimating Result Size
◼ Select W = sZ=val(R)
T(W) = ?
◼ using DOM(R,*)
◼ Formula derivation
W = sC=val(R)
◼ T(W) = f(val)  T(R)
= 1/10 * 5 = 0,5
W = sB=val(R)
◼ T(W) = (1/10)*5
W = sA=val(R)
◼ T(W) = 0,5
V(R,A)=3 DOM(R,A)=10
V(R,B)=1 DOM(R,B)=10
V(R,C)=5 DOM(R,C)=10
V(R,D)=4 DOM(R,D)=10
A B C D
cat 1 10.2.98 a
cat 1 20.3.98 b
dog 1 30.4.98 a
dog 1 14.6.98 c
bat 1 15.6.98 d
PA152, Vlastislav Dohnal, FI MUNI, 2024 60
Estimating Result Size
◼ Select W = sZ=val(R)
 Original solution
 Alternative
 Mostly used:
◼ histogram over the most
frequent ones
◼ plus, uniform distribution
of remaining ones.
T(W) =
T(R)
V(R,Z)
V(R,A)=3 DOM(R,A)=10
V(R,B)=1 DOM(R,B)=10
V(R,C)=5 DOM(R,C)=10
V(R,D)=4 DOM(R,D)=10
T(W) =
T(R)
DOM(R,Z)
A B C D
cat 1 10.2.98 a
cat 1 20.3.98 b
dog 1 30.4.98 a
dog 1 14.6.98 c
bat 1 15.6.98 d
PA152, Vlastislav Dohnal, FI MUNI, 2024 61
Estimating Result Size
◼ Select W = sZ≥val(R)
Solution 1
◼ T(W) = T(R) / 2
Solution 2
◼ T(W) = T(R) / 3
Solution 3
◼ Estimate values in range – using histogram.
PA152, Vlastislav Dohnal, FI MUNI, 2024 62
Estimating Result Size
◼ Select – estimate values in range
◼ Calculate fraction of unique values in range
T(W) = f  T(R)
Min=1 V(R,Z)=10
W= sz  15 (R)
Max=20
ZR
f =
20-15+1
20-1+1
6
20
=
PA152, Vlastislav Dohnal, FI MUNI, 2024 63
Estimate # values: Histogram
◼ Histogram of attribute values
Replaces V(R,A) and DOM(R,A)
More precise estimates
◼ Distinct values
Few → abs. number per each
Many → quantization
◼ Ranges (intervals) of equal size (in recs)
◼ Percentiles
◼ Most frequent ones only
 others all together (i.e., uniformly distributed)
10 23 37 42
8
15
28
37
f
PA152, Vlastislav Dohnal, FI MUNI, 2024 64
Estimating Result Size
◼ Select W = szval(R)
T(W) = T(R)(1 – f(val)) = T(R)(1 – 1/V(R,Z))
Typical solution
◼ T(W) = T(R)
Mind special cases
◼ e.g., V(R,Z) = 1 and the value is “val”.
= T(R) –
T(R)
V(R,Z)
PA152, Vlastislav Dohnal, FI MUNI, 2024 65
Estimating Result Size
◼ Natural join W = R1  R2
Notation
◼ X – attributes of R1
◼ Y – attributes of R2
◼ Case 1
X  Y = Ø
Same as R1  R2
◼ Case 2
X  Y = Z
Follows…
PA152, Vlastislav Dohnal, FI MUNI, 2024 66
Estimating Result Size: Natural Join
◼ Assumption Z={A} and also:
V(R1,A) ≤ V(R2,A)
→ every A value in R1 is in R2
V(R1,A) ≥ V(R2,A)
→ every A value in R2 is in R1
R1 A B C R2 A DR1 R2
PA152, Vlastislav Dohnal, FI MUNI, 2024 67
Estimating Result Size: Natural Join
◼ V(R1,A) ≤ V(R2,A)
◼ One record is matched with T(R2) / V(R2,A)
records
Assumption of uniform distribution
◼ Result:
R1 A B C R2 A D
Take 1 tuple
Match
T(W) =
T(R2)
V(R2,A)
T(R1) 
PA152, Vlastislav Dohnal, FI MUNI, 2024 68
Estimating Result Size: Natural Join
◼ Overview of both variants
V(R1,A) ≤ V(R2,A)
V(R2,A) ≤ V(R1,A)
◼ Difference is in denominator
T(W) =
T(R2)
V(R2,A)
T(R1) 
T(W) =
T(R1)
V(R1,A)
T(R2) 
PA152, Vlastislav Dohnal, FI MUNI, 2024 69
Estimating Result Size: Natural Join
◼ In general
W = R1 R2
T(W) =
T(R1)  T(R2)
max { V(R1,A), V(R2,A) }
PA152, Vlastislav Dohnal, FI MUNI, 2024 70
Estimating Result Size: Natural Join
◼ Alternate solution
Values uniformly distributed over domain
◼ One rec. matches with T(R2)/DOM(R2,A)
recs.
◼ Result:
R1 A B C R2 A D
Take 1 tuple
Match
T(W) =
T(R1)  T(R2)
DOM(R2,A)
=
T(R1)  T(R2)
DOM(R1,A)
assuming to be same
PA152, Vlastislav Dohnal, FI MUNI, 2024 71
Estimating Result Size: Natural Join
◼ W =
R1(X), R2(Y), X  Y = Z
◼ Size of record
S(W) = S(R1) + S(R2) – S(R1,Z)
Valid for all cases
◼ Number of tuples if Z contains more attrs.?
Assume they are independent
R1 R2
T(W) =
T(R1)  T(R2)
max{ V(R1,A1), V(R2,A1) }  max{ V(R1,A2), V(R2,A2) }
PA152, Vlastislav Dohnal, FI MUNI, 2024 72
Estimating Size: Project, Select
◼ Project W = AB(R)
T(W)=T(R)
S(W)=S(R, AB)
◼ Select W = sA=aB=b(R)
S(W)=S(R)
Let n=T(R), T(W) = n(1 – (1-m1/n)(1-m2/n))
◼ m1=T(R) / V(R,A) m2=T(R) / V(R,B)
PA152, Vlastislav Dohnal, FI MUNI, 2024 73
Estimating Size: Set Operations
◼ Union, intersect, except
 , , –
◼ T(W) – choose average size
 E.g.,
◼ T(R  S) = T(R) + T(S) … if  means UNION ALL
◼ T(R  S) = [ max{T(R), T(S)} , T(R) + T(S) ]
 So, use: T(R  S) = avg{ max{T(R), T(S)} , T(R) + T(S) }
▪ If set union is evaluated
◼ T(R – S) = T(R) – 1/2T(S)
◼ T(R  S) = avg{ 0, min{T(R), T(S)} }
◼ DISTINCT
 All attributes
◼ min{ 1/2T(R), (V(R,A)*V(R,B)*…) }
PA152, Vlastislav Dohnal, FI MUNI, 2024 74
Estimating Result Size
◼ For complex expressions, intermediate
results and their stats are needed
◼ Example
W = [sA=a(R1)] R2
denote as U
◼ T(U) = T(R1) / V(R1,A) S(U) = S(R1)
Need V(U,*) to estimate costs of W!
◼ at least for the join attributes
PA152, Vlastislav Dohnal, FI MUNI, 2024 75
Estimate Number of Values
◼ To estimate V(U,*)
U = sA=a(R1)
Assume R1(A,B,C,D)
PA152, Vlastislav Dohnal, FI MUNI, 2024 76
Estimate # values: Example
◼ Relation R1
◼ U = sA=a(R1)
◼ Estimates
V(U,A) = 1
V(U,B) = 1
V(U,C) = 1 .. (T(R1) / V(R1,A))
V(U,D) = 1 .. (T(R1) / V(R1,A))
V(R,A)=3
V(R,B)=1
V(R,C)=5
V(R,D)=4
A B C D
cat 1 10.2.98 a
cat 1 20.3.98 b
dog 1 30.4.98 a
dog 1 14.6.98 c
bat 1 15.6.98 d
sth. in
between
Not easy, if there is no
dependence btw. A and
B/C/D.
➔ functional
dependencies
PA152, Vlastislav Dohnal, FI MUNI, 2024 77
Estimate # values: Reality
◼ Common solution
U = sA=a(R1)
V(U,A) = 1
V(U,K) = T(U)
◼ Primary key K of R1 is an exception
V(U,*) = V(R,*) or V(U,*) = T(U)
◼ As a result, original V(R,*) can be used
V(U,*) = min { V(R,*), T(U) }
PA152, Vlastislav Dohnal, FI MUNI, 2024 78
Estimate # values
◼ Join
U = R1(A,B) R2(A,C)
◼ Estimates:
V(U,A) = min{ V(R1,A), V(R2,A) }
V(U,B) = min { V(R1,B), T(U) }
V(U,C) = min { V(R2,C), T(U) }
PA152, Vlastislav Dohnal, FI MUNI, 2024 79
Estimate # values
◼ Example
Z = R1(A,B) R2(B,C) R3(C,D)
◼ T(R1) = 1000 V(R1,A)=50 V(R1,B)=100
◼ T(R2) = 2000 V(R2,B)=200 V(R2,C)=300
◼ T(R3) = 3000 V(R3,C)=90 V(R3,D)=500
Intermediate result
◼ U = R1(A,B) R2(B,C)
◼ T(U) = T(R1)  T(R2) / max{ V(R1,B), V(R2,B) } = 10k
◼ V(U,A) = 50
◼ V(U,B) = min{ V(R1,B), V(R2,B) } = 100
◼ V(U,C) = 300
PA152, Vlastislav Dohnal, FI MUNI, 2024 80
Estimate # values
◼ Final result
Z = U R3(C,D)
◼ U(A,B,C)
Estimates:
◼ T(Z) = 10 000  3 000 / 300 = 100 000
◼ V(Z,A) = 50
◼ V(Z,B) = 100
◼ V(Z,C) = 90
◼ V(Z,D) = 500
PA152, Vlastislav Dohnal, FI MUNI, 2024 81
Example of Stats in PostgreSQL
◼ Connect to student instance of PostgreSQL
See the first lecture for instructions
◼ Check the schema xdohnal
Relations: predmet, skupina, hotel
◼ Observe both the relation and attribute statistics
Explanation of individual items in doc
◼ https://www.postgresql.org/docs/current/view-pg-stats.html
PA152, Vlastislav Dohnal, FI MUNI, 2024 82
Example of Stats in PostgreSQL
◼ Relation hotel
PA152, Vlastislav Dohnal, FI MUNI, 2024 83
Example of Stats in PostgreSQL
◼ Attribute hotel.id
◼ Attribute hotel.name
PA152, Vlastislav Dohnal, FI MUNI, 2024 84
Example of Stats in PostgreSQL
◼ Attribute hotel.state
◼ Attribute hotel.distance_to_center
PA152, Vlastislav Dohnal, FI MUNI, 2024 85
Summary
◼ Estimating size of results is an “art”
◼ Do not forget:
Statistics must be kept up to date for precise
estimates
→ necessity to maintain them during table
updates
What are the costs to update stats?
PA152, Vlastislav Dohnal, FI MUNI, 2024 86
Statistics Update
◼ Stats do not change rapidly
in short time
◼ Inaccurate stats may also help
◼ Instant stats update
Can become a bottleneck
◼ Stats are used very often
◼ → Do not update often
PA152, Vlastislav Dohnal, FI MUNI, 2024 87
Statistics Update
◼ Run periodically
After some time elapses
After some number of updates are made
◼ Slow for V(R,A)
Especially if histograms are kept
→ Use a data sample
◼ If almost all are distinct → V(R,A)T(R)
◼ If not many are distinct → we likely saw most of them
PA152, Vlastislav Dohnal, FI MUNI, 2024 88
Estimating Costs of Query Plan: Outline
◼ Estimating the size of results
Already done
◼ Estimating # of IOs
Next lecture…
◼ Generate and compare plans
Takeaways
◼ Query plan
What it is
How to transform it and why
◼ Term “costs”
◼ Types of statistics maintained by DBMS
◼ How to estimate
result size of an operation
and the other statistics
PA152, Vlastislav Dohnal, FI MUNI, 2024 89