Logical/physical operators
Logical operators
what they do
e.g., union, selection, project, join, grouping
Physical operators
how they do it
e.g., nested loop join, sort-merge join, hash join, index join
Query Execution Plans
select C.cust_name from Sale S, Manufacturer M, Customer C, Product P where M.city = 'Lima' and P.prod_id = S.prod_id and P.manufactr_id = M.manufactr_id and S.cust_id = C.cust_id
Create a Query Plan:
- logical tree
- implementation choice at every node
- scheduling of operations.
Some operators are from relational algebra, and others (e.g., scan, group) are not.
Cost Parameters
Cost parameters
- M = number of blocks that fit in main memory
- B(R) = number of blocks holding R
- T(R) = number of tuples in R
- V(R,a) = number of distinct values of the attribute a
Estimating the cost:
- Important in optimization (next lecture)
- Compute I/O cost only
- We compute the cost to read the tables
- We don’t compute the cost to write the result (because pipelining)
Sorting
Two pass multi-way merge sort
Step 1:
- Read M blocks at a time, sort, write
- Result: have runs of length M on disk
Step 2:
- Merge M-1 at a time, write to disk
- Result: have runs of length M(M-1)»M2
Cost: 3B(R), Assumption: B(R) < M2
Scanning Tables
The table is clustered (I.e. blocks consists only of records from this table):
- Table-scan: if we know where the blocks are
- Index scan: if we have index to find the blocks
The table is unclustered (e.g. its records are placed on blocks with other tables)
- May need one read for each record
Clustered relation:
- Table scan: B(R); to sort: 3B(R)
- Index scan: B(R); to sort: B(R) or 3B(R)
Unclustered relation
- T(R); to sort: T(R) + 2B(R)
One Pass Algorithms
Selection $\sigma$(R), projection $\pi$(R)
- Both are tuple-at-a-Time algorithms
- Cost B(R)
Duplicate elimination $\delta$(R)
Need to keep a dictionary in memory:
- balanced search tree
- hash table
- etc
Cost: B(R)
Assumption: B(d(R)) <= M
Grouping: $\gamma$city, sum(price) (R)
- Need to keep a dictionary in memory
- Also store the sum(price) for each city
- Cost: B(R)
- Assumption: number of cities fits in memory
Binary operations: R $\cap$ S, R $\cup$ S, R – S
- Assumption: min(B(R), B(S)) <= M
- Scan one table first, then the next, eliminate duplicates
- Cost: B(R)+B(S)
Nested Loop Joins
Tuple-based nested loop R $\bowtie$ S
R=outer relation, S=inner relation
for each tuple r in R do for each tuple s in S do if r and s join then output (r,s)
Cost: T(R) T(S), sometimes T(R) B(S)
Block-based Nested Loop Join
for each (M-1) blocks bs of S do for each block br of R do for each tuple s in bs do for each tuple r in br do if r and s join then output(r,s)
Cost:
• Read S once: cost B(S)
• Outer loop runs B(S)/(M-1) times, and each time need to read R: costs B(S)B(R)/(M-1)
• Total cost: B(S) + B(S)B(R)/(M-1)
Notice: it is better to iterate over the smaller relation first– i.e., S smaller
Two Pass algorithms (based on sorting)
Duplicate elimination $delta$(R)
Simple idea: like sorting, but include no duplicates
Step 1: sort runs of size M, write
- Cost: 2B(R)
Step 2: merge M-1 runs, but include each tuple only once
- Cost: B(R)
- Total cost: 3B(R), Assumption: B(R)<= M2
Index-based algorithms (zero pass)
In a clustered index all tuples with the same value of the key are clustered on as few blocks as possible:
Index based selection
Selection on equality: $\sigma_{a=v}$(R)
Clustered index on a: cost B(R)/V(R,a)
Unclustered index on a: cost T(R)/V(R,a)
Example: B(R) = 2000, T(R) = 100,000, V(R, a) = 20, compute the cost of $\sigma_{a=v}$(R)
Cost of table scan:
- If R is clustered: B(R) = 2000 I/Os
- If R is unclustered: T(R) = 100,000 I/Os
Cost of index based selection:
- If index is clustered: B(R)/V(R,a) = 100
- If index is unclustered: T(R)/V(R,a) = 5000
Notice: when V(R,a) is small, then unclustered index is useless
Index based join
R $\bowtie$ S
Assume S has an index on the join attribute
Iterate over R, for each tuple fetch corresponding tuple(s) from S
Assume R is clustered. Cost:
- If index is clustered: B(R) + T(R)B(S)/V(S,a)
- If index is unclustered: B(R) + T(R)T(S)/V(S,a)
Assume both R and S have a sorted index (B+ tree) on the join attribute
- Then perform a merge join (called zig-zag join)
Cost: B(R) + B(S)
Estimating Sizes
S = $\sigma_{a=c}(R)$
- T(S) can be anything from 0 to T(R) – V(R,A) + 1
- Mean value: T(S) = T(R)/V(R,A)
S = $\sigma_{a<c}(R)$
- T(S) can be anything from 0 to T(R)
- Heuristics: T(S) = T(R)/3
Estimating the size of a natural join, R join S
- When the set of A values are disjoint, then T(R join S) = 0
- When A is a key in S and a foreign key in R, then T(R join S) = T(R)
- When A has a unique value, the same in R and S, then T(R join S) = min(T(R), T(S))
In general: T(R join S) = T(R) T(S) / max(V(R,A),V(S,A))
- T(R) = 10000, T(S) = 20000
- V(R,A) = 100, V(S,A) = 200
- How large is R$\bowtie$S ? = 10000*20000/200 = 1000000
Query Optimization
At the heart of the database engine
- Step 1: convert the SQL query to some logical plan
- Step 2: find a better logical plan, find an associated physical plan
SQL –> Logical Query Plans:
select a1, …, an from R1, …, Rk Where C $\rightarrow$ $\pi$a1,…,an($\sigma$C(R1 x R2 x … x Rk))
Now we have one logical plan
Algebraic laws:
foundation for every optimization
Two approaches to optimizations:
- Rule-based (heuristics): apply laws that seem to result in cheaper plans
- Cost-based: estimate size and cost of intermediate results, search systematically for best plan
Optimizer has three things:
- Algebraic laws
- An optimization algorithm
- A cost estimator
Algebraic Laws
Commutative and Associative Laws
- R $\cup$ S = S $\cup$ R, R $\cup$ (S $\cup$ T) = (R $\cup$ S) $\cup$ T
- R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T
- R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T
Distributive Laws
- R ⋈ (S $\cup$ T) = (R ⋈ S) $\cup$ (R ⋈ T)
Laws involving selection:
- s C AND C’(R) = s C(s C’(R)) = s C(R) ∩ s C’(R)
- s C OR C’(R) = s C(R) $\cup$ s C’(R)
- s C (R ⋈ S) = s C (R) ⋈ S
When C involves only attributes of R
- s C (R – S) = s C (R) – S
- s C (R $\cup$ S) = s C (R) $\cup$ s C (S)
- s C (R ∩ S) = s C (R) ∩ S
Rule Based Optimization
Query rewriting based on algebraic laws
Result in better queries most of the time
Heuristic number 1: Push selections down
Example:
select C.cust_name from Sale S, Manufacturer M, Customer C, Product P where M.city = 'Lima' and P.prod_id = S.prod_id and P.manufactr_id = M.manufactr_id and S.cust_id = C.cust_id
Heuristics number 2: Sometimes push selections up, then down
When would this be helpful? Come up with an example, when pushing selections up and then back down would be helpful.