Query Execution

Logical/physical operators

Logical operators
                what they do
                e.g., union, selection, project, join, grouping

Physical operators
                how they do it
                e.g., nested loop join, sort-merge join, hash join, index join

Query Execution Plans

select C.cust_name 
from Sale S, Manufacturer M, Customer C, Product P 
where M.city = 'Lima' and 
	P.prod_id = S.prod_id and 
	P.manufactr_id = M.manufactr_id and 
	S.cust_id = C.cust_id

Create a Query Plan:

  • logical tree
  • implementation   choice at every node
  • scheduling of operations.

Some operators are from relational algebra, and others (e.g., scan, group) are not.

Cost Parameters

Cost parameters 

  • M = number of blocks that fit in main memory
  • B(R) = number of blocks holding R
  • T(R) = number of tuples in R
  • V(R,a) = number of distinct values of the attribute a

Estimating the cost:

  • Important in optimization (next lecture)
  • Compute I/O cost only
  • We compute the cost to read the tables
  • We don’t compute the cost to write the result (because pipelining)

Sorting

Two pass multi-way merge sort

Step 1:

  • Read M blocks at a time, sort, write
  • Result: have runs of length M on disk

Step 2:

  • Merge M-1 at a time, write to disk
  • Result: have runs of length M(M-1)»M2

Cost: 3B(R),  Assumption: B(R) < M2

Scanning Tables

The table is clustered (I.e. blocks consists only of records from this table):

  • Table-scan: if we know where the blocks are
  • Index scan: if we have index to find the blocks

The table is unclustered (e.g. its records are placed on blocks with other tables)

  • May need one read for each record

Clustered relation:

  • Table scan:  B(R); to sort: 3B(R)
  • Index scan:  B(R); to sort: B(R) or 3B(R)

Unclustered relation

  • T(R); to sort: T(R) + 2B(R)

One Pass Algorithms

Selection $\sigma$(R), projection $\pi$(R)

  • Both are tuple-at-a-Time algorithms
  • Cost B(R)

Duplicate elimination $\delta$(R)

Need to keep a dictionary in memory:

  • balanced search tree
  • hash table
  • etc

Cost: B(R)

Assumption: B(d(R)) <= M

Grouping: $\gamma$city, sum(price) (R)

  • Need to keep a dictionary in memory
  • Also store the sum(price) for each city
  • Cost: B(R)
  • Assumption: number of cities fits in memory

Binary operations: R $\cap$ S, R $\cup$ S, R – S

  • Assumption: min(B(R), B(S)) <= M
  • Scan one table first, then the next, eliminate duplicates
  • Cost: B(R)+B(S)

Nested Loop Joins

Tuple-based nested loop R $\bowtie$ S

R=outer relation, S=inner relation

for each tuple r in R do
   for each tuple s in S do
      if r and s join then output (r,s)

Cost: T(R) T(S),  sometimes T(R) B(S)

Block-based Nested Loop Join

for each (M-1) blocks bs of S do
   for each block br of R do
      for each tuple s in bs do
         for each tuple r in br do
            if r and s join then output(r,s)

Cost:
• Read S once: cost B(S)
• Outer loop runs B(S)/(M-1) times, and each time need to read R: costs B(S)B(R)/(M-1)
• Total cost: B(S) + B(S)B(R)/(M-1)
Notice: it is better to iterate over the smaller relation first– i.e., S smaller

Two Pass algorithms (based on sorting)

Duplicate elimination $delta$(R)

Simple idea: like sorting, but include no duplicates

Step 1: sort runs of size M, write

  • Cost: 2B(R)

Step 2: merge M-1 runs, but include each tuple only once

  • Cost: B(R)
  • Total cost: 3B(R), Assumption: B(R)<= M2

Index-based algorithms (zero pass)

In a clustered index all tuples with the same value of the key are clustered on as few blocks as possible:

Index based selection

Selection on equality: $\sigma_{a=v}$(R)

Clustered index on a:  cost B(R)/V(R,a)

Unclustered index on a: cost T(R)/V(R,a)

Example: B(R) = 2000,  T(R) = 100,000, V(R, a) = 20, compute the cost of $\sigma_{a=v}$(R)

Cost of table scan:

  • If R is clustered: B(R) = 2000 I/Os
  • If R is unclustered: T(R) = 100,000 I/Os

Cost of index based selection:

  • If index is clustered: B(R)/V(R,a) = 100
  • If index is unclustered: T(R)/V(R,a) = 5000

Notice: when V(R,a) is small, then unclustered index is useless

Index based join

R $\bowtie$ S

Assume S has an index on the join attribute

Iterate over R, for each tuple fetch corresponding tuple(s) from S

Assume R is clustered. Cost:

  • If index is clustered:  B(R) + T(R)B(S)/V(S,a)
  • If index is unclustered: B(R) + T(R)T(S)/V(S,a)

Assume both R and S have a sorted index (B+ tree) on the join attribute

  • Then perform a merge join (called zig-zag join)

Cost: B(R) + B(S)

Estimating Sizes

S = $\sigma_{a=c}(R)$

  • T(S) can be anything from 0 to T(R) – V(R,A) + 1
  • Mean value: T(S) = T(R)/V(R,A)

S = $\sigma_{a<c}(R)$

  • T(S) can be anything from 0 to T(R)
  • Heuristics: T(S) = T(R)/3

Estimating the size of a natural join, R join S

  • When the set of A values are disjoint, then T(R join S) = 0
  • When A is a key in S and a foreign key in R, then T(R join S) = T(R)
  • When A has a unique value, the same in R and S, then T(R join S) = min(T(R), T(S))

In general: T(R join S) = T(R) T(S) / max(V(R,A),V(S,A))

  • T(R) = 10000,  T(S) = 20000
  • V(R,A) = 100,  V(S,A) = 200
  • How large is R$\bowtie$S ?  = 10000*20000/200 = 1000000

Query Optimization

At the heart of the database engine

  • Step 1: convert the SQL query to some logical plan
  • Step 2: find a better logical plan, find an associated physical plan

SQL –> Logical Query Plans:

select a1, …, an from R1, …, Rk Where C $\rightarrow$ $\pi$a1,…,an($\sigma$C(R1 x R2 x … x Rk))

Now we have one logical plan

Algebraic laws:
                foundation for every optimization

Two approaches to optimizations:

  • Rule-based (heuristics): apply laws that seem to result in cheaper plans
  • Cost-based: estimate size and cost of intermediate results, search systematically for best plan

Optimizer has three things:

  • Algebraic laws
  • An optimization algorithm
  • A cost estimator

Algebraic Laws

Commutative and Associative Laws

  • R $\cup$ S = S $\cup$ R,  R $\cup$ (S $\cup$ T) = (R $\cup$ S) $\cup$ T
  • R ∩ S = S ∩ R,  R ∩ (S ∩ T) = (R ∩ S) ∩ T
  • R ⋈ S = S ⋈ R,  R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T

Distributive Laws

  • R ⋈ (S $\cup$ T)  =  (R ⋈ S) $\cup$ (R ⋈ T)

Laws involving selection:

  • s C AND C’(R) = s C(s C’(R)) = s C(R) ∩ s C’(R)
  • s C OR C’(R) = s C(R) $\cup$ s C’(R)
  • s C (R ⋈ S) = s C (R) ⋈ S

When C involves only attributes of R

  • s C (R – S) = s C (R) – S
  • s C (R $\cup$ S) = s C (R) $\cup$ s C (S)
  • s C (R ∩ S)  = s C (R) ∩ S

Rule Based Optimization

Query rewriting based on algebraic laws

Result in better queries most of the time

Heuristic number 1: Push selections down

Example:

select C.cust_name 
from Sale S, Manufacturer M, Customer C, Product P 
where M.city = 'Lima' and 
	P.prod_id = S.prod_id and 
	P.manufactr_id = M.manufactr_id and 
	S.cust_id = C.cust_id

Heuristics number 2: Sometimes push selections up, then down

When would this be helpful? Come up with an example, when pushing selections up and then back down would be helpful.