Advanced Database Management
Systems
Query Processing: Chapter 1
Database Instance and Available Indexes:
Examples from: Principles of Database Systems by Greg Riccardi, Addison Wesley, 2001
select * from Customer
where accountId = 101
select * from Customer
where accountId >= 101 and accountId < 300
select * from Customer where lastName < 'D'
Scanning,
SQL Expression Tree
Parsing,
Query (algebra)
Validating
Optimizer
Execution
Query Code Plan
Generator
Query DB Runtime
Result
Code Processor
Check syntax
Select * from Eployees having salary >1000
Check Schema Elements
Attributes, relations … used
Converts SQL to RA expression
Query optimization determines
the most efficient (or sufficiently efficient)
process for executing the query
Optimization reorganizes
an expression tree for a query
using algebraic transformation rules
Information used:
implementation techniques for algebra operators
algebraic transformation rules
heuristic rules
cost estimates
Commutative: σ ⋈ x ⋃ ⋂
Associative: ⋈ x ⋃ ⋂
Cascades of select:
σ c1 AND c2 AND … AND cn (R) ≡ σc1(σc2( … σcn(R))…))
selects can then be commuted
Cascades of project:
πL1(πL2( … πLn(R))…)) ≡ πL1(R)
projects cannot be commuted
Commuting select and project:
πA1,A2,…,An(σc(R)) ≡ σc(πA1,A2,…,An(R))
valid when selection condition only involves
attributes in projection list
Converting select/cross-product into join:
σc(R x S) ≡ R ⋈c S
Commuting select and join or cross-product:
σc(R ⋈ S) ≡ σc(R) ⋈ S
valid when selection condition involves only attributes in R
Commuting project with join or cross-product:
πL(R ⋈c S) ≡ (πA1,A2,…,An(R)) ⋈c (πB1,B2,…,Bn(S))
where A1,…,An ⊆ L are attributes of R,
and B1,…,Bn ⊆ L are attributes of S,
and A1,…,An, B1,…,Bn are involved in C
(slightly more complicated if C contains attributes not in L)
Break up conjunctive select conditions
into a cascade of selects
more flexibility in moving selects
Push selects as early as possible,
using commutativity of select with other operators
Reorder sub-expressions such
that most restrictive selects are done first
Combine select/cross-product into join
Push projects as early as possible
keep only necessary attributes
Enumerate the query plans
Apply algebraic transformations & heuristics
to get all reasonable trees
Estimate the cost of each plan
Account for size of relations, available indexes, information
about file layout, info about value distributions …
Choose the best (fastest) plan
select e.lname, e.fname, w.pno, w.hours
from employee e, works_on w
where e.ssn = w.essn and w.hours > 20;
πe.lname, e.fname, w.pno, w.hours
σe.ssn=w.essn AND w.hours > 20
Employee e works_on w
πe.lname, e.fname, w.pno, w.hours
Heuristic: σw.hours > 20
Conjunctive Note: selects
select cascade could commute
of selects
σe.ssn=w.essn
Employee e works_on w
πe.lname, e.fname, w.pno, w.hours
Heuristic:
Combine select
σw.hours > 20
and cross join
⋈e.ssn=w.essn
Employee e works_on w
πe.lname, e.fname, w.pno, w.hours
Heuristic:
Push selects as
early as possible ⋈e.ssn=w.essn
Employee e σw.hours > 20
works_on w
πe.lname, e.fname, w.pno, w.hours
πe.ssn, e.lname, e.fname, w.essn, w.pno, w.hours
First cascade
Heuristic: projects to get
attributes needed
⋈e.ssn=w.essn
Push projects as
early as possible to push through
join
Employee e σw.hours > 20
works_on w
πe.lname, e.fname, w.pno, w.hours
Heuristic:
⋈e.ssn=w.essn
Push projects as Push project
early as possible through join
πw.essn, w.pno, w.hours
πe.ssn, e.lname, e.fname
Employee e
σw.hours > 20
works_on w
πe.lname, e.fname, w.pno, w.hours
Heuristic:
⋈e.ssn=w.essn
Push projects as Push project
early as possible through select
πe.ssn, e.lname, e.fname σw.hours > 20
Employee e πw.essn, w.pno, w.hours
works_on w
At this point we’ve generated
nine different query plans for the same query
Adding strategies/algorithms for implementing
individual operations would add
even more potential plans
Next Step: Estimate the cost of executing each
plan
Disk Access
Cpu cycle
Transit Time in network- distributed system
CPU cost is difficult to calculate and cpu cost is insignificant
compared with Disk Access cost
Consider only disk access
No. of seeks (N) –Random I/O cost < Sequential
I/O cost
Cost =N*avg seek time
No. of blocks read
Cost =N*avg block read cost
No. of blocks written
Cost = N*avg block write cost
Cost of writing >> cost of reading because once
written it has to be read to check if it is written
correctly