Unit:2 Query Processing
Definition:
• A procedure for transforming a high-level query(SQL) into a correct and efficient execution plan
in low level language.
• A query processor selects the appropriate execution plan to respond to the user request.
• Query processing goes through a series of query complication steps called Execution plan before
it begins execution.
Steps in Query Processing
1. Syntax Analyzer- Parses the query and checks if it obeys the syntax rules.
2. Query Decomposer-Translates the query into relational algebra query using
transformation rules.
3. Query Optimizer-Performing optimization using equivalent expressions.
4. Query Code Generator-Generating code for the queries.
5. Runtime Database Processor –Selecting the optimal plan and execution.
Syntax Analyser
• Takes up a query, parses it into tokens , analyses the tokens inorder to make sure it comply with
the rules of the language grammer.
• If an error is found, it is rejected and an error code together with explanation is returned.
• Ex: A simple grammar to implement a sql statement,
Query: Select clause+from clause+where clause
Select clause: ‘select’ +<column list>
From clause: ‘From’+ <table list>
Where clause: ‘where’+value1opvalue2
Value1:value/column name
Value2:value/column name
Op:+,-,*,/,=
The above grammar can be used to implement a query such as,
Select eno, ename from emp where eno=10;
2. Query Decomposition
The first phase of query processing whose aim is to transform a high level query into a
relational algebra query and to check if the query is syntactically and semantically correct.
This consists of the following 5 stages,
1. Query Analysis
2. Query Normalization
3. Semantic Analysis
4. Query Simplifier
5. Query Restructuring
Query Analysis
Equivalence Rules
Query Normalization
Data Dictionary
Semantic Analysis
Idempotency Rules
Query Simplifier
Transformation Rules
Query Restructuring
Relational Algebra Query
Query Analysis
• Query is lexically and syntactically analysed using the compilers(parsers) to find out syntax
errors.
• Syntactically Legal queries is then validated with the system catalogue, whether the relations
and attributes mentioned exists.
• The type specification of the query qualifiers are also checked at this stage.
• Ex: of Invalid query which is rejected..
Select ename, desig from emp where desig>100;
• The outcome of this phase is an internal representation called Query Tree Notation.
Query Tree Notation
It is a tree data structure that corresponds to a relational algebra expression.
Components of Query Tree
• Leaf nodes-base input relations of the query.
• Non-leaf nodes-represent the intermediate relation as a result of applying an operation
• Root represents the result of the query
• Sequence of operations is directed from leaves to root.
Initial Canonical Query Tree Notation
Ex:
• Project the projno,deptno,and the managers details for every project located in Mumbai.
RA expression is,
∏proj-no,dept-no,ename,addr,dob’(σproj-loc=‘mumbai’(PROJECT))’ X dept-no=dnum(DEPARTMENT)
X mgrid=empid(EMPLOYEE)
∏p.proj-no,p.dept-no,e.ename,e.addr,e.dob
Root
σproj-loc=‘mumbai’ and p.dept-no=d.dnum and .mgrid=e.empid
Intermediate Op
Leaf
E
X
P D Leaf
In the initial query tree notation, the Cartesian product is applied first, then the selection and join
conditions of where clause are applied, followed by the projection on the select attributes.
Query tree corresponding to the relational algebra expression.
∏p.proj_no,p.Deptno,e.emp_name,e.addr,e,dob
d.mgrid=e.empid
p.deptno=d.deptno
σProj_loc=’mumbai’
D
Leaf node
P
Query Graph Notation
A Query graph corresponds to a relational calculus expression.
• Relational nodes represented as single circle.
• Constant values from the query selection are represented as double circles.
• Selection and join are represented by edges.
• Attributes to be retrieved are put in [].
• Disadvantage: Does not indicate the order in which operation is performed.
P D E
‘Chennai’
2. Query Normalization
• Primary Goal: Avoid Redundancy.
• A set of equivalency rule is applied so that projection and selection operations are
simplified.
• Applying these rules the predicate is converted into 2 normal forms– CNF & DNF.
• Conjuctive Normal Form is a sequence of conjucts connected with ^(AND). Conjucts
can contain one or more terms connected with v (OR).
• Disjunctive Normal Form is a sequence of disjuncts connected with V (OR). Disjuncts
contain one or more terms connected with ^ (AND)
• DNF is often used as it allows the query to be broken into a series of independent sub-
queries linked by unions.
Example:
CNF-Conjuctive Normal Form
(emp-design=‘Programmer’ V emp-sal>40000)^ location=‘mumbai’
This can be expressed in DNF as
(emp-design=‘Programmer’ ^location=‘mumbai’)
(emp-sal>40000 ^ location=‘mumbai’)
The equivalency rules are,
• Rule:1 Commutativity of Unary operation:
Unary op1 unary op2 REL Unary op2 unary op1 REL
• Rule:2 Commutativity of Binary operation:
REL1 binop (REL2 binop REL3) (REL1 binop REL2) binop REL3
• Rule:3 Idempotency of Unary op
unaryop1 unaryop2 REL unaryop REL
• Rule: 4 Distributivity of unary w.r.t binary
unaryop (REL1 binop REL2) unaryop (REL1) binop unaryop(REL2)
• Rule:5 Factorisation
Unaryop (REL1) binop unaryop(REL2) unaryop(REL1 binop REL2)
2.3 Semantic Analyser
• Goal: Reduce the no. of Predicates that must be evaluated by refuting the incorrect or
contradictory queries.
• This rejects normalized query which are incorrectly formulated or which are contradictory.
• A query is Incorrectly formulated- if the query does not contribute to the result. Ex: missing joins
• A Query is Contradictory if the predicate cannot be satisfied by any tuple in the relation.
• Ex: emp_desig=‘programmer’ ^ emp_desig=‘analyst’ (Contradictory query)
• Semantic Analyser also examines the query to make sure that only data objects defined in the
catalogue is used.
Connection Graphs
• Connection graphs can be constructed to check the correctness and contradictions as follows.
1. Constructing a Query graph for relation
2. Constructing a Query graph for normalized attribute
Ex:1
Select (p.projno,p.projloc)
from project as p,viewing as v,depart as d
where d.deptid=v.deptid and d.maxbudget>=85000 and d.compl_year=‘2005’ and p.proj_
mgr=‘Mathew’;
• The Query graph representation is not fully connected which means the query is not correctly
formulated. In this graph the join condition v.proj_no=p.proj_no has been omitted.
D Result
V P
Incorrectly formulated Query Graph
• Example 2
Select (p.projno,p.loc)
from project as p,cost_of_project as c,depart as d
where d.max-budget>85000 and d.compl_year=‘2005’ and d.maxbudget<50000
This graph has a cycle between the nodes 0 and max_budget with a negative sum.
50000
0
2500
-2500
-85000 d.max_budg d.compl_
et year
2.4 Query simplifier
• Goal: Detect redundancy, eliminate common sub expressions and transform query to
semantically equivalent and easily computed forms.
• Integrity constraints, view definitions and access restrictions are considered here and the query
will be rejected if it is contradictory.
• The final form of simplification is obtained using the Idempotence rule of Boolean algebra
Idempotence Rules of Boolean Algebra
1. Pred AND Pred = Pred
2. Pred AND true = Pred
3. Pred AND False = False
4. Pred AND not(Pred) = False
5. Pred1 and (Pred1 or Pred2) = pred1
6. Pred OR Pred = Pred
7. Pred OR true = True
8. Pred OR False = Pred
9. Pred OR not(Pred) = True
10. Pred1 OR (Pred1 AND Pred2) = pred1
Example of Using Idempotence Rules.
Consider the following query.
Select d.dept_id,m.branch_mgr,m.branch_id,b,branch_id,b.nranch_loc,e.empname,e.sal
From department as d,manager as m,branch as b
Where d.dept-id=m.dept-id
And m.branch-id=b.branch-id
And b.branch_loc=’mumbai’
And not(b.branch_loc=’mumbai’and b.branch_loc=’delhi’)
And e.empsal>85000
And not(b.brach_loc=’Delhi’)
And d.dept_loc=’Bangalore’
Let us examine the following part of the query to greater detail
And b.branch_loc=’mumbai’
And not(b.branch_loc=’mumbai’and b.branch_loc=’delhi’)
Let us equate
b.branch_loc=’mumbai’=PRED1
b.branch_loc=’mumbai’=PRED2
b.branch_loc=’delhi’=PRED3
It can now be represented in the form of idempotence rule as
PRED1 and not(PRED2 and PRED3)
Applying the equivalence rule,
(PRED1 and not(PRED1))and not(PRED3)
By the rule 4 of idempotence law we get the following form
FALSE and not(PRED3)
This is equivalent to not(PRED3)
Now translating the predicate into sql we get
Not(b.branch_loc=’delhi’)
Thus the original query which contains many redundant predicate can be eliminated and the query is
simplified now.
2.5 Query Restructuring
This is the final stage of Query Decomposition.
The Query can be restructured to give a more efficient implementation using transformation
rules. The query can be considered as a relational algebra program.
3. Query Optimization
• Goal: Choose an efficient execution strategy for processing a query.
• Atempts to minimise the use of resources-I/Os and CPU time. By choosing the best access plans
• A query has many possible execution strategies, choosing the suitable one is called Query
Optimization
Block diagram of Query Optimizer Estimated Formula
Query Query Optimizer Join Manager
Simplifier
Statistical data
Execution Plan
Generator
Database
Cost Model Catalogue
Execution Plan
Four main input modules are
1. Relational algebra query from the query decomposer.
2. Estimated formulas used to determine the cardinality.
3. A cost model
4. Statistical data from database catalogue.
The basic issues in Query Optimization are
1. How to use available indexes.
2. How to use memory to accumulate information and perform sorting.
3. How to determine the order in which join is to be performed.
• Two techniques for implementing Query Optimization:
1. Heuristic Rule
2. Systematic estimation of the cost of different strategies.
Heuristic Query Optimization-Intro’
• Heuristics rules are applied to improve the performance of the query.
• One main heuristic rule is , Apply select before applying join or other operations. This reduces
the size of the file. The SELECT and PROJECT operation will reduce the size of the file and hence
should be applied before a JOIM or other binary operation.
• The heuristic query optimizer then transforms the initial query tree into a final query tree using
equivalence transformation rules. The final queries are efficient to execute.
• Cartesian product with selection can be replaced by a join.
• Projection operations must be included in the query tree.
Illustration of the steps involved in converting query tree during optimization
Consider the following relations on a company database,
Employee(emp-name,emp-id,dob,addr,sex,sal,dep_no)
Project(proj-name,proj-no,proj-loc,proj-deptno)
Works_on(eid,pno,hours)
Consider the query to find the names of employees born after 1970 who works on a project named
‘Growth’.
SELECT EMP-NAME
FROM EMPLOYEE,WORKS-ON,PROJECT
WHERE PROJ-NAME=’GROWTH’ AND PROJ-NO=PNO AND EID=EMP-ID AND DOB=’31-12-1970’;
∏emp_name
σproj-name=’growth’ and proj_no=p-no and eid=emp-id and dob>1970
project
X
Initial Query tree
employee Works_on
∏Emp_name
σproj_no=p-no
σeid=emp-id
σproj_name=’Growth’
σdob>1970 project
Works_on
employee
Improved Query tree by applying SELECT operation
∏emp_name
σe-id=emp-id
σproj_no=p-no σ dob>1970
X employee
σproject_name=’Growth’
Works_on
Project
Improved Query tree by applying restrictive SELECT operation
∏emp_name
eid=emp-id
∏Emp-name,emp-id
∏eid
σdob>1970
proj_no=p-no
employee
∏Proj_no
σproject_name=’Growth’ ∏ e-id,p-no
Works_on
project
Improved Query tree by moving PROJECT operations down with JOIN
Steps in Heuristic Optimization Algorithm
Step Action Description
1 Perform Selection at the earliest Use transformation rule to
break u any select
operations with
conjunctive condition into
a cascade of selec.
2. Perform commutativity of SELECT Use transformation rules
operation 2,4,6,9and move selection
operation as far down the
tree.
3. Combine the Cartesian product Use transformation rule
with subsequent select operation 13 to combine a Cartesian
and selection into a join
operation.
4. Use Commutativity and Use transformations rule
associativity of binary operations 5,11,12 to rearrange the
leaf nodes so that most
restrictive selections are
done first.
5. Perform Projection operation as Use transformation rules
early as possible 3,4,7,10,break down and
move the projection
attributes down the tree
as far as possible.
6. Computer common expressions Identify subtrees that can
once be executed by a single
algorithm
Heuristics Transformation Rules
1. Cascading of Selection: (σ)
σbranch_loc=‘chennai’ ^ σ sal>85000(EMP) =
σbranch_loc=‘chennai’(σsal>85000(EMP))
2. Commutativity of Selection (σ)
σbranch_loc=‘chennai’(σsal>85000(EMP)) =
σsal>85000 (σbranch_loc=‘chennai’(EMP))
3. Commutativity of Projection (∏)
∏ename ∏branch_loc,ename(EMP)=∏ename,branch_loc (EMP)
4. Commutativity of σ and ∏
∏ename,dob(σename=‘thomas’ (EMP) =
σename=‘thomas’(∏ename,dob(EMP))
5. Commutativity of Join and cartesian product.
R X S=SXR
R cS =S cR
6. Commutativity of selection and join or cartesian
σcR S=σc S
σc(RXS)=(σc(R)) X S
7. Commutativity of ∏, or X
∏l1U l2(R c S)=∏l1(R) c ∏l2(S)
8.Commutativity of Union & Intersection
R U S=S U R
R n S= S n R
9. Commutativity of selection and set operations.
σc(R U S)=σc(S) U σc (R)
σc(R n S)=σc(S) n σc (R)
σc(R - S)=σc(S) - σc (R)
10. Commutativity of ∏ and U
∏L(R U S)=∏L(R) U ∏L(S)
11. Associativity of Join & Cartesian
(R X S)X T=R X (SXT)
(R S) T=R (S T)
12. Associativity of Union & Intersection
(R U S)U T=S U (R U T)
13. Converting a selection & cartesian into Join σc(RXS)= R c S
COST ESTIMATION IN QUERY OPTIMIZATION
The main aim is to choose the most efficient way of implementing the relational algebra
operations at the lowest possible cost. The method of optimizing the query by choosing a strategy
with the minimum cost is called cost-based query optimization.
COST COMPONENTS OF QUERY EXECUTION
a. Access cost to secondary storage (Larger databases)
b. Storage Cost
c. Computation Cost (Smaller databases)
d. Memory Uses Cost
e. Communication Cost (Distributed databases)
To estimate the cost of the various execution strategies, the DBMS is expected to hold the following
types of information,
a. nTuples(R) - no of tuples in relation R
b. The average record size in R
c. The no. Of blocks required to store relation R, nBlocks(R)
d. The blocking factor of Relation R- bFactor(R)
e. Primary access method for each file
f. Primary access attributes of each file.
g. nLevels(I)-the no. of levels of each multi-level index I.
h. nBlocks(I)- the no. of first-level index blocks
i. nDistinct(R)- the no. of distinctive values that appear for attribute A
j. The selectivity of an attribute.
k. The selection cardinality of a attribute A in Relation R
ESTIMATED COST OF STRATEGIES FOR SELECTION OPERATION
STATEGIES Cost Estimates
Linear Search nBlocks(R)/2, if record is found
nBlocks(R), if no record satisfies
Binary Search Log2(nBlocks(R), if equity condition on key
attribute,
[log2(nBlocks(R)]+sc(R)/bFactor(R)-1,
otherwise
Using Primary key or 1, assuming no overflow
hash key to retrieve a
single record
Equity condition on nLevels(I)+1
primary key
Inequity condition on [nLevels(I)+1]+[nBlocks[R]/2]
primary key
Using inequality Nlevels(R)+[nlfBlocks(I)/2]+[nTuples(R)/2]
condition on a
secondary index (b+
tree)
Equity condition on [nLeveles(I)+1]+[SC(R)/bFactor(R)]
clustering index
Equity condition on [nLevels(I)+1]+[sc(R)]
non-clustering
Pipelining
• When a query is composed of several RA operators, the result of one operator is
pipelined to another operator without creating a temporary relation to hold the intermediate
result.
• This is called on-the-fly-processing.
• Pipelining is used to Improve the performance of the query.
• Adv: Saves the cost of creating temporary relations.
• Disadv: Input to the operations are not available all at once for processing.
Materialization
• If the output of an operator is saved in a temporary relation for processing by the next operator,
the tuples are said to be materialized.
• Process is called Materialization because the results of the intermediate operations are created
(or materialised) and then evaluated for next level operation.
• By repeating the process, the operation at the root of the tree is evaluated giving the final result
of the execution.
STRUCTURE OF QUERY EVALUATION PLAN
An evaluation plan is used to define exactly what algorithm should be used for each operation
and how the execution of the operations should be coordinated.
It may be classified into the following,
Left-deep (join) tree query execution plan
Right-deep (join) tree query execution plan
Linear Tree query execution plan
Bushy(non-linear) tree query execution plan
Left-deep tree starts from a relation and constructs the result by successively adding an operation
involving a single relation until the query is completed. That is only one input into a binary operation is
an intermediate result. It reduces the searching space and allows the query optimizer to be based on
dynamic programming technique. The disadvantage is that many alternative execution strategies are not
considered.
Right-deep tree execution has applications where there is large main memory.
Combination of left-deep and right-deep are called Linear tree. Here the relation on one side of the
operator is always a base relation. The inner relation must always be materialized.
Bushy (Non-Linear) trees are the most general type of trees. They allow both inputs into binary
operation to be the intermediate results.
This allows wide variety of plans to be considered, and the disadvantage is it increases the search space.
∏ (Sort )
Hash Join
Relation3
Pipelining Pipelining
σ (Use index 1) σ (use linear scan)
Relation1 Relation2
A possible evaluation plan
Left deep execution plan
R4
R3
R11 R2
Right-deep execution plan
R4
R3
R21 R1
Linear Tree Execution Plan
R4
R3
R21 R1
Bushy execution plan
R3 R4 R5
R21 R1