KEMBAR78
Unit 5 Query Processing Detail | PDF | Relational Database | Relational Model
0% found this document useful (0 votes)
61 views38 pages

Unit 5 Query Processing Detail

Uploaded by

rudraghankute07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views38 pages

Unit 5 Query Processing Detail

Uploaded by

rudraghankute07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

MITCOM MCA DS

BCA SECOND YEAR


SEM 3

INTRODUCTION TO DBMS

MITCOM
MIT ADT UNIVERSITY

Unit 5
QUERY PROCESSING

Compiled by
Dr Ashwin Tomar
[MCA, MBA, PhD(Computer Sc.)]
drtomarashwin@gmail.com

1
MITCOM MCA DS

Syllabus
Query Processing & Query Optimization 2 Hours
9.1 Overview, measures of query cost, selection operation, sorting, Join
9.2 Evaluation of expressions, transformation of relational expressions,
estimating statistics of expression results, evaluation plans, materialized views

UNIT - 4
Query Processing
Query Processing includes translations on high level Queries into low level
expressions that can be used at physical level of file system, query optimization
and actual execution of query to get the actual result.

Block Diagram of Query Processing

Scanning, Parsing, Validating.


[Scanner identifies the language token like SQL keywords, attributes and relational names]
Parser - checks syntax of query.
Validation – to check all attributes and names are valid
SQL Query is converted to Query tree and to Query graph].

2
MITCOM MCA DS

Detailed Diagram is drawn as:

Method of Optimising the query by choosing the strategy that results in minimum cost is
called as cost based query optimisation. The cost of algorithm depends on cardinality of
input, output. The cost of executing query includes Cost of access to secondary storage,
Cost of storage, Cost of computation, Cost of memory usage, Cost of communication.
Minimisation of cost depends on the size and type of database application.

It is done in the following steps:


Step-1:

Parser: During parse call, the database performs the following checks-
Syntax check, Semantic check and Shared pool check, after converting the query
into relational algebra.
Parser performs the following checks as (refer detailed diagram):
1. Syntax check – concludes SQL syntactic validity.
Example:
SELECT * FORM employee
Here error of wrong spelling of FROM is given by this check.

3
MITCOM MCA DS

2. Semantic check – determines whether the statement is meaningful or


not. Example: query contains a table name which does not exist is
checked by this check.
3. Shared Pool check – Every query possess a hash code during its
execution. So, this check determines existence of written hash code in
shared pool if code exists in shared pool then database will not take
additional steps for optimization and execution.

Hard Parse and Soft Parse –

If there is a fresh query and its hash code does not exist in shared pool then
that query has to pass through from the additional steps known as hard
parsing otherwise if hash code exists then query does not passes through
additional steps. It just passes directly to execution engine. This is known as

Hard Parse includes following steps Optimizer and Row source generation.

Step-2:
soft parsing.
Optimizer: During optimization stage, database must perform a hard
parse at least for one unique DML statement and perform optimization during this
parse. This database never optimizes DDL unless it includes a DML component
such as sub query that require optimization.
It is a process in which multiple query execution plan for satisfying a query are
examined and most efficient query plan is satisfied for execution.
Database catalog stores the execution plans and then optimizer passes the lowest
cost plan for execution.
Row Source Generation –
The Row Source Generation is a software that receives a optimal execution
plan from the optimizer and produces an iterative execution plan that is
usable by the rest of the database. The iterative plan is the binary program
that when executes by the sql engine produces the result set.

Step-3:

Execution Engine: Finally runs the query and display the required result.

4
MITCOM MCA DS

Query Optimization in Relational Algebra


Query: A query is a request for information from a database.

Query Plans: A query plan (or query execution plan) is an ordered set of steps
used to access data in a SQL relational database management system.

Query Optimization: A single query can be executed through different


algorithms or re-written in different forms and structures. Hence, the question of
query optimization comes into the picture – Which of these forms or pathways is
the most optimal? The query optimizer attempts to determine the most efficient
way to execute a given query by considering the possible query plans.

Importance: The goal of query optimization is to reduce the system resources


required to fulfill a query, and ultimately provide the user with the correct result
set faster.

 First, it provides the user with faster results, which makes the application
seem faster to the user.
 Secondly, it allows the system to service more queries in the same amount
of time, because each request takes less time than unoptimized queries.
 Thirdly, query optimization ultimately reduces the amount of wear on the
hardware (e.g. disk drives), and allows the server to run more efficiently (e.g.
lower power consumption, less memory usage).

There are broadly two ways a query can be optimized:

1. Analyze and transform equivalent relational expressions: Try to minimize


the tuple and column counts of the intermediate and final query processes
(discussed here).

2. Using different algorithms for each operation: These underlying algorithms


determine how tuples are accessed from the data structures they are stored
in, indexing, hashing, data retrieval and hence influence the number of disk
and block accesses (discussed in query processing).

Analyze and transform equivalent relational expressions

Here, we shall talk about generating minimal equivalent expressions. To analyze


equivalent expression, listed are a set of equivalence rules. These generate
equivalent expressions for a query written in relational algebra. To optimize a
query, we must convert the query into its equivalent form as long as an
equivalence rule is satisfied.

5
MITCOM MCA DS

Relational Algebra

Relational algebra is a procedural query language. It gives a step by step process to


obtain the result of the query. It uses operators to perform queries.

Types of Relational operation

1. Select Operation:

o The select operation selects tuples that satisfy a given predicate.


o It is denoted by sigma (σ).

Notation: σ p(r)
Where:

σ is used for selection prediction


r is used for relation
p is used as a propositional logic formula which may use connectors like: AND OR and NOT.

These relational can use as relational operators like =, ≠, ≥, <, >, ≤.

For example: LOAN Relation

BRANCH_NAME LOAN_NO AMOUNT

6
MITCOM MCA DS

Downtown L-17 1000

Redwood L-23 2000

Perryride L-15 1500

Downtown L-14 1500

Mianus L-13 500

Roundhill L-11 900

Perryride L-16 1300

Input:

σ BRANCH_NAME ="perryride" (LOAN)

Output:

BRANCH_NAME LOAN_NO AMOUNT

Perryride L-15 1500

Perryride L-16 1300

2. Project Operation:

o This operation shows the list of those attributes that we wish to appear in the result.
Rest of the attributes are eliminated from the table.
o It is denoted by ∏.

Notation: ∏ A1, A2, An (r)

Where

A1, A2, A3 is used as an attribute name of relation r.

Example: CUSTOMER RELATION

NAME STREET CITY


Jones Main Harrison

Smith North Rye

7
MITCOM MCA DS

Hays Main Harrison


Curry North Rye
Johnson Alma Brooklyn
Brooks Senator Brooklyn

Input:

∏ NAME, CITY (CUSTOMER)

Output:

NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn
3. Union Operation:

o Suppose there are two tuples R and S. The union operation contains all the tuples that
are either in R or S or both in R & S.
o It eliminates the duplicate tuples. It is denoted by ∪.

Notation: R ∪ S

A union operation must hold the following condition:

o R and S must have the attribute of the same number.


o Duplicate tuples are eliminated automatically.

Example:

DEPOSITOR RELATION

CUSTOMER_NAME ACCOUNT_NO
Johnson A-101
Smith A-121
Mayes A-321
Turner A-176
Johnson A-273
Jones A-472
Lindsay A-284

8
MITCOM MCA DS

BORROW RELATION

CUSTOMER_NAME LOAN_NO
Jones L-17
Smith L-23
Hayes L-15
Jackson L-14
Curry L-93
Smith L-11
Williams L-17

Input:

∏ CUSTOMER_NAME (BORROW) ∪ ∏ CUSTOMER_NAME (DEPOSITOR)

Output:

CUSTOMER_NAME
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Mayes
4. Set Intersection:
o Suppose there are two tuples R and S. The set intersection operation contains all tuples
that are in both R & S.
o It is denoted by intersection ∩.

Notation: R ∩ S

Example: Using the above DEPOSITOR table and BORROW table

Input:

∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITOR)

Output:

CUSTOMER_NAME
Smith
Jones

9
MITCOM MCA DS

5. Set Difference:
o Suppose there are two tuples R and S. The set intersection operation contains all tuples
that are in R but not in S.
o It is denoted by intersection minus (-).

Notation: R - S

Example: Using the above DEPOSITOR table and BORROW table

Input:

∏ CUSTOMER_NAME (BORROW) - ∏ CUSTOMER_NAME (DEPOSITOR)

Output:

CUSTOMER_NAME
Jackson
Hayes
Willians
Curry

6. Cartesian product
o The Cartesian product is used to combine each row in one table with each row in the
other table. It is also known as a cross product.
o It is denoted by X.

Notation: E X D

Example:
EMPLOYEE

EMP_ID EMP_NAME EMP_DEPT


1 Smith A
2 Harry C
3 John B

DEPARTMENT

DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal

Input:

EMPLOYEE X DEPARTMENT

10
MITCOM MCA DS

Output:

EMP_ID EMP_NAME EMP_DEPT DEPT_NO DEPT_NAME


1 Smith A A Marketing
1 Smith A B Sales
1 Smith A C Legal
2 Harry C A Marketing
2 Harry C B Sales
2 Harry C C Legal
3 John B A Marketing
3 John B B Sales
3 John B C Legal

7. Rename Operation:

The rename operation is used to rename the output relation. It is denoted by rho (ρ).

Example: We can use the rename operator to rename STUDENT relation to STUDENT1.

ρ(STUDENT1, STUDENT)

Relational Model concept

Relational model can represent as a table with columns and rows. Each row is known as a tuple.
Each table of the column has a name or attribute.

Domain: It contains a set of atomic values that an attribute can take.

Attribute: It contains the name of a column in a particular table. Each attribute Ai must have
a domain, dom(Ai)

Relational instance: In the relational database system, the relational instance is represented
by a finite set of tuples. Relation instances do not have duplicate tuples.

Relational schema: A relational schema contains the name of the relation and name of all
columns or attributes.

Relational key: In the relational key, each row has one or more attributes. It can identify the
row in the relation uniquely.

Example: STUDENT Relation

ROLL_NO PHONE_NO ADDRESS AGE


Ram 14795 7305758992 Noida 24
Shyam 12839 9026288936 Delhi 35
Laxman 33289 8583287182 Gurugram 20

11
MITCOM MCA DS

Mahesh 27857 7086819134 Ghaziabad 27


Ganesh 17282 9028 9i3988 Delhi 40

o In the given table, NAME, ROLL_NO, PHONE_NO, ADDRESS, and AGE are the
attributes.
o The instance of schema STUDENT has 5 tuples.
o t3 = <Laxman, 33289, 8583287182, Gurugram, 20>

Properties of Relations

o Name of the relation is distinct from all other relations.


o Each relation cell contains exactly one atomic (single) value
o Each attribute contains a distinct name
o Attribute domain has no significance
o tuple has no duplicate value
o Order of tuple can have a different sequence

3.2.2 Relational model constraints

Integrity Constraints

o Integrity constraints are a set of rules. It is used to maintain the quality of information.
o Integrity constraints ensure that the data insertion, updating, and other processes have
to be performed in such a way that data integrity is not affected.
o Thus, integrity constraint is used to guard against accidental damage to the database.

Types of Integrity Constraint

1. Domain constraints
o Domain constraints can be defined as the definition of a valid set of values for an
attribute.

12
MITCOM MCA DS

o The data type of domain includes string, character, integer, time, date, currency, etc.
The value of the attribute must be available in the corresponding domain.

Example:

2. Entity integrity constraints [primary key value can't be null – identifies rows]
o The entity integrity constraint states that primary key value can't be null.
o This is because the primary key value is used to identify individual rows in relation and
if the primary key has a null value, then we can't identify those rows.
o A table can contain a null value other than the primary key field.

Example:

3. Referential Integrity Constraints


o A referential integrity constraint is specified between two tables.
o In the Referential integrity constraints, if a foreign key in Table 1 refers to the Primary
Key of Table 2, then every value of the Foreign Key in Table 1 must be null or be
available in Table 2.

Example:

13
MITCOM MCA DS

4. Key constraints
o Keys are the entity set that is used to identify an entity within its entity set uniquely.
o An entity set can have multiple keys, but out of which one key will be the primary key.
A primary key can contain a unique and null value in the relational table.

Example:

14
MITCOM MCA DS

Transformation Rules:

15
MITCOM MCA DS

16
MITCOM MCA DS

17
MITCOM MCA DS

Minimality
A set of equivalence rules is said to be minimal if no rule can be derived from any
combination of the others. A query is said to be optimal when it is minimal.
Examples
instructor(ID, name, dept_name, salary)
teaches(ID, course_id, sec_id, semester, year)
course(course_id, title, dept_name, credits)

18
MITCOM MCA DS

19
MITCOM MCA DS

When a query is submitted to the database, it is responsibility of database to


determine algorithm to evaluate the query. The DBMS has to decide a low cost
algorithm to evaluate the query and then optimal path to evaluate the query. The
query may be simple or complex. Depending on the cost it has to pick better
execution path. This is called query optimization.

There are various factors affecting the performance of the query like

 number of records in the table,


 number of blocks allocated to each table,
 number of records in each block,
 size of the record,
 duplicate records,
 height of B+ tree,
 constraints and indexes etc.

How to select a better performing Query ?

Transformation of Relational Expressions

When a SQL query is submitted to DB, it can be evaluated in number of ways. For
example, consider the below case:

SELECT EMP_ID, DEPT_NAME


FROM EMP, DEPT
WHERE EMP.DEPT_ID = DEPT.DEPT_ID
AND EMP.DEPT_ID = 10;

Above query selects the EMP_ID and DEPT_NAME from EMP and DEPT table for
DEPT_ID = 10. But when it is given to the DBMS, it divides the query into tokens
and sees how it can be put together so that performance will be better. This is the
duty of query optimizer. But altering the order of tokens in the query should not
change the result. In either way it should give same result. Order of records can
change and are least important. This is called equivalent query. There is set of rules
to put tokens in the query. This is called equivalence rule.

Above query can be broken down in either ways below:

 Select the records of EMP with DEPT_ID = 10 first then join them with DEPT
table to get all matching records of EMP and DEPT. Then select only the
columns EMP_ID and DEPT_NAME to display the result.
20
MITCOM MCA DS

 Select all matching records from EMP and DEPT, from which filter on
DEPT_ID = 10 and select only EMP_ID and DEPT_NAME to display.

Both the steps above are same irrespective of how it is performed. Hence both are
called equivalent query. These are not written in SQL, but using relational algebra,
graph or tree.

∏ EMP_ID, DEPT_NAME (σ DEPT_ID = 10 (EMP ∞DEPT))

or

σ DEPT_ID = 10 (∏ EMP_ID, DEPT_NAME, DEPT_ID (EMP ∞DEPT))

21
MITCOM MCA DS

Above relational algebra and tree shows how DBMS depicts the query inside it. But
the cost of both of them may vary. This is because the number of records in each
step changes depending on the join and filters we use, and the algorithms used to
evaluate them. For example we may have huge number of records in second case
tree above to filter. But in the first case we are filtering the record first; hence
number of records to join would have been reduced. This makes lots of difference
and query optimizer calculates this difference and selects the optimal tree for
query evaluation.

The query optimizer uses these two techniques to determine which process or
expression to consider for evaluating the query.

There are two methods of Query Optimization

1. Cost based Optimization (Physical)

This is based on the cost of the query. The query can use different paths based on
indexes, constraints, sorting methods etc. This method mainly uses the statistics
like record size, number of records, number of records per block, number of
blocks, table size, whether whole table fits in a block, organization of tables,
uniqueness of column values, size of columns etc.

Suppose, we have series of table joined in a query.

T1 ∞ T2 ∞ T3 ∞ T4∞ T5 ∞ T6

22
MITCOM MCA DS

For above query we can have any order of evaluation. We can start taking any two
tables in any order and start evaluating the query. Ideally, we can have join
combinations in (2(n-1))! / (n-1)! ways.

For example, suppose we have 5 tables involved in join, then we can have 8! / 4! =
1680 combinations. But when query optimizer runs, it does not evaluate in all
these ways always. It uses dynamic programming where it generates the costs for
join orders of any combination of tables. It is calculated and generated only once.
This least cost for all the table combination is then stored in the database and
is used for future use. i.e.; say we have a set of tables, T = { T1 , T2 , T3 .. Tn}, then
it generates least cost combination for all the tables and stores it.

 Dynamic Programming

The least cost for the joins of any combination of table is generated here. These
values are stored in the database and when those tables are used in the query, this
combination is selected for evaluating the query.

While generating the cost, it follows below steps :

Suppose we have set of tables, T = {T1 , T2 , T3 .. Tn}, in a DB. It picks the first table,
and computes cost for joining with rest of the tables in set T. It calculates cost for
each of the tables and then chooses the best cost. It continues doing the same with
rest of the tables in set T. It will generate 2n – 1 cases and it selects the lowest cost
and stores it. When a query uses those tables, it checks for the costs here and that
combination is used to evaluate the query. This is called dynamic programming.

In this method, time required to find optimized query is in the order of 3n, where
n is the number of tables. Suppose we have 5 tables, then time required in 35 =
243, which is lesser than finding all the combination of tables and then deciding
the best combination (1680). Also, the space required for computing and storing
the cost is also less and is in the order of 2n. In above example, it is 25 = 32.

 Left Deep Trees

This is another method of determining the cost of the joins. Here, the tables and
joins are represented in the form of trees. The joins always form the root of the
tree and table is kept at the right side of the root. LHS of the root always point to
the next join. Hence it gets deeper and deeper on LHS. Hence it is called as left

23
MITCOM MCA DS

deep tree.

Here instead of calculating the best join cost for set of tables, best join cost for
joining with each table is calculated. In this method, time required to find
optimized query is in the order of n2n, where n is the number of tables. Suppose
we have 5 tables, then time required in 5*25 =160, which is lesser than dynamic
programming. Also, the space required for computing storing the cost is also less
and is in the order of 2n. In above example, it is 25 = 32, same as dynamic
programming.

 Interesting Sort Orders

This method is an enhancement to dynamic programming. Here, while calculating


the best join order costs, it also considers the sorted tables. It assumes, calculating
the join orders on sorted tables would be efficient. i.e.; suppose we have unsorted
tables T1 , T2 , T3 .. Tn and we have join on these tables.

(T1 ∞T2) ∞ T3 ∞… ∞ Tn

This method uses hash join or merge join method to calculate the cost. Hash Join
will simply join the tables. We get sorted output in merge join method, but it is
costlier than hash join. Even though merge join is costlier at this stage, when it
moves to join with third table, the join will have less effort to sort the tables. This
is because first table is the sorted result of first two tables. Hence it will reduce the
total cost of the query.

24
MITCOM MCA DS

But the number of tables involved in the join would be relatively less and this
cost/space difference will be hardly noticeable.

All these cost based optimizations are expensive and are suitable for large number
of data. There is another method of optimization called heuristic optimization,
which is better compared to cost based optimization.

2. Heuristic Optimization (Logical)

This method is also known as rule based optimization. This is based on the
equivalence rule on relational expressions; hence the number of combination of
queries get reduces here. Hence the cost of the query too reduces.

This method creates relational tree for the given query based on the equivalence
rules. These equivalence rules by providing an alternative way of writing and
evaluating the query, gives the better path to evaluate the query. This rule need
not be true in all cases. It needs to be examined after applying those rules. The
most important set of rules followed in this method is listed below:

 Perform all the selection operation as early as possible in the query. This
should be first and foremost set of actions on the tables in the query. By
performing the selection operation, we can reduce the number of records
involved in the query, rather than using the whole tables throughout the
query.

Suppose we have a query to retrieve the students with age 18 and studying in class
DESIGN_01. We can get all the student details from STUDENT table, and class
details from CLASS table. We can write this query in two different ways.

The queries will return same result. But when we observe them closely we can see
that first query will join the two tables first and then applies the filters. That
means, it traverses whole table to join, hence the number of records involved is
more. But he second query, applies the filters on each table first. This reduces the
number of records on each table (in class table, the number of record reduces to

25
MITCOM MCA DS

one in this case!). Then it joins these intermediary tables. Hence the cost in this
case is comparatively less.

Instead of writing query the optimizer creates relational algebra and tree for
above case.

 Perform all the projection as early as possible in the query. This is similar to
selection but will reduce the number of columns in the query.

Suppose for example, we have to select only student name, address and class name
of students with age 18 from STUDENT and CLASS tables.

Here again, both the queries look alike, results alike. But when we compare the
number of records and attributes involved at each stage, second query uses less
records and hence more efficient.

26
MITCOM MCA DS

Next step is to perform most restrictive joins and selection operations. When we
say most restrictive joins and selection means, select those set of tables and views
which will result in comparatively less number of records. Any query will have
better performance when tables with few records are joined. Hence throughout
heuristic method of optimization, the rules are formed to get less number of
records at each stage, so that query performance is better. So is the case here too.

Suppose we have STUDENT, CLASS and TEACHER tables. Any student can attend
only one class in an academic year and only one teacher takes a class. But a class
can have more than 50 students. Now we have to retrieve STUDENT_NAME,
ADDRESS, AGE, CLASS_NAME and TEACHER_NAME of each student in a school.

∏STD_NAME, ADDRESS, AGE, CLASS_NAME, TEACHER_NAME ((STUDENT ∞ CLASS_ID


CLASS)∞ TECH_IDTEACHER)

Not So efficient

∏STD_NAME, ADDRESS, AGE, CLASS_NAME, TEACHER_NAME (STUDENT ∞ CLASS_ID


(CLASS∞ TECH_IDTEACHER))

Efficient

In the first query, it tries to select the records of students from each class. This
result in a very huge intermediary table. This table is joined with another small
table. Hence the traversing of number of records is also more.

But in the second query, CLASS and TEACHER are joined first, which has one to
one relation here. Hence the number of resulting record is STUDENT table give the
final result. Hence this second method is more efficient.

 Sometimes we can combine above heuristic steps with cost based


optimization technique to get better results.

All these methods need not be always true. It also depends on the table size,
column size, type of selection, projection, join sort, constraints, indexes, statistics
etc. Above optimization describes the best way of optimizing the queries.

27
MITCOM MCA DS

Evaluation Plans
When a query is submitted to DB, it is parsed and translated to relational algebra.
It is verified for its validity and correctness. Once it passes this stage, different
ways of evaluating the query is generated. It is checked for various factors and
its execution plan is generated. It may be based on cost of the query or based on
the equivalence rules. Once cost based execution and rule based execution plans
are generated, optimizer has to decide, which plan to be selected for evaluation.
This is the most important step in processing a query.

As we have seen in other articles, the cost or the heuristic execution plan may not
be always effective in all the tables with same type of query. They are all general
guidelines to evaluate a query. There are lot many factors affecting the
performance of a query. The evaluation plans exactly defines the system which
plan / algorithm is to be used to evaluate, which index to be used etc.

Consider below example on EMP and DEPT tables.

SELECT EMP_ID, DEPT_NAME


FROM EMP, DEPT
WHERE EMP.DEPT_ID = DEPT.DEPT_ID
AND EMP.DEPT_ID = 10
AND EMP.EMP_LAST_NAME = ‘Joseph’ ;

Above query selects the EMP_ID and DEPT_NAME from EMP and DEPT table for
DEPT_ID = 10 with employee’s last name as ‘Joseph’. But when it is given to the
DBMS, it divides the query into tokens and sees how it can be put together so
that performance will be better. This is the duty of query optimizer. But
altering the order of tokens in the query should not change the result. In either
way it should give same result. Order of records can change and are least
important. This is called equivalent query. There is set of rules to put tokens in
the query. This is called equivalence rule.

Above query can be broken down by the DBMS in either ways below :

Select the records of EMP with DEPT_ID = 10 and EMP_LAST_NAME = ‘Joseph’


first then join them with DEPT table to get all matching records of EMP and
DEPT. Then select only the columns EMP_ID and DEPT_NAME to display the
result.

28
MITCOM MCA DS

Select all matching records from EMP and DEPT, from which filter on DEPT_ID =
10 and EMP_LAST_NAME = ‘Joseph’ and then select only EMP_ID and
DEPT_NAME to display.

Select all matching records from EMP and DEPT, from which select only EMP_ID
and DEPT_NAME and then filter on DEPT_ID = 10 and EMP_LAST_NAME =
‘Joseph’ and then.

Both the steps above are same irrespective of how it is performed. Hence both
are called equivalent query. These are not written in SQL, but using relational
algebra, or tree.

∏ EMP_ID, DEPT_NAME (σ DEPT_ID = 10 AND EMP_LAST_NAME = ‘Joseph’ (EMP) ∞DEPT)

Or

∏ EMP_ID, DEPT_NAME (σ DEPT_ID = 10 AND EMP_LAST_NAME = ‘Joseph’ (EMP ∞DEPT))

Or

DEPT_ID = 10 AND EMP_LAST_NAME = ‘Joseph’ (∏ EMP_ID, DEPT_NAME, DEPT_ID (EMP ∞DEPT


))

29
MITCOM MCA DS

The optimizer can produce relational expression and tree in above three
formats. According to evaluation rule

The optimizer can produce relational expression and tree in above three
formats. According to evaluation rule, the first query seems to be the best
one. But considering other factors of tables, other plans may also be efficient.

30
MITCOM MCA DS

Choice of Evaluation Plans


Which evaluation plan has to be selected so that it can be traversed
efficiently ?

It collects all the statistics, costs, access/ evaluation paths, relational trees etc.
It then analyses them and chooses the best evaluation path.

Same query is written in different forms of relational algebra. Corresponding trees


for them too is drawn by DBMS. Statistics for them based on cost based
evaluation and heuristic methods are collected. It checks the costs based on the
different techniques that we have seen so far. It checks for the operator, joining
type, indexes, number of records, selectivity of records, distinct values etc from
the data dictionary. Once all these information’s are collected, it picks the best
evaluation plan.

Relational algebra and tree for EMP and DEPT.

∏ EMP_ID, DEPT_NAME (σ DEPT_ID = 10 AND EMP_LAST_NAME = ‘Joseph’ (EMP) ∞DEPT)

Or

∏ EMP_ID, DEPT_NAME (σ DEPT_ID = 10 AND EMP_LAST_NAME = ‘Joseph’ (EMP ∞DEPT))

Or

σ DEPT_ID = 10 AND EMP_LAST_NAME = ‘Joseph’ (∏ EMP_ID, DEPT_NAME, DEPT_ID (EMP ∞DE


PT))

31
MITCOM MCA DS

What can be observed here?

First tree reduces the number of records for joining and seems to be efficient. But
what happens if we have index on DEPT_ID? Then the join between EMP and EMP
can also be more efficient. But we see the filter condition on EMP table, we have
DEPT_ID = 10, which is index column. Hence first applying selection condition and
then join will reduce the number of records as well as make the join more efficient
than without index. Next are the projected columns – EMP_ID and DEPT_NAME.
they are all distinct values. There cannot be duplicate values for them. But we are
selecting those values for DEPT_ID = 10, hence DEPT_NAME has only one value.
Hence their selectivity is same as number of employees working for DEPT_ID = 10.
But we are selecting only those employees whose last name is ‘Joseph’. Hence the
selectivity is min (distinct (employee (DEPT_10)), distinct (employee (DEPT_10,
JOSEPH)). Obliviously distinct (employee (DEPT_10, JOSEPH)) would have lesser
value. The optimizer decides all these factors for above 3 trees and then decides
first tree would be more efficient. Hence it evaluates the query using first tree.

32
MITCOM MCA DS

This is how when any query is submitted to DB is traversed and evaluated.

Pipelining & Materialism

There are two methods of evaluating the query.

Materialization

In this method, queries are broken into individual queries and then the results of
which are used to get the final result. To be more specific, suppose there is a
requirement to find the students who are studying in class ‘DESIGN_01’.

SELECT * FROM STUDENT s, CLASS c


WHERE s.CLASS_ID = c.CLASS_ID AND c.CLASS_NAME = ‘DESIGN_01’;

Here we can observe two queries: one is to select the CLASS_ID of ‘DESIGN_01’ and
another is to select the student details of the CLASS_ID retrieved in the first query.
The DBMS also does the same. It breaks the query into two as mentioned above.
Once it is broken, it evaluates the first query and stores it in the temporary table
in the memory. This temporary table data will be then used to evaluate the second
query.

This is the example of two level queries in materialization method. Any number
of levels and so many numbers of temporary tables.

Although this method looks simple, the cost of this type of evaluation is always
more. It takes the time to evaluate and write into temporary table, then retrieve
33
MITCOM MCA DS

from this temporary table and query to get the next level of result and so on.
Hence cost of evaluation in this method is:

Cost = cost of individual SELECT + cost of write into temporary table

Pipelining

In this method, DBMS do not store the records into temporary tables. Instead, it
queries each query and result of which will be passed to next query to process and
so on. It will process the query one after the other and each will use the result of
previous query for its processing.

In the example above, CLASS_ID of DESIGN_01 is passed to the STUDENT table to


get the student details.

In this method no extra cost of writing into temporary tables. It has only cost of
evaluation of individual queries; hence it has better performance than
materialization.

There are two types of pipelining:

In this method, the result of lower level queries are not passed to the higher level
automatically. It will be passed to higher level only when it is requested by the
higher level. In this method, it retains the result value and state with it and it will
be transferred to the next level only when it is requested.

34
MITCOM MCA DS

In our example above, CLASS_ID for DESIGN_01 will be retrieved, but it will not be
passed to STUDENT query only when it is requested. Once it gets the request, it is
passed to student query and that query will be processed.

Producer Driven or Eager Pipelining

In this method, the lower level queries eagerly pass the results to higher level
queries. It does not wait for the higher level queries to request for the results. In
this method, lower level query creates a buffer to store the results and the higher
level queries pulls the results for its use. If the buffer is full, then the lower level
query waits for the higher level query to empty it. Hence it is also called as PULL
and PUSH pipelining.

There are still more methods of pipelining like Linear and non-linear methods of

pipelining, left deep tree, right deep tree etc.

35
MITCOM MCA DS

36
MITCOM MCA DS

37
MITCOM MCA DS

38

You might also like