Chapter Three
Query Processing and Optimization
        22
   2 5/                                      QUERY OPTIMIZATION   1
  /
10
OUTLINE
        I.   Query Processing and Optimization: Why?
        II. Steps of Processing
        III. Methods of Optimization
                Heuristic (Logical Transformations)
                    Transformation Rules
                    Heuristic Optimization Guidelines
                Cost Based (Physical Execution Costs)
                    Data Storage/Access Refresher
                Semantics based
        22
   2 5/                                                  QUERY OPTIMIZATION   2
  /
10
        INTRODUCTION TO QUERY OPTIMISATION
        With declarative languages such as SQL, the user specifies what data is required
         E.g. SELECT name FROM Staff
               WHERE gender = ‘F’;
             rather than how it is to be retrieved
        This relieves the user of knowing what constitutes good execution strategy.
        It also gives DBMS more control over system performance.
        22
   2 5/                                                                     QUERY OPTIMIZATION   3
  /
10
 What is Query Processing?
 Steps required to transform high level SQL query into a correct and
  “efficient” strategy for execution and retrieval.
 What is Query Optimization?
 The activity of choosing a single “efficient” execution strategy
  (from hundreds) as determined by database catalog statistics.
 Which relational algebra expression, equivalent to the given query,
  will lead to the most efficient solution plan?
 How do operations pass data (main memory buffer, disk buffer,…)?
 Will this plan minimize resource usage? (CPU/Response
  Time/Disk)
        22
   2 5/                                              QUERY OPTIMIZATION   4
  /
10
   Example:
                Identify all managers who work in a London branch
                     SELECT * FROM Staff s, Branch b
                     WHERE s.branchNo = b.branchNo AND s.position = ‘Manager’ AND
                                 b.city = ‘london’;
Results in these equivalent relational algebra statements
   (1)(position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
   (2) (position=‘Manager’)^(city=‘London’) (Staff Staff.branchNo = Branch.branchNo Branch)
   (3) [(position=‘Manager’) (Staff)] Staff.branchNo = Branch.branchNo [(city=‘London’) (Branch)]
 Assume:
            –   1000 tuples in Staff.
            –   50 Managers
            –   50 tuples in Branch.
            –   5 London branches
            –   No indexes or sort keys
         22 –   All temporary results are written back to disk (memory      is small)
    2 5/                                                          QUERY OPTIMIZATION                    5
   /
 10
            –   Tuples are accessed one at a time (not in blocks)
QUERY 1 (BAD)
       (position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
 Requires (1000+50) disk accesses to read from Staff and Branch relations
 Creates temporary relation of Cartesian Product-mathematical operation that
  returns a set of from multiple set (1000*50) tuples
 Requires (1000*50) disk access to read in temporary relation and test
  predicate
                       Total Work = (1000+50) + 2*(1000*50) =
                                      101,050 I/O operations
Query 2 (Better)
                  (position=‘Manager’)^(city=‘London’) (Staff Staff.branchNo = Branch.branchNo Branch)
              – Again requires (1000+50) disk accesses to read from Staff and Branch
              – Joins Staff and Branch on branchNo with 1000 tuples
                (1 employee : 1 branch )
              – Requires (1000) disk access to read in joined relation and check predicate
                                 Total Work = (1000+50) + 2*(1000) =
                                              3050 I/O operations
         22
                                  3300% Improvement over Query 1
    2 5/                                                                                       QUERY OPTIMIZATION   6
   /
 10
QUERY 3 (BEST)
         [ (position=‘Manager’) (Staff) ] Staff.branchNo = Branch.branchNo [ (city=‘London’) (Branch) ]
 Read Staff relation to determine ‘Managers’ (1000 reads)
   Create 50 tuple relation(50 writes)
 Read Branch relation to determine ‘London’ branches (50 reads)
   Create 5 tuple relation(5 writes)
 Join reduced relations and check predicate (50 + 5 reads)
                           Total Work = 1000 + 2*(50) + 5 + (50 + 5) =
                                      1160 I/O operations
                                 8700% Improvement over Query 1
          22
     2 5/                                                                             QUERY OPTIMIZATION      7
    /
  10
       QUERY PROCESSING
       STEPS
 • Processing can be divided into :Decomposition, Optimization , and
 Execution ,Code generation main categories
   1. Query Decomposition
It is the process of transforming a high level query into a relational algebra
    query, and to check that the query is syntactically and semantically
    correct. It Consists of parsing and validation
  10/25/22                               QUERY OPTIMIZATION           8
  TYPICAL STAGES IN QUERY DECOMPOSITION ARE:
i. Analysis: lexical and syntactical analysis of the
   query(correctness) based on attributes, data type.. ,.
 Query tree will be built for the query containing leaf node for base
  relations,
 one or many non-leaf nodes for relations produced by relational
  algebra operations and
 root node for the result of the query.
 Sequence of operation is from the leaves to the root.
• (SELECT * FROM Catalog c ,Author a Where a.authorid = c.authorid
 AND c.price>200 AND a.country= ‘ USA’ )
ii. Normalization: convert the query into a normalized
    form.
 The predicate WHERE will be converted to Conjunctive (∨) or
  Disjunctive (∧) Normal form.
          22
     2 5/                                            QUERY OPTIMIZATION   9
    /
  10
 CONT… TYPICAL STAGES IN QUERY DECOMPOSITION ARE:
iii. Semantic Analysis: to reject normalized queries that
     are not correctly formulated or contradictory.
 Incorrect if components do not contribute to generate result.
 Contradictory if the predicate can not be satisfied by any tuple. Say
  for example,(gender=“m”  gender= “f”) since a given book author
  can only be classified in either of the category at a time
iv. Simplification: to detect redundant qualifications,
    eliminate common sub-expressions, and transform
    the query to a semantically equivalent but more
    easily and effectively computed form.
 For example, If a user don’t have the necessary access to all of the
  objects of the query , it should be rejected.
          22
     2 5/                                            QUERY OPTIMIZATION   10
    /
  10
2. Query Optimization
      Everyone wants the performance of their database to be optimal. In
      particular, there is often a requirement for a specific query or object
      that is query based, to run faster.
      Problem of query optimization is to find the sequence of steps that
      produces the answer to user request in the most efficient manner,
      given the database structure.
 The performance of a query is affected by the tables or queries that
  underlies the query and by the complexity of the query.
      Given a request for data manipulation or retrieval, an optimizer
      will choose an optimal plan for evaluating the request from among
      the manifold alternative strategies. i.e. there are many ways (access
      paths) for accessing desired file/record.
      hence ,DBMS is responsible to pick the best execution strategy
      based on various considerations( Least amount of I/O and CPU resources. )
            22
       2 5/                                                 QUERY OPTIMIZATION   11
      /
    10
        22
   2 5/      QUERY OPTIMIZATION   12
  /
10
        22
   2 5/      QUERY OPTIMIZATION   13
  /
10
        22
   2 5/      QUERY OPTIMIZATION   14
  /
10
        22
   2 5/      QUERY OPTIMIZATION   15
  /
10
A .USING HEURISTICS IN QUERY OPTIMIZATION
   Query block: The basic unit that can be translated into the algebraic
    operators and optimized.
   A query block contains a single SELECT-FROM-WHERE expression,
    as well as GROUP BY and HAVING clause if these are part of the block.
   Nested queries within a query are identified as separate query blocks.
             Process for heuristics optimization
              1. The parser of a high-level query generates an initial internal
                  representation;
              2. Apply heuristics rules to optimize the internal representation.
              3. A query execution plan is generated to execute groups of
                  operations based on the access paths available on the files
                  involved in the query.
             The main heuristic is to apply first the operations that reduce
              the size of intermediate results.
               E.g. Apply SELECT and PROJECT operations before applying
                  the JOIN or other binary operations.
         22
    2 5/                                                      QUERY OPTIMIZATION   16
   /
 10
• Query tree:
   – A tree data structure that corresponds to a relational algebra
     expression.
   – It represents the input relations of the query as leaf nodes of the
     tree, and represents the relational algebra operations as internal
     nodes.
• Query graph:
   – A graph data structure that corresponds to a relational calculus
     expression. It does not indicate an order on which operations to
     perform first. There is only a single graph corresponding to each
     query.
  Example:
  For every project located in ‘Stafford’, retrieve the project number, the controlling
   department number and the department manager’s last name, address and birthdate.
  Relation algebra:
  SQL query:
 Q2:       SELECT P.NUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE
           FROM PROJECT AS P ,DEPARTMENT AS D, EMPLOYEE AS E
                 WHERE P.DNUM=D.DNUMBER AND
       /2
         2           D.MGRSSN=E.SSN AND
                                                                    17
 10/2
      5              P.PLOCATION=‘STAFFORD’;     QUERY OPTIMIZATION
        22
   2 5/      QUERY OPTIMIZATION   18
  /
10
Q2:          SELECT P.NUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE
                          FROM      PROJECT AS P ,DEPARTMENT AS D, EMPLOYEE AS E
                                WHERE P.DNUM=D.DNUMBER AND
                                       D.MGRSSN=E.SSN AND
                                 P.PLOCATION=‘STAFFORD’;
        22
   2 5/                                                        QUERY OPTIMIZATION   19
  /
10
        22
   2 5/      QUERY OPTIMIZATION   20
  /
10
Heuristic Optimization of Query Trees:
 The same query could correspond to many different relational algebra
  expressions — and hence many different query trees.
 The task of heuristic optimization of query trees is to find a final
  query tree that is efficient to execute.
Example:
Q:        SELECT            LNAME
          FROM              EMPLOYEE, WORKS_ON, PROJECT
          WHERE             PNAME = ‘AQUARIUS’ AND
                            PNMUBER=PNO AND ESSN=SSN
                            AND BDATE > ‘1957-12-31’;
        22
   2 5/                                                QUERY OPTIMIZATION   21
  /
10
        22
   2 5/      QUERY OPTIMIZATION   22
  /
10
        22
   2 5/      QUERY OPTIMIZATION   23
  /
10
Which tree is better?
               brName                                   brName
                   |X| b.branchNo = s.branchNo             |X| b.branchNo = s.branchNo
Branch
                                                  s.name = ‘Mary’           Branch
                          s.name = ‘Mary’
                                                         Staff
                                 Staff
         22
    2 5/                                                               QUERY OPTIMIZATION   24
   /
 10
Summary of Heuristics for Algebraic Optimization:
             1. The main heuristic is to apply first the operations that reduce
                the size of intermediate results.
             2. Perform select operations as early as possible to reduce the
                number of tuples and perform project operations as early as
                possible to reduce the number of attributes. (This is done by
                moving select and project operations as far down the tree as possible.)
             3. The select and join operations that are most restrictive should
                be executed before other similar operations. (This is done by
                reordering the leaf nodes of the tree among themselves and
                adjusting the rest of the tree appropriately.)
        22
   2 5/                                                              QUERY OPTIMIZATION   25
  /
10
B. Cost Estimation Approach to Query Optimization
The main idea is to minimize the cost of processing a query. The cost
  function is comprised of:
I/O cost + CPU processing cost + communication cost + Storage cost
These components might have different weights in different
  processing environments
The DBMs will use information stored in the system catalogue for
  the purpose of estimating cost.
The main target of query optimization is to minimize the size of the
   intermediate relation. The size will have effect in the cost of:
 Disk Access
 Data Transportation
 Storage space in the Primary Memory
 Writing on Disk
        22
   2 5/                                              QUERY OPTIMIZATION   26
  /
10
1. Access Cost of Secondary Storage
Data is going to be accessed from secondary storage, as a query will be
   needing some part of the data stored in the database. The disk access
   cost can again be analyzed in terms of:
 Searching
 Reading, and
 Writing, data blocks used to store some portion of a relation.
Remark: The disk access cost will vary depending on
 The file organization used and the access method implemented for the file
  organization.
 whether the data is stored contiguously or in scattered manner, will affect the
  disk access cost.
2. Storage Cost
•While processing a query, as any query would be composed of many database
operations, there could be one or more intermediate results before reaching the
final output. These intermediate results should be stored in primary memory for
further processing. The bigger the intermediate relation, the larger the memory
requirement, which will have impact on the limited available space.
         22
    2 5/                                                       QUERY OPTIMIZATION   27
   /
 10
3. Computation Cost
Query is composed of many operations. The operations could be
   database operations like reading and writing to a disk, or
   mathematical and other operations like:
 Searching
 Sorting
 Merging
 Computation on field values
4. Communication Cost
o In most database systems the database resides in one station and
various queries originate from different terminals. This will have impact
on the performance of the system adding cost for query processing. Thus,
the cost of transporting data between the database site and the terminal
from where the query originate should be analyzed.
        22
   2 5/                                                QUERY OPTIMIZATION   28
  /
10
3. Query Execution Plans
       – An execution plan for a relational algebra query consists of a
              combination of the relational algebra query tree and information
              about the access methods to be used for each relation as well as the
              methods to be used in computing the relational operators stored in
              the tree.
         22
    2 5/                                                        QUERY OPTIMIZATION   29
   /
 10