KEMBAR78
Query Optimization Part1 | PDF | Relational Database | Data Management
0% found this document useful (0 votes)
14 views52 pages

Query Optimization Part1

The document outlines the execution process of SQL queries in a typical RDBMS, detailing steps from parsing the query to optimizing and executing it. It discusses various methods for query execution, including interpretation, vectorization, and compilation, along with their pros and cons. Additionally, it covers rule-based logical plan optimization and the fundamentals of relational algebra as they relate to query optimization.

Uploaded by

RishabhRao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views52 pages

Query Optimization Part1

The document outlines the execution process of SQL queries in a typical RDBMS, detailing steps from parsing the query to optimizing and executing it. It discusses various methods for query execution, including interpretation, vectorization, and compilation, along with their pros and cons. Additionally, it covers rule-based logical plan optimization and the fundamentals of relational algebra as they relate to query optimization.

Uploaded by

RishabhRao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

BAN 610 Database Management

Instructor:
Peng Xie

Department of
Management
How are SQL Queries Executed?
Typical RDBMS Execution
Step 1, Query

starName Title
- Suppose we have a query
Star1 Movie1
Star1 Movie2
Star2 Movie1
…… ……
Step 2, Parse tree

- Parse the query into a tree


Step 3, Generate query plan

- Generate a logical query plan from the parsed tree


Step 3, Generate query plan

- Generate a logical query plan from the parsed tree

One column of
star names
Step 3, Generate query plan

- Generate a logical query plan from the parsed tree

Cross join with Starsin table

One column of
star names
Step 3, Generate query plan

- Generate a logical query plan from the parsed tree

Use the foreign key constraint to remove


unrealistic records
Cross join with Starsin table

One column of
star names
Step 3, Generate query plan

- Generate a logical query plan from the parsed tree

Take the title of the result table

Use the foreign key = primary key to remove


unrealistic records
Cross join with Starsin table

One column of
star names
Step 4, Optimize query plan

- Using pre-defined rules to optimize the query

Select * from A inner join B


On A.key = B.key
(inner join)

Is better than

Select * from A, B where A.key = B.key


(cross join then predicate)
Step 5, Estimate Intermediate Table Size

- Estimate intermediate table size to select best physical plan

?
?
Step 6, Calculate cost and pick the best physical plan

- Estimate intermediate table size to select best physical plan


Now we have a plan, How do we run it

- Several different options that trade between complexity, setup time &
performance

- Example: Simple Query


Method 1: Interpretation
Method 1: Interpretation

- Example Expression Class


Method 1: Interpretation

- Example Expression Class


Method 1: Interpretation

- Example Operator Class

These expr class are


actualy attribute
class
Method 1: Interpretation

- Running Query with Interpretation


Method 1: Interpretation

- Pros and Cons


- Pros
- Straightforward and easy to implement
- Cons
- Reading tables one record at a time
- Each time it does compute, it has to trace back to
the expr and see what it is, is it a times or plus.
Branches (i.e., tracing back to parent classes)
slows down CPU
Method 2: Vectorization

- Keep recursive interpretation


- But make Operators and Expressions run on Batches
Method 2: Vectorization

- Implementing Vectorization

Values stored in columnar arrays (e.g. int[]), with a separate bit array to
mark nulls if some recording in a array does not meet condition
Tuple batches fit in L1 or L2 cache
Method 2: Vectorization

- Pros and Cons


- Faster than record-at-a-time if the query processes many
records
- Relatively simple to implement
- Lots of nulls in batches if query is selective
- Lots of usage of L1 or L2 cache, since batches are stored in
there
- Lots of data traveling between CPU and Cache
Method 3: Compilation

- Directly turn the query into executable without creating


operator or expression classes
Method 3: Compilation

- Pros and Cons


- Potential to get fastest possible execution
- Complex to implement a compiler
- Compilation takes time
- Errors in compilation are hard to find and debug. May not be
worth the trouble
Rule based logical plan optimization
Rule based optimization

- What is a rule?
- Procedure to simplify part of the query based on a pattern
- Example: When I see expr OR TRUE for an expression expr, simplify
this with TRUE
Rule based optimization

- Implementing a rule
- Each rule is typically a function that walks through query plan to
search for its pattern
Rule based optimization

- Implementing a rule
- Rules are often grouped into phases
- E.g., simplify Boolean expressions, pushdown selects, choose join
algorithms, etc.
- Each phase runs rules till they no longer apply
Simple Rules can Work Together to Optimize Complex
Queries

Purple and orange -> reduce the number of predicates to run


Green -> Make short predicates ahead of longer ones in OR (if the short
predicate is True, we don’t need to calculate the rest of the predicates)
Extensible Optimizer

- People are creating more rules over time!


Common Rule-Based
Optimizations

- Simplifying expressions in select, project, etc


- Boolean algebra, numeric expressions, string expressions, etc
- Many redundancies because queries are optimized for readability or
produced by code

- Simplifying relational operator graphs


- Select, project, join, etc
- These relational optimizations have the most impact

- Selecting access paths and operator


- Index column predicate ⇒ use index
- Small table ⇒ use hash join against it
- Aggregation on field with few values ⇒ use in-memory hash table
Relational Algebra

- To study how to optimize queries systematically we need to start with


relational algebra
- Basic set operators:
- Intersection: R ∩ S
- Employee ∩ Staff
- select E.first_name, E.last_name
- from employee as E
- inner join staff as S On E.first_name=S.first_name
- and E.last_name=S.last_name
- Union: R ∪ S
- Employee ∩ Staff
- select last_name, first_name, dept from employee
- Union
- select last_name, first_name, dept from staff
Relational Algebra

- Basic set operators:


- Difference: R – S
- Employee – Staff
- select id, name from Employee
- except
- select id, name from Staff

- Cartesian Product: R ⨯ S -> { (r, s) | r ∈ R, s ∈ S } -> cross join


- Employee ⨯ Registration
- Select *
- From Employee, Registration
Relational Algebra

- Special query operators:


- Selection: 𝝈𝒄𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏 (𝑹) -> { r ∈ R | condition(r) is true }
- 𝜎𝑑𝑒𝑝𝑡=𝑀𝐺𝑀𝑇 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒
- Select *
- from Employee
- where Dept = “MGMT”
- Projection: 𝝅𝒆𝒙𝒑𝒓𝒆𝒔𝒔𝒊𝒐𝒏𝒔 (𝑹) -> { expressions(r) | r ∈ R }
- 𝜋𝑠𝑎𝑙𝑎𝑟𝑦/12 (𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒)
- Select (salary/12) as MonthlySalary
- from Employee

- Expressions can be a set of attributes we want


- 𝜋(𝐸𝑚𝑝𝐼𝐷,𝑠𝑎𝑙𝑎𝑟𝑦) (𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒)
- Select EmpID and Salary
- From Employee
Relational Algebra

- Special query operators:


- Inner join: R ⨝ S -> { (r, s) ∈ R ⨯ S) | r.key = s.key }
- Employee ⨝ Registration
- Select *
- from Employee, Registration
- where Employee.EmpID = Registration.EmpID
- Aggregation: 𝒌𝒆𝒚𝒔 𝑮𝒂𝒈𝒈 𝒂𝒕𝒕𝒓 (𝑹)
- 𝑑𝑒𝑝𝑡 𝐺𝑚𝑎𝑥 𝑠𝑎𝑙𝑎𝑟𝑦 (𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒)
- Select max(salary)
- from Employee
- group by dept
Relational Algebra

- Properties for Unions, Products, and Join


- R∪S=S∪R
- R ∪ (S ∪ T) = (R ∪ S) ∪ T
- R⨯S=S⨯R
- (R ⨯ S) ⨯ T = R ⨯ (S ⨯ T)
- R⨝S=S⨝R
- (R ⨝ S) ⨝ T = R ⨝ (S ⨝ T)
Relational Algebra
𝝈

- Properties for Select


- 𝝈𝒑∧ 𝒒 𝑹 = 𝝈𝒑 𝝈𝒒 𝑹
- The following two queries are equivalent:
- 𝜎salary >50000∧ salary <70000 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒
- Select * from Employee
- Where salary >50000
- And salary < 70000

- 𝜎salary >50000 𝜎salary <70000 𝑅


- Select * from
- (select * from Employee where salary < 70000) as T
- Where salary > 50000
Relational Algebra
𝝈

- But which funs faster?


- 𝜎salary >50000∧ salary <70000 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒
- Select * from Employee
- Where salary >50000
- And salary < 70000

- 𝜎salary >50000 𝜎salary <70000 𝑅


- Select * from
- (select * from Employee where salary < 70000) as T
- Where salary > 50000
Relational Algebra
𝝈

- Properties for Select


- 𝝈𝒑∨ 𝒒 𝑹 = 𝝈𝒑 𝑹 ∪ 𝝈𝒒 𝑹
- The following two queries are equivalent:
- 𝜎salary < 50000 ∨ dept ="MGMT" 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒
- Select * from Employee
- Where salary < 50000
- Or dept = “MGMT”

- 𝜎salary <50000 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒 ∪ 𝜎dept ="MGMT" 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒


- Select * from Employee Where salary < 50000
- Union
- Select * from Employee Where dept = “MGMT”
Relational Algebra
𝝈

- But which one runs faster?


- 𝜎salary < 50000 ∨ dept ="MGMT" 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒
- Select * from Employee
- Where salary < 50000
- Or dept = “MGMT”

- 𝜎salary <50000 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒 ∪ 𝜎dept ="MGMT" 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒


- Select * from Employee Where salary < 50000
- Union
- Select * from Employee Where dept = “MGMT”
Relational Algebra
σ+⨝
- Properties for σ + ⨝
- Let p = predicate with only R attributes
- q = predicate with only S attributes
- m = predicate with only R, S attributes

- 𝝈𝒑 R ⨝ S = 𝝈𝒑 R ⨝ S
- The following two queries are equivalent
- 𝜎(𝑠𝑎𝑙𝑎𝑟𝑦>50000) Employee ⨝ 𝑅𝑒𝑔𝑖𝑠𝑡𝑟𝑎𝑡𝑖𝑜𝑛 :
- Select *
- From Employee inner join Registration
- on Employee.EmpID = Registration.EmpID
- where salary > 50000

- 𝜎(𝑠𝑎𝑙𝑎𝑟𝑦>50000) Employee ⨝ 𝑅𝑒𝑔𝑖𝑠𝑡𝑟𝑎𝑡𝑖𝑜𝑛 :


- Select * from
- (select * from Employee where salary > 50000) as T inner join Registration
- On T.EmpID = Registration.EmpID
Relational Algebra
σ+⨝

- But which runs faster?


- 𝜎(𝑠𝑎𝑙𝑎𝑟𝑦>50000) Employee ⨝ 𝑅𝑒𝑔𝑖𝑠𝑡𝑟𝑎𝑡𝑖𝑜𝑛
- Discard records with salary < 50000 after inner join

- 𝜎(𝑠𝑎𝑙𝑎𝑟𝑦>50000) Employee ⨝ 𝑅𝑒𝑔𝑖𝑠𝑡𝑟𝑎𝑡𝑖𝑜𝑛


- Discard records with salary < 50000 before inner join
Relational Algebra
σ+⨝

- Properties for σ + ⨝
- Let p = predicate with only R attributes
- q = predicate with only S attributes
- m = predicate with both R, S attributes
- Note: predicates are the conditions in the where clause

- 𝝈𝒒 R ⨝ S = 𝑹⨝𝝈𝒒 S , which is faster?


- 𝝈𝒑∧ 𝒒 R ⨝ S = 𝝈𝒑 R ⨝ 𝝈𝒒 S , which is faster?
- 𝝈𝒑∧ 𝒒∧ 𝒎 R ⨝ S = 𝝈𝒎 (𝝈𝒑 R ⨝ 𝝈𝒒 S ), which is faster?
- 𝝈𝒑∨𝒒 R ⨝ S = (𝝈𝒑 R ⨝ S) ∪ (R ⨝ 𝝈𝒒 S ) , which is faster?
Relational Algebra
σ+⨝
p24

- 𝝈𝒑∧ 𝒒 R ⨝ S
p28 - = 𝝈𝒑 (𝝈𝒒 R ⨝ S )
- = 𝝈𝒑 (R ⨝ 𝝈𝒒 S )
- = 𝝈𝒑 R ⨝ 𝝈𝒒 S

You can try others for practice.


Relational Algebra
𝝅+σ

- Properties for 𝜋 + σ
- Let x = subset of R attributes, e.g., R = Employee, x = (EmpID, dept)
- z = attributes in predicate p (subset of R attributes), e,g, z = salary

- 𝝅𝒙 𝝈𝒑 (𝑹) = 𝝅𝒙 (𝝈𝒑 (𝝅𝒙∪ z 𝑹 ))


- The following two versions are equivalent
- 𝜋(𝐸𝑚𝑝𝐼𝐷,𝑑𝑒𝑝𝑡) 𝜎𝑠𝑎𝑙𝑎𝑟𝑦>50000 (𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒) :
- Select EmpID, dept
- From employee
- Where salary > 50000

- 𝜋(𝐸𝑚𝑝𝐼𝐷,𝑑𝑒𝑝𝑡) (𝜎𝑠𝑎𝑙𝑎𝑟𝑦>50000 (𝜋𝐸𝑚𝑝𝐼𝑑,𝑑𝑒𝑝𝑡,𝑠𝑎𝑙𝑎𝑟𝑦 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒 ))


- select R.emp_id, R.dept
- from
- (select * from
- (select emp_id, dept, salary from employee) as T
- where T.salary > 50000) as R
Relational Algebra
𝝅+σ

- But which one runs better?


- 𝜋(𝐸𝑚𝑝𝐼𝐷,𝑑𝑒𝑝𝑡) 𝜎𝑠𝑎𝑙𝑎𝑟𝑦>50000 (𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒)
- It is like you move all you belongs from A to B, and then discard
the things you don’t need anymore

- 𝜋(𝐸𝑚𝑝𝐼𝐷,𝑑𝑒𝑝𝑡) (𝜎𝑠𝑎𝑙𝑎𝑟𝑦>50000 (𝜋𝐸𝑚𝑝𝐼𝑑,𝑑𝑒𝑝𝑡,𝑠𝑎𝑙𝑎𝑟𝑦 𝐸𝑚𝑝𝑙𝑜𝑦𝑒𝑒 ))


- It is like you first discard the things you don’t need before the
moving, and then move everything left from A to B.
Relational Algebra
𝝅+σ

- Properties for 𝜋 + σ
- Let x = subset of R attributes, e.g., R = Employee, x = (EmpID, dept)
- z = attributes in predicate p (subset of R attributes), e,g, z = salary

- Is the following correct? Does projection and predicate swappable?


- 𝝅𝒙 𝝈𝒑 (𝑹) = 𝝈𝒑 (𝝅𝒙 (𝐑))
Relational Algebra
𝝅+⨝
- Properties for 𝜋 + ⨝
- Let x = subset of R attributes, e.g., R = Employee, x = (dept)
- Let y = subset of S attributes, e.g., S = Registration, y = (training_course,
course_date)
- z = intersection of R, S attributes, e.g., z = (Emp_ID)

- 𝝅𝒙∪y R ⨝ S = 𝝅𝒙∪y ( 𝝅𝒙∪ 𝒛 (R) ⨝ 𝝅y∪ 𝒛 (S) )


- The following two versions are equivalent
- 𝝅(dept,Training_course,course_date) Employee⨝Registration
- Select dept, training_course, course_date from
- (select * from Employee inner join Registration
- on Employee.Emp_ID = Registration.Emp_ID)

- 𝝅(dept,training_course,course_date) ( 𝝅dept∪ Emp_ID (R) ⨝


𝝅training_course, course_date∪ Emp_ID (S) )
- Select dept, training_course, course_date from
- ((select dept, Emp_ID from Employee) as A(dept, Emp_ID) inner join
(select training_course, course_date, Emp_ID from registration) as
B(training _course, course_date, Emp_ID) on A.Emp_ID = B.Emp_ID)
Relational Algebra
𝝅+⨝

- Which funs faster?


- 𝝅𝒙∪y R ⨝ S = 𝝅𝒙∪y ( 𝝅𝒙∪ 𝒛 (R) ⨝ 𝝅y∪ 𝒛 (S) )
- Inner join two tables first and then take out useful information

- 𝝅(dept,training_course,course_date) ( 𝝅dept∪ Emp_ID (R) ⨝


- Take out useful information first and then inner join
Summary

If you have to discard something eventually,


do it as early as possible
Summary

- Logial query plan rules:


- Group 1: Push select as far down as possible
- 𝝈𝒑 𝑹⨝𝑺 = 𝝈𝒑 𝑹 ⨝𝑺, if p only references R
- 𝝈𝒒 𝑹⨝𝑺 = 𝑹 ⨝ 𝝈𝒒 𝑺 , if q only references S
- 𝝈𝒑^𝒒 𝑹⨝𝑺 = 𝝈𝒑 𝑹 ⨝ 𝝈𝒒 𝑺 , if p on R and q on S
- Group 2:Push projection as far down as possible
- 𝝅𝒙 𝝈𝒑 (𝑹) = 𝝅𝒙 (𝝈𝒑 (𝝅𝒙∪ z 𝑹 )), z = the fields in p
- 𝝅𝒙∪y R ⨝ S = 𝝅𝒙∪y ( 𝝅𝒙∪ 𝒛 (R) ⨝ 𝝅y∪ 𝒛 (S) ), x = fields in R, y = fields
in S, z=fields in both

You might also like