Advanced Databases
UNIT: III (Chapter-2)
Query Processing and
Decomposition
Reference:
Chapter – 6 & 7
Principles of Distributed Database Systems, M.Tamer Ozsu, Patrick Valduriez, 3rd Edition,
Springer
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/1
Outline
• Objectives of Query Processing
• Characterization of query processors
• Layers of query processing
• Query decomposition
• Localization of distributed data
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/2
Query Processing in a DDBMS
high level user query
query
processor
Low-level data manipulation
commands for D-DBMS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/3
Query Processing Components
• Query language that is used
➡ SQL
• Query execution methodology
➡ The steps that one goes through in executing high-level (declarative) user
queries.
• Query optimization
➡ How do we determine the “best” execution plan?
• We assume a homogeneous D-DBMS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/4
Selecting Alternatives
EMP(ENO, ENAME, TITLE)
ASG(ENO, PNO, RESP, DUR)
SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND RESP = "Manager“
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/5
Selecting Alternatives
EMP(ENO, ENAME, TITLE)
SELECT ENAME
ASG(ENO, PNO, RESP, DUR)
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND RESP = "Manager"
Strategy 1
ENAME(RESP=“Manager”EMP.ENO=ASG.ENO(EMP×ASG))
Strategy 2
ENAME(EMP ⋈ENO (RESP=“Manager” (ASG))
Strategy 2 avoids Cartesian product, so may be “better”
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/6
What is the Problem?
Site 1 Site 2 Site 3 Site 4 Site 5
ASG1=ENO≤“E3”(ASG) ASG2= ENO>“E3”(ASG) EMP1= ENO≤“E3”(EMP) EMP2= ENO>“E3”(EMP) Result
Site 5
Strategy-B
Strategy-A Site 5
result EMP1' EMP2' result= (EMP1 × EMP2)⋈ENOσRESP=“Manager”(ASG1× ASG2)
EMP1' EMP2'
Site 3 Site 4 ASG1 ASG2 EMP1 EMP2
EMP’1=EMP1 ⋈ENO ASG’1 EMP’2=EMP2 ⋈ENO ASG’2
Site 1 Site 2 Site 3 Site 4
ASG 1' ASG '2
Site 1 Site 2
ASG 1' σ RESP "Manager"ASG 1 ASG '2 σ RESP "Manager"ASG 2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/7
Cost of Alternatives
• Assume
➡ size(EMP) = 400, size(ASG) = 1000 20 Managers in ASG
➡ tuple access cost = 1 unit tuple transfer cost = 10 units
• Strategy-A
➡ produce ASG': (10+10) tuple access cost =20
➡ transfer ASG' to the sites of EMP: (10+10) tuple transfer cost=200
➡ produce EMP': (10+10) tuple access cost 2=40
➡ transfer EMP' to result site: (10+10) tuple transfer cost= 200
Total Cost 460
• Strategy-B
➡ transfer EMP to site 5: 400 tuple transfer cost 4,000
➡ transfer ASG to site 5: 1000 tuple transfer cost 10,000
➡ produce ASG': 1000 tuple access cost 1,000
➡ join EMP and ASG': 400 20 tuple access cost 8,000
Total Cost 23,000
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/8
Objectives of Query Processing
• Minimize a cost function
I/O cost + CPU cost + Communication cost
These might have different weights in different distributed environments
• Wide area networks
➡ Communication cost may dominate or vary much
✦ bandwidth
✦ speed
✦ high protocol overhead
• Local area networks
➡ communication cost not that dominant
➡ total cost function should be considered
• Can also maximize throughput
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/9
Characteristics of Query Processors
• Languages
• Types of Optimizers
• Optimization Timing
• Statistics
• Decision Sites
• Network Topology
• Replicated Fragments
• Use of Semijoins
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/10
Characteristics of Query Processors
Complexity of Relational Operations
Operation Complexity
Select
• Assume Project O(n)
(without duplicate elimination)
➡ relations of cardinality n
➡ sequential scan Project
(with duplicate elimination) O(n log n)
Group
Join
Semi-join O(n log n)
Division
Set Operators
Cartesian Product O(n2)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/11
Characteristics of Query Processors
Types of Optimization
• Exhaustive search – all possible execution strategies are considered
➡ Cost-based
➡ Optimal
➡ Combinatorial complexity in the number of relations
• Heuristics
➡ Not optimal
➡ Regroup common sub-expressions
➡ Perform selection, projection first
➡ Replace a join by a series of semijoins
➡ Reorder operations to reduce intermediate relation size
➡ Optimize individual operations
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/12
Characteristics of Query Processors
Optimization Granularity
• Single query at a time
➡ Cannot use common intermediate results
• Multiple queries at a time
➡ Efficient if many similar queries
➡ Decision space is much larger
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/13
Characteristics of Query Processors
Optimization Timing
• Static
➡ Compilation optimize prior to the execution
➡ Difficult to estimate the size of the intermediate results. error
propagation
➡ Can amortize over many executions
• Dynamic
➡ Run time optimization
➡ Exact information on the intermediate relation sizes
➡ Have to reoptimize for multiple executions
➡ Distributed INGRES
• Hybrid
➡ Compile using a static algorithm
➡ If the error in estimate sizes > threshold, reoptimize at run time
➡ Mermaid
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/14
Characteristics of Query Processors
Statistics
• Relation
➡ Cardinality
➡ Size of a tuple
➡ Fraction of tuples participating in a join with another relation
• Attribute
➡ Cardinality of domain
➡ Actual number of distinct values
• Common assumptions
➡ Independence between different attribute values
➡ Uniform distribution of attribute values within their domain
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/15
Characteristics of Query Processors
Decision Sites
• Centralized
➡ Single site determines the “best” schedule
➡ Simple
➡ Need knowledge about the entire distributed database
• Distributed
➡ Cooperation among sites to determine the schedule
➡ Need only local information
➡ Cost of cooperation
• Hybrid
➡ One site determines the global schedule
➡ Each site optimizes the local subqueries
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/16
Characteristics of Query Processors
Network Topology
• Wide area networks (WAN) – point-to-point
➡ Characteristics
✦ Low bandwidth
✦ Low speed
✦ High protocol overhead
➡ Communication cost will dominate; ignore all other cost factors
➡ Global schedule to minimize communication cost
➡ Local schedules according to centralized query optimization
• Local area networks (LAN)
➡ Communication cost not that dominant
➡ Total cost function should be considered
➡ Broadcasting can be exploited (joins)
➡ Special algorithms exist for star networks
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/17
Layering scheme for Distributed
Query Processing
Calculus Query on Distributed Relations
Query GLOBAL
Decomposition SCHEMA
Algebraic Query on Distributed
Relations
CONTROL
Data FRAGMENT
SITE Localization SCHEMA
Fragment Query
Global STATS ON
Optimization FRAGMENTS
Optimized Fragment Query
with Communication Operations
LOCAL Local LOCAL
Optimization SCHEMAS
SITES
Optimized Local Queries
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/18
Query Decomposition
&
Localization of distributed data
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/19
Query Decomposition
Input : Calculus query on global relations
• Normalization
➡ manipulate query quantifiers and qualification
• Semantic Analysis
➡ detect and reject “incorrect” queries
➡ possible for only a subset of relational calculus
• Simplification
➡ eliminate redundant predicates
• Restructuring
➡ calculus query algebraic query
➡ more than one translation is possible
➡ use transformation rules
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/20
Normalization
• Lexical and syntactic analysis
➡ check validity (similar to compilers)
➡ check for attributes and relations
➡ type checking on the qualification
• Put into normal form
➡ Conjunctive normal form
(p11 p12 … p1n) … (pm1 pm2 … pmn)
➡ Disjunctive normal form
(p11 p12 … p1n) … (pm1 pm2 … pmn)
➡ OR's mapped into union
➡ AND's mapped into join or selection
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/21
Semantic Analysis
• prove incorrect queries
• Type incorrect
➡ If any of its attribute or relation names are not defined in the global schema
➡ If operations are applied to attributes of the wrong type
• Semantically incorrect
➡ Components do not contribute in any way to the generation of the result
➡ Only a subset of relational calculus queries can be tested for correctness
➡ Those that do not contain disjunction and negation
➡ To detect
✦ connection graph (query graph)
✦ join graph
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/22
Semantic Analysis – Example
SELECT ENAME,RESP
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND PNAME = "CAD/CAM"
AND DUR ≥ 36
AND TITLE = "Programmer"
Query graph Join graph
DUR≥36
ASG ASG
EMP.ENO=ASG.ENO ASG.PNO=PROJ.PNO EMP.ENO=ASG.ENO ASG.PNO=PROJ.PNO
TITLE =
EMP RESP PROJ EMP PROJ
“Programmer”
ENAME
RESULT
PNAME=“CAD/CAM”
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/23
Semantic Analysis
If the query graph is not connected, the query may be wrong or
use Cartesian product
SELECT ENAME,RESP
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND PNAME = "CAD/CAM"
AND DUR > 36
AND TITLE = "Programmer"
ASG
EMP RESP PROJ
ENAME
RESULT
PNAME=“CAD/CAM”
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/24
Simplification
• Why simplify?
➡ Remember the example
• How? Use transformation rules
➡ Elimination of redundancy
✦ idempotency rules
p1 ¬( p1) false
p1 (p1p2) p1
p1 false p1
…
➡ Application of transitivity
➡ Use of integrity rules
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/25
Simplification – Example
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = "J. Doe"
OR (NOT(EMP.TITLE = "Programmer")
AND (EMP.TITLE = "Programmer"
OR EMP.TITLE = "Elect. Eng.")
AND NOT(EMP.TITLE = "Elect. Eng."))
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = "J. Doe"
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/26
Restructuring
• Convert relational calculus to relational ENAME Project
algebra
• Make use of query trees σDUR=12 OR DUR=24
• Example
Find the names of employees other than
J. Doe who worked on the CAD/CAM
σPNAME=“CAD/CAM” Select
project for either 1 or 2 years.
SELECT ENAME σENAME≠“J. DOE”
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO ⋈PNO
AND ASG.PNO = PROJ.PNO
AND ENAME≠ "J. Doe" ⋈ENO Join
AND PNAME = "CAD/CAM"
AND (DUR = 12 OR DUR = 24) PROJ ASG EMP
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/27
Example
Recall the previous example: ENAME
Project
Find the names of employees other
than J. Doe who worked on the DUR=12 DUR=24
CAD/CAM project for either one or
two years.
PNAME=“CAD/CAM” Select
SELECT ENAME
FROM PROJ, ASG, EMP ENAME≠“J. DOE”
WHERE ASG.ENO=EMP.ENO
AND ASG.PNO=PROJ.PNO ⋈PNO
AND ENAME ≠ "J. Doe"
AND PROJ.PNAME="CAD/CAM" ⋈ENO Join
AND (DUR=12 OR DUR=24)
PROJ ASG EMP
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/28
Equivalent Query
ENAME
PNAME=“CAD/CAM” (DUR=12 DUR=24) ENAME≠“J. Doe”
⋈PNO,ENO
EMP PROJ ASG
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/29
Data Localization
Input: Algebraic query on distributed relations
• Determine which fragments are involved
• Localization program
➡ substitute for each global query its materialization program
➡ optimize
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/30
Example
Assume ENAME
➡ EMP is fragmented into EMP1, EMP2, DUR=12 DUR=24
EMP3 as follows:
✦ EMP1= ENO≤“E3”(EMP) PNAME=“CAD/CAM”
✦ EMP2= “E3”<ENO≤“E6”(EMP)
ENAME≠“J. DOE”
✦ EMP3= ENO≥“E6”(EMP)
➡ ASG fragmented into ASG1 and ASG2 ⋈PNO
as follows:
✦ ASG1= ENO≤“E3”(ASG) ⋈ENO
✦ ASG2= ENO>“E3”(ASG) PROJ
Replace EMP by (EMP1 EMP2 EMP3)
and ASG by (ASG1 ASG2) in any query EMP1EMP2 EMP3 ASG1 ASG2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/31
Provides Parallellism
⋈ENO ⋈ENO ⋈ENO ⋈ENO
EMP1 ASG1 EMP2 ASG2 EMP3 ASG1 EMP3 ASG2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/32
Eliminates Unnecessary Work
⋈ENO ⋈ENO ⋈ENO
EMP1 ASG1 EMP2 ASG2 EMP3 ASG2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/33
Reduction for PHF
• Reduction with selection
➡ Relation R and FR={R1, R2, …, Rw} where Rj=p (R)
j
pi(Rj)= if x in R: ¬(pi(x) pj(x))
➡ Example
SELECT *
FROM EMP
WHERE ENO="E5"
ENO=“E5” ENO=“E5”
EMP1 EMP2 EMP3 EMP2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/34
Reduction for PHF
• Reduction with join
➡ Possible if fragmentation is done on join attribute
➡ Distribute join over union
(R1 R2)⋈S (R1⋈S) (R2⋈S)
➡ Given Ri =p (R) and Rj = p (R)
i j
Ri ⋈Rj = if x in Ri, y in Rj: ¬(pi(x) pj(y))
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/35
Reduction for PHF
• Assume EMP is fragmented as ⋈ENO
before and
➡ ASG1: ENO ≤ "E3"(ASG)
➡ ASG2: ENO > "E3"(ASG)
• Consider the query EMP1 EMP2 EMP3 ASG1 ASG2
SELECT *
FROM EMP,ASG
WHERE EMP.ENO=ASG.ENO
• Distribute join over unions
• Apply the reduction rule ⋈ENO ⋈ENO ⋈ENO
EMP1 ASG1 EMP2 ASG2 EMP3 ASG2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/36
Reduction for VF
• Find useless (not empty) intermediate relations
Relation R defined over attributes A = {A1, ..., An} vertically fragmented
as Ri =A'(R) where A' A:
D,K(Ri) is useless if the set of projection attributes D is not in A'
Example: EMP1=ENO,ENAME (EMP); EMP2=ENO,TITLE (EMP)
SELECT ENAME
FROM EMP
ENAME ENAME
⋈ENO
EMP1 EMP2 EMP1
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/37
Reduction for DHF
• Rule :
➡ Distribute joins over unions
➡ Apply the join reduction for horizontal fragmentation
• Example
ASG1: ASG ⋉ENO EMP1
ASG2: ASG ⋉ENO EMP2
EMP1: TITLE=“Programmer” (EMP)
EMP2: TITLE=“Programmer” (EMP)
• Query
SELECT *
FROM EMP, ASG
WHERE ASG.ENO = EMP.ENO
AND EMP.TITLE = "Mech. Eng."
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/38
Reduction for DHF
Generic query ⋈ENO
TITLE=“Mech. Eng.”
ASG1 ASG2 EMP1 EMP2
Selections first ⋈ENO
TITLE=“Mech. Eng.”
ASG1 ASG2 EMP2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/39
Reduction for DHF
Joins over unions
⋈ENO ⋈ENO
TITLE=“Mech. Eng.” TITLE=“Mech. Eng.”
ASG1 EMP2 ASG2 EMP2
Elimination of the empty intermediate relations
(left sub-tree) ⋈ENO
TITLE=“Mech. Eng.”
ASG2 EMP2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/40
Reduction for Hybrid Fragmentation
• Combine the rules already specified:
➡ Remove empty relations generated by contradicting selections on horizontal
fragments;
➡ Remove useless relations generated by projections on vertical fragments;
➡ Distribute joins over unions in order to isolate and remove useless joins.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/41
Reduction for Hybrid Fragmentation
Example
ENAME
Consider the following hybrid
fragmentation: ENAME
ENO=“E5”
EMP1= ENO≤"E4" (ENO,ENAME (EMP))
EMP2= ENO>"E4" (ENO,ENAME (EMP))
⋈ENO
ENO=“E5”
EMP3= ENO,TITLE (EMP)
and the query
EMP2
SELECT ENAME
FROM EMP
WHERE ENO="E5" EMP1 EMP2 EMP3
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/42