CSE 544
Principles of Database
Management Systems
Fall 2016
Lecture 9 – Structural query optimization
Conjunctive Queries
• Definition:
Q(X) :- R1(X1), R2(X2), ..., Rm(Xm)
• Same as a single datalog rule
• Terminology:
– Atoms
– Head variables
– Existential variables
• CQ = denotes the set of conjunctive queries
CSE 544 - Fall 2016 2
Examples
• Example of CQ
q(x,y) = ∃z.(R(x,z) ∧ ∃u.(R(z,u) ∧ R(u,y)))
q(x) = ∃z.∃u.(R(x,z) ∧ R(z,u) ∧ R(u,y))
• Examples of non-CQ:
q(x,y) = S(x,y)∧∀z.(R(x,z) à R(y,z))
q(x) = T(x) ∨ ∃z.S(x,z)
3
Types of CQ
• Full CQ: head variables are all variables
Q(x,y,z,u) :- R(x,y),S(y,z),T(z,u)
• Boolean CQ: no head variables
Q() :- R(x,y),S(y,z),T(z,u)
• With or without self-joins:
Q(x,u) :- R(x,y),S(y,z),R(z,u)
Q(x,u) :- R(x,y),S(y,z),T(z,u)
CSE 544 - Fall 2016 4
Extensions
• With inequalities CQ<:
Q(x) :- R(x,y),S(y,z),T(z,u),y<u
• With disequalities CQ≠:
Q(x) :- R(x,y),S(y,z),T(z,u),y≠u
• With aggregates:
Q(x,count(*)) :- R(x,y),S(y,z),T(z,u)
Q(x, sum(u)) :- R(x,y),S(y,z),T(z,u)
CSE 544 - Fall 2016 5
Complexity of Query Evaluation
• The query evaluation problem is this:
given a query Q and a database D, compute Q(D)
• Three complexity measures:
– Data complexity. Fix Q. The complexity is f(|D|)
Variation: f(|Input|, |Output|)
– Query (or expression) complexity. Fix D. The complexity is f(|Q|)
– Combined complexity. The complexity if f(|D|,|Q|)
• Example: data complexity of R ⋈ S
6
Complexity of Query Evaluation
• The query evaluation problem is this:
given a query Q and a database D, compute Q(D)
• Three complexity measures:
– Data complexity. Fix Q. The complexity is f(|D|)
Variation: f(|Input|, |Output|)
– Query (or expression) complexity. Fix D. The complexity is f(|Q|)
– Combined complexity. The complexity if f(|D|,|Q|)
• Example: data complexity of R ⋈ S
– PTIME in |R|, |S| (trivially so...)
7
Complexity of Query Evaluation
• The query evaluation problem is this:
given a query Q and a database D, compute Q(D)
• Three complexity measures:
– Data complexity. Fix Q. The complexity is f(|D|)
Variation: f(|Input|, |Output|)
– Query (or expression) complexity. Fix D. The complexity is f(|Q|)
– Combined complexity. The complexity if f(|D|,|Q|)
• Example: data complexity of R ⋈ S
– PTIME in |R|, |S| (trivially so...)
– Better: O(|R| * |S|)
8
Complexity of Query Evaluation
• The query evaluation problem is this:
given a query Q and a database D, compute Q(D)
• Three complexity measures:
– Data complexity. Fix Q. The complexity is f(|D|)
Variation: f(|Input|, |Output|)
– Query (or expression) complexity. Fix D. The complexity is f(|Q|)
– Combined complexity. The complexity if f(|D|,|Q|)
• Example: data complexity of R ⋈ S
– PTIME in |R|, |S| (trivially so...)
– Better: O(|R| * |S|)
– Even better: O( (|R|+|S|) log (|R|+|S|) + |Output| )
9
Complexity of Query Evaluation
• The query evaluation problem is this:
given a query Q and a database D, compute Q(D)
• Three complexity measures:
– Data complexity. Fix Q. The complexity is f(|D|)
Variation: f(|Input|, |Output|)
– Query (or expression) complexity. Fix D. The complexity is f(|Q|)
– Combined complexity. The complexity if f(|D|,|Q|)
• Example: data complexity of R ⋈ S
– PTIME in |R|, |S| (trivially so...)
– Better: O(|R| * |S|)
– Even better: O( (|R|+|S|) log (|R|+|S|) + |Output| )
• Discuss more about complexity in class...
10
Question in Class
• Q(x,w) :- R(x,y),S(y,z),T(z,u),K(u,v),L(v,w)
• Assume |R|=|S|=|T|=|K|=|L| = N
• What is the complexity of Q?
CSE 544 - Fall 2016 11
Question in Class
• Q(x,w) :- R(x,y),S(y,z),T(z,u),K(u,v),L(v,w) Πxw
• Assume |R|=|S|=|T|=|K|=|L| = N ⨝
• What is the complexity of Q? ⨝ L
• What is the complexity of this plan? ⨝ K
⨝ T
CSE 544 - Fall 2016
R S 12
Question in Class
• Q(x,w) :- R(x,y),S(y,z),T(z,u),K(u,v),L(v,w) Πxw
• Assume |R|=|S|=|T|=|K|=|L| = N ⨝
• What is the complexity of Q? ⨝ L
• What is the complexity of this plan? ⨝ K
• Can you find a more efficient plan? ⨝ T
CSE 544 - Fall 2016
R S 13
Question in Class
Πxw
• Push projections down:
What about this complexity?
⨝
Πxu
• Can we still improve?
L(v,w)
⨝
Πxu
⨝ K(u,v)
Πxz
T(z,u)
⨝
R(x,y) S(y,z)
Law of Semijoins
• Input: R(A1,…An), S(B1,…,Bm)
• Output: T(A1,…,An)
Definition: the semi-join operation is
R ⋉ S = Π A1,…,An (R ⨝ S)
• Data complexity: O(|R| + |S|) ignoring log-factors
• The law of semijoins is:
R ⨝ S = (R ⋉ S) ⨝ S
CSE 544 - Fall 2016 15
Laws with Semijoins
• Very important in parallel databases
• Often combined with Bloom Filters (my plan is to discuss
them in the next lecture)
• Read pp. 747 in the textbook
• Also used in query optimization, sometimes called “magic
sets” (see Chaudhuri’s paper)
• Historical note: magic sets were invented after semi-join
reductions, and the connection became clear only later
CSE 544 - Fall 2016 16
Semijoin Reducer
• Given a query:
Q = R 1 ⨝ R2 ⨝ . . . ⨝ Rn
• A semijoin reducer for Q is
Ri1 = Ri1 ⋉ Rj1
Ri2 = Ri2 ⋉ Rj2
.....
Rip = Rip ⋉ Rjp
such that the query is equivalent to:
Q = Rk1 ⨝ Rk2 ⨝ . . . ⨝ Rkn
• A full reducer is such that no dangling tuples remain
CSE 544 - Fall 2016 17
Example
• Example:
Q = R(A,B) ⨝ S(B,C)
• A semijoin reducer is:
R1(A,B) = R(A,B) ⋉ S(B,C)
• The rewritten query is:
Q = R1(A,B) ⨝ S(B,C)
18
Semijoin Reducer
• More complex example:
Q = R(A,B) ⨝ S(B,C) ⨝ T(C,D,E)
• What is a full reducer?
19
Semijoin Reducer
• More complex example:
Q = R(A,B) ⨝ S(B,C) ⨝ T(C,D,E)
• A full reducer is:
S’(B,C) := S(B,C) ⋉ R(A,B)
T’(C,D,E) := T(C,D,E) ⋉ S’(B,C)
S’’(B,C) := S’(B,C) ⋉ T’(C,D,E)
R’(A,B) := R (A,B) ⋉ S’’(B,C)
Q = R’(A,B) ⨝ S’’(B,C) ⨝ T’(C,D,E)
20
Practice at Home...
• Find semi-join reducer for
R(x,y),S(y,z),T(z,u),K(u,v),L(v,w)
CSE 544 - Fall 2016 21
Not All Queries Have Full Reducers
• Example:
Q = R(A,B) ⨝ S(B,C) ⨝ T(A,C)
• Can write many different semi-join reducers
• But no full reducer of length O(1) exists
22
Acyclic Queries
• Fix a Conjunctive Query without self-joins
• Q is acyclic if its atoms can be organized in a tree
such that for every variable the set of nodes that
contain that variable form a connected component
R(x,y)
K(z,v) R(x,y)
S(y,z,u)
S(y,z)
L(v,m) T(z,x)
T(y,z,w) R(x,y),S(y,z),T(z,x)
Acyclic CSE 544 - Fall 2016 23
is cyclic
Yannakakis Algorithm
• Given: acyclic query Q
• Compute Q on any database in time O(|Input|+|Output|)
• Step 1: semi-join reduction
– Pick any root node x in the tree decomposition of Q
– Do a semi-join reduction sweep from the leaves to x
– Do a semi-join reduction sweep from x to the leaves
• Step 2: compute the joins bottom up, with early
projections
CSE 544 - Fall 2016 24
Examples in Class
• Boolean query: Q() :- ...
R(x,y)
K(z,v)
• Non-boolean: Q(x,m) :- ...
S(y,z,u)
• With aggregate: Q(x,sum(m)) :- ... L(v,m)
• And also: Q(x,count(*)) :- ...
T(y,z,w)
In all cases: runtime = O(|R|+|S|+...+|L| + |Output|)
Testing if Q is Acyclic
• An ear of Q is an atom R(X) with the following property:
– Let X’ ⊆ X be the set of join variables (meaning: they occur in at
least one other atom)
– There exists some other atom S(Y) such that X’ ⊆ Y
• The GYO algorithm (Graham,Yu,Özsoyoğlu) for testing if
Q is acyclic:
– While Q has an ear R(X), remove the atom R(X) from the query
– If all atoms were removed, then Q is acyclic
– If atoms remain but there is no ear, then Q is cyclic
• Show example in class
CSE 544 - Fall 2016 26
Computing Cyclic Queries
R(x,y) ρ*=3/2
K(z,v) R(x,y),
S(y,z,u) A(y,s),B(x,s) ρ*=2
K(z,v),F(v,q)
L(v,m) S(y,z,u)
ρ*=1 ρ*=1
T(y,z,w)
ρ*=2
L(v,m)
T(y,z,w),
O(n + |Output|) C(z,t),D(t,p),E(p,y)
O(n2 + |Output|)
ftw = max(ρ*) = 2
Next lecture...