KEMBAR78
11 Iterators Relalg | PDF | Relational Model | Theoretical Computer Science
0% found this document useful (0 votes)
7 views46 pages

11 Iterators Relalg

The document discusses the Iterator Interface in database management systems, focusing on how relational operators are implemented as subclasses of an abstract Iterator class. It highlights the challenges of reusing algorithms for different relational operations and the need for a pull-based computation model. Additionally, it provides examples of various operators such as Select, Heap Scan, and Group By, explaining their initialization, execution, and closure processes.

Uploaded by

duong.td158
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

11 Iterators Relalg

The document discusses the Iterator Interface in database management systems, focusing on how relational operators are implemented as subclasses of an abstract Iterator class. It highlights the challenges of reusing algorithms for different relational operations and the need for a pull-based computation model. Additionally, it provides examples of various operators such as Select, Heap Scan, and Group By, explaining their initialization, execution, and closure processes.

Uploaded by

duong.td158
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Iterators Interface

Alvin Cheung
Fall 2024

Reading: R&G, Chapters 4.1 - 4.2


Iterator Interface
• We have now seen how sorting, hashing, and joins are implemented

• Design question: how to package these algos such that they can be
reused by others?

• Two challenges
• Different relational operations want to use the same algo
• Tuples are not all read in at the same time
• sort(data) doesn’t work!

• These challenges are common across different relational operations


Iterator Interface
The relational operators implemented as subclasses of the class Iterator:
abstract class iterator { // Invoked when “wiring” dataflow
void setup(List<Iterator> inputs); // Configure the input (children) args
void init(args); // Invoked before calling next: sets up state
tuple next(); // Invoked repeatedly: return another tuple
void close(); // Invoked when finished
}

• Pull-based computation model


• e.g., Console calls init on root operator of query plan, and then next
• If tuple is not ready, this next request propagates down the query plan recursively
• init/next can result in either streaming (“on-the-fly”) or blocking (“batch”) algorithm for that operator:
• streaming: small, constant amount of work per call
• blocking: does not produce output until it consumes its entire input, i.e., all rows from children!
• Q: examples?
• Any iterator can be input to any other, since they all implement the same interface (composability)
• State: iterators may maintain substantial private “internal” state
• e.g., hash tables, running counts, large sorted files …
Example: Select (streaming aka on-the-fly)
• A streaming operator: small amount of work per tuple produced

• init(predicate):
child.init()
pred = predicate; // local variable storing state
current = NULL; // local cursor

• next():
while (current != EOF && !pred(current))
current = child.next(); // give us another tuple
} // exit if pred is satisfied or EOF
return current; // return current tuple or EOF

• close():
child.close()
Example: Heap Scan
• Leaf of the query plan

• init(relation):
heap = open heap file for this relation; // file handle
cur_page = heap.first_page(); // first page
cur_slot = cur_page.first_slot(); // first slot on that page

• next():
if (cur_page == NULL) return EOF;
current = [cur_page, cur_slot]; // we will return this recordId
cur_slot = cur_slot.advance(); // advance the slot for subseq. calls
if (cur_slot == NULL) { // advance to next page, first slot
cur_page = cur_page.advance();
if (cur_page != NULL)
cur_slot = cur_page.first_slot();
}
return current;

• close():
heap.close() // close file
Example: Group By on Sorted input
agg_type state init merge(x) final
COUNT count 0 count ++ count
GroupBy
SUM sum 0 sum += x sum
AVG ? ? ? ?
Sorted on
group col
Sort
MIN ? ? ? ?

• Say input is sorted, and we want to do a group by


• Sort not necessary, can also do group by with hashing
• Keep “group” in iterator state, add merge(tuple) function
• Initialize group state
• Operate one tuple at a time in next, and merge tuple with existing group state
• Create new group if needed
• Return result tuple when done with group
Example: Group By on Sorted input
agg_type state init merge(x) final
COUNT count 0 count ++ count
GroupBy
SUM sum 0 sum += x sum
AVG [count, [0, 0] [count++, sum
Sorted on
sum] sum+=x] / count group col
Sort
MIN min +infinity min > x ? min
x : min

• Say input is sorted, and we want to do a group by


• Sort not necessary, can also do group by with hashing
• Keep “group” in iterator state, add merge(tuple) function
• Initialize group state
• Operate one tuple at a time in next, and merge tuple with existing group state
• Create new group if needed
• Return result tuple when done with group
Example: Group By on Sorted input
agg_type state init merge(x) final
COUNT count 0 count ++ count
• init(group_keys, aggs):
child.init(); SUM sum 0 sum += x sum
cur_group = NULL; // no group so far GroupBy
AVG [count, [0, 0] [count++, sum
sum] sum+=x] / count
next():
result = NULL; MIN min +infinity min > x ? min
do { x : min Sort
tup = child.next();
if (group(tup) != cur_group) { // New group!
if (cur_group != NULL) // Have we seen a group previously?
result = [cur_group, final() of all aggs] // Form result for that current group
cur_group = group(tup); // Initialize new group state
call init() on group state
}
call merge(tup) to merge tup into state
} while (!result); // Exit if current group result is formed
return result;

• close():
child.close()

Neat: only maintains one tuple of partial results in memory at any time!
A Full (Single Table) Query Plan
• A Query Plan is Single-threaded!
GroupBy
• Exercise: trace the calls that lead to flow of tuples:
• Call init() on the root GroupBy
• How does init() recurse down the chain and return? Sort
• Call next() on root
• How does next() recurse down the chain and return a tuple?
Select
• Note how the blocking operator (sort) interacts with the other,
streaming operators
Heap
• Select and GroupBy are essentially streaming operators Scan
• We don’t store each operator output on disk; tuples stream through
the plan’s call stack
• Some operators like Sort use disk internally – but not exposed outside
the operator
Summary
• We just finished our tour of the upper layer of the DBMS

• We have now seen how SQL queries can be represented using relational
algebra trees

• We have also seen how relational algebra operators are implemented


• init / next / close
• Streaming vs blocking

• We saw earlier the different algos for implementing various relational


algebra operators
• Sort, hash, join, etc

• We didn’t cover how SQL text are translated into RA trees


• Take CS164!
Relational Algebra

Alvin Cheung
Fall 2024

Reading: R&G, Chapters 4.1 - 4.2


Architecture of a DBMS: What we’ve learned
Completed SQL Client

Query Parsing Today: Packaging


& Optimization implementations of our
Relational Operators
algos as relational
We are here!
operators.
Completed Database
Files and Index Management
Management Coming soon:
Completed Buffer Management optimization
System
Completed Disk Space Management

Database
An Overview of Query Execution
SQL Query Relational Algebra
SELECT S.name
FROM Reserves R, Sailors S Query Parser !S.name("bid=100⋀rating>5(
WHERE R.sid = S.sid
AND R.bid = 100
& Optimizer
Reserves ⋈R.sid=S.sid Sailors))
AND S.rating > 5


a le n t to
E q u iv On-the-fly
Optimized (Physical) Query Plan:
(Logical) Query Plan:
Project Iterator
!S.name !S.name
On-the-fly
But actually will "S.rating>5 Select Iterator
"R.bid=100 ⋀ S.rating > 5
⋈R.sid=S.s
produce…
Indexed Nested
Loop Join Iterator
⋈R.sid=S.s Operator Code
id
Heap Scan
id B+-Tree
"R.bid=100 Sailors Iterator
Reserves Sailors Indexed Scan
Iterator Reserves
SQL vs Relational Algebra
SQL Query Relational Algebra
SELECT S.name
FROM Reserves R, Sailors S Query Parser !S.name("bid=100⋀rating>5(
WHERE R.sid = S.sid
AND R.bid = 100
& Optimizer
Reserves ⋈R.sid=S.sid Sailors))
AND S.rating > 5


a le n t to
E q u iv On-the-fly
Optimized (Physical) Query Plan:
(Logical) Query Plan:
Project Iterator
!S.name
SQL
!S.name Relational Algebra On-the-fly
But actually will "S.rating>5
Operational description of Select Iterator
A"declarative expression
R.bid=100 ⋀ S.rating >

produce…
5 of the query result a computation.
R.sid=S.s Indexed Nested
Loop Join Iterator
⋈R.sid=S.s Operator Code
id
Heap Scan
id "R.bid=100 Sailors
B+-Tree Systems execute relational algebra
Iterator
Reserves Sailors Indexed Scan query plan.
Iterator Reserves
Relational Algebra Preliminaries
• Algebra of operators on relation instances
• Just like other algebras: linear algebra or elementary algebra
• Operating on matrices or variables
• !S.name("R.bid=100 ⋀ S.rating>5(R ⋈R.sid=S.sid S))
• Closed: result is also a relation instance
• Enables rich composition!
• Just like a linear algebraic expression on matrices returns a matrix

• Typed: input schema determines output


• Can statically check whether queries are legal.
• Same story for linear algebra – input sizes determine output sizes
Relational Algebra and Sets
• Pure relational algebra has set semantics
• No duplicate tuples in a relation instance
• vs. SQL, which has multiset (bag) semantics
• We will switch to multiset in the system discussion
Relational Algebra Operators: Unary
• Unary Operators: on single relation
• Projection (π ): Retains only desired columns (vertical)
• Selection (σ ): Selects a subset of rows (horizontal)
• Renaming ( ! ): Rename attributes and relations.
Relational Algebra Operators: Binary
• Binary Operators: on pairs of relations
• Union ( ∪ ): Tuples in r1 or in r2.
• Set-difference ( — ): Tuples in r1, but not in r2.
• Cross-product ( × ): Allows us to combine two relations.
Relational Algebra Operators: Compound

• Compound Operators: ops that can be expressed using the 6 basic


ops earlier

• Intersection ( ∩ ): Tuples in r1 and in r2.


• Joins ( ⋈# , ⋈ ): Combine relations that satisfy predicates
Relational Algebra Operators
• Projection (π ): Retains only desired columns (vertical)
• Selection (σ ): Selects a subset of rows (horizontal)
• Renaming ( ! ): Rename attributes and relations
• Union ( ∪ ): Tuples in r1 or in r2.
• Set-difference ( — ): Tuples in r1, but not in r2.
• Cross-product ( × ): Allows us to combine two relations.

• Intersection ( ∩ ): Tuples in r1 and in r2.


• Joins ( ⋈% , ⋈ ): Combine relations that satisfy predicates
Projection (π)
• Corresponds to the SELECT list in SQL
• Schema determined by schema of attribute list
• Names and types correspond to input attributes
• Selects a subset of columns (vertical slice)

πsname,age(S2)
Relational Instance S2
List of Attributes
sid sname rating age
sname age
28 yuppy 9 35.0
yuppy 35.0
31 lubber 8 55.5
lubber 55.5
44 guppy 5 35.0
guppy 35.0
58 rusty 10 35.0
rusty 35.0
Projection (π), cont.
• Set semantics results in fewer rows
• Real systems don’t automatically remove duplicates
• Why? (Semantics and Performance reasons)

πage(S2)
Relational Instance S2 Multiset
age
sid sname rating age Set
35.0 age
28 yuppy 9 35.0
55.5 35.0
31 lubber 8 55.5
35.0 55.5
44 guppy 5 35.0
35.0
58 rusty 10 35.0
Selection (!)
• Selects a subset of rows (horizontal slice)
• Corresponds to the WHERE clause in SQL
• Output schema same as input
• Duplicate Elimination? Not needed if input is a set.
!rating>8(S2)
Relational Instance S2
Selection Condition (Boolean Expression)
sid sname rating age
28 yuppy 9 35.0 sid sname rating age
31 lubber 8 55.5 28 yuppy 9 35.0
44 guppy 5 35.0 58 rusty 10 35.0
58 rusty 10 35.0
Composing Select and Project

• Names of sailors with rating > 8: πsname(!rating>8(S2))


sid sname rating age
28 yuppy 9 35.0 sid sname rating age sname
31 lubber 8 55.5 28 yuppy 9 35.0 yuppy
44 guppy 5 35.0 58 rusty 10 35.0 rusty
58 rusty 10 35.0 !rating>8 πsname

• What about: !rating>8(πsname(S2))


• Invalid types. Input to !rating>8 does not contain rating.
Union (∪)
• Takes the set union of two sets
• The two input relations must be compatible:
• Same sequence of attributes and types thereof
• SQL Expression: UNION

S1 ∪ S2

S1 S2 S1 ∪ S2
Union (∪) VS Union ALL
•In union under set semantics, duplicate elimination is needed
•SQL Expression: UNION (get rid of duplicates) vs. UNION ALL (keep dup.)

S1 ∪ S2
Relational Instance S1 Relational Instance S2
sid sname rating age
sid sname rating age sid sname rating age
22 dustin 7 45
22 dustin 7 45.0 28 yuppy 9 35.0
28 yuppy 9 35.0
31 lubber 8 55.5 31 lubber 8 55.5
31 lubber 8 55.5
58 rusty 10 35.0 44 guppy 5 35.0
44 guppy 5 35.0
58 rusty 10 35.0
58 rusty 10 35.0
Set Difference ( − )
• Same as with union, both input relations must be compatible.

• SQL Expression: EXCEPT

S1 − S2

S1 S2 S1 − S2
Set Difference ( − ), cont.
• Q: Do we need to eliminate duplicates like in UNION?
• Not required if inputs are sets
• SQL Expression: EXCEPT vs EXCEPT ALL
• Same as UNION/UNION ALL
• In EXCEPT duplicates are eliminated if they exist S1 − S2
sid sname rating age
Relational Instance S1 Relational Instance S2
22 dustin 7 45
sid sname rating age sid sname rating age
22 dustin 7 45.0 28 yuppy 9 35.0
31 lubber 8 55.5 31 lubber 8 55.5
S2 − S1
58 rusty 10 35.0 44 guppy 5 35.0 sid sname rating age

58 rusty 10 35.0 28 yuppy 9 35.0


44 guppy 5 35.0
Cross-Product (×)
• R1 × S1: Each row of R1 paired with each row of S1
sid bid day sid sname rating age

S1: 22 101 10/10/96 22 dustin 7 45.0


R1:
22 101 10/10/96 31 lubber 8 55.5
sid sname rating age
sid bid day 22 101 10/10/96 58 rusty 10 35.0
22 101 10/10/96
×
22 dustin 7 45.0 = 58 103 11/12/96 22 dustin 7 45.0
31 lubber 8 55.5
58 103 11/12/96 58 103 11/12/96 31 lubber 8 55.5
58 rusty 10 35.0
58 103 11/12/96 58 rusty 10 35.0
• How many rows in result?
• |R1|*|S1|
• Do we need to worry about schema compatibility?
• Not needed.
• Do we need to do duplicate elimination?
• None generated.
Renaming ( ! = “rho” )
• Renames relations and their attributes
• Convenient to avoid confusion when two relations overlap in attributes
• Can omit output name if we don’t want to rename the output R1:
sid bid day
22 101 10/10/96

!R(sid2, bid2, day) (R1) 58 103 11/12/96

R:
!R(sid->sid2, bid->bid2) (R1) sid2 bid2 day
22 101 10/10/96
58 103 11/12/96
Renaming ( ! = “rho” ) contd.
• Yet another shorthand for renaming
!Temp1(R1.sid -> sid1, S1.sid -> sid2) (R1 × S1)
Output List of cols Input
Name to rename Relation

• Again, can omit output name if we don’t want to rename the output
• For this case, we can equivalently rename each relation and then do cross-product

R1 × S1 Temp1
sid bid day sid sname rating age sid1 bid day sid2 sname rating age

22 101 10/10/96 22 dustin 7 45.0 22 101 10/10/96 22 dustin 7 45.0


22 101 10/10/96 31 lubber 8 55.5 22 101 10/10/96 31 lubber 8 55.5
22 101 10/10/96 58 rusty 10 35.0 22 101 10/10/96 58 rusty 10 35.0
58 103 11/12/96 22 dustin 7 45.0 58 103 11/12/96 22 dustin 7 45.0
58 103 11/12/96 31 lubber 8 55.5 58 103 11/12/96 31 lubber 8 55.5
58 103 11/12/96 58 rusty 10 35.0 58 103 11/12/96 58 rusty 10 35.0
Relational Algebra Operators
• Projection (π ): Retains only desired columns (vertical)
• Selection (σ ): Selects a subset of rows (horizontal)
• Renaming ( ! ): Rename attributes and relations
• Union ( ∪ ): Tuples in r1 or in r2.
• Set-difference ( — ): Tuples in r1, but not in r2.
• Cross-product ( × ): Allows us to combine two relations.

• Intersection ( ∩ ): Tuples in r1 and in r2. Next


• Joins ( ⋈% , ⋈ ): Combine relations that satisfy predicates
Compound Operator: Intersection
• Same as with union, both input relations must be compatible.
• SQL Expression: INTERSECT

S1 ∩ S2

S1 S2 S1 ∩ S2
Intersection (∩)
• Same story as — with respect to duplicates
• Duplicates don’t need to be eliminated if inputs are sets

S1 ∩ S2
Relational Instance S1 Relational Instance S2
sid sname rating age sid sname rating age
22 dustin 7 45.0 28 yuppy 9 35.0 sid sname rating age
31 lubber 8 55.5 31 lubber 8 55.5 31 lubber 8 55.5
58 rusty 10 35.0 44 guppy 5 35.0 58 rusty 10 35.0
58 rusty 10 35.0
Intersection (∩), Pt 2
S1 ∩ S2
• We saw that ∩ is a compound
operator.
• S1 ∩ S2 = ?
Intersection (∩), Pt 3
S1 ∩ S2
• S1 ∩ S2 = S1 – ?
• Q: What is “?”

= S1 – ?
Intersection (∩), Pt 4
S1 ∩ S2
• S1 ∩ S2 = S1 – (S1 – S2)

= S1 – ( S1
? – S2 )
Compound Operator: Join
• Joins are compound operators (like intersection):
• Generally, !"( R × S)
• With possibly a rename in there (for natural join)

• Increasing degree of specialization


• Theta Join ( ⋈" ): join on logical expression "
• Equi-Join: theta join with theta being a conjunction of equalities
• Natural Join ( ⋈ ): equi-join on all matching column names
• Only one copy per column preserved!
• Relating information across tables using joins/cross-products is super useful
and important
• Want to avoid cross-products
• We just learned about efficient join algorithms
Theta Join (⋈") Semantics

• R ⋈" S = #"(R × S)
• Apply a cross-product, then filter out tuples that
don’t match.
• If " only contains equality conditions (with an AND
between them), this is called an equi-join
Theta Join (⋈") Example
• We want to find boats that people have reserved
• R1 ⋈sid=sid S1
• Confusing… hard to interpret!
S1:
R1:
sid sname rating age sid bid day sid sname rating age
sid bid day
22 101 10/10/96 ⋈sid=sid 22 dustin 7 45.0 = 22 101 10/10/96 22 dustin 7 45.0
31 lubber 8 55.5 58 103 11/12/96 58 rusty 10 35.0
58 103 11/12/96
58 rusty 10 35.0

• Q: How do we fix?
Theta Join (⋈") Example
• #(sid->sid1)(R1) ⋈sid1=sid S1
S1:
R1:
sid1 bid day sid sname rating age sid1 bid day sid sname rating age
22 101 10/10/96 ⋈sid1=sid 22 dustin 7 45.0 = 22 101 10/10/96 22 dustin 7 45.0
58 103 11/12/96 31 lubber 8 55.5 58 103 11/12/96 58 rusty 10 35.0
58 rusty 10 35.0
Natural Join (⋈)
• Special case of equi-join in which equalities are specified for all
matching fields and duplicate fields are projected away

• Compute R × S
• Select rows where fields appearing in both relations have equal values
• Project onto the set of all unique fields.
Other Join Variants
• We have convenient symbols for Outer joins:
• Left Outer join
• R⟕S
• Right Outer join
• R⟖S
• Full Outer join
• R⟗S
Extended Relational Algebra
• Group By / Aggregation Operator (!):
• !age, AVG(rating)(Sailors)
• With selection (HAVING clause):
• !age, AVG(rating), COUNT(*)>2(Sailors)

• Implicitly combines GROUP BY, HAVING and SELECT


Relational Operators and Query Plans
!sname(!sid(!bid("color=‘red’(Boats)) ⋈ Reserves) ⋈ Sailors)

• Expression Tree Representation = Query plan


!sname
• Edges encode “flow” of tuples
• Vertices = Relational Alg Operators ⋈
• Source vertices = table access operators !sid

• Also called dataflow graph ⋈


• Here, “flow of tuples” !bid
• Not specific to DBMSs "color=‘red
• E.g., “big data systems”, ML systems
Boats Reserves Sailors
Summary
• Relational Algebra: a small set of operators mapping relations to
relations
• Operational, in the sense that you specify the explicit order of
operations
• A closed set of operators! Mix and match.
• Easy to manipulate/rewrite/simplify
• Super powerful! Can encapsulate a lot of SQL functionality
• Basic ops include: σ, π, ×, ∪, —
• Important compound ops: ∩, ⋈

You might also like