▪ We start @ 14:15
▪ Remember to join MS Teams – code: 6esizxc
▪ You can download the lecture from MS Teams for your convenience
Instructor: Krystian Wojtkiewicz
School of Computer Science and Engineering
International University, VNU-HCMC Lecture 3: Relational
Model and Algebra
ACKNOWLEDGEMENT
The following slides are referenced from
Dr. Sudeepa Roy, Duke University.
Pilot in the Royal Air Force in WW2
Inventor of the relational model and
algebra while at IBM
Turing Award, 1981
RDBMS = Relational DBMS
http://en.wikipedia.org/wiki/File:Edgar_F_Codd.jpg
Drinkers Frequent Bars
“X” times a week ▪ Bars Serve Beers At price “Y”
Bars Serve Beers
At price “Y” ▪ Beers - Each has a brewer
▪ Drinkers Likes Beers
▪ Drinkers - Each has an address
▪ Drinkers Frequent Bars “X” times a week
Drinkers Likes Beers
Drinkers Beers
Each has an address Each has a brewer
Bar Beer
name address Name brewer
The Edge 108 Morris Street Budweiser Anheuser-Busch Inc.
Satisfaction 905 W. Main Street Corona Grupo Modelo
Dixie Dixie Brewing
Drinker Frequents
“BEERS” AS A name
Amy
address
100 W. Main Street
drinker
Ben
bar
Satisfaction
times_a_week
2
RELATIONAL Ben
Dan
101 W. Main Street
300 N. Duke Street
Dan
Dan
The Edge
Satisfaction
1
2
DATABASE
Likes Serves
drinker beer bar beer price
Amy Corona The Edge Budweiser 2.50
Dan Budweiser The Edge Corona 3.00
Dan Corona Satisfaction Budweiser 2.25
Ben Budweiser
• A database is a collection of relations (or tables) Serves
• Each relation has a set of attributes (or columns)
• Each attribute has a name and a domain (or type)
• Set-valued attributes are not allowed
bar beer price
• Each relation contains a “set” of tuples (or rows)
• Each tuple has a value for each attribute of the relation The Edge Budweiser 2.50
• Duplicate tuples are not allowed (Two tuples are
duplicates if they agree on all attributes) The Edge Corona 3.00
• Ordering of rows doesn’t matter (even
though output is always in some order)
Satisfaction Budweiser 2.25
However, SQL supports “bag” or duplicate tuples (why?)
Simplicity is a virtue!
Ordering of rows doesn’t matter (even though output is always in some order)
Beer Frequents Serves
Name brewer drinker bar times_a_week bar beer price
Budweiser Anheuser-Busch Inc. Ben Satisfaction 2 The Edge Budweiser 2.50
Corona Grupo Modelo Dan The Edge 1 The Edge Corona 3.00
Dixie Dixie Brewing Dan Satisfaction 2 Satisfaction Budweiser 2.25
• Schema (metadata)
• Specifies the logical structure of data
• Is defined at setup time
• Rarely changes
• Instance
• Represents the data content
• Changes rapidly, but always conforms to the schema
Compare to types vs. collections of objects of these types in a
programming language
Schema (metadata)
• Beer (name string, brewer string)
• Serves (bar string, beer string, price float)
• Frequents (drinker string, bar string, times_a_week int)
Instance
• Beer {<Budweiser, Anheuser-Busch Inc.>, <corona, Grupo
Modelo>,…}
• Serves {<the Edge, Budweiser, 2.50>,<The Edge, Corona, 3.0>,..}
• Frequents {<Ben, Satisfaction,2>,<Dan, The Edge, 1>,…}
A language for querying relational data
based on “operators”
RelOp
RelOp
• Core operators:
• Selection, projection, cross product, union, difference,
and renaming
• Additional, derived operators:
• Join, natural join, intersection, etc.
• Compose operators to make complex queries
Input: a table 𝑅
𝑝 is called a
Notation: 𝜎P𝑅 selection condition
(or predicate)
Purpose: filter rows according to some
criteria
Output: same columns as 𝑅, but only
rows of 𝑅 that satisfy 𝑝 (𝑠𝑒𝑡!)
Find beers with price < 2.75
Serves 𝝈𝒑𝒓𝒊𝒄𝒆 < 𝟐.𝟕𝟓 Serves
bar beer price bar beer price
The Edge Budweiser 2.50 The Edge Budweiser 2.50
The Edge Corona 3.00
Satisfaction Budweiser 2.25 Satisfaction Budweiser 2.25
• Selection condition can include any column of 𝑅, constants,
comparison (=, ≤, etc.) and Boolean connectives (∧: and, ∨:
or, ¬: not)
• Example: Serves tuples for “The Edge” or price >= 2.75
𝜎bar= ‘The Edge’ ∨ price 2.75𝑆𝑒𝑟𝑣𝑒𝑠
• You must be able to evaluate the condition over each single
row of the input table!
• Example: the most expensive beer at any bar
𝜎price every price in Serves 𝑈𝑠𝑒𝑟
Serves
bar beer price
The Edge Budweiser 2.50
The Edge Corona 3.00
Satisfaction Budweiser 2.25
Input: a table 𝑅
𝐿 is a list of
Notation: 𝜋L 𝑅 columns in 𝑅
Purpose: output chosen columns
Output: same rows, but only the
columns in 𝐿 (𝑠𝑒𝑡!)
Example: Find all the prices for each beer
Serves
𝝅𝒃𝒆𝒆𝒓,𝒑𝒓𝒊𝒄𝒆 Serves
bar beer price
The Edge Budweiser 2.50 beer price
Budweiser 2.50
The Edge Corona 3.00
Corona 3.00
Satisfaction Budweiser 2.25
Budweiser 2.25
Output of 𝜋beerServes?
Duplicate output rows are removed (by definition)
Example: beer on servers
𝝅𝒃𝒆𝒆𝒓 Serves
Serves
bar beer price beer
The Edge Budweiser 2.50 Budweiser
The Edge Corona 3.00 Corona
Satisfaction Budweiser 2.25 Budweiser
Input: two tables 𝑅 and 𝑆
Natation: 𝑅×𝑆
Purpose: pairs rows from two tables
Output: for each row 𝑟 in 𝑅 and each 𝑠 in 𝑆, output
a row 𝑟𝑠 (concatenation of 𝑟 and 𝑠)
Bar Frequents
name address drinker bar times_a_week
The Edge 108 Morris Street Ben Satisfaction 2
Satisfaction 905 W. Main Street Dan The Edge 1
Dan Satisfaction 2
Bar x Frequents
name address drinker bar times_a_
w eek
The Edge 108 Morris Street Ben Satisfaction 2
The Edge 108 Morris Street Dan The Edge 1
The Edge 108 Morris Street Dan Satisfaction 2
Satisfaction 905 W. Main Street Ben Satisfaction 2
Satisfaction 905 W. Main Street Dan The Edge 1
Satisfaction 905 W. Main Street Dan Satisfaction 2
Ordering of columns is unimportant as far as contents are concerned.
name address drinker bar time drinker bar time name address
s_a_ s_a_
w w
eek eek
The Edge 108 Morris Street Ben Satisfaction 2 Ben Satisfaction 2 The Edge 108 Morris Street
The Edge 108 Morris Street Dan The Edge 1 Dan The Edge 1 The Edge 108 Morris Street
The Edge 108 Morris Street Dan Satisfaction 2 = Dan Satisfaction 2 The Edge 108 Morris Street
Satisfaction 905 W. Main Street Ben Satisfaction 2 Ben Satisfaction 2 Satisfaction 905 W. Main Street
Satisfaction 905 W. Main Street Dan The Edge 1 Dan The Edge 1 Satisfaction 905 W. Main Street
Satisfaction 905 W. Main Street Dan Satisfaction 2 Dan Satisfaction 2 Satisfaction 905 W. Main Street
So cross product is commutative, i.e., for any R and S, R X S = S X R (up to the ordering
of columns)
(Also known as “theta-join”: most general joins)
• Input: two tables 𝑅 and 𝑆
• Notation: 𝑅 ⋈P 𝑆 One of the most important operations!
• 𝑝 is called a join condition (or predicate)
• Purpose: relate rows from two tables according to some criteria
• Output: for each row 𝑟 in 𝑅 and each row 𝑠 in 𝑆, output a row 𝑟𝑠 if 𝑟 and 𝑠
satisfy 𝑝
• Shorthand for 𝜎P(R×𝑆)
• Predicate p only has equality (A = 5 ∧ B = 7) : equijoin
Extend Frequents relation with addresses of the bars
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠 ⋈bar = name 𝐵𝑎𝑟
Ambiguous attribute? Prefix a column reference with table name and “.” to disambiguate identically named columns
from different tables. Ex. Use Bar.name
Bar Frequents
name address drinker bar times_a_week
The Edge 108 Morris Street Ben Satisfaction 2
Satisfaction 905 W. Main Street Dan The Edge 1
Dan Satisfaction 2
name address drinker bar times_a_week
The Edge 108 Morris Street Ben Satisfaction 2
The Edge 108 Morris Street Dan The Edge 1
The Edge 108 Morris Street Dan Satisfaction 2
Satisfaction 905 W. Main Street Ben Satisfaction 2
Satisfaction 905 W. Main Street Dan The Edge 1
Satisfaction 905 W. Main Street Dan Satisfaction 2
Theta Join
Equi-Join
Natural Join
Later, (left/right) outer join, semi-join
Input: two tables 𝑅 and 𝑆
Notation: 𝑅 ⋈ 𝑆 (i.e. no subscript)
Purpose: relate rows from two tables, and
• Enforce equality between identically named columns
• Eliminate one copy of identically named columns
Shorthand for 𝜋L(R ⋈P 𝑆) , where
• 𝑝 equates each pair of columns common to 𝑅 and 𝑆
• 𝐿 is the union of column names from 𝑅 and 𝑆 (with duplicate columns removed)
Serves ⋈ 𝐿𝑖𝑘𝑒𝑠
= 𝜋? (𝑆𝑒𝑟𝑣𝑒𝑠 ⋈? 𝐿𝑖𝑘𝑒𝑠)
bar, beer, price, drinker (𝑆𝑒𝑟𝑣𝑒𝑠 ⋈Serves.beer = Likes.beer 𝐿𝑖𝑘𝑒𝑠)
=𝜋
Serves Likes
drinker beer
bar beer price
The Edge Budweiser 2.50 Amy Corona
The Edge Corona 3.00 Dan Budweiser
Satisfaction Budweiser 2.25 Dan Corona
Ben Budweiser
Serves ⋈ 𝐿𝑖𝑘𝑒𝑠
bar beer price drinker
The Edge Budweiser 2.50 Dan Natural Join is on beer.
The Edge Budweiser 2.50 Ben
Only one column for beer in the output
The Edge Corona 3.00 Amy
What happens if the tables
The Edge Corona 3.00 Dan
have two or more common columns?
... …. …..
Input: two tables 𝑅 and 𝑆
Notation: 𝑅 ∪ 𝑆
• 𝑅 and 𝑆 must have identical schema
Output:
• Has the same schema as 𝑅 and 𝑆
• Contains all rows in 𝑅 and all rows in 𝑆 (with duplicate rows
removed)
Important for set operations:
Union Compatibility
Important for set operations:
Union Compatibility
• Input: two tables 𝑅 and 𝑆
• Notation: 𝑅 − 𝑆
• 𝑅 and 𝑆 must have identical schema
• Output:
DIFFERENCE • Has the same schema as 𝑅 and 𝑆
• Contains all rows in 𝑅 that are not in 𝑆
Example on board
Input: two tables 𝑅 and 𝑆
Notation: 𝑅 ∩ 𝑆
• 𝑅 and 𝑆 must have identical schema
Output:
• Has the same schema as 𝑅 and 𝑆
• Contains all rows that are in both 𝑅 and 𝑆
How can you write it using other operators?
Shorthand for 𝑅 − (𝑅 − 𝑆)
Also equivalent to 𝑆 − (𝑆 − 𝑅)
And to 𝑅 ⋈ 𝑆
Important for set operations:
Union Compatibility
Notation: 𝜌S 𝑅, 𝜌(A1, A2,…)𝑅,
Input: a table 𝑅 or 𝜌S
(A1,A2,…)R
Output: a table with the
Purpose: “rename” a
same rows as 𝑅, but
table and/or its columns
called differently
Used to As with all other
• Avoid confusion caused by relational operators, it
identical column names doesn’t modify the
• Create identical column database
names for natural joins
• Think of the renamed table as
a copy of the original
Frequents
• Find drinkers who drinker bar times_a_week
frequent both “The Edge”
Ben Satisfaction 2
and “Satisfaction”
Dan The Edge 1
Dan Satisfaction 2
𝜋drinker 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠 ⋈ Bar= ‘The Edge ∧ 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠
Bar = ‘Satisfaction’∧
drinker = drinker
𝜌 d1, b1, t1 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠
𝜋uid1 ⋈b 1 = ‘The Edge’∧ b2 = Satisfaction’ ∧ d1=d2
𝜌 d2, b2, t2 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠
What if you move 𝜎 to the top?
Find addresses of all bars that Still correct?
‘Dan’ frequents More or less efficient?
Bar
name address
Also called logical Plan tree
The Edge 108 Morris Street 𝜋address
Satisfaction 905 W. Main Street
⋈bar=name
Frequents
drinker bar times_a_week 𝜎drinker = ‘Dan’
Ben Satisfaction 2
Dan The Edge 1
𝐵𝑎𝑟 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠
Dan Satisfaction 2
Equivalent to
𝜋 address(𝐵𝑎𝑟 ⋈bar = name (𝜎drinker =‘Dan’𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠))
• Selection: 𝜎P 𝑅
• Projection: 𝜋L 𝑅
• Cross product: 𝑅×𝑆
• Union: 𝑅 ∪ 𝑆
• Difference: 𝑅 − 𝑆
• Renaming: 𝜌S A1, A2 ,… 𝑅
• Does not really add “processing” power
SUMMARY OF DERIVED
OPERATORS
▪ • Join: 𝑅 ⋈P 𝑆
▪ Natural join: 𝑅 ⋈ 𝑆
▪ Intersection: 𝑅 ∩ 𝑆
▪ Many more
▪ Semijoin, anti-semijoin, quotient, …
Frequents(drinker, bar, times_of_week)
Exercise Bar(name, address)
Drinker(name, address)
Bars that drinkers in address “300 N. Duke Street” do not frequent
Frequents(drinker, bar, times_of_week)
Bar(name, address)
Exercise Drinker(name, address)
Bars that drinkers in address “300 N. Duke Street”
do not frequent
− Bars that the drinkers at this
All bars address frequent
𝜌bar 𝜋bar
⋈ drinker = 𝑛𝑎𝑚𝑒
𝜋name
𝜎address=‘300 N.Duke Street
𝐵𝑎𝑟 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠
𝐷𝑟𝑖𝑛𝑘𝑒𝑟
41
Frequents(drinker, bar, times_of_week)
Bar(name, address)
A trickier Exercise Drinker(name, address)
For each bar, find the drinkers who frequent it max no. times a
week
42
Frequents(drinker, bar, times_of_week)
Bar(name, address)
A trickier Exercise Drinker(name, address)
For each bar, find the drinkers who frequent it max no. times a week
• Who do NOT visit a bar max no. of times?
• Whose times_of_weeks is lower than somebody else’s for a given bar
𝜋bar, drinker 𝜋F1.bar, F1.drinker
⋈F1.times-of-week < F2.times-of-week
∧F1.bar=F2.bar
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠
𝜌F1 𝜌F2
A deeper question:
When (and why) is “−” needed?
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑠
EXPRESSIONS IN A
SINGLE ASSIGNMENT
Example: the theta-join R3 = R1 ⋈C R2 can be written: R3
:= σC (R1 Χ R2)
Precedence of relational operators:
1. [σ, π, ρ] (highest)
2. [Χ, ⋈]
3. ∩
4. [∪, —]
▪ Sailors(sid, sname, rating, age)
▪ Boats(bid, bname, color)
▪ Reserves(sid, bid, day)
Sailors(sid, sname, rating, age) Use of rename operation
Boats(bid, bname, color)
Reserves(sid, bid, day)
Can identify all red or green boats, then find sailors who’ve reserved one of these boats:
(Tempboats, ( color ='red ' color =' green' Boats))
sname(Tempboats ⨝ Reserves ⨝ Sailors)
Can also define Tempboats using union. Try the “AND” version yourself
Not supported as a primitive operator, but useful for expressing queries like:
Find sailors who have reserved all boats.
Let A have 2 fields, x and y; B have only field y:
A/B = x (A) − x (( x (A) B)−A)
• i.e., A/B contains all x tuples (sailors) such that for every y tuple (boat) in B, there is
an xy tuple in A.
• Or: If the set of y values (boats) associated with an x value (sailor) in A contains all y
values in B, the x value is in A/B.
sno pno pno pno pno
s1 p1 p2 p2 p1
s1 p2 p4 p2
B1
s1 p3 p4
s1 p4
B2
sno B3
s2 p1 s1
s2 p2 s2 sno
s3 p2 s3 s1 sno
s4 p2 s4 s4 s1
s4 p4
A/B1 A/B2 A/B3
A
Division is not crucial , but a useful shorthand.
(Also true of joins, but joins are so common that systems implement
joins specially)
Idea: For A/B, compute all x values that are not
`disqualified’ by some y value in B.
x value is disqualified if by attaching y value from B, we obtain an xy
tuple that is not in A.
Disqualified x values: all disqualified tuples
A/B: x (A) − x (( x (A) B)−A)
Find the name of sailors who’ve
reserved all boats
Uses division; schemas of the input relations to/ must be carefully chosen:
(Tempsids, ( sid,bidReserves) / ( bid Boats))
sname (Tempsids ⨝ Sailors) Sailors(sid, sname, rating, age)
Boats(bid, bname, color)
To find sailors who’ve reserved all ‘Interlake’ boats: Reserves(sid, bid, day)
..... / ( Boats)
bid bname =' Interlake'
Add more rows RelOp What happens
to the input...
to the output?
If some old output rows may need to be
removed then the operator is non-monotone
Otherwise the operator is monotone
MONOTONE
OPERATORS i.e., old output rows always remain “correct”
when more rows are added to the input
Formally, for a monotone operator 𝑜𝑝:
𝑅 ⊆ 𝑅' implies 𝑜𝑝(𝑅) ⊆ 𝑜𝑝(𝑅’) for any 𝑅, 𝑅'
44
• Selection: 𝜎P 𝑅 Monotone
• Projection: 𝜋L 𝑅 Monotone
• Cross product: 𝑅×𝑆 Monotone
▪ Join: 𝑅 ⋈P 𝑆 Monotone
• Natural join: 𝑅 ⋈ 𝑆 Monotone
▪ Union: 𝑅 ∪ 𝑆 Monotone
• Difference: 𝑅 − 𝑆 Monotone w.r.t. 𝑅; non-monotone w.r.t 𝑆
• Intersection: 𝑅 ∩ 𝑆 Monotone
45
• Composition of monotone operators produces a monotone query
• Old output rows remain “correct” when more rows are added to the input
• Is the “highest” query monotone?
• No!
• Current highest price 3.0
• Add another row with price 3.01
• Old answer is invalidated
So, it must use difference!
▪ Difference: The only non-monotone operator
▪ Projection: The only operator that removes columns
▪ Cross product: The only operator that adds columns
▪ Union: The only operator that allows you to add rows? A
more rigorous argument?
▪ Selection: Homework problem
DUPLICATE HANDLING (“BAG GROUPING AND “EXTENSION” (OR “EXTENDED ☞ ALL THESE WILL COME UP
ALGEBRA”) AGGREGATION PROJECTION”) TO ALLOW WHEN WE TALK ABOUT SQL ☞
NEW COLUMN VALUES TO BE BUT FOR NOW WE WILL STICK
COMPUTED TO STANDARD RELATIONAL
ALGEBRA WITHOUT THESE
EXTENSIONS
Simple: • A small set of core operators
• Semantics are easy to grasp
• Yes, compared with older languages like CODASYL
Declarative? • Though assembling operators into a query does feel
somewhat “procedural”
Complete? • With respect to what?