Database Design: Normalization
Agenda
1. Database Design
2. Normal forms & functional dependencies
3. Finding functional dependencies
4. Closures, superkeys & keys
Design Theory
The biggest problem needed to be solved in database is data redundancy.
Why data redundancy is the problem? Because it causes:
Insert Anomaly
Update Anomaly
Delete Anomaly
Design theory is about how to represent your data to avoid anomalies.
Achieved by Data Normalization, a process of analyzing a relation to ensure
that it is well formed.
Normalization involves decomposing relations with anomalies to produce
smaller well structured relations.
If a relation is normalized (or well formed), rows can be inserted, deleted and
modified without creating anomalies.
Data Anomalies & Constraints
Constraints Prevent (some)
Anomalies in the Data
A poorly designed database causes
anomalies:
Student Course Room
Mary CSC261 101
Joe CSC261 101
Sam CSC261 101
.. .. ..
If every course is in only one room,
contains redundant information!
Constraints Prevent (some)
Anomalies in the Data
A poorly designed database causes
anomalies:
Student Course Room
Mary CSC261 101
Joe CSC261 703
Sam CSC261 101
.. .. ..
If we update the room number for one tuple, we
get inconsistent data = an update anomaly
Constraints Prevent (some)
Anomalies in the Data
A poorly designed database causes
anomalies:
Student Course Room
.. .. ..
If everyone drops the class, we lose what
room the class is in! = a delete anomaly
Constraints Prevent (some)
Anomalies in the Data
A poorly designed database causes
anomalies:
Student Course Room
Similarly, we
Mary CSC261 B01 can’t reserve a
Joe CSC261 B01 room without
Sam CSC261 B01 students = an
… CSC461 703 insert anomaly
.. .. ..
Constraints Prevent (some)
Anomalies in the Data
Student Course Is this form better?
Mary CSC261 Course Room
• Redundancy?
Joe CSC261 CSC261 101 • Update anomaly?
Sam CSC261 CSC257 601 • Delete anomaly?
• Insert anomaly?
.. ..
Today: develop theory to understand why this design
may be better and how to find this decomposition…
Database Anomalies ACTIVITY Relation
Example 2 StudentID Activity Fee
100 Skiing 200
Anomalies are problems caused by bad database design. 100 Golf 65
Example: 175 Squash 50
175 Swimming 50
ACTIVITY(StudentID, Activity, Fee)
200 Swimming 50
An insertion anomaly occurs when a row cannot be added to a 200 Golf 65
relation, because not all data are available (or one has to invent “dummy” data)
Example: we want to store that scuba diving costs $175, but have no place to put
this information until a student takes up scuba-diving (unless we create a fake
student)
A deletion anomaly occurs when data is deleted from a relation, and other critical data
are unintentionally lost
Example: if we delete the record with StudentID = 100, we forget that skiing costs
$200
An update anomaly occurs when one must make many changes to reflect the
modification of a single datum
Example: if the cost of swimming changes, then all entries with swimming
Activity must be changed too
Cause of Anomalies
Anomalies are primarily caused by:
1. Data redundancy: replication of the same field in multiple tables, other than foreign
keys
2. Functional dependencies including:
Partial dependency
Transitive dependency
Multi-value dependency
Functional Dependencies
Functional Dependencies for Dummies
• A relationship between attributes where one attribute (or
group of attributes) determines the value of another
attribute (or group of attributes) in the same table.
• Example:
SSN uniquely identify any Person
(SSN) (First Name, Last Name)
Candidate Keys/Primary Keys and Functional Dependencies
By definition:
• A candidate key of a relation functionally determines all
other non-key attributes in the row.
Implies:
• A primary key of a relation functionally determines all other
non-key attributes in the row.
EmployeeID (EmployeeName, EmpPhone)
Functional Dependency
Def: Let A, B be sets of attributes, we write A B or say
A functionally determines B if, for any tuples t1 and t2:
t1[A] = t2[A] implies t1[B] = t2[B] and we
call A B a functional dependency
A B means that
“whenever two tuples agree on A then they agree on B.”
A It is a determinant set.
B It is a dependent attribute.
A functionally determines B.
{A → B}
B is a functionally dependent on A.
A Picture of FDs
Defn (again):
A1 … Am B1 … Bn Given attribute sets A = {A1,…,Am}
and B = {B1,…Bn} in R,
A Picture of FDs
Defn (again):
A1 … Am B1 … Bn Given attribute sets A={A1,…,Am}
and B = {B1,…Bn} in R,
The functional dependency A B
on R holds if for any ti, tj in R:
A Picture of FDs
Defn (again):
Given attribute sets A={A1,…,Am}
and B = {B1,…Bn} in R,
A1 … Am B1 … Bn
The functional dependency A B
ti
on R holds if for any ti,tj in R:
tj
ti[A1] = tj[A1] AND ti[A2]=tj[A2] AND …
AND ti[Am] = tj[Am]
If t1, t2 agree here..
A Picture of FDs
Defn (again):
Given attribute sets A={A1,…,Am}
and B = {B1,…Bn} in R,
A1 … Am B1 … Bn
The functional dependency A B
ti
on R holds if for any ti,tj in R:
tj
if ti[A1] = tj[A1] AND ti[A2]=tj[A2] AND
… AND ti[Am] = tj[Am]
If t1,t2 agree here.. …they also agree here! then ti[B1] = tj[B1] AND ti[B2]=tj[B2]
AND … AND ti[Bn] = tj[Bn]
FDs for Relational Schema Design
High-level idea: why do we care about FDs?
1. Start with some relational schema (e.g., design by ER diagram)
2. Find out its functional dependencies (FDs)
3. Use these to design a better schema
• One which minimizes the possibility of anomalies
Functional Dependencies as Constraints
A functional dependency is a form of
constraint Student Course Room
Mary CS145 B01
• Holds on some instances not others. Joe CS145 B01
Sam CS145 B01
• Part of the schema, helps define a valid
instance. .. .. ..
Note: The FD
{Course} -> {Room} holds on
Recall: an instance of a schema is a multiset of this instance
tuples conforming to that schema, i.e. a table
Functional Dependencies as Constraints
Note that:
• You can check if an FD is Student Course Room
violated by examining a single Mary CS145 B01
instance;
Joe CS145 B01
Sam CS145 B01
• However, you cannot prove
that an FD is part of the .. .. ..
schema by examining a single
instance. However, cannot prove that the
• This would require checking FD {Course} -> {Room} is part of
every valid instance the schema
More Examples
An FD is a constraint which holds, or does not hold on
an instance:
EmpID Name Phone Position
E0045 Smith 1234 Clerk
E3542 Mike 9876 Salesrep
E1111 Smith 9876 Salesrep
E9999 Mary 1234 Lawyer
More Examples
EmpID Name Phone Position
E0045 Smith 1234 Clerk
E3542 Mike 9876 Salesrep
E1111 Smith 9876 Salesrep
E9999 Mary 1234 Lawyer
{Position} {Phone}
More Examples
EmpID Name Phone Position
E0045 Smith 1234 Clerk
E3542 Mike 9876 Salesrep
E1111 Smith 9876 Salesrep
E9999 Mary 1234 Lawyer
but not {Phone} {Position}
ACTIVITY
Find at least three FDs which
A B C D E hold on this instance:
1 2 4 3 6
3 2 5 1 8
1 4 4 5 7 {A } {C }
{A,B } {C }
1 2 4 3 6 {E } {D }
3 2 5 1 8
Armstrong inference rules
Armstrong's Axioms is a set of rules.
It provides a simple technique for reasoning about functional
dependencies.
It was developed by William W. Armstrong in 1974.
It is used to infer all the functional dependencies on a relational
database.
Armstrong inference rules
Axioms:
Reflexivity: if YX, then X→Y
Augmenta on: if X→Y, then WX→WY
Transi vity: if X→Y and Y→Z, then X→Z
Derived Rules:
Union: if X→Y and X→Z, the X→YZ
Decomposi on: if X→YZ, then X→Y and X→Z
Pseudo transi vity: if X→Y and WY→Z, then XW→Z
Armstrong inference rules
Axioms are both
Sound:
when applied to a set of functional dependencies they only
produce dependency tables that belong to the transitive closure
of that set
Complete:
can produce all dependency tables that belong to the transitive
closure of the set
Armstrong inference rules
Three last rules can be derived from the first three (the axioms)
Let us look at the union rule:
if X→Y and X→Z, the X→YZ
Using the first three axioms, we have:
if X→Y, then XX→XY same as X→XY (2nd)
if X→Z, then YX→YZ same as XY→YZ (2nd)
if X→XY and XY→YZ, then X→YZ (3rd)
Example:
Consider relation E = (P, Q, R, S, T, U) having set of Functional Dependencies (FD).
P→Q P→R
QR → S Q→T
QR → U PR → U
Calculate some members of axioms are as follows:
1. P → T
2. PR → S Axioms:
3. QR → SU Reflexivity: if YX, then X→Y
4. PR → SU Augmenta on: if X→Y, then WX→WY
Transi vity: if X→Y and Y→Z, then X→Z
Derived Rules:
Union: if X→Y and X→Z, the X→YZ
Decomposi on: if X→YZ, then X→Y and X→Z
Pseudo transi vity: if X→Y and WY→Z, then XW→Z
Axioms:
Reflexivity: if YX, then X→Y
Augmenta on: if X→Y, then WX→WY
Transi vity: if X→Y and Y→Z, then X→Z
Solution: Derived Rules:
Union: if X→Y and X→Z, the X→YZ
1. P → T Decomposi on: if X→YZ, then X→Y and X→Z
Pseudo transi vity: if X→Y and WY→Z, then XW→Z
In the FD set, P → Q and Q → T
So, Using Transi ve Rule: If {A → B} and {B → C}, then {A → C}
∴ If P → Q and Q → T, then P → T.
2. PR → S
In the above FD set, P → Q
As, QR → S
So, Using Pseudo Transi vity Rule: If{A → B} and {BC → D}, then {AC → D}
∴ If P → Q and QR → S, then PR → S.
3. QR → SU
In above FD set, QR → S and QR → U
So, Using Union Rule: If{A → B} and {A → C}, then {A → BC}
∴ If QR → S and QR → U, then QR → SU.
4. PR → SU
So, Using Pseudo Transi vity Rule: If{A → B} and {BC → D}, then {AC → D}
∴ If PR → S and PR → U, then PR → SU.
Trivial Functional Dependency
If A holds B {A → B}, where B is a subset of A, then it is called a Trivial
Trivial
Functional Dependency. Trivial always holds Functional Dependency.
If A holds B {A → B}, where B is not a subset A, then it is called as a Non-
Non-Trivial
Trivial Functional Dependency.
Normalization
https://www.youtube.com/watch?v=UrYLYV7WSHM
https://www.youtube.com/watch?v=l5DCnCzDb8g
Normalization
Normalization is the process of removing redundant data from your
tables to improve storage efficiency, data integrity, and scalability.
Normalization generally involves splitting existing tables into multiple
ones, which must be re-joined or linked each time a query is issued.
Why normalization?
The relation derived from the user view or data store will most
likely be unnormalized.
The problem usually happens when an existing system uses
unstructured file, e.g. in MS Excel.
Unnormalized Form (table)
Example
Normalization Example
• (Student ID) (Student Name, DormName, DormCost)
• However, if
– (DormName) (DormCost)
Then, DormCost should be put into its own relation, resulting
in:
(Student ID) (Student Name, DormName)
(DormName) ( DormCost)
Normalization Example
• (AttorneyID, ClientID) (ClientName, MeetingDate,
Duration)
• However, if
– ClientID ClientName
• Then: ClientName should be in its own relation:
• (AttorneyID, ClientID) (MeetingDate, Duration)
• (ClientID) (ClientName)
Steps of Normalization
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
Domain Key Normal Form (DKNF)
In practice, 1NF, 2NF, 3NF, and BCNF are enough for
database.
Normal Forms
• 1st Normal Form (1NF) = All tables are flat
• 2nd Normal Form (2NF)
DB designs based on
functional dependencies,
• 3 rd Normal Form (3NF) intended to prevent data
anomalies
• Boyce-Codd Normal Form (BCNF)
• 4th and 5th Normal Forms = see text books
Normalization Steps
Table with Remove
First Normal
multivalued multivalued
Form (1NF)
attributes
attributes
First Normal Remove Second
Partial Normal Form
Form (1NF) Dependencies (2NF)
Second Remove
Third Normal
Normal Transitive Form (3NF)
Dependencies
Form (2NF)
First Normal Form (1NF)
The official qualifications for 1NF are:
1. Each attribute name must be unique.
2. Each attribute value must be single.
3. Each row must be unique.
4. There is no repeating groups.
Additional:
Choose a primary key.
Reminder:
A primary key is unique, not null, unchanged. A primary
key can be either an attribute or combined attributes.
1st Normal Form (1NF)
Student Courses
Student Courses
Mary CS145
Mary {CS145, CS229}
Mary CS229
Joe {CS145, CS106}
Joe CS145
… …
Joe CS106
Violates 1NF. In 1st NF
1NF Constraint: Types must be atomic!
First Normal Form (1NF) (Cont.)
Example of a table not in 1NF :
Group Topic Student Score
Group A Intro MongoDB Sok San 18 marks
Sao Ry 17 marks
Group B Intro MySQL Chan Tina 19 marks
Tith Sophea 16 marks
It violates the 1NF because:
Attribute values are not single.
Repeating groups exists.
First Normal Form (1NF) (Cont.)
After eliminating:
Group Topic Family Name Given Name Score
A Intro MongoDB Sok San 18
A Intro MongoDB Sao Ry 17
B Intro MySQL Chan Tina 19
B Intro MySQL Tith Sophea 16
Now it is in 1NF.
However, it might still violate 2NF and so on.
Functional Dependencies
We say an attribute, B, has a functional dependency on another
attribute, A, if for any two records, which have the same value
for A, then the values for B in these two records must be the
same. We illustrate this as:
AB (read as: A determines B or B depends on A)
Employee_name Project Email_address
Joe San POS Mart Sys soksan@yahoo.com
Sao Ry Univ Mgt Sys sao@yahoo.com
Joe San Web Redesign soksan@yahoo.com
Chan Sokna POS Mart Sys chan@gmail.com
Sao Ry DB Design sao@yahoo.com
Employee_name Email_address
Functional Dependencies (cont.)
EmpNum EmpEmail EmpFname EmpLname
123 jdoe@abc.com John Doe
456 psmith@abc.com Peter Smith
555 alee1@abc.com Alan Lee
633 pdoe@abc.com Peter Doe
787 alee2@abc.com Alan Lee
If EmpNum is the PK then the FDs:
EmpNum EmpEmail, EmpFname, EmpLname
must exist.
Functional Dependencies (cont.)
EmpNum EmpEmail, EmpFname, EmpLname
3 different ways
you might see
FDs depicted
EmpEmail
EmpNum EmpFname
EmpLname
EmpNum EmpEmail EmpFname EmpLname
Determinant
Functional Dependency
EmpNum EmpEmail
Attribute on the left hand side is known as the
determinant
• EmpNum is a determinant of EmpEmail
Second Normal Form (2NF)
The official qualifications for 2NF are:
1. A table is already in 1NF.
2. All non-key attributes are fully dependent on the
primary key.
All partial dependencies are removed to place in another
table.
Partial Dependencies
• Partial dependency is a functional dependency whose determinant is
part of the primary key (but not all of it)
• Example:
ACTIVITY(StudentID, Activity, Fee)
StudentID Activity Fee
StudentID
100 Skiing 200
Fee
100 Golf 65
175 Squash 50
Activity 175 Swimming 50
200 Swimming 50
200 Golf 65
Example of a table not in 2NF:
CourseID SemesterID Num Student Course Name
IT101 201301 25 Database
IT101 201302 25 Database
IT102 201301 30 Web Prog
IT102 201302 35 Web Prog
IT103 201401 20 Networking
Primary Key
The Course Name depends on only CourseID, a part of the primary key not the
whole primary {CourseID, SemesterID}. It’s called partial dependency.
Solution:
Remove CourseID and Course Name together to create a new table.
CourseID Course Name C ourseI D SemesterID Num Student
IT101 Database IT101 201301 25
IT101 Database IT101 201302 25
IT102 Web Prog IT102 201301 30
IT102 Web Prog IT102 201302 35
IT103 Networking IT103 201401 20
Done?
Oh no, it is still not in
1NF yet.
Remove the repeating CourseID Course Name
groups too. IT101 D atabase
Finally, connect the IT102 Web Prog
relationship.
IT103 Networking
Third Normal Form (3NF)
The official qualifications for 3NF are:
1. A table is already in 2NF.
2. Nonprimary key attributes do not depend on other
nonprimary key attributes
(i.e. no transitive dependencies)
All transitive dependencies are removed to place
in another table.
Transitive Dependencies
Transitive dependency is a functional dependency whose determinant is
not the primary key, part of the primary key, or a candidate key.
Transitive functionality is a functional dependency in which a non-key
attribute is determined by another non-key attribute.
Example:
ACTIVITY(StudentID, Activity, Fee)
StudentID Activity Fee
100 Skiing 200
100 Golf 65
175 Squash 50
175 Swimming 50
200 Swimming 50
200 Golf 65
Example of a Table not in 3NF:
StudyID CourseName TeacherName TeacherTel
1 Database Sok Piseth 012 123 456
2 Database Sao Kanha 0977 322 111
3 Web Prog Chan Veasna 012 412 333
4 Web Prog Chan Veasna 012 412 333
5 Networking Pou Sambath 077 545 221
Primary Key
The TeacherTel is a nonkey attribute, and the
TeacherName is also a nonkey attribute. But
TeacherTel depends on TeacherName.
It is called transitive dependency.
Solution:
Remove Teacher Name and TeacherTel together to create a
new table.
Done?
Teacher Name Teacher Tel
Oh no, it is still not in 1NF yet.
Sok Piseth 012 123 456 Remove Repeating row.
Sao Kanha 0977 322 111
StudyI D C ourse N am e T.I D
Chan Veasna 012 412 333
1 Database T1
Chan Veasna 012 412 333
2 Database T2
Pou Sambath 077 545 221
3 Web Prog T3
Teacher Name Teacher Tel 4 Web Prog T3
Sok Piseth 012 123 456 5 Networking T4
Sao Kanha 0977 322 111
Chan Veasna 012 412 333
Pou Sambath 077 545 221
Note about primary key: ID Teacher Name Teacher Tel
-In theory, you can choose T1 Sok Piseth 012 123 456
TeacherName to be a primary key. T2 Sao Kanha 0977 322 111
-But in practice, you should add
T3 Chan Veasna 012 412 333
TeacherID as the primary key.
T4 Pou Sambath 077 545 221
Boyce Codd Normal Form (BCNF) – 3.5NF
The official qualifications for BCNF are:
1. A table is already in 3NF.
2. All determinants must be superkeys.
All determinants that are not superkeys are removed to
place in another table.
K is a superkey for relation R if K functionally determines
all of R.
K is a (candidate)key for R if K is a superkey, but no
proper subset of K is a superkey.
Boyce Codd Normal Form (BCNF) (Cont.)
Example of a table not in BCNF:
Student Course Teacher
Sok DB John
Sao DB William
Chan E-Commerce Todd
Sok E-Commerce Todd
Chan DB William
Key: {Student, Course}
Functional Dependency:
{Student, Course}Teacher
Teacher Course
Problem: Teacher is not a superkey but determines Course.
Student Course Solution: Decouple a table
contains Teacher and Course
Sok DB
from original table (Student,
Sao DB Course). Finally, connect the
C han E-Commerce new and old table to third
table contains Course.
Sok E-Commerce
C han DB Course
DB
E-Commerce
Course Teacher
DB John
DB W illiam
E-Commerce Todd
Forth Normal Form (4NF)
The official qualifications for 4NF are:
1. A table is already in BCNF.
2. A table contains no multi-valued dependencies.
Multi-valued dependency: MVDs occur when two or more independent multi
valued facts about the same attribute occur within the same table.
A ->-> B (B multi-valued depends on A)
Example: MVD
Customer(name, addr, phones, email_drinksLiked)
A drinker’s phones are independent of the drinks they like.
name->->phones and name ->->drinksLiked.
Thus, each of a drinker’s phones appears with each of the drinks they like in
all combinations.
Tuples Implied by name->->phones
If we have tuples:
name addr phones drinksLiked
sue a p1 d1
sue a p2 d2
sue a p2 d1
sue a p1 d2
Then these tuples must also be in the relation.
Picture of MVD X ->->Y
X Y others
equal
exchange
Forth Normal Form (4NF) (Cont.)
Example of a table not in 4NF:
Student Major Hobby
Sok IT Football
Sok IT Volleyball
Sao IT Football
Sao Med Football
Chan IT NULL
Puth NULL Football
Tith NULL NULL
Key: {Student, Major, Hobby}
MVD: Student -->> Major, Hobby
Solution: Decouple to each Student Major
table contains MVD. Finally, Sok IT
connect each to a third table
contains Student. Sao IT
Sao Med
Student Chan IT
Sok Puth NULL
Sao Tith NULL
Chan
Puth Student Hobby
Tith Sok Football
Sok Volleyball
Sao Football
Chan NULL
Puth Football
Tith NULL
Fifth Normal Form (5NF)
The official qualifications for 5NF are:
1. A table is already in 4NF.
2. The attributes of multi-valued dependencies are related.
Fifth Normal Form (5NF) (Cont.)
Example of a table not in 5NF:
Seller Company Product
Sok MIAF Trading Zenya
Sao Coca-Cola Corp Coke
Sao Coca-Cola Corp Fanta
Sao Coca-Cola Corp Sprite
Chan Angkor Brewery Angkor Beer
Chan Cambodia Brewery Cambodia Beer
Key: {Seller, Company, Product}
MVD: Seller -->> Company, Product
Product is related to Company.
C ompany
Seller C ompany
Seller 1 M Sok M 1 MIAF Trading
MIAF Trading
Sok Coca-Cola Corp
Sao Coca-Cola Corp
Sao Angkor Brewery
C han Angkor Brewery
C han Cambodia Brewery
C han Cambodia Brewery
1 1
M
M
1 C ompany Product
Seller Product
MIAF Trading Z enya
Sok Z enya
M Coca-Cola Corp C oke
Sao C oke
Product Coca-Cola Corp Fanta
Sao Fanta
Z enya Coca-Cola Corp Sprite
Sao Sprite
C oke Angkor Brewery Angkor Beer
C han Angkor Beer
Fanta C ambodia C ambodia
C han C ambodia Brewery Beer
Sprite
Beer M
Angkor Beer 1
Normalization Practice
Please see the normalization example slides
Acknowledgement
Some of these slides are taken from cs145 course offered by
Stanford University.