Normalization
lecture 6
In this lecture, you will learn
• Overview
• A little background and terminology:
• What is a relational database?
• What is a primary key?
• What is a foreign key?
• Things to know about designing a database:
• The normalization process and how/why use it
• Relating tables
• Types of relationships
Relational Database Management
System
• Collection of information organized in tables
• Tables are also “relations”
• Tables are constructed and associated to each other
through shared fields–“common” fields
– Fields are also “columns” or “attributes”
• A set of attributes comprises a record
– Records are also “rows” or “tuples”
• Tables are related through common fields designated as
primary and foreign keys
• Allow us to find, update, and delete data quickly, and
help to ensure accuracy
Primary and Foreign Key
Fields
• Primary Key
– Primary key fields must be unique and cannot contain
a null value.
• Each table should have a primary key field.
• Concatenated keys: using more than one field as a
primary key field.
• Foreign Key: Fields in a table that refer to the primary
key in another table
• The data in this field must exactly match data
contained in the primary key field.
What is Normalization?
• The process by which we efficiently organize
data to achieve these goals:
– Eliminating redundancy
– Ensuring data is stored in the correct table
– Eliminating need for restructuring database
when data is added.
– Five levels of normal form
• In order to achieve one level of normal form,
each previous level must be met
Definition
• Normalization is a process for evaluating and
correcting table structures to minimize data
redundancies, thereby helping to eliminate
data anomalies. It helps us evaluate table
structures and produce good tables.
Problems of redundancy
• If redundancy exists then this can cause
problems during normal database operations:
– When data is inserted the database the data
must be duplicated where ever redundant
versions of that data exists.
– When data is updated, all redundant data
must be simultaneously updated to reflect that
change.
What are the stages of Normalization?
• Normalization Works through a series of stages
called normal forms:
o Normal form (1NF)
o Second normal form (2NF)
o Third normal form (3NF)
o Boyce-Codd Normal Form (BCNF)
o Forth Normal (4NF)
o Fifth Normal (5NF)
o Domain-key normal form (DKNF)
Normal forms
• The data in the database can be considered to be in
one of a number of `normal forms'. Basically the
normal form of the data indicates how much
redundancy is in that data. The normal forms have a
strict ordering:
– 1st Normal Form
– 2nd Normal Form
– 3rd Normal Form
– BCNF
• There are more forms after BCNF. These are rarely
utilised in system design and are not considered
further here.
Stages of Normalization
• From a structural point of view, 2NF is better
than 1NF, and 3NF is better than 2NF. For
most business database design purposes,
3NF is as high as we need to go in the
normalization process. And some very
specialized applications may require
normalization beyond 4NF.
Integrity Constraints
• An integrity constraint is a rule that restricts the
values that may be present in the database.
• entity integrity - The rows (or tuples) in a relation
represent entities, and each one must be uniquely
identified. Hence we have the primary key that must
have a unique non-null value for each row.
• referential integrity - This constraint involves the
foreign keys. Foreign keys tie the relations together,
so it is vitally important that the links are correct.
Every foreign key must either be null or its value must
be the actual value of a key in another relation.
Understanding Data
• Sometimes the starting point for understanding a problem’s
data requirements is given using functional dependencies.
• A functional dependency is two lists of attributes
separated by an arrow. Given values for the LHS
uniquely identifies a single set of values for the RHS
attributes.
• Consider
R(matrix_no,firstname,surname,tutor_no,tutor_name)
tutor_no -> tutor_name
– A given tutor_no uniquely identifies a tutor_name.
– An implied daterminant is also present:
• matrix_no -> firstname,surname,tutor_no,tutor_name
Extracting understanding
• It is possible that the functional dependencies
have to be extracted by looking a real data
from the database. This is problematic as it is
possible that the data does not contain
enough information to extract all the
dependencies, but it is a starting point.
Extracting understanding
• It is possible that the functional dependencies
have to be extracted by looking a real data
from the database. This is problematic as it is
possible that the data does not contain
enough information to extract all the
dependencies, but it is a starting point.
Example
matric_no Name date_of_birth subject grade
960100 Smith, J 14/11/1977 Databases C
Soft_Dev A
ISDE D
960105 White, A 10/05/1975 Soft_Dev B
ISDE B
960120 Moore, T 11/03/1970 Databases A
Soft_Dev B
Workshop C
960145 Smith, J 09/01/1972 Databases B
960150 Black, D 21/08/1973 Databases B
Soft_Dev D
ISDE C
Workshop D
Student(matric_no, name, date_of_birth, ( subject, grade ) )
name, date_of_birth -> matric_no
Flattened Tables
matric_no name date_of_birth Subject grade
960100 Smith, J 14/11/1977 Databases C
960100 Smith, J 14/11/1977 Soft_Dev A
960100 Smith, J 14/11/1977 ISDE D
960105 White, A 10/05/1975 Soft_Dev B
960105 White, A 10/05/1975 ISDE B
960120 Moore, T 11/03/1970 Databases A
960120 Moore, T 11/03/1970 Soft_Dev B
960120 Moore, T 11/03/1970 Workshop C
960145 Smith, J 09/01/1972 Databases B
960150 Black, D 21/08/1973 Databases B
960150 Black, D 21/08/1973 Soft_Dev D
960150 Black, D 21/08/1973 ISDE C
960150 Black, D 21/08/1973 Workshop B
What is a repeating group?
A repeating group is a column that can accommodate
multiple values. The columns in a base table in SQL are
explicitly named and typed and therefore can accommodate
only a single value of that type. Base table in SQL cannot
have a repeating group. SQL column with an integer data
type, for example, cannot contain a repeating group
(multivalued attributes) containing a set of several integers.
First Normal Form
• First normal form (1NF) deals with the `shape' of the
record.
• A relation is in 1NF if, and only if, it contains no
repeating attributes or groups of attributes.
• Example:
– The Student table with the repeating group is not
in 1NF
– It has repeating groups, it is an `unnormalised
table'.
• To remove the repeating group, either:
– flatten the table and extend the key, or
– decompose the relation- leading to First Normal
Form
Flatten table and Extend Primary Key
• The Student table with the repeating group can be
written as:
Student(matric_no, name, date_of_birth, ( subject, grade ) )
• If the repeating group was flattened, as in the Student
#2 data table, it would look something like:
Student(matric_no, name, date_of_birth, subject, grade )
• This does not have repeating groups, but has
redundancy. For every matric_no/subject
combination, the student name and date of birth is
replicated. This can lead to errors:
Flattened table problems
• With the relation in its flattened form, strange
anomalies appear in the system. Redundant data
is the main cause of insertion, deletion, and
updating anomalies.
– Insertion anomaly – at subject is now part of the
primary key, we cannot add a student until they
have at least one subject. Remember, no part
of a primary key can be NULL.
– Update anomaly – changing the name of a
student means finding all rows of the database
where that student exists and changing each
one separately.
– Deletion anomaly- for example deleting all
database subject information also deletes
student 960145.
Decomposing the relation
• The alternative approach is to split the table into two parts,
one for the repeating groups and one of the non-repeating
groups.
• the primary key for the original relation is included in both of
the new relations
Record Student
matric_no subject grade matric_n name date_of_birth
o
960100 Databases C 960100 Smith,J 14/11/1977
960100 Soft_Dev A 960105 White,A 10/05/1975
960100 ISDE D
960120 Moore,T 11/03/1970
960105 Soft_Dev B
960105 ISDE B 960145 Smith,J 09/01/1972
... ... ...
960150 Black,D 21/08/1973
960150 Workshop B
Relations
• We now have two relations, Student and Record.
– Student contains the original non-repeating groups
– Record has the original repeating groups and the
matric_no
Student(matric_no, name, date_of_birth )
Record(matric_no, subject, grade )
• This version of the relations does not have insertion,
deletion, or update anomalies.
• Without repeating groups, we say the relations are in First
Normal Form (1NF).
Conversion to Second Normal Form
• The rule of conversion from INF format to 2NF format is:
Eliminate all partial dependencies from the 1NF format.
• The conversion from 1NF to 2NF format is done in two
steps:
Step 1: Identify All the Key Components
Step 2: Identify the Dependent
• Because a partial dependency can exist only if a table’s
primary key is composed of several attributes, a table
whose primary key consists of only a single attribute is
automatically in 2NF if it is in 1NF.
Functional Dependencies
• Functional dependencies allow us to express
constraints that cannot be expressed using
superkeys.
• We use functional dependencies to:
– test relations to see if they are legal under a given
set of functional dependencies. If a relation r is
legal under a set F of functional dependencies, we
say that r satisfies F.
– Specify constraints on the set of legal relations; we
say that F holds on R if all legal relations on R
satisfy the set of functional dependencies F.
Use of Functional Dependencies
• We use functional dependencies to:
– test relations to see if they are legal under a given
set of functional dependencies. If a relation r is
legal under a set F of functional dependencies, we
say that r satisfies F.
– Specify constraints on the set of legal relations; we
say that F holds on R if all legal relations on R
satisfy the set of functional dependencies F.
Two types of dependencies
• Partial dependencies: Dependencies based on only a
part of a composite primary key are called partial
dependencies.
• Transitive dependencies: A transitive dependency is a
dependency of one nonprime attribute on another
nonprime attribute. The problem with transitive
dependencies is that they still yield data anomalies.
Second Normal Form
• A relation is in 2NF if, and only if, it is in 1NF and every non-
key attribute is fully functionally dependent on the whole key.
(It includes no partial dependencies); i.e, no attribute is
dependent on only a portion of the primary key.
• Thus the relation is in 1NF with no repeating groups, and all
non-key attributes must depend on the whole key, not just
some part of it. Another way of saying this is that there must
be no partial key dependencies (PKDs).
• The problems arise when there is a compound key, e.g. the
key to the Record relation - matric_no, subject. In this case
it is possible for non-key attributes to depend on only part of
the key - i.e. on only one of the two key attributes. This is
what 2NF tries to prevent.
Example
• Consider again the Student relation from the flattened
Student #2 table:
Student(matric_no, name, date_of_birth, subject, grade )
• There are no repeating groups, so the relation is in 1NF
• However, we have a compound primary key - so we must
check all of the non-key attributes against each part of the
key to ensure they are functionally dependent on it.
– matric_no determines name and date_of_birth, but not
grade.
– subject together with matric_no determines grade,
but not name or date_of_birth.
• So there is a problem with potential redundancies
Dependency Diagram
• A dependency diagram is used to show how non-
key attributes relate to each part or combination of
parts in the primary key.
Student
matric_no name date_of_bith subject grade
PKD
Fully Dependent
• This relation is not in 2NF
– It appears to be two tables squashed into one.
– the solutions is to split the relation into component
parts.
• separate out all the attributes that are solely dependent
on matric_no - put them in a new Student_details relation,
with matric_no as the primary key
• separate out all the attributes that are solely dependent
on subject - in this case no attributes are solely
dependent on subject.
• separate out all the attributes that are solely dependent
on matric_no + subject - put them into a separate Student
relation, keyed on matric_no + subject
All attributes in each relation
Student Details are fully functionally
matrix_no name date_of_birth
dependent upon its primary
key
Student
matrix_no subject grade
These relations are now in
2NF
What is interesting is that this set of
relations are the same as the ones where we
realised that there was a repeating group.
Conversion to Third Normal Form
• The rule of conversion from 2NF format to 3NF format is:
Eliminate all transitive dependencies from the 2NF format.
• The conversion from 2NF to 3NF format is done in three
steps:
Step 1: Identify Each New Determinant
• For every transitive dependency, write its determinant as a
PK for a new table. (A determinant is any attribute whose
value determines other values within a row.)
• If you have three different transitive dependencies, you will
have three different determinants.
Conversion to Third Normal Form
Step 2: Identify the Dependent Attributes
• Identify the attributes that are dependent on
each determinant identified in Step 1 and
identify the dependency.
• Name the table to reflect its contents and
function.
Conversion to Third Normal Form
Step 3: Remove the Dependent Attributes from Transitive
Dependencies
o Eliminate all the dependent attributes in the transitive
relationship(s) from each of the table that has such a
transitive relationship.
o Draw a new dependency diagram to show all the tables
defined in Steps 1-3.
o Check the new tables as well as the tables modified in Step 3
to make sure that each table has a determinant and that no
table contains inappropriate dependencies (partial or
transitive).
Third Normal Form
• 3NF is an even stricter normal form and
removes virtually all the redundant data :
• A relation is in 3NF if, and only if, it is in 2NF
and there are no transitive functional
dependencies . (It contains no transitive
dependencies)
Third Normal Form
• Transitive functional dependencies arise:
– when one non-key attribute is functionally
dependent on another non-key attribute:
• FD: non-key attribute -> non-key attribute
– and when there is redundancy in the database
• By definition transitive functional dependency can
only occur if there is more than one non-key field, so
we can say that a relation in 2NF with zero or one
non-key field must automatically be in 3NF.
Example
project_no manager address
p1 Black,B 32 High Street
Project has more than one
p2 Smith,J 11 New Street non-key fields so we must
check for transitive
dependency:
p3 Black,B 32 High Street
p4 Black,B 32 High Street
Extract
• Address depends on the value of manager.
• From the table we can propose:
Project(project_no, manager, address)
manager -> address
• In this case address is transitively dependent on
manager. The primary key is project_no, but the LHS and
RHS have no reference to this key, yet both sides are
present in the relation.
Fix
• Data redundancy arises from this
– we duplicate address if a manager is in
charge of more than one project
– causes problems if we had to change the
address- have to change several entries, and
this could lead to errors.
Fix
• Eliminate transitive functional dependency by
splitting the table
– create two relations - one with the transitive
dependency in it, and another for all of the
remaining attributes.
– split Project into Project and Manager.
• the determinant attribute becomes the primary key in
the new relation - manager becomes the primary key
to the Manager relation
• the original key is the primary key to the remaining
non-transitive attributes - in this case, project_no
remains the key to the new Projects table.
Result
• Now we need to store
the address only once Project project_no manager
p1 Black,B
• If we need to know a
manager's address we p2 Smith,J
can look it up in the
Manager relation p3 Black,B
p4 Black,B
• The manager attribute
is the link between the Manager manager address
two tables, and in the
Projects table it is now Black,B 32 High Street
a foreign key.
• These relations are Smith,J 11 New Street
now in third normal
form.
Summary: 1NF
• A relation is in 1NF if it contains no repeating groups
• To convert an unnormalised relation to 1NF either:
– Flatten the table and extend the primary key, or
– Decompose the relation into smaller relations, one for the
repeating groups and one for the non-repeating groups.
• Remember to put the primary key from the original
relation into both new relations.
• This option is liable to give the best results.
R(a,b,(c,d)) becomes
R(a,b)
R1(a,c,d)
Summary: 2NF
• A relation is in 2NF if it contains no repeating groups
and no partial key functional dependencies
– Rule: A relation in 1NF with a single key field must
be in 2NF
– To convert a relation with partial functional
dependencies to 2NF. create a set of new
relations:
• One relation for the attributes that are fully
dependent upon the key.
• One relation for each part of the key that has
partially dependent attributes
R(a,b,c,d) and a->c becomes
R(a,b,d) and R1(a,c)
Summary: 3NF
• A relation is in 3NF if it contains no repeating groups,
no partial functional dependencies, and no transitive
functional dependencies
• To convert a relation with transitive functional
dependencies to 3NF, remove the attributes involved
in the transitive dependency and put them in a new
relation
Summary: 3NF
• Rule: A relation in 2NF with only one non-key
attribute must be in 3NF
• In a normalised relation a non-key field must provide
a fact about the key, the whole key and nothing but
the key.
• Relations in 3NF are sufficient for most practical
database design problems. However, 3NF does not
guarantee that all anomalies have been removed.
Review Questions
Case of a Construction Company (For Question 1)
– Building project -- Project number, Name, Employees
assigned to the project.
– Employee -- Employee number, Name, Job
classification
– The company charges its clients by billing the hours
spent on each project. The hourly billing rate is
dependent on the employee’s position.
Summary of Operations:
• The company manages many projects.
• Each project requires the services of many employees.
• An employee may be assigned to several different projects.
• Some employees are not assigned to a project and perform
duties not specifically related to a project.
• Some employees are part of a labor pool, to be shared by
all project teams.
• Each employee has a (single) primary job classification.
• This job classification determines the hourly billing rate.
• Many employees can have the same job classification.
Question 1: Normalize the following table to
3NF
Question 2: Normalize the following table to
3NF
Question 3: Normalize the following relation
into 2NF
Question 4:Normalize the following table into
3NF
Question 5: Given the dependency diagram
below; define the following terms:
• Partial dependency
• Transitive dependency