Normalization
A large database defined as a single relation may result in data duplication.
This repetition of data may result in:
• Making relations very large.
• It isn't easy to maintain and update data as it would involve searching
many records in relation.
• Wastage and poor utilization of disk space and resources.
• The likelihood of errors and inconsistencies increases.
So, to handle these problems, we should analyze and decompose the relations
with redundant data into smaller, simpler, and well-structured relations that
are satisfy desirable properties. Normalization is a process of decomposing the
relations into relations with fewer attributes.
What is Normalization?
• Normalization is the process of organizing the data in the database.
• Normalization is used to minimize the redundancy from a relation or set
of relations.
• It is also used to eliminate undesirable characteristics like Insertion,
Update, and Deletion Anomalies.
• Normalization divides the larger table into smaller and links them using
relationships.
Why do we Need Normalization?
• The main reason for normalizing the relations is removing certain
anomalies. These are: Data Redundancy, Insertion Anomaly, Deletion
Anomaly, Updation Anomaly.
• Failure to eliminate anomalies leads to data redundancy and can cause
data integrity and other problems as the database grows.
• Normalization consists of a series of guidelines that helps to guide you
in creating a good database structure.
1|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
Types of Anomalies in Database
Following are the types of anomalies that make the table inconsistency, loss of
integrity, and redundant data.
1. Data Redundancy: Data redundancy occurs in a relational database when
two or more rows or columns have the same value or repetitive value leading
to unnecessary utilization of the memory. For example: there are two students in
the above table, 'James' and 'Ritchie Rich', whose records are repetitive when
we enter a new CourseID. Hence it repeats the studRegistration, StudName and
address attributes.
Student Table
StudRegistration CourseID StudName Address Course
205 6204 James Los Angeles Economics
205 6247 James Los Angeles Economics
230 6204 Ritchie Rich Egypt Computer
230 6208 Ritchie Rich Egypt Accounts
2. Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a
new tuple into a relationship due to lack of data. For example, In the Student
table, if we want to insert a new courseID, we need to wait until the student
enrolled in a course. In this way, it is difficult to insert new record in the table.
Hence, it is called insertion anomalies.
3. Updatation Anomalies: The update anomaly is when an update of a single
data value requires multiple rows of data to be updated. For example,
suppose there is a student 'James' who belongs to Student table. If we want to
update the course in the Student, we need to update the same in the Course
table; otherwise, the data can be inconsistent. And it reflects the changes in a
table with updated values where some of them will not.
4. Deletion Anomalies: The delete anomaly refers to the situation where the
deletion of data results in the unintended loss of some other important data.
For example, if we want to remove Trent Bolt from the Student table, it also
removes his address, course and other details from the Student table.
Therefore, we can say that deleting some attributes can remove other attributes
of the database table.
2|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
Types of Normal Forms
Normalization works through a series of stages called Normal forms. The
normal forms apply to individual relations. The relation is said to be in
particular normal form if it satisfies constraints. Following are the various types
of Normal forms. These are:
• First Normal Form (1NF)
• Second Normal Form (2NF)
• Third Normal Form (3NF)
• Boyce and Codd Normal Form (BCNF)
• Fourth Normal Form (4NF)
• Fifth Normal Form (5NF)
First Normal Form (1NF)
• A relation will be 1NF if it contains an atomic value.
• It states that an attribute of a table cannot hold multiple values. It must
hold only single-valued attribute.
• First normal form disallows the multi-valued attribute, composite
attribute, and their combinations.
Example of 1NF
Relation EMPLOYEE is not in 1NF because of multi-valued attribute
EMP_PHONE.
EMPLOYEE Table
EMP_ID EMP_NAME EMP_PHONE EMP_STATE
14 John 7272826385, UP
9064738238
20 Harry 8574783832 Bihar
12 Sam 7390372389, Punjab
8589830302
The decomposition of the EMPLOYEE table into 1NF has been shown below.
3|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
EMP_ID EMP_NAME EMP_PHONE EMP_STATE
14 John 7272826385 UP
14 John 9064738238 UP
20 Harry 8574783832 Bihar
12 Sam 7390372389 Punjab
12 Sam 8589830302 Punjab
Second Normal Form (2NF)
• To be in second normal form, a relation must be in 1NF and relation
must not contain any partial dependency.
• A relation is in 2NF if it has No Partial Dependency, i.e., no non-prime
attribute (attributes which are not part of any candidate key) is
dependent on any proper subset of any candidate key of the table.
Non-prime Attribute: Attributes which are not part of any candidate key.
Partial Dependency: If the proper subset of candidate key determines non-
prime attribute, it is called partial dependency.
Example of 2NF
Consider the following table.
STUD_NO COURSE_NO COURSE_FEE
1 C1 1000
2 C2 1500
1 C4 2000
4 C3 1000
4 C1 1000
2 C5 2000
• Note that, there are many courses having the same course fee.
• Here, COURSE_FEE cannot alone decide the value of COURSE_NO
or STUD_NO. That means, COURSE_FEE alone cannot be a Key.
4|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
• COURSE_FEE together with STUD_NO cannot decide the value of
COURSE_NO. That means, {COURSE_FEE, STUD_NO} cannot be
a candidate key.
• COURSE_FEE together with COURSE_NO cannot decide the value
of STUD_NO. That means, {COURSE_FEE, COURSE_NO} cannot
be a candidate key.
• Hence, COURSE_FEE would be a non-prime attribute, as it does not
belong to the one only candidate key {STUD_NO, COURSE_NO}.
• But, COURSE_NO -> COURSE_FEE, i.e., COURSE_FEE is
dependent on COURSE_NO, where COURSE_NO is a proper subset
of the candidate key.
• That means, here, non-prime attribute COURSE_FEE is dependent on
a proper subset of the candidate key, which is a partial dependency
and so this relation is not in 2NF.
• To convert the above relation to 2NF, we need to split the table into
two tables such as: Table 1: STUD_NO, COURSE_NO Table 2:
COURSE_NO, COURSE_FEE.
Table 1 Table 2
STUD_NO COURSE_NO COURSE_NO COURSE_FEE
1 C1 C1 1000
2 C2 C2 1500
1 C4 C3 1000
4 C3 C4 2000
4 C1 C5 2000
2 C5
NOTE: 2NF tries to reduce the redundant data getting stored in memory.
For instance, if there are 100 students taking C1 course, we don’t need to
store its Fee as 1000 for all the 100 records, instead, once we can store it in
the second table as the course fee for C1 is 1000.
5|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
Third Normal Form (3NF)
• A relation will be in 3NF if it is in 2NF and not contain any transitive
partial dependency.
• If there is no transitive dependency for non-prime attributes, then the
relation must be in 3NF.
• A relation is in 3NF if it holds at least one of the following conditions
for every non-trivial function dependency X → Y.
i. X is a super key.
ii. Y is a prime attribute, i.e., each element of Y is part of some
candidate key.
Transitive Functional Dependency: If A->B and B->C are two functional
dependencies (FDs), then A->C is called transitive dependency.
The normalization of 2NF relations to 3NF involves the removal of
transitive dependencies. If a transitive dependency exists, we remove the
transitively dependent attribute(s) from the relation by placing the attribute(s)
in a new relation along with a copy of the determinant.
Non-trivial Functional Dependency: A → B has a non-trivial functional
dependency if B is not a subset of A. When A intersection B is NULL, then A
→ B is called as complete non-trivial. Example:
ID → Name,
Name → DOB
Example of 3NF
Consider the following Employee_Details Table.
EMPLOYEE_DETAIL Table
EMP_ID EMP_NAME EMP_ZIP EMP_STATE EMP_CITY
222 Harry 201010 UP Noida
333 Stephan 02228 US Boston
444 Lan 60007 US Chicago
555 Katharine 06389 UK Norwich
666 John 462007 MP Bhopal
6|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
In the above table:
• Super Key: {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID,
EMP_NAME, EMP_ZIP}, …… and so on
• Candidate Key: {EMP_ID}
• Non-prime Attributes: In the given table, all attributes except EMP_ID
are non-prime.
• Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and
EMP_ZIP dependent on EMP_ID. The non-prime attributes
(EMP_STATE, EMP_CITY) transitively dependent on super key
(EMP_ID). It violates the rule of third normal form.
• To solve this problem, we need to move the EMP_CITY and
EMP_STATE to the new <EMPLOYEE_ZIP> table, with EMP_ZIP
as a Primary key.
EMPLOYEE Table
EMP_ID EMP_NAME EMP_ZIP
222 Harry 201010
333 Stephan 02228
444 Lan 60007
555 Katharine 06389
666 John 462007
EMPLOYEE_ZIP Table
EMP_ZIP EMP_STATE EMP_CITY
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
462007 MP Bhopal
7|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
Boyce Codd normal form (BCNF)
• BCNF is the advance version of 3NF. It is stricter than 3NF.
• A table is in BCNF if every functional dependency X → Y, X is the super
key of the table.
• For BCNF, the table should be in 3NF, and for every FD, LHS is super
key.
Example of BCNF
Let's assume there is a company where employees work in more than one
department.
EMPLOYEE Table
EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP_DEPT_NO
264 India Designing D394 283
264 India Testing D394 300
364 UK Stores D283 232
364 UK Developing D283 549
In the above table Functional Dependencies are as follows:
• EMP_ID → EMP_COUNTRY
• EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate Key: {EMP-ID, EMP-DEPT}
• The table is not in BCNF because neither EMP_DEPT nor EMP_ID
alone are keys. i.e., left side part of both the functional dependencies
are not super key.
• To convert the given table into BCNF, we decompose it into three
tables.
EMP_COUNTRY Table
EMP_ID EMP_COUNTRY
264 India
364 UK
8|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
EMP_DEPT Table
EMP_DEPT DEPT_TYPE EMP_DEPT_NO
Designing D394 283
Testing D394 300
Stores D283 232
Developing D283 549
EMP_DEPT_MAPPING Table
EMP_ID EMP_DEPT
264 283
264 300
364 232
364 549
Functional Dependencies of the above tables are:
• EMP_ID → EMP_COUNTRY
• EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Super Key / Candidate keys of the above tables are:
• For the first table: EMP_ID
• For the second table: EMP_DEPT
• For the third table: {EMP_ID, EMP_DEPT}
Now, this is in BCNF because left side part of both the functional
dependencies is a super key.
9|Page
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
Fourth Normal Form (4NF)
A relation R is in 4NF if and only if the following conditions are satisfied:
• It should be in the Boyce-Codd Normal Form (BCNF).
• The table should not have any Multi-valued Dependency.
A table with a multivalued dependency violates the normalization standard of
the Fourth Normal Form (4NF) because it creates unnecessary redundancies
and can contribute to inconsistent data. To bring this up to 4NF, it is necessary
to break this information into two tables.
Multi-valued Dependency
• For a dependency A → B, if for a single value of A, multiple values of
B exist, then the relation will be a multi-valued dependency.
• Put another way, two attributes (or columns) in a table are independent
of one another, but both depend on a third attribute.
• A multivalued dependency always requires at least three attributes
because it consists of at least two attributes that are dependent on a
third.
Example of 4NF
The given below STUDENT table is in 3NF, but the COURSE and HOBBY
are two independent entities. Hence, there is no relationship between COURSE
and HOBBY.
STUDENT
STU_ID COURSE HOBBY
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
10 | P a g e
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
In the STUDENT relation, a student with STU_ID, 21 contains two
courses, Computer and Math and two hobbies, Dancing and Singing. So, there
is a multi-valued dependency on STU_ID, which leads to unnecessary
repetition of data.
So, to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
11 | P a g e
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
Fifth Normal Form (5NF)
• A relation is in 5NF if it is in 4NF and not contains any join dependency
and joining should be lossless.
• 5NF is also known as Project-join normal form (PJ/NF).
Example of 5NF
SUBJECT LECTURER SEMESTER
Computer Anshika Semester 1
Computer John Semester 1
Math John Semester 1
Math Akash Semester 2
Chemistry Praveen Semester 1
• In the above table, John takes both Computer and Math class for
Semester 1 but he doesn't take Math class for Semester 2. In this case,
combination of all these fields required to identify a valid data. That means,
primary key for the table is combination of all the fields (columns).
• Suppose we add a new Semester as Semester 3 but do not know about
the subject and who will be taking that subject so we leave Lecturer and
Subject as NULL. But all three columns together act as a primary key,
so we can't leave other two columns blank.
• So, to make the above table into 5NF, we can decompose it into three
relations P1, P2 & P3.
P1
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
12 | P a g e
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
Semester 1 Anshika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
Relational Decomposition
• When a relation in the relational model is not in appropriate normal
form then the decomposition of a relation is required.
• In a database, it breaks the table into multiple tables.
• If the relation has no proper decomposition, then it may lead to
problems like loss of information.
• Decomposition is used to eliminate some of the problems of bad design
like anomalies, inconsistencies, and redundancy.
13 | P a g e
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
Types of Decomposition
Lossless Decomposition
• If the information is not lost from the relation that is decomposed, then
the decomposition will be lossless.
• The lossless decomposition guarantees that the join of relations will
result in the same relation as it was decomposed.
• The relation is said to be lossless decomposition if natural joins of all the
decomposition give the original relation.
Example
EMPLOYEE_DEPARTMENT
EMP_ EMP_ EMP_ EMP_ DEPT_ DEPT_
ID NAME AGE CITY ID NAME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketing
46 Stephan 30 Bangalore 869 Finance
52 Katherine 36 Mumbai 575 Production
60 Jack 40 Noida 678 Testing
The above EMPLOYEE_DEPARTMENT relation is decomposed into two
relations EMPLOYEE and DEPARTMENT.
14 | P a g e
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
EMPLOYEE
EMP_ID EMP_NAME EMP_AGE EMP_CITY
22 Denim 28 Mumbai
33 Alina 25 Delhi
46 Stephan 30 Bangalore
52 Katherine 36 Mumbai
60 Jack 40 Noida
DEPARTMENT
DEPT_ID EMP_ID DEPT_NAME
827 22 Sales
438 33 Marketing
869 46 Finance
575 52 Production
678 60 Testing
Now, when these two relations are joined on the common column "EMP_ID",
then the resultant relation will be lossless and will look like:
Employee ⋈ Department
EMP_ EMP_ EMP_ EMP_ DEPT_ DEPT_
ID NAME AGE CITY ID NAME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketing
46 Stephan 30 Bangalore 869 Finance
52 Katherine 36 Mumbai 575 Production
60 Jack 40 Noida 678 Testing
15 | P a g e
Ishtiaq Ahammad (Lecturer, Dept. of CSE)
Dependency Preserving
• It is an important constraint of the database.
• In the dependency preservation, at least one decomposed table must
satisfy every dependency.
• If a relation R is decomposed into relation R1 and R2, then the
dependencies of R either must be a part of R1 or R2 or must be
derivable from the combination of functional dependencies of R1 and
R2.
• For example, suppose there is a relation R (A, B, C, D) with functional
dependency set (A->BC). The relational R is decomposed into
R1(ABC) and R2(AD) which is dependency preserving because FD A-
>BC is a part of relation R1(ABC).
16 | P a g e
Ishtiaq Ahammad (Lecturer, Dept. of CSE)