Normalization of Database
Database Normalisation is a technique of organizing the data in the database. Normalization is
a systematic approach of decomposing tables to eliminate data redundancy and undesirable
characteristics like Insertion, Update and Deletion Anamolies. It is a multi-step process that
puts data into tabular form by removing duplicated data from the relation tables.
Normalization is used for mainly two purpose,
Eliminating reduntant(useless) data.
Ensuring data dependencies make sense i.e data is logically stored.
Normalization rule are divided into following normal form.
1.
First Normal Form
2.
Second Normal Form
3.
Third Normal Form
4.
Fourth normal form
5.
Fifth normal form
6.
BCNF
First Normal Form (1NF)
As per First Normal Form, no two Rows of data must contain repeating group of information i.e
each set of column must have a unique value, such that multiple columns cannot be used to
fetch the same row. Each table should be organized into rows, and each row should have a
primary key that distinguishes it as unique.
The Primary key is usually a single column, but sometimes more than one column can be
combined to create a single primary key. For example consider a table which is not in First
normal form
Student Table :
Student
Age
Subject
Adam
15
Biology, Maths
Alex
14
Maths
Stuart
17
Maths
In First Normal Form, any row must not have a column in which more than one value is saved,
like separated with commas. Rather than that, we must separate such data into multiple rows.
Student Table following 1NF will be :
Student
Age
Subject
Adam
15
Biology
Adam
15
Maths
Alex
14
Maths
Stuart
17
Maths
Using the First Normal Form, data redundancy increases, as there will be many columns with
same data in multiple rows but each row as a whole will be unique.
As per the Second Normal Form there must not be any partial dependency of any column on
primary key. It means that for a table that has concatenated primary key, each column in the
table that is not part of the primary key must depend upon the entire concatenated key for its
existence. If any column depends only on one part of the concatenated key, then the table
fails Second normal form.
In example of First Normal Form there are two rows for Adam, to include multiple subjects
that he has opted for. While this is searchable, and follows First normal form, it is an
inefficient use of space. Also in the above Table in First Normal Form, while the candidate key
is {Student, Subject}, Age of Student only depends on Student column, which is incorrect as
per Second Normal Form. To achieve second normal form, it would be helpful to split out the
subjects into an independent table, and match them up using the student names as foreign
keys.
New Student Table following 2NF will be :
Student
Age
Adam
15
Alex
14
Stuart
17
Student
Subject
Adam
Biology
Adam
Maths
Alex
Maths
Stuart
Maths
In Student Table the candidate key will be Student column, because all other column
i.e Age is dependent on it.
New Subject Table introduced for 2NF will be :
In Subject Table the candidate key will be {Student, Subject} column. Now, both the above
tables qualifies for Second Normal Form and will never suffer from Update Anomalies.
Although there are a few complex cases in which table in Second Normal Form suffers Update
Anomalies, and to handle those scenarios Third Normal Form is there.
Third Normal Form (3NF)
Third Normal form applies that every non-prime attribute of table must be dependent on
primary key, or we can say that, there should not be the case that a non-prime attribute is
determined by another non-prime attribute. So this transitive functional dependency should be
removed from the table and also the table must be in Second Normal form. For example,
consider a table with following fields.
Student_Detail Table :
Student_id
Student_name
DOB
Street
city
State
Zip
In this table Student_id is Primary key, but street, city and state depends upon Zip. The
dependency between zip and other fields is called transitive dependency. Hence to
apply 3NF, we need to move the street, city and state to new table, with Zip as primary key.
New Student_Detail Table :
Student_id
Student_name
DOB
Address Table :
Zip
Street
city
state
Zip
The advantage of removing transtive dependency is,
Amount of data duplication is reduced.
Data integrity achieved.
Boyce and Codd Normal Form (BCNF)
Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals
with certain type of anamoly that is not handled by 3NF. A 3NF table which does not have
multiple overlapping candidate keys is said to be in BCNF. For a table to be in BCNF, following
conditions must be satisfied:
R must be in 3rd Normal Form
and, for each functional dependency ( X -> Y ), X should be a super Key.
Fourth normal form (4NF) is a level of database normalization where there are no nontrivial multivalued dependencies other than a candidate key.
It builds on the first three normal forms (1NF, 2NF and 3NF) and the Boyce-Codd Normal
Form (BCNF). It states that, in addition to a database meeting the requirements of BCNF, it
must not contain more than one multivalued dependency. Decomposition and 4NF If X >
>Y is a 4NF violation violation for relation relation R, we can decompose R using the same
technique as for BCNF. 1. XY is one of the decomposed relatio2. All but Y X is the other.
Example Drinkers Drinkers(name, addr, phones, beersLiked beersLiked) FD: name > addr
MVDs: name >> phones name >> beersLiked Key is { a e, n m p o es, h n bee s ed} r
Liked}. Which dependencies violate 4NF ?
Example, Continued Decompose using name > addr: 1. Drinkers1 Drinkers1(name, addr)
In 4NF, only dependency is name > addr. 2. Drinkers2 Drinkers2(name, phones,
beersLiked) beersLiked) Not in 4NF. MVDs name >> phones and name >> beersLiked
beersLiked apply. Key ? No FDs so all three attributes form the key 21 No FDs, so
all three attributes form the key.
FIFTH NORMAL FORM
A relation R is in Fifth Normal Form (5NF) if and only if the following conditions are satisfied
simultaneously:
1.
R is already in 4NF.
2.
It cannot be further non-loss decomposed.
5NF is of little practical use to the database designer, but it is of interest from a theoretical
point of view and a discussion of it is included here to complete the picture of the further
normal forms.
In all of the further normal forms discussed so far, no loss decomposition was achieved by the
decomposing of a single table into two separate tables. No loss decomposition is possible
because of the availability of the join operator as part of the relational model. In considering
5NF, consideration must be given to tables where this non-loss decomposition can only be
achieved by decomposition into three or more separate tables. Such decomposition is not
always possible as is shown by the following example.
Consider the table
AGENT_COMPANY_PRODUCT (Agent, Company, Product _Name)
This table lists agents, the companies they work for and the products they sell for those
companies. 'The agents do not necessarily sell all the products supplied by the companies
they do business with. An example of this table might be.
The redundancy has been eliminated, but the information about which companies make which
products and which of these products they supply to which agents has been lost. The natural
join of these projections over the 'agent' columns is:
The table resulting from this join is spurious, since the asterisked row of the table contains
incorrect information. Now suppose that the original table were to be decomposed into three
tables, the two projections, P I and P2 which have already shown, and the final, possible
projection, P3.
If a join is taken of all three projections, first of PI and P2 with the (spurious) result shown
above, and then of this result with P3 over the 'Company' and 'Product name' column, the
following table is obtained:
But now consider the different case where, if an agent is an agent for a company and that
company makes a product, then he always sells that product for the company. Under these
circumstances, the 'agent company product' table as shown below:
The assumption being that ABC makes both Nuts and Bolts and that CDE makes Bolts only.
This table can be decomposed into its three projections without loss of information as
demonstrated below:
All redundancy has been removed, if the natural join of PI and P2 IS taken, the result is:
The spurious row as asterisked. Now, if this result is joined with P3 over the column 'company
'product_name' the following table is obtained: