Normalization
A large database defined as a single relation may result in data duplication. This repetition of
data may result in:
   o   Making relations very large.
   o   It isn't easy to maintain and update data as it would involve searching many records in
       relation.
   o   Wastage and poor utilization of disk space and resources.
   o   The likelihood of errors and inconsistencies increases.
So to handle these problems, we should analyze and decompose the relations with redundant
data into smaller, simpler, and well-structured relations that are satisfy desirable properties.
Normalization is a process of decomposing the relations into relations with fewer attributes.
What is Normalization?
   o   Normalization is the process of organizing the data in the database.
   o   Normalization is used to minimize the redundancy from a relation or set of relations.
       It is also used to eliminate undesirable characteristics like Insertion, Update, and
       Deletion Anomalies.
   o   Normalization divides the larger table into smaller and links them using relationships.
   o   The normal form is used to reduce redundancy from the database table.
Why do we need Normalization?
The main reason for normalizing the relations is removing these anomalies. Failure to
eliminate anomalies leads to data redundancy and can cause data integrity and other problems
as the database grows. Normalization consists of a series of guidelines that helps to guide you
in creating a good database structure.uestions On Second Normal Form(2NF)
Data modification anomalies can be categorized into three types:
   o   Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple
       into a relationship due to lack of data.
   o   Deletion Anomaly: The delete anomaly refers to the situation where the deletion of
       data results in the unintended loss of some other important data.
   o   Updatation Anomaly: The update anomaly is when an update of a single data value
       requires multiple rows of data to be updated.
Types of Normal Forms:
Normalization works through a series of stages called Normal forms. The normal forms apply
to individual relations. The relation is said to be in particular normal form if it satisfies
constraints.
 Normal Form                                                     Description
 1NF      A relation is in 1NF if it contains an atomic value.
 2NF      A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional
          dependent on the primary key.
 3NF      A relation will be in 3NF if it is in 2NF and no transition dependency exists.
 BCNF     A stronger definition of 3NF is known as Boyce Codd's normal form.
 4NF      A relation will be in 4NF if it is in Boyce Codd's normal form and has no multi-valued
          dependency.
 5NF      A relation is in 5NF. If it is in 4NF and does not contain any join dependency, joining
          should be lossless.
Following are the various types of Normal forms:
Advantages of Normalization
   o   Normalization helps to minimize data redundancy.
        o   Greater overall database organization.
        o   Data consistency within the database.
        o   Much more flexible database design.
        o   Enforces the concept of relational integrity.
     Disadvantages of Normalization
        o   You cannot start building the database before knowing what the user needs.
        o   The performance degrades when normalizing the relations to higher normal forms,
            i.e., 4NF, 5NF.
        o   It is very time-consuming and difficult to normalize relations of a higher degree.
        o   Careless decomposition may lead to a bad database design, leading to serious
            problems.
     First Normal Form (1NF)
        o   A relation will be 1NF if it contains an atomic value.
        o   It states that an attribute of a table cannot hold multiple values. It must hold only
            single-valued attribute.
        o   First normal form disallows the multi-valued attribute, composite attribute, and their
            combinations.
     Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute
     EMP_PHONE.
     EMPLOYEE table:
EMP_ID         EMP_NAME                      EMP_PHONE                      EMP_STATE
14             John                          7272826385,                   UP
                                             9064738238
20             Harry                         8574783832                    Bihar
12             Sam                           7390372389,                   Punjab
                                             8589830302
     The decomposition of the EMPLOYEE table into 1NF has been shown below:
EMP_ID         EMP_NAME                      EMP_PHONE                      EMP_STATE
14               John                           7272826385                    UP
14               John                           9064738238                    UP
20               Harry                          8574783832                    Bihar
12               Sam                            7390372389                    Punjab
12               Sam                            8589830302                    Punjab
     Second Normal Form (2NF)
           o   In the 2NF, relational must be in 1NF.
           o   In the second normal form, all non-key attributes are fully functional dependent on the
               primary key
     Example: Let's assume, a school can store the data of teachers and the subjects they teach. In
     a school, a teacher can teach more than one subject.
     TEACHER table
      TEACHER_ID                       SUBJECT                           TEACHER_AGE
      25                              Chemistry                          30
      25                              Biology                            30
      47                              English                            35
      83                              Math                               38
      83                              Computer                           38
     In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID
     which is a proper subset of a candidate key. That's why it violates the rule for 2NF.
     To convert the given table into 2NF, we decompose it into two tables:
     TEACHER_DETAIL table:
      TEACHER_ID                                          TEACHER_AGE
      25                                                 30
 47                                               35
 83                                               38
TEACHER_SUBJECT table:
 TEACHER_ID                                     SUBJECT
 25                                            Chemistry
 25                                            Biology
 47                                            English
 83                                            Math
 83                                            Computer
Transitive Dependency
Let's consider a relation R(A B C). Here A, B, and C are known as the attribute of the relation
R. When the following condition arises; then the transitive dependency occurs in the DBMS.
The conditions are A → B, B → C. Therefore, the condition becomes A → C. In other words,
we can say that when dependencies are made by two functional dependencies, then the
functional dependencies become transitive dependencies.
What is Transitive Dependency?
Consider a relation R(A B C). Here A, B, and C are known as the attribute of the relation R.
When the following condition arises; then the transitive dependency occurs in the DBMS.
The conditions are A → B, B → C. Therefore, the condition becomes A → C. in transitive
functional dependencies, a dependent is directly dependent on the determinant.
Let's understand the transitive dependencies with the help of the following example.
 Author_ID          Author              Book                               Author_Nationality
 A1                Arundhati Roy          The God of Small Things         India
 A1                Kiran Desai            The Inheritance Of Loss         India
 A2                R. K. Narayan          The Man Eater of Malgudy        India
In the above author table, we get that.
   o   Book → Author: Here, the author attribute is determined by the book attribute. If
       someone knows the book's name, they can also learn the author's name.
   o   Author → Author_Nationality: If someone knows the author's name, they can also
       learn the author's nationality.
   o   Book → Author_Nationality : If someone knows the book's name, they can also
       learn the author's nationality.
If we look closely at the functional dependencies discussed above, we find the following
pattern.
   o   A → B and B → C B→C; therefore, A → CA → C.
   o   A → Book, B → Author B → Author and C → Author_Nationality
       C→Author_Nationality
Suppose we apply Third Normal Form (3NF) in the above author table. In that case, the
transitive dependency of the author table must be removed, and the process of removing the
transitive dependencies of the database is known as the Normalization process.
How to Avoid Transitive Dependencies?
Let's consider the above author table again and see what type of issue arises during the table
creation.
Authors table:
 Author_ID          Author               Book                                Author_Nationality
 A1                Arundhati Roy        The God of Small Things              India
 A1                Kiran Desai          The Inheritance Of Loss              India
 A2                R. K. Narayan        The Man Eater of Malgudy             India
The data anomalies (like updation, insertion, and deletion anomalies) and inconsistency can
be contributed by the author table. When there is too much redundancy in the database, then
data anomalies are caused in the data. Data anomalies are also caused when there is a
problem with updating, deleting, and inserting new data. For example, in the author table-
   o   We cannot add a new author until we cannot add a book to the table.
   o   We cannot delete the author until we completely delete the book from the database.
   o   If we want to delete the book "The God of Small Things," the author's id, author, and
       nationality also got deleted.
The above issues occur in any relationship that contains transitive dependencies.
Third Normal Form by Removing Transitive Dependency
Let's consider the Author table with three attributes (Author_ID, Author, Author_Nationality)
and try to find and eliminate the Transitive dependency from this table,
Authors Table:
 Author_ID          Author              Book                                Author_Nationality
 A1                Arundhati Roy        The God of Small Things            India
 A1                Kiran Desai          The Inheritance Of Loss            India
 A2                R. K. Narayan        The Man Eater of Malgudy           India
The above Author's table is not in 3NF because it has a Transitive dependency. Let's see how
   o   Author → Author_Nationality Author → Author_Nationality
   o   Author_ID → Author
Therefore the following functional dependency also exists,
   o   Authir_ID → Author_Nationality is forming a pattern similar to what we discussed
       above.
Now to eliminate the Transitive dependency, all we need to do is to split the Author's table in
such a manner that Author_ID will no longer functionally depend on Author_Nationality.
Let's create two tables, one containing only { Author_ID, Author} and another containing
{Author_Nationality}. The new tables will look like this,
Author Table
 Author_ID                          Author
 A1                                The God of Small Things
 A2                                The Inheritance Of Loss
 A3                                The Man Eater of Malgudy
Author Nationality Table
 Author                                     Author_Nationality
 Arundhati Roy                             India
 Kiran Desai                               India
 R. K. Narayan                             India
Now the new Author table and Author Nationality table contains no Transitive dependency
and the relation is now in 3NF.
Third Normal Form (3NF)
   o   A relation will be in 3NF if it is in 2NF and not contain any transitive partial
       dependency.
   o   3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
   o   If there is no transitive dependency for non-prime attributes, then the relation must be
       in third normal form.
A relation is in third normal form if it holds atleast one of the following conditions for every
non-trivial function dependency X → Y.
   1. X is a super key.
   2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
       Example:
       EMPLOYEE_DETAIL table:
         EMP_ID           EMP_NAME              EMP_ZIP           EMP_STATE               EMP_CITY
        222               Harry                201010            UP                      Noida
        333               Stephan              02228             US                      Boston
        444               Lan                  60007             US                      Chicago
        555               Katharine            06389             UK                      Norwich
        666               John                 462007            MP                      Bhopal
       Super key in the table above:
   1. {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP}....so on
       Candidate key: {EMP_ID}
       Non-prime attributes: In the given table, all attributes except EMP_ID are non-
       prime.
       Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent
       on EMP_ID. The non-prime attributes (EMP_STATE, EMP_CITY) transitively
       dependent on super key(EMP_ID). It violates the rule of third normal form.
       That's why we need to move the EMP_CITY and EMP_STATE to the new
       <EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.
       EMPLOYEE table:
         EMP_ID                      EMP_NAME                              EMP_ZIP
        222                         Harry                                 201010
        333                         Stephan                               02228
        444                         Lan                                   60007
        555                         Katharine                             06389
        666                         John                                  462007
       EMPLOYEE_ZIP table:
         EMP_ZIP                     EMP_STATE                           EMP_CITY
        201010                       UP                                  Noida
        02228                        US                                  Boston
        60007                        US                                  Chicago
        06389                        UK                                  Norwich
        462007                       MP                                  Bhopal
Functional Dependency
The functional dependency is a relationship that exists between two attributes. It typically
exists between the primary key and non-key attribute within a table.
X → Y
The left side of FD is known as a determinant, the right side of the production is known as a
dependent.
For example:
Assume we have an employee table with attributes: Emp_Id, Emp_Name, Emp_Address.
   Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table
   because if we know the Emp_Id, we can tell that employee name associated with it.
   Functional dependency can be written as:
   Emp_Id → Emp_Name
   We can say that Emp_Name is functionally dependent on Emp_Id.
   Types of Functional dependency
   1. Trivial functional dependency
      o   A → B has trivial functional dependency if B is a subset of A.
      o   The following dependencies are also trivial like: A → A, B → B
   Example:
1. Consider a table with two columns Employee_Id and Employee_Name.
2. {Employee_id, Employee_Name} →          Employee_Id is a trivial functional dependency as
3. Employee_Id is a subset of {Employee_Id, Employee_Name}.
4. Also, Employee_Id → Employee_Id and Employee_Name →              Employee_Name are trivial
   dependencies too.
   2. Non-trivial functional dependency
      o   A → B has a non-trivial functional dependency if B is not a subset of A.
      o   When A intersection B is NULL, then A → B is called as complete non-trivial.
  Example:
1. ID →     Name,
2. Name →      DOB
  Boyce Codd normal form (BCNF)
     o    BCNF is the advance version of 3NF. It is stricter than 3NF.
     o    A table is in BCNF if every functional dependency X → Y, X is the super key of the
          table.
     o    For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
  Example: Let's assume there is a company where employees work in more than one
  department.
  EMPLOYEE table:
   EMP_ID           EMP_COUNTRY              EMP_DEPT           DEPT_TYPE         EMP_DEPT_NO
   264             India                    Designing          D394              283
   264             India                    Testing            D394              300
   364             UK                       Stores             D283              232
   364             UK                       Developing         D283              549
  In the above table Functional dependencies are as follows:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
  Candidate key: {EMP-ID, EMP-DEPT}een
  The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
  To convert the given table into BCNF, we decompose it into three tables:
  EMP_COUNTRY table:
   EMP_ID                                EMP_COUNTRY
   264                                   India
   264                                   India
  EMP_DEPT table:
   EMP_DEPT                       DEPT_TYPE                        EMP_DEPT_NO
   Designing                      D394                            283
   Testing                        D394                            300
   Stores                         D283                            232
   Developing                     D283                            549
  EMP_DEPT_MAPPING table:
   EMP_ID                                        EMP_DEPT
   D394                                          283
   D394                                          300
   D283                                          232
   D283                                          549
  Functional dependencies:
1. EMP_ID →      EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
  Candidate keys:
  For the first table: EMP_ID
  For the second table: EMP_DEPT
  For the third table: {EMP_ID, EMP_DEPT}
  Now, this is in BCNF because left side part of both the functional dependencies is a key.
  Multivalued Dependency
     o   Multivalued dependency occurs when two attributes in a table are independent of each
         other but, both depend on a third attribute.
     o   A multivalued dependency consists of at least two attributes that are dependent on a
         third attribute that's why it always requires at least three attributes.
  Example: Suppose there is a bike manufacturer company which produces two colors(white
  and black) of each model every year.
   BIKE_MODEL                               MANUF_YEAR                              COLOR
  M2011                                    2008                                     White
  M2001                                    2008                                     Black
  M3001                                    2013                                     White
  M3001                                    2013                                     Black
  M4006                                    2017                                     White
  M4006                                    2017                                     Black
  Here columns COLOR and MANUF_YEAR are dependent on BIKE_MODEL and
  independent of each other.
  In this case, these two columns can be called as multivalued dependent on BIKE_MODEL.
  The representation of these dependencies is shown below:
1. BIKE_MODEL → → MANUF_YEAR
2. BIKE_MODEL → → COLOR
  This can be read as "BIKE_MODEL                       multidetermined     MANUF_YEAR"     and
  "BIKE_MODEL multidetermined COLOR"
  Fourth normal form (4NF)
      o   A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued
          dependency.
      o   For a dependency A → B, if for a single value of A, multiple values of B exists, then
          the relation will be a multi-valued dependency.
Example
STUDENT
 STU_ID                              COURSE                              HOBBY
 21                                 Computer                             Dancing
 21                                 Math                                 Singing
 34                                 Chemistry                            Dancing
 74                                 Biology                              Cricket
 59                                 Physics                              Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent
entity. Hence, there is no relationship between COURSE and HOBBY.
In    the   STUDENT     relation,   a    student    with    STU_ID, 21 contains two
courses, Computer and Math and two hobbies, Dancing and Singing. So there is a Multi-
valued dependency on STU_ID, which leads to unnecessary repetition of data.
So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE
 STU_ID                                              COURSE
 21                                                  Computer
 21                                                  Math
 34                                                  Chemistry
 74                                                  Biology
 59                                                    Physics
STUDENT_HOBBY
 STU_ID                                                   HOBBY
 21                                                       Dancing
 21                                                       Singing
 34                                                       Dancing
 74                                                       Cricket
 59                                                       Hockey
Join Dependency
      o   Join decomposition is a further generalization of Multivalued dependencies.
      o   If the join of R1 and R2 over C is equal to relation R, then we can say that a join
          dependency (JD) exists.
      o   Where R1 and R2 are the decompositions R1(A, B, C) and R2(C, D) of a given
          relations R (A, B, C, D).
      o   Alternatively, R1 and R2 are a lossless decomposition of R.
      o   A JD ⋈ {R1, R2,..., Rn} is said to hold over a relation R if R1, R2,....., Rn is a
          lossless-join decomposition.
      o   The *(A, B, C, D), (C, D) will be a JD of R if the join of join's attribute is equal to the
          relation R.
      o   Here, *(R1, R2, R3) is used to indicate that relation R1, R2, R3 and so on are a JD of
          R.
Fifth normal form (5NF)
      o   A relation is in 5NF if it is in 4NF and not contains any join dependency and joining
          should be lossless.
      o   5NF is satisfied when all the tables are broken into as many tables as possible in order
          to avoid redundancy.
     o   5NF is also known as Project-join normal form (PJ/NF).
Example
 SUBJECT                         LECTURER                            SEMESTER
 Computer                       Anshika                              Semester 1
 Computer                       John                                 Semester 1
 Math                           John                                 Semester 1
 Math                           Akash                                Semester 2
 Chemistry                      Praveen                              Semester 1
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't
take Math class for Semester 2. In this case, combination of all these fields required to
identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who
will be taking that subject so we leave Lecturer and Subject as NULL. But all three columns
together acts as a primary key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
P1
 SEMESTER                                                 SUBJECT
 Semester 1                                              Computer
 Semester 1                                              Math
 Semester 1                                              Chemistry
 Semester 2                                              Math
P2
 SUBJECT     LECTURER
Computer     Anshika
Computer     John
Math         John
Math         Akash
Chemistry    Praveen
P3
 SEMSTER      LECTURER
Semester 1   Anshika
Semester 1   John
Semester 1   John
Semester 2   Akash
Semester 1   Praveen