Unit Two: Normalization
This unit is developed to provide you the necessary information regarding the following
content coverage and topics
          Identification of suitable business data
          Rules of normalization
          Normalize business data and document results
          Compare normalization results with ER diagram
          Reconcile differences between data
2.1.       Identify Suitable Business Data
When developing normalization for a database, it is important to identify suitable business data that accurately
represents the entities and relationships within the system. Here are some steps to help identify suitable business data
for normalization:
1. Understand the Business Processes: Gain a deep understanding of the business processes and operations that the
    database will support. This involves analyzing the workflow, data flows, and interactions between different
    entities within the organization. Identify the key entities, their attributes, and the relationships between them.
2. Conduct Stakeholder Interviews: Engage with stakeholders, such as business owners, managers, and end-users,
    to gather their input on the data requirements. Conduct interviews to understand their needs, pain points, and
    how they interact with the data. This will help identify the critical data elements and relationships that need to
    be captured in the database.
3. Review Existing Documentation: Examine any existing documentation, such as business requirements
    documents, process flowcharts, or data dictionaries. This can provide valuable insights into the data elements
    that are already identified or documented. Reviewing existing documentation will help ensure that no important
    data elements are overlooked during the normalization process.
4. Analyze Sample Data: Analyze sample data sets to identify the different types of data that need to be captured.
    This can be done by reviewing existing spreadsheets, forms, reports, or any other sources of data. Identify the
    unique values, common patterns, and potential data dependencies within the sample data. This analysis will help
    determine the appropriate entities, attributes, and relationships that should be included in the normalized
    database.
5. Identify Key Business Rules: Understand the business rules that govern the data and its relationships. These rules
    define the logic, constraints, and dependencies that shape the data. Identify any unique constraints, mandatory
    fields, or conditional relationships that need to be captured in the database design.
6. Consider Future Growth and Scalability: Anticipate future growth and scalability requirements of the database.
    Identify data elements that are likely to change or expand over time. This includes considering potential new
    features, additional data sources, or changes in business processes. By planning for future growth, you can ensure
    that the normalized database can accommodate evolving business needs without significant redesign or
    restructuring.
7. Collaborate with Database Experts: Seek guidance from database experts or experienced database administrators.
    They can provide valuable insights into best practices for normalization and help identify suitable business data
    based on their expertise and knowledge.
Remember, the goal of normalization is to eliminate redundancy and improve data integrity. By carefully identifying
suitable business data, you can create a well-structured and efficient database that accurately represents the business
processes and supports the organization's goals.
2.2.    Understand Rules of Normalization
2.2.1   Normalization
   It is a technique for producing a set of relations with desirable properties, given the data requirements of an
    enterprise.Database designed based on ER model may have some amount of inconsistency, ambiguity and
    redundancy. To resolve these issues, some amount of refinement is required. This refinement process is called
    as Normalization.
   The process of normalization is a formal method that identifies relations based on their primary key.Primarily it
    is a tool to validate and improve a logical design so that it satisfies certain constraints that avoid unnecessary
    redundancy of data.
   It is the process of decomposing relations with anomalies to produce smaller, well-structured relations and helps
    eliminate data anomalies/problems
2.2.2 Benefits of database normalization
 Reduced usage of storage space by intelligently categorizing data.
 It enables better, faster, stronger searches as it entails fewer entities to scan in comparison with the earlier
    searches based on mixed entities.
 Improves data integrity: it splits all the data into individual entities yet building strong linkages with the related
    data.
 More efficient database structure.
 Better understanding of data.
 More flexible database structure.
 Easier to maintain database structure.
 Few (if any) costly surprises down the road.
 Validates your common sense and intuition.
 Avoids redundant fields.
 Ensures that distinct tables exist when necessary
2.2.3 Anomalies in Normalization
Anomalies in normalization refer to inconsistencies or issues that can occur in a database when it is not properly
normalized. These anomalies can affect data integrity, accuracy, and the ability to perform efficient data operations.
There are three main types of anomalies that can occur:
1. Insertion Anomaly: An insertion anomaly happens when it is not possible to insert a new record into a table
    without including additional, unrelated data. This occurs when a table has attributes that are functionally
    dependent on only a part of the primary key.
2. Update Anomaly: An update anomaly occurs when modifying data in a table leads to inconsistencies or
    redundant updates in other parts of the table. This happens when a table has redundant data or dependencies
    between non-key attributes.
3. Deletion Anomaly: A deletion anomaly occurs when removing data from a table unintentionally removes other
    related data that should have been preserved. This happenswhen a table has dependencies between attributes, and
    removing data leads to the loss of other necessary data.
Normalization helps to eliminate these anomalies by organizing data into well-structured tables, ensuring data
dependencies are properly defined, and reducing redundancy. By achieving higher levels of normalization, such as
3NF or BCNF, the likelihood of anomalies occurring is minimized, and data integrity is improved
Example
We’ll be using a student database as an example in this article, which records student, class, and teacher information.
Insert Anomaly
For example, if we wanted to add a new student but did not know their course name this will be how
Update Anomaly
For example, let’s say the class Biology 1 was changed to “Intro to Biology”. We would have to query all of the
columns that could have this Class field and rename each one that was found.
Delete Anomaly
For example, let’s say Susan Johnson quits and her record needs to be deleted from the system. We could delete her
row:
Normalization stages
     1NF - First normal form
     2NF - Second normal form
     3NF - Third normal form
     3.5NF - Boyce Codd Normal Form (BCNF)
     4NF - Fourth normal form
     5NF - Fifth normal form
    
The Normal Forms
The database community has developed a series of guidelines for ensuring that databases are normalized. These are
referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as First normal
form or 1NF) through five (fifth normal form or 5NF). In practical applications, you'll often see 1NF, 2NF, and 3NF
along with the occasional 4NF. Fifth normal form is very rarely seen.
  I.    First normal form (1NF):Eliminating Repeating Groups
     Is the "basic" level of normalization, corresponds to the definition of any database
     It contains two-dimensional tables with rows and columns.
     Each column corresponds to a sub-object or an attribute of the object represented by the entire table.
     Each row represents a unique instance of that sub-object or attribute and must be different in some way from
        any other row (no duplicate rows are possible).
     All entries in any column must be of the same kind. For example, in the column labeled "Customer," only
        customer names or numbers are permitted.
Requirements:
     There should be a primary key, and
     No column should contain more than one value (no multi-valued attributes)
In order to perform first normalizationrule, we have to consider the following concepts
 Eliminate duplicative columns from the same table.
 Create separate tables for each group of related data and identify each row with a unique column or set of columns
    (theprimary key).
 Eliminate composite attributes
                                  Example 1: Develop 1NF of the following table
 Student Details                          Course Details                              Result details
 1001 Ram        11/09/1986              M4      Basic Math’s          7            11/11/2004     89       A
 1002 Shyam 12/08/1987                   M4      Basic Math’s          7            11/11/2004     78       B
 1001 Ram       23/06/1987                H6                           4             11/11/2004     87      A
 1003 Sita      16/07/1985                C3    Basic Chemistry       11             11/11/2004     90      A
 1004 Gita       24/09/1988               B3                           8             11/11/2004     78      B
 1002 Shyam 23/06/1988                    P3    Basic Physics         13             11/11/2004     67      C
 1005 Sunita 14/09/1987                   P3    Basic Physics         13             11/11/2004     78      B
 1003 Sita      23/10/1987                B4                           5             11/11/2004     67      C
 1005 Sunita 13/03/1990                   H6                          4              11/11/2004     56      D
 1004 Gita       21/08/1987               M4    Basic Math’s           7             11/11/2004     78      B
The above table Student Details, Course Details and Result Details can be further divided.
      Student Details attribute is divided into Student#(Student Number), Student Name and date of birth.
      Course Details is divided into Course#, Course Name and duration.
      Results attribute is divided into Date ofexam, Marks and Grade.
 II.   Second normal form (2NF): Eliminating Redundant Data
      Second normal form (2NF) requires that all non-key columns are fully dependent on the entire primary key.
       If the table has only a single-column primary key, this requirement is easily met.
      At this level of normalization, each column in a table that is not a determiner of the contents of another
       column must itself be a function of the other columns in the table. For example, in a table with three columns
       containing customer ID, product sold, and price of the product when sold, the price would be a function of
       the customer ID (entitled to a discount) and the specific product.
In order to perform first normalizationrule, we have to consider the following concepts
     Meet all the requirements of the first normal form and remove subsets of data that apply to multiple rows of
        a table and place them in separate tables.
     Create relationships between these new tables and their predecessors through the use of foreign keys.
Example 2
Let us re-visit 1NF table structure.
     Student# is key attribute for Student,
     Course# is key attribute for Course
     Student# and Course# together form the composite key attributes for result relationship.
     Other attributes are non - key attributes.
To make this table 2NF compliant, we have to remove all the partial dependencies.
     Student Name and Date ofBirth depends only on student#.
     CourseName, Pre-Requisite and DurationInDays depends only on Course#
     Date ofExam depends only on Course#.
To remove this partial dependency, we need to split Student_Course_Result table into four separate tables,
STUDENT, COURSE, RESULT and EXAM_DATE tables as shown in the following:
STUDENT TABLE
                 Student #         Student Name                    DateofBirth
               1001              Ram                              Some value
               1002              Shyam                            Some value
               1003              Sita                             Some value
               1004              Geeta                            Some value
               1005              Sunita                           Some value
COURSE TABLE
                 Course#        CourseName                     Duration of days
                 C3             Bio Chemistry                  3
                 B3             Botany                         8
                 P3             Nuclear Physics                1
                 M4             Applied Mathematics            4
                 H6             American History               5
                 B4             Zoology                        9
RESULT TABLE
           Student#                Course#             Marks                Grade
                1001               M4                  89                   A
                1002               M4                  78                   B
                1001               H6                  87                   A
                1003               C3                  90                   A
                1004               B3                  78                   B
                1002               P3                  67                   C
                1005               P3                  78                   B
                1003               B4                  67                   C
                1005               H6                  56                   D
                1004               M4                  78                   B
EXAM DATE Table
                                    Course#           DateOfExam
                                    M4                Some value
                                    H6                Some value
                                    C3                Some value
                                    B3                Some value
                                    P3                Some value
                                    B4                Some value
     In STUDENT table, the key attribute is Student# and all other non-key attributes, Student name and Date
      ofBirth are fully functionally dependent on the key attribute.
     In COURSE table, the key attribute is Course# and all the non-key attributes, Course name, Duration in days
      are fully functional dependent on the key attribute.
     In RESULT table, the key attributes are #StudentCourse# together and all other non-key attributes, Marks and
      Grade are fully functional dependent on the key attributes.
      In EXAM DATE table, the key attribute is Course# and the non key attribute Date ofExam is fully
       functionally dependent on the key attribute.
   At first look it appears like all our anomalies are taken away! Now we are storing Student 1003 and M4 record
   only once. We can insert prospective students and courses at our will. We will update only once if we need to
   change any data in STUDENT, COURSE tables. We can get rid of any course or student details by deleting just
   one row.
Let us analyze the RESULT Table
We already concluded that:
    All attributes are atomic in nature
    No partial dependency exists between the key attributes and non-key attributes
    RESULT table is in 2NF
 Assume, at present, as per the university evaluation policy,
    Students who score more than or equal to 80 marks are awarded with “A” grade
    Students who score more than or equal to 70 marks up till 79 are awarded with “B” grade
    Students who score more than or equal to 60 marks up till 69 are awarded with “C” grade
    Students who score more than or equal to 50 marks up till 59 are awarded with “D” grade
 The University management which is committed to improve the quality of education wants to change the existing
grading system to a new grading system. In the present RESULT table structure,
    We don’t have an option to introduce new grades like A+, B- and E
    We need to do multiple updates on the existing record to bring them to new grading definition
    We will not be able to take away “D” grade if we want to.
    2NF does not take care of all the anomalies and inconsistencies.
III.    Third normal form (3NF): Eliminating Columns Not Dependent on Keys
Requires that there are no transitive dependencies, where one column depends on another column which depends on
the primary key.At the 2NF, modifications are still possible because a change to one row in a table may affect data
that refers to this information from another table. For example, using the customer table just cited, removing a row
describing a customer purchase (because of a return perhaps) will also remove the fact that the product has a certain
price. In the third normal form, these tables would be divided into two tables so that product pricing would be tracked
separately.
In order to perform first normalizationrule, we have to consider the following concepts
      Meet all the requirements of the second normal form.
      Remove columns that are not dependent upon the primary key.
      No transitive dependency exists between non-key attributes and key attributes.
Example 3:In the above RESULT table Student# and Course# are the key attributes. All other attributes, except
grade are non-partially, non-transitively dependent on key attributes. The grade attribute is dependent on “Marks “,
and in turn “Marks” is dependent on #Student#Course. To bring the table in 3NF, we need to take off this transitive
dependency.
                           Student#              Course#               Marks
                           1001                  M4                    89
                           1002                  M4                    78
                           1001                  H6                    87
                           1003                  C3                    90
                           1004                  B3                    78
                           1002                  P3                    67
                           1005                  P3                    78
                           1003                  B4                    67
                             1005                 H6                    56
                             1004                 M4                    78
                            UpperBound             LowerBound            Grade
                            100                    95                    A+
                            94                     90                    A
                            89                     85                    B+
                            84                     80                    B
                            79                     75                    B-
                            74                     70                    C
                            69                     65                    C-
 After normalizing tables to 3NF, we got rid of all the anomalies and inconsistencies. Now we can add new grade
systems, update the existing one and delete the unwanted ones. Hence the Third Normal form is the most optimal
normal form and 99% of the databases which require efficiency in
         INSERT, UPDATE and DELETE Operations are designed in this normal form
IV.       Boyce-Codd Normal Form (BCNF or 3.5NF)
         The Boyce-Codd Normal form, also referred to as the "third and half (3.5) normal form", adds one more
          requirement:
               Meet all the requirements of the third normal form.
               Every determinant must be a candidate key.
         Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. A row is in Boyce Codd normal form if
          and only if every determinant is a candidate key.
         Most entities in 3NF are already in BCNF.
 V.       Fourth Normal Form (4NF)
         Fourth normal form (4NF) has one additional requirement:
               Meet all the requirements of the third normal form.
               A relation is in 4NF if it has no multi-valued dependencies.
         An entity is in Fourth Normal Form (4NF) if and only if it is in 3NF and has no multiple sets of multi-valued
          dependencies. In other words, 4NF states that no entity can have more than a single one-to-many relationship
          within an entity if the one-to-many attributes are independent of each other.
VI.       Fifth Normal Form (5NF)
         5NF specifies that every join dependency for the entity must be a consequence of its candidate keys.
Summary of normal forms
   A row is in first normal form if and only if all underlying domains contain atomic values only. 1NF eliminates
    repeating groups by putting each into a separate table and connecting them with a one-to-many relationship.
   A row is in second normal form if and only if it is in first normal form and every non-key attribute is fully
    dependent on the key. 2NF eliminates functional dependencies on a partial key by putting the fields in a
    separate table from those that are dependent on the whole key. A row is in third normal form if and only if it
    is in second normal form and every non-key attribute is non-transitively dependent on the primary key. 3NF
    eliminates functional dependencies on non-key fields by putting them in a separate table.non-key fields are
    dependent on the key.
      Functional Dependency
Functional dependency is a relationship between two sets of attributes in a database table. It describes the dependency
of one attribute (or a set of attributes) on another attribute (or a set of attributes). In other words, if changing the
value of one attribute determines the value of another attribute(s), then a functional dependency exists.
   Example: In a table called "Employees," if the attribute "EmployeeID" determines the attribute "EmployeeName,"
it can be represented as: EmployeeID -> EmployeeName. This means that for every value of EmployeeID, there is
a unique value of EmployeeName associated with it.
      Full Functional Dependency
Full functional dependency occurs when an attribute is functionally dependent on the entire primary key of a table,
and not just a part of it. In other words, all non-key attributes depend on the entire primary key and not on any subset
of it.
  Example: In a table called "Students," if the primary key is composed of "StudentID" and "CourseID," and the
attribute "Grade" depends on both of these attributes, it can be represented as: StudentID, CourseID -> Grade. This
means that for every combination of StudentID and CourseID, there is a unique value of Grade associated with it.
     Partial Dependency
Partial dependency occurs when an attribute is functionally dependent on only a part of the primary key, rather than
the entire primary key. It means that a non-key attribute depends on only a subset of the primary key, and not on the
entire primary key.
  Example: In a table called "Orders," if the primary key is "OrderID," and the attributes "CustomerName" and
"CustomerAddress" depend on only the attribute "OrderID," it indicates a partial dependency. This can be
represented as: Ordered ->CustomerName, CustomerAddress. To remove the partial dependency, the table can be
split into two separate tables: "Orders" and "Customers," where the customer details are stored separately.
     Transitive Dependency
Transitive dependency occurs when an attribute depends on another non-key attribute, rather than directly depending
on the primary key. It means that the dependency is indirectly established through another attribute.
  Example: In a table called "Employees," if the primary key is "EmployeeID," and the attributes "Department" and
"DepartmentLocation" depend on each other, it indicates a transitive dependency. This can be represented as:
EmployeeID -> Department -> DepartmentLocation. To remove the transitive dependency, the table can be split into
two separate tables: "Employees" and "Departments," where the department details are stored separately.
Identifying and eliminating partial and transitive dependencies are crucial in achieving higher levels of normalization
(such as 3NF or BCNF) to ensure data integrity, reduce redundancy, and avoid anomalies in a database.
2.3.   Normalize business data and document results
Undertaking normalization of business data involves the process of organizing and structuring the data in a database
to eliminate redundancy, improve data integrity, and reduce anomalies. The goal is to ensure that each piece of data
is stored in the most efficient and logical manner.
The steps involved in normalization typically include:
1. Analyzing the data: The first step is to analyze the existing data in the database. This involves identifying the
    various entities, attributes, and relationships between them.
2. Applying normalization rules: Next, the data is normalized by applying normalization rules, specifically the rules
    outlined in normal forms, such as First Normal Form (1NF), Second Normal Form (2NF), and so on. Each
    normalization form has specific criteria that need to be met.
3. Breaking down tables: In order to meet the criteria for normalization forms, it may be necessary to break down
    existing tables into multiple tables, with each table focusing on a specific entity or relationship.
4. Resolving dependencies: During the normalization process, dependencies between attributes are identified and
    resolved. This includes identifying and eliminating partial dependencies and transitive dependencies.
5. Documenting the results: Once the normalization process is complete, it is important to document the results.
    This documentation includes the structure of the normalized tables, the relationships between them, and any
    changes made to the original data model.
By undertaking normalization of business data and documenting the results, organizations can ensure that their
databases are efficiently structured, leading to improved data quality, easier data maintenance, and more effective
data operations.
2.4.   Compare normalization results with ERdiagram
When comparing normalization results with an entity relationship diagram (ERD), it is important to understand the
relationship between the two and how they complement each other in the process of designing a database.
An entity relationship diagram is a visual representation of the entities (tables), attributes, and relationships between
them in a database. It helps in understanding the structure and organization of the data, as well as the dependencies
between different entities. An ERD typically includes entity boxes, attribute labels, and lines representing the
relationships between entities.
Normalization, on the other hand, is a set of rules and guidelines used to eliminate redundancy, improve data
integrity, and reduce anomalies in a database. It involves breaking down tables, resolving dependencies, and
organizing data into well-structured tables.
When comparing normalization results with an ERD, the main focus is on ensuring that the normalization process
aligns with the relationships and dependencies depicted in the ERD.
Suppose we have an ER diagram representing a library database. The diagram includes entities such as "Books,"
"Authors," and "Publishers," with relationships like "Author writes Book" and "Publisher publishes Book."
Now, let's say we apply normalization to this database. During the normalization process, we break down the initial
table into smaller, more atomic tables to eliminate redundancy and improve data integrity.
For instance, we might have initially had a single table with columns like "Book ID," "Book Title," "Author Name,"
and "Publisher Name." After normalization, we would have separate tables for "Books," "Authors," and "Publishers,"
each with their own unique identifiers and relevant attributes.
When comparing the normalization results with the ER diagram, we would examine if the relationships depicted in
the ER diagram are accurately represented in the normalized tables. We would ensure that the foreign keys in the
normalized tables correctly establish the relationships between entities.
In our example, we would check if the "Author ID" in the "Books" table references the corresponding "Author ID"
in the "Authors" table, and if the "Publisher ID" in the "Books" table references the appropriate "Publisher ID" in
the "Publishers" table. This comparison ensures that the normalization process has preserved the intended
relationships                                             between                                                 entities.
Additionally, we would verify if the attributes in the normalized tables align with the attributes specified in the ER
diagram. We would ensure that no redundant data exists and that the data is properly organized according to
normalization rules.
By comparing the normalization results with the ER diagram, we can validate the accuracy and consistency of the
database design, ensuring that the normalized tables effectively capture the structure and relationships depicted in
the ER diagram.
2.5.   Reconcile differences between data
Reconciling differences between data in normalization refers to the process of resolving conflicts or inconsistencies
that may arise during the normalization process. These conflicts can occur when attempting to organize and structure
data into normalized tables, especially when there are dependencies or relationships between attributes and entities.
Here are some key points to understand about reconciling differences between data in normalization
1. Identify Inconsistencies: The first step is to identify any inconsistencies or conflicts within the data. This may
    involve analyzing the relationships, dependencies, and functional dependencies between attributes and entities.
2. Analyze Dependencies: Evaluate the dependencies between attributes and entities to determine if they are
    accurately represented. This includes identifying partial dependencies (where an attribute depends on only a part
    of the primary key) or transitive dependencies (where an attribute depends on another non-key attribute).
3. Normalize Data: Apply normalization rules, such as First Normal Form (1NF), Second Normal Form (2NF), and
    so on, to organize the data into well-structured tables. This may involve breaking down tables, creating separate
    tables for related entities, and defining appropriate primary and foreign keys.
4. Resolve Conflicts: Address any conflicts or inconsistencies that arise during the normalization process. This may
    involve making decisions on how to handle partial dependencies or transitive dependencies. One approach is to
    split tables and create additional tables to ensure that data is properly organized and dependencies are accurately
    represented.
5. Ensure Data Integrity: As you reconcile differences, it is crucial to maintain data integrity. This means that the
    data in the normalized tables should accurately represent the relationships and dependencies between entities. It
    also involves ensuring that there are no duplicate or redundant data.
6. Validate Results: Validate the results of the normalization process to ensure that the reconciled data aligns with
    the intended structure and relationships. This can be done by comparing the normalized tables with the original
    data, verifying that the relationships and dependencies are accurately represented.
Reconciling differences between data in normalization is an essential step in achieving a well-structured and efficient
database design. It helps eliminate redundancy, reduce anomalies, and improve data integrity.