BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Module 4
Normalization: Database Design Theory – Introduction to Normalization using Functional and
Multivalued Dependencies: Informal design guidelines for relation schema, Functional Dependencies,
Normal Forms based on Primary Keys, Second and Third Normal Forms, Boyce-Codd Normal Form,
Multivalued Dependency and Fourth Normal Form, Join Dependencies and Fifth Normal Form.
Database design may be performed using two approaches: bottom-up or top-down.
A bottom-up design methodology (also called design by synthesis)
It considers the basic relationships among individual attributes as the starting
point and uses those to construct relation schemas.
This approach is not very popular in practice because it suffers from the
problem of having to collect a large number of binary relationships among
attributes as the starting point.
For practical situations, it is next to impossible to capture binary relationships
among all such pairs of attributes.
In contrast, a top-down design methodology (also called design by analysis)
It starts with a number of groupings of attributes into relations that exist
together naturally, for example, on an invoice, a form, or a report.
The relations are then analyzed individually and collectively, leading to further
decomposition until all desirable properties are met.
Top-down design approach, is more appropriate when performing design of
databases by analysis and decomposition of sets of attributes that appear
together in files, in reports, and on forms in real-life situations.
Relational database design ultimately produces a set of relations.
The implicit goals of the design activity are information preservation and
minimum redundancy.
The relational design must preserve all of these concepts, which are originally
captured in the conceptual design after the conceptual to logical design mapping.
Minimizing redundancy implies minimizing redundant storage of the same
information and reducing the need for multiple updates to maintain consistency
across multiple copies of the same information in response to real-world events that
require making an update.
DEPT OF AI&ML,BMSIT Page 1 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Informal Design Guidelines for Relation Schemas
There are four informal guidelines that may be used as measures to determine the quality of
relation schema design:
1. Making sure that the semantics of the attributes is clear in the schema.
Whenever we group attributes to form a relation schema, we assume that attributes
belonging to one relation have certain real-world meaning and a proper
interpretation associated with them.
The semantics of a relation refers to its meaning resulting from the interpretation of
attribute values in a tuple.
Guideline 1. Design a relation schema so that it is easy to explain its meaning. Do not
combine attributes from multiple entity types and relationship types into a single
relation. Intuitively, if a relation schema corresponds to one entity type or one
relationship type, it is straightforward to explain its meaning. Otherwise, if the
relation corresponds to a mixture of multiple entities and relationships, semantic
ambiguities will result and the relation cannot be easily explained.
Example of Violating Guideline 1:
2. Reducing the redundant information in tuples.
One goal of schema design is to minimize the storage space used by the base relations
(and hence the corresponding files).
Grouping attributes into relation schemas has a significant effect on storage space.
Storing natural joins of base relations leads to an additional problem referred to as
update anomalies.
These can be classified into insertion anomalies, deletion anomalies, and modification
anomalies
Insertion Anomalies. Insertion anomalies can be differentiated into two types,
illustrated by the following examples based on the EMP_DEPT relation:
To insert a new employee tuple into EMP_DEPT, we must include either the attribute
values for the department that the employee works for, or NULLs (if the employee
does not work for a department as yet). For example, to insert a new tuple for an
employee who works in department number 5, we must enter all the attribute values
DEPT OF AI&ML,BMSIT Page 2 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
of department 5 correctly so that they are consistent with the corresponding values
for department 5 in other tuples in EMP_DEPT.
It is difficult to insert a new department that has no employees as yet in the
EMP_DEPT relation. The only way to do this is to place NULL values in the attributes
for employee. This violates the entity integrity for EMP_DEPT because its primary key
Ssn cannot be null. Moreover, when the first employee is assigned to that department,
we do not need this tuple with NULL values anymore.
Deletion Anomalies. The problem of deletion anomalies is related to the second
insertion anomaly situation just discussed. If we delete from EMP_DEPT an employee
tuple that happens to represent the last employee working for a particular department,
the information concerning that department is lost inadvertently from the database.
Modification Anomalies. In EMP_DEPT, if we change the value of one of the attributes
of a particular department—say, the manager of department 5—we must update the
tuples of all employees who work in that department; otherwise, the database will
become inconsistent. If we fail to update some tuples, the same department will be
shown to have two different values for manager in different employee tuples, which
would be wrong
Guideline 2. Design the base relation schemas so that no insertion, deletion, or
modification anomalies are present in the relations. If any anomalies are present, note
them clearly and make sure that the programs that update the database will operate
correctly.
3. Reducing the NULL values in tuples.
In some schema designs we may group many attributes together into a “fat” relation.
If many of the attributes do not apply to all tuples in the relation, we end up with
many NULLs in those tuples.
This can waste space at the storage level and may also lead to problems with
understanding the meaning of the attributes and with specifying JOIN operations at
the logical level.
Another problem with NULLs is how to account for them when aggregate operations
such as COUNT or SUM are applied. SELECT and JOIN operations involve
comparisons; if NULL values are present, the results may become unpredictable.
Guideline 3. As far as possible, avoid placing attributes in a base relation whose
values may frequently be NULL. If NULLs are unavoidable, make sure that they apply
in exceptional cases only and do not apply to a majority of tuples in the relation.
DEPT OF AI&ML,BMSIT Page 3 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
4. Disallowing the possibility of generating spurious tuples.
Guideline 4. Design relation schemas so that they can be joined with equality
conditions on attributes that are appropriately related (primary key, foreign key)
pairs in a way that guarantees that no spurious tuples are generated. Avoid relations
that contain matching attributes that are not (foreign key, primary key) combinations
because joining on such attributes may produce spurious tuples.
Functional Dependencies
A functional dependency is a constraint between two sets of attributes from the
database.
Definition. A functional dependency, denoted by X → Y, between two sets of
attributes X and Y that are subsets of R specifies a constraint on the possible tuples
that can form a relation state r of R. The constraint is that, for any two tuples t1 and
t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].
This means that the values of the Y component of a tuple in r depend on, or are
determined by, the values of the X component; alternatively, the values of the X
component of a tuple uniquely (or functionally) determine the values of the Y
component.
We also say that there is a functional dependency from X to Y, or that Y is functionally
dependent on X.
Thus, X functionally determines Y in a relation schema R if, and only if, whenever two
tuples of r(R) agree on their X-value, they must necessarily agree on their Y-value.
Note the following:
If a constraint on R states that there cannot be more than one tuple with a
given X-value in any relation instance r(R)—that is, X is a candidate key of
R—this implies that X → Y for any subset of attributes Y of R (because the key
constraint implies that no two tuples in any legal state r(R) will have the same
value of X). If X is a candidate key of R, then X→R.
If X→Y in R, this does not say whether or not Y →X in R.
A functional dependency is a property of the semantics or meaning of the
attributes. The database designers will use their understanding of the semantics of
the attributes of R—that is, how they relate to one another—to specify the functional
DEPT OF AI&ML,BMSIT Page 4 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
dependencies that should hold on all relation states (extensions) r of R.
Relation extensions r(R) that satisfy the functional dependency constraints are called
legal relation states (or legal extensions) of R.
Hence, the main use of functional dependencies is to describe further a relation
schema R by specifying constraints on its attributes that must hold at all times.
Consider the relation schema EMP_PROJ in Figure; from the semantics of the
attributes and the relation, we know that the following functional dependencies
should hold:
a. Ssn →Ename
b. Pnumber → {Pname, Plocation}
c. {Ssn, Pnumber} → Hours
A functional dependency is a property of the relation schema R, not of a particular legal
relation state r of R.
Therefore, an FD cannot be inferred automatically from a given relation extension r
but must be defined explicitly by someone who knows the semantics of the attributes
of R.
Inference Rules for Functional Dependencies
Definition: An FD X →Y is inferred from or implied by a set of dependencies F
specified on R if X →Y holds in every legal relation state r of R; that is, whenever r
satisfies all the dependencies in F, X→Y also holds in r.
For example, if each department has one manager, so that Dept_no uniquely
determines Mgr_ssn (Dept_no →Mgr_ssn), and a manager has a unique phone number
called Mgr_phone (Mgr_ssn→Mgr_phone), then these two dependencies together
imply that Dept_no → Mgr_phone.
This is an inferred or implied FD and need not be explicitly stated in addition to the
two given FDs. Therefore, it is useful to define a concept called closure formally that
includes all possible dependencies that can be inferred from the given set F.
CLOSURE OF F (F+).
Definition. Formally, the set of all dependencies that include F as well as all
dependencies that can be inferred from F is called the closure of F; it is denoted by
F+.
For example, suppose that we specify the following set F of obvious functional
dependencies on the relation schema in Figure :
DEPT OF AI&ML,BMSIT Page 5 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
F = {Ssn →{Ename, Bdate, Address, Dnumber}, Dnumber →{Dname, Dmgr_ssn} }]
Some of the additional functional dependencies that we can infer from F are the
following:
Ssn → {Dname, Dmgr_ssn}
Ssn → Ssn
Dnumber →Dname
The closure F+ of F is the set of all functional dependencies that can be inferred from
F.
To determine a systematic way to infer dependencies, we must discover a set of
inference rules that can be used to infer new dependencies from a given set of
dependencies.
INFERENCE RULES:
IR1 (reflexive rule): If X ⊇ Y, then X →Y.
Proof of IR1: Suppose that X ⊇ Y and that two tuples t1 and t2 exist in some relation instance
r of R such that t1 [X] = t2 [X]. Then t1[Y] = t2[Y] because X ⊇ Y; hence, X → Y must hold in r.
IR2 (augmentation rule): {X → Y} |=XZ → YZ.
Proof of IR2 (by contradiction): Assume that X → Y holds in a relation instance r of R but
that XZ → YZ does not hold. Then there must exist two tuples t1 and t2 in r such that (1) t1
[X] = t2 [X], (2) t1 [Y] = t2 [Y], (3) t1 [XZ] = t2 [XZ], and (4) t1 [YZ] ≠ t2 [YZ]. This is not possible
because from (1) and (3) we deduce (5) t1 [Z] = t2 [Z], and from (2) and (5) we deduce (6)
t1 [YZ] = t2 [YZ], contradicting (4).
IR3 (transitive rule): {X → Y, Y → Z} |=X → Z.
Proof of IR3: Assume that (1) X → Y and (2) Y → Z both hold in a relation r. Then for any two
tuples t1 and t2 in r such that t1 [X] = t2 [X], we must have (3) t1 [Y] = t2 [Y], from assumption
(1); hence we must also have (4) t1 [Z] = t2 [Z] from (3) and assumption (2); thus X → Z must
hold in r. IR4 (decomposition, or projective, rule): {X → YZ} |=X → Y.
Proof of IR4 : 1. X → YZ (given)
2. YZ → Y (using IR1 Rule)
3. X → Y (using IR3 on 1 and 2)
DEPT OF AI&ML,BMSIT Page 6 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
IR5 (union, or additive, rule): {X → Y, X → Z} |=X → YZ.
1. X →Y (given).
2. X → Z (given).
3. X → XY (using IR2 on 1 by augmenting with X; notice that XX = X).
4. XY → YZ (using IR2 on 2 by augmenting with Y).
5. X → YZ (using IR3 on 3 and 4).
IR6 (pseudotransitive rule): {X → Y, WY → Z} |=WX → Z.
1. X → Y (given)
2. WY → Z (given)
3. WX → WY (using IR2 on 1 by augmenting with W)
4. WX → Z (using IR3 on 3 and 2)
Three rules IR1 through IR3 that are well-known inference rules for functional
dependencies. They were proposed first by Armstrong (1974) and hence are known
as Armstrong’s axioms.
Armstrong has shown that inference rules IR1 through IR3 are sound and complete.
By sound, we mean that given a set of functional dependencies F specified on a
relation schema R, any dependency that we can infer from F by using IR1 through IR3
holds in every relation state r of R that satisfies the dependencies in F.
By complete, we mean that using IR1 through IR3 repeatedly to infer dependencies
until no more dependencies can be inferred results in the complete set of all possible
dependencies that can be inferred from F. In other words, the set of dependencies F+,
which we called the closure of F, can be determined from F by using only inference
rules IR1 through IR3.
CLOSURE OF X UNDER F.( X+)
Definition. For each such set of attributes X, we determine the set X+ of attributes that are
functionally determined by X based on F; X+ is called the closure of X under F.
Algorithm 15.1: Determining X+, the Closure of X under F
Input: A set F of FDs on a relation schema R, and a set of attributes X, which is a subset of R.
X+ := X;
repeat
oldX+ := X+;
for each functional dependency Y → Z in F do
if X+ ⊇ Y then X+ := X+ � Z;
until (X+ = oldX+);
DEPT OF AI&ML,BMSIT Page 7 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
For example, consider the following relation schema about classes held at a
university in a given academic year.
CLASS (Classid,Course#,Instr_name,Credit_hrs, Text, Publisher, Classroom, Capacity).
Let F, the set of functional dependencies for the above relation include the following
f.d.s:
FD1: Classid → Course#, Instr_name, Credit_hrs, Text, Publisher,
Classroom, Capacity;
FD2: Course# → Credit_hrs;
FD3: {Course#, Instr_name} → Text, Classroom;
FD4: Text → Publisher
FD5: Classroom → Capacity
Using the inference rules about the FDs and applying the definition of closure, we can
define the following closures:
{ Classid } + = { Classid , Course#, Instr_name, Credit_hrs, Text, Publisher, Classroom,
Capacity } = CLASS
{ Course#} + = { Course#, Credit_hrs}
{ Course#, Instr_name } + = { Course#, Credit_hrs, Text, Publisher,Classroom, Capacity }
Equivalence of Sets of Functional Dependencies
Definition. A set of functional dependencies F is said to cover another set of
functional dependencies E if every FD in E is also in F+; that is, if every dependency in
E can be inferred from F; alternatively, we can say that E is covered by F.
Definition. Two sets of functional dependencies E and F are equivalent if E+ = F+.
Therefore, equivalence means that every FD in E can be inferred from F, and every FD
in F can be inferred from E; that is, E is equivalent to F if both the conditions—E covers
F and F covers E—hold.
We can determine whether F covers E by calculating X+ with respect to F for each FD
X → Y in E, and then checking whether this X+ includes the attributes in Y. If this is the
case for every FD in E, then F covers E.
We determine whether E and F are equivalent by checking that E covers F and F
covers E
DEPT OF AI&ML,BMSIT Page 8 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
DEPT OF AI&ML,BMSIT Page 9 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Minimal Sets of Functional Dependencies
Definition: An attribute in a functional dependency is considered an extraneous
attribute if we can remove it without changing the closure of the set of dependencies.
Formally, given F, the set of functional dependencies, and a functional dependency X
→ A in F, attribute Y is extraneous in X if Y ⊂ X, and F logically implies (F −
(X → A) � { (X − Y) → A } ).
We can formally define a set of functional dependencies F to be minimal if it satisfies
the following conditions:
1. Every dependency in F has a single attribute for its right-hand side.
2. We cannot replace any dependency X → A in F with a dependency Y → A, where Y is a
proper subset of X, and still have a set of dependencies that is equivalent to F.
3. We cannot remove any dependency from F and still have a set of dependencies that
is equivalent to F.
We can think of a minimal set of dependencies as being a set of dependencies in a
standard or canonical form and with no redundancies.
Condition 1 just represents every dependency in a canonical form with a single
attribute on the right-hand side, and it is a preparatory step before we can evaluate if
conditions 2 and 3 are met.
Conditions 2 and 3 ensure that there are no redundancies in the dependencies
either by having redundant attributes (referred to as extraneous attributes) on the
left-hand side of a dependency (Condition 2) or by having a dependency that can be
inferred from the remaining FDs in F (Condition 3).
Definition. A minimal cover of a set of functional dependencies E is a minimal set of
dependencies (in the standard canonical form and without redundancy) that is
equivalent to E.
Algorithm 1. Finding a Minimal Cover F for a Set of Functional Dependencies E
Input: A set of functional dependencies E.
Note: Explanatory comments are given at the end of some of the steps. They follow the
format: (*comment*).
1. Set F := E.
2. Replace each functional dependency X → {A1, A2, … , An} in F by the n functional
dependencies X →A1, X →A2, … , X → An.
DEPT OF AI&ML,BMSIT Page 10 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
(*This places the FDs in a canonical form for subsequent testing*)
3. For each functional dependency X → A in F
for each attribute B that is an element of X
if { {F − {X → A} } � { (X − {B} ) → A} } is equivalent to F then replace X
→ A with (X − {B} ) → A in F.
(*This constitutes removal of an extraneous attribute B contained in the lefthand side X of a
functional dependency X → A when possible*)
4. For each remaining functional dependency X → A in F
if {F − {X → A} } is equivalent to F,
then remove X → A from F.
(*This constitutes removal of a redundant functional dependency X → A from
F when possible*)
Example 1: Let the given set of FDs be E: {B → A, D → A, AB → D}. We have to
find the minimal cover of E.
All above dependencies are in canonical form (that is, they have only one attribute on
the right-hand side), so we have completed step 1 of Algorithm 1 and can proceed to
step 2. In step 2 we need to determine if AB → D has any redundant (extraneous)
attribute on the left-hand side; that is, can it be replaced by B → D or A → D?
Since B → A, by augmenting with B on both sides (IR2), we have
BB → AB, or B → AB (i).
However, AB → D as given (ii).
Hence by the transitive rule (IR3), we get from (i) and (ii), B → D. Thus AB → D may
be replaced by B → D.
We now have a set equivalent to original E, say E′: {B → A, D → A, B → D}. No further
reduction is possible in step 2 since all FDs have a single attribute on the left-hand
side.
In step 3 we look for a redundant FD in E′. By using the transitive rule on B → D and
D → A, we derive B → A. Hence B → A is redundant in E′ and can be eliminated.
Therefore, the minimal cover of E is F: {B → D, D → A}.
Example 2: Let the given set of FDs be G: {A → BCDE, CD → E}.
Here, the given FDs are NOT in the canonical form. So we first convert them into:
E: {A → B, A→ C, A→ D, A→ E, CD → E}.
In step 2 of the algorithm, for CD → E, neither C nor D is extraneous on the left-hand
side, since we cannot show that C → E or D → E from the given FDs. Hence we cannot
replace it with either.
In step 3, we want to see if any FD is redundant. Since A→ CD and CD → E, by
transitiverule (IR3), we get A→ E. Thus, A→ E is redundant in G.
DEPT OF AI&ML,BMSIT Page 11 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
So we are left with the set F, equivalent to the original set G as: {A → B, A→ C, A→ D,
CD → E}. F is the minimum cover.
we can combine the first three FDs using the union rule (IR5) and express the
minimum cover as:
Minimum cover of G, F: {A → BCD, CD → E}.
Example 3: Find the minimal cover of the set of functional dependencies given;
{A → C, AB → C, C → DI, CD → I, EC → AB, EI → C}
Simple properties/steps of minimal cover:
1. Right Hand Side (RHS) of all FDs should be single attribute.
2. Remove extraneous attributes.
3. Eliminate redundant functional dependencies.
Let us apply these properties to F = {A → C, AB → C, C → DI, CD → I, EC → AB, EI → C}
1. Right Hand Side (RHS) of all FDs should be single attribute.
So we write F as F1, as follows;
F1 = {A → C, AB → C, C → D, C → I, CD → I, EC → A, EC → B, EI → C}
2. Remove extraneous attributes.
Extraneous attribute is a redundant attribute on the LHS of the functional dependency. In
the set of FDs, AB → C, CD → I, EC → A, EC → B, and EI → C have more than one attribute in
the LHS. Hence, we check one of these LHS attributes are extraneous or not.
To check, we need to find the closure of each attribute on the LHS; [apply the closure finding
algorithm – refer here]
(i) A+ = ACDI
(ii) B+ = B
(iii) C+ = CDI
(iv) D+ = D
(v) E+ = E
(vi) I+ = I
From (i), the closure of A included the attribute C. So, B is extraneous in AB → C, and B can
be removed.
DEPT OF AI&ML,BMSIT Page 12 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
From (iii), the closure of C included the attribute I. So, D is extraneous in CD → I, and D can
be removed. No more extraneous attributes are found. Hence, we write F1 as F2 after
removing extraneous attributes from F1 as follows;
F2 = {A → C, C → D, C → I, EC → A, EC → B, EI → C}
3. Eliminate redundant functional dependency.
None of the FDs in F2 is redundant. Hence, F2 is minimal cover.
Hence, set of functional dependencies F2 is the minimal cover for the set F.
Finding a Key K for R Given a Set F of Functional Dependencies
Algorithm 2. Finding a Key K for R Given a Set F of Functional DependenciesInput: A relation
R and a set of functional dependencies F on the attributes
of R.
1. Set K := R.
2. For each attribute A in K
{compute (K − A)+ with respect to F;
if (K − A)+ contains all the attributes in R,
then set K := K − {A} };
Example 1 :
R = (ABCDE), F = {A -> C, E -> D, B -> C}
The first step in finding the candidate keys is to find the attribute closure given F.
A+ = AC
B+ = BC
C+ = C
D+ = D
E+ = DE
From this information we need to find the candidate keys. Any attribute that only appears
on the right side in a trivial dependency must be in the candidate key. For this, that includes
ABE. Does ABE+ get us to a candidate key? ABE+ = ABCDE – yes it does. The candidate key is
ABE.
DEPT OF AI&ML,BMSIT Page 13 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Example 2:
R = ABCDE, F = {A -> BE, C -> BE, B -> D}
Compute the attribute closure:
A+ = ABDE
B+ = BD
C+ = CBDE
D+ = D
E+ = E
The 2 attributes that only appear on the right side in a trivial are AC. Is AC a candidate key?
Yes. Because AC+=ABCDE
Example 3 :
R = ABCDEF, F = {A -> B, B -> D, C -> D, E -> F}
Compute the attribute closures:
A+ = ABD
B+ = BD
C+ = CD
D+ = D
E+ = EF
Ok, let’s start with those attributes that only appear on the right side in trivial FDs. They are
ACE . ACE+ = ABCDEF, so ACE is a candidate key.
Example 4 :
R = ABCD, F={AB -> C, BC -> D, CD -> A}
Computing the attribute closure:
The single letters are all trivial.
AB+ = ABCD,
BC+=ABCD,
CD+=ACD
So our candidate keys are AB and BC.. Why isn’t BCD+ a candidate key? Because it is not
minimal. The D is extraneous since BC -> D.
Example 5 :
R = ABCD, F={A -> BCD, C -> A}
Attribute closure:
A+ - ABCD
B+ = B
DEPT OF AI&ML,BMSIT Page 14 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
C+ = ABCD
D+ = D
Our candidate keys are A and C.
Normal Forms Based on Primary Keys
Normalization of data can be considered a process of analyzing the given relation
schemas based on their FDs and primary keys to achieve the desirable properties
of (1) minimizing redundancy and (2) minimizing the insertion, deletion, and
update Anomalies.
It can be considered as a “filtering” or “purification” process to make the design have
successively better quality.
An unsatisfactory relation schema that does not meet the condition for a normal
form—the normal form test—is decomposed into smaller relation schemas that
contain a subset of the attributes and meet the test that was otherwise not met by the
original relation.
Thus, the normalization procedure provides database designers with the following:
A formal framework for analyzing relation schemas based on their keys and on
the functional dependencies among their attributes.
A series of normal form tests that can be carried out on individual relation
schemas so that the relational database can be normalized to any desired
degree
Definition. The normal form of a relation refers to the highest normal form
condition that it meets, and hence indicates the degree to which it has been
normalized.
Note: Normal forms, when considered in isolation from other factors, do not
guarantee a good database design.
Definition. Denormalization is the process of storing the join of higher normal form
relations as a base relation, which is in a lower normal form.
Definition. A superkey of a relation schema R = {A1, A2, … , An} is a set of attributes
S ⊆ R with the property that no two tuples t1 and t2 in any legal relation state r of R
will have t1[S] = t2[S].
DEPT OF AI&ML,BMSIT Page 15 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Definition. A key K is a superkey with the additional property that removal of any
attribute from K will cause K not to be a superkey anymore.
If a relation schema has more than one key, each is called a candidate key. One of the
candidate keys is arbitrarily designated to be the primary key, and the others are
called secondary keys.
NOTE: In a practical relational database, each relation schema must have a primary
key. If no candidate key is known for a relation, the entire relation can be treated as a
default superkey.
{Ssn} is the only candidate key for EMPLOYEE, so it is also the primary key.
Definition. An attribute of relation schema R is called a prime attribute of R if it is a
member of some candidate key of R.
Definition. An attribute is called nonprime attribute if it is not a prime attribute—
that is, if it is not a member of any candidate key.
First Normal Form
First normal form (1NF)is now considered to be part of the formal definition of a
relation in the basic (flat) relational model; historically, it was defined to disallow
multivalued attributes, composite attributes, and their combinations. It states that the
domain of an attribute must include only atomic (simple, indivisible) values and that
the value of any attribute in a tuple must be a single value from the domain of that
attribute.
Hence, 1NF disallows having a set of values, a tuple of values, or a combination of both
as an attribute value for a single tuple.
In other words, 1NF disallows relations within relations or relations as attribute
values within tuples.
The only attribute values permitted by 1NF are single atomic (or indivisible) values.
DEPT OF AI&ML,BMSIT Page 16 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
The DEPARTMENT relation in Figure is not in 1NF; in fact, it does not even qualify as
a relation according to our definition of relation.
There are two ways we can look at the Dlocations attribute:
The domain of Dlocations contains atomic values, but some tuples can have a
set of these values. In this case, Dlocations is not functionally dependent on the
primary key Dnumber.
The domain of Dlocations contains sets of values and hence is nonatomic. In
this case, Dnumber → Dlocations because each set is considered a single
member of the attribute domain.
There are three main techniques to achieve first normal form for such a relation:
1. Remove the attribute Dlocations that violates 1NF and place it in a separate
relation DEPT_LOCATIONS along with the primary key Dnumber of DEPARTMENT.
The primary key of this newly formed relation is the combination {Dnumber,
Dlocation}
2. Expand the key so that there will be a separate tuple in the original DEPARTMENT
relation for each location of a DEPARTMENT, as shown in Figure (c). In this case, the
primary key becomes the combination {Dnumber, Dlocation}. This solution has the
disadvantage of introducing redundancy in the relation and hence is rarely adopted.
DEPT OF AI&ML,BMSIT Page 17 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
3. If a maximum number of values is known for the attribute—for example, if it is
known that at most three locations can exist for a department—replace the
Dlocations attribute by three atomic attributes: Dlocation1, Dlocation2, and
Dlocation3. This solution has the disadvantage of introducing NULL values if most
departments have fewer than three locations. It further introduces spurious
semantics about the ordering among the location values; that ordering is not
originally intended. Queryingon this attribute becomes more difficult.
First normal form also disallows multivalued attributes that are themselves
composite. These are called nested relations because each tuple can have a relation
within it.
Figure below shows how the EMP_PROJ relation could appear if nesting is allowed.
Each tuple represents an employee entity, and a relation PROJS(Pnumber, Hours)
within each tuple represents the employee’s projects and the hours per week that
employee works on each project.
The schema of this EMP_PROJ relation can be represented as follows:
EMP_PROJ(Ssn, Ename, {PROJS(Pnumber, Hours)})
The set braces { } identify the attribute PROJS as multivalued, and we list the component
attributes that form PROJS between parentheses ( ).
DEPT OF AI&ML,BMSIT Page 18 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Second normal form (2NF)
2NF is based on the concept of full functional dependency.
A functional dependency X → Y is a full functional dependency if removal of any
attribute A from X means that the dependency does not hold anymore; that is, for any
attribute A ε X, (X − {A}) does not functionally determine Y.
A functional dependency X → Y is a partial dependency if some attribute A ε X can
be removed from X and the dependency still holds; that is, for some A ε X, (X
– {A}) → Y.
In Figure , {Ssn, Pnumber} → Hours is a full dependency (neither Ssn → Hours nor
Pnumber → Hours holds). However, the dependency {Ssn, Pnumber} → Ename is
partial because Ssn → Ename holds.
Definition. A relation schema R is in 2NF if every nonprime attribute A in R is fully
functionally dependent on the primary key of R.
General Definition. A relation schema R is in second normal form (2NF) if every
nonprime attribute A in R is not partially dependent on any key of R.
The test for 2NF involves testing for functional dependencies whose left-hand side
attributes are part of the primary key.
If the primary key contains a single attribute, the test need not be applied at all.
If a relation schema is not in 2NF, it can be second normalized or 2NF normalized into
a number of 2NF relations in which nonprime attributes are associated only with the
part of the primary key on which they are fully functionally dependent.
DEPT OF AI&ML,BMSIT Page 19 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
DEPT OF AI&ML,BMSIT Page 20 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Third normal form (3NF)
3NF is based on the concept of transitive dependency. A functional dependency X → Y
in a relation schema R is a transitive dependency if there exists a set of attributes Z
in R that is neither a candidate key nor a subset of any key of R, and both X → Z and Z
→ Y hold.
Definition. According to Codd’s original definition, a relation schema R is in 3NF if it
satisfies 2NF and no nonprime attribute of R is transitively dependent on the primary
key.
General Definition. A relation schema R is in third normal form (3NF) if, whenever
a nontrivial functional dependency X → A holds in R, either (a) X is a superkey of R, or
(b) A is a prime attribute of R.
Alternative Definition. A relation schema R is in 3NF if every nonprime attribute of
R meets both of the following conditions:
It is fully functionally dependent on every key of R.
It is nontransitively dependent on every key of R.
Intuitively, we can see that any functional dependency in which the left-hand side is
part (a proper subset) of the primary key, or any functional dependency in which the
left-hand side is a nonkey attribute, is a problematic FD.
2NF and 3NF normalization remove these problem FDs by decomposing the original
relation into new relations.
In terms of the normalization process, it is not necessary to remove the partial
dependencies before the transitive dependencies, but historically, 3NF has been
defined with the assumption that a relation is tested for 2NF first before it is tested
for 3NF.
The relation schema EMP_DEPT is in 2NF, since no partial dependencies on a key exist.
However, EMP_DEPT is not in 3NF because of the transitive dependency of Dmgr_ssn (and
also Dname) on Ssn via Dnumber. We can normalize EMP_DEPT by decomposing it into the
two 3NF relation schemas ED1 and ED2.
DEPT OF AI&ML,BMSIT Page 21 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
SUMMARY
Normal Test Remedy (Normalization)
Form
1NF Relation should have no multivalued Form new relations for each multivalued
attributes or nested relations. attribute or nested relation.
2NF For relations where primary key Decompose and set up a new relation for
contains multiple attributes, no each partial key with its dependent
nonkey attribute should be attribute(s). Make sure to keep a relation
functionally dependent on a part of with the original primary key and any
the primary key. attributes that are fully functionally
dependent on it.
3NF Relation should not have a nonkey Decompose and set up a relation that
attribute functionally determined by includes the nonkey attribute(s) that
another nonkey attribute (or by a set Functionally determine(s) other nonkey
of nonkey attributes). That is, there attribute(s).
should be no transitive dependency
of a nonkey attribute on the primary
key.
DEPT OF AI&ML,BMSIT Page 22 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
DECOMPOSING the LOTS relation with its functional dependenciesFD1 through FD4
Normalization into 2NF and 3NF.
.
Boyce-Codd Normal Form (BCNF):
Boyce-Codd normal form (BCNF) was proposed as a simpler form of 3NF, but it was
found to be stricter than 3NF. That is, every relation in BCNF is also in 3NF; however,
a relation in 3NF is not necessarily in BCNF.
DEPT OF AI&ML,BMSIT Page 23 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Definition. A relation schema R is in BCNF if whenever a nontrivial functional
dependency X → A holds in R, then X is a superkey of R.
In practice, most relation schemas that are in 3NF are also in BCNF. Only if there exists
some f.d. X → A that holds in a relation schema R with X not being a superkey and A
being a prime attribute will R be in 3NF but not in BCNF.
FIND THE HIGHEST NORMAL FORM OF A RELATION
Steps to find the highest normal form of a relation:
1. Find all possible candidate keys of the relation.
2. Divide all attributes into two categories: prime attributes and non-prime attributes.
3. Check for 1st normal form then 2nd and so on. If it fails to satisfy nth normal form
condition, highest normal form will be n-1.
Example 1. Find the highest normal form of a relation R(A,B,C,D,E) with FD set
F={AD, BA, BC D, ACBE}
Step 1. As we can see, (AC)+ ={A,C,B,E,D} but none of its subset can determine all attribute of
relation, So AC will be candidate key. A can be derived from B, so we can replace A in AC by B.
So BC will also be a candidate key. So there will be two candidate keys {AC, BC}.
Step 2. Prime attribute are those attribute which are part of candidate key {A,B,C} in this
example and others will be non-prime {D,E} in this example.
Step 3. The relation R is in 1st normal form as a relational DBMS does not allow multi-valued
or composite attribute.
The relation is not in 2nd Normal form because A->D is partial dependency (A which is subset
of candidate key AC is determining non-prime attribute D) and 2nd normal form does not allow
partial dependency.
So the highest normal form will be 1st Normal Form.
Example 2. Find the highest normal form of a relation R(A,B,C,D,E) with FD set
{BA, AC, BCD, ACBE}
Step 1. As we can see, (B)+ ={B,A,C,D,E}, so B will be candidate key. B can be derived from AC
using AC->B (Decomposing AC->BE to AC->B and AC->E). So AC will be super key but (C)+ ={C}
and (A)+ ={A,C,B,E,D}. So A (subset of AC) will be candidate key. So there will be two candidate
keys {A,B}.
Step 2. Prime attribute are those attribute which are part of candidate key {A,B} in this
example and others will be non-prime {C,D,E} in this example.
Step 3. The relation R is in 1st normal form as a relational DBMS does not allow multi-valued
or composite attribute.
DEPT OF AI&ML,BMSIT Page 24 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
The relation is in 2nd normal form because B->A is in 2nd normal form (B is a super key) and A-
>C is in 2nd normal form (A is super key) and BC->D is in 2nd normal form (BC is a super key)
and AC->BE is in 2nd normal form (AC is a super key).
The relation is in 3rd normal form because LHS of all FD’s are super keys. The relation is in
BCNF as all LHS of all FD’s are super keys. So the highest normal form is BCNF.
Example 3. Find the highest normal form of a relation R(A,B,C,D,E) with FD set as
{BC->D, AC->BE, B->E}
Step 1. As we can see, (AC)+ ={A,C,B,E,D} but none of its subset can determine all attribute of
relation, So AC will be candidate key. A or C can’t be derived from any other attribute of the
relation, so there will be only 1 candidate key {AC}.
Step 2. Prime attribute are those attribute which are part of candidate key {A,C} in this
example and others will be non-prime {B,D,E} in this example.
Step 3. The relation R is in 1st normal form as a relational DBMS does not allow multi-valued
or composite attribute.
The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is not proper subset
of candidate key AC) and AC->BE is in 2nd normal form (AC is candidate key) and B->E is in
2nd normal form (B is not a proper subset of candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither BC is a super key nor D is a
prime attribute) and in B->E (neither B is a super key nor E is a prime attribute) but to satisfy
3rd normal for, either LHS of an FD should be super key or RHS should be prime attribute.
So the highest normal form of relation will be 2nd Normal form.
Multivalued Dependency and Fourth Normal Form
Definition. A multivalued dependency X → Y specified on relation schema R, where X
and Y are both subsets of R, specifies the following constraint on any relation state r
of R: If two tuples t1 and t2 exist in r such that t1[X] = t2[X], then two tuples t3 and t4
should also exist in r with the following properties, where we use Z to denote (R −
(X � Y)):
t3[X] = t4[X] = t1[X] = t2[X]
t3[Y] = t1[Y] and t4[Y] = t2[Y]
t3[Z] = t2[Z] and t4[Z] = t1[Z]
Whenever X ↠ Y holds, we say that X multidetermines Y. Because of the symmetry
in the definition, whenever X ↠Y holds in R, so does X ↠ Z.
Hence, X↠Y implies X ↠Z and therefore it is sometimes written as X↠Y|Z.
An MVD X ↠ Y in R is called a trivial MVD if (a) Y is a subset of X, or (b) X � Y = R.
DEPT OF AI&ML,BMSIT Page 25 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
For example, the relation EMP_PROJECTS in Figure (b) has the trivial MVD Ename
↠ Pname and the relation EMP_DEPENDENTS has the trivial MVD Ename
↠Dname.
An MVD that satisfies neither (a) nor (b) is called a nontrivial MVD.
A trivial MVD will hold in any relation state r of R; it is called trivial because it does
not specify any significant or meaningful constraint on R.
If we have a nontrivial MVD in a relation, we may have to repeat values redundantly
in the tuples. F (that includes functional dependencies and multivalued
dependencies) if, for every nontrivial multivalued dependency X ↠ Y in F+, X is a
superkey for R.
We can state the following points:
An all-key relation is always in BCNF since it has no FDs.
An all-key relation such as the EMP relation in Figure (a), which has no FDs but has
the MVD Ename ↠Pname | Dname, is not in 4NF.
A relation that is not in 4NF due to a nontrivial MVD must be decomposed to convert
it into a set of relations in 4NF.
The decomposition removes the redundancy caused by the MVD.
Note: The process of normalizing a relation involving the nontrivial MVDs that is not in 4NF
consists of decomposing it so that each MVD is represented by a separate relation
where it becomes a trivial MVD.
DEPT OF AI&ML,BMSIT Page 26 of 39
BMSINSTITUTE OF TECHNOLOGY AND MANAGEMENT
Department of Artificial Intelligence and Machine Learning
Module 3: Normalization
Consider the EMP relation in Figure (a). EMP is not in 4NF because in the nontrivial
MVDs Ename↠Pname and Ename ↠ Dname, and Ename is not a superkey of EMP.
We decompose EMP into EMP_PROJECTS and EMP_DEPENDENTS, shown in Figure
(b).
Both EMP_PROJECTS and EMP_DEPENDENTS are in 4NF, because the MVDs Ename
↠ Pname in EMP_PROJECTS and Ename ↠ Dname in EMP_DEPENDENTS are trivial
MVDs.
No other nontrivial MVDs hold in either EMP_PROJECTS or EMP_DEPENDENTS. No
FDs hold in these relation schemas either.
Join Dependencies and Fifth Normal Form (Project-Join Normal Form)
Definition. A relation schema R is in fifth normal form (5NF) (or project-join normal
form (PJNF)) with respect to a set F of functional, multivalued, and join dependencies if, for
every nontrivial join dependency JD(R1, R2, … , Rn) in F+ (that is, implied by F), every Ri is a
superkey of R.
Definition: A join dependency (JD), denoted by JD(R1, R2, … , Rn), specified on relation
schema R, specifies a constraint on the states r of R. The constraint states that every legal
state r of R should have a nonadditive join decomposition into R1, R2, … , Rn. Hence, for every
such r we have * (πR1(r), πR2(r), … , πRn(r)) = r
Notice that an MVD is a special case of a JD where n = 2. That is, a JD denoted as JD(R1, R2)
implies an MVD (R1 ∩ R2) →→ (R1 − R2)(or, by symmetry, (R1 ∩ R2) →→ (R2 − R1)). A join
dependency JD(R1, R2, … , Rn), specified on relation schema R, is a trivial JD if one of the
relation schemas Ri in JD(R1, R2, … , Rn) is equal to R. Such a dependency is called trivial
because it has the nonadditive join property for any relation state r of R and thus does not
specify any constraint on R.
EXAMPLES TO BE SOLVED:
1. Consider the universal relation R = {A, B, C, D, E, F, G, H, I, J} and the set of
functional dependencies F = {{A, B}→{C}, {A}→{D, E}, {B}→{F}, {F}→{G, H},
{D}→{I, J}}. What is the key for R? Decompose R into 2NF and then 3NF
relations.
2. Consider a relation R(A, B, C, D, E) with the following dependencies: AB → C,
CD → E, DE → B.
Is AB a candidate key of this relation? If not, is ABD? Explain your answer.
DEPT OF AI&ML,BMSIT Page 27 of 39