Database Design for Students
Database Design for Students
Module – 04
Insertion Anomalies
     Insertion anomalies can be differentiated into two types, illustrated by the following
      examples based on the EMP_DEPT relation:
     To insert a new employee tuple into EMP_DEPT, we must include either the attribute
      values for the department that the employee works for, or NULLs (if the employee
      does not work for a department as yet).
     It is difficult to insert a new department that has no employees as yet in the
      EMP_DEPT relation. The only way to do this is to place NULL values in the attributes
      for employee.
Deletion Anomalies
    If we delete from EMP_DEPT an employee tuple that happens to represent the last
       employee working for a particular department, the information concerning that
       department is lost inadvertently from the database.
Modification Anomalies
                                       Guideline 2
    Design the base relation schemas so that no insertion, deletion, or modification
     anomalies are present in the relations.
    If any anomalies are present,4 note them clearly and make sure that the programs
     that update the database will operate correctly.
    The second guideline is consistent with and, in a way, a restatement of the first
     guideline. This way, whenever the base relation is updated, we do not end up with
     inconsistencies.
    In general, it is advisable to use anomaly-free base relations and to specify views that
     include the joins for placing together the attributes frequently referenced in
     important queries.
    If many of the attributes do not apply to all tuples in the relation, we end up with
     many NULLs in those tuples.
    This can waste space at the storage level and may also lead to problems with
     understanding the meaning of the attributes and with specifying JOIN operations at
     the logical level.
    Another problem with NULLs is how to account for them when aggregate operations
     such as COUNT or SUM are applied.
    Moreover, NULLs can have multiple interpretations, such as the following:
                                       Guideline 3
    As far as possible, avoid placing attributes in a base relation whose values may
     frequently be NULL.
    If NULLs are unavoidable, make sure that they apply in exceptional cases only and do
     not apply to a majority of tuples in the relation.
    Using space efficiently and avoiding joins with NULL values are the two overriding
     criteria that determine whether to include the columns that may have NULLs in a
     relation or to have a separate relation for those columns.
    Consider the two relation schemas EMP_LOCS and EMP_PROJ1 in Figure 14.5(a),
     which can be used instead of the single EMP_PROJ relation in Figure 14.3(b).
    A tuple in EMP_LOCS means that the employee whose name is Ename works on at
     least one project located at Plocation.
    A tuple in EMP_PROJ1 refers to the fact that the employee whose Social Security
     number is Ssn works the given Hours per week on the project whose name, number,
     and location are Pname, Pnumber, and Plocation.
    Additional tuples that were not in EMP_PROJ are called spurious tuples because they
     represent spurious information that is not valid.
                                       Guideline 4
    Design relation schemas so that they can be joined with equality conditions on
     attributes that are appropriately related (primary key, foreign key) pairs in a way that
     guarantees that no spurious tuples are generated.
    Avoid relations that contain matching attributes that are not (foreign key, primary
     key) combinations because joining on such attributes may produce spurious tuples.
    This informal guideline obviously needs to be stated more formally.
Functional Dependencies
 The single most important concept in relational schema design theory is that of a
  functional dependency.
 A functional dependency is a constraint between two sets of attributes from the database.
 Suppose that our relational database schema has n attributes A1, A2, … , An; let us think of the
    whole database as being described by a single universal relation schema R = {A1, A2, … , An}.
 We do not imply that we will actually store the database as a single universal table; we use this
    concept only in developing the formal theory of data dependencies.
 A functional dependency is a property of the semantics or meaning of the attributes.
 The database designers will use their understanding of the semantics of the attributes of R—that
    is, how they relate to one another—to specify the functional dependencies that should hold on all
    relation states (extensions) r of R.
        Definition
                A functional dependency, denoted by X → Y, between two sets of attributes X
                and Y that are subsets of R specifies a constraint on the possible tuples that
                can form a relation state r of R.
                The constraint is that, for any two tuples t1 and t2 in r that have t1[X] = t2[X],
                they must also have t1[Y] = t2[Y].
                The abbreviation for functional dependency is FD or f.d. The set of attributes X is
                called the left-hand side of the FD, and Y is called the right-hand side.
Normalization of Relations
    The normalization process, as first proposed by Codd (1972a), takes a relation schema
     through a series of tests to certify whether it satisfies a certain normal form.
    Initially, Codd proposed three normal forms, which he called first, second, and third
     normal form.
    A stronger definition of 3NF—called Boyce-Codd normal form (BCNF)—was proposed
     later by Boyce and Codd.
    All these normal forms are based on a single analytical tool: the functional
     dependencies among the attributes of a relation.
    Later, a fourth normal form (4NF) and a fifth normal form (5NF) were proposed, based
     on the concepts of multivalued dependencies and join dependencies, respectively.
    Normalization of data can be considered a process of analyzing the given relation
     schemas based on their FDs and primary keys to achieve the desirable properties of
     (1) minimizing redundancy and (2) minimizing the insertion, deletion, and update
     anomalies.
    Thus, the normalization procedure provides database designers with the following:
Definition
    The normal form of a relation refers to the highest normal form condition that it
     meets, and hence indicates the degree to which it has been normalized.
    Normal forms, when considered in isolation from other factors, do not guarantee a
     good database design.
    It is generally not sufficient to check separately that each relation schema in the
     database is, say, in BCNF or 3NF.
Definition
Denormalization is the process of storing the join of higher normal form relations as a base
relation, which is in a lower normal form.
A superkey of a relation schema R = {A1, A2, … , An} is a set of attributes S R with the
property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S]. A
key K is a superkey with the additional property that removal of any attribute from K will
cause K not to be a superkey anymore.
    If a relation schema has more than one key, each is called a candidate key.
    One of the candidate keys is arbitrarily designated to be the primary key, and the
     others are called secondary keys.
    In a practical relational database, each relation schema must have a primary key. If no
     candidate key is known for a relation, the entire relation can be treated as a default
     superkey.
    In Figure 14.1, {Ssn} is the only candidate key for EMPLOYEE, so it is also the primary
     key.
Definition
   •    1NF, 2NF, and 3NF. These were proposed by Codd (1972a) as a sequence to achieve
        the desirable state of 3NF relations by progressing through the intermediate states of
        1NF and 2NF if needed.
   •    As we shall see, 2NF and 3NF independently attack different types of problems arising
        from problematic functional dependencies among attributes.
   •    The domain of Dlocations contains atomic values, but some tuples can have a set of
        these values. In this case, Dlocations is not functionally dependent on the primary key
        Dnumber.
   •    The domain of Dlocations contains sets of values and hence is nonatomic. In this case,
        Dnumber → Dlocations because each set is considered a single member of the
        attribute domain.
    The test for 2NF involves testing for functional dependencies whose left-hand side
     attributes are part of the primary key.
    The nonprime attribute Ename violates 2NF because of FD2, as do the nonprime attributes
     Pname and Plocation because of FD3.
    Each of the functional dependencies FD2 and FD3 violates 2NF because Ename can be
     functionally determined by only Ssn, and both Pname and Plocation can be functionally
     determined by only Pnumber.
    Therefore, the functional dependencies FD1, FD2, and FD3 in Figure 14.3(b) lead to the
     decomposition of EMP_PROJ into the three relation schemas EP1, EP2, and EP3 shown in
     Figure 14.11(a), each of which is in 2NF.
Definition
According to Codd’s original definition, a relation schema R is in 3NF if it satisfies 2NF and no
nonprime attribute of R is transitively dependent on the primary key.
    2NF and 3NF normalization remove these problem FDs by decomposing the original
     relation into new relations.
    In terms of the normalization process, it is not necessary to remove the partial
     dependencies before the transitive dependencies, but historically, 3NF has been
     defined with the assumption that a relation is tested for 2NF first before it is tested
     for 3NF.
Boyce-Codd Normal Form
    Boyce-Codd normal form (BCNF) was proposed as a simpler form of 3NF, but it was
     found to be stricter than 3NF.
    That is, every relation in BCNF is also in 3NF; however, a relation in 3NF is not
     necessarily in BCNF.
    We pointed out in the last subsection that although 3NF allows functional
     dependencies that conform to the clause (b) in the 3NF definition, BCNF disallows
     them and hence is a stricter definition of a normal form.
Definition
    The formal definition of BCNF differs from the definition of 3NF in that clause (b) of
     3NF, which allows f.d.’s having the RHS as a prime attribute, is absent from BCNF.
    That makes BCNF a stronger normal form compared to 3NF. In our example, FD5
     violates BCNF in LOTS1A because Area is not a superkey of LOTS1A.
    Achieving the normalization status of just 1NF or 2NF is not considered adequate,
     since both were developed historically to be intermediate normal forms as stepping
     stones to 3NF and BCNF.
    For Example:-
    Hence this relation is in 3NF but not BCNF. Decomposition of this relation schema
     into two schemas is not straightforward because it may be decomposed into one of
     the three following possible pairs:
    If two tuples t1 and t2 exist in r such that t1[X] = t2[X], then two tuples t3 and t4
     should also exist in r with the following properties, where we use Z to denote (R − (X
       Y)):
    We now present the definition of fourth normal form (4NF), which is violated when a
     relation has undesirable multivalued dependencies and hence can be used to identify
     and decompose such relations.
    A relation schema R is in 4NF with respect to a set of dependencies F (that includes
     functional dependencies and multivalued dependencies) if, for every nontrivial
     multivalued dependency X →→ Y in F+, 21 X is a superkey for R.
    The process of normalizing a relation involving the nontrivial MVDs that is not in 4NF
     consists of decomposing it so that each MVD is represented by a separate relation
     where it becomes a trivial MVD.
A join dependency (JD), denoted by JD(R1, R2, … , Rn), specified on relation schema R,
specifies a constraint on the states r of R. The constraint states that every legal state r of R
should have a nonadditive join decomposition into R1, R2, … , Rn.
 Hence, for every such r we have that an MVD is a special case of a JD where n = 2.
    That is, a JD denoted as JD(R1, R2) implies an MVD (R1 ∩ R2) →→ (R1 − R2)(or, by
     symmetry, (R1 ∩ R2) →→ (R2 − R1)).
A relation schema R is in fifth normal form (5NF) (or project-join normal form (PJNF)) with
respect to a set F of functional, multivalued, and join dependencies if, for every nontrivial
join dependency JD(R1, R2, … , Rn) in F+ (that is, implied by F),22 every Ri is a superkey of R.
    For example, if each department has one manager, so that Dept_no uniquely
     determines Mgr_ssn (Dept_no → Mgr_ssn), and a manager has a unique phone
     number called Mgr_phone (Mgr_ssn → Mgr_phone), then these two dependencies
     together imply that Dept_no → Mgr_phone.
    This is an inferred or implied FD and need not be explicitly stated in addition to the
     two given FDs.
    Therefore, it is useful to define a concept called closure formally that includes all
     possible dependencies that can be inferred from the given set F.
Definition
Formally, the set of all dependencies that include F as well as all dependencies that can be
inferred from F is called the closure of F; it is denoted by F+.
    The closure F+ of F is the set of all functional dependencies that can be inferred from
     F. To determine a systematic way to infer dependencies, we must discover a set of
     inference rules that can be used to infer new dependencies from a given set of
     dependencies.
    We consider some of these inference rules next. We use the notation F |=X → Y to
     denote that the functional dependency X → Y is inferred from the set of functional
     dependencies F.
    In the following discussion, we use an abbreviated notation when discussing
     functional dependencies. We concatenate attribute variables and drop the commas
     for convenience.
    Hence, the FD {X,Y} → Z is abbreviated to XY → Z, and the FD {X, Y, Z} → {U, V} is
     abbreviated to XYZ → UV.
    We present below three rules IR1 through IR3 that are well-known inference rules for
     functional dependencies.
    They were proposed first by Armstrong (1974) and hence are known as Armstrong’s
     axioms.
Definition
For each such set of attributes X, we determine the set X+ of attributes that are functionally
determined by X based on F; X+ is called the closure of X under F.
The closure concept is useful in understanding the meaning and implications of attributes or
sets of attributes in a relation.
    Algorithm 15.1 starts by setting X+ to all the attributes in X. By IR1, we know that all
     these attributes are functionally dependent on X.
    Using inference rules IR3 and IR4, we add attributes to X+, using each functional
     dependency in F.
                                         Equivalence
Definition
Definition
inferred from E; that is, E is equivalent to F if both the conditions—E covers F and F covers
E—hold.
    We can determine whether F covers E by calculating X+ with respect to F for each FD
     X → Y in E, and then checking whether this X+ includes the attributes in Y.
    If this is the case for every FD in E, then F covers E. We determine whether E and F
     are equivalent by checking that E covers F and F covers E.
    It is left to the reader as an exercise to show that the following two sets of FDs are
     equivalent:
                                        Minimal cover
    Informally, a minimal cover of a set of functional dependencies E is a set of functional
     dependencies F that satisfies the property that every dependency in E is in the
     closure F+ of F.
    In addition, this property is lost if any dependency from the set F is removed; F must
     have no redundancies in it, and the dependencies in F are in a standard form.
Definition
Definition
    We can always find at least one minimal cover F for any set of dependencies E using
     Algorithm 15.2.
    If several sets of FDs qualify as minimal covers of E by the definition above, it is
     customary to use additional criteria for minimality.
    For example, we can choose the minimal set with the smallest number of
     dependencies or with the smallest total length (the total length of a set of
     dependencies is calculated by concatenating the dependencies and treating them as
     one long character string).
Using the functional dependencies, the algorithms decompose the universal relation schema
R into a set of relation schemas D = {R1, R2, … , Rm} that will become the relational database
schema; D is called a decomposition of R.
Definition
Given a set of dependencies F on R, the projection of F on Ri, denoted by πRi (F) where Ri is
a subset of R, is the set of dependencies X → Y in F+ such that the attributes in X Y are all
contained in Ri.
Formally, a decomposition D = {R1, R2, … , Rm} of R has the lossless (nonadditive) join
property with respect to the set of dependencies F on R if, for every relation state r of R that
satisfies F, the following holds, where * is the NATURAL JOIN of all the relations in D: *(πR1
(r), … , πRm(r)) = r.
    The word loss in lossless refers to loss of information, not to loss of tuples.
    If a decomposition does not have the lossless join property, we may get additional
     spurious tuples after the PROJECT (π) and NATURAL JOIN (*) operations are applied;
     these additional tuples represent erroneous or invalid information.
    We prefer the term nonadditive join because it describes the situation more
     accurately.
    We provide a general procedure for testing whether any decomposition D of a
     relation into n relations is nonadditive with respect to a set of given functional
     dependencies F in the relation; it is presented as Algorithm 15.3.
    Figure 15.1(a) shows how we apply Algorithm 15.3 to the decomposition of the
     EMP_PROJ relation schema from Figure 14.3(b)into the two relation schemas
     EMP_PROJ1 and EMP_LOCS in Figure 14.5(a).
If a decomposition D = {R1, R2, … , Rm} of R has the nonadditive (lossless) join property with
respect to a set of functional dependencies F on R, and if a decomposition Di = {Q1, Q2, … ,
Qk} of Ri has the nonadditive join property with respect to the projection of F on Ri, then the
decomposition D2 = {R1, R2, … , Ri−1, Q1, Q2, … , Qk, Ri+1, … , Rm} of R has the nonadditive
join property with respect to F.
    Now we give an algorithm where we achieve conditions 1 and 2 and only guarantee
     3NF.
    The original lost FDs can be recovered by a JOIN operation over the results of
     decomposition.
    Each time through the loop in Algorithm 15.5, we decompose one relation schema Q
     that is not in BCNF into two relation schemas.
    According to property NJB for binary decompositions and claim 2, the decomposition
     D has the nonadditive join property. At the end of the algorithm, all relation schemas
     in D will be in BCNF.
    It is important to note that the theory of nonadditive join decompositions is based on
     the assumption that no NULL values are allowed for the join attributes.
 As with functional dependencies (FDs), inference rules for MVDs have been developed. It
  is better, though, to develop a unified framework that includes both FDs and MVDs so
  that both types of constraints can be considered together.
 The following inference rules IR1 through IR8 form a sound and complete set for
  inferring functional and multivalued dependencies from a given set of dependencies.
 Assume that all attributes are included in a universal relation schema R = {A1, A2, … , An}
  and that X, Y, Z, and W are subsets of R.
 IR1 through IR3 are Armstrong’s inference rules for FDs alone.
 IR4 through IR6 are inference rules pertaining to MVDs only.
 IR7 and IR8 relate FDs and MVDs.
 In particular, IR7 says that a functional dependency is a special case of a multivalued
  dependency; that is, every FD is also an MVD because it satisfies the formal definition of
  an MVD.
 However, this equivalence has a catch: An FD X → Y is an MVD X →→ Y with the
  additional implicit restriction that at most one value of Y is associated with each value of
  X.
 Given a set F of functional and multivalued dependencies specified on R = {A1, A2, … ,
  An}, we can use IR1 through IR8 to infer the (complete) set of all dependencies
  (functional or multivalued) F+ that will hold in every relation state r of R that satisfies F.
 We again call F+ the closure of F.
Fourth Normal Form Revisited
Definition
    If the relation has nontrivial MVDs, then insert, delete, and update operations on
     single tuples may cause additional tuples to be modified besides the one in question.
    If the update is handled incorrectly, the meaning of the relation may change.
     However, after normalization into 4NF, these update anomalies disappear.
 We can use a slight modification of Algorithm 15.5 to develop Algorithm 15.7, which
  creates a nonadditive join decomposition into relation schemas that are in 4NF (rather
  than in BCNF).
 As with Algorithm 15.5, Algorithm 15.7 does not necessarily produce a decomposition
  that preserves FDs.
After defining JD, we defined the fifth normal form based on it in Fifth normal form has also been known
as project join normal form or PJNF (Fagin, 1979).
In the remaining part of this section, we introduce some other types of dependencies that have been
identified.
Among them, the inclusion dependencies and those based on arithmetic or similar functions are used
frequently.
Inclusion Dependencies
 For example, in the relation each tuple represents an item from an order with a
  particular quantity, and the price per unit for that item. In this relation, (Quantity,
  Unit_price) → Extended_price by the formula.
 A relation schema is said to be in DKNF if all constraints and dependencies that should
  hold on the valid relation states can be enforced simply by enforcing the domain
  constraints and key constraints on the relation.
 For example, consider a relation CAR(Make, Vin#) (where Vin# is the vehicle
  identification number) and another relation MANUFACTURE(Vin#, Country) (where
  Country is the country of manufacture).