Agenda
• Data Normalization – Additional Material
• Homewrok 3
• Kaggle Datasets
• https://www.kaggle.com/datasets
• Exam # 1 Date
1
Normalization
Normalization
Process for evaluating and correcting table structures to
minimize data redundancies .
“Reduces data anomalies”
Normalization is a process that “improves” a database
design by generating relations that are of higher normal
forms.
The objective of normalization:
“to create relations where every dependency is on the key,
the whole key, and nothing but the key”.
2
Normalization
There is a sequence to normal forms:
1NF is considered the weakest,
2NF is stronger than 1NF,
3NF is stronger than 2NF, and
BCNF is considered the strongest
Also,
any relation that is in BCNF, is in 3NF;
any relation in 3NF is in 2NF; and
any relation in 2NF is in 1NF.
3
Normalization
1NF a relation in BCNF, is also
in 3NF
2NF a relation in 3NF is also in
2NF
3NF
a relation in 2NF is also in
1NF
BCNF
4
Normalization
We consider a relation in BCNF to be fully normalized.
The benefit of higher normal forms is that update semantics for
the affected data are simplified.
This means that applications required to maintain the database
are simpler.
A design that has a lower normal form than another design has
more redundancy. Uncontrolled redundancy can lead to data
integrity problems.
First we introduce the concept of functional dependency
5
Functional Dependencies
Functional Dependencies
We say an attribute, B, has a functional dependency on
another attribute, A, if for any two records, which have
the same value for A, then the values for B in these two
records must be the same. We illustrate this as:
AB
Example: Suppose we keep track of employee email
addresses, and we only track one email address for each
employee. Suppose each employee is identified by their
unique employee number. We say there is a functional
dependency of email address on employee number:
employee number email address
6
Functional Dependencies
EmpNum EmpEmail EmpFname EmpLname
123 jdoe@abc.com John Doe
456 psmith@abc.com Peter Smith
555 alee1@abc.com Alan Lee
633 pdoe@abc.com Peter Doe
787 alee2@abc.com Alan Lee
If EmpNum is the PK then the FDs:
EmpNum EmpEmail
EmpNum EmpFname
EmpNum EmpLname
must exist.
7
Functional Dependencies
EmpNum EmpEmail
EmpNum EmpFname 3 different ways
EmpNum EmpLname you might see FDs
depicted
EmpEmail
EmpNum EmpFname
EmpLname
EmpNum EmpEmail EmpFname EmpLname
8
Determinant
Functional Dependency
EmpNum EmpEmail
Attribute on the LHS is known as the determinant
• EmpNum is a determinant of EmpEmail
9
• Product table (Prodcut_code, price, type)
What is the determinant?
10
Transitive dependency
Transitive dependency
Consider attributes A, B, and C, and where
A B and B C.
Functional dependencies are transitive, which
means that we also have the functional dependency
AC
We say that C is transitively dependent on A
through B.
11
Transitive dependency
EmpNum DeptNum
EmpNum EmpEmail DeptNum DeptNname
DeptNum DeptName
EmpNum EmpEmail DeptNum DeptNname
DeptName is transitively dependent on EmpNum via DeptNum
EmpNum DeptName
12
• Table A (Student_id, Student_name, Prof_id,
Prof_name)
– How many functional dependencies are there in
the relation (table)? What are they?
– Where is the “transitive dependency”?
May 2012 91.2814 13
Partial dependency
A partial dependency exists when an attribute B is
functionally dependent on an attribute A, and A is a
component of a multipart candidate key.
InvNum LineNum Qty InvDate
Candidate keys: {InvNum, LineNum} InvDate is
partially dependent on {InvNum, LineNum} as
InvNum is a determinant of InvDate and InvNum is
part of a candidate key
14
15
First Normal Form
First Normal Form
We say a relation is in 1NF if all values stored in the
relation are single-valued and atomic.
1NF places restrictions on the structure of relations.
Values must be simple.
16
First Normal Form
The following in not in 1NF
EmpNum EmpPhone EmpDegrees
123 233-9876
333 233-1231 BA, BSc, PhD
679 233-1231 BSc, MSc
EmpDegrees is a multi-valued field:
employee 679 has two degrees: BSc and MSc
employee 333 has three degrees: BA, BSc, PhD
17
Second Normal Form
Second Normal Form
A relation is in 2NF if it is in 1NF, and every non-key attribute
is fully dependent on each candidate key. (That is, we don’t
have any partial functional dependency.)
• 2NF (and 3NF) both involve the concepts of key and
non-key attributes.
• A key attribute is any attribute that is part of a key;
any attribute that is not a key attribute, is a non-key
attribute.
•A relation in 2NF will not have any partial dependencies
18
Second Normal Form
Consider this InvLine table (in 1NF):
InvNum LineNum ProdNum Qty InvDate
InvNum, LineNum ProdNum, Qty
There are two
candidate keys.
InvDate is the only
non-key attribute, and
InvNum InvDate it is dependent on
InvLine is not 2NF since there is a partial InvNum
dependency of InvDate on InvNum
InvLine is
only in 1NF
19
Second Normal Form
InvLine
InvNum LineNum ProdNum Qty InvDate
The above relation has redundancies: the invoice date is
repeated on each invoice line.
We can improve the database by decomposing the relation
into two relations:
InvNum LineNum ProdNum Qty
InvNum InvDate
Question: What is the highest normal form for these
relations? 2NF? 3NF? BCNF?
20
2NF, but not in 3NF:
EmployeeDept
ename ssn bdate address dnumber dname
since dnumber is not a candidate key and we have:
dnumber dname.
22
Third Normal Form
Third Normal Form
• A relation is in 3NF if the relation is in 1NF and all
determinants of non-key attributes are candidate keys
That is, for any functional dependency: X Y, where Y is
a non-key attribute (or a set of non-key attributes), X is a
candidate key.
• This definition of 3NF differs from BCNF only in the
specification of non-key attributes - 3NF
• A relation in 3NF will not have any transitive dependencies
of non-key attribute on a candidate key through another
non-key attribute.
23
Third Normal Form
Consider this Employee relation Candidate keys
are? …
EmpNum EmpName DeptNum DeptName
EmpName, DeptNum, and DeptName are non-key attributes.
DeptNum determines DeptName, a non-key attribute, and
DeptNum is not a candidate key.
Is the relation in 3NF? … no Is the relation in BCNF? … no
Is the relation in 2NF? … yes
24
Third Normal Form
EmpNum EmpName DeptNum DeptName
We correct the situation by decomposing the original relation
into two 3NF relations. Note the decomposition is lossless.
EmpNum EmpName DeptNum DeptNum DeptName
Verify these two relations are in 3NF.
25
Patient relation example
Patient # Surgeon # Surg. date Patient Name Patient Addr Surgeon Surgery Postop drug
Drug side effects
Gallstone
s removal;
Jan 1, 15 New St. Beth Little Kidney
145 1995; June New York, Michael stones Penicillin, rash
1111 311 12, 1995 John White NY Diamond removal none- none
Eye
Charles Cataract
Apr 5, Field removal
243 1994 May 10 Main St. Patricia Thrombos Tetracyclin Fever
1234 467 10, 1995 Mary Jones Rye, NY Gold is removal e none none
Dogwood
Lane Open
Jan 8, Harrison, David Heart Cephalosp
2345 189 1996 Charles Brown NY Rosen Surgery orin none
55 Boston
Post Road,
Nov 5, Chester, Cholecyst
4876 145 1995 Hal Kane CN Beth Little ectomy Demicillin none
Blind Brook Gallstone
May 10, Mamaronec s
5123 145 1995 Paul Kosher k, NY Beth Little Removal none none
Eye
Cornea
Replacem
Apr 5, Hilton Road ent Eye
1994 Dec Larchmont, Charles cataract Tetracyclin
6845 243 15, 1984 Ann Hood NY Field removal e Fever
26
Patient #
Unnormalized Relation
Surgeon # Surg. date Patient Name Patient Addr Surgeon Surgery Postop drug
Drug side effects
Gallstone
s removal;
Jan 1, 15 New St. Beth Little Kidney
145 1995; June New York, Michael stones Penicillin, rash
1111 311 12, 1995 John White NY Diamond removal none- none
Eye
Charles Cataract
Apr 5, Field removal
243 1994 May 10 Main St. Patricia Thrombos Tetracyclin Fever
1234 467 10, 1995 Mary Jones Rye, NY Gold is removal e none none
Dogwood
Lane Open
Jan 8, Harrison, David Heart Cephalosp
2345 189 1996 Charles Brown NY Rosen Surgery orin none
55 Boston
Post Road,
Nov 5, Chester, Cholecyst
4876 145 1995 Hal Kane CN Beth Little ectomy Demicillin none
Blind Brook Gallstone
May 10, Mamaronec s
5123 145 1995 Paul Kosher k, NY Beth Little Removal none none
Eye
Cornea
Replacem
Apr 5, Hilton Road ent Eye
1994 Dec Larchmont, Charles cataract Tetracyclin
6845 243 15, 1984 Ann Hood NY Field removal e Fever
First Normal Form
• To move to First Normal Form a relation must
contain only atomic values at each row and
column.
– No repeating groups
– A column or set of columns is called a Candidate
Key when its values can uniquely identify the row
in the relation.
First Normal Form
Patient # Surgeon # Surgery DatePatient Name Patient Addr Surgeon Name Surgery Drug adminSide Effects
15 New St.
New York, Gallstone
1111 145 01-Jan-95 John White NY Beth Little s removal Penicillin rash
15 New St. Kidney
New York, Michael stones
1111 311 12-Jun-95 John White NY Diamond removal none none
Eye
10 Main St. Cataract Tetracyclin
1234 243 05-Apr-94 Mary Jones Rye, NY Charles Field removal e Fever
10 Main St. Thrombos
1234 467 10-May-95 Mary Jones Rye, NY Patricia Gold is removal none none
Dogwood
Lane Open
Charles Harrison, Heart Cephalosp
2345 189 08-Jan-96 Brown NY David Rosen Surgery orin none
55 Boston
Post Road,
Chester, Cholecyst
4876 145 05-Nov-95 Hal Kane CN Beth Little ectomy Demicillin none
Blind Brook Gallstone
Mamaronec s
5123 145 10-May-95 Paul Kosher k, NY Beth Little Removal none none
Eye
Hilton Road Cornea
Larchmont, Replacem Tetracyclin
6845 243 05-Apr-94 Ann Hood NY Charles Field ent e Fever
Hilton Road Eye
Larchmont, cataract
6845 243 15-Dec-84 Ann Hood NY Charles Field removal none none
1NF Storage Anomalies
• Insertion: A new patient has not yet undergone surgery --
hence no surgeon # -- Since surgeon # is part of the key we
can’t insert.
• Insertion: If a surgeon is newly hired and hasn’t operated yet -
- there will be no way to include that person in the database.
• Update: If a patient comes in for a new procedure, and has
moved, we need to change multiple address entries.
• Deletion (type 1): Deleting a patient record may also delete all
info about a surgeon.
• Deletion (type 2): When there are functional dependencies
(like side effects and drug) changing one item eliminates other
information.
Second Normal Form
• A relation is said to be in Second Normal Form
when every nonkey attribute is fully
functionally dependent on the primary key.
– That is, every nonkey attribute needs the full
primary key for unique identification
Second Normal Form
Patient # Patient Name Patient Address
15 New St. New
1111 John White York, NY
10 Main St. Rye,
1234 Mary Jones NY
Charles Dogwood Lane
2345 Brown Harrison, NY
55 Boston Post
4876 Hal Kane Road, Chester,
Blind Brook
5123 Paul Kosher Mamaroneck, NY
Hilton Road
6845 Ann Hood Larchmont, NY
Second Normal Form
Surgeon # Surgeon Name
145 Beth Little
189 David Rosen
243 Charles Field
311 Michael Diamond
467 Patricia Gold
Second Normal Form
Patient # Surgeon # Surgery Date Surgery Drug Admin Side Effects
Gallstones
1111 145 01-Jan-95 removal
Kidney Penicillin rash
stones
1111 311 12-Jun-95 removal none none
Eye Cataract
1234 243 05-Apr-94 removal Tetracycline Fever
Thrombosis
1234 467 10-May-95 removal none none
Open Heart Cephalospori
2345 189 08-Jan-96 Surgery n none
Cholecystect
4876 145 05-Nov-95 omy Demicillin none
Gallstones
5123 145 10-May-95 Removal none none
Eye cataract
6845 243 15-Dec-84 removal none none
Eye Cornea
6845 243 05-Apr-94 Replacement Tetracycline Fever
1NF Storage Anomalies Removed
• Insertion: Can now enter new patients without
surgery.
• Insertion: Can now enter Surgeons who haven’t
operated.
• Deletion (type 1): If Charles Brown dies the
corresponding tuples from Patient and Surgery tables
can be deleted without losing information on David
Rosen.
• Update: If John White comes in for third time, and
has moved, we only need to change the Patient table
2NF Storage Anomalies
• Insertion: Cannot enter the fact that a particular drug
has a particular side effect unless it is given to a
patient.
• Deletion: If John White receives some other drug
because of the penicillin rash, and a new drug and
side effect are entered, we lose the information that
penicillin can cause a rash
• Update: If drug side effects change (a new formula)
we have to update multiple occurrences of side
effects.
Third Normal Form
• A relation is said to be in Third Normal Form if there
is no transitive functional dependency between
nonkey attributes
– When one nonkey attribute can be determined with one
or more nonkey attributes there is said to be a transitive
functional dependency.
• The side effect column in the Surgery table is
determined by the drug administered
– Side effect is transitively functionally dependent on drug
so Surgery is not 3NF
Third Normal Form
Patient # Surgeon # Surgery Date Surgery Drug Admin
1111 145 01-Jan-95 Gallstones removal Penicillin
Kidney stones
1111 311 12-Jun-95 removal none
1234 243 05-Apr-94 Eye Cataract removal Tetracycline
1234 467 10-May-95 Thrombosis removal none
2345 189 08-Jan-96 Open Heart Surgery Cephalosporin
4876 145 05-Nov-95 Cholecystectomy Demicillin
5123 145 10-May-95 Gallstones Removal none
6845 243 15-Dec-84 Eye cataract removal none
Eye Cornea
6845 243 05-Apr-94 Replacement Tetracycline
Third Normal Form
Drug Admin Side Effects
Cephalosporin none
Demicillin none
none none
Penicillin rash
Tetracycline Fever
2NF Storage Anomalies Removed
• Insertion: We can now enter the fact that a
particular drug has a particular side effect in
the Drug relation.
• Deletion: If John White recieves some other
drug as a result of the rash from penicillin, but
the information on penicillin and rash is
maintained.
• Update: The side effects for each drug appear
only once.