KEMBAR78
Bit3107 Database Systemsii | PDF | Databases | Data Model
0% found this document useful (0 votes)
42 views171 pages

Bit3107 Database Systemsii

The document outlines the course structure for Database Systems II (BIT3107), detailing expected learning outcomes, course content, assessment methods, and recommended texts. Students will evaluate various database management techniques, explore advanced SQL, and understand emerging trends in database technology. The course includes a comprehensive weekly breakdown of topics, including data modeling, transaction management, database security, and object-oriented databases.

Uploaded by

grantonkipkoech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views171 pages

Bit3107 Database Systemsii

The document outlines the course structure for Database Systems II (BIT3107), detailing expected learning outcomes, course content, assessment methods, and recommended texts. Students will evaluate various database management techniques, explore advanced SQL, and understand emerging trends in database technology. The course includes a comprehensive weekly breakdown of topics, including data modeling, transaction management, database security, and object-oriented databases.

Uploaded by

grantonkipkoech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

DEPARTMENT OF INFORMATION TECHNOLOGY

UNIT CODE: BIT3107

UNIT NAME: DATABASE SYSTEMS II


ii
COURSE OUTLINE

Contact hours: 42
Pre-requisites: Database Systems BIT 2102
Purpose:
Expected Learning Outcomes of the Course:
By the end of the course, students should be able to:-
i. Evaluate various file organization methods, distribution strategies, concurrency
mechanisms, query optimization techniques, and database recovery and security
procedures.
ii. Critically examine and reflect on the theoretical aspect of the process of normalization
and its practical implementation.
iii. Evaluate and use various modern database management systems tools in building a
small but realistic database system.
iv. Describe advances in trends and emerging standards in database technology;
understand the comparative features and crucial aspects of data and knowledge
management technology and the impact of the Internet on database technology

Course Content:

1. Physical DB organization; Database Administration;


2. Transaction Management and Concurrency Control; Database Recovery
3. Advanced sql, Inner joins, outer joins, left outer joins.
4. Database Security;
5. Query Evaluation, Processing and Optimization.
6. Weaknesses of Relational DBMSs;
7. Distributed Databases; Client/Database Server Architecture; Homogeneous,
Heterogeneous and Federated Distributed Database Management Systems; Schema
Architecture for Distributed DBMS;
8. Design Issues: Data Fragmentation, Allocation and Replication;
9. Object-Oriented Databases, Object Modeling Concepts and Object-Oriented DBMSs:
10. Architecture of OODBMS; The OODBS manifesto; Object-Relational Database
management system; Comparison of ORDBMS and OODBMS; The ODMG Standards.
11. Advances in Databases: Data Warehousing; OLAP vs. OLTP; Multidimensional Modeling;
12. Database and the Internet Technology; Web enabled Databases and XML technology.

i
Course Assessment

Examination - 70%
Continuous Assessment Test (CATS) - 20%
Assignments - 10%
Total - 100%

Recommended Text Books:

i. Database Systems by Connolly & Begg 5th Edition


ii. Fundamentals of Database Systems by Elmasri & Navathe 5th edition
iii. Date C. J. (1990) An Introduction to Database Systems. Volume 1, 5th Edition Addison
Wesley
iv. Korth HF & Silberschatz A (1991) Database System Concepts Second Edition McGraw-Hill

Text Books for further Reading:

i. Object Data Management. OO & Extended Relational Database, Systems Addison-Wesley


Publishing Atkinson,M., DeWitt,D., Maier,D.,Bancilhon,F., Dittrich,K., Zdonik,S.(1990)
ii. Object-Oriented Database Systems Morgan Kaufmann Publishers Won Kim (1991)
iii. Introduction to Object-Oriented Databases, The MIT Press, Cambridge
(Massachusetts)
Unland,R., Schlagetter,G.
iv. Object-Oriented Database Systems: State of the Art and Research Problems. In: Expert
Database Systems, 1992, Academic Press Ltd

ii
COURSE OUTLINE

WEEK ONE : INTRODUCTION TO DATABASE SYSTEMS


Database Systems ......................................................................................................................................................................................... 1
Users and their Roles..................................................................................................................................................................................... 2
Data and Database Administration ................................................................................................................................................................ 2
Database Architecture ................................................................................................................................................................................... 4
Database Development Life Cycle .................................................................................................................................................................. 4
Database Design............................................................................................................................................................................................ 5
Conceptual database design .......................................................................................................................................................................... 5
Logical Database Design for the Relational Model .......................................................................................................................................... 5
Physical Database Design for Relational Databases ........................................................................................................................................ 6
Data model ................................................................................................................................................................................................... 6
Relational databases ..................................................................................................................................................................................... 7
Traditional File Processing ............................................................................................................................................................................. 8
Database Management System (DBMS) ........................................................................................................................................................10
Components of the Database Environment...................................................................................................................................................11
Enterprise Database Applications .................................................................................................................................................................12
History of Database Systems ........................................................................................................................................................................14
Review Questions.........................................................................................................................................................................................15

WEEK TWO : DATA MODELING


Entity Modeling ............................................................................................................................................................................................16
Data Models:................................................................................................................................................................................................16
Conceptual Data Modeling: ..........................................................................................................................................................................19
Review Questions.........................................................................................................................................................................................23

WEEK THREE: RELATIONAL DATABASE SYSTEMS


Relational Database Systems ........................................................................................................................................................................25
Relational Data Structures ............................................................................................................................................................................25
Properties of Relations .................................................................................................................................................................................26
Relational Data Languages............................................................................................................................................................................26
Data Manipulation Language ........................................................................................................................................................................26
Relational Algebra ........................................................................................................................................................................................27
Data Definition Language .............................................................................................................................................................................28
Data Definition Language (DDL) ............................................................................................................................................................................. 28
Review Questions.........................................................................................................................................................................................28

WEEK FOUR: QUERY OPTIMISATION


Query optimisation ......................................................................................................................................................................................30
Four Stages of Optimisation Process .........................................................................................................................................................35
Some General Principles...............................................................................................................................................................................38
Review Questions.........................................................................................................................................................................................43

WEEK FIVE : INTRODUCTION TO SQL


Introduction to SQL ......................................................................................................................................................................................43
SQL Transactions ..........................................................................................................................................................................................47
Joined Relations ...........................................................................................................................................................................................48
Advanced SQL ..............................................................................................................................................................................................57
Embedded SQL .............................................................................................................................................................................................59
Review Questions.........................................................................................................................................................................................60

WEEK SIX : TRANSACTION MANAGEMENT


Transaction management .............................................................................................................................................................................61
Transaction State .........................................................................................................................................................................................62
Concurrent Executions ..................................................................................................................................................................................64
Interference Between Concurrent Transactions ............................................................................................................................................67
The Lost Update Problem .............................................................................................................................................................................67
Uncommitted Dependency Problem .............................................................................................................................................................67
Inconsistent Analysis Problem ......................................................................................................................................................................68
Locking .........................................................................................................................................................................................................69
Deadlocks ....................................................................................................................................................................................................71
Concurrency Control without Locking ...........................................................................................................................................................73
Review Questions.........................................................................................................................................................................................74

iii
WEEK SEVEN: DATABASE SECURITY
Characteristics of database security ..............................................................................................................................................................75
Threats.........................................................................................................................................................................................................75
Causes of Failures: ........................................................................................................................................................................................77
Recovery Procedures ....................................................................................................................................................................................78
Database Recovery Features.........................................................................................................................................................................78
DBMS Recovery Facilities..............................................................................................................................................................................78
Review Questions.........................................................................................................................................................................................79

WEEK EIGHT: DISTRIBUTED DBMS


Distributed DBMS.........................................................................................................................................................................................82
Parallel DBMS ..............................................................................................................................................................................................82
Multi database System (MDBS) ....................................................................................................................................................................84
Functions of a DDBMS ..................................................................................................................................................................................85
Distributed Database Design ........................................................................................................................................................................87
Fragmentation .............................................................................................................................................................................................87
Data Allocation ............................................................................................................................................................................................88
Replication ...................................................................................................................................................................................................89
Types of Fragmentation ................................................................................................................................................................................90
Date’s 12 Rules for a DDBMS ........................................................................................................................................................................97
Review Questions.........................................................................................................................................................................................98

WEEK NINE : OBJECT-ORIENTED DATABASE SYSTEMS


Object-Oriented Database Systems ..............................................................................................................................................................99
Origins of Object-Oriented Database Concepts ........................................................................................................................................... 100
OODBS Golden Rules: ................................................................................................................................................................................. 101
OODBS: Products & Prototypes .................................................................................................................................................................. 107
OODBS: Future Directions & Research ........................................................................................................................................................ 108
SQL3 .......................................................................................................................................................................................................... 108
ODMG major components .......................................................................................................................................................................... 110
Review Questions....................................................................................................................................................................................... 111
1. Explain what are Database Objects? ....................................................................................................................................................... 111

WEEK TEN: WEB DATABASES


Web Technology and DBMSs Transparencies .............................................................................................................................................. 112
Requirements for Web-DBMS Integration ................................................................................................................................................... 116
Advantages of Web-DBMS Approach .......................................................................................................................................................... 117
Disadvantages of Web-DBMS Approach...................................................................................................................................................... 117
Approaches to Integrating Web and DBMSs................................................................................................................................................ 118
Review Questions....................................................................................................................................................................................... 140

WEEK ELEVEN: DATA WAREHOUSING AND DATA MINING


Introduction To Data Warehousing ............................................................................................................................................................. 141
Data Warehousing Architecture.................................................................................................................................................................. 142
Data Mart .................................................................................................................................................................................................. 143
Introduction to Data Mining ....................................................................................................................................................................... 144
Data Mining Functionalities ........................................................................................................................................................................ 149
Major Issues in Data Mining ....................................................................................................................................................................... 150
Review Questions....................................................................................................................................................................................... 151

WEEK TWELVE: MULTI-DIMENSIONAL DATABASES


MDD and Relational Databases .................................................................................................................................................................. 154
MDD Features: Multidimensional Computations......................................................................................................................................... 163
Benefits of MDD ......................................................................................................................................................................................... 163
Review Questions....................................................................................................................................................................................... 164
References ................................................................................................................................................................................................. 165

iv
Database and Database System

Week One

Introduction to Database Systems


A Database is an organized collection of logically related data. The purpose of database is to
store information about certain types of objects termed entities or objects.
A database is a shared collection of logically related data designed to meet the information
needs of an organization.
Components of Database Systems
1. Database
2. Hardware
3. Software - DBMS
4. Procedures
5. Users

Database
The data in the database will be expected to be both integrated and shared particularly on
multi-user systems
Integration - The database may be thought of as a unification of several otherwise
distinct files, with any redundancy among these files eliminated
Shared - individual pieces of data in the database may be shared among several
different users

Data: Refers to stored representations of meaningful objects and events. Includes


– Structured: numbers, text, dates
– Unstructured (multimedia): images, video, documents
Information: Refers to data processed to increase knowledge of the person who uses the data
– Place data in a context
– Involves summarizing, processing and presentation of data
Metadata
Metadata: Is data that describes the properties or characteristics of user data
– It is the primary mechanism for providing context for data
– Data without metadata can be confusing, misinterpreted, or erroneous

1
o

Hardware
These are secondary storage on which the database physically resides, together with the
associated I/O devices, device controllers etc.

Procedures
These are instructions and rules that govern design and use of databases e.g. starting and
stopping database, making backups, handling failures.

Users and their Roles


 Database Designers - designs conceptual and logical database
 Application Developers - writes application programs that use the database
 Data and Database Administrator
 End - user - interacts with the system from an on-line terminal by using Query Languages
etc.

Data and Database Administration


1. Data Administrator – Is a business manager responsible for controlling the overall
corporate data resources

2
2. Database Administrator (DBA) – Is a technical person responsible for development of
the total system

Advantages of the Database Approach


 Program-data independence: the separation of data description from the application
program; metadata is stored in central repository
 Planned data redundancy: Control of data redundancy
 Improved data consistency
 Improved data sharing and productivity
 User view: a logical description of some portion of the database that is required by a
user to perform some task
 Increased application development productivity
 Enforcement of standards
 Naming conventions, data quality standards and uniform procedures for accessing,
updating and protecting data
 Improved data quality
 Integrity constraints
 Improved data accessibility and responsiveness
 SQL (structured query language) can be used to extract meaning from data.
 Reduced program maintenance
 Improved decision support
 Multipurpose use of data
 Balance conflicting user requirement
 Improved data accessibility and responsiveness
 Improved maintenance through data independence
 Increased concurrency
 Improved backup and recovery services.

Disadvantages: - Costs and Risks of the Database Approach


 Complexity
 Size
 Cost of DBMS : Installation and management cost and complexity
 Additional hardware costs
 New, specialized personnel
 Conversion costs
 Need for explicit backup and recovery
 Organizational conflict

3
Database Architecture

External Level – concerned with the way users perceive the database
Conceptual Level – concerned with abstract representation of the database in its entirety
Internal Level – concerned with the way data is actually stored

Data Independence
1. Logical Data Independence – users and user programs are independent of logical
structure of the database
2. Physical Data Independence – the separation of structural information about the data
from the programs that manipulate and use the data i.e. the immunity of application
programs to changes in the storage structure and access strategy

Database Development Life Cycle


This consists of
– Database planning
4
– System definition
– Requirement collection and analysis
– Database design
– DBMS selection
– Application design
– Prototyping
– Implementation
– Data conversion and loading
– Testing
– Operational maintenance

Database Design
Conceptual database design – Is the process of constructing a model of the information used in
an organization, independent of all physical considerations?
Step 1 Build local conceptual data model for each user view
– Identify entity types
– Identify relationship types
– Identify and associate attributes with entity or relationship
– Determine attributes domains
– Determine candidate and primary key attributes
– Specialize/generalize entity types (optional step)
– Draw Entity-Relationship diagram
– Review local conceptual data model with user
Logical Database Design for the Relational Model: the process of constructing a model of the
info used in an organization based on a specific data model, but independent of a particular
DBMS and other physical considerations
Step 2 Build and validate local data model for each user view
– Map local conceptual data model to local logical data model
– Derive relations from local logical data model
– Validate model using normalization
– Validate model against user transactions
– Draw Entity-Relationship diagram
– Define integrity constraints
– Review local logical data model with user
Step 3 Build and validate global logical data model
– Merge local logical data models into global model

5
– Validate global data model
– Check for future growth
– Draw final Entity-Relationship diagram
– Review global logical data model with users
Physical Database Design for Relational Databases: the process of producing a description of
the implementation of the database on secondary storage.
Step 4 Translate global data model for target DBMS
– Design base relations
– Design enterprise constraints for target DBMS
Step 5 Design physical representations
– Analyze transactions
– Choose file organizations
– Choose secondary indexes
– Consider the introduction of controlled redundancy
– Estimate disk space requirements
Step 6 Design security mechanisms
– Design user views
– Design access rules
Step 7 Monitor and tune the operational system

Data model
A data model is a collection of tools for describing
– Data
– Data relationships
– Data semantics
– Data constraints
Data models include the following
§ Relational model
§ Entity-Relationship data model (mainly for database design)
§ Object-based data models (Object-oriented and Object-relational)
§ Semi structured data model (XML)
§ Other older models:
Network model
Hierarchical model

6
A Data model consists of
Entities
– A person, place, object, event or concept in the user environment about which
the organization wishes to maintain data.
Relationships
– Association between entities
– 1:M, M:N

Relational databases
A Relational Database is: -
– A database that represents data as a collection of tables in which all data
relationships are represented by common values in related tables
– Database technology involving tables (relations) representing entities and
primary/foreign keys representing relationships

7
Traditional File Processing
Dates back to before we had databases, used to store, manipulate and retrieve large files of
data to support business operation
It is still in use today, including backup of database systems
This approach to information systems design focused on the data processing needs of individual
departments, instead of evaluating the overall information needs of the organization.

8
Disadvantages of File Processing
Drawbacks of using file systems to store data include
i. Program-Data Dependence. All programs maintain metadata for each file they use
ii. Data redundancy and inconsistency. Multiple file formats, duplication of information in
different files.
iii. Duplication of Data results in different systems/programs having separate copies of the
same data i.e multiple files
iv. Difficulty in accessing data. Need to write a new program to carry out each new task
v. Integrity problems. Integrity constraints (e.g., account balance > 0) become “buried” in
program code rather than being stated explicitly
vi. Limited Data Sharing. No centralized control of data
vii. Hard to add new constraints or change existing ones: Lengthy Development Times.
Programmers must design their own file formats
viii. Atomicity of updates. Failures may leave database in an inconsistent state with partial
updates carried out. Example: Transfer of funds from one account to another should
either complete or not happen at all
ix. Concurrent access by multiple users. Concurrent access needed for performance
Uncontrolled concurrent accesses can lead to inconsistencies –Example: Two people
reading a balance (say 100) and updating it by withdrawing money (say 50 each) at the
same time
x. Security problems. Hard to provide user access to some, but not all, data

9
xi. Excessive Program Maintenance. Can take up to 80% of information systems budget

Database systems offer solutions to all the above problems

Problems with Data Dependency


i. Each application programmer must maintain his/her own data
ii. Each application program needs to include code for the metadata of each file
iii. Each application program must have its own processing routines for reading, inserting,
updating, and deleting data
iv. Lack of coordination and central control
v. Non-standard file formats

Problems with Data Redundancy


i. Waste of space to have duplicate data
ii. Causes more maintenance headaches
iii. The biggest problem is :
o Data changes in one file could cause inconsistencies
o Compromises in data integrity

SOLUTION:
The solution to these problems is the DATABASE Approach. The approach has the following
characteristics
i. Emphasizes the integration and sharing of data throughout the organization
ii. Central repository of shared data
iii. Data is managed by a controlling agent
iv. Stored in a standardized, convenient form

The solution requires a Database Management System (DBMS)

Database Management System (DBMS)


A Database Management System contains information about a particular enterprise
– Collection of interrelated data
– Set of programs to access the data
– An environment that is both convenient and efficient to use

Examples of Database Applications:


Banking: transactions
Airlines: reservations, schedules
Universities: registration, grades
Sales: customers, products, purchases

10
Online retailers: order tracking, customized recommendations
Manufacturing: production, inventory, orders, supply chain
Human resources: employee records, salaries, tax deductions

Databases can be very large. Databases touch all aspects of our lives

Components of the Database Environment


 CASE Tools–computer-aided software engineering
 Repository–centralized storehouse of metadata
 Database Management System (DBMS) –software for managing the database
 Database–storehouse of the data
 Application Programs–software using the data
 User Interface–text and graphical displays to users
 Data/Database Administrators–personnel responsible for maintaining the database
 System Developers–personnel responsible for designing databases and software
 End Users–people who use the applications and databases

The Range of Database Applications includes


Personal databases
– Single user, improve personal productivity
– Used on PC, PDA, cell phone
11
Workgroup databases
– Collaborative project within small team of people
– Database server and client are LAN connected
– Data security, data integrity and concurrent updating issues
Departmental/divisional databases
– Functional unit within an organization
– Most common type of database
Enterprise database
– Support organization-wide operation and decision making

Enterprise Database Applications: These typically consist of


Enterprise Resource Planning (ERP)
– Business management systems that integrate all enterprise functions
(manufacturing, finance, sales, marketing, inventory, accounting, human
resources)
Data Warehouse
– Integrated decision support system derived from various operational databases
– Work with historical data to identify patterns and trends; help form business
strategies

12
Types of databases:
Databases can be classified by:
1. Location
Centralized:
– Supports data located at a single site
Distributed:
– Supports data distributed across several sites
2. use:
Transactional (or production):
– Supports a company’s day-to-day operations
Data warehouse:
– Stores data used to generate information required to make tactical or
strategic decisions
– Often used to store historical data
– Structure is quite different

Data Definition Language (DDL)


Is a specification notation for defining the database schema

Example: create table instructor (


ID char(5),
name varchar(20),
dept_name varchar
(20), salary numeric(8,2))

The DDL compiler generates a set of table templates stored in a data dictionary
The Data dictionary contains metadata (i.e., data about data). This consists of
o Database schema
o Integrity constraints
– Primary key (ID uniquely identifies instructors)
– Referential integrity (references constraint in SQL) e.g. dept_name value
in any instructor tuple must appear in department relation
o Authorization

Structured query language


SQL: Is a widely used non-procedural language
Example: Find the name of the instructor with ID 22222
select name
from instructor
13
where instructor.ID = ‘22222’

Example: Find the ID and building of instructors in the Physics dept.


select instructor.ID, department.building
from instructor, department
where instructor.dept_name = department.dept_name and department.dept_name =
‘Physics’
 Application programs generally access databases through one of Language extensions to
allow embedded SQL
 Application program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent to a
database

History of Database Systems


1950s and early 1960s:
Data processing using magnetic tapes for storage Tapes provided only sequential access
Punched cards for input
Late 1960s and 1970s:
Hard disks allowed direct access to data
Network and hierarchical data models in widespread use
Ted Codd defines the relational data model Would win the ACM Turing Award for this
work
IBM Research begins System R prototype
UC Berkeley begins Ingres prototype
High-performance (for the era) transaction processing
1980s:
Research relational prototypes evolve into commercial systems SQL becomes industrial
standard
Parallel and distributed database systems
Object-oriented database systems

1990s:
Large decision support and data-mining applications
Large multi-terabyte data warehouses
Emergence of Web commerce
Early 2000s:
XML and XQuery standards
Automated database administration
Later 2000s:
Giant data storage systems Google BigTable, Yahoo PNuts, Amazon, ..

14
Review Questions
1. Define the following terms
 Data
 Meta data
 End user data
 Database
 Database management system
 Database system
2. Discuss the main characteristics of database approach and how it defers from traditional
file system.
3. Discuss the components of database systems
4. Compare any three common databases used in the corporate world.
5. Discuss the capabilities provided by a typical DBMS.
6. Outline the main components of a DBMS

15
Week Two

Conceptual Data Modeling: Entity Modeling

Data Models:
A database can be modeled as: a collection of entities and relationship among entities.
§ An Entity Model (also called Entity-Relationship Model) is a representation of the data
used by a system. It shows how data is logically grouped together and the relationships
between these logical data groupings as defined by the business requirements of the
system.
§ An Entity Model comprises:
§ Entity Relationship Diagram (ERD)
§ Associated Documentation of Entities, Relationships and Attributes.
Entity
An entity is an object that exists and is distinguishable from other objects. Example: a specific
person, company, event, plant etc.
§ It is a logical grouping of data relevant to the application under investigation. An
identifiable object, concept or activity belonging to the application about which
descriptive data is stored.
§ An entity set is a set of entities of the same type that share the same properties.
Example: set of all persons, companies, trees, holidays

Entity Type Vs Entity Occurrence:


Entity-type refers to a generic class of things such as Company. Entity is the short form
of entity-type.

Entity-occurrence refers to specific instances or examples of a type. For example, one


occurrence of the entity Car is Chevrolet Cavalier.

An entity usually has attributes (i.e., data elements) that further describe it. Each
attribute is a characteristic of the entity. An entity must possess a set of one or more
attributes that uniquely identify it (called a primary key).

The entities on an Entity-Relationship Diagram are represented by boxes (i.e.,


rectangles). The name of the entity is placed inside the box.

Attribute
Entities have attributes Example: people have names and addresses
16
An attribute is a property of an entity that distinguishes one entity occurrence from
another
§ There are two types: Key Attribute and Non-Key Attributes
§ Attributes can belong to an Attribute Domain - set of values that may be assigned to an
attribute and they can be:
– Simple attribute or Composite attribute - multiple components
– Single-valued or Multi-valued
– Derived attribute
Relationships
§ Relationship is a meaningful association between two or more entities
§ Degree of a Relationship refers to the number participating entities in a relationship

§ One-to-one - For any A there may only be one member of B and for any B there is only
one member of A associated with it at any time.
§ One to Many - For any A there may be many members of B and for any B there is only
one member of A associated with it.
Example Customer – Order
§ Many to Many - For any A there may be many members of B and for any B there may be
many members of A associated with it. Example Customer – Product
§ One-to-One relationships are, in general, discouraged and it is normal to merge the two
entities involved into one
§ One-to-Many relationships defines the association between a master entity and a detail
entity
§ Many-to-Many relationships must be decomposed into One-to-Many relationships
Constructing the Entity Model
Step 1 - Develop the Initial Entities
§ Identify the objects about which the system currently holds data. This may be done
using interviews, notes, document reviews, observation
Step 2 - Identify Initial Relationships
§ Successful identification of relationships between entities demands constant reference
to the business requirements and environments of the system.
Step 3 Validate against Current System
§ The initial ERD is validated with users. The ERD is logical and so does not show physical
aspects of the current system, instead of "Blue Enquiry Form", "Order Form", "Employee
Master Listing“use "Enquiry", "Order" "Employee".
Step 4 - Extend EM to Cover New Requirements
17
§ The EM is extended to cover entities needed to fulfil the new requirements of the
system.
Step 5 - Rationalize the Structure
§ A review of the ERD is required to ensure that it is the minimum logical model required
to support the system requirements.
§ If there are two access paths between two entities, the access logic may be is duplicated
and redundant.
§ If link entities are found to contain similar data, they may be merged.
Step 6 - Revalidate the EM with Users
More on Relationships
§ Recursive Relationships - A recursive relationship is one in which a the same entity
participates more than once in different roles
Example:
Employee married to Employee
Employee manages Employee
§ Optional/Mandatory Relationships
In an optional relationship, an entity occurrence in one entity need not have an
occurrence in the other entity associated with it.
§ Cardinality Constraints - describes the maximum number of possible relationships for
each participating entity
• It is a function of policies established by an enterprise
• The most common is binary (1:1), one-to-many (1:M) and many-to-many (M:N)
§ Cardinality Constraints - Example
A Lecturer teaches at most three Classes per semester and each Class has at most one
Lecturer. A student is limited to six classes per semester and class size is limited to 35.
 Participation Constraint (Existence – Dependent) - Determines whether the existence of
an entity depends on it being related to another entity through the relationship.
Example: Course generates Class

18
§ Participation Constraint (Existence – Dependent) - Determines whether the existence of
an entity depends on it being related to another entity through the relationship.
Example: Course generates Class
Enhanced Entity Model
§ Generalization Vs Specialization
generalization and specialization are important relationships that exist between a higher level
entity set and one or more lower level entity sets.
1. Generalization is the result of taking the union of two or lower level entity sets
to produce a higher level entity sets. Specialization is the results of taking
subsets of a higher level entity set to form a lower level entity sets.
2. In generalization, each higher level entity must also be a lower level entity.
In specialization, some higher level entities may not have lower-level entity
sets at all.
3. Specialization is a Top Down process where as Generalization is Bottom Up
process.

Conceptual Data Modeling:


Normalization
§ Normalization is a data analysis technique for producing a set of relations with desirable
properties
§ There are several levels of normalization: Basic level - First (FNF), Second (SNF), and
Third (TNF). There are also Fourth Normal Form, Fifth Normal Form
§ It is the TNF that is usually used as the basis for the design of the data model and for
mapping onto a database
Advantages of Normalization
§ It is a formal technique with each stage of normalization process eliminating a particular
type of undesirable dependency
§ It highlights constraints and dependencies in the data and hence aids in understanding
the nature of data
§ The TNF produces well designed databases which provide a higher degree of
independence
Un-normalized Form
§ A relation that contains one or more repeating groups i.e. repeated values for particular
attributes with a single record
First Normal Form
§ A relation is said to be in FNF if and only it contains no repeating groups

19
§ Remove repeating groups and propagate higher level primary keys by partitioning the
un-normalized relation
Second Normal Form
§ A relation is in SNF if and only if it is in FNF and every non-key attribute is fully
functionally dependent on the key attribute
* Remove any partial dependencies by partitioning the relation
* A relation that is in FNF and has no composite key is necessarily in SNF!
Third Normal Form
§ A relation is in TNF if and only if it is in SNF and every non-key attribute is independent
of all other non-key attributes
* Remove any non-key attributes that depend on other non-key dependencies
(transitive dependencies)
Functional Dependency
§ Given a relation R, attribute Y of R is functionally dependent on attribute X of R, if and
only if each X-value in R has associated with it precisely one Y-value in R (at any one
time)
R.X R.Y (R.X functionally determines R.Y)
Transitive Dependency
§ A B C
If B depends on A and C depends on B, then C transitively depends on A via B
Example Staff No name, salary, branchno, address
branch No address
Normalization Steps
§ Represent the un-normalized relation
 List Attributes
 Identify Repeating Groups
 Identify Key Attributes
§ Convert to FNF
• By removing Repeating Groups
§ Understand the Dependencies
• Functional Dependency Diagrams may be used
§ Convert to SNF
• by removing partial dependencies
§ Convert to TNF
• by removing transitive dependencies

20
21
22
Fourth Normal Form
§ Fourth Normal Form is concerned with multi-valued dependencies
• A functional dependency indicates that for each value of X there is one value of Y
• A multi-valued dependency states that for each value of attribute X there is a finite
set of values of Y. Example Book/Authors
Fifth Normal Form
§ Fifth Normal Form is concerned with relations which cannot be partitioned into two
relations without losing information, but can be partitioned into three or more relations
* Has little practical importance

Review Questions
1. An engineering consultancy firm supplies temporary specialized staff to bigger companies in
the country to work on their project for certain amount of time. The table below lists the
time spent by each of the company’s employees at other companies to carry out projects.
The National Insurance Number (NIN) is unique for every member of staff.

NIN Contract No Hours Employee Company ID Company Location


Name

616681B SC1025 72 P. White SC115 Belfast

23
674315A SC1025 48 R. Press SC115 Belfast

323113B SC1026 24 P. Smith SC23 Bangor

616681B SC1026 24 P. White SC23 Bangor

a) Explain in which normal form this table is


b) Find the Primary Key for this relation and explain your choice.
c) Find the Fully Functional Dependencies on the PK and the Partial Dependencies on the
PK.
d) Normalize the table to 2NF
e) Find the transitive dependencies on the 2NF tables
f) Normalize the tables to 3NF – Express the tables in DBML language and show the PK and
FK in all the relations.
2. Draw an E-R diagram that shows the hospital environment, theatres, patients (in and
out-patients) doctors, nurses, wards and ward beds.

24
Week Three

Relational Database Systems


Relational databases are tuned towards:
• simple data
• simple, ad-hoc queries
• multiple users

Relational Data Structures


§ Relation is a table with columns and rows.
§ Attribute is a column
§ Domain is allowable values for one or more attributes
§ Tuple is a row
§ Degree - degree of a relation is the number of attributes of contains.
§ Cardinality – cardinality of a relation is the number of tuples it contains
§ Heading - consists of a fixed set of attributes
§ Body - consists of a time-varying set of tuples
25
Properties of Relations
§ The relation has a distinct name
§ Each attribute has a distinct name
§ Each cell of the relation contains one atomic value
§ The values of an attribute are from the same domain
§ The order of attributes has no significance
§ No duplicate rows
§ The order of rows has no significance (theoretically)

Relational Keys
i. Candidate Key - a set of unique identifiers
ii. Primary Key - The candidate key selected to be the primary key
iii. Composite Key - a candidate key that consists of two or more attributes
iv. Foreign Key - A foreign key is an attribute(s) in one relation whose values are
required to match those of the primary key of some relation

Relational Data Integrity


§ Entity Integrity - No attribute participating in the primary key of a relation is allowed to
accept null values
§ Referential Integrity - If a FK exist in a relation either the foreign FK value must match a
candidate key value in a home relation or FK value must be wholly null.
§ Null – represents a value for an attribute that is currently unknown or is not applicable
for this tuple
§ Enterprise Constraints – Additional rules specified by the users or DBA
§ Domain Constraints – restriction on the values allowed for attributes

Relational Data Languages

Data Manipulation Language

The Data Manipulation Languages (DML) of relational databases is based on Relational Algebra.
Each operator of the relational algebra takes either one or two relations as its input and
produces a new relation as its output
26
Data Manipulation is:

§ retrieval of information from the database


§ insertion of new information into the database
§ deletion of information in the database
§ modification of information in the database

A DML is a language which enables users to access and manipulate data. The goal is to provide
efficient human interaction with the system. There are two types of DML:
i. procedural: the user specifies what data is needed and how to get it
ii. nonprocedural: the user only specifies what data is needed

– Easier for user


– May not generate code as efficient as that produced by procedural
languages

A query language is a portion of a DML involving information retrieval only. The terms DML and
query language are often used synonymously.

Data manipulation languages were initially only used by computer programs, but (with the
advent of SQL) have come to be used by people, as well.

Data Manipulation Languages have their functional capability organized by the initial word in a
statement, which is almost always a verb. In the case of SQL, these verbs are:

1. Select
2. Insert
3. Update
4. Delete

Relational Algebra
§ SELECT - Extracts specified tuples (rows/records) from a specified relation
§ PROJECT - Extracts specified attributes (columns/fields) from a specified relation
§ JOIN - Produces a relation from two specified relations consisting of all possible
concatenated pairs of tuples, one from each of the two specified relations, such that in
each pair the two tuples satisfy some specified condition
§ DIVIDE - Takes two relations, one binary and one unary, and builds a relation consisting
of all values of one attribute of the binary relation that match (in the other attribute) all
values in the unary relation
§ UNION - Produces a relation consisting of all tuples appearing in either or both of two
specified relations

27
§ INTERSECTION - Produces a relation consisting of all tuples appearing in both of two
specified relations
§ DIFFERENCE - Produces a relation consisting of all tuples appearing in the first and not in
the second of two specified relations
§ PRODUCT-Produces a relation from two specified relations consisting of all possible
concatenated pairs of tuples, one from each of the two specified relations

Data Definition Language

A Data Definition Language (DDL) is a computer language for defining data structures. The
term was first introduced in relation to the Codasyl database model, where the schema* of the
database was written in a Data Definition Language describing the records, fields, and "sets"
making up the user Data Model. Initially it referred to a subset of SQL, but is now used in a
generic sense to refer to any formal language for describing data or information structures, like
XML schemas.

*Schema: The schema of a database system is its structure described in a formal language
supported by the database management system (DBMS). In a relational database, the schema
defines the tables, the fields in each table, and the relationships between fields and tables.
Schemas are generally stored in a data dictionary. Although a schema is defined in text
database language, the term is often used to refer to a graphical depiction of the database
structure

Data Definition Language (DDL)

1. Used to specify a database scheme as a set of definitions expressed in a DDL


2. DDL statements are compiled, resulting in a set of tables stored in a special file called a
data dictionary or data directory.
3. The data directory contains metadata (data about data)
4. The storage structure and access methods used by the database system are specified by
a set of definitions in a special type of DDL called a data storage and definition language
5. basic idea: hide implementation details of the database schemes from the users

Review Questions
1. Define the following terms
 Data model
 Referential integrity
 Primary candidates
28
 Candidate key
 Value set (domain)
 Attribute
 Attribute value
2. Differentiate between the following terms: -
 Entity type and entity set
 Relationship type, relationship instance and relationship set.
 Weak entity types and strong (regular) entity types
3. Describe briefly the following data models
 Relational data models
 Network data models
 Hierarchical data models
4. What is a participatory role? When is it necessary to use role names in the
description of relationship type?
5. Define the following terms:-
a. Internal schema
b. Conceptual schema
c. External schema
d. Data independence
e. Data definition language
f. Query language
g. Data manipulation language
6. Differentiate between
 Database state and database schema
 Logical data independence and physical data independence
 Procedural and non-procedural DMLs.
7. Discuss the weaknesses of a relational database model

29
Week Four

Query optimisation
Objectives

understand the importance of query optimization for relational


database systems;
explain the main objective and principle of a query optimizer;
quantify the effects of optimization by calculating the number of tuple
I/O operations, and compare different query plans using this parameter;
identify and discuss techniques used in the different stages of
the optimization process;
Discuss the role of database statistics in the form of a data dictionary
(or system catalogue) in the optimization process.

Query optimization is an important component of a modern relational database system. When


processing user queries, the query optimizer transforms them into a set of low level operations,
and decides in what order these operations are to be executed in the most efficient way.

In a non-relational database system, any optimization has to be carried out manually.


Relational database systems, however, offer a system-managed optimization facility by
making use of the wealth of statistical information available to the system. The overall
objective of a query optimizer is to improve the efficiency and performance of a relational
database system.

The Need for Query Optimization

A query optimizer is essentially a program which is used for the efficient evaluation of relational
queries, making use of relevant statistic information.
Objective

The main objective of query optimization is to choose the most efficient strategy for
implementing a given relational query, and thereby improve on the system performance

30
Need for Query optimization

§ To perform automatic navigation

A database system based on either the hierarchical model or network model requires users to
navigate across the database to locate and retrieve the desired data. A relational database
system, however, allows users simply to state what data they require and leave the system to
locate that data in the database.

§ To achieve acceptable performance

Although there are many different ways available to perform a single user query, this automatic
navigation can be optimized by choosing and executing the most efficient plan, selected on the
basis of system statistical information. By executing queries in the most cost-effective way, the
overall system performance can be improved /enhanced.

§ To minimize existing discrepancies

The speed discrepancy between CPU and I/O devices is still enormous despite rapid recent
advances in computer technology. As an I/O access is one of the most costly and frequently
performed operations in a typical database system, it is the main aim of the query optimiser to
minimize the I/O activities by choosing the cheapest query plan for a given query, usually the
one with a minimum number of tuple I/O operations.

Effects of Optimization - An Example


In order to appreciate the effects of query optimization on database performance, we use a
simple example as an illustration.

Consider the following 3 tables:

Student, Lending and Book. The attributes highlighted are the keys for the relevant relations
shown in below(fig v4.1)

V4.1
Effects of Optimization - 1

Student = (Stud_No, Stud_Name, Gender, Address)

Lending = (Lending_No, Stud_No, Book_No)

Book = (Book_No, Title, Author, Edition)

Consider the following query:


31
Retrieve the names of students who have borrowed the book B1.

This query can be expressed in SQL as shown in fig. V4.2.


We further make the following two assumptions:
V4.2.
Effects of Optimisation – 2

Query: Retrieve the names of students who have borrowed the


book B1
Select Distinct Student.Stud_Name
From Student, Lending
Where Student.Stud_No = Lending.Stud_No

And Lending.Book_No = 'B1'


• Suppose the database contains 100 students and 10,000
lendings, of which only 50 are for book B1
• Assume that results of up to 50 tuples can be kept in memory

 The database contains 100 students and 10,000 lendings, of which only
50 are for book B1.
 It is possible to hold up to 50 tuples in memory, without having to write
back to disk.
Query Execution Plan A - No Optimization
V4.3
Query Execution Plan A - No
Optimization
Join-Select-Project
1.Join relations Student and Lending over
Stud_No
2.Select the result of Step 1 for just the tuples for
book B1
3.Project the result of Step 2 over Stud_Name to
get result (50 max)
Total tuple I/O: 1,020,000

In this first approach, the operations required by the query are performed in the sequence:
Join-Select-Project

We calculate the number of database accesses (tuple I/O operations) occurred in each
operation.
1. Join relations Student and Lending over Stud_no.
- For each Student row, every Lending row will be retrieved
32
a n d tested (reading each of the 100 Student rows 10,000 times);
- Every Lending row will match with one Student row, so the
number of joined rows in this intermediate relation is 10,000.
These have to written back to disk (only 50 tuples can be held
in memory – see assumptions).

So, the number of tuple I/Os occurred in this step is:

(100*10,000) + 10,000 = 1,010,000

2. Select the result of Step 1 for just the tuples for book B1.
- This results reading the 10,000 joined tuples (obtained in step
1) Back into memory.
- Then Select produces a relation containing only 50 tuples, which
Can be kept in memory (see assumption).

The number of tuple I/Os in this step is:

10,000 + 0 = 10,000

3. Project the result of Step 2 over Stud_Name to get result (50 max).
- This results in reading a relation of 50 tuples (obtained in step
2) Which is already in memory, and producing a final relation of no
more than 50 tuples, again in memory.

The number of tuple I/O in this step is:

0+0=0

Therefore, the total number of tuple I/Os for query plan A is:

(1,010,000 + 10,000).

Total tuple I/O: 1,020,000

33
Query Execution Plan B - With Optimization
V4.4

With Optimization
Select-Join-Project
1. Select the relation Lending to just the tuples
for Book B1
2. Join the result of Step 1 to relation Student
over Stud_No
3. Project the result of Step 2 over Stud_Name
Total tuple I/O: 10,100

In this approach, the sequence of the operations has been changed to the
following:

Select-Join-Project

1.Select the relation lending for just the tuples for Bppk B1 a relation with
50 tuples,which will be kept in
memory (see assumption).
The number of tuple I/Os: 10,000 + 0 = 10,000
2. Join the result of Step 1 with relation Student over Stud_No.

- This results in reading 100 tuples of Student relation, and


joining them with the relation obtained in step 1 which is already
in memory. This join produces a relation of 50 tuples, which
again will be kept in memory.
The number of tuple I/Os: 100 + 0 = 100
3. Project the result of Step 2 over Stud_Name.
- Same as step 3 of Query Plan A.

Therefore, the total number of tuple I/Os for query plan B is (10,000 + 100)
Total tuple I/O: 10,100

Comparison of A and B

Ratio (Plan A to Plan B): 1,020,000 / 10,100 (102 / 1)

Although plan A and plan B will produce the same end result for the given
query, the order in which the required relational operations are executed is
different. It is obvious that plan B requires far fewer tuple I/O accesses than
plan A, and therefore is far more efficient in terms of performance.
If, furthermore, use was made in Plan B of an index on Book-No for the Lending
relation, the number of tuples read in Step 1 would be further reduced from
10,000 to just 50. Similarly, an index on Student.Stud-No would reduce
34
retrievals in Step 2 to only 50 tuples. One or more similar examples should be
shown to the students to demonstrate how tuple I/Os should be calculated and
used as a parameter in the comparison of different query execution plans.

Four Stages of Optimization Process

Four Stages of
Optimization - 1
1. Convert query into internal form
suitable for machine manipulation -
e.g. Query tree, Relational algebra
2. Further convert internal form into
equivalent and more efficient canonical
form, using well-defined transformation
rules (laws)

Having considered the above example, now we describe the query


optimization in a more systematic way.

The whole optimization process generally involves 4 stages (Date):


1. Convert the query into a more suitable internal form.

First, the original query is converted into some internal


representation which is more suitable for machine manipulation.
The typical forms used for this representation are:

- Query tree;
- Relational algebra.

2. Convert to a more efficient canonical form.

This internal representation is further converted into some


Equivalent canonical form which is more efficient, making use
of well-defined transformation rules (see Visuals V4.5 and V4.6).

V4.5
Four Stages of Optimization – 2
3. Choose set of candidate low-level
procedures, using statistics about the
database
– Low-level operation
– Implementation procedure
35
– Cost formula
4. Generate query plans and choose the
best (cheapest) plan by evaluating the
cost formulae

3. Choose low-level procedures.

Having completed steps 1 and 2, the query optimizer must now decide how to evaluate
the transformed query. The basic principle here is to consider the transformed query
representation as specifying a series of low-level operations, using statistical
information about the database
The main results produced in this stage are:

- a set of low-level operations (for example, join, restriction, etc.);


- a set of implementation procedures (one for each low-level
operation);
- a set of cost formulae (one for each procedure).

4. Generate query plans and choose the best.

In this stage, a set of candidate query plans are generated by


c o m b i n i n g together the set of candidate implementation
procedures. A final decision is made to choose the best (cheapest) of
those plans by evaluating the cost formulae.

Transformation Rules

V4.6
Transformation Rules - 1
Rule 1
(A where Restrict-1) where Restrict-2
≡ A (where Restrict-1 AND Restrict-2)
Rule 2
A ([Project]) where Restrict
≡ (A where Restrict) [Project]
Rule 3
(A [Project-1]) [Project-2] ≡ A [Project-2]
Rule 4
(A Join B) where Restrict-on-A AND Restrict-on-B
≡ (A where Restrict-on-A) Join (B where Restrict-on-B)

• Rule 1
Transform a sequence of restrictions against a given relation into
a single restriction.
(A where Restrict-1) where Restrict-2
≡ A (where Restrict-1 AND Restrict-2)
• Rule 2
36
Transform a restriction of a projection into a projection of a
restriction.
A ([Project]) where Restrict
≡ (A where Restrict) [Project]
• Rule 3
Transform a sequence of projections against a given relation into a
single (the last) projection.

(A [Project-1]) [project-2] ≡ A [Project-2]

• Rule 4
Distributivity (for restrictions and projections).

(A Join B) where Restrict-on-A AND Restrict-on-B


≡ (A where Restrict-on-A) Join (B where Restrict-on-B)

V4.7
Transformation Rules - 2
Rule 5
where p OR (q AND r)
≡ where (p OR q) AND (p OR r)
Rule 6
(A Join B ) Join C ≡ A Join (B Join C)
Rule 7
Perform projections as early as possible
Rule 8
Perform restrictions as early as possible

• Rule 5
Distributivity (for logical
expressions).

where p OR (q AND r)
≡ where (p OR q) AND (p OR r)

• Rule 6
Choose an optimal ordering of
the joins to keep the
intermediate results low in size.

A Join (B Join C)

37
(A Join B ) Join C ≡

• Rule 7
Perform projections as early as possible.

• Rule 8
Perform restrictions as early as possible.

Some General Principles


Some general guidelines can be drawn from the above rules. These include:

• Avoid having to build Cartesian products.


• Perform selects and projects before joins if possible to reduce the
number of rows which have to be joined.
• Use indexes for selects and joins where appropriate.
At this point, various examples should be shown to the students to
illustrate the use of the above rules

6 Database Statistics

Database Statistics
System catalogue, or data dictionary
Typical statistics maintained by it include:
l For each base table
– cardinality
– number of pages for this table
l For each column of each base table
– maximum, minimum, and average value
– actual values and their frequencies
l For each index
– whether a ‘clustering index’
– number of leaf pages
– number of levels

Various decisions which have to be made in the optimization process are based upon the
database statistics stored in the system, often in the form of a system catalogue or a data
dictionary.
Typical statistics maintained by the system include
For each base relation, for example:
- cardinality of the relation;
- number of pages for this relation.

For each column of each base relation, for example:


- number of distinct values in this column;
38
- maximum, minimum, and average value for this column;
- actual values in this column and their frequencies (a histogram).

For each index, for example:


- whether this is a clustering index;
- number of leaf pages in this index;
- number of levels in this index.

Summary
The following key points should be emphasized for this topic:

 Objective - The importance of query optimization in terms of


minimising I/O operations, speeding up query execution and improving
the efficiency and performance of a relational database system.

 Effects:
The dramatic effects of optimization are illustrated by the
example. Students should be able to compare different query executing
plans by calculating the number of tuple I/Os for each plan, as shown in the
example.

 Processes:
The main tasks involved in each of the four stages of optimization.

 Information needed:
The critical role of database statistics in the form of a system
catalogue O r data dictionary.

Effects of Optimization - An Example


Query

In order to appreciate the effects of query optimization on database


performance, we use a
simple example as an illustration.

Consider the following 3 tables: Student, Lending & Book. The attributes
highlighted are
the keys for the relevant relations:

Student = (Stud_No, Stud_Name, Gender, Address)


39
Lending = (Lending_No, Stud_No, Book_No)
Book = ( Book_No, Title, Author, Edition)

Consider the following query:

Retrieve the names of students who have borrowed the book B1.

This query can be expressed in SQL:

Select Distinct Student.Stud_Name


From Student, Lending
Where Student.Stud_No = Lending.Stud_No
And Lending.Book_No = 'B1'

We further make the following two assumptions:

 The database contains 100 students and 10,000 lendings, of which only
50 are for book B1.
 It is possible to hold up to 50 tuples in memory, without having to write
back to disk.

Query Execution Plan A - No Optimization

In this first approach, the operations required by the query are performed in the
sequence:

Join-Select-Project

We calculate the number of database accesses (tuple I/O operations)


occurred in each operation.
1. Join relations Student and Lending over Stud_no.
- For each Student row, every Lending row will be retrieved
and tested (reading each of the 100 Student rows 10,000 times);
- Every Lending row will match with one Student row, so the
number of joined rows in this intermediate relation is 10,000.

These have to written back to disk (only 50 tuples can be held


in memory – see assumptions).
So, the number of tuple I/Os occurred in this step is:

(100*10,000) + 10,000 = 1,010,000

2. Select the result of Step 1 for just the tuples for book B1.
- This results reading the 10,000 joined tuples (obtained in step
1) Back into memory.
40
- Then Select produces a relation containing only 50 tuples, which
can Be kept in memory (see assumption).
The number of tuple I/Os in this step is:

10,000 + 0 = 10,000

3. Project the result of Step 2 over Stud_Name to get result (50 max).
- This results in reading a relation of 50 tuples (obtained in step
2) which is already in memory, and producing a final relation of
no more than 50 tuples, again in memory.
The number of tuple I/O in this step is:

0+0=0

Therefore, the total number of tuple I/Os for query plan A is:

(1,010,000 + 10,000).

Total tuple I/O: 1,020,000

Query Execution Plan B - With Optimisation

In this approach, the sequence of the operations has been changed to the
following:

Select-Join-Project

1. Select the relation lending for just the tuples for Book B1.

- This results in reading 10,000 tuples of lending relation,


but only generates a relation with 50 tuples, which will be kept
in memory (see assumption).
The number of tuple I/Os: 10,000 + 0 = 10,000

2. Join the result of Step 1 with relation Student over Stud_No.


- This results in reading 100 tuples of Student relation, and
Joining them with the relation obtained in step 1 which is already
in memory. This join produces a relation of 50 tuples, which
again will be kept In memory.
The number of tuple I/Os: 100 + 0 = 100

3. Project the result of Step 2 over Stud_Name.

41
- Same as step 3 of Query Plan A.
Therefore, the total number of tuple I/Os for query plan B is (10,000 + 100)

Total tuple I/O: 10,100

Transformation Rules

• Rule 1
Transform a sequence of restrictions against a given relation into a single
restriction.
(A where Restrict-1) where Restrict-2
A (where Restrict-1 AND Restrict-2)

• Rule 2
Transform a restriction of a projection into a projection of a restriction.

A ([Project]) where Restrict


(A where Restrict) [Project]
• Rule 3
Transform a sequence of projections against a given relation into a single (the
last) projection.

(A [Project-1]) [project-2] ≡ A [Project-2]


• Rule 4
Distributivity (for restrictions and projections).

(A Join B) where Restrict-on-A AND Restrict-on-B


(A where Restrict-on-A) Join (B where Restrict-on-B)

• Rule 5
Distributivity (for logical expressions).

where p OR (q AND r)
where (p OR q) AND (p OR r)

• Rule 6
Choose an optimal ordering of the joins to keep the intermediate results
low in size.

(A Join B ) Join C ≡ A Join (B Join C)


• Rule 7
Perform projections as early as possible.

42
• Rule 8
Perform restrictions as early as possible.

Review Questions
1. Discuss the reasons for converting SQL queries into relational algebra before
optimization is done.
2. Briefly discuss the process of query optimization
3. Why are queries converted into canonical form during optimization?
4. What is meant by the term heuristic optimization? Discuss the main heuristics that are
applied during query optimization

Week Five

Introduction to SQL
What is SQL?

 SQL stands for Structured Query Language


 SQL lets you access and manipulate databases
 SQL is an ANSI (American National Standards Institute) standard

What Can SQL do?

 SQL can execute queries against a database


 SQL can retrieve data from a database
 SQL can insert records in a database
 SQL can update records in a database
 SQL can delete records from a database
 SQL can create new databases
 SQL can create new tables in a database
 SQL can create stored procedures in a database
 SQL can create views in a database
 SQL can set permissions on tables, procedures, and views

Using SQL in Your Web Site

To build a web site that shows data from a database, you will need:

43
 An RDBMS database program (i.e. MS Access, SQL Server, MySQL)
 To use a server-side scripting language, like PHP or ASP
 To use SQL to get the data you want
 To use HTML / CSS

SQL Data Definition Language


The SQL data-definition language (DDL) allows the specification of information about relations,
including:

§ The schema for each relation.


§ The domain of values associated with each attribute.
§ Integrity constraints
§ And as we will see later, also other information such as
– The set of indices to be maintained for each relations.
– Security and authorization information for each relation.
– The physical storage structure of each relation on disk.

Domain Types in SQL


 char(n). Fixed length character string, with user-specified length n.
 varchar(n). Variable length character strings, with user-specified maximum length n.
 int. Integer (a finite subset of the integers that is machine-dependent).
 smallint. Small integer (a machine-dependent subset of the integer domain type).
 numeric(p,d). Fixed point number, with user-specified precision of p digits, with n digits
to the right of decimal point.
 real, double precision. Floating point and double-precision floating point numbers, with
machine-dependent precision.
 float(n). Floating point number, with user-specified precision of at least n digits.

Some of The Most Important SQL Commands

 SELECT - extracts data from a database


 UPDATE - updates data in a database
 DELETE - deletes data from a database
 INSERT INTO - inserts new data into a database
 CREATE DATABASE - creates a new database
 ALTER DATABASE - modifies a database
 CREATE TABLE - creates a new table
 ALTER TABLE - modifies a table
 DROP TABLE - deletes a table
 CREATE INDEX - creates an index (search key)
 DROP INDEX - deletes an index

44
Create Table Construct
An SQL relation is defined using the create table command:
Create table r (A1 D1, A2 D2, ..., An Dn,
(integrity-constraint1), ..., (integrity
constraintk))
r is the name of the relation
each Ai is an attribute name in the schema of relation r
Di is the data type of values in the domain of attribute Ai

Example:

create table instructor


( ID char(5),
name varchar(20) not null,
dept_name varchar(20),
salary numeric(8,2))

Insert into instructor values (‘10211’, ’Smith’, ’Biology’, 66000);


Insert into instructor values (‘10211’, null, ’Biology’, 66000);

Integrity Constraints in Create Table


These are
not null
primary key (A1, ..., An )
foreign key (Am, ..., An ) references r
Example: Declare ID as the primary key for instructor .
create table instructor(
ID char(5),
name varchar(20) not null,
dept_name varchar(20),
salary numeric(8,2),
primary key (ID),
foreign key (dept_name)
references department)

Note that primary key declaration on an attribute automatically ensures not null

Drop and Alter Table Constructs


Drop table student deletes the table and its contents
Delete from student Deletes all contents of table, but retains table
45
Alter table
Alter table r add A D
– Where A is the name of the attribute to be added to relation r and D is the domain of A.
– All tuples in the relation are assigned null as the value for the new attribute.
Alter table r drop A
– where A is the name of an attribute of relation r
– Dropping of attributes not supported by many databases

Basic Query Structure

The SQL data-manipulation language (DML) provides the ability to query information, and
insert, delete and update tuples
A typical SQL query has the form:
Select A1, A2, ..., An
From r1, r2, ..., rm
Where P

– Ai represents an attribute
– Ri represents a relation
– P is a predicate.

The result of an SQL query is a relation.

Views
In some cases, it is not desirable for all users to see the entire logical model (that is, all the
actual relations stored in the database.)
Consider a person who needs to know an instructors name and department, but not the salary.
This person should see a relation described, in SQL, by select ID, name, dept_name from
instructor
A view provides a mechanism to hide certain data from the view of certain users.
Any relation that is not of the conceptual model but is made visible to a user as a “virtual
relation” is called a view.

View Definition
A view is defined using the create view statement which has the form
Create view v as < query expression >
Where <query expression> is any legal SQL expression.
The view name is represented by v.
Once a view is defined, the view name can be used to refer to the virtual relation that the view
generates.

46
View definition is not the same as creating a new relation by evaluating the query expression
Rather, a view definition causes the saving of an expression; the expression is substituted into
queries using the view.
Example Views
A view of instructors without their salary
Create view faculty as
Select ID, name, dept_name
From instructor
Find all instructors in the Biology department
Select name
From faculty
Where dept_name = ‘Biology’

Create a view of department salary totals


Create view departments_total_salary(dept_name, total_salary) as
Select dept_name, sum (salary)
From instructor
Group by dept_name;

SQL Transactions
A transaction is a unit of work. An Atomic transaction either fully executed or rolled back as if it
never occurred. Isolation from concurrent transactions
Transactions begin implicitly Ended by commit work or rollback work
But default on most databases: each SQL statement commits automatically. One can turn off
auto commit for a session (e.g. using API)
In SQL: 1999 can use: begin atomic …. end Not supported on most databases

Integrity Constraints
Integrity constraints guard against accidental damage to the database, by ensuring that
authorized changes to the database do not result in a loss of data consistency.
Integrity Constraints on a Single Relation are
not null
primary key
unique
check (P), where P is a predicate

Referential Integrity
Ensures that a value that appears in one relation for a given set of attributes also appears for a
certain set of attributes in another relation. Example: If “Biology” is a department name
appearing in one of the tuples in the instructor relation, then there exists a tuple in the
department relation for “Biology”.
47
Built-in Data Types in SQL
date: Dates, containing a (4 digit) year, month and date Example: date ‘2005-7-27’
time: Time of day, in hours, minutes and seconds. Example: time ‘09:00:30’ time ‘09:00:30.75’
timestamp: date plus time of day Example: timestamp ‘2005-7-27 09:00:30.75’
interval: period of time Example: interval ‘1’ day
– Subtracting a date/time/timestamp value from another gives an interval value
– Interval values can be added to date/time/timestamp values

Authorization
Forms of authorization on parts of the database:
Read - allows reading, but not modification of data.
Insert - allows insertion of new data, but not modification of existing data.
Update - allows modification, but not deletion of data.
Delete - allows deletion of data.

Forms of authorization to modify the database schema


Index - allows creation and deletion of indices.
Resources - allows creation of new relations.
Alteration - allows addition or deletion of attributes in a relation.
Drop - allows deletion of relations.

Privileges in SQL
select: allows read access to relation,or the ability to query using the view
Example: grant users U1, U2, and U3 select authorization on the instructor relation:
grant select on instructor to U1, U2, U3
insert: the ability to insert tuples
update: the ability to update using the SQL update statement
delete: the ability to delete tuples.
all privileges: used as a short form for all the allowable privileges

Joined Relations
Join operations take two relations and return as a result another relation.
A join operation is a Cartesian product which requires that tuples in the two relations match
(under some condition). It also specifies the attributes that are present in the result of the join
The join operations are typically used as sub query expressions in the from clause

48
Join operations – Example

Relation course

Relation prerequisites

Observe that
prerequisite information is missing for CS-315 and
course information is missing for CS-437

Outer Join
§ Is an extension of the join operation that avoids loss of information
§ Computes the join and then adds tuples form one relation that does not match tuples in
the other relation to the result of the join.
§ Uses null values.

Joined Relations
Join operations take two relations and return as a result another relation.
– These additional operations are typically used as subquery expressions in the from
clause
Join condition – defines which tuples in the two relations match, and what attributes are
present in the result of the join.
49
Join type – defines how tuples in each relation that do not match any tuple in the other relation
(based on the join condition) are treated.

SQL LEFT JOIN Keyword

The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in
the right table (table2). The result is NULL in the right side when there is no match.

SQL LEFT JOIN Syntax


SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name=table2.column_name;

or:

SELECT column_name(s)
FROM table1
LEFT OUTER JOIN table2
ON table1.column_name=table2.column_name;

Note! In some databases LEFT JOIN is called LEFT OUTER JOIN.

Demo Database

In this example we will use a sample database.

Below is a selection from the "Customers" table:

50
CustomerID CustomerName ContactName Address City PostalCode Country

1 Alfreds Futterkiste Maria Anders Obere Str. 57 Berlin 12209 Germany

Ana Trujillo Avda. de la


México
2 Emparedados y Ana Trujillo Constitución 05021 Mexico
D.F.
helados 2222

Antonio Moreno Antonio Mataderos México


3 05023 Mexico
Taquería Moreno 2312 D.F.

And a selection from the "Orders" table:

OrderID CustomerID EmployeeID OrderDate ShipperID

10308 2 7 1996-09-18 3

10309 37 3 1996-09-19 1

10310 77 8 1996-09-20 2

SQL LEFT JOIN Example

The following SQL statement will return all customers, and any orders they might have:

Example
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
LEFT JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;

51
SQL RIGHT JOIN Keyword

The RIGHT JOIN keyword returns all rows from the right table (table2), with the matching rows
in the left table (table1). The result is NULL in the left side when there is no match.

SQL RIGHT JOIN Syntax


SELECT column_name(s)
FROM table1
RIGHT JOIN table2
ON table1.column_name=table2.column_name;

or:

SELECT column_name(s)
FROM table1
RIGHT OUTER JOIN table2
ON table1.column_name=table2.column_name;

Note! In some databases RIGHT JOIN is called RIGHT OUTER JOIN.

Demo Database

In this example we will use a sample database.

Below is a selection from the "Customers" table:

CustomerID CustomerName ContactName Address City PostalCode Country

1 Alfreds Futterkiste Maria Anders Obere Str. 57 Berlin 12209 Germany

Ana Trujillo Avda. de la


México
2 Emparedados y Ana Trujillo Constitución 05021 Mexico
D.F.
helados 2222

Antonio Moreno Antonio Mataderos México


3 05023 Mexico
Taquería Moreno 2312 D.F.

And a selection from the "Orders" table:

52
OrderID CustomerID EmployeeID OrderDate ShipperID

10308 2 7 1996-09-18 3

10309 37 3 1996-09-19 1

10310 77 8 1996-09-20 2

SQL RIGHT JOIN Example

The following SQL statement will return all orders, and any customers that might have placed
them:

Example
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
RIGHT JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;

SQL FULL OUTER JOIN Keyword

The FULL OUTER JOIN keyword returns all rows from the left table (table1) and from the right
table (table2).

The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.

SQL FULL OUTER JOIN Syntax


SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name=table2.column_name;

Demo Database

In this example we will use a sample database.

Below is a selection from the "Customers" table:

53
CustomerID CustomerName ContactName Address City PostalCode Country

1 Alfreds Futterkiste Maria Anders Obere Str. 57 Berlin 12209 Germany

Ana Trujillo Avda. de la


México
2 Emparedados y Ana Trujillo Constitución 05021 Mexico
D.F.
helados 2222

Antonio Moreno Antonio Mataderos México


3 05023 Mexico
Taquería Moreno 2312 D.F.

And a selection from the "Orders" table:

OrderID CustomerID EmployeeID OrderDate ShipperID

10308 2 7 1996-09-18 3

10309 37 3 1996-09-19 1

10310 77 8 1996-09-20 2

SQL FULL OUTER JOIN Example

The following SQL statement selects all customers, and all orders:

SELECT Customers.CustomerName, Orders.OrderID


FROM Customers
FULL OUTER JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;

A selection from the result set may look like this:

CustomerName OrderID

Alfreds Futterkiste

54
Ana Trujillo Emparedados y helados 10308

Antonio Moreno Taquería 10365

10382

10351

Note: The FULL OUTER JOIN keyword returns all the rows from the left table (Customers), and
all the rows from the right table (Orders). If there are rows in "Customers" that do not have
matches in "Orders", or if there are rows in "Orders" that do not have matches in "Customers",
those rows will be listed as well.

The SQL UNION Operator

The UNION operator is used to combine the result-set of two or more SELECT statements.

Notice that each SELECT statement within the UNION must have the same number of columns.
The columns must also have similar data types. Also, the columns in each SELECT statement
must be in the same order.

SQL UNION Syntax


SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;

Note: The UNION operator selects only distinct values by default. To allow duplicate values, use
the ALL keyword with UNION.

SQL UNION ALL Syntax


SELECT column_name(s) FROM table1
UNION ALL
SELECT column_name(s) FROM table2;

Note: The column names in the result-set of a UNION are usually equal to the column names in
the first SELECT statement in the UNION.

Demo Database

In this example we will use a sample database.


55
Below is a selection from the "Customers" table:

CustomerID CustomerName ContactName Address City PostalCode Country

1 Alfreds Futterkiste Maria Anders Obere Str. 57 Berlin 12209 Germany

Ana Trujillo Avda. de la


México
2 Emparedados y Ana Trujillo Constitución 05021 Mexico
D.F.
helados 2222

Antonio Moreno Antonio Mataderos México


3 05023 Mexico
Taquería Moreno 2312 D.F.

And a selection from the "Suppliers" table:

SupplierID SupplierName ContactName Address City PostalCode Country

Charlotte 49 Gilbert
1 Exotic Liquid Londona EC1 4SD UK
Cooper St.

New Orleans Cajun P.O. Box New


2 Shelley Burke 70117 USA
Delights 78934 Orleans

Grandma Kelly's 707 Oxford


3 Regina Murphy Ann Arbor 48104 USA
Homestead Rd.

SQL UNION Example

The following SQL statement selects all the different cities (only distinct values) from the
"Customers" and the "Suppliers" tables:

Example
SELECT City FROM Customers
UNION
SELECT City FROM Suppliers
ORDER BY City;

56
Note: UNION cannot be used to list ALL cities from the two tables. If several customers and
suppliers share the same city, each city will only be listed once. UNION selects only distinct
values. Use UNION ALL to also select duplicate values!

SQL UNION ALL Example

The following SQL statement uses UNION ALL to select all (duplicate values also) cities from the
"Customers" and "Suppliers" tables:

Example
SELECT City FROM Customers
UNION ALL
SELECT City FROM Suppliers
ORDER BY City;

SQL UNION ALL With WHERE

The following SQL statement uses UNION ALL to select all (duplicate values also) German cities
from the "Customers" and "Suppliers" tables:

Example
SELECT City, Country FROM Customers
WHERE Country='Germany'
UNION ALL
SELECT City, Country FROM Suppliers
WHERE Country='Germany'
ORDER BY City;

Advanced SQL
JDBC and ODBC

API (application-program interface) for a program to interact with a database server


Application makes calls to
– Connect with the database server
– Send SQL commands to the database server
– Fetch tuples of result one-by-one into program variables
ODBC (Open Database Connectivity) works with C, C++, C#, and Visual Basic
– Other API’s such as ADO.NET sit on top of ODBC
JDBC (Java Database Connectivity) works with Java

57
Java Database Connectivity- JDBC
§ JDBC is a Java API for communicating with database systems supporting SQL.
§ JDBC supports a variety of features for querying and updating data, and for retrieving
query results.
§ JDBC also supports metadata retrieval, such as querying about relations present in the
database and the names and types of relation attributes.
§ Model for communicating with the database:
– Open a connection
– Create a “statement” object
– Execute queries using the Statement object to send queries and fetch results
– Exception mechanism to handle errors

JDBC Code
public static void JDBCexample(String dbid, String userid, String passwd)
{
try {
Class.forName ("oracle.jdbc.driver.OracleDriver");
Connection conn = DriverManager.getConnection(
"jdbc:oracle:thin:@db.yale.edu:2000:univdb", userid, passwd);
Statement stmt = conn.createStatement();
….. Do actual work ……..
Stmt.close();
Conn.close();
}
catch
(SQLException
sqle) {
System.out.println("SQLException : " + sqle);
}

Update to database
try {
stmt.executeUpdate(
"insert into instructor values(’77987’, ’Kim’, ’Physics’, 98000)");
} catch (SQLException sqle)
{ System.out.println("Could not insert tuple. " + sqle);
}

Execute query and fetch and print results


58
ResultSet rset = stmt.executeQuery(
"select dept_name, avg (salary)
from instructor
group by dept_name");
while (rset.next()) {
System.out.println(rset.getString("dept_name") + " " + rset.getFloat(2));
}

JDBC Code Details


§ Getting result fields:
rs.getString(“dept_name”) and rs.getString(1) equivalent if dept_name is the first
argument of select result.
§ Dealing with Null values int a = rs.getInt(“a”);
if (rs.wasNull()) Systems.out.println(“Got null value”);

SQL Injection

§ Suppose query is constructed using "select * from instructor where name = ’" + name +
"’"
§ Suppose the user, instead of entering a name, enters: X’ or ’Y’ = ’Y
§ then the resulting statement becomes:
– "select * from instructor where name = ’" + "X’ or ’Y’ = ’Y" + "’"
– which is: select * from instructor where name = ’X’ or ’Y’ = ’Y’
– User could have even used X’; update instructor set salary = salary + 10000; --
§ Prepared statement internally uses: "select * from instructor where name = ’X\’ or \’Y\’
= \’Y’ Always use prepared statements, with user inputs as parameters

Embedded SQL
§ The SQL standard defines embeddings of SQL in a variety of programming languages
such as C, Java, and Cobol.
§ A language to which SQL queries are embedded is referred to as a host language, and
the SQL structures permitted in the host language comprise embedded SQL.
§ The basic form of these languages follows that of the System R embedding of SQL into
PL/I.
§ EXEC SQL statement is used to identify embedded SQL request to the preprocessor

EXEC SQL <embedded SQL statement > END_EXEC


Note: this varies by language (for example, the Java embedding uses # SQL { …. }; )

59
Example Query
From within a host language, find the ID and name of students who have completed more than
the number of credits stored in variable credit_amount.

Specify the query in SQL and declare a cursor for it

EXEC SQL
declare c cursor for
select ID, name
from student
where tot_cred > :credit_amount
END_EXEC

Review Questions
1. List the data types that are allowed in SQL
2. How does SQL allow implementation of the entity integrity and referential integrity
constraints?
3. What is a view in SQL and how is it defined?
4. Discuss the problems that may arise when one attempts to update a view
5. Describe the six clauses in the syntax of an SQL query and show what type of constructs
can be can be specified in each of the six clauses. Which of the six clauses are required
and which are optional?
6. Describe conceptually how an SQL query will be executed by specifying the conceptual
order of executing each of the six clauses. Create a Student table and a course table
using SQL.
7. Feed some student records and Course records using SQL.
8. Use SQL to retrieve data from both the tables to show the Student details and the
course they take.

60
Week Six

Transaction management
Objectives
 Transaction Concept
 Transaction State
 Concurrent Executions
 Recoverability
 Implementation of Isolation
 Transaction Definition in SQL
Transaction Concept
A transaction is a unit of program execution that accesses and possibly updates various data
items.
Two main issues to deal with:
i. Failures of various kinds, such as hardware failures and system crashes
ii. Concurrent execution of multiple transactions

Transaction Management
* What if the system fails?
* What if more than one user is concurrently updating the same data?

A transaction is a collection of operations that performs a single logical function in a database


application
Transaction-management component ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating system crashes) and
transaction failures.
Concurrency-control manager controls the interaction among the concurrent transactions, to
ensure the consistency of the database.
A transaction comprises a unit of work performed within a database management system (or
similar system) against a database, and treated in a coherent and reliable way independent of
other transactions.

Transactions in a database environment have two main purposes:


i. To provide reliable units of work that allow correct recovery from failures and keep a
database consistent even in cases of system failure, when execution stops (completely
or partially) and many operations upon a database remain uncompleted, with unclear
status.
ii. To provide isolation between programs accessing a database concurrently. If this
isolation is not provided, the program's outcomes are possibly erroneous.

What is a transaction?

61
A transaction is a unit of program execution that accesses and possibly updates various data
items.To preserve the integrity of data the database system must ensure:
Atomicity. Either all operations of the transaction are properly reflected in the database or
none are.
Consistency. Execution of a transaction in isolation preserves the consistency of the database.
Isolation. Although multiple transactions may execute concurrently, each transaction must be
unaware of other concurrently executing transactions. Intermediate transaction results must be
hidden from other concurrently executed transactions. That is, for every pair of transactions Ti
and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started
execution after Ti finished.
Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures
Transactions Management deals with the problem of always keeping the database in a
consistent state even when concurrent accesses and failures occur
The database can (and usually is) temporarily inconsistent during the execution of a
transaction. The important point is that the database should be consistent when the
transaction terminates
Transaction:
• A transaction is a series of actions, carried out by a single user or application
program, which must be treated as a logical unit of work.
• A transaction is a unit of consistent and reliable computation.
• Example the transfer of funds from one account to another: either the entire
operations are carried out or none of the operation is carried out.
• Transactions transform the database from consistent state to another consistent
state.

Transaction State
Active – the initial state; the transaction stays in this state while it is executing
Partially committed – after the final statement has been executed.
Failed -- after the discovery that normal execution can no longer proceed.
Aborted – after the transaction has been rolled back and the database restored to its state
prior to the start of the transaction. Two options after it has been aborted: restart the
transaction can be done only if no internal logical error kill the transaction
Committed – after successful completion.

62
Transaction State

ACID Properties
A database transaction, by definition, must be atomic, consistent, isolated and durable.
Database practitioners often refer to these properties of database transactions using the
acronym ACID.

Transactions provide an "all-or-nothing" proposition, stating that each work-unit performed in a


database must either complete in its entirety or have no effect whatsoever. Further, the system
must isolate each transaction from other transactions, results must conform to existing
constraints in the database, and transactions that complete successfully must get written to
durable storage.

A transaction is a unit of program execution that accesses and possibly updates various data
items.To preserve the integrity of data the database system must ensure:
Atomicity. Either all operations of the transaction are properly reflected in the database or
none are.
Consistency. Execution of a transaction in isolation preserves the consistency of the database.
Isolation. Although multiple transactions may execute concurrently, each transaction must be
unaware of other concurrently executing transactions. Intermediate transaction results must be
hidden from other concurrently executed transactions. That is, for every pair of transactions Ti
and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started
execution after Ti finished.
Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures.

63
Properties of a Transaction: ACID
§ Atomicity
– Transactions are either done or not done
– They are never left partially executed
§ Consistency
– Transactions should leave the database in a consistent state
§ Isolation
– Transactions must behave as if they were executed in isolation
§ Durability
– Effects of completed transactions are resilient against failures

Transaction management is concerned with:


1. Concurrency Control
2. Database Recovery

Concurrent Executions

Multiple transactions are allowed to run concurrently in the system.


Advantages are:
§ increased processor and disk utilization, leading to better transaction throughput
E.g. one transaction can be using the CPU while another is reading from or writing to the
disk
§ reduced average response time for transactions: short transactions need not wait
behind long ones.
Concurrency control schemes – Are mechanisms to achieve isolation. that is, to control the
interaction among the concurrent transactions in order to prevent them from destroying the
consistency of the database

Transactions in SQL
§ Consider the following SQL statement
UPDATE Project
SET Budget=Budget*1.1
WHERE Pname=‘Engineering’
§ This can be translated into a transaction as follows:
Begin_Transaction Budget_Update
Begin
EXEC SQL UPDATE Project
64
SET Budget=Budget*1.1
WHERE Pname=‘Engineering’
End

Note that
§ (Implicit beginning of transaction)
– SELECT …;
– UPDATE …;
– ……
– ROLLBACK | COMMIT;
§ ROLLBACK (transaction abort)
– Will undo the the partial effects of the transaction
– May be initiated by the DBMS

Transaction Definition in SQL


§ Data manipulation language must include a construct for specifying the set of actions
that comprise a transaction.
§ In SQL, a transaction begins implicitly.
§ A transaction in SQL ends by:
 Commit work commits current transaction and begins a new one.
 Rollback work causes current transaction to abort.
§ In almost all database systems, by default, every SQL statement also commits implicitly
if it executes successfully
 Implicit commit can be turned off by a database directive
4 E.g. in JDBC, connection.setAutoCommit(false);

65
Concurrency Control

DBMS Components
 Transactions Manager - It is the responsibility of the Transactions Manager of a DBMS
to ensure transactions do not interfere with one another and corrupt the database.
 Scheduler - Responsible for implementing a particular strategy for concurrency control.
Also called the Lock Manager (for those systems that use locking as the means of
managing concurrency)
 Recovery Manager- Ensures that the database is restored to the state it was in before
the start of the transaction.
 Buffer Manager - Responsible for the transfer of data between disk storage and
memory.
Conflicting Operations
§ Two operations on the same data item conflict if at least one of the operations is a write
– r(X) and w(X) conflict
– w(X) and r(X) conflict
– w(X) and w(X) conflict
– r(X) and r(X) do not
– r/w(X) and r/w(Y) do not
§ Order of conflicting operations matters
– If T1 .r(A) precedes T2 .w(A), then conceptually, T1 should precede T2

66
Interference Between Concurrent Transactions
There are many ways in which concurrently executing transactions can interfere with one
answer and so compromise the integrity and consistency of the DB. Three examples of such
interference are:
 Lost Update Problem
 Uncommitted Dependency Problem
 Inconsistency Analysis Problem

The Lost Update Problem


§ An apparently successfully completed update operation by one user can be overridden
by another user.
§ Consider this situation.
Time Transaction A Transaction B
t1 Read R -
t2 - Read R
t3 Write R -
t4 - Write R
§ Transaction A’s update is lost at time t4 because transaction B overwrites it without
even looking at it!
Example: Bank Balances
– Transaction A: retrieves some record R at time t1
– Transaction B: retries the same record R at time t2
– Transaction A: updates record at time t3 on the basis of values read at time t1
– Transaction B updates the same record at time t4 on the basis of values read at
t2 which are the same as those at t1

Solutions to the Lost Update Problem


 The loss of Transaction A’s update is avoided by preventing Transaction B from reading
the value of R until after Transaction A's update has been completed.
 Write Lock R at t1

Uncommitted Dependency Problem


 A violation of the integrity constraints governing the database can arise when two
transactions are allowed to execute concurrently without being synchronized.
67
 Consider this situation:
Time Transaction A Transaction B
t1 - Write R
t2 Read R -
t3 ROLLBACK
Example: Scheduling of Surgeons for Operations
 Transaction A reads on uncommitted update at time t2 e.g. assigning Theatre to a
Surgeon and a Patient
 The update is then undone at time t3 e.g. Surgeon or the Patient becomes Unavailable
 Transaction A is therefore operating on false assumption i.e.Transaction A becomes
dependent on an uncommitted update at time t2
Solution to the Uncommitted Dependency Problem
 The problem is avoided by preventing Transaction A from reading the value of R until
after the decision has been made either to commit or abort Transaction B's effect.
 Write Lock R at t1

Inconsistent Analysis Problem


 Transactions that only read the database can obtain inaccurate results if they are
allowed to read partial results of incomplete transactions, which are simultaneously
updating the database
 Consider two transactions A and B operating on bank account records
 Transaction B is summing account balances.
 Transaction A is transferring an amount 10 from account 1 to account 3
 Account 1 = 100; Account 2 = 50; Account 3 = 25

68
The Inconsistent Analysis Problem

Solution to Inconsistent Analysis Problem


 The problem is avoided by preventing Transaction B from reading the balance of
Account 1 and Account 3 until after Transaction A has completed its updates

Locking
• If a transaction wants to read an object, it must first request a shared lock (S mode) on
that object
• If a transaction wants to modify an object, it must first request an exclusive lock (X
mode) on that object
• Allow one exclusive lock, or multiple shared locks

69
The Lock Manager may schedule the tx as follows:
Two Phase Locking

Problem of Two phase Locking

70
Strict Two phase Locking
 Only release locks at commit/abort time
– A writer will block all other readers until the writer commits or aborts
 Used in most commercial DBMS (except Oracle)

Deadlocks

System is deadlocked if there is a set of transactions such that every transaction in the set is
waiting for another transaction in the set.

71
Deadlock prevention protocols ensure that the system will never enter into a deadlock state.
Some prevention strategies : Require that each transaction locks all its data items before it
begins execution (predeclaration).
Impose partial ordering of all data items and require that a transaction can lock data items only
in the order specified by the partial order (graph-based protocol).

Dealing with Deadlocks


§ Impose an order for locking objects
o Must know in advance which objects a transaction will access
§ Timeout
o If a transaction has been blocked for too long, just abort
§ Prevention
o Idea: abort more often, so blocking is less likely
o Wait/die versus wound/wait
§ Detection using wait-for graph
o Idea: deadlock is rare, so only deal it when it becomes an issue
o How often do we detect deadlocks?
o Which transactions do we abort in case of deadlock?
Deadlock Recovery
When deadlock is detected : Some transaction will have to rolled back (made a victim) to break
deadlock. Select that transaction as victim that will incur minimum cost.
Rollback -- determine how far to roll back transaction Total rollback: Abort the transaction and
then restart it.
More effective to roll back transaction only as far as necessary to break deadlock.

72
Starvation happens if same transaction is always chosen as victim. Include the number of
rollbacks in the cost factor to avoid starvation

Concurrency Control without Locking


§ Optimistic (validation-based)
§ Timestamp-based
Optimistic (validation-based)
 Locking is pessimistic
– Use blocking to avoid conflicts
– Overhead of locking even if contention is low
 Optimistic concurrency control
– Assume that most transactions do not conflict
– Let them execute as much as possible
– If it turns out that they conflict, abort and restart
– Timestamp-based
Optimistic: Sketch of Protocol
 Read phase: transaction executes, reads from the database, and writes to a
private space
 Validate phase: DBMS checks for conflicts with other transactions; if conflict is
possible, abort and restart
– Requires maintaining a list of objects read and written by each
transaction
 Write phase: copy changes in the private space to the database
Pessimistic vs Optimistic
 Overhead of locking vs overhead of validation and copying private space
 Blocking versus Aborts and Restarts
 Note
– Locking has better throughput for environments with medium-to-high
contention
– Optimistic concurrency control is better when resource utilization is low enough
Timestamp-based
 Associate each database object with a read timestamp and a write timestamp
 Assign a timestamp to each transaction (Timestamp order is commit order)
 When transaction reads/writes an object, check the object’s timestamp for conflict with
a younger transaction; if so, abort and restart

73
Review Questions
1. Discuss the meaning of concurrent execution of database transaction in a multi-user
system.
2. Discuss why concurrency control is necessary in transaction management.
3. State and explain using suitable diagram the states that a transaction goes through
during execution.
4. Discuss the ACID properties of a transaction.
5. Define the following terms as used in transcations
 Schedule
 Serial schedule
 Serializable schedule

74
Week Seven

Database Security
Database security can be defined as the protection of the database against
• unauthorized access to or modification of the database,
• denial of service to authorized users and
• provision of service to unauthorized users
It also includes the measures necessary to detect, document, and counter threats

Characteristics of database security


§ Confidentiality – protection against disclosure to unauthorized parties
§ Integrity – data is not accidentally or maliciously manipulated, altered or corrupted
§ Availability – accessibility, reliability and assurance of continuity of operation

Threats
Browsing - accessing info
Misuse – malice, errors of omission etc.
Penetration – Unauthorized access
Systems Flaws – h/w and s/w errors
Component Failure – malfunctioning of h/w, s/w or media
Tampering – attacks to physical and logical components
Eavesdropping – passive surveillance of telecomm channel e.g. tapping, sniffing
Denial of Service – preventing or delaying performance e.g. jamming, traffic flooding
Browsing - accessing info
Misuse – malice, errors of omission etc.
Penetration – Unauthorized access
Systems Flaws – h/w and s/w errors
Component Failure – malfunctioning of h/w, s/w or media
Tampering – attacks to physical and logical components
Eavesdropping – passive surveillance of telecomm channel e.g. tapping, sniffing
Denial of Service – preventing or delaying performance e.g. jamming, traffic flooding

75
Countermeasures – computer-based controls
authorization
views
backup and recovery
integrity
encryption
Redundant array of independent disks (RAID)
Authorization
The granting of a right or privilege that enables a subject to have legitimate
access to a database system or a database system’s object.
Authentication
A mechanism that determines whether a user is, who he or she claims to be.
View
A view is a virtual table that does not necessarily exist in the database but can be
produced upon request by a particular user, at the time of request.
Backup
Process of periodically taking a copy of the database and log file (and possibly programs)
onto offline storage media.

76
Journaling
Process of keeping and maintaining a log file (or journal) of all changes made to
database to enable recovery to be undertaken effectively in the event of failure.
Integrity
Prevents data from becoming invalid, and hence giving misleading or incorrect
results.
Encryption
Encoding the data by a special algorithm that renders the data unreadable by any
program without the decryption key.
Redundant array of independent disks (RAID)
Hardware that the DBMS runs on must be fault-tolerant, meaning that the DBMS
should continue to operate even if one of the hardware components fails.
Suggests having redundant components that can be seamlessly integrated into the
working system whenever there are failures.
The main hardware components that should be fault-tolerant include disk drives,
disk controllers, CPU, power supplies, and cooling fans.
Disk drives are the most vulnerable components with the shortest times between
failure of any of the hardware components.
One solution is to provide a large disk array comprising an arrangement of several
independent disks that are organized to improve reliability and at the same time
increase performance.

Database recovery
Database Recovery is the process of restoring the database to a correct state in the event of a
failure

Causes of Failures:
Transaction-Local - The failure of an individual transaction can be caused in several ways:
* Transaction – Induced Abort e.g. insufficient funds
* Unforeseen Transaction Failure, arising from bugs in the application
programs
* System-Induced Abort – e.g. when Transaction Manager explicitly aborts
a transaction because it conflicts with another transaction or to break a
deadlock.
§ Site Failures - Occurs due to failure of local CPU and results in a System Crashing.

77
• Total Failure – all sites in DDS are down
• Partial Failure – some sites are down
§ Media Failures e.g. head crashes
§ Network Failures - a failure may occur in the communications links.
§ Disasters e.g. fire or power failures
§ Carelessness - unintentional destruction of data by users
§ Sabotage - intentional corruption of destruction of data, h/w or s/w facilities

Recovery Procedures
Recovering from any type of system failure requires the following:
• Determining which data structures are intact and which ones need recovery.
• Following the appropriate recovery steps.
• Restarting the database so that it can resume normal operations.
• Ensuring that no work has been lost nor incorrect data entered in the database

Database Recovery Features


• recovery from system, software, or hardware failure
• automatic database instance recovery at database start up
• recovery of individual offline table spaces or files while the rest of a database is
operational
• time-based and change-based recovery operations to recover to a transaction-
consistent state specified by the database administrator
• increased control over recovery time in the event of system failure
• the ability to apply redo log entries in parallel to reduce the amount of time for
recovery
• Export and Import utilities for archiving and restoring data in a logical data format,
rather than a physical file backup

DBMS Recovery Facilities


 A backup mechanism
 Logging Facilities
 Checkpoint Facility to enable updates to database that are in progress to be permanent
 Recovery Manager - It is the role of the Recovery Manager to guarantee two of the four
A.C.I.D properties i.e. durability and atomicity, in the presence of unpredictable
features.
78
Log File
§ All operations on the database carried out by all transactions are recorded in the log
file (journal).
§ Each log record contains
 Transaction identifier
 Type of log record e.g. begin, write, commit, abort etc.
 Identifier of data object affected
 Before-image of data object
 After-image of data object
 Log management information
§ The Log is also used for monitoring and audit purposes.

Recovery Techniques
§ Restart Procedures – No transactions are accepted until the database has been
repaired. Includes:
• Emergency Restart - follows when a system fails without warning e.g. due to
power failure.
• Cold Restart - system restarted from archive when the log and/or restart file has
been corrupted.
• Warm Restart - follows controlled shutdown of system
§ Backups - particularly useful if there has been extensive damage to database.
§ Mirroring - Two complete copies of the DB are maintained on-line on different stable
storage devices. Used in environments with non-stop, fault-tolerant operations.
§ Undo/Redo - Undoing and Redoing a transaction after failure. The Transaction Manager
keeps an active-list, an abort-list and a commit-list; transactions that have began,
transactions aborted, and committed transactions respectively

Review Questions

1. Discuss what is meant by the following terms


 Database authorization
 Access control
 Data encryption
 Audit trail
2. How is view mechanism used as an authorization mechanism?
3. Discuss the different form of security measures used in design and implementation of
databases.
4. Define the following terms as used in databases
79
i. Checkpoint
ii. Database security
iii. Threat

80
Week Eight

Distributed DBMSs - Concepts and Design Transparencies


Objectives
§ Concepts.
§ Advantages and disadvantages of distributed databases.
§ Functions and architecture for a DDBMS.
§ Distributed database design.
§ Levels of transparency.
§ Comparison criteria for DDBMSs.

Concepts
Distributed Database: A logically interrelated collection of shared data (and a description of
this data), physically distributed over a computer network.
Distributed DBMS: Software system that permits the management of the distributed database
and makes the distribution transparent to users.
§ Collection of logically-related shared data.
§ Data split into fragments.
§ Fragments may be replicated.
§ Fragments/replicas allocated to sites.
§ Sites linked by a communications network.
§ Data at each site is under control of a DBMS.
§ DBMSs handle local applications autonomously.
§ Each DBMS participates in at least one global application

81
Distributed DBMS

Distributed Processing
A centralized database that can be accessed over a computer network.

Parallel DBMS
A DBMS running across multiple processors and disks designed to execute operations in
parallel, whenever possible, to improve performance.
§ Based on premise that single processor systems can no longer meet requirements for
cost-effective scalability, reliability, and performance.
§ Parallel DBMSs link multiple, smaller machines to achieve same throughput as single,
larger machine, with greater scalability and reliability.

82
§ Main architectures for parallel DBMSs are:
Shared memory,
Shared disk,
Shared nothing.
Parallel DBMS

Advantages of DDBMSs
§ Reflects organizational structure
§ Improved share ability and local autonomy
§ Improved availability
§ Improved reliability
§ Improved performance
§ Economics
§ Modular growth

Disadvantages of DDBMSs
§ Complexity
§ Cost
§ Security
§ Integrity control more difficult
§ Lack of standards

83
§ Lack of experience
§ Database design more complex

Types of DDBMS
1) Homogeneous DDBMS
2) Heterogeneous DDBMS
Homogeneous DDBMS
§ All sites use same DBMS product.
§ Much easier to design and manage.
§ Approach provides incremental growth and allows increased performance.
Heterogeneous DDBMS
§ Sites may run different DBMS products, with possibly different underlying data models.
§ Occurs when sites have implemented their own databases and integration is considered
later.
§ Translations required to allow for:
o Different hardware.
o Different DBMS products.
o Different hardware and different DBMS products.
§ Typical solution is to use gateways.

Open Database Access and Interoperability


§ Open Group formed a Working Group to provide specifications that will create a
database infrastructure environment where there is:
– Common SQL API that allows client applications to be written that do not need
to know vendor of DBMS they are accessing.
– Common database protocol that enables DBMS from one vendor to
communicate directly with DBMS from another vendor without the need for a
gateway.
– A common network protocol that allows communications between different
DBMSs.
§ Most ambitious goal is to find a way to enable transaction to span DBMSs from different
vendors without use of a gateway.
§ Group has now evolved into DBIOP Consortium and are working in version 3 of DRDA
(Distributed Relational Database Architecture) standard.

Multi database System (MDBS)


DDBMS in which each site maintains complete autonomy.
84
§ DBMS that resides transparently on top of existing database and file systems and
presents a single database to its users.
§ Allows users to access and share data without requiring physical database integration.
§ Unfederated MDBS (no local users) and federated MDBS.

Overview of Networking
Network - Interconnected collection of autonomous computers, capable of exchanging
information.
Local Area Network (LAN) intended for connecting computers at same site.
Wide Area Network (WAN) used when computers or LANs need to be connected over
long distances.
WAN relatively slow and less reliable than LANs. DDBMS using LAN provides much faster
response time than one using WAN.

Functions of a DDBMS
§ Expect DDBMS to have at least the functionality of a DBMS.
§ Also to have following functionality:
– Extended communication services.
– Extended Data Dictionary.
– Distributed query processing.
– Extended concurrency control.
– Extended recovery services.
85
Reference Architecture for DDBMS
§ Due to diversity, no accepted architecture equivalent to ANSI/SPARC 3-level
architecture.
§ A reference architecture consists of:
o Set of global external schemas.
o Global conceptual schema (GCS).
o Fragmentation schema and allocation schema.
o Set of schemas for each local DBMS conforming to 3-level ANSI/SPARC.
§ Some levels may be missing, depending on levels of transparency supported.
Reference Architecture for DDBMS

§ In DDBMS, GCS is union of all local conceptual schemas.


§ In FMDBS, GCS is subset of local conceptual schemas (LCS), consisting of data that
each local system agrees to share.
§ GCS of tightly coupled system involves integration of either parts of LCSs or local
external schemas.
§ FMDBS with no GCS is called loosely coupled.
Architecture for Tightly-Coupled FMDBS
86
Components of a DDBMS

Distributed Database Design


Three key issues:
– Fragmentation,
– Allocation,
– Replication.

Fragmentation
Relation may be divided into a number of sub-relations, which are then distributed.
Definition and allocation of fragments carried out strategically to achieve:
– Locality of Reference.

87
– Improved Reliability and Availability.
– Improved Performance.
– Balanced Storage Capacities and Costs.
– Minimal Communication Costs.
Involves analyzing most important applications, based on quantitative/qualitative
information.
Quantitative information may include:
– frequency with which an application is run;
– site from which an application is run;
– performance criteria for transactions and applications.
Qualitative information may include transactions that are executed by application, type
of access (read or write), and predicates of read operations.

Data Allocation
Each fragment is stored at site with “optimal” distribution.
Four alternative strategies regarding placement of data:
– Centralized,
– Partitioned (or Fragmented),
– Complete Replication,
– Selective Replication.
• Centralized: Consists of single database and DBMS stored at one site with users
distributed across the network.
• Partitioned: Database partitioned into disjoint fragments, each fragment
assigned to one site.
• Complete Replication: Consists of maintaining complete copy of database at
each site.
• Selective Replication: Combination of partitioning, replication, and
centralization.

88
Replication
Copy of fragment may be maintained at several sites.
Comparison of Strategies for Data Distribution

Why Fragment?
§ Usage
o Applications work with views rather than entire relations.
§ Efficiency
o Data is stored close to where it is most frequently used.
o Data that is not needed by local applications is not stored.
§ Parallelism
o With fragments as unit of distribution, transaction can be divided into several
subqueries that operate on fragments.
§ Security
o Data not required by local applications is not stored and so not available to
unauthorized users.
Disadvantages
o Performance,
o Integrity.

Correctness of Fragmentation
There are three correctness rules:
– Completeness,
– Reconstruction,
– Disjointness.

89
Completeness
If relation R is decomposed into fragments R1, R2, ... Rn, each data item that can be
found in R must appear in at least one fragment.

Reconstruction
• Must be possible to define a relational operation that will reconstruct R from the
fragments.
• Reconstruction for horizontal fragmentation is Union operation and Join for vertical .
Disjointness
• If data item di appears in fragment Ri, then it should not appear in any other
fragment.
• Exception: vertical fragmentation, where primary key attributes must be repeated to
allow reconstruction.
• For horizontal fragmentation, data item is a tuple.
• For vertical fragmentation, data item is an attribute.

Types of Fragmentation
Four types of fragmentation:
– Horizontal,
– Vertical,
– Mixed,
– Derived.
– Other possibility is no fragmentation:
– If relation is small and not updated frequently, may be better not to fragment
relation.

90
Horizontal and Vertical Fragmentation

Horizontal Fragmentation
§ Consists of a subset of the tuples of a relation.
§ Defined using Selection operation of relational algebra:
 p(R)
§ For example:
 P1 =  type=‘House’(PropertyForRent)
 P2 =  type=‘Flat’(PropertyForRent)

§ This strategy is determined by looking at predicates used by transactions.


§ Involves finding set of minimal (complete and relevant) predicates.
§ Set of predicates is complete, if and only if, any two tuples in same fragment are
referenced with same probability by any application.
§ Predicate is relevant if there is at least one application that accesses fragments
differently.
Vertical Fragmentation
§ Consists of a subset of attributes of a relation.
§ Defined using Projection operation of relational algebra:
o a1, ... ,an(R)
For example:
 S1 = staffNo, position, sex, DOB, salary(Staff)
 S2 = staffNo, fName, lName, branchNo(Staff)
§ Determined by establishing affinity of one attribute to another.

91
Mixed Fragmentation

Mixed Fragmentation
§ Consists of a horizontal fragment that is vertically fragmented, or a vertical fragment
that is horizontally fragmented.
§ Defined using Selection and Projection operations of relational algebra:
o  p(a1, ... ,an(R)) or
o a1, ... ,an(σp(R))
Example - Mixed Fragmentation
S1 = staffNo, position, sex, DOB, salary(Staff)
S2 = staffNo, fName, lName, branchNo(Staff)
S21 =  branchNo=‘B003’(S2)
S22 =  branchNo=‘B005’(S2)
S23 =  branchNo=‘B007’(S2)
Derived Horizontal Fragmentation
§ A horizontal fragment that is based on horizontal fragmentation of a parent relation.
§ Ensures that fragments that are frequently joined together are at same site.
§ Defined using Semijoin operation of relational algebra:
 Ri = R F Si, 1iw
Example - Derived Horizontal Fragmentation
S3 =  branchNo=‘B003’(Staff)
S4 =  branchNo=‘B005’(Staff)
S5 =  branchNo=‘B007’(Staff)
Could use derived fragmentation for Property:
Pi = PropertyForRent branchNo Si, 3  i  5
Derived Horizontal Fragmentation
§ If relation contains more than one foreign key, need to select one as parent.
§ Choice can be based on fragmentation used most frequently or fragmentation with
better join characteristics.

92
Distributed Database Design Methodology
1. Use normal methodology to produce a design for the global relations.
2. Examine topology of system to determine where databases will be located.
3. Analyze most important transactions and identify appropriateness of horizontal/vertical
fragmentation.
4. Decide which relations are not to be fragmented.
5. Examine relations on 1 side of relationships and determine a suitable fragmentation
schema. Relations on many side may be suitable for derived fragmentation.

Transparencies in a DDBMS
§ Distribution Transparency
– Fragmentation Transparency
– Location Transparency
– Replication Transparency
– Local Mapping Transparency
– Naming Transparency
§ Transaction Transparency
– Concurrency Transparency
– Failure Transparency
§ Performance Transparency
– DBMS Transparency
Distribution transparency allows user to perceive database as single, logical entity.
If DDBMS exhibits distribution transparency, user does not need to know:
– data is fragmented (fragmentation transparency),
– location of data items (location transparency),
– otherwise call this local mapping transparency.
With replication transparency, user is unaware of replication of fragments

Naming Transparency
§ Each item in a DDB must have a unique name.
§ DDBMS must ensure that no two sites create a database object with same name.
§ One solution is to create central name server. However, this results in:
– loss of some local autonomy;
– central site may become a bottleneck;
– low availability; if the central site fails, remaining sites cannot create any new
objects.

93
§ Alternative solution - prefix object with identifier of site that created it.
§ For example, Branch created at site S1 might be named S1.BRANCH.
§ Also need to identify each fragment and its copies.
§ Thus, copy 2 of fragment 3 of Branch created at site S 1 might be referred to as
S1.BRANCH.F3.C2.
§ However, this results in loss of distribution transparency
§ An approach that resolves these problems uses aliases for each database object.
§ Thus, S1.BRANCH.F3.C2 might be known as Local Branch by user at site S1.
§ DDBMS has task of mapping an alias to appropriate database object.
Transaction Transparency
§ Ensures that all distributed transactions maintain distributed database’s integrity and
consistency.
§ Distributed transaction accesses data stored at more than one location.
§ Each transaction is divided into a number of sub-transactions, one for each site that has
to be accessed.
§ DDBMS must ensure the indivisibility of both the global transaction and each of the sub-
transactions.
Example - Distributed Transaction
T prints out names of all staff, using schema defined above as S 1, S2, S21, S22, and S23. Define
three sub-transactions TS3, TS5, and TS7 to represent agents at sites 3, 5, and 7.

Concurrency Transparency
§ All transactions must execute independently and be logically consistent with results
obtained if transactions executed one at a time, in some arbitrary serial order.
§ Same fundamental principles as for centralized DBMS.
§ DDBMS must ensure both global and local transactions do not interfere with each other.
§ Similarly, DDBMS must ensure consistency of all subtransactions of global transaction.

94
Classification of Transactions
In IBM’s Distributed Relational Database Architecture (DRDA), four types of
transactions:
– Remote request
– Remote unit of work
– Distributed unit of work
– Distributed request.

§ Replication makes concurrency more complex.


§ If a copy of a replicated data item is updated, update must be propagated to all copies.
§ Could propagate changes as part of original transaction, making it an atomic operation.
§ However, if one site holding copy is not reachable, then transaction is delayed until site
is reachable.
§ Could limit update propagation to only those sites currently available. Remaining sites
updated when they become available again.
§ Could allow updates to copies to happen asynchronously, sometime after the original
update. Delay in regaining consistency may range from a few seconds to several hours.

Failure Transparency
§ DDBMS must ensure atomicity and durability of global transaction.
§ Means ensuring that subtransactions of global transaction either all commit or all abort.
§ Thus, DDBMS must synchronize global transaction to ensure that all subtransactions
have completed successfully before recording a final COMMIT for global transaction.
§ Must do this in presence of site and network failures.
95
Performance Transparency
§ DDBMS must perform as if it were a centralized DBMS.
– DDBMS should not suffer any performance degradation due to distributed
architecture.
– DDBMS should determine most cost-effective strategy to execute a request.
§ Distributed Query Processor (DQP) maps data request into ordered sequence of
operations on local databases.
§ Must consider fragmentation, replication, and allocation schemas.
§ DQP has to decide:
– which fragment to access;
– which copy of a fragment to use;
– which location to use.
§ DQP produces execution strategy optimized with respect to some cost function.
§ Typically, costs associated with a distributed request include:
– I/O cost;
– CPU cost;
– Communication cost.

Performance Transparency – Example


Property(propNo, city) 10000 records in London
Client(clientNo,maxPrice) 100000 records in Glasgow
Viewing(propNo, clientNo) 1000000 records in London
SELECT p.propNo
FROM Property p INNER JOIN
(Client c INNER JOIN Viewing v ON c.clientNo = v.clientNo)
ON p.propNo = v.propNo
WHERE p.city=‘Aberdeen’ AND c.maxPrice > 200000;

Assume:
Each tuple in each relation is 100 characters long.
10 renters with maximum price greater than £200,000.
100 000 viewings for properties in Aberdeen.
Computation time negligible compared to communication time.

96
Date’s 12 Rules for a DDBMS
Fundamental Principle To the user, a distributed system should look exactly like a non-
distributed system.
1. Local Autonomy
2. No Reliance on a Central Site
3. Continuous Operation
4. Location Independence
5. Fragmentation Independence
6. Replication Independence
7. Distributed Query Processing
8. Distributed Transaction Processing
9. Hardware Independence
10. Operating System Independence
11. Network Independence
12. Database Independence
Last four rules are ideals.

97
Review Questions
1. What are the reasons for and advantages of distributed databases.
2. What is fragmentation? . What are the main types of fragments?
3. Why is the concept of fragmentation useful in distributed database design?
4. Outline the major advantages and disadvantages of:-
- Fragmentation
- Replication
5. Why is data replication useful in DDBMSs? . What typical Chapters of data are replicated?
6. How is a horizontal partitioning of a relation specified? How can a relation be put back together
from a complete horizontal partitioning?
7. How is a vertical partitioning of a relation specified? How can a relation be put back together
from a complete vertical partitioning?
8. Discuss the main problems of distributed databases.
9. List and explain any five advantages of a distributed database management system.
10. With help of illustrative diagram, main discuss three architectures of parallel DBMS

98
Week Nine

Object-Oriented Database Systems


Objectives
1) The Object-Oriented Paradigm
2) Basic Definitions and Features of OODBS
3) The Object-Oriented Database System Manifesto
4) Examples of Object-Oriented DBMS
“Old” Database Applications: Features
i. Uniformity
ii. Record orientation
iii. Small data items
iv. Atomic fields
v. Short transactions
vi. Static conceptual schemes
Who Needs Object-Oriented Databases?
i. Computer-aided design (CAD)
ii. Computer-aided software engineering (CASE)
iii. Multimedia databases
iv. Office information systems (OIS)
v. Expert database systems
Conventional DBMS for New Applications: Disadvantages
a) Artificial representation of complex structured objects.
b) Properties of objects cannot be modeled in an appropriate way
c) Operational semantics of complex structured objects is not expressible
d) Different levels of abstraction are not supported
e) Lack of trigger, constraint and event mechanisms
f) Clumsy database access (impedance mismatch)
Object-Oriented Paradigm
Our environment exclusively consists of objects. Together with the objects comes their
behavior; i.e. objects are described by their functionality
Mostly we know the functionality of objects but we don't know how this functionality is
realized (encapsulation)
Objects react to messages. Only the objects themselves decide in which way they react to a
message. Different objects may react to the same message in different ways (polymorphism)
Objects inherit characteristics and abilities as a result of their membership of a special class or
category
99
The concepts of inheritance and polymorphism can best be exploited if the underlying system
supports run-time type checking and late binding - the operation to be used is chosen at run
time and converted to a program address
Object-Oriented Databases: Basic Definitions
In an object-oriented system everything is an object
§ Objects are encapsulated which means that there is no way to access an object except
through the public interface - a set of operations - specified for it
§ To emphasize object independence, objects communicate by message passing
§ A class is a template that special operations (like new) can use to create new objects
§ Inheritance makes it possible to declare a class as a specialization of another class, thus
supporting the management of (hierarchical) relationships among classes as well as the
reusability of software
§ The ability of different objects to respond differently to the same message is known as
polymorphism
§ The concepts of inheritance and polymorphism can best be exploited if the underlying
system supports run-time type checking and late binding - operation names are
translated into program addresses at run-time
O-O Databases: Top-Down & Bottom-Up Approach
§ Historical roots in
– database technology, including semantic data models
– abstraction-based programming languages, e.g., Simula-67, Smalltalk
– Knowledge-representation of AI technology
§ Bottom-Up approach - by database community
– Structurally object-oriented
§ Top-Down approach - by programming languages people
– behavioral object-orientation
– persistence - objects remain in existence even if an application program ends
– object sharing - independent applications may use the same data concurrently

Origins of Object-Oriented Database Concepts


1. Traditional database systems
– Persistence
– Sharing
– Query Language
– Concurrency control
– Transaction Management

100
2. Semantic
– data models
– Aggregation
– Generalization
3. Object-Oriented Programming
– Complex objects
– Object Identity
– Classes and Methods
– Encapsulation
– Inheritance
– Extensibility
4. Object-Oriented Data Model

The Object-Oriented Database System Manifesto


An object-oriented database system must satisfy two criteria: it should be a DBMS, and
it should be an object-oriented system, i.e., to the extent possible, it should be
consistent with the current crop of object-oriented programming languages. The first
criterion translates into five features: persistence, secondary storage management,
concurrency, recovery and an ad hoc query facility. The second one translates into eight
features: complex objects, object identity, encapsulation, types or classes, inheritance,
overriding combined with late binding, extensibility and computational completeness.
– The above listed features are classified as the mandatory characteristics
– Optional features include the ones that can be added to make the system better,
but which are not mandatory. These are multiple inheritance, type checking and
inferencing, distribution, design transactions and versions.
– Open features are the points where the designer can make a number or choices.
These are the programming paradigm, the representation system, the type
system, and uniformity.

OODBS Golden Rules:


Support for Complex Objects (1)
 Complex objects are built from simpler ones by applying constructors to them
 The simplest objects are objects such as integers, characters, byte strings of any length,
booleans and floats
 There are various complex object constructors: tuples, sets, bags, lists, and arrays
 The minimal set of constructors that the system should have are set, list and tuple

101
 Sets are critical because they are a natural way of representing collections from the real
world
 Tuples are critical because they are a natural way of representing properties of an entity
 Lists or arrays are important because they capture order, which occurs in the real world
 The object constructors must be orthogonal: any constructor should apply to any object
 Operations on a complex object must propagate transitively to all its components

Support for Object Identity (2)


 The idea of object identity is the following: in a model with object identity, an object has
an existence which is independent of its value
 Each Object has a unique, unchangable (immutable) ID independent of its current state
or behaviour
 Logical ID - implementation free (PK) - value oriented
 OID - system generated - unique to an environment not between environments
 Thus two notions of object equivalence exist: two objects can be identical or they can be
equal. This has two implications: object sharing and object updates
 In an identity-based model, two objects can share a component
– Example: A Person has a name, an age and a set of children. Assume Peter and
Susan both have a 15-year-old child named John. In real life, two situations may
arise: Susan and Peter are parents of the same child or there are two children
involved. In a system without identity, Peter is represented by:
 (peter, 40, {(john, 15, {})}) and Susan is represented by:
 (susan, 41, {(john, 15, {})})
 In an identity-based model, these two structures can share the common
part (john, 15, {}) or not, thus capturing either situations. (SETS in
CODASYL)
 Support for many-to-many relationships?
Encapsulation (3)
 The idea of encapsulation comes from:
– the need to cleanly distinguish between the specification and the
implementation of an operation and
– the need for modularity
 There are two views of encapsulation
– the programming language view - abstract data types
– the database adaptation of that view - encapsulate both program and data

102
Support for Types or Classes (4)
 There are two main categories of object-oriented systems, those supporting the notion
of class and those supporting the notion of type
 A type, in an object-oriented system, summarizes the common features of a set of
objects with the same characteristics. It has two parts: the interface and the
implementation (or implementations).
– The interface consists of a list of operations together with their signatures (i.e.,
the type of the input parameters and the type of the result)
– The type implementation consists of a data part and an operation part
 The notion of class is different from that of type. Its specification is the same as that of a
type, but it is more of a run-time notion. It contains two aspects: an object factory and an
object warehouse.
– The object factory can be used to create new objects, by performing the
operation new on the class, or by cloning some prototype object representative
of the class
– The object warehouse means that attached to the class is its extension, i.e., the
set of objects that are instances of the class
Class or Type Hierarchies (5)
 Inheritance has two advantages:
– it is a powerful modeling tool, because it gives a concise and precise description
of the world and
– it helps in factoring out shared specifications and implementations in
applications
 Example
– Employee(name, age, salary) - can die, get married, be paid
– Student(name, age, set-of-grades) - can die, get married, have GPA computed
– Person(name, age) - can die, get married
 Employee(+salary) - pay
 Student(+set-of-grades) - GPA computation
 There are at least four types of inheritance:
– substitution inheritance
– inclusion inheritance
– constraint inheritance
– specialization inheritance

103
Overriding, Overloading and Late Binding (6)
 Example The display operation
for x in X do
begin
case of type(x)
person: display(x); for x in X do display(X)
bitmap: display-bitmap(x)
graph: display-graph(x);
end
end
 In an object-oriented system, we define the display operation at the object type level
(the most general type in the system). Thus, display has a single name and can be used
indifferently on graphs, persons and pictures. However, we redefine the
implementation of the operation for each of the types according to the type (this
redefinition is called overriding). This results in a single name (display) denoting three
different programs (this is called overloading). To display the set of elements, we simply
apply the display operations to each one of them, and let the system pick the
appropriate implementation at run-time.
 In order to provide this new functionality, the system cannot bind operation names to
programs at compile time. Therefore, operation names must be resolved (translated
into program addresses) at run-time. This delayed translation is called late binding.
Computational Completeness (7)
 From a programming language point of view the computational completeness means
that one can express any computable function, using the DML of the database system
 From a database point of view this is a novelty, since SQL, for instance, is not complete.
 Computational completeness can be introduced through a reasonable connection to
existing programming languages
 Note that this is different from being "resource complete", i.e., being able to access all
resources of the system (e.g. screen and remote communication) from within the
language
 Database query languages usually impose severe restrictions on the kind of
computations that can be performed. As a result application programs must be
implemented in general-purpose languages while access to data is realized via
declarative data sublanguages, like SQL. As a consequence data has to be passed
between these two languages. Since both languages are usually semantically as well as
structurally different such transformations may lead to a loss of information. This
problem is known as impedance mismatch.
104
Extensibility (8)
 The database system comes with a set of predefined types. These types can be used at
will by programmers to write their applications. This set of types must be extensible in
the following sense:
– there is a means to define new types and
– there is no distinction in usage between system defined and user defined types
 Relational database systems rely on a simple data model. It offers a fixed number of
predefined atomic data types and very few type constructors.
 In order to permit the integration of arbitrary data types into a data model, it must be
possible to define new basic data types, to use previously defined data types as building
blocks for other data types, or to nest data structures to arbitrary levels
 Since object-oriented database systems are meant to offer an interface which integrates
a programming language and data management facilities they must allow the
programmer to define and execute any kind of operations
Persistence (9)
 Persistence is the ability of the programmer to have her/his data survive the execution
of a process, in order to eventually reuse it in another process
 Persistence should be orthogonal, i.e., each object, independent of its type, is allowed
to become persistent as such
 Persistence should also be implicit: the user should not have to explicitly move or copy
data to make it persistent
 Persistence can be achieved in several ways:
 A first solution is to define the persistence property in the methods by using operations
which make an object persistent
– Another straightforward solution is to make everything persistent
– Different approach is to let the system designer specify which classes are to be
persistent or to store only data which is explicitly identified by a unique name
Secondary Storage Management (10)
 Secondary storage management is a classical feature of database management systems.
It is usually supported through a set of mechanisms. These include index management,
data clustering, data buffering, access path selection and query optimization.
 None of these is visible to the user: they are simply performance features
 The application programmer should not have to write code to maintain indices, to
allocate disk storage, or to move data between disk and main memory
 Since objects in object-oriented database systems may be rather complex as well as
extremely large it seems to be necessary to support access strategies within objects (e.g.

105
indexes on attribute types which consist of collections of values or instances of other
types)
Support for Concurrent Users (11)
 With respect to the management of multiple users concurrently interacting with the
system, the system should offer the same level of service as current database systems
provide
 It should therefore insure harmonious coexistence among users working simultaneously
on the database
 The system should therefore support the standard notion of atomicity of a sequence of
operations and of controlled sharing
 Serializability of operations should at least be offered, although less strict alternatives
may be offered
 The existence of a class hierarchy and the fact that complex objects may contain other
objects need to be considered especially by the concurrency-control mechanism for the
following reasons:
– If the scope of a query is not a single class but the complete subclass hierarchy
the set of objects of the subclass hierarchy must not be modified by a conflicting
transaction
– If a complete composite object O is accessed by a transaction it has to be
ensured that shared subobjects of O cannot be accessed in an incompatible way
by concurrent transactions
– Since a subclass S inherits structure and behavior from its superclasses it must be
forbidden that a concurrent transaction modifies the definition of any of the
superclasses of S while a transaction T accesses instances of S (otherwise
inherited attributes or methods may be deleted from S or new attributes may be
added to S while instances of S are accessed by T)
Recovery (12)
 The object-oriented database system should provide the same level of service as current
database systems
 Therefore, in case of hardware or software failures, the system should recover, i.e.,
bring itself back to some coherent state of the data
 Hardware failures include both processor and disk failures
 Support for long transactions needs a finer unit of recovery since it is unreasonable for a
user to lose all his work (of days or weeks) in case of, e.g. a system malfunction
Ad Hoc Query Facility (13)
A query language should be:
– application-independent
106
– high-level (descriptive)
– optimizable
– complete
– generic
– powerful
– adequate
– extensible
– efficient

OODBS Manifesto: Optional Features


 Multiple inheritance
 Type checking and type inferencing
 Distribution
 Design transactions
– inheritance - lock sub-superclasses
– support for long tx
– softlocks and notification
– choose your own cc mechanism
 Versions
– when to create a new version?
– how to represent versions
 Change Management
– How do we handle schema changes?
– Write once classes, immediate update, lazy update, schema mapping

OODBS: Products & Prototypes


 All the systems we categorize as object-oriented databases have the common thread of
objects as the basic data structure, public methods (and, in some cases, attributes)
associated with objects as the mechanism to operate on objects, private attributes (and,
in some cases, procedures) associated with objects as the underlying representation,
inheritance of attributes and procedures from supertypes, the ability to define
dynamically new simple attribute types as well as new object types, and the
representation of relationships by attributes.
 ONTOS is based on C++, and stores the binary code for methods associated with objects
in C++ binaries rather than the database, linking the methods with object data when
objects are accessed. ONTOS operations are invoked by calls on a run-time library, and

107
persistent application classes inherit methods from an ONTOS-supplied persistent object
class.
 GemStone was designed based on a derivative of Smalltalk called OPAL. However, it has
subsequently been integrated with C++ as well.
 ORION was designed around extensions of Common LISP, as is the product followon,
ITASCA. LISP methods are stored in the database, and are invoked by the system.

OODBS: Future Directions & Research


 Conceptual
– Query languages for OODBs are still rough
– An active area in database-systems research is deductive or logic databases
 System Engineering
– Optimization technology for object-oriented systems
– Languages in OODBs are currently more procedural than declarative
– Encapsulation hides the big picture for what is going on in a query
– Storage management for OODBs is in its infancy
– Parallelism has been a major topic in database research throughout the 1980s
– Emergence of nonstandard architectures for database systems
 Applications
– Object-oriented schema design is more complex than is design of record-based
databases, because behaviour must be modelled, and its partitioning between
classes must be decided
– Hypertext systems share features with OODBs,
– OODBs are good candidates for supporting environments for cooperative work

SQL3
 Most relevant to extended relational approach
 ANSI and ISO draft standard
 “Work on SQL3 is well underway, but the final standards is several years away”
 Most important new features:
– Abstract data types w. attributes and routines
– Inheritance of ADTs and tables
– Collections, parameterized types
– Procedures, functions: computational completeness

108
Strategies for developing an oodbms
 Extend an Existing OOP Language with Database Capabilities: Add traditional
database capabilities to an existing OOP language e.g. Smalltalk, C++ or Java.
Used in GemStone.
 Provide Extensible OODBMS Libraries: Similar to the above. However, rather
than extending the language, class libraries are provided that support
persistence, aggregation, data types, transactions, cocurrency, etc. Used in
Ontos, Versant, ObjectStore.
 Embed OO Database Language Constructs in a Conventional Host Language:
used in O2, which supports embedded extensions for C.
 Extend an Existing Database Language with OO Capabilities: e.g. extending SQL
to provide OO constructs (SQL 3).
 Develop a New Database Data Model/Data Language: develop an entirely new
database language and DBMS with OO capabilities.

Object database standards


These are being developed by
 The Object Management Group (OMG)
 Object Data Management Group (ODMG

Object management group


 A non-profit corporation dedicated to developing object standards. They have
published
 CORBA - Common Object Request Broker Architecture :Creates a standard protocol for
an object to place requests and receives responses from other objects.
 OMGs goal is to get everybody to agree on
 a message format and how objects talk to each other
 language and model data structure
 common interfaces
 how to do security & containment
 OMG provides an open forum for industry, discussion, education and promotion of
OMG endorsed object technology.

Object data management group

109
 a consortium of object-oriented database management system (ODBMS) vendors and
interested parties working on standards to allow portability of customer software across
ODBMS products.
 Current voting members:
 GemStone Systems
 IBEX Computing (ITASCA)
 O2 Technology
 Object Design (ObjectStore)
 Objectivity
 POET Software
 UniSQL
 Versant Object Technology

ODMG major components


 Object Model
 based on OMG Object Model
 Object Definition Language (ODL)
 Language for describing database schemas
 based on OMG IDL
 Object Query Language (OQL)
 Declarative query language, not computationally complete
 Can be optimized
 Syntax based on SQL
 C++ Binding
 Smalltalk Binding
 (Java Binding)
TRENDS
 Object Technology will continue to grow, especially:
 visual development tools
 re-use, components
 Background:
 C++, CORBA, Extended RDMS, OODBMS, tools, infrastructure
 Foreground:
 Java, ActiveX, Internet, visual development, quick and light solutions, end-user
involvement, scalability

110
Review Questions

1. Explain what are Database Objects?


2. Discuss the components of a Object oriented database environment
3. Explain the following:
i. Class
ii. Class defining object
iii. Instance
iv. Encapsulation
v. Object
vi. Message
vii. Method
5. Investigate any object system that might be available to you. What programming
languages does the system support? Does it support a query language? If so what is it?
In your opinion is it more or less powerful than SQL? Is there any view support?
6. Object systems typically support integrity constraints in a procedural fashion, via
methods: the main exception is that referential constraints are typically supported
declaratively. What are the advantages of procedural support? Why do you think
referential constraints are handled differently?

111
Week Ten

Web Technology and DBMSs Transparencies


Objectives
1. Basics of Internet, Web, HTTP, HTML, URLs.
2. Advantages and disadvantages of Web as a database platform.
3. Approaches for integrating databases into Web:
– Scripting Languages
– Common Gateway Interface (CGI)
– HTTP Cookies
– Extending the Web Server
– Java, J2EE, JDBC, SQLJ, CMP, JDO, Servlets, and JSP
– Microsoft Web Platform: .NET, ASP, and ADO
– Oracle Internet Platform.
Introduction
§ Web is the most popular and powerful networked information system to date.
§ As architecture of Web was designed to be platform-independent, can significantly
lower deployment and training costs.
§ Organizations using Web as strategic platform for innovative business solutions, in
effect becoming Web-centric.
§ Many Web sites today are file-based where each Web document is stored in
separate file.
§ For large sites, this can lead to significant management problems.
§ Also many Web sites now contain more dynamic information, such as product and
pricing data.
§ Maintaining such data in both a database and in separate HTML files is problematic.
§ Accessing database directly from Web would be a better approach.

Internet
Worldwide collection of interconnected networks.
§ Began in late ‘60s in ARPANET, a US DOD project, investigating how to build
networks that could withstand partial outages.
§ Starting with a few nodes, Internet estimated to have over 945 million users by end
of 2004.
§ 2 billion users projected by 2010.
§ About 3.5 billion documents on Internet (550 billion if intranets/extranets included).
Intranet and Extranet
112
§ Intranet - Web site or group of sites belonging to an organization, accessible only by
members of that organization.
§ Extranet - An intranet that is partially accessible to authorized outsiders.
Whereas intranet resides behind firewall and is accessible only to people who are
members of same organization, extranet provides various levels of accessibility to
outsiders.
eCommerce and eBusiness
§ eCommerce - Customers can place and pay for orders via the business’s Web site.
§ eBusiness - Complete integration of Internet technology into economic
infrastructure of the business.
§ Business-to-business transactions may reach $2.1 trillion in Europe and $7 trillion in
US by 2006.
§ eCommerce may account for $12.8 trillion in worldwide corporate revenue by 2006
and could represent 18% of sales in the global economy.

The Web
Hypermedia-based system that provides a simple ‘point and click’ means of browsing
information on the Internet using hyperlinks.
§ Information presented on Web pages, which can contain text, graphics, pictures,
sound, and video.
§ Can also contain hyperlinks to other Web pages, which allow users to navigate in a
non-sequential way through information.
§ Web documents written using HTML.
§ Web consists of network of computers that can act in two roles:
– as servers, providing information;
– as clients (browsers), requesting information.
§ Protocol that governs exchange of information between Web server and browser is
HTTP and locations within documents identified as a URL.
§ Much of Web’s success is due to its simplicity and platform-independence.

Basic Components of Web Environment

113
HyperText Transfer Protocol (HTTP)
Protocol used to transfer Web pages through Internet.
– Based on request-response paradigm:
Connection - Client establishes connection with Web server.
Request - Client sends request to Web server.
Response - Web server sends response (HTML document)
to client.
Close - Connection closed by Web server.
HTTP/1.0 is stateless protocol - each connection is closed once server provides response.
This makes it difficult to support concept of a session that is essential to basic DBMS
transactions.

HyperText Markup Language (HTML)


Document formatting language used to design most Web pages.
– A simple, yet powerful, platform-independent document language.
– HTML is application of Standardized Generalized Markup Language (SGML), a system for
defining structured document types and markup languages to represent instances of
those document types.

114
Uniform Resource Locators (URLs)
String of alphanumeric characters that represents location or address of a resource on Internet
and how that resource should be accessed.
– Defines uniquely where documents (resources) can be found.
– Uniform Resource Identifiers (URIs) - generic set of all Internet resource
names/addresses.
– Uniform Resource Names (URNs) - persistent, location-independent name. Relies on
name lookup services.
– URL consists of three basic parts:
u protocol used for the connection,
u host name,
u path name on host where resource stored.
– Can optionally specify:
u port through which connection to host should be made,
u query string.
http://www.w3.org/Markup/MarkUp.html

Static and Dynamic Web Pages


– HTML document stored in file is static Web page.
– Content of dynamic Web page is generated each time it is accessed.
– Thus, dynamic Web page can:
o respond to user input from browser;

115
o be customized by and for each user.
– Requires hypertext to be generated by servers.
– Need scripts that perform conversions from different data formats into HTML ‘on-the-
fly’.

Web Services
Collection of functions packaged as single entity and published to network for use by other
programs.
– Web services are important paradigm in building applications and business processes
for the integration of heterogeneous applications.
– Based on open standards and focus on communication and collaboration among people
and applications.
– Unlike other Web-based applications, Web services have no user interface and are not
targeted for browsers. Instead, consist of reusable software components designed to be
consumed by other applications.
Web Services – Technologies & Standards
– eXtensible Markup Language (XML).
– SOAP (Simple Object Access Protocol) protocol, based on XML, used for communication
over Internet.
– WSDL (Web Services Description Language) protocol, again based on XML, used to
describe the Web service.
– UDDI (Universal Discovery, Description and Integration) protocol used to register the
Web service for prospective users.
– Common example is stock quote facility, which receives a request for current price of a
specified stock and responds with requested price.
– Second example is Microsoft MapPoint Web service that allows high quality maps,
driving directions, and other location information to be integrated into a user
application, business process, or Web site.

Requirements for Web-DBMS Integration


– Ability to access valuable corporate data in a secure manner.
– Data- and vendor-independent connectivity to allow freedom of choice in DBMS
selection.
– Ability to interface to database independent of any proprietary browser or Web server.
– Connectivity solution that takes advantage of all the features of an organization’s DBMS.
– Open architecture to allow interoperability with a variety of systems and technologies.
For example:

116
different Web servers;
Microsoft's (Distributed) Common Object Model (DCOM/COM);
CORBA/IIOP (Internet Inter-ORB protocol);
Java/Remote Method Invocation (RMI);
XML;
Web services (SOAP, WSDL, UDDI).
– Cost-effective solution that allows for scalability, growth, and changes in strategic
directions, and helps reduce applications development costs.
– Support for transactions that span multiple HTTP requests.
– Support for session- and application-based authentication.
– Acceptable performance.
– Minimal administration overhead.
– Set of high-level productivity tools to allow applications to be developed, maintained,
and deployed with relative ease and speed.

Advantages of Web-DBMS Approach


DBMS advantages
Simplicity
Platform independence
Graphical User Interface
Standardization
Cross-platform support
Transparent network access
Scalable deployment
Innovation

Disadvantages of Web-DBMS Approach


Reliability
Security
Cost
Scalability
Limited functionality of HTML
Statelessness
Bandwidth
Performance
Immaturity of development tools

117
Approaches to Integrating Web and DBMSs
Scripting Languages.
Common Gateway Interface (CGI).
HTTP Cookies.
Extending the Web Server.
Java, J2EE, JDBC, SQLJ, JDO, Servlets, and JSP.
Microsoft Web Solution Platform: .NET, ASP, and ADO.
Oracle Internet Platform.
Scripting Languages (JavaScript and VBScript)
– Scripting languages can be used to extend browser and Web server with database
functionality.
– As script code is embedded in HTML, it is downloaded every time page is accessed.
– Updating browser is simply a matter of changing Web document on server.
– Some popular scripting languages are: JavaScript, VBScript, Perl, and PHP.
– They are interpreted languages, not compiled, making it easy to create small
applications.
Common Gateway Interface (CGI)
Specification for transferring information between a Web server and a CGI program.
– Server only intelligent enough to send documents and to tell browser what kind of
document it is.
– But server also knows how to launch other programs.
– When server sees that URL points to a program (script), it executes script and sends
back script’s output to browser as if it were a file.

CGI – Environment

– CGI defines how scripts communicate with Web servers.

118
– A CGI script is any script designed to accept and return data that conforms to the CGI
specification.
– Before server launches script, prepares number of environment variables
representing current state of the server, who is requesting the information, and so
on.
– Script picks this up and reads STDIN.
– Then performs necessary processing and writes its output to STDOUT.
– Script responsible for sending MIME header, which allows browser to differentiate
between components.
– CGI scripts can be written in almost any language, provided it supports reading and
writing of an operating system’s environment variables.
Four primary methods for passing information from browser to a CGI script:
o Passing parameters on the command line.
o Passing environment variables to CGI programs.
o Passing data to CGI programs via standard input.
o Using extra path information.

CGI - Passing Parameters on Command Line

119
CGI – Advantages
CGI is the de facto standard for interfacing Web servers with external applications.
Possibly most commonly used method for interfacing Web applications to data sources.
Advantages:
– simplicity,
– language independence,
– Web server independence,
– wide acceptance.

CGI – Disadvantages
– Communication between client and database server must always go through
Web server.
– Lack of efficiency and transaction support, and difficulty validating user input
inherited from statelessness of HTTP protocol.
– HTTP never intended for long exchanges or interactivity.
– Server has to generate a new process or thread for each CGI script.
– Security.
HTTP Cookies
Cookies can make CGI scripts more interactive.
Cookies are small text files stored on Web client.
CGI script creates cookie and has Web server send it to client’s browser to store
on hard disk.
Later, when client revisits Web site and uses a CGI script that requests this
cookie, client’s browser sends information stored in the cookie.

120
Cookies can be used to store registration information or preferences (e.g. for
virtual shopping cart).
However, not all browsers support cookies.

Extending the Web Server


To overcome limitations of CGI, many servers provide an API that adds
functionality to server.
Two of main APIs are Netscape’s NSAPI and Microsoft’s ISAPI.
Scripts are loaded in as part of the server, giving back-end applications full access
to all the I/O functions of server.
One copy of application is loaded and shared between multiple requests to
server.
Approach more complex than CGI, possibly requiring specialized programmers.
Can provide very flexible and powerful solution.
API extensions can provide same functionality as a CGI program, but as API runs
as part of the server, API approach can perform significantly better than CGI.
Extending Web server is potentially dangerous, since server executable is being
changed.

Comparison of CGI and API


CGI and API both extend capabilities of server.
CGI scripts run in environment created by Web server program.
Scripts only execute once Web server interprets request from browser, then
returns results back to the server.
API approach not nearly so limited in its ability to communicate.
API-based extensions are loaded into same address space as Web server.
Java
Proprietary language developed by Sun.
Originally intended to support environment of networked machines and embedded
systems.
Now, Java is rapidly becoming de facto language for Web computing.
Interesting because of its potential for building Web applications (applets) and
server applications (servlets).
Java is ‘A simple, object-oriented, distributed, interpreted, robust, secure, architecture
neutral, portable, high-performance, multi-threaded and dynamic language’.
Has a machine-independent target architecture, the Java Virtual Machine (JVM).

121
Since almost every Web browser vendor has already licensed Java and implemented
an embedded JVM, Java applications can currently be deployed on most end-user
platforms.

Before Java application can be executed, it must first be loaded into memory.
Done by Class Loader, which takes ‘.class’ file(s) containing bytecodes and transfers
it into memory.
Class file can be loaded from local hard drive or downloaded from network.
Finally, bytecodes must be verified to ensure that they are valid and do not violate
Java’s security restrictions.
Loosely speaking, Java is a ‘safe’ C++.
Safety features include strong static type checking, automatic garbage collection,
and absence of machine pointers at language level.
Safety is central design goal: ability to safely transmit Java code across Internet.
Security is also integral part of Java’s design - sandbox ensures untrusted application
cannot gain access to system resources.

Java 2 Platform
In mid-1999, Sun announced it would pursue a distinct and integrated Java enterprise
platform:
– J2ME: aimed at embedded and consumer-electronics platforms.
– J2SE: aimed at typical desktop and workstation environments. Serves as
foundation for J2EE and Web services.
– J2EE: aimed at robust, scalable, multiuser, and secure enterprise applications.
– J2EE was designed to simplify complex problems with development,
deployment, and management of multi-tier enterprise applications.

122
Cornerstone of J2EE is Enterprise JavaBeans (EJB), a standard for building server-side
components in Java.
Three types of EJB components:
– EJB Session Beans, components implementing business logic, business rules, and
workflow.
– EJB Message-Driven Beans (MDBs), which process messages sent by clients, EJBs,
or other J2EE components.
– EJB Entity Beans, components encapsulating some data contained by the
enterprise. Entity Beans are persistent.
Two types of entity beans:
– Bean-Managed Persistence (BMP), which requires developer to write code top
make bean persist using an API such as JDBC or SQLJ.
– Container-Managed Persistence (CMP), where persistence is provided
automatically by container.

Discuss 5 ways to access a database: JDBC, SQLJ, CMP, JDO, and JSP.

JDBC
Modeled after ODBC, JDBC API supports basic SQL functionality.
With JDBC, Java can be used as host language for writing database applications.
On top of JDBC, higher-level APIs can be built.
Currently, two types of higher-level APIs:
– An embedded SQL for Java (e.g. SQLJ).
– A direct mapping of relational database tables to Java classes (e.g. TopLink from
Oracle).
JDBC API consists of two main interfaces: an API for application writers, and a lower-
level driver API for driver writers.
123
Applications and applets can access databases using:
– ODBC drivers and existing database client libraries;
– JDBC API with pure Java JDBC drivers.

JDBC - Advantages/Disadvantages
Advantage of using ODBC drivers is that they are a de facto standard for PC database
access, and are available for many DBMSs, for very low price.
Disadvantages with this approach:
– Non-pure JDBC driver will not necessarily work with a Web browser.
– Currently downloaded applet can connect only to database located on host
machine.
– Deployment costs increase.

SQLJ
Another JDBC-based approach uses Java with static embedded SQL.
SQLJ comprises a set of clauses that extend Java to include SQL constructs as statements
and expressions.
SQLJ translator transforms SQLJ clauses into standard Java code that accesses database
through a CLI.
Comparison of JDBC and SQLJ
SQLJ is based on static embedded SQL while JDBC is based on dynamic SQL.
Thus, SQLJ facilitates static analysis for syntax checking, type checking, and schema
checking, which may help produce more reliable programs at loss of some functionality.
It also potentially allows DBMS to generate an execution strategy for the query, thereby
improving performance of the query.
JDBC is low-level middleware tool with features to interface Java application with
RDBMS.
Developers need to design relational schema to which they will map Java objects, and
write code to map Java objects to rows of relations.
Problems:
– need to be aware of two different paradigms (object and relational);
– need to design relational schema to map onto object design;
– need to write mapping code
EJBs
EJBs have 3 elements in common:
– an indirection mechanism;
– a bean implementation;

124
– a deployment description.
With indirection mechanism clients do not invoke EJB methods directly.
Session and entity beans provide access to their operations via interfaces.
home interface defines methods that manage lifecycle of a bean. The corresponding
server-side implementation classes are generated at deployment time.
To provide access to other operations, bean can expose a local interface (if client and
bean are colocated), a remote interface, or both.
Local interfaces expose methods to clients running in same container or JVM.
Remote interfaces make methods available to clients no matter where deployed.
When a client invokes create() method (which returns an interface) on home interface,
EJB container calls ejbCreate() to instantiate bean, at which point client can access bean
through remote or local interface returned by create().

Bean implementation is a Java class that implements business logic defined in remote
interface.
Transactional semantics are described declaratively and captured in the deployment
descriptor.
Deployment descriptor, written in XML, lists a bean’s properties and elements, which
may include:
– home interface, remote interface, local interface;
– Web service endpoint interface,
– bean implementation class,
– JNDI name for bean, transaction attributes, security attributes, and per-method
descriptors.

Container-Managed Persistence (CMP)


Instead of writing Java code to implement BMP, CMP is defined declaratively in
deployment descriptor.
At runtime, container manages bean’s data by interacting with data source designated
in deployment descriptor.

125
Following steps need to be followed for CMP:
– Define CMP fields in local interface.
– Define CMP fields in entity bean class implementation.
– Define CMP fields in deployment descriptor.
– Define PK field and its type in deployment descriptor.
EJB container can manage relationships between entity beans and session beans.
Relationships have a multiplicity, which can be 1:1, 1:M, or M:M, and a direction, which
can be unidirectional or bidirectional.
Local interfaces provide foundation for CMR.
With CMR, beans use local interfaces to maintain relationships with other beans.
For example, a Staff bean can use collection of PropertyForRent local interfaces to
maintain a 1:M relationship
Container can also manage referential integrity.
CMR relationships are described declaratively in deployment descriptor file outside
enterprise-beans element.
Need to specify both beans involved in relationship.
Relationship is defined in ejb-relations element, with each role defined in ejb-
relationship-role element.
When bean is deployed, the container provider’s tools parse deployment descriptor and
generate code to implement underlying classes.

EJB Query Language (EJB-QL


Used to define queries for entity beans that operate with CMP. EJB-QL can express
queries for two different styles of operations:
– finder methods, which allow results of an EJB-QL query to be used by clients of
the entity bean. Finder methods are defined in home interface.
– select methods, which find objects or values related to state of an entity bean
without exposing results to client. Select methods are defined in entity bean
class.
An object-based approach for defining queries against persistent store; conceptually
similar to SQL.
As with CMP and CMR fields, queries are defined in the deployment descriptor.
EJB container is responsible for translating EJB-QL queries into query language of
persistent store, resulting in query methods that are more flexible.
<query>
<query-method>
<method-name>findAll</method-name>

126
<method-params></method-params>
</query-method>
<result-type-mapping>Local</result-type-mapping>
<ejb-ql><![CDATA[SELECT OBJECT(s) FROM Staff s]]></ejb-ql>
</query>

<query>
<query-method>
<method-name>findByStaffName</method-name>
<method-params>java.lang.String</method-params>
</query-method>
<result-type-mapping>Local</result-type-mapping>
<ejb-ql><![CDATA[SELECT OBJECT(s)
FROM Staff s WHERE s.name = ?1]]>
</ejb-ql>
</query>

Java Data Objects (JDO)


ODMG submitted their Java binding to Java Community Process as basis of JDO.
Development of JDO had two major aims:
– To provide standard interface between application objects and data sources,
such as relational databases, XML databases, legacy databases, and file systems.
– To provide developers with a transparent Java-centric mechanism for working
with persistent data to simplify application development. (Aim of JDO was to
reduce need to explicitly code such things as SQL statements and transaction
management into applications).
Java Data Objects (JDO) – Interfaces
PersistenceCapable makes a Java class capable of being persisted by a persistence manager.
Every class whose instances can be managed by a JDO PersistenceManager must implement
this interface.
Most JDO implementations provide an enhancer that transparently adds code to implement
this interface to each persistent class.
The interface defines methods that allow an application to examine runtime state of an
instance and to get its associated PersistenceManager if it has one.
PersistenceManagerFactory obtains PersistenceManager instances. PMF instances can be
configured and serialized for later use.

127
PersistenceManager contains methods to manage the lifecycle of PersistenceCapable
instances and is also the factory for Query and Transaction instances.
A PersistenceManager instance supports one transaction at a time and uses one connection
to the underlying data source at a time.
Query allows applications to obtain persistent instances from data source. Can be many
Query instances associated with a PersistenceManager and multiple queries may be
designated for simultaneous execution.
This interface is implemented by each JDO vendor to translate expressions in JDOQL into
native query language of data store.

Java Data Objects (JDO) – Interfaces and Classes


Extent is a logical view of all objects of a particular class that exist in the data source.
Extents are obtained from a PersistenceManager and can be configured to also include
subclasses.
Extent has two possible uses: (a) to iterate over all instances of a class; (b) to execute a
query in the data source over all instances of a particular class.
Transaction contains methods to mark start/end of transactions.
JDOHelper class defines static methods that allow a JDO-aware application to examine
runtime state of instances and to get its associated PersistenceManager if it has one.
JDO – Creating Persistent Classes
1. Ensure each class has a no-arg constructor. If class has no constructors defined,
complier automatically generates a no-arg constructor; otherwise, developer will need
to specify one.
2. Create a JDO metadata file to identify the persistent classes. The JDO metadata file is
expressed as an XML document.
3. Enhance classes so that they can be used in a JDO runtime environment. JDO
specification describes a number of ways that classes can be enhanced, however, most
common way is using an enhancer program that reads a set of .class files and JDO
metadata file and creates new .class files that have been enhanced to run in a JDO
environment.
JDO – Reachability-based Persistence
JDO supports reachability-based persistence.
Thus, any transient instance of a persistent class will become persistent at commit if it is
reachable, directly or indirectly, by a persistent instance.
Instances are reachable through either a reference or collection of references.
Reachability algorithm is applied to all persistent instances transitively through all their
references to instances in memory, causing the complete closure to become persistent.

128
Allows developers to construct complex object graphs in memory and make them
persistent simply by creating a reference to graph from a persistent instance.
Instances have to be explicitly deleted.

JDO Query Language (JDOQL)


Data source-neutral query language based on Java boolean expressions.
Syntax is same as standard Java syntax, with a few exceptions.
A Query object is used to find persistent objects matching certain criteria. A Query is
obtained through one of newQuery() methods of PersistenceManager.
Basic JDOQL query has following 3 components:
o a candidate class (usually a persistent class);
o a candidate collection containing persistent objects (usually an Extent);
o a filter, a boolean expression in a Java-like syntax
Query result is a subcollection of candidate collection containing only those instances of
candidate class that satisfy filter.
Queries can include optional parameter declarations that act as placeholders in filter
string, variable declarations, imports, and ordering expressions.
Query query = pm.newQuery(PropertyForRent.class,
“this.rent < 400”);
Collection result = (Collection) query.execute();

Java Servlets
Servlets are programs that run on Java-enabled Web server and build Web pages,
analogous to CGI.
Have a number of advantages over CGI:
– improved performance;
– portability;
– extensibility;
– simpler session management;
– improved security and reliability.

Java Server Pages (JSP)


Java-based server-side scripting language that allows static HTML to be mixed with
dynamically-generated HTML.
Compiled into Java servlet and processed by a Java-enabled Web server (JSP works
with most Web servers).

129
Since servlet is compiled, performance is improved.

Java Web Services – Document-Oriented


Deal directly with processing XML documents.
Java API for XML Processing (JAXP), processes XML documents using various parsers and
transformations. JAXP supports both SAX and DOM. Also supports the XSLT.
Java Architecture for XML Binding (JAXB), processes XML documents using schema-
derived JavaBeans component classes. JAXB provides methods for unmarshalling an
XML instance document into a tree of Java objects, and marshalling tree back into an
XML document.
SOAP with Attachments API for Java (SAAJ), provides standard way to send XML
documents over Internet from Java platform. Based on SOAP 1.1 and SOAP with
Attachments, which define a basic framework for exchanging XML messages.

Java Web Services – Procedure-Oriented

§ Java API for XML-based RPC (JAX-RPC), sends SOAP method calls to remote clients over
Internet and receives results.
§ Client written in language other than Java can access a Web service developed and
deployed on Java platform.
§ Also, client written in Java can communicate with service developed and deployed using
some other platform.
§ Java API for XML Registries (JAXR), provides standard way to access business registries
and share information.
§ JAXR gives Java developers a uniform way to use business registries based on open
standards (such as ebXML) or industry consortium-led specifications (such as UDDI).

Microsoft Web Platform - .NET


“Software is delivered as a service, accessible by any device, any time, any place, and is fully
programmable and personalizable.”
u Contains various tools, services, and technologies, such as:
– Windows 2000, 2007 etc
– Exchange Server,
– Visual Studio,
– HTML/XML,
– scripting languages,
– components (Java, ActiveX).

130
Object Linking and Embedding for DataBases (OLE DB)
§ Microsoft has defined set of data objects, collectively known as OLE DB.
§ Allows OLE-oriented applications to share and manipulate sets of data as objects.
§ OLE DB is an object-oriented specification based on C++ API.
§ Components can be treated as data consumers and data providers. Consumers take
data from OLE DB interfaces and providers expose OLE DB interfaces.

OLE DB Architecture

Active Server Pages (ASP)


§ ASP is programming model that allows dynamic, interactive Web pages to be created on
server.
§ ASP provides flexibility of CGI, without performance overhead discussed previously.
§ ASP runs in-process with the server, and is optimized to handle large volume of users.
§ When an ‘.asp’ file is requested, Web server calls ASP, which reads requested file,
executes any commands, and sends generated HTML page back to browser.

131
ASP Architecture

ActiveX Data Objects (ADO)


Programming extension of ASP supported by Microsoft IIS for database connectivity.
Supports following key features:
Independently-created objects.
Support for stored procedures.
Support for different cursor types.
Batch updating.
Support for limits on number of returned rows.
Support for multiple recordsets.
Designed as an easy-to-use interface to OLE DB.

ADO Object Model

Remote Data Services (RDS)

132
§ Microsoft technology for client-side database manipulation across Internet.
§ Still uses ADO on server-side to execute query and return recordset to client, which can
then execute other queries on recordset.
§ RDS provides mechanism to send updated records back to server.
§ A disconnected recordset model.

Comparison of ASP and JSP


Both designed to enable developers to separate page design from programming logic
through use of callable components.
Differences:
– JSP is essentially platform and server independent whereas ASP primarily
restricted to MS Windows-based platforms.
– JSP perhaps more extensible as JSP developers can extend the JSP tags
available.
– JSP components are reusable across platforms.
– JSP benefits from in-built Java security model.

Microsoft .NET
§ Number of limitations with Microsoft’s platform:
– a number of languages supported with different programming models (J2EE
composed solely of Java);
– no automatic state management;
– relatively simple user interfaces for Web compared to traditional Windows user
interfaces;
– need to abstract operating system (Windows API difficult to program).
§ Next, and current, evolution in Microsoft’s Web solution strategy was development of
.NET.
§ Various tools, services, technologies in .NET:
– Windows Server,
– BizTalk Server (to build XML-based business processes across applications and
organizations),
– Commerce Server (to build scalable e-Commerce solutions),
– Application Center (to deploy and manage scalable Web applications),
– Mobile Information Server (to support handheld devices),
– SQL Server,
– Microsoft Visual Studio .NET
– Microsoft .NET Framework (CLR + Class Library).

133
.NET Framework

.NET – Common Language Runtime


§ An execution engine that loads, executes, and manages code compiled into an
intermediate bytecode format - Microsoft Intermediate Language (MSIL) - analogous to
Java bytecodes.
§ Not interpreted but compiled to native binary format before execution by a JIT compiler
built into CLR.
§ Allows one language to call another, and even inherit and modify objects from another
language.
§ Provides number of services such as memory management, code and thread execution,
uniform error handling, and security.
§ Enforces strict type-and-code-verification system called common type system (CTS),
which contains range of pre-built data types representing both simple data types for
objects such as numbers and text values, and more complex data types for developing
user interfaces, data systems, file management, graphics, and Internet services.
§ Also supports side-by-side execution allowing application to run on single computer that
has multiple versions of .NET Framework installed, without application being affected.

.NET Framework Class Library


§ Collection of reusable classes, interfaces, and types that integrate with CLR providing
standard functionality such as:
– string management, input/output, security management,

134
– network communications, thread management,
– user interface design features,
– database access and manipulation.
§ 3 main components:
– Windows Forms to support user interface development.
– ASP.NET to support development of Web applications and Web services.
Reengineered version of ASP to improve performance and scalability.
– ADO.NET to help applications connect to databases.
ADO.NET
§ Designed to address 3 main weaknesses with ADO:
– providing a disconnected data access model required for Web;
– providing compatibility with .NET Framework class library;
– providing extensive support for XML.
§ Different from connected style of programming that existed in traditional 2-tier C-S
architecture, where connection was held open for duration of program’s lifetime and no
special handling of state was required.
§ Also ADO data model is primarily relational and could not easily handle XML with a data
model that is heterogeneous and hierarchical.
§ Recognizing that ADO was a mature technology and widely used, ADO has been retained
in the .NET Framework, accessible through the .NET COM interoperability services.
§ Two main layers:
– a connected layer (similar to ADO);
– a disconnected layer, the DataSet (providing a similar functionality to RDS).

ADO.NET ARCHITECTURE

ADO.NET Object Model


135
§ Main replacements for ADO Recordset are:
– DataAdapter, acts as bridge between vendor-dependent data source and
vendor-neutral DataSet. While data source may be RDB, may also be an XML
document.
– DataReader, provides connected, forward-only, read-only stream of data from
data source. A DataReader can be used independently of a DataSet for increased
performance.
– DataSet, provides disconnected copies of records from data source. DataSet
stores records from one or more tables in memory without holding a connection
to the data source, but unlike RDS DataSet maintains information on
relationships between tables and constraints.
§ Several ways a DataSet can be used:
– user can create DataTable, DataRelation, and Constraint within DataSet and
populate table with data programmatically.
– user can populate DataSet with data from existing relational data source using a
DataAdapter.
– contents of DataSet can be loaded from an XML stream or document, which can
be either data, XML schema information, or both.
§ Also, a DataSet can be made persistent using XML (with or without a corresponding XML
Schema).

Microsoft Web Services


§ .NET Framework built on number of standards to promote interoperability with non-
Microsoft solutions.
§ For example, Visual Studio .NET automatically creates necessary XML and SOAP
interfaces required to turn application into a Web service.

136
§ In addition, .NET Framework provides set of classes that conform to all the underlying
communication standards, such as SOAP, WSDL, and XML.
§ Microsoft UDDI SDK enables developers to add UDDI functionality to development tools,
installation programs, and any other software that needs to register or locate and bind
remote Web services.

Microsoft Access and Web Page Generation


Access provides wizards for automatically generating HTML/XML:
– Static pages: user can export data to HTML format.
– Dynamic pages using ASP: user can export data to an ‘asp’ file on Web server.
– Dynamic pages using HTX/IDC files: user can export data to HTX/IDC files on
server.
– Dynamic pages using data access pages: data access pages are Web pages bound
directly to data in the database. Can be used like Access forms, except pages are
stored as external files.
– XML: data can be output as an XML document along with associated schema
and an XSL file.

Oracle Internet Platform


Comprises Oracle Application Server and Oracle DBMS.
It is n-tier architecture based on industry standards such as:
– HTTP and HTML/XML for Web enablement.
– Java, J2EE, EJB, JDBC, and SQLJ for database connectivity, Java servlets, and JSP.
Also supports JNDI and stored Java procedures.
– OMG’s CORBA technology.
– IIOP for object interoperability and RMI.
– Web services: SOAP, WSDL, UDDI, ebXML, WebDav, LDAP.

137
Oracle Internet Platform Architecture

Oracle Application Server (OracleAS)


A reliable, scalable, secure, middle-tier application server designed to support
eBusiness.
Currently available in three versions:
– Java Edition: lightweight Web server with minimal application support;
– Standard Edition: for medium to large Web sites that handle large volume of
transactions;
– Enterprise Edition: Standard Edition + extras.
Communication Services
Handles all incoming requests received by OracleAS, some processed by Oracle HTTP
Server and some by other areas of OracleAS.
Oracle HTTP Server is extended version of Apache Server.
Oracle HTTP Server Modules (mods)
Oracle has enhanced several of Apache mods, and has added Oracle-specific ones; e.g.:
– mod_oc4j, routes HTTP requests for J2EE to OracleAS Containers for J2EE (OC4J);
– mod_plsql, routes requests for stored procedures to database server;

138
– mod_fastcgi, enhanced version of CGI that runs programs in pre-spawned
process;
– mod_oradav, provides support for WebDAV;
– mod_ossl, provides standard S-HTTP;
– mod_osso, enables transparent single sign-on.

OracleAS Containers for J2EE (OC4J)


A fully compliant J2EE 1.3 server.
Runs on J2SE and executes and manages J2EE application components such as:
– Servlets Servlet container provided that manages execution of Web
components and J2EE applications.
– JSPs JSP translator provided to convert JSP files into Java source that container
can then compile and execute as a servlet.
– EJBs EJB container provided that manages execution of EJBs for J2EE
applications. Container has configurable settings that customize the underlying
support , such as security, transaction management, JNDI lookups, and remote
connectivity. Container also manages EJB lifecycles, database connection
resource pooling, data persistence, and access to J2EE APIs.
OracleAS supports both JDBC and SQLJ database access mechanisms, and provides
following drivers:
– Oracle JDBC drivers, for use with Oracle database. Have extensions to support
Oracle-specific datatypes and to enhance their performance.
– J2EE Connectors, part of J2EE platform, provide a Java-based solution for
connecting various application servers and EISs.
– DataDirect Connect Type 4 JDBC drivers, for connecting to non-Oracle databases
Oracle Portal
§ A portal is Web-based application that provides a common, integrated entry point for
accessing dissimilar data types on a single Web page.
§ A portal is divided into a number of portlets.
§ Oracle Portal provides a number of tools to generate and customize portals and
portlets.
Oracle Wireless
Provides services and tools for delivering information and applications to mobile
devices.
§ Includes Multi-Channel Server (MCS) that supports development of applications that are
accessible from multiple channels including wireless browsers, voice, and messaging.

139
§ MCS automatically translates applications written in Oracle Wireless XML, XHTML
Mobile Profile, or XHTML+XForms for any device and network.
§ Also allows portal sites to be created that use Web pages, Java applications, and XML-
based applications.

Business Intelligence
Functions to track, extract, and analyze business intelligence to support strategic
decision-making:
– Oracle Reports Services enable users to run Oracle Reports over Internet.
– Oracle Discoverer allows users to produce queries, reports, and analysis of
information from databases, OLTP systems, and data warehouses using a Web
browser.
– Oracle Clickstream provides services to capture and analyze aggregate
information about Web site usage.
– Oracle Personalization enables users to track activity of specific user and
personalize information for that user.

Review Questions
1. Discuss and do a critical review of the technologies for enabling web databases
Integration
2. How are XHML, HTML, and SGML related?
3. Explain the following term s in your own words
HTML, HTTP, URL, search engine, XML,XML schema, XQuery, SGML, web crawler,
4. How might we consider storing XML data in a database? What are the advantages and
disadvantages of this approach?
5. It is sometimes said that XML data is “schemaless”. How do you go about querying data that
has no schema? How would you design a query language for such data?.Discuss.

140
Week Eleven

Data Warehousing and Data Mining

Introduction To Data Warehousing

Objectives of Today’s Businesses


 Access and combine data from a variety of data stores
 Perform complex data analysis across these date stores
 Create multidimensional views of data and its metadata
 Easily summarize and roll up the information across subject areas and business
dimensions
These objectives cannot be met easily because
 Data is scattered in many types of incompatible structures.
 Lack of documentation has prevented from integration older legacy systems with newer
systems
 Internet software like searching engine needs to be improved
 Accurate and accessible metadata across multiple organizations is hard to get

Four Levels of Analytical Processing


In modern organization, at least four levels of analytical processing should be supported by
information systems
– First level: Consists of simple queries and reports against current and historical
data
– Second level: Goes deeper and requires the ability to do “what if” processing
across data store dimensions
– Third level: Needs to step back and analyze what has previously occurred to
bring about the current state of the data
– Fourth level: Analyzes what has happened in the past and what needs to be
done in the future in order to bring some specific change

Definition of Data Warehouse


A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management's decision making process.

§ Subject-Oriented: a data warehouse is organized around the major subjects of an


organization e.g. customers, products, sales rather than the major application
areas e.g. stock control, invoicing.
141
§ Integrated: coming from different sources.
§ Time-variant: data in a data warehouse is only accurate and valid at some point
in time or over some time interval.
§ Non-volatile: the data is not updated in real-time but is refreshed from
operational systems on a regular basis.

Data Warehousing Architecture

Three-Tier Architecture
1) Warehouse database server
– Almost always a relational DBMS; rarely flat files
2) OLAP servers
– Relational OLAP (ROLAP): extended relational DBMS that maps operations
on multidimensional data to standard relational operations.
– Multidimensional OLAP (MOLAP): special purpose server that directly
implements multidimensional data and operations.
3) Clients
– Query and reporting tools.
142
– Analysis tools
– Data mining tools (e.g., trend analysis, prediction)

Data Warehouse – Components


1. Operational Data
2. Load Manager - performs all the operations associated with the extraction and loading
of data into the warehouse
3. Warehouse Manager - performs all operations associated with the management of the
data in the warehouse e.g. analysis of data to ensure consistency, transforming and
merging of data sources
4. Query Manager - performs all operations associated with the management of user
queries e.g. directing queries to appropriate tables and scheduling the execution of
queries
5. End-User Access Tools - Data reporting and query tools, application development tools,
executive information system tools, OLAP tools, data mining tools

Data Warehousing: Two Distinct Issues


1. How to get information into warehouse
– “Data warehousing”
2. What to do with data once it’s in warehouse -“Warehouse DBMS”
– Both rich research areas
– Industry has focused on (2)
Issues in Data Warehousing
• Warehouse Design
• Extraction
– Wrappers, monitors (change detectors)
• Integration
– Cleansing & merging
• Warehousing specification & Maintenance
• Optimizations
• Miscellaneous (e.g., evolution)

Data Mart
Data Mart is a subset of data warehouse that supports the requirements of a particular
department or business function.

143
Data Warehouse Vs Data Marts
• Enterprise warehouse: collects all information about subjects (customers, products,
sales, assets, personnel) that span the entire organization.
– Requires extensive business modeling
– May take years to design and build
• Data Marts: Departmental subsets that focus on selected subjects: Marketing data
mart: customer, products, sales.
– Faster roll out, but complex integration in the long run.
• Virtual warehouse: views over operational dbs
– Materialize some summary views for efficient query processing
– Easier to build
– Requisite excess capacity on operational db servers

Introduction to Data Mining

Objectives
 Motivation: Why data mining?
 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Are all the patterns interesting?
 Classification of data mining systems
 Major issues in data mining

Necessity Is the Mother of Invention


 Data explosion problem
Automated data collection tools and mature database technology lead to
tremendous amounts of data accumulated and/or to be analyzed in databases,
data warehouses, and other information repositories
 We are drowning in data, but starving for knowledge!

Solution
Data warehousing and data mining
 Data warehousing and on-line analytical processing
 Mining interesting knowledge (rules, regularities, patterns, constraints) from
data in large databases

144
Evolution of Database Technology
 1960s:
Data collection, database creation, IMS and network DBMS
 1970s:
Relational data model, relational DBMS implementation
 1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
Data mining, data warehousing, multimedia databases, and Web databases
 2000s
Stream data management and mining
Data mining with a variety of applications
Web technology and global information systems

What Is Data Mining?


* Data Mining is the process of discovering new correlations, patterns, and trends by
digging into (mining) large amounts of data stored in warehouses, using artificial
intelligence, statistical and mathematical techniques.
* Data mining can also be defined as “The nontrivial extraction of implicit, previously
unknown, and potentially useful information from data”.
* Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, etc.

Why Data Mining? : Potential Applications


§ Data analysis and decision support
 Market analysis and management
– Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
 Risk analysis and management
– Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
 Fraud detection and detection of unusual patterns (outliers)
§ Other Applications

145
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 DNA and bio-data analysis

Market Analysis and Management


 Where does the data come from?
– Credit card transactions, loyalty cards, discount coupons, customer complaint
calls, plus (public) lifestyle studies
 Target marketing
– Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits..
– Determine customer purchasing patterns over time
 Cross-market analysis
– Associations/co-relations between product sales, & prediction based on such
association
 Customer profiling
– What types of customers buy what products (clustering or classification)
 Customer requirement analysis
– identifying the best products for different customers
– predict what factors will attract new customers
 Provision of summary information
– multidimensional summary reports
– statistical summary information (data central tendency and variation)

Corporate Analysis & Risk Management


 Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
 Resource planning
– summarize and compare the resources and spending
 Competition
– monitor competitors and market directions
– group customers into classes and a class-based pricing procedure
– set pricing strategy in a highly competitive market
Fraud Detection & Mining Unusual Patterns
 Approaches: Clustering & model construction for frauds, outlier analysis

146
 Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
– Medical insurance
 Professional patients, ring of doctors, and ring of references
 Unnecessary or correlated screening tests
 Telecommunications: phone-call fraud
– Phone call model: destination of the call, duration, time of day or week. Analyze
patterns that deviate from an expected norm
 Retail industry
– Analysts estimate that 38% of retail shrink is due to dishonest employees
 Anti-terrorism

Data Mining: A KDD Process

Steps of a KDD Process


 Learning the application domain
– relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant representation.
147
 Choosing functions of data mining
summarization, classification, regression, association, clustering

 Choosing the mining algorithm(s)


 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge

Data Mining and Business Intelligence

148
Data Mining: On What Kinds of Data?
1. Relational database
2. Data warehouse
3. Transactional database
4. Advanced database and information repository
– Object-relational database
– Spatial and temporal data
– Time-series data
– Stream data
– Multimedia database
– Heterogeneous and legacy database
– Text databases & WWW

Data Mining Functionalities


1. Concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics
2. Association (correlation and causality)
3. Classification and Prediction
149
– Construct models (functions) that describe and distinguish classes or concepts
for future prediction
– Presentation: decision-tree, classification rule, neural network
– Predict some unknown or missing numerical values
4. Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster houses to
find distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
5. Outlier analysis
– Outlier: a data object that does not comply with the general behavior of the data
– Noise or exception? No! useful in fraud detection, rare events analysis
6. Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis

Major Issues in Data Mining


1. Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
150
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
2. User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
3. Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
Summary
• Data mining: discovering interesting patterns from large amounts of data
• A natural evolution of database technology, in great demand, with wide applications
• A KDD process includes data cleaning, data integration, data selection, transformation,
data mining, pattern evaluation, and knowledge presentation
• Mining can be performed in a variety of information repositories
• Data mining functionalities: characterization, discrimination, association, classification,
clustering, outlier and trend analysis..
• Data mining systems and architectures
• Major issues in data mining

Review Questions
1. Define a data warehouse. What are the major components of a data warehouse?
2. Discuss the importance of data mining in databases
3. Discuss the new trends and challenges of data warehousing technology

151
Week Twelve

Multi-Dimensional Databases
Learning Objectives
1. Multidimensional Databases
– What are they? Why are they needed?
2. Contrast MDD and Relational Databases
3. When is MDD (In) appropriate?
4. MDD Features
– Rotation, Ranging, Roll-Up and Drill-Down, Computations
5. Benefits of MDD

Organization of Concepts

Large-Scale Data Management for


Organizational Decision-Making

Data Warehouse Multidimension Databases Data Mining


(Data Integration) (Enabling Technology) (Relationship Discovery)

Purpose Purpose Purpose

Basic Features Basic Features Applications

Functions
Definition MDDB vs. Relational Database
Data Requirements Rotation
Ranging Association
Transaction Processing vs. Roll-Up
Data Warehousing Drill-Down Availability Bias
Data Structures Computations Sequencial Pattern
Example Benefits Classifying
Clustering
Architecture Hybrid
Tools
Emergent Applications
Data Mapping
Example Architecture of an IBM PowerPlay
Data Warehouse

Preparing Data for DW

Reasons for Failure

152
Why Multi-Dimensional Databases?
§ No single "best" data structure for all applications within an enterprise
§ From econometric research conducted at MIT in the 1960s, the multidimensional
database has matured into the database engine of choice for data analysis applications
§ Inherent ability to integrate and analyze large volumes of enterprise data
§ Offers a good conceptual fit with the way end-users visualize business data
– Most business people already think about their businesses in multidimensional
terms
– Managers tend to ask questions about product sales in different markets over
specific time periods
What is a Multi-Dimensional Database?
A multidimensional database (MDD) is a computer software system designed to allow for the
efficient and convenient storage and retrieval of large volumes of data that is (1) intimately
related and (2) stored, viewed and analyzed from different perspectives. These perspectives
are called dimensions.

153
Contrasting Relational and Multi-Dimensional Models:
An Example

SALES VOLUMES FOR GLEASON DEALERSHIP

MODEL COLOR SALES VOLUME


MINI VAN BLUE 6
MINI VAN RED 5
MINI VAN WHITE 4
SPORTS COUPE BLUE 3
SPORTS COUPE RED 5
SPORTS COUPE WHITE 5
SEDAN BLUE 4
SEDAN RED 3
SEDAN WHITE 2

RELATIONAL STRUCTURE

Differences between MDD and Relational Databases


1. Multidimensional array structure represents a higher level of organization than the
relational table

154
2. Perspectives are embedded directly into the structure in the multidimensional model
All possible combinations of perspectives containing a specific attribute
(the color BLUE, for example) line up along the dimension position for
that attribute.
3. Perspectives are placed in fields in the relational model - tells us nothing about field
contents.
4. MDD makes data browsing and manipulation intuitive to the end-user
5. Any data manipulation action possible with a MDD is also possible using relational
technology
6. Substantial cognitive advantages in query formulation
7. Substantial computational performance advantages in query processing when using
MDD

155
Viewing Data - An Example
Sales Volumes

M
O
D
E
L

DEALERSHIP

COLOR

• Assume that each dimension has 10 positions, as shown in the cube above
• How many records would be there in a relational table?
• Implications for viewing data from an end-user standpoint?

156
157
158
When is MDD (In)appropriate?
• Our sales volume dataset has a great number of meaningful interrelationships
• Interrelationships more meaningful than individual data elements themselves.
• The greater the number of inherent interrelationships between the elements of a
dataset, the more likely it is that a study of those interrelationships will yield business
information of value to the company.
• Highly interrelated dataset types be placed in a multidimensional data structure for
greatest ease of access and analysis

159
160
161
162
MDD Features: Multidimensional Computations
• Well equipped to handle demanding mathematical functions.
• Can treat arrays like cells in spreadsheets. For example, in a budget analysis situation,
one can divide the ACTUAL array by the BUDGET array to compute the VARIANCE array.
• Applications based on multidimensional database technology typically have one
dimension defined as a "business measurements" dimension.
• Integrates computational tools very tightly with the database structure.

The Time Dimension


TIME as a predefined hierarchy for rolling-up and drilling-down across days, weeks,
months, years and special periods, such as fiscal years.
– Eliminates the effort required to build sophisticated hierarchies every time a
database is set up.
– Extra performance advantages

Contrasting Relational Model and MD Model

Criteria Relational Model Multidimensional


Databases
Focus Data integrity of each piece Facilitate exploration of
of data interrelationships between
dimensions
Organization structure One-dimensional array Multi-dimensional arrays
Perspectives Embedded in fields Embedded directly in
MDDB structure
Computational power for Joining tables often Structure designed for
Query processing required; computationally OLAP; computationally
expensive cheap
Cognitive issues in Cumbersome Intuitive
querying data
Query Languages SQL or SQL front-ends, Point-and-click emphasis
such as QBE No standardized language
Management of Time Not well suited Well suited
Dimension

Benefits of MDD
• Cognitive Advantages for the User
• Ease of Data Presentation and Navigation

163
– Obtaining the same views in a relational world requires the end user to either
write complex SQL queries or use an SQL generator against the relational
database to convert the table outputs into a more intuitive format.
• Ease of Maintenance
– Because data is stored in the same way as it is viewed (i.e. according to its
fundamental attributes), no additional overhead is required to translate user
queries into requests for data
• Performance
– Multidimensional databases achieve performance levels that are difficult to
match in a relational environment.

Review Questions
1. Explain what are Multi dimensional database s and their usefulness
2. Differentiate between MDD and Relational Databases
3. Discuss the Features of MDD
4. What are the benefits of MDD

164
References
1. Database Systems by Connolly & Begg 5th Edition
2. Fundamentals of Database Systems by Elmasri & Navathe 5th edition
3. Date C. J. (1990) An Introduction to Database Systems. Volume 1, 5th Edition Addison
Wesley
4. Korth HF & Silberschatz A (1991)
5. Database System Concepts Second Edition McGraw-Hill
6. Object Data Management. OO & Extended Relational Database, Systems Addison-
Wesley Publishing Atkinson,M., DeWitt,D., Maier,D.,Bancilhon,F., Dittrich,K.,
Zdonik,S.(1990)
7. Object-Oriented Database Systems Morgan Kaufmann Publishers Won Kim (1991)
8. Introduction to Object-Oriented Databases, The MIT Press, Cambridge (Massachusetts)
Unland,R., Schlagetter,G.
9. Object-Oriented Database Systems: State of the Art and Research Problems. In: Expert
Database Systems, 1992, Academic Press Ltd

165

You might also like