KEMBAR78
Database Systems Essentials | PDF | Databases | Relational Database
0% found this document useful (0 votes)
72 views113 pages

Database Systems Essentials

Uploaded by

Hareesh Talari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views113 pages

Database Systems Essentials

Uploaded by

Hareesh Talari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

1

UNIT-1
OVERVIEW OF DATABASE SYSTEMS
(1) Overview of DBMS
Data:
➢ Data is collection of raw facts and figures, such as numbers, words, measurements, observations or
just descriptions of things.
➢ Alone it tells you nothing.
➢ The real goal is turn the data into information.
➢ Data is not in meaningful way, it is unorganized format.
➢ Once the data is processed, organized, structured or presented in given context, it can become useful.
EX: your name, age, height etc.
Information:
➢ Information is credited from the data
➢ Information is nothing but refined data that have been put into a meaningful and useful context.
➢ Information is the back bone of any otganization.it consists of images, text, document and voice etc.
➢ The 3 key attributes of information are ,
▪ Accuracy(correctness)
▪ Timelines
▪ Relevancy

Data Base:
➢ The data is an organised form of the word “DATUM” that means “single piece of information”.
➢ Data base is organised collection of data, so that it can be easily stored, managed and accessed
electrically from a computer system.
➢ It is also used to organise the data in the form of rows and columns and index etc. To make it easier
to find relevant information.
➢ The main purpose of database is to operate large amount of information by storing, retrieving and
managing the data.
➢ There are many databases are available , such as
MySQL, Sybase, Oracle, SQL server etc.,
➢ A cylindrical structure is used to represent the data.

DBMS:
➢ DBMS is software which is used to manage the database.
➢ It is a collection of programs that enable to store, modify, and extract information from the database.
➢ It is a piece of software that provides services for accessing a database, while maintaining all the
required features of data
➢ It provides security and protection to the database; in case of multiple users it also maintains data
consistency.
2

➢ It provides an interface to perform various operations like database creations, storing data in it,
updating data, creating tables in database etc.,

Characteristics of Database Management System:


Here are the characteristics and properties of Database Management System:
➢ Provides security and removes redundancy
➢ Self-describing nature of a database system
➢ Insulation between programs and data abstraction
➢ Support of multiple views of the data
➢ Sharing of data and multiuser transaction processing
➢ Database Management Software allows entities and relations among them to form tables.
➢ It follows the ACID concept:
▪ Atomicity,
▪ Consistency,
▪ Isolation,
▪ Durability.
➢ DBMS supports multi-user environment that allows users to access and manipulate data in parallel.

Advantages of DBMS:
• Reducing Data Redundancy:
o The file based data management systems contained multiple files that were stored in many
different locations in a system or even across multiple systems. Because of this, there were
sometimes multiple copies of the same file which lead to data redundancy.
• Sharing of Data:
o In a database, the users of the database can share the data among themselves. There are
various levels of authorisation to access the data, and consequently the data can only be
shared based on the correct authorisation protocols being followed.
• Data Integrity:
o Data integrity means that the data is accurate and consistent in the database. Data Integrity is
very important as there are multiple databases in a DBMS. All of these databases contain data
that is visible to multiple users.
3

• Data Security:
o Data Security is vital concept in a database. Only authorised users should be allowed to
access the database and their identity should be authenticated using a username and
password.
• Privacy:
o The privacy rule in a database means only the authorized users can access a database
according to its privacy constraints. There are levels of database access and a user can only
view the data he is allowed to
• Backup and Recovery:
o Database Management System automatically takes care of backup and recovery. The users
don't need to backup data periodically because this is taken care of by the DBMS. Moreover,
it also restores the database after a crasher system failure to its previous condition.
• Data Consistency:
o Data consistency is ensured in a database because there is no data redundancy. All data
appears consistently across the database and the data is same for all the users viewing the
database..
Disadvantages of DBMS:
1. High Cost:
The high cost of software and hardware is the main disadvantage of the database management system.
Database users require a high-speed processor and huge memory size to use the database on the
DBMS.
2. Complexity:
Database management system (DBMS) is so complex for non-technical users. So, it isn’t easy to
manage and maintain database systems.
3. Increased Staff Cost:
DBMS requires an educated and skilled staff for managing and maintaining the databases. So, we need
to spend a lot of money to get this level of trained and experienced staff.
4. Requirement of Technical Staff:
A non-technical people can’t understand the complexity of the database. So, the technical staffs are
required for maintaining and handling the database management system.
5. Cost of Data Conversion:
It is one of the big disadvantages of the database management system because the cost of data
conversion is very high.
6. Performance:
Performance is another big disadvantage of database systems because the speed of the
database systems for small firms and organizations is very slow.
Objectives of DBMS:
1. Eliminate redundant data.
2. Make access to the data easy for the user.
3. Provide for mass storage of relevant data.
4. Protect the data from physical harm and un-authorised systems.
5. Allow for growth in the data base system.
6. Make the latest modifications to the data base available immediately.
7. Allow for multiple users to be active at one time.
8. Provide prompt response to user requests for data.
(2)Describing and storing data in DBMS:
➢ A data model is a collection of high-level data description constructs that hide many low-level storage
details
➢ A widely used semantic data model called the entity-relationship (ER) model allows us to pictorially
denote entities and the relationships among them
1. Data abstraction -: (3 Layer/Level Architecture)
Data abstraction is one of the fundamental characteristic of any database, which helps in making data
more accurate and easy to use.
Data abstraction refers to the act of representing data without giving details that how data are stored
or maintained. There are different levels of abstraction.
4

Physical View (internal schema):


• The physical schema specifies additional storage detail, summarizes how the relations described in
conceptual schema are actually stored on secondary storage devices such as disks and tapes.
It is called the internal level. This level is the lowest level of abstraction which specified storage
details that how the data are store in disk.
This layer describe manner in which records are stored either as the collection of pages or as the
collection of records.

Logical Level -: Conceptual Schema


• The conceptual schema (sometimes called the logical schema) describes the stored data in terms of the
data model of the DBMS.
• Relations contain information about entities and relationships
• This is the conceptual view .It is next higher level of abstraction what data are stored in the database
and what relationship exists among those data.
• There is only one conceptual schema per database.
• This schema also contains the method of deriving the object on the conceptual view from the internal
views.
• This level of abstraction is used by database administration.
View level:-External Schema
External schemas allow data access to be customized and authorized at the level of individual user or
groups of users.
• Each external schema consists of a collection of views and relations from the conceptual schema.
• A view is conceptually a relation, but the records in a view are not stored in the DBMS. The records are
computed using a definition for the view, in terms of relations stored in the DBMS.
• The external schema design is guided by the end user requirements.
It is the highest level of abstraction which describes different views of the entire database.
These views are designed according to the requirement of user who want to access only part of database.
(2) Instance and schemas –
The collection of information stored in the database at a particular moment is called an instance of the
database.
The overall design of the database is called the database schema. Schema is changed infrequently.
Database systems have several schemas, partitioned according to the levels of abstraction.
At the lowest level is the physical schema, at the intermediate level is the logical schema and at the
highest level is a subschema.
In general database system supports one physical schema one logical schema, and several subschemas.

(3)Queries in DBMS:
A query is a question or a request for information expressed in a formal manner. In computer science,
a query is essentially the same thing, the only difference is the answer or retrieved information comes from
a database
5

A database query is either an action query or a select query. A select query is one that
retrieves data from a database. An action query asks for additional operations on data, such as insertion,
updating, deleting or other forms of data manipulation.
Query languages are used to make queries in a database, and Microsoft Structured Query Language
(SQL) is the standard.
Note: SQL and MySQL are not the same, as the latter is a software extension that uses SQL. Other
language extensions of the language include Oracle SQL and NuoDB.
Although Microsoft's SQL is the most popular language, there are many other types of databases and
languages.
Query by Example (QBE)
➢ In this method, the system displays a blank record and lets you identify the fields and values that
define the query.
➢ This is a method of query creation that authorises the user to look for documents based on an
example in the form of a selected text string, or in the form of a document name, or even a list of
documents. Because the QBE system develops the actual query, QBE is easier to grasp than formal
query languages, while still enabling powerful searches.
Before we delve into the examples, below are the main benefits of using a query:
• Review data from multiple tables simultaneously.
• Filter records containing only certain fields and of certain criteria.
• Automate data management tasks and perform calculations .

(4)Transaction management:

What is a Database Transaction:


A Database Transaction is a logical unit of processing in a DBMS which entails one or more
database access operation.

ACID properties of transactions

In the context of transaction processing, the acronym ACID refers to the four key properties of a transaction:

Atomicity
All changes to data are performed as if they are a single operation. That is, all the changes are
performed, or none of them are.
For example, in an application that transfers funds from one account to another, the atomicity property
ensures that, if a debit is made successfully from one account, the corresponding credit is made to the other
account.
Consistency
Data is in a consistent state when a transaction starts and when it ends.
For example, in an application that transfers funds from one account to another, the consistency
property ensures that the total value of funds in both the accounts is the same at the start and end of each
transaction.
6

Isolation
The intermediate state of a transaction is invisible to other transactions. As a result, transactions that
run concurrently appear to be serialized.
For example, in an application that transfers funds from one account to another, the isolation property
ensures that another transaction sees the transferred funds in one account or the other, but not in both, nor in
neither.
Durability
After a transaction successfully completes, changes to data persist and are not undone, even in the
event of a system failure. For example, in an application that transfers funds from one account to another,
the durability property ensures that the changes made to each account will not be reversed.

Facts about Database Transactions


➢ A transaction is a program unit whose execution may or may not change the contents of a database.
➢ The transaction concept in DBMS is executed as a single unit.
➢ If the database operations do not update the database but only retrieve data, this type of transaction is
called a read-only transaction.
➢ A successful transaction can change the database from one CONSISTENT STATE to another
➢ DBMS transactions must be atomic, consistent, isolated and durable
➢ If the database were in an inconsistent state before a transaction, it would remain in the inconsistent
state after the transaction.

(5)Structure of DBMS:
• Database Management System (DBMS) is software that allows access to data stored in a database and
provides an easy and effective method of
• Defining the information.
• Storing the information.
• Manipulating the information.
• Protecting the information from system crashes or data theft.
• Differentiating access permissions for different users.
• The database system is divided into three components: Query Processor, Storage Manager, and Disk
Storage. These are explained as following below.
• STRUCTURE OF DBMS
• DBMS (Database Management System) acts as an interface between the user and the database. The user
requests the DBMS to perform various operations (insert, delete, update and retrieval) on the database.
The components of DBMS perform these requested operations on the database and provide necessary
data to the users. The various components of DBMS are shown below: -
• 1. DDL Compiler - Data Description Language compiler processes schema definitions specified in the
DDL. It includes metadata information such as the name of the files, data items, storage details of each
file, mapping information and constraints etc.
• 2. DML Compiler and Query optimizer - The DML commands such as insert, update, delete, retrieve
from the application program are sent to the DML compiler for compilation into object code for database
access. The object code is then optimized in the best way to execute a query by the query optimizer and
then send to the data manager.
• 3. Data Manager - The Data Manager is the central software component of the DBMS also knows
as Database Control System.

7


• Fig. 2.1 Structure Of DBMS

• The Main Functions Of Data Manager Are: –


• • Convert operations in user's Queries coming from the application programs or combination of DML
Compiler and Query optimizer which is known as Query Processor from user's logical view to
physical file system.
• • Controls DBMS information access that is stored on disk.
• • It also controls handling buffers in main memory.
• • It also enforces constraints to maintain consistency and integrity of the data.
• • It also synchronizes the simultaneous operations performed by the concurrent users.
• • It also controls the backup and recovery operations.
• 4. Data Dictionary - Data Dictionary is a repository of description of data in the database. It
contains information about
• • Data - names of the tables, names of attributes of each table, length of attributes, and number of
rows in each table.
• • Detailed information on physical database design such as storage structure, access paths, files and
record sizes.
• • Access Authorization - is the Description of database users their responsibilities and their access
rights.
• Data dictionary is used to actually control the data integrity, database operation and accuracy. It may
be used as a important part of the DBMS.
• Importance of Data Dictionary -
• Data Dictionary is necessary in the databases due to following reasons:
• • It improves the control of DBA over the information system and user's understanding of use of the
system.
• • It helps in documenting the database design process by storing documentation of the result of every
design phase and design decisions.
• • It helps in searching the views on the database definitions of those views.
• 5. Data Files - It contains the data portion of the database.
• 6. Compiled DML - The DML complier converts the high level Queries into low level file access
commands known as compiled DML.
• 7. End Users - They are already discussed in previous section.
8

(5)Data Models
Data Model : Data Model gives us an idea that how the final system will look like after its complete
implementation.
It defines the data elements and the relationships between the data elements. Data Models are used to
show how data is stored, connected, accessed and updated in the database management system.
Here, we use a set of symbols and text to represent the information so that members of the
organisation can communicate and understand it.
Though there are many data models being used nowadays but the Relational model is the most
widely used model.
Apart from the Relational model, there are many other types of data models about which we will
study in details in this blog. Some of the Data Models in DBMS are:
1. Hierarchical Model
2. Network Model
3. Entity-Relationship Model
4. Relational Model
5. Object-Oriented Data Model
6. Object-Relational Data Model
7. Flat Data Model
8. Semi-Structured Data Model
Hierarchical Model
Hierarchical Model was the first DBMS model. This model organises the data in the hierarchical tree
structure.
The hierarchy starts from the root which has root data and then it expands in the form of a tree adding
child node to the parent node.
This model easily represents some of the real-world relationships like food recipes, sitemap of a website
etc.
Example: We can represent the relationship between the shoes present on a shopping website in the
following way:

Features of a Hierarchical Model


1. One-to-many relationship: The data here is organised in a tree-like structure where the one-to-
many relationship is between the datatypes. Also, there can be only one path from parent to any
node. Example: In the above example, if we want to go to the node sneakers we only have one
path to reach there i.e through men's shoes node.
2. Parent-Child Relationship: Each child node has a parent node but a parent node can have more
than one child node. Multiple parents are not allowed.
3. Deletion Problem: If a parent node is deleted then the child node is automatically deleted.
4. Pointers: Pointers are used to link the parent node with the child node and are used to navigate
between the stored data. Example: In the above example the 'shoes' node points to the two other
nodes 'women shoes' node and 'men's shoes' node.
Advantages of Hierarchical Model
• It is very simple and fast to traverse through a tree-like structure.
9

• Any change in the parent node is automatically reflected in the child node so, the integrity of data
is maintained.
Disadvantages of Hierarchical Model
• Complex relationships are not supported.
• As it does not support more than one parent of the child node so if we have some complex
relationship where a child node needs to have two parent node then that can't be represented using
this model.
• If a parent node is deleted then the child node is automatically deleted.
Network Model
This model is an extension of the hierarchical model.
It was the most popular model before the relational model. This model is the same as the hierarchical
model, the only difference is that a record can have more than one parent.
It replaces the hierarchical tree with a graph.
Example: In the example below we can see that node student has two parents i.e. CSE Department and
Library. This was earlier not possible in the hierarchical model.

Features of a Network Model


1. Ability to Merge more Relationships: In this model, as there are more relationships so data is
more related. This model has the ability to manage one-to-one relationships as well as many-to-
many relationships.
2. Many paths: As there are more relationships so there can be more than one path to the same
record. This makes data access fast and simple.
3. Circular Linked List: The operations on the network model are done with the help of the circular
linked list. The current position is maintained with the help of a program and this position
navigates through the records according to the relationship.
Advantages of Network Model
• The data can be accessed faster as compared to the hierarchical model. This is because the data is
more related in the network model and there can be more than one path to reach a particular node.
So the data can be accessed in many ways.
• As there is a parent-child relationship so data integrity is present. Any change in parent record is
reflected in the child record.
Disadvantages of Network Model
• As more and more relationships need to be handled the system might get complex. So, a user must
be having detailed knowledge of the model to work with the model.
• Any change like updation, deletion, insertion is very complex.
Entity-Relationship Model
Entity-Relationship Model or simply ER Model is a high-level data model diagram. In this model,
we represent the real-world problem in the pictorial form to make it easy for the stakeholders to
understand. It is also very easy for the developers to understand the system by just looking at the ER
diagram. We use the ER diagram as a visual tool to represent an ER Model. ER diagram has the following
three components:
• Entities: Entity is a real-world thing. It can be a person, place, or even a
concept. Example: Teachers, Students, Course, Building, Department, etc are some of the entities
of a School Management System.
10

• Attributes: An entity contains a real-world property called attribute. This is the characteristics of
that attribute. Example: The entity teacher has the property like teacher id, salary, age, etc.
• Relationship: Relationship tells how two attributes are related. Example: Teacher works for a
department.
Example:

In the above diagram, the entities are Teacher and Department.


• The attributes of Teacher entity are Teacher_Name, Teacher_id, Age, Salary, Mobile_Number.
• The attributes of entity Department entity are Dept_id, Dept_name.
• The two entities are connected using the relationship. Here, each teacher works for a department.
Features of ER Model
• Graphical Representation for Better Understanding: It is very easy and simple to understand so
it can be used by the developers to communicate with the stakeholders.
• ER Diagram: ER diagram is used as a visual tool for representing the model.
• Database Design: This model helps the database designers to build the database and is widely
used in database design.
Advantages of ER Model
• Simple: Conceptually ER Model is very easy to build. If we know the relationship between the
attributes and the entities we can easily build the ER Diagram for the model.
• Effective Communication Tool: This model is used widely by the database designers for
communicating their ideas.
• Easy Conversion to any Model: This model maps well to the relational model and can be easily
converted relational model by converting the ER model to the table. This model can also be
converted to any other model like network model, hierarchical model etc.
Disadvatages of ER Model
• No industry standard for notation: There is no industry standard for developing an ER model.
So one developer might use notations which are not understood by other developers.
• Hidden information: Some information might be lost or hidden in the ER model. As it is a high-
level view so there are chances that some details of information might be hidden.
Relational Model
Relational Model is the most widely used model. In this model, the data is maintained in the form of a
two-dimensional table. All the information is stored in the form of row and columns. The basic structure
of a relational model is tables. So, the tables are also called relations in the relational model. Example: In
this example, we have an Employee table.
11

Features of Relational Model


• Tuples: Each row in the table is called tuple. A row contains all the information about any
instance of the object. In the above example, each row has all the information about any
specific individual like the first row has information about John.
• Attribute or field: Attributes are the property which defines the table or relation. The values
of the attribute should be from the same domain. In the above example, we have different
attributes of the employee like Salary, Mobile_no, etc.
Advnatages of Relational Model
• Simple: This model is more simple as compared to the network and hierarchical model.
• Scalable: This model can be easily scaled as we can add as many rows and columns we
want.
• Structural Independence: We can make changes in database structure without changing
the way to access the data. When we can make changes to the database structure without
affecting the capability to DBMS to access the data we can say that structural independence
has been achieved.
Disadvantages of Relatinal Model
• Hardware Overheads: For hiding the complexities and making things easier for the user
this model requires more powerful hardware computers and data storage devices.
• Bad Design: As the relational model is very easy to design and use. So the users don't need
to know how the data is stored in order to access it. This ease of design can lead to the
development of a poor database which would slow down if the database grows.
But all these disadvantages are minor as compared to the advantages of the relational model.
These problems can be avoided with the help of proper implementation and organisation.
Object-Oriented Data Model
The real-world problems are more closely represented through the object-oriented data
model.
In this model, both the data and relationship are present in a single structure known as an
object. We can store audio, video, images, etc in the database which was not possible in the
relational model(although you can store audio and video in relational database, it is adviced not to
store in the relational database).
In this model, two are more objects are connected through links. We use this link to relate
one object to other objects. This can be understood by the example given below.

In the above example, we have two objects Employee and Department. All the data and
relationships of each object are contained as a single unit. The attributes like Name, Job_title of
the employee and the methods which will be performed by that object are stored as a single object.
The two objects are connected through a common attribute i.e the Department_id and the
communication between these two will be done with the help of this common id.
12

Object-Relational Model
As the name suggests it is a combination of both the relational model and the object-oriented
model. This model was built to fill the gap between object-oriented model and the relational
model. We can have many advanced features like we can make complex data types according to
our requirements using the existing data types. The problem with this model is that this can get
complex and difficult to handle. So, proper understanding of this model is required.
Flat Data Model
It is a simple model in which the database is represented as a table consisting of rows and
columns. To access any data, the computer has to read the entire table. This makes the modes slow
and inefficient.
Semi-Structured Model
Semi-structured model is an evolved form of the relational model. We cannot differentiate
between data and schema in this model.
Example: Web-Based data sources which we can't differentiate between the schema and data of
the website.
(6)What is Database Architecture?
A Database Architecture is a representation of DBMS design. It helps to design, develop, implement,
and maintain the database management system. DBMS architecture allows dividing the database system into
individual components that can be independently modified, changed, replaced, and altered. It also helps to
understand the components of a database.

Types of DBMS Architecture:


There are mainly three types of DBMS architecture:
• One Tier Architecture (Single Tier Architecture)
• Two Tier Architecture
• Three Tier Architecture

1-Tier Architecture:
1Tier Architecture in DBMS is the simplest architecture of Database in which the client, server, and
Database all reside on the same machine. A simple one tier architecture example would be anytime you
install a Database in your system and access it to practice SQL queries. But such architecture is rarely used
in production.

2-Tier Architecture:
A 2 Tier Architecture in DBMS is a Database architecture where the presentation layer runs on a
client (PC, Mobile, Tablet, etc.), and data is stored on a server called the second tier. Two tier architecture
provides added security to the DBMS as it is not exposed to the end-user directly. It also provides direct and
faster communication.
13

In the above 2 Tier client-server architecture of database management system, we can see that one
server is connected with clients 1, 2, and 3.
3-Tier Architecture:
A 3 Tier Architecture in DBMS is the most popular client server architecture in DBMS in which the
development and maintenance of functional processes, logic, data access, data storage, and user interface is
done independently as separate modules.
Three Tier architecture contains a presentation layer, an application layer, and a database server.
3-Tier database Architecture design is an extension of the 2-tier client-server architecture.
A 3-tier architecture has the following layers:

• Presentation layer (your PC, Tablet, Mobile, etc.)


• Application layer (server)
• Database Server

The Application layer resides between the user and the DBMS, which is responsible for
communicating the user’s request to the DBMS system and send the response from the DBMS to the user.
The application layer (business logic layer) also processes functional logic, constraint, and rules before
passing data to the user or down to the DBMS.
The goal of Three Tier client-server architecture is:
• To separate the user applications and physical database
• To support DBMS characteristics
• Program-data independence
• Supporting multiple views of the data

Chapter 2: INTRODUCTION TO DATABASE DESIGN
(7) What is Database Design? Beyond ER design
• Database Design is a collection of processes that facilitate the designing, development,
implementation and maintenance of enterprise data management systems.
• Properly designed database are easy to maintain, improves data consistency and are cost effective
in terms of disk storage space. The database designer decides how the data elements correlate and
what data must be stored.
• The main objectives of database design in DBMS are to produce logical and physical designs
models of the proposed database system.
• The logical model concentrates on the data requirements and the data to be stored independent of
physical considerations. It does not concern itself with how the data will be stored or where it will
be stored physically.
• The physical data design model involves translating the logical DB design of the database onto
physical media using hardware resources and software systems such as database management
systems (DBMS).
14

What is ER Diagram? Entity Relationship Diagram – ER Diagram in DBMS


An Entity–relationship model (ER model) describes the structure of a database with the help of a
diagram, which is known as Entity Relationship Diagram (ER Diagram).
An ER model is a design or blueprint of a database that can later be implemented as a database. The
main components of E-R model are: entity set and relationship set.
What is an Entity Relationship Diagram (ER Diagram)?
An ER diagram shows the relationship among entity sets. An entity set is a group of similar entities
and these entities can have attributes.
In terms of DBMS, an entity is a table or attribute of a table in database, so by showing relationship
among tables and their attributes, ER diagram shows the complete logical structure of a database.

A simple ER Diagram:
In the following diagram we have two entities Student and College and their relationship.
• The relationship between Student and College is many to one as a college can have many
students however a student cannot study in multiple colleges at the same time.

Rectangle: Represents Entity sets.


Ellipses: Attributes
Diamonds: Relationship Set
Lines: They link attributes to Entity Sets and Entity sets to Relationship Set
Double Ellipses: Multivalued Attributes
Dashed Ellipses: Derived Attributes
Double Rectangles: Weak Entity Sets
Double Lines: Total participation of an entity in a relationship set Components of a ER Diagram

As shown in the above diagram, an ER diagram has three main components:


1. Entity
2. Attribute
3. Relationship

1. Entity
An entity is an object or component of data. An entity is represented as rectangle in an ER diagram.
For example: In the following ER diagram we have two entities Student and College and these two entities
have many to one relationship as many students study in a single college. We will read more about
relationships later, for now focus on entities.
15

Weak Entity:
An entity that cannot be uniquely identified by its own attributes and relies on the relationship with other
entity is called weak entity. The weak entity is represented by a double rectangle. For example – a bank
account cannot be uniquely identified without knowing the bank to which the account belongs, so bank
account is a weak entity.
2. Attribute
An attribute describes the property of an entity. An attribute is represented as Oval in an ER diagram.
There are four types of attributes
1. Key attribute
2. Composite attribute
3. Multivalued attribute
4. Derived attribute
16

1. Key attributes:

A key attribute can uniquely identify an entity from an entity set. For example, student roll number can
uniquely identify a student from a set of students. Key attribute is represented by oval same as other
attributes however the text of key attribute is underlined.

2. Composite attribute:
An attribute that is a combination of other attributes is known as composite attribute. For example, In
student entity, the student address is a composite attribute as an address is composed of other attributes such
as pin code, state, country.

3. Multivalued attribute:
An attribute that can hold multiple values is known as multivalued attribute. It is represented
with double ovals in an ER Diagram. For example – A person can have more than one phone numbers so
the phone number attribute is multivalued.
4. Derived attribute:
A derived attribute is one whose value is dynamic and derived from another attribute. It is represented
by dashed oval in an ER Diagram. For example – Person age is a derived attribute as it changes over time
and can be derived from another attribute (Date of birth).

E-R diagram with multivalued and derived attributes:


3. Relationship
A relationship is represented by diamond shape in ER diagram, it shows the relationship among
entities. There are four types of relationships:
1. One to One
2. One to Many
3. Many to One
4. Many to Many
17

1. One to One Relationship

When a single instance of an entity is associated with a single instance of another entity then it is
called one to one relationship. For example, a person has only one passport and a passport is given to one
person.
2. One to many Relationship

When a single instance of an entity is associated with more than one instances of another entity then it
is called one to many relationship. For example – a customer can place many orders but a order cannot be
placed by many customers.

3. Many to One Relationship

When more than one instances of an entity is associated with a single instance of another entity then it
is called many to one relationship. For example – many students can study in a single college but a student
cannot study in many colleges at the same time.

4. Many to Many Relationship

When more than one instances of an entity is associated with more than one instances of another entity
then it is called many to many relationship. For example, a can be assigned to many projects and a project
can be assigned to many students.
18

Total Participation of an Entity set

A Total participation of an entity set represents that each entity in entity set must have at least one
relationship in a relationship set. For example: In the below diagram each college must have at-least one
associated Student.
Types of relationship are as follows:

a. One-to-One Relationship:
When only one instance of an entity is associated with the relationship, then it is known as one to one
relationship.
For example, a female can marry to one male, and a male can marry to one female.

b. One-to-many relationship:
When only one instance of the entity on the left, and more than one instance of an entity on the right
associates with the relationship then this is known as a one-to-many relationship.
For example, Scientist can invent many inventions, but the invention is done by the only specific
scientist.
c. Many-to-one relationship:

When more than one instance of the entity on the left, and only one instance of an entity on the right
associates with the relationship then it is known as a many-to-one relationship.
For example, Student enrolls for only one course, but a course can have many students.
19

d. Many-to-many relationship:
When more than one instance of the entity on the left, and more than one instance of an entity on the
right associates with the relationship then it is known as a many-to-many relationship.
For example, Employee can assign by many projects and project can have many employees.

(8)Extended E-R Features:


The basic E-R concepts can model most database features, some aspects of a database may be more
aptly expressed by certain extensions to the basic E-R model. The extended E-R features are specialization,
generalization, higher- and lower-level entity sets, attribute inheritance, and aggregation.
EER is a high-level data model that incorporates the extensions to the original ER model. Enhanced
ERD are high level models that represent the requirements and complexities of complex database.
In addition to ER model concepts EE-R includes −
• Subclasses and Super classes.
• Specialization and Generalization.
• Category or union type.
• Aggregation.
These concepts are used to create EE-R diagrams.
Subclasses and Super class
Super class is an entity that can be divided into further subtype.
For example − consider Shape super class.

Super class shape has sub groups: Triangle, Square and Circle.
Sub classes are the group of entities with some unique attributes. Sub class inherits the properties and
attributes from super class.
Specialization and Generalization
Generalization is a process of generalizing an entity which contains generalized attributes or properties
of generalized entities.

It is a Bottom up process i.e. considers we have 3 sub entities Car, Truck and Motorcycle. Now these
three entities can be generalized into one super class named as Vehicle.
20

Specialization is a process of identifying subsets of an entity that share some different characteristic. It
is a top down approach in which one entity is broken down into low level entity.
In above example Vehicle entity can be a Car, Truck or Motorcycle.
Owner is the subset of two super classes: Vehicle and House.
Aggregation
Represents relationship between a whole object and its component.
Consider a ternary relationship Works_On between Employee, Branch and Manager. Now the best
way to model this situation is to use aggregation, So, the relationship-set, Works_On is a higher level entity-
set. Such an entity-set is treated in the same manner as any other entity-set. We can create a binary
relationship, Manager, between Works_On and Manager to represent who manages what tasks

Conceptual Design with the ER Model:


• Conceptual design follows requirements analysis,
– Yields a high-level description of data to be stored
• ER model popular for conceptual design
– Constructs are expressive, close to the way people think about their applications.
– Note: There are many variations on ER model
• Both graphically and conceptually
• Basic constructs: entities, relationships, and attributes (of entities and relationships).
• Some additional constructs: weak entities, ISA hierarchies (see text if you’re curious), and
aggregation.
• When developing an information system, we first specify what is required and produce a design to
meet these requirements. For reasons outlined earlier, we recommend first developing this design at
the conceptual level using Object-Role Modeling. This entails describing the structure of the UoD
formally in terms of an ORM conceptual schema. If an ORM context is understood, we'll often
shorten “ORM conceptual schema” to just “conceptual schema”.
• The procedure for designing a conceptual schema small enough to manage as a single unit is referred
to as the Conceptual Schema Design Procedure (CSDP).
• Large business domains are first divided into subareas (which may overlap).
• These subareas are normally prioritized to determine which ones to develop first, and a
conceptual subschema is designed for each using the CSDP.
• This top-down design approach may be summarized as follows:
• Divide the universe of discourse into manageable subareas.
• Apply the CSDP to each subarea.
• Integrate the subschemas into a global conceptual schema.
21

– For each manageably sized business domain or subarea, the conceptual schema design is
performed in seven steps:
1 Transform familiar examples into elementary facts.
• 2. Draw the fact types, and apply a population check.
• 3. Check for entity types to be combined, and note any arithmetic derivations.
• 4. Add uniqueness constraints, and check the arity of fact types.
• 5. Add mandatory role constraints, and check for logical derivations.
• 6. Add value, set-comparison, and subtyping constraints.
• 7. Add other constraints and perform final checks.
22

UNIT-2
INTRODUCTION TO DATABASE DESIGN
(1)Introduction to the Relational Model:
Relational Model was proposed by E.F.Codd to model data in the form of relations or tables. After
designing the conceptual model of Database using ER diagram, we need to convert the conceptual model
in the relational model which can be implemented using any RDBMS languages like Oracle SQL, MySQL
etc. So we will see what Relational Model is.

What is Relational Model?


Relational Model (RM) represents the database as a collection of relations. A relation is nothing but a
table of values. Every row in the table represents a collection of related data values. These rows in the table
denote a real-world entity or relationship.
The table name and column names are helpful to interpret the meaning of values in each row.
The data are represented as a set of relations. In the relational model, data are stored as tables.
However, the physical storage of the data is independent of the way the data are logically organized.
Some popular Relational Database management systems are:

• DB2 and Informix Dynamic Server – IBM


• Oracle and RDB – Oracle
• SQL Server and Access – Microsoft
Relational Model Concepts
1. Attribute: Each column in a Table. Attributes are the properties which define a relation. e.g.,
Student_Rollno, NAME,etc.
2. Tables – In the Relational model the, relations are saved in the table format. It is stored along with
its entities. A table has two properties rows and columns. Rows represent records and columns
represent attributes.
3. Tuple – It is nothing but a single row of a table, which contains a single record.
4. Relation Schema: A relation schema represents the name of the relation with its attributes.
5. Degree: The total number of attributes which in the relation is called the degree of the relation.
6. Cardinality: Total number of rows present in the Table.
7. Column: The column represents the set of values for a specific attribute.
8. Relation instance – Relation instance is a finite set of tuples in the RDBMS system. Relation
instances never have duplicate tuples.
9. Relation key – Every row has one, two or multiple attributes, which is called relation key.
10. Attribute domain – Every attribute has some pre-defined value and scope which is known as
attribute domain
23

Relational Integrity Constraints


Relational Integrity constraints in DBMS are referred to conditions which must be present for a valid
relation. These Relational constraints in DBMS are derived from the rules in the mini-world that the
database represents.
There are many types of Integrity Constraints in DBMS. Constraints on the Relational database
management system are mostly divided into three main categories are:
1. Domain Constraints
2. Key Constraints
3. Referential Integrity Constraints
Domain Constraints
Domain constraints can be violated if an attribute value is not appearing in the corresponding domain
or it is not of the appropriate data type.
Domain constraints specify that within each tuple, and the value of each attribute must be unique. This
is specified as data types which include standard data types integers, real numbers, characters, Booleans,
variable length strings, etc.
Example: Create DOMAIN Customer Name
CHECK (value not NULL)
The example shown demonstrates creating a domain constraint such that Customer Name is not NULL

Key Constraints
An attribute that can uniquely identify a tuple in a relation is called the key of the table. The value of
the attribute for different tuples in the relation has to be unique.

Example: In the given table, CustomerID is a key attribute of Customer Table. It is most likely to
have a single key for one customer, CustomerID =1 is only for the Customer Name =” Google”.

CustomerID CustomerName Status

1 Google Active

2 Amazon Active

3 Apple Inactive

Referential Integrity Constraints


Referential Integrity constraints in DBMS are based on the concept of Foreign Keys. A foreign key is
an important attribute of a relation which should be referred to in other relationships. Referential integrity
constraint state happens where relation refers to a key attribute of a different or same relation. However, that
key element must exist in the table.

Example: Tuple for CustomerID =1 is referenced twice in the relation Billing. So we know
CustomerName=Google has billing amount $300
24

In the above example, we have 2 relations, Customer and Billing.

Operations in Relational Model


Four basic update operations performed on relational database model are
Insert, update, delete and select.

• Insert is used to insert data into the relation


• Delete is used to delete tuples from the table.
• Modify allows you to change the values of some attributes in existing tuples.
• Select allows you to choose a specific range of data.

Whenever one of these operations are applied, integrity constraints specified on the relational database
schema must never be violated.
Insert Operation
The insert operation gives values of the attribute for a new tuple which should be inserted into a
relation.

Update Operation
You can see that in the below-given relation table CustomerName= ‘Apple’ is updated from Inactive
to Active.
25

Delete Operation
To specify deletion, a condition on the attributes of the relation selects the tuple to be deleted.

In the above-given example, CustomerName= “Apple” is deleted from the table.

The Delete operation could violate referential integrity if the tuple which is deleted is referenced by
foreign keys from other tuples in the same database.

Select Operation

In the above-given example, CustomerName=”Amazon” is selected

Advantages of using Relational Model


• Simplicity: A Relational data model in DBMS is simpler than the hierarchical and network model.
• Structural Independence: The relational database is only concerned with data and not with a
structure. This can improve the performance of the model.
• Easy to use: The Relational model in DBMS is easy as tables consisting of rows and columns are
quite natural and simple to understand
• Query capability: It makes possible for a high-level query language like SQL to avoid complex
database navigation.
• Data independence: The Structure of Relational database can be changed without having to change
any application.
• Scalable: Regarding a number of records, or rows, and the number of fields, a database should be
enlarged to enhance its usability.

Disadvantages of using Relational Model


• Few relational databases have limits on field lengths which can’t be exceeded.
• Relational databases can sometimes become complex as the amount of data grows, and the relations
between pieces of data become more complicated.
• Complex relational database systems may lead to isolated databases where the information cannot be
shared from one system to another.

(2) Enforcing data integrity in databases:


Data integrity refers to the correctness and completeness of data within a database.
• To enforce data integrity, you can constrain or restrict the data values that users can insert,
delete, or update in the database.
26

For example, the integrity of data in the pubs2 and pubs3 databases requires that a book title in
the titles table must have a publisher in the publishers table.
You cannot insert books that do not have a valid publisher into titles, because it violates the data
integrity of pubs2 or pubs3.
Transact-SQL provides several mechanisms for integrity enforcement in a database such as rules,
defaults, indexes, and triggers. These mechanisms allow you to maintain these types of data integrity:
• Requirement – requires that a table column must contain a valid value in every row; it cannot allow
null values. The create table statement allows you to restrict null values for a column.
• Check or validity – limits or restricts the data values inserted into a table column. You can use
triggers or rules to enforce this type of integrity.
• Uniqueness – no two table rows can have the same non-null values for one or more table columns.
You can use indexes to enforce this integrity.
• Referential – data inserted into a table column must already have matching data in another table
column or another column in the same table. A single table can have up to 192 references.
As an alternative to using rules, defaults, indexes, and triggers, Transact-SQL provides a series of integrity
constraints as part of the create table statement to enforce data integrity as defined by the SQL standards.
These integrity constraints are described later in this chapter.

(3)Querying relational data:


➢ A relational database query is a question about the data& the answer consist of a new relation
containing result.
➢ SQL is the most popular commercial query language for RDBMS(relational database management
system)
➢ A relational query must specify the tables required a query language is a specialised language for
writing queries
Example:
S.id Name Login Age GPA
53831 Rameswaran reameswaran@GATE 11 7.8
53832 Krishnan Krishnan@GATE 12 8.0

➢ Do you want to find all students younger than 18 or all students enrolled in registration number 203
➢ Students with age less than 18 on instance instance 51
➢ We can retrieve rows corresponding to students who are younger than 18 with the following query
➢ Select *from students S where s. age < 18.
➢ Select is a command comes under data query language of SQL commands
➢ Select statement retrieves zero or more rows from one or more database tables
➢ Select most commonly used data manipulation language command
➢ The symbol * means that we relation all fields of selected tuples in the result
➢ ‘S’ as a variable that takes a on the value of each tuple in students table one double after other
➢ The condition s.age < 18 in the where clause specify that we want to select only tuples when the age
field has a value less than 18
➢ This Query evaluates to the relation show fig1
➢ In addition to selecting a subset of tuples a query can extract a subset of fields of each selected tuple
➢ We can compute the of names of login
➢ We can compute the names of logins of students who are younger than 18 with the following query
SELECT s.name, s.login FROM students S WHERE s.age<18;
➢ The Output for the query:
name Login
Rameswaran rameswaran@GATE
Krishnan Krishnan@GATE
27

➢ We can also combine information in the student and enrolled relations.


➢ If we want to obtain the names of the all students who obtained a grade and the id of the course in
which they got an A, the query is
SELECT s.name, E.cid FROM student S, Enrolled E WHERE s.id = E.student AND E.grades;
➢ This query can be returning a single tuple.

(4)Logical data base Design


Logical Database Design: from Conceptual to Logical Schema:
Logical database design is the process of transforming (or mapping) a conceptual schema of the
application domain into a schema for the data model underlying a particular DBMS, such as the relational
or object-oriented data model. This mapping can be understood as the result of trying to achieve two
distinct sets of goals:
(i) representation goal: preserving the ability to capture and distinguish all valid states of the
conceptual schema;
(ii) Data management goals: addressing issues related to the ease and cost of querying the logical
schema, as well as costs of storage and constraint maintenance. This entry focuses mostly on the mapping
of (Extended) Entity-Relationship (EER) diagrams to relational databases.
➢ The ER model is convenient for representing an initial, high level DB design.
➢ The ER diagram describing a DB, a standard approach is taken to generating relational DB schema.
➢ Now describe how to translate an ER diagram into collection of tables with associated constraint is
a relational DB schema.
i.Entity sets to tables.
ii.Relationship sets to Tables (without constraint)
iii.Translating relationship sets with key constraint.
iv.Translating relationship sets with participation constraint.
v.Translating weak entity set.
vi.Translating class hierarchy.
vii.Translating ER – diagrams with aggression.
(5).Introduction to Views destroying/ altering Tables and Views.
What is a View?
SQL has a special version of tables called View, which is a virtual table that is compiled in runtime.
A View is just an SQL statement, and the data associated with it is not physically stored in the view but is
stored in the base tables of it.
Advantages of views:
➢ Security
Each user can be given permission to access the database only through a small set of views that
contain the specific data the user is authorized to see, thus restricting the user's access to stored data
➢ Query Simplicity:
A view can draw data from several different tables and present it as a single table, turning multi-table
queries into single-table queries against the view.
➢ Structural simplicity:
Views can give a user a "personalized" view of the database structure, presenting the database as a set
of virtual tables that make sense for that user.
➢ Data Integrity:
If data is accessed and entered through a view, the DBMS can automatically check the data to ensure
that it meets the specified integrity constraints.
➢ Logical data independence:
28

View can make the application and database tables to a certain extent independent. If there is no
view, the application must be based on a table. With the view, the program can be established in view of
above, to view the program with a database table to be separated.
Disadvantages of views:
➢ Performance:
Views create the appearance of a table, but the DBMS must still translate queries against the view into
queries against the underlying source tables. If the view is defined by a complex, multi-table query then
simple queries on the views may take considerable time.
➢ Update restrictions:
When a user tries to update rows of a view, the DBMS must translate the request into an update on rows of
the underlying source tables. This is possible for simple views, but more complex views are often restricted
to read-only.
Types of Views:
➢ Simple View: A view based on only a single table, which doesn't contain GROUP BY clause and any
functions.
➢ Complex View: A view based on multiple tables, which contain GROUP BY clause and functions.
➢ Inline View: A view based on a subquery in FROM Clause, that subquery creates a temporary table
and simplifies the complex query.
➢ Materialized View: A view that stores the definition as well as data. It creates replicas of data by
storing it physically.

Student_Detail

STU_ID NAME ADDRESS

1 Stephan Delhi

2 Kathrin Noida

3 David Ghaziabad

4 Alina Gurugram

Student_Marks
STU_ID NAME MARKS AGE

1 Stephan 97 19

2 Kathrin 86 21

3 David 74 18

4 Alina 90 20

5 John 96 18

1. Creating view
29

A view can be created using the CREATE VIEW statement. We can create a view from a single
table or multiple tables.
Syntax:
CREATE VIEW view_name AS SELECT column1, column2..... FROM table_name WHERE condition;
2. Creating View from a single table
In this example, we create a View named DetailsView from the table Student_Detail.
Query:
CREATE VIEW DetailsView AS SELECT NAME, ADDRESS FROM Student_Details
WHERE STU_ID < 4;
Just like table query, we can query the view to view the data.
SELECT * FROM DetailsView;

Output:
NAME ADDRESS

Stephan Delhi

Kathrin Noida

David Ghaziabad
3. Creating View from multiple tables:
View from multiple tables can be created by simply include multiple tables in the SELECT statement.
In the given example, a view is created named MarksView from two tables Student_Detail and
Student_Marks.
Query:
CREATE VIEW MarksView AS SELECT Student_Detail.NAME, Student_Detail.ADDRESS,
Student_Marks.MARKS FROM Student_Detail, Student_Mark WHERE Student_Detail.NAME =
Student_Marks.NAME;
To display data of View MarksView:
SELECT * FROM MarksView;

NAME ADDRESS MARKS

Stephan Delhi 97

Kathrin Noida 86

David Ghaziabad 74

Alina Gurugram 90
4. Deleting View
A view can be deleted using the Drop View statement.
Syntax
DROP VIEW view_name;
Example:
30

If we want to delete the View MarksView, we can do this as:


DROP VIEW MarksView;

5. Altering/ updating view:


The ALTER VIEW command allows you to modify a view. A view is based on the result set from a
query consisting of a SELECT statement or a UNION of two or more SELECT statements. See CREATE
VIEW for further information on using a UNION.
affecting any other table.
Syntax:
UPDATE < view_name > SET<column1>=<value1>,<column2>=<value2>,.....
WHERE <condition>;
31

Chapter 2: Relational algebra and Relational calculus

(1) What is Relational Algebra in DBMS?

Relational algebra is a procedural query language that works on relational model. The purpose of a
query language is to retrieve data from database or perform various operations such as insert, update, delete
on the data. When I say that relational algebra is a procedural query language, it means that it tells what data
to be retrieved and how to be retrieved.
On the other hand relational calculus is a non-procedural query language, which means it tells what
data to be retrieved but doesn’t tell how to retrieve it. We will discuss relational calculus in a separate
tutorial.

Types of operations in relational algebra

We have divided these operations in two categories:


1. Basic Operations
2. Derived Operations
Basic/Fundamental Operations:
1. Select (σ)
2. Project (∏)
3. Union (∪)
4. Set Difference (-)
5. Cartesian product (X)
6. Rename (ρ)
Derived Operations:
1. Natural Join (⋈)
2. Left, Right, Full outer join (⟕, ⟖, ⟗)
3. Intersection (∩)
4. Division (÷)
Let’s discuss these operations one by one with the help of examples.

Select Operator (σ)


Select Operator is denoted by sigma (σ) and it is used to find the tuples (or rows) in a relation (or
table) which satisfy the given condition.
If you understand little bit of SQL then you can think of it as a where clause in SQL, which is used for
the same purpose.
Syntax of Select Operator (σ)
σ Condition/Predicate(Relation/Table name)
Select Operator (σ) Example
Table: CUSTOMER
---------------

Customer_Id Customer_Name Customer_City


----------- ------------- -------------
C10100 Steve Agra
C10111 Raghu Agra
C10115 Chaitanya Noida
C10117 Ajeet Delhi
C10118 Carl Delhi
Query:
32

σ Customer_City="Agra" (CUSTOMER)
Output:

Customer_Id Customer_Name Customer_City


----------- ------------- -------------
C10100 Steve Agra
C10111 Raghu Agra
Project Operator (∏)
Project operator is denoted by ∏ symbol and it is used to select desired columns (or attributes) from a
table (or relation).
Project operator in relational algebra is similar to the Select statement in SQL.
Syntax of Project Operator (∏)
∏ column_name1, column_name2, ...., column_nameN(table_name)
Project Operator (∏) Example
In this example, we have a table CUSTOMER with three columns, we want to fetch only two columns
of the table, which we can do with the help of Project Operator ∏.
Table: CUSTOMER

Customer_Id Customer_Name Customer_City


----------- ------------- -------------
C10100 Steve Agra
C10111 Raghu Agra
C10115 Chaitanya Noida
C10117 Ajeet Delhi
C10118 Carl Delhi
Query:

∏ Customer_Name, Customer_City (CUSTOMER)


Output:

Customer_Name Customer_City
------------- -------------
Steve Agra
Raghu Agra
Chaitanya Noida
Ajeet Delhi
Carl Delhi
Union Operator (∪)
Union operator is denoted by ∪ symbol and it is used to select all the rows (tuples) from two tables
(relations).
Lets discuss union operator a bit more. Lets say we have two relations R1 and R2 both have same
columns and we want to select all the tuples(rows) from these relations then we can apply the union operator
on these relations.
Note: The rows (tuples) that are present in both the tables will only appear once in the union set. In
short you can say that there are no duplicates present after the union operation.
Syntax of Union Operator (∪)

table_name1 ∪ table_name2
Union Operator (∪) Example
Table 1: COURSE
33

Course_Id Student_Name Student_Id


--------- ------------ ----------
C101 Aditya S901
C104 Aditya S901
C106 Steve S911
C109 Paul S921
C115 Lucy S931
Table 2: STUDENT

Student_Id Student_Name Student_Age


------------ ---------- -----------
S901 Aditya 19
S911 Steve 18
S921 Paul 19
S931 Lucy 17
S941 Carl 16
S951 Rick 18
Query:

∏ Student_Name (COURSE) ∪ ∏ Student_Name (STUDENT)


Output:

Student_Name
------------
Aditya
Carl
Paul
Lucy
Rick
Steve
Note: As you can see there are no duplicate names present in the output even though we had few
common names in both the tables, also in the COURSE table we had the duplicate name itself.

Intersection Operator (∩)


Intersection operator is denoted by ∩ symbol and it is used to select common rows (tuples) from two
tables (relations).
Lets say we have two relations R1 and R2 both have same columns and we want to select all those
tuples(rows) that are present in both the relations, then in that case we can apply intersection operation on
these two relations R1 ∩ R2.
Note: Only those rows that are present in both the tables will appear in the result set.
Syntax of Intersection Operator (∩)
table_name1 ∩ table_name2
Intersection Operator (∩) Example
Lets take the same example that we have taken above.
Table 1: COURSE

Course_Id Student_Name Student_Id


--------- ------------ ----------
C101 Aditya S901
C104 Aditya S901
34

C106 Steve S911


C109 Paul S921
C115 Lucy S931
Table 2: STUDENT

Student_Id Student_Name Student_Age


------------ ---------- -----------
S901 Aditya 19
S911 Steve 18
S921 Paul 19
S931 Lucy 17
S941 Carl 16
S951 Rick 18
Query:

∏ Student_Name (COURSE) ∩ ∏ Student_Name (STUDENT)


Output:

Student_Name
------------
Aditya
Steve
Paul
Lucy

Set Difference (-)

Set Difference is denoted by – symbol. Lets say we have two relations R1 and R2 and we want to
select all those tuples(rows) that are present in Relation R1 but not present in Relation R2, this can be done
using Set difference R1 – R2.
Syntax of Set Difference (-)
table_name1 - table_name2
Set Difference (-) Example
Lets take the same tables COURSE and STUDENT that we have seen above.
Query:
Lets write a query to select those student names that are present in STUDENT table but not present in
COURSE table.
∏ Student_Name (STUDENT) - ∏ Student_Name (COURSE)
Output:
Student_Name
------------
Carl
Rick

Cartesian product (X)

Cartesian Product is denoted by X symbol. Lets say we have two relations R1 and R2 then the
cartesian product of these two relations (R1 X R2) would combine each tuple of first relation R1 with the
each tuple of second relation R2. I know it sounds confusing but once we take an example of this, you will
be able to understand this.
35

Syntax of Cartesian product (X)

R1 X R2
Cartesian product (X) Example
Table 1: R

Col_A Col_B
----- ------
AA 100
BB 200
CC 300
Table 2: S

Col_X Col_Y
----- -----
XX 99
YY 11
ZZ 101
Query:
Lets find the cartesian product of table R and S.

RXS
Output:

Col_A Col_B Col_X Col_Y


----- ------ ------ ------
AA 100 XX 99
AA 100 YY 11
AA 100 ZZ 101
BB 200 XX 99
BB 200 YY 11
BB 200 ZZ 101
CC 300 XX 99
CC 300 YY 11
CC 300 ZZ 101
Note: The number of rows in the output will always be the cross product of number of rows in each
table. In our example table 1 has 3 rows and table 2 has 3 rows so the output has 3×3 = 9 rows.
Rename (ρ)
Rename (ρ) operation can be used to rename a relation or an attribute of a relation.
Rename (ρ)
Syntax: ρ(new_relation_name, old_relation_name)
Rename (ρ) Example
Lets say we have a table customer, we are fetching customer names and we are renaming the resulted
relation to CUST_NAMES.
36

Table: CUSTOMER
Customer_Id Customer_Name Customer_City
----------- ------------- -------------
C10100 Steve Agra
C10111 Raghu Agra
C10115 Chaitanya Noida
C10117 Ajeet Delhi
C10118 Carl Delhi
Query:

ρ(CUST_NAMES, ∏(Customer_Name)(CUSTOMER))
Output:

CUST_NAMES
----------
Steve
Raghu
Chaitanya
Ajeet
Carl

SQL Joins and its types Explained Visually


Merging two data sets using SQL or SQL tools can be accomplished through JOINS. A JOIN is a SQL
instruction in the FROM clause of your query that is used to identify the tables you are querying and how
they should be combined.

Primary and Foreign Keys


Typically in a relational database, data is organized into various tables made of attributes (columns)
and records (rows).
In each table there exist a column that is the primary key which is a column where each entry
uniquely represents a single row in that table.
This is usually the ID (short for identifier) column.
A column in a table that establishes an association with another table’s primary key via shared values
is called a foreign key.
Foreign keys are also typically titled IDs but prepended with the name of the referenced table.
This concept is applied when combining two or more tables together using a JOIN.
37

Notice that between the two tables there is a common column (dimension) highlighted in green, User
ID. In the User Table, the ID column is the user ID and it’s the primary key for that table whereas, in the
Event Table, the User_ID column is the foreign key since that column refers to the ID column in the Users
table. We can use this relationship to join the two tables together to get the user and events information in
one table.

Meet the joins


There are three common ways you can join any two or more tables together we’ll talk about
first: Outer Join, Inner Join, and Left Join. Using the example User and Event tables above, let’s look at
some examples of joins…
Outer Join
Let’s say you want to have a table that contains all your user and event table data together.
You would use an Outer Join to join the tables together. An outer join combines the columns from all
tables on one or more common dimension when possible, and includes all data from all tables.

Inner Join
What if you want to have a table that contains only users that have done an action?
You would use an Inner Join to join the tables together. An inner join combines the columns on a
common dimension (the first N columns) when possible, and only includes data for the columns that share
the same values in the common N column(s). In the example, the User ID would be the common dimension
used for the inner join.

Left Join
Now, what if you want to have a table that contains all the users’ data and only actions that those users
have done? Actions performed by other users not in the users table should not be included?
You would use a Left Join to join the tables together. A left join combines the columns on a common
dimension (the first N columns) when possible, returning all rows from the first table with the
matching rows in the consecutive tables. The result is NULL in the consecutive tables when there is no
match. In this case, we would make the User Table the first (left table) to use for the left join.
38

For a more Detailed look at the Left Join click here.


Union and Cross Join
In addition to these common join types, there are some methods which will result in additional rows in
your output table as well as more columns. Two of these join types are called Union and Cross Join. These
join types probably wouldn’t be as appropriate for our example tables above, but for the sake of this article
we can still use them to see how these joins function. A Union Join will stack tables on top of each other
resulting in new rows.

A good use case for this would be if you’re looking to combine two tables by appending them rather
than joining them. A Cross Join would result in a table with all possible combinations of your tables’ rows
together. This can result in enormous tables and should be used with caution.

Cross Joins will likely only be used when your tables contain single values that you want to join
together without a common dimension.

What is Relational Calculus?

Relational calculus is a non-procedural query language that tells the system what data to be retrieved
but doesn’t tell how to retrieve it. the user is concerned with the details of how to obtain the end results.

➢ The relational calculus tells what to do but never explains how to do.
39

Types of Relational calculus:

Tuple relational Calculus


1. Tuple Relational Calculus (TRC):
➢ The tuple relational calculus is specified to select the tuples in a relation. In TRC, filtering variable
uses the tuples of a relation.
➢ The result of the relation can have one or more tuples.

Notation:
{T | P (T)} or {T | Condition (T)}
Where T is the resulting tuples
P(T) is the condition used to fetch T.
For example:
{ T.name | Author(T) AND T.article = 'database' }
OUTPUT: This query selects the tuples from the AUTHOR relation. It returns a tuple with
'name' from Author who has written an article on 'database'.
TRC (tuple relation calculus) can be quantified. In TRC, we can use Existential (∃) and Universal
Quantifiers (∀).
For example:
{ R| ∃T ∈ Authors(T.article='database' AND R.name=T.name)}
Output: This query will yield the same result as the previous one.

. Tuple Relational Calculus (TRC)


Tuple relational calculus is used for selecting those tuples that satisfy the given
condition.
Table: Student

First_Name Last_Name Age


---------- --------- ----
Ajeet Singh 30
Chaitanya Singh 31
Rajeev Bhatia 27
Carl Pratap 28
Lets write relational calculus queries.

Query to display the last name of those students where age is greater than 30
40

{ t.Last_Name | Student(t) AND t.age > 30 }


In the above query you can see two parts separated by | symbol. The second part is
where we define the condition and in the first part we specify the fields which we want to
display for the selected tuples.

The result of the above query would be:

Last_Name
---------
Singh
Query to display all the details of students where Last name is ‘Singh’

{ t | Student(t) AND t.Last_Name = 'Singh' }


Output:

First_Name Last_Name Age


---------- --------- ----
Ajeet Singh 30
Chaitanya Singh 31

Domain relational calculus -


2. Domain Relational Calculus (DRC):
➢ The second form of relation is known as Domain relational calculus. In domain relational calculus,
filtering variable uses the domain of attributes.
➢ Domain relational calculus uses the same operators as tuple calculus. It uses logical connectives ∧
(and), ∨ (or) and ┓ (not).
➢ It uses Existential (∃) and Universal Quantifiers (∀) to bind the variable.
Notation:
{ a1, a2, a3, ..., an | P (a1, a2, a3, ... ,an)}
Where
a1, a2 are attributes
P stands for formula built by inner attributes
For example:
{< article, page, subject > | ∈ javatpoint ∧ subject = 'database'}
Output: This query will yield the article, page, and subject from the relational javatpoint, where the subject
is a database.
In domain relational calculus the records are filtered based on the domains.
Again we take the same table to understand how DRC works.
Table: Student

First_Name Last_Name Age


41

---------- --------- ----


Ajeet Singh 30
Chaitanya Singh 31
Rajeev Bhatia 27
Carl Pratap 28
Query to find the first name and age of students where student age is greater than 27

{< First_Name, Age > | ∈ Student ∧ Age > 27}


Note:
The symbols used for logical operators are: ∧ for AND, ∨ for OR and ┓ for NOT.

Output:

First_Name Age
---------- ----
Ajeet 30
Chaitanya 31
Carl 28

(4). Expressive Power of Algebra and calculus:

➢ a query language is said to relationally complete, if it can express all the queries that can be
expressed in relational algebra (SQL).
➢ SQL is relationally complete.
➢ SQL produces additional expressive power beyond relational algebra.
➢ Expressive power (theorem due to codd) every query that can be expressed in relational algebra can
be expressed as safe query in relational calculus, the converse Is also true.
Relational completeness:
➢ Query language (sql) can express query that is expressible in relational algebra/calculus (actually
SQL is macro powerful)
A Remark: unsafe query
∃ syntactically correct calculus queries that have am initiate no.of answers! Unsafe queries.
Ex: {s|┓ (s∈ sailors)}
➢ A query which yields an infinite (quasi) answers said to be unsafe, of course should not allowed by
system. It is possible to define a safe formula in TRC& DRC
Safety:
Certain queries stated in the relational calculus may lead to answers which contains am infinite no.of
tuples or atleast as many as the system can handle.
42

UNIT -3
INTRODUCTION

(1) What is MySQL & its Features:


MySQL is an open-source relational database management system that works on many platforms. It
provides multi-user access to support many storage engines and is backed by Oracle. So, you can buy a
commercial license version from Oracle to get premium support services.
Developed in the mid-90s., MySQL was one of the first open-source database available in the market.
Today there are many alternatives variants of MySQL,. However, the differences between the variants are
not significant as they use the same syntax, and basic functionality also remains same.
The features of MySQL are as follows:
• Ease of Management – The software very easily gets downloaded and also uses an event scheduler to
schedule the tasks automatically.
• Robust Transactional Support – Holds the ACID (Atomicity, Consistency, Isolation, Durability)
property, and also allows distributed multi-version support.
• Comprehensive Application Development – MySQL has plugin libraries to embed the database into
any application. It also supports stored procedures, triggers, functions, views and many more for
application development. You can refer to the RDS Tutorial, to understand Amazon’s RDBMS.
• High Performance – Provides fast load utilities with distinct memory caches and table index
partitioning.
• Secure Data Protection – MySQL supports powerful mechanisms to ensure that only authorized users
have access to the databases.
• High Availability – MySQL can run high-speed master/slave replication configurations and it offers
cluster servers.
• Scalability & Flexibility – With MySQL you can run deeply embedded applications and create data
warehouses holding a humongous amount of data.
MYSQL data base terms
• Database
A database is a named collection of tables. (see table). A database can also contain views,
indexes, sequences, data types, operators, and functions. Other relational database products
use the term catalog.
• Command
A command is a string that you send to the server in hopes of having the server do something
useful. Some people use the word statement to mean command. The two words are very
similar in meaning and, in practice, are interchangeable.
• Query
A query is a type of command that retrieves data from the server.
• Table (relation, file, class)
A table is a collection of rows. A table usually has a name, although some tables are
temporary and exist only to carry out a command. All the rows in a table have the same shape
(in other words, every row in a table contains the same set of columns). In other database
systems, you may see the terms relation, file, or even class? these are all equivalent to a table.
• Column (field, attribute)
A column is the smallest unit of storage in a relational database. A column represents one
piece of information about an object. Every column has a name and a data type. Columns are
grouped into rows, and rows are grouped into tables. In Figure 1.1, the shaded area depicts a
single column. The terms field and attribute have similar meanings.
43

• Row (record, tuple)


A row is a collection of column values. Every row in a table has the same shape (in other
words, every row is composed of the same set of columns). If you are trying to model a real-
world application, a row represents a real-world object. For example, if you are running an
auto dealership, you might have a vehicles table. Each row in the vehicles table represents a
car (or truck, or motorcycle, and so on). The kinds of information that you store are the same
for all vehicles (that is, every car has a color, a vehicle ID, an engine, and so on). In Figure
1.2, the shaded area depicts a row.
You may also see the terms record or tuple? these are equivalent to a row.
• View
A view is an alternative way to present a table (or tables). You might think of a view as a
"virtual" table. A view is (usually) defined in terms of one or more tables. When you create a
view, you are not storing more data, you are instead creating a different way of looking at
existing data. A view is a useful way to give a name to a complex query that you may have to
use repeatedly.
• Transaction
A transaction is a collection of database operations that are treated as a unit. PostgreSQL
guarantees that all the operations within a transaction complete or that none of them
complete. This is an important property?it ensures that if something goes wrong in the middle
of a transaction, changes made before the point of failure will not be reflected in the
database. A transaction usually starts with a BEGIN command and ends with
a COMMIT or ROLLBACK (see the next entries).
• Commit
A commit marks the successful end of a transaction. When you perform a commit, you are
telling PostgreSQL that you have completed a unit of operation and that all the changes that
you made to the database should become permanent.
• Rollback
A rollback marks the unsuccessful end of a transaction. When you roll back a transaction,
you are telling PostgreSQL to discard any changes that you have made to the database (since
the beginning of the transaction).
• Index
An index is a data structure that a database uses to reduce the amount of time it takes to
perform certain operations. An index can also be used to ensure that duplicate values don't
appear where they aren't wanted. I'll talk about indexes in Chapter 4, "Query Optimization."
MySQL Data Types
• Numeric – This data type includes integers of various sizes, floating-point(real) of various
precisions and formatted numbers.
• Character-string – These data types either have a fixed, or a varying number of characters.
This data type also has a variable-length string called CHARACTER LARGE
OBJECT (CLOB) which is used to specify columns that have large text values.
• Bit-string – These data types are either of a fixed length or varying length of bits. There is
also a variable-length bit string data type called BINARY LARGE
OBJECT(BLOB), which is available to specify columns that have large binary values, such
as images.
• Boolean – This data type has TRUE or FALSE values. Since SQL, has NULL values, a
three-valued logic is used, which is UNKNOWN.
44

• Date & Time – The DATE data type has: YEAR, MONTH, and DAY in the form YYYY-
MM-DD. Similarly, the TIME data type has the components HOUR, MINUTE, and
SECOND in the form HH:MM: SS. These formats can change based on the requirement.
• Timestamp & Interval – The TIMESTAMP data type includes a minimum of six positions,
for decimal fractions of seconds and an optional WITH TIME ZONE qualifier in addition to
the DATE and TIME fields. The INTERVAL data type mentions a relative value that can be
used to increment or decrement an absolute value of a date, time, or timestamp.
Access mysql:
In order to access your MySQL database, please follow these steps:
1. Log into your Linux web server via Secure Shell.
2. Open the MySQL client program on the server in the /usr/bin directory.
3. Type in the following syntax to access your database:
$ mysql -h {hostname} -u username -p {databasename}
Password: {your password}
hostname: the name of the MySQL server that you are assigned to, for example,
mysql4.safesecureweb.com
databasename: the name of your MySQL database
password: the password you use to access your MySQL database

Difference between dbms and rdbms:


No. DBMS RDBMS

1) DBMS applications RDBMS applications store data in


store data as file. a tabular form.

2) Normalization is not present Normalization is present in


in DBMS. RDBMS.

3) DBMS does not apply any RDBMS defines the integrity


security with regards to data constraint for the purpose of ACID
manipulation. (Atomocity, Consistency, Isolation and
Durability) property.

4) DBMS uses file system to in RDBMS, data values are stored


store data, so there will be no in the form of tables, so
relation between the tables. a relationship between these data values
will be stored in the form of a table as
well.

5) DBMS does not support RDBMS supports distributed


distributed database. database.

6) DBMS is meant to be for RDBMS is designed to handle


small organization and deal with large amount of data. it supports multiple
small data. it supports single user. users.

7) Examples of DBMS are file Example of RDBMS


systems, xml etc. are mysql, postgre, sql server, oracle etc.
45

Difference between sql and mysql:

SQL MySQL

SQL is Structured Query MySQL is a relational database management system used to


Language used to manage the store, retrieve, modify and administer a database using SQL. We
relational databases. have a lot of database software available in the market. The
popular ones include MySQL, SQL Server, Oracle, Informix, etc.

It’s a query language. It’s database software. It uses SQL as a language to query
the database.

Since this is a language, it Since it’s a software, it gets frequent updates.


does not get updates. SQL
commands always remain the
same.

MYSQL COMMAND SYNTAX


SQL commands are instructions. It is used to communicate with the database. It is also used to
perform specific tasks, functions, and queries of data.
SQL can perform various tasks like create a table, add data to tables, drop the table, modify the table,
set permission for users.

1. Data Definition Language (DDL)


o DDL changes the structure of the table like creating a table, deleting a table, altering a table, etc.
o All the command of DDL are auto-committed that means it permanently save all the changes in the
database.
46

Here are some commands that come under DDL:


o CREATE
o ALTER
o DROP
o TRUNCATE
CREATE It is used to create a new table in the database.
Syntax:
CREATE TABLE TABLE_NAME (COLUMN_NAME DATATYPES[,....]);
Example:
CREATE TABLE EMPLOYEE(Name VARCHAR2(20), Email VARCHAR2(100), DOB DATE);
DROP: It is used to delete both the structure and record stored in the table.
Syntax
DROP TABLE table_name;
Example
DROP TABLE EMPLOYEE;
ALTER: It is used to alter the structure of the database. This change could be either to modify the
characteristics of an existing attribute or probably to add a new attribute.
Syntax:
To add a new column in the table
ALTER TABLE table_name ADD column_name COLUMN-definition;
To modify existing column in the table:
ALTER TABLE table_name MODIFY(column_definitions....);

EXAMPLE
ALTER TABLE STU_DETAILS ADD(ADDRESS VARCHAR2(20));
ALTER TABLE STU_DETAILS MODIFY (NAME VARCHAR2(20));
d. TRUNCATE: It is used to delete all the rows from the table and free the space containing the table.
Syntax:
TRUNCATE TABLE table_name;
Example:
TRUNCATE TABLE EMPLOYEE;

2. Data Manipulation Language


o DML commands are used to modify the database. It is responsible for all form of changes in the
database.
o The command of DML is not auto-committed that means it can't permanently save all the changes in
the database. They can be rollback.
Here are some commands that come under DML

o INSERT
o UPDATE
o DELETE
INSERT: The INSERT statement is a SQL query. It is used to insert data into the row of a table.
Syntax:
INSERT INTO TABLE_NAME (col1, col2, …. col N) VALUES (value1, value2, , .... valueN);
Or
INSERT INTO TABLE_NAME VALUES (value1, value2, value3, .... valueN);
For example:
INSERT INTO javatpoint (Author, Subject) VALUES ("Sonoo", "DBMS");
47

UPDATE: This command is used to update or modify the value of a column in the table.
Syntax:
UPDATE table_name SET [column_name1= value1,...column_nameN = valueN]
[WHERE CONDITION]
For example:
UPDATE students SET User_Name = 'Sonoo' WHERE Student_Id = '3'
DELETE: It is used to remove one or more row from a table.
Syntax:
DELETE FROM table_name [WHERE condition];
For example:
DELETE FROM javatpoint WHERE Author="Sonoo";

Data Control Language


DCL commands are used to grant and take back authority from any database user.
Here are some commands that come under DCL:
o Grant
o Revoke
Grant: It is used to give user access privileges to a database.
Example
GRANT SELECT, UPDATE ON MY_TABLE TO SOME_USER, ANOTHER_USER;

Revoke: It is used to take back permissions from the user.


Example
REVOKE SELECT, UPDATE ON MY_TABLE FROM USER1, USER2;

Transaction Control Language


TCL commands can only use with DML commands like INSERT, DELETE and UPDATE only.
These operations are automatically committed in the database that's why they cannot be used while
creating tables or dropping them.
Here are some commands that come under TCL:
o COMMIT
o ROLLBACK
o SAVEPOINT
Commit: Commit command is used to save all the transactions to the database.
Syntax:
COMMIT;
Example:
DELETE FROM CUSTOMERS WHERE AGE = 25; COMMIT;
b. Rollback: Rollback command is used to undo transactions that have not already been saved to the
database.
Syntax:
ROLLBACK;
Example:
DELETE FROM CUSTOMERS WHERE AGE = 25; ROLLBACK;
SAVEPOINT: It is used to roll the transaction back to a certain point without rolling back the entire
transaction.
Syntax:
SAVEPOINT SAVEPOINT_NAME;
48

Data Query Language


DQL is used to fetch the data from the database.
It uses only one command:
SELECT
SELECT: This is the same as the projection operation of relational algebra. It is used to select the attribute
based on the condition described by WHERE clause.
Syntax:
SELECT expressions
FROM TABLES
WHERE conditions;
For example:
SELECT emp_name FROM employee WHERE age > 20;

LIMIT clause
The LIMIT clause is used in the SELECT statement to constrain the number of rows to return.
The LIMIT clause accepts one or two arguments. The values of both arguments must be zero or
positive integers.
The following illustrates the LIMIT clause syntax with two arguments:
SELECT
select_list
FROM
table_name
LIMIT [offset,] row_count;
Code language: SQL (Structured Query Language) (sql)

In this syntax:
• The offset specifies the offset of the first row to return. The offset of the first row is 0, not 1.
• The row_count specifies the maximum number of rows to return.

Sort Results:
This MySQL tutorial explains how to use the MySQL ORDER BY clause with syntax and examples.
Description
The MySQL ORDER BY clause is used to sort the records in your result set.
Syntax
The syntax for the ORDER BY clause in MySQL is:
SELECT expressions
FROM tables
[WHERE conditions]
ORDER BY expression [ ASC | DESC ];
49

SQL Logical Operators


The Logical Operators in SQL perform the Boolean operations, which give two results True and
False. These operators provide True value if both operands match the logical condition.
Following are the various logical operators which are performed on the data stored in the SQL
database tables:
1. SQL ALL operator
2. SQL AND operator
3. SQL OR operator
4. SQL BETWEEN operator
5. SQL IN operator
6. SQL NOT operator
7. SQL ANY operator
8. SQL LIKE operator
SQL ALL Operator
The ALL operator in SQL compares the specified value to all the values of a column from the sub-
query in the SQL database.
This operator is always used with the following statement:
1. SELECT,
2. HAVING, and
3. WHERE.
Syntax of ALL operator:
SELECT column_Name1, ...., column_NameN FROM table_Name WHERE column Comparison_ope
rator ALL (SELECT column FROM tablename2)

SQL AND Operator


The AND operator in SQL would show the record from the database table if all the conditions
separated by the AND operator evaluated to True. It is also known as the conjunctive operator and is used
with the WHERE clause.
Syntax of AND operator:
SELECT column1, ...., columnN FROM table_Name WHERE condition1 AND condition2 AND c
ondition3 AND ....... AND conditionN;

SQL OR Operator
The OR operator in SQL shows the record from the table if any of the conditions separated by the OR
operator evaluates to True. It is also known as the conjunctive operator and is used with the WHERE clause.
Syntax of OR operator:
SELECT column1, ...., columnN FROM table_Name WHERE condition1 OR condition2 OR cond
ition3 OR ....... OR conditionN;
SQL BETWEEN Operator
The BETWEEN operator in SQL shows the record within the range mentioned in the SQL query.
This operator operates on the numbers, characters, and date/time operands.
If there is no value in the given range, then this operator shows NULL value.
Syntax of BETWEEN operator:
SELECT column_Name1, column_Name2 ...., column_NameN FROM table_Name WHERE
column_nameBETWEEN value1 and value2;
50

SQL IN Operator
The IN operator in SQL allows database users to specify two or more values in a WHERE clause.
This logical operator minimizes the requirement of multiple OR conditions.
This operator makes the query easier to learn and understand. This operator returns those rows whose
values match with any value of the given list.
Syntax of IN operator:
SELECT column_Name1, column_Name2 ...., column_NameN FROM table_Name WHERE
column_name IN (list_of_values);

SQL NOT Operator


The NOT operator in SQL shows the record from the table if the condition evaluates to false. It is
always used with the WHERE clause.
Syntax of NOT operator:
SELECT column1, column2 ...., columnN FROM table_Name WHERE NOT condition;
SQL ANY Operator
The ANY operator in SQL shows the records when any of the values returned by the sub-query
meet the condition. The ANY logical operator must match at least one record in the inner query and
must be preceded by any SQL comparison operator.
Syntax of ANY operator:
SELECT column1, column2 ...., columnN FROM table_Name WHERE column_name compar
ison_operator ANY ( SELECT column_name FROM table_name WHERE condition(s)) ;

SQL LIKE Operator


The LIKE operator in SQL shows those records from the table which match with the given pattern
specified in the sub-query.
The percentage (%) sign is a wildcard which is used in conjunction with this logical operator.
This operator is used in the WHERE clause with the following three statements:
1. SELECT statement
2. UPDATE statement
3. DELETE statement

Syntax of LIKE operator:


SELECT column_Name1, column_Name2 ...., column_NameN FROM table_Name WHERE column_
name LIKE pattern;

ADVANCED USAGE of MYSQL


MySQL Primary Key
MySQL primary key is a single or combination of the field, which is used to identify each record in a
table uniquely. If the column contains primary key constraints, then it cannot be null or empty. A table may
have duplicate columns, but it can contain only one primary key. It always contains unique value into a
column.
When you insert a new row into the table, the primary key column can also use
the AUTO_INCREMENT attribute to generate a sequential number for that row
automatically. MySQL automatically creates an index named "Primary" after defining a primary key into
the table. Since it has an associated index, we can say that the primary key makes the query performance
fast.
51

Rules for Primary key


Following are the rules for the primary key:
1. The primary key column value must be unique.
2. Each table can contain only one primary key.
3. The primary key column cannot be null or empty.
4. MySQL does not allow us to insert a new row with the existing primary key.
5. It is recommended to use INT or BIGINT data type for the primary key column.
We can create a primary key in two ways
➢ CREATE TABLE Statement
➢ ALTER TABLE Statement
Let us discuss each one in detail.
Primary Key Using CREATE TABLE Statement
In this section, we are going to see how a primary key is created using the CREATE
TABLE statement.
Syntax
CREATE TABLE table_name( col1 datatype PRIMARY KEY, col2 datatype, ... );

Primary Key Using ALTER TABLE Statement


This statement allows us to do the modification into the existing table. When the table does not have
a primary key, this statement is used to add the primary key to the column of an existing table.
Syntax
ALTER TABLE table_name ADD PRIMARY KEY(column_list);

DROP Primary Key


The ALTER TABLE statement also allows us to drop the primary key from the table. The following
syntax is used to drop the primary key:
syntax
ALTER TABLE table_name DROP PRIMARY KEY;

MYSQL String functions:


ASCII()
This function returns the numeric value of the leftmost character of the string str. Returns 0 if str is
the empty string. Returns NULL if str is NULL
Syntax : ASCII(str)
Example : SELECT ASCII('2');
Output : 50
CHAR_LENGTH()
Returns the length of the string str, measured in characters. A multi-byte character counts as a single
character. This means that for a string containing five 2-byte characters,
LENGTH() returns 10, whereas CHAR_LENGTH() returns 5.
Syntax : CHAR_LENGTH(str)
Example : SELECT CHAR_LENGTH('test string');
Output : 11
52

LCASE()
MySQL LCASE() converts the characters of a string to lower case characters.
Syntax : LCASE(str)
Example : SELECT LCASE('MYTESTSTRING');
Output : myteststring
LENGTH()
MySQL LENGTH() returns the length of a given string.
Syntax : LENGTH(str)
Example : SELECT LENGTH('text');
Output : 4
LOWER()
MySQL LOWER() converts all the characters in a string to lowercase characters.
Syntax : LOWER(str)
Example : SELECT LOWER('MYTESTSTRING');
Output : myteststring
UPPER()
MySQL UPPER() converts all the characters in a string to uppercase characters.
Syntax : UPPER(str)
Example : SELECT UPPER('myteststring');
Output : MYTESTSTRING
SUBSTR()
MySQL SUBSTR() returns the specified number of characters from a particular position of a given
string. SUBSTR() is a synonym for SUBSTRING().
Syntax : SUBSTR(str,pos,len)
Example : SELECT SUBSTR('w3resource',4,3);
Output : eso
CHAR()
CHAR() interprets each argument N as an integer and returns a string consisting of the characters
given by the code values of those integers. NULL values are skipped.
Syntax : CHAR(N,... [USING charset_name])
Example : SELECT CHAR(77,121,83,81,'76');
Output : MySQL

CONCAT()
Returns the string that results from concatenating one or more arguments. If all arguments are
nonbinary strings, the result is a nonbinary string. If the arguments include any binary strings, the
result is a binary string. A numeric argument is converted to its equivalent nonbinary string form.
Syntax : CONCAT(str1,str2,...)
Example : SELECT CONCAT('w3resource','.','com');
Output : w3resource.com
LTRIM(str)
MySQL LTRIM() removes the leading space characters of a string passed as argument.
Syntax : LTRIM(str)
Example : SELECT LTRIM(' Hello')
Output : Hello ( leading spaces have been exclude)
LEFT()
MySQL LEFT() returns a specified number of characters from the left of a given string. Both the
number and the string are supplied in the arguments as str and len of the function.
Syntax : LEFT(str,len)
Example : SELECT LEFT('w3resource', 3);
Output : w3r
53

RIGHT()
MySQL RIGHT() extracts a specified number of characters from the right side of a given string.
Syntax : RIGHT(str,len)
Example : SELECT RIGHT('w3resource',8);
Output : resource
MID()
MySQL MID() extracts a substring from a string. The actual string, position to start extraction and
length of the extracted string - all are specified as arguments.
Syntax : MID(str,pos,len)
Example : SELECT MID('w3resource',4,3);
Output : eso
REPLACE()
MySQL REPLACE() replaces all the occurrences of a substring within a string.
Syntax : REPLACE(str,from_str,to_str)
Example : SELECT REPLACE('w3resource','ur','r');
Output : w3resorce
REPEAT()
MySQL REPEAT() repeats a string for a specified number of times.
The function returns NULL either any either of the arguments are NULL.
Syntax : REPEAT(str,count)
Example : SELECT REPEAT('**-',5);
Output : **-**-**-**-**-
LOCATE()
MySQL LOCATE() returns the position of the first occurrence of a string within a string. Both of
these strings are passed as arguments. An optional argument may be used to specify from which position of
the string (i.e. string to be searched) searching will start. If this position is not mentioned, searching starts
from the beginning.
Syntax : LOCATE(substr,str,pos);
Example : SELECT LOCATE('st','myteststring');
Output : 5
REVERSE()
Returns a given string with the order of the characters reversed.
Syntax : REVERSE(str)
Example : SELECT REVERSE('w3resource');
Output : ecruoser3w
FIELD()
Returns the index (position) of str in the str1, str2, str3, ... list. Returns 0 if str is not found.
Syntax : FIELD(str,str1,str2,str3,...)
Example : SELECT FIELD('ank', 'b', 'ank', 'of', 'monk');
Output : 2
STRCMP()
MySQL strcmp() function is used to compare two strings. It returns 0 if both of the strings are same
and returns -1 when the first argument is smaller than the second according to the defined order and 1 when
the second one is smaller the first one.
Syntax:
STRCMP (expr1, expr2)
54

Code:
SELECT STRCMP('mytesttext', 'mytesttext');
Sample Output:
mysql> SELECT STRCMP('mytesttext', 'mytesttext');
+------------------------------------+
| STRCMP('mytesttext', 'mytesttext') |
+------------------------------------+
| 0|
+------------------------------------+
1 row in set (0.01 sec)
MySQL STRCMP() function with unmatched strings
SQL Aggregate Functions
o SQL aggregation function is used to perform the calculations on multiple rows of a single column of
a table. It returns a single value.
o It is also used to summarize the data.
Types of SQL Aggregation Function

1. COUNT FUNCTION
o COUNT function is used to Count the number of rows in a database table. It can work on both
numeric and non-numeric data types.
o COUNT function uses the COUNT(*) that returns the count of all the rows in a specified table.
COUNT(*) considers duplicate and Null.
Syntax
COUNT(*) or
COUNT( [ALL|DISTINCT] expression )
Sample table:
PRODUCT_MAST
PRODUCT COMPANY QTY RATE COST

Item1 Com1 2 10 20

Item2 Com2 3 25 75

Item3 Com1 2 30 60

Item4 Com3 5 10 50

Item5 Com2 2 20 40
55

Example: COUNT()
SELECT COUNT(*) FROM PRODUCT_MAST;
2. SUM Function
Sum function is used to calculate the sum of all selected columns. It works on numeric fields only.
Syntax
SUM()
or
SUM( [ALL|DISTINCT] expression )
Example: SUM()
SELECT SUM(COST) FROM PRODUCT_MAST;
3. AVG function
The AVG function is used to calculate the average value of the numeric type. AVG function returns
the average of all non-Null values.
Syntax
AVG()
or
AVG( [ALL|DISTINCT] expression )
Example:
SELECT AVG(COST) FROM PRODUCT_MAST;
4. MAX Function
MAX function is used to find the maximum value of a certain column. This function determines the
largest value of all selected values of a column.
Syntax
MAX()
or
MAX( [ALL|DISTINCT] expression )
Example:
SELECT MAX(RATE) FROM PRODUCT_MAST;
5. MIN Function
MIN function is used to find the minimum value of a certain column. This function determines the
smallest value of all selected values of a column.
Syntax
MIN()
or
MIN( [ALL|DISTINCT] expression )
Example:
SELECT MIN(RATE) FROM PRODUCT_MAST;
MySQL Date Functions
The following table lists the most important built-in date functions in MySQL:
Function Description

NOW() Returns the current date and time

CURDATE() Returns the current date

CURTIME() Returns the current time


56

DATE() Extracts the date part of a date or date/time expression

DATE_ADD() Adds a specified time interval to a date

DATE_SUB() Subtracts a specified time interval from a date

DATEDIFF() Returns the number of days between two dates

DATE_FORMAT() Displays date/time data in different formats

SQL Server Date Functions


The following table lists the most important built-in date functions in SQL .
MySQL: DATE Function

Function Description

GETDATE() Returns the current date and time

DATEPART() Returns a single part of a date/time

DATEADD() Adds or subtracts a specified time interval from a date

DATEDIFF() Returns the time between two dates

CONVERT() Displays date/time data in different formats


Description
The MySQL DATE function extracts the date value from a date or date time expression.
Syntax
The syntax for the DATE function in MySQL is:
DATE( expression )
Parameters or Arguments
Expression : The date or datetime value from which the date should be extracted.
Note
• If expression is not a date or a datetime value, the DATE function will return NULL.
.
Example
mysql> SELECT DATE('2014-02-14');
Result: '2014-02-14'

mysql> SELECT DATE('2014-02-14 18:20:19');


Result: '2014-02-14'

mysql> SELECT DATE('2014-02-15 06:18:01.000001');


Result: '2014-02-15'

mysql> SELECT DATE('The date is 2014-02-14');


57

Result: NULL

mysql> SELECT DATE(NULL);


Result: NULL

BASICS OF FUNCTIONAL DEPENDENCIES AND NORMALIZATION FOR RATIONAL


DATABASE
(Q)1 Informal Design Guidelines for Relational Databases
1.1Semantics of the Relation Attributes
1.2 Redundant Information in Tuples and Update Anomalies
1.3 Null Values in Tuples
1.4 Spurious Tuples
1.1Semantics of the Relation Attributes
Informally, each tuple in a relation should represent one entity or relationship instance. (Applies to
individual relations and their attributes).
• Attributes of different entities (EMPLOYEEs, DEPARTMENTs, PROJECTs) should not be mixed
in the same relation
• Only foreign keys should be used to refer to other entities
• Entity and relationship attributes should be kept apart as much as possible.

1.2 Redundant Information in Tuples and Update Anomalies


• Mixing attributes of multiple entities may cause problems
• Information is stored redundantly wasting storage
• Problems with update anomalies
– Insertion anomalies
– Deletion anomalies
– Modification anomalies
– Update Anomaly: Changing the name of project number P1 from “ProductX” to “Product Y”
may cause this update to be made for all employees working on project P1.
• Insert Anomaly: Cannot insert a project unless an employee is assigned to .
Inversely - Cannot insert an employee unless an he/she is assigned to a project.
58

• Delete Anomaly: When a project is deleted, it will result in deleting all the employees who
work on that project. Alternately, if an employee is the sole employee on a project, deleting
that employee would result in deleting the corresponding project.
• Attributes that are NULL frequently could be placed in separate relations (with the primary
key)

1.3 Null Values in Tuples


• Reasons for nulls:
• attribute not applicable or invalid
• attribute value unknown (may exist)
• value known to exist, but unavailable
1.4 Spurious Tuples
• Bad designs for a relational database may result in erroneous results for certain JOIN
operations
• The "lossless join" property is used to guarantee meaningful results for join operations
There are two important properties of decompositions:
(a) non-additive or losslessness of the corresponding join
(b) Preservation of the functional dependencies.
• GUIDELINE 1: Design a schema that can be explained easily relation by relation. The
semantics of attributes should be easy to interpret.
• GUIDELINE 2: Design a schema that does not suffer from the insertion, deletion and
update anomalies. If there are any present, then note them so that applications can be
made to take them into account
• GUIDELINE 3: Relations should be designed such that their tuples will have as few
NULL values as possible
• GUIDELINE 4: The relations should be designed to satisfy the lossless join condition.
No spurious tuples should be generated by doing a natural-join of any relations.
2.1 Functional Dependencies
• Definition of FD
• Inference Rules for FDs
• Equivalence of Sets of FDs
• Minimal Sets of FDs

• Definition
59

• Functional dependencies (FDs) are used to specify formal measures of the "goodness" of
relational designs
• FDs and keys are used to define normal forms for relations
• FDs are constraints that are derived from the meaning and interrelationships of the data
attributes
• A set of attributes X functionally determines a set of attributes Y if the value of X
determines a unique value for Y

– X -> Y holds if whenever two tuples have the same value for X, they must have the
same value for Y
– For any two tuples t1 and t2 in any relation instance r(R): If t1[X]=t2[X], then
t1[Y]=t2[Y]
– X -> Y in R specifies a constraint on all relation instances r(R)
– Written as X -> Y; can be displayed graphically on a relation schema as in Figures. (
denoted by the arrow: ).
– FDs are derived from the real-world constraints on the attributes
• Examples of FD
– social security number determines employee name
SSN -> ENAME
– project number determines project name and location
PNUMBER -> {PNAME, PLOCATION}
– employee ssn and project number determines the hours per week that the employee
works on the project
{SSN, PNUMBER} -> HOURS
– An FD is a property of the attributes in the schema R
– The constraint must hold on every relation instance r(R)
– If K is a key of R, then K functionally determines all attributes in R (since we never
have two distinct tuples with t1[K]=t2[K])
Inference Rules for FDs
• Given a set of FDs F, we can infer additional FDs that hold whenever the FDs in F hold
Armstrong's inference rules:
IR1. (Reflexive) If Y subset-of X, then X -> Y
IR2. (Augmentation) If X -> Y, then XZ -> YZ
(Notation: XZ stands for X U Z)
IR3. (Transitive) If X -> Y and Y -> Z, then X -> Z
• IR1, IR2, IR3 form a sound and complete set of inference rules
Some additional inference rules that are useful:
(Decomposition) If X -> YZ, then X -> Y and X -> Z
(Union) If X -> Y and X -> Z, then X -> YZ
(Psuedotransitivity) If X -> Y and WY -> Z, then WX -> Z
• The last three inference rules, as well as any other inference rules, can be deduced from IR1,
IR2, and IR3 (completeness property)
• Closure of a set F of FDs is the set F+ of all FDs that can be inferred from F
• Closure of a set of attributes X with respect to F is the set X + of all attributes that are
functionally determined by X

2.3 Equivalence of Sets of FDs


• Two sets of FDs F and G are equivalent if:
60

- every FD in F can be inferred from G, and


- every FD in G can be inferred from F
• Hence, F and G are equivalent if F + =G +
Definition: F covers G if every FD in G can be inferred from F (i.e., if G + subset-of F +)
• F and G are equivalent if F covers G and G covers F
• There is an algorithm for checking equivalence of sets of FDs

2.4 Minimal Sets of FDs


• A set of FDs is minimal if it satisfies the following conditions:
(1) Every dependency in F has a single attribute for its RHS.
(2) We cannot remove any dependency from F and have a set of dependencies that is equivalent
to F.
(3) We cannot replace any dependency X -> A in F with a dependency Y -> A, where Y proper-
subset-of X ( Y subset-of X) and still have a set of dependencies that is equivalent to F.
(4) Every set of FDs has an equivalent minimal set
(5) There can be several equivalent minimal sets
(6) There is no simple algorithm for computing a minimal set of FDs that is equivalent to a set F
of FDs
(7) To synthesize a set of relations, we assume that we start with a set of dependencies that is a
minimal set (e.g., see algorithms 11.2 and 11.4)
Normal Forms Based on Primary Keys
3.1 Normalization of Relations
• Normalization: The process of decomposing unsatisfactory "bad" relations by breaking up
their attributes into smaller relations
• Normal form: Condition using keys and FDs of a relation to certify whether a relation
schema is in a particular normal form
• 2NF, 3NF, BCNF based on keys and FDs of a relation schema
4NF based on keys, multi-valued dependencies : MVDs; 5NF based on keys, join
Dependencies :
Practical Use of Normal Forms
• Normalization is carried out in practice so that the resulting designs are of high quality and
meet the desirable properties
• The practical utility of these normal forms becomes questionable when the constraints on
which they are based are hard to understand or to detect
• The database designers need not normalize to the highest possible normal form. (usually up
to 3NF, BCNF or 4NF)
• Denormalization: the process of storing the join of higher normal form relations as a base
relation—which is in a lower normal form
Definitions of Keys and Attributes Participating in Keys
• A superkey of a relation schema R = {A1, A2, ...., An} is a set of attributes S subset-of R with
the property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S]
• A key K is a superkey with the additional property that removal of any attribute from K will
cause K not to be a super key anymore.
• If a relation schema has more than one key, each is called a candidate key. One of the
candidate keys is arbitrarily designated to be the primary key, and the others are called
secondary keys.

• A Prime attribute must be a member of some candidate key


61

• A Nonprime attribute is not a prime attribute—that is, it is not a member of any candidate
key.
Dependencies in DBMS is a relation between two or more attributes. It has the following types
in DBMS −

• Functional Dependency
• Fully-Functional Dependency
• Transitive Dependency
• Multivalued Dependency
• Partial Dependency
Let us start with Functional Dependency −
Functional Dependency
If the information stored in a table can uniquely determine another information in the same
table, then it is called Functional Dependency. Consider it as an association between two attributes
of the same relation.
If P functionally determines Q, then

P -> Q

Let us see an example −


<Employee>

EmpID EmpName EmpAge


E01 Amit 28
E02 Rohit 31

In the above table, EmpName is functionally dependent on EmpID because EmpName can take
only one value for the given value of EmpID:

EmpID -> EmpName

The same is displayed below −


Fully-functionally Dependency
An attribute is fully functional dependent on another attribute, if it is Functionally Dependent
on that attribute and not on any of its proper subset.
For example, an attribute Q is fully functional dependent on another attribute P, if it is
Functionally Dependent on P and not on any of the proper subset of P.
62

Let us see an example −


<ProjectCost>

ProjectID ProjectCost
001 1000
002 5000

<EmployeeProject>

EmpID ProjectID Days (spent on the


project)
E099 001 320
E056 002 190

The above relations states:

EmpID, ProjectID, ProjectCost -> Days

However, it is not fully functional dependent.


Whereas the subset {EmpID, ProjectID} can easily determine the {Days} spent on the project
by the employee.
This summarizes and gives our fully functional dependency −

{EmpID, ProjectID} -> (Days)

Transitive Dependency
When an indirect relationship causes functional dependency it is called Transitive
Dependency.
If P -> Q and Q -> R is true, then P-> R is a transitive dependency.
Multivalued Dependency
When existence of one or more rows in a table implies one or more other rows in the same
table, then the Multi-valued dependencies occur.
If a table has attributes P, Q and R, then Q and R are multi-valued facts of P.
It is represented by double arrow −

->->

For our example:


63

P->->Q
Q->->R

In the above case, Multivalued Dependency exists only if Q and R are independent attributes.
Partial Dependency
Partial Dependency occurs when a nonprime attribute is functionally dependent on part of a
candidate key.
The 2nd Normal Form (2NF) eliminates the Partial Dependency. Let us see an example −
<StudentProject>

StudentID ProjectNo StudentName ProjectName


S01 199 Katie Geo Location
S02 120 Ollie Cluster
Exploration

In the above table, we have partial dependency; let us see how −


The prime key attributes are StudentID and ProjectNo.
As stated, the non-prime attributes i.e. StudentName and ProjectName should be functionally
dependent on part of a candidate key, to be Partial Dependent.
The StudentName can be determined by StudentID that makes the relation Partial
Dependent.
The ProjectName can be determined by ProjectID, which that the relation Partial Dependent.

Types of Functional dependency

1. Trivial functional dependency


o A → B has trivial functional dependency if B is a subset of A.
o The following dependencies are also trivial like: A → A, B → B

Example:
Consider a table with two columns Employee_Id and Employee_Name.
64

{Employee_id, Employee_Name} → Employee_Id is a trivial functional dependency as Employe


e_Id is a subset of {Employee_Id, Employee_Name}.
Also, Employee_Id → Employee_Id and Employee_Name → Employee_Name are trivial depende
ncies too.

2. Non-trivial functional dependency


o A → B has a non-trivial functional dependency if B is not a subset of A.
o When an intersection B is NULL, then A → B is called as complete non-trivial.
Example:
ID → Name,
Name → DOB
Normalization
o Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of relations. It is also used to
eliminate the undesirable characteristics like Insertion, Update and Deletion Anomalies.
o Normalization divides the larger table into the smaller table and links them using relationship.
o The normal form is used to reduce redundancy from the database table.
Types of Normal Forms
There are the four types of normal forms:

Norm Description
al Form

1NF A relation is in 1NF if it contains an atomic value.

2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully
functional dependent on the primary key.

3NF A relation will be in 3NF if it is in 2NF and no transition dependency


exists.

4NF A relation will be in 4NF if it is in Boyce Codd normal form and has no
multi-valued dependency.

5NF A relation is in 5NF if it is in 4NF and not contains any join dependency
and joining should be lossless.
First Normal Form (1NF)
o A relation will be 1NF if it contains an atomic value.
o It states that an attribute of a table cannot hold multiple values. It must hold only single-valued
attribute.
o First normal form disallows the multi-valued attribute, composite attribute, and their combinations.
Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute EMP_PHONE.

EMPLOYEE table:
65

EMP_I EMP_NAME EMP_PHONE EMP_STATE


D

14 John 7272826385, UP
9064738238

20 Harry 8574783832 Bihar

12 Sam 7390372389, Punjab


8589830302
The decomposition of the EMPLOYEE table into 1NF has been shown below:
EMP_I EMP_NAME EMP_PHONE EMP_STATE
D

14 John 7272826385 UP

14 John 9064738238 UP

20 Harry 8574783832 Bihar

12 Sam 7390372389 Punjab

12 Sam 8589830302 Punjab


Second Normal Form (2NF)
o In the 2NF, relational must be in 1NF.
o In the second normal form, all non-key attributes are fully functional dependent on the primary key
Example: Let's assume, a school can store the data of teachers and the subjects they teach. In a school,
a teacher can teach more than one subject.
TEACHER table
TEACHER_ID SUBJECT TEACHER_AGE

25 Chemistry 30

25 Biology 30

47 English 35

83 Math 38

83 Computer 38
In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID which is a
proper subset of a candidate key. That's why it violates the rule for 2NF.

To convert the given table into 2NF, we decompose it into two tables:
66

TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE

25 30

47 35

83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT

25 Chemistry

25 Biology

47 English

83 Math

83 Computer
Third Normal Form (3NF)
o A relation will be in 3NF if it is in 2NF and not contain any transitive partial dependency.
o 3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
o If there is no transitive dependency for non-prime attributes, then the relation must be in third normal
form.
A relation is in third normal form if it holds atleast one of the following conditions for every non-
trivial function dependency X → Y.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
Example:
EMPLOYEE_DETAIL table:
EMP EMP_NA EMP_ EMP_STA EMP_CI
_ID ME ZIP TE TY

222 Harry 201010 UP Noida

333 Stephan 02228 US Boston

444 Lan 60007 US Chicago

555 Katharine 06389 UK Norwich

666 John 462007 MP Bhopal


Super key in the table above:
{EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP}....so on

Candidate key: {EMP_ID}


67

Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime.
Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on
EMP_ID. The non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on super
key(EMP_ID). It violates the rule of third normal form.
That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMP_ID EMP_NAME EMP_ZIP

222 Harry 201010

333 Stephan 02228

444 Lan 60007

555 Katharine 06389

666 John 462007


EMPLOYEE_ZIP table:
EMP_ZIP EMP_STATE EMP_CITY

201010 UP Noida

02228 US Boston

60007 US Chicago

06389 UK Norwich

462007 MP Bhopal
Boyce Codd normal form (BCNF)
o BCNF is the advance version of 3NF. It is stricter than 3NF.
o A table is in BCNF if every functional dependency X → Y, X is the super key of the table.
o For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
Example: Let's assume there is a company where employees work in more than one department.
EMPLOYEE table:
EM EMP_COUN EMP_D DEPT_T EMP_DEPT
P_ID TRY EPT YPE _NO

264 India Designin D394 283


g

264 India Testing D394 300

364 UK Stores D283 232

364 UK Developi D283 549


ng

In the above table Functional dependencies are as follows:


68

EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate key: {EMP-ID, EMP-DEPT}
Difference between structure and union in Hindi
Keep Watching
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY

264 India

264 India
EMP_DEPT table:
EMP_DEPT DEPT_TYPE EMP_DEPT_NO

Designing D394 283

Testing D394 300

Stores D283 232

Developing D283 549


EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT

D394 283

D394 300

D283 232

D283 549
Functional dependencies:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}
Now, this is in BCNF because left side part of both the functional dependencies is a key.
Fourth normal form (4NF)
o A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued dependency.
o For a dependency A → B, if for a single value of A, multiple values of B exists, then the relation will
be a multi-valued dependency.
Example
STUDENT
STU_ID COURSE HOBBY

21 Computer Dancing
69

21 Math Singing

34 Chemistry Dancing

74 Biology Cricket

59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity.
Hence, there is no relationship between COURSE and HOBBY.
In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and Math and
two hobbies, Dancing and Singing. So there is a Multi-valued dependency on STU_ID, which leads to
unnecessary repetition of data.
So to make the above table into 4NF, we can decompose it into two tables:
HTML Tutorial
STUDENT_COURSE
STU_ID COURSE

21 Computer

21 Math

34 Chemistry

74 Biology

59 Physics
STUDENT_HOBBY
STU_ID HOBBY

21 Dancing

21 Singing

34 Dancing

74 Cricket

59 Hockey
Fifth normal form (5NF)
o A relation is in 5NF if it is in 4NF and not contains any join dependency and joining should be
lossless.
o 5NF is satisfied when all the tables are broken into as many tables as possible in order to avoid
redundancy.
o 5NF is also known as Project-join normal form (PJ/NF).

Example
SUBJECT LECTURER SEMESTER

Computer Anshika Semester 1


70

Computer John Semester 1

Math John Semester 1

Math Akash Semester 2

Chemistry Praveen Semester 1


In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take Math
class for Semester 2. In this case, combination of all these fields required to identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who will be
taking that subject so we leave Lecturer and Subject as NULL. But all three columns together acts as a
primary key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
P1
OOPs Concepts in Java
SEMESTER SUBJECT

Semester 1 Computer

Semester 1 Math

Semester 1 Chemistry

Semester 2 Math
P2
SUBJECT LECTURER

Computer Anshika

Computer John

Math John

Math Akash

Chemistry Praveen

P3
SEMSTER LECTURER

Semester 1 Anshika

Semester 1 John
71

Semester 1 John

Semester 2 Akash

Semester 1 Praveen
72

UNIT-4
Introduction to transaction processing
Transaction processing
• The transaction is a set of logically related operation. It contains a group of tasks.
• A transaction is an action or series of actions. It is performed by a single user to perform
operations for accessing the contents of the database.

Example: Suppose an employee of bank transfers Rs.800 from X's account to Y's account.
This small transaction contains several low-level tasks:

X's Account :

1. Open_Account(X)
2. Old_Balance = X.balance
3. New_Balance = Old_Balance - 800
4. X.balance = New_Balance
5. Close_Account(X)
Y's Account
1. Open_Account(Y)
2. Old_Balance = Y.balance
3. New_Balance = Old_Balance + 800
4. Y.balance = New_Balance
5. Close_Account(Y)
Operations of Transaction:
Following are the main operations of transaction : in SQL
Read(X): Read operation is used to read the value of X from the database and stores it in a
buffer in main memory.
Write(X): Write operation is used to write the value back to the database from the buffer.
Let's take an example to debit transaction from an account which consists of following
operations:
1. R(X); // Read
2. X = X - 500; //Update
3. W(X); // Write
Commit: It is used to save the work done permanently.

WHY WE NEED TRANSACTIONS


 A database is a shared resource accessed by many users and processes concurrently.
 Not managing this concurrent access to a shared resource will cause problems (not unlike in
operating systems)
73

• Problems due to concurrency


• Problems due to failures

TRANSACTION NOTATION

Transaction and System Concepts


In this section we discuss additional concepts relevant to transaction processing.
(1). Transaction States and Additional Operations
o A transaction is an atomic unit of work that should either be completed in its entirety
or not done at all.
o For recovery purposes, the system needs to keep track of when each transaction starts,
terminates, and commits or aborts.
o Therefore, the recovery manager of the DBMS needs to keep track of the following
operations:
• BEGIN_TRANSACTION. This marks the beginning of transaction execution.
• READ or WRITE. These specify read or write operations on the database items that are
executed as part of a transaction.
• END_TRANSACTION. This specifies that READ and WRITE transaction
operations have ended and marks the end of transaction execution. However, at this point it
may be necessary to check whether the changes introduced by the transaction can be
permanently applied to the database (committed) or whether the transaction has to be aborted
because it violates serializability .
• COMMIT_TRANSACTION. This signals a successful end of the transaction so that any
changes (updates) executed by the transaction can be safely committed to the database and
will not be undone.
74

• ROLLBACK (or ABORT). This signals that the transaction has ended unsuccessfully, so
that any changes or effects that the transaction may have applied to the database must
be undone.

• Figure 21.4 shows a state transition diagram that illustrates how a transaction moves through
its execution states.
• A transaction goes into an active state immediately after it starts execution, where it can
execute its READ and WRITE operations.
• When the transaction ends, it moves to the partially committed state.
• At this point, some recovery protocols need to ensure that a system failure will not result in an
inability to record the changes of the transaction permanently (usually by recording changes in
the system log, discussed in the next section).
• Once this check is successful, the transaction is said to have reached its commit point and
enters the committed state.
• When a transaction is committed, it has concluded its execution successfully and all its
changes must be recorded permanently in the database, even if a system failure occurs.
• However, a transaction can go to the failed state if one of the checks fails or if the transaction
is aborted during its active state.
• The transaction may then have to be rolled back to undo the effect of its WRITE operations on
the database.
• The terminated state corresponds to the transaction leaving the system.
• The transaction information that is maintained in system tables while the transaction has been
running is removed when the transaction terminates.
• Failed or aborted transactions may be restarted later—either automatically or after being
resubmitted by the user—as brand new transactions.
(2).The System Log

• To be able to recover from failures that affect transactions, the system maintains
a log to keep track of all transaction operations that affect the values of database items,
as well as other transaction information that may be needed to permit recovery from
failures.
• The log is a sequential, append-only file that is kept on disk, so it is not affected by any
type of failure except for disk or catastrophic failure.
75

• Typically, one (or more) main memory buffers hold the last part of the log file, so that
log entries are first added to the main memory buffer.
• When the log buffer is filled, or when certain other conditions occur, the log buffer
is appended to the end of the log file on disk. In addition, the log file from disk is
periodically backed up to archival storage (tape) to guard against catastrophic failures.
• The following are the types of entries—called log records—that are written to the log
file and the corresponding action for each log record.
• In these entries, T refers to a unique transaction-id that is generated automatically by
the system for each transaction and that is used to identify each transaction:

• [start_ transaction, T]. Indicates that transaction T has started execution.


• [Writ_ item, T, X, old _value, new _value]. Indicates that transaction T has changed the
value of database item X from old_ value to new_ value.
• [Read_ item, T, X]. Indicates that transaction T has read the value of database item X.
• [Commit, T]. Indicates that transaction T has completed successfully, and affirms that
its effect can be committed (recorded permanently) to the data-base.
• [Abort, T]. Indicates that transaction T has been aborted.

• Protocols for recovery that avoid cascading rollbacks (see Section 21.4.2)—which
include nearly all practical protocols—do not require that READ operations are writ-ten
to the system log. However, if the log is also used for other purposes—such as auditing
(keeping track of all database operations)—then such entries can be included.
Additionally, some recovery protocols require simpler WRITE entries only include one
of new_value and old_value instead of including both
(3). Commit Point of a Transaction
• A transaction T reaches its commit point when all its operations that access the database
have been executed successfully and the effect of all the transaction operations on the
database have been recorded in the log.
• Beyond the commit point, the transaction is said to be committed, and its effect must
be permanently recorded in the database.
• The transaction then writes a commit record [commit, T] into the log.
• If a system failure occurs, we can search back in the log for all transactions T that have written
a [start_transaction, T] record into the log but have not written their [commit, T] record yet;
these transactions may have to be rolled back to undo their effect on the database during the
recovery process.
• Transactions that have written their commit record in the log must also have recorded all
their WRITE operations in the log, so their effect on the database can be redone from the log
records.
• Notice that the log file must be kept on disk, updating a disk file involves copying the
appropriate block of the file from disk to a buffer in main memory, updating the buffer in
main memory, and copying the buffer to disk.
• It is common to keep one or more blocks of the log file in main memory buffers, called
the log buffer, until they are filled with log entries and then to write them back to disk only
once, rather than writing to disk every time a log entry is added.
76

• This saves the overhead of multiple disk writes of the same log file buffer.
• At the time of a system crash, only the log entries that have been written back to disk are
considered in the recovery process because the contents of main memory may be lost.
• Hence, before a transaction reaches its commit point, any portion of the log that has not been
written to the disk yet must now be written to the disk.
• This process is called force-writing the log buffer before committing a transaction.
Desirable Properties of Transactions
• Transactions should possess several properties, often called the ACID properties; they should
be enforced by the concurrency control and recovery methods of the DBMS. The following
are the ACID properties:
• Atomicity. A transaction is an atomic unit of processing; it should either be performed in its
entirety or not performed at all.
o The atomicity property requires that we execute a transaction to completion.
o It is the responsibility of the transaction recovery subsystem of a DBMS to ensure
atomicity.
o If a transaction fails to complete for some reason, such as a system crash .
o On the other hand, write operations of a committed transaction must be eventually
written to disk.
• Consistency preservation. A transaction should be consistency preserving, meaning that if it
is completely executed from beginning to end without interference from other transactions, it
should take the database from one consistent state to another.
o The preservation of consistency is generally considered to be the responsibility of the
programmers who write the database programs or of the DBMS module that enforces
integrity constraints.
o Recall that a database state is a collection of all the stored data items (values) in the
database at a given point in time.
o A consistent state of the database satisfies the constraints specified in the schema as
well as any other constraints on the database that should hold.
o A database program should be written in a way that guarantees that, if the database is
in a consistent state before executing the transaction, it will be in a consistent state
after the complete execution of the transaction, assuming that no interference with
other transactions occurs.
• Isolation. A transaction should appear as though it is being executed in isolation from other
transactions, even though many transactions are executing concurrently. That is, the execution
of a transaction should not be interfered with by any other transactions executing
concurrently.
o The isolation property is enforced by the concurrency control subsystem of the DBMS.
o If every transaction does not make its updates (write operations) visible to other
transactions until it is committed, one form of isolation is enforced that solves the
temporary update problem and eliminates cascading rollbacks, but does not eliminate all
other problems.
o There have been attempts to define the level of isolation of a transaction.
o A transaction is said to have level 0 (zero) isolation if it does not overwrite the dirty reads
of higher-level transactions.
77

o Level 1 (one) isolation has no lost updates, and level 2 isolation has no lost updates and
no dirty reads.
o Finally, level 3 isolation (also called true isolation) has, in addition to level 2 properties,
repeatable reads.
• Durability or permanency. The changes applied to the database by a com-mitted transaction
must persist in the database.
o These changes must not be lost because of any failure.
o And last, the durability property is the responsibility of the recovery subsystem of the
DBMS.

Characterizing Schedules Based on Serializability


• Characterize the types of schedules that are always considered to be correct when concurrent
transactions are executing. Such schedules are known as serializable schedules.
• In DBMS, schedules may be classified as-


We have discussed-
• A schedule is the order in which the operations of multiple transactions appear for execution.
• Serial schedules are always consistent.
78

• Non-serial schedules are not always consistent.

Serializability in DBMS-

• Some non-serial schedules may lead to inconsistency of the database.


• Serializability is a concept that helps to identify which non-serial schedules are correct and will
maintain the consistency of the database.

Serializable Schedules-
If a given non-serial schedule of ‘n’ transactions is equivalent to some serial schedule of ‘n’
transactions, then it is called as a serializable schedule.

Characteristics-
Serializable schedules behave exactly same as serial schedules.
Thus, serializable schedules are always-
• Consistent
• Recoverable
• Casacadeless
• Strict

Serial Schedules Vs Serializable Schedules-

Serial Schedules Serializable Schedules

No concurrency is allowed.
Concurrency is allowed.
Thus, all the transactions
Thus, multiple transactions can
necessarily execute serially one after the
execute concurrently.
other.

Serial schedules lead to less Serializable schedules improve both


resource utilization and CPU throughput. resource utilization and CPU throughput.

Serial Schedules are less efficient as Serializable Schedules are always


compared to serializable schedules. better than serial schedules.
(due to above reason) (due to above reason)

Types of Serializability- Serializability is mainly of two types-

1. Conflict Serializability
1.2. View Serializability
79

Conflict Serializability-
If a given non-serial schedule can be converted into a serial schedule by swapping its non-
conflicting operations, then it is called as a conflict serializable schedule.
Conflicting Operations-
Two operations are called as conflicting operations if all the following conditions hold true
for them-
• Both the operations belong to different transactions
• Both the operations are on the same data item
• At least one of the two operations is a write operation

Example-
Consider the following schedule-

In this schedule,
• W1 (A) and R2 (A) are called as conflicting operations.
• This is because all the above conditions hold true for them.

Checking Whether a Schedule is Conflict Serializable Or Not-


Follow the following steps to check whether a given non-serial schedule is conflict
serializable or not-

Step-01: Find and list all the conflicting operations.

Step-02: Start creating a precedence graph by drawing one node for each transaction.

Step-03:
• Draw an edge for each conflict pair such that if Xi (V) and Yj (V) forms a conflict pair then draw
an edge from Ti to Tj.
• This ensures that Ti gets executed before Tj.

Step-04:
Check if there is any cycle formed in the graph.
• If there is no cycle found, then the schedule is conflict serializable otherwise not.

Characterizing Schedules Based on Recoverability

Non-Serializable Schedules-
• A non-serial schedule which is not serializable is called as a non-serializable schedule.
80

• A non-serializable schedule is not guaranteed to produce the the same effect as produced
by some serial schedule on any consistent database.

Characteristics-
Non-serializable schedules-
• may or may not be consistent
• may or may not be recoverable
Irrecoverable Schedules-
If in a schedule,
• A transaction performs a dirty read operation from an uncommitted transaction
• And commits before the transaction from which it has read the value
then such a schedule is known as an Irrecoverable Schedule.
Example- Consider the following schedule-

Here,
•T2 performs a dirty read operation.
• T2 commits before T1.
• T1 fails later and roll backs.
• The value that T2 rea Characterizing Schedules Based on Recoverabilityd
now stands to be incorrect.
• T2 can not recover since it has already committed.
Recoverable Schedules-
81

If in a schedule,
• A transaction performs a dirty read operation from an uncommitted transaction
• And its commit operation is delayed till the uncommitted transaction either commits or roll backs
then such a schedule is known as a Recoverable Schedule.
Here,
• The commit operation of the transaction that performs the dirty read is delayed.
• This ensures that it still has a chance to recover if the uncommitted transaction fails later.
Example- Consider the following schedule-

Here,

T2 performs a dirty read operation.
• The commit operation of T2 is delayed till T1 commits or roll backs.
• T1 commits later.
• T2 is now allowed to commit.
• In case, T1 would have failed, T2 has a chance to recover by rolling back.
Checking Whether a Schedule is Recoverable or Irrecoverable-
Method-01:
Check whether the given schedule is conflict serializable or not.
• If the given schedule is conflict serializable, then it is surely recoverable. Stop and report your
answer.
• If the given schedule is not conflict serializable, then it may or may not be recoverable. Go and
check using other methods.
82

Thumb Rules
• All conflict serializable schedules are recoverable.
• All recoverable schedules may or may not be conflict serializable.

Method-02:
Check if there exists any dirty read operation.
(Reading from an uncommitted transaction is called as a dirty read)
• If there does not exist any dirty read operation, then the schedule is surely recoverable. Stop and
report your answer.
• If there exists any dirty read operation, then the schedule may or may not be recoverable.
If there exists a dirty read operation, then follow the following cases-
Case-01:
If the commit operation of the transaction performing the dirty read occurs before the commit
or abort operation of the transaction which updated the value, then the schedule is irrecoverable.
Case-02:
If the commit operation of the transaction performing the dirty read is delayed till the commit
or abort operation of the transaction which updated the value, then the schedule is recoverable.

Chapter 2:Concurrency Control Techniques


Concurrency Control in DBMS

Concurrency Control deals with interleaved execution of more than one transaction. In the
next article, we will see what is serializability and how to find whether a schedule is serializable or
not.

What is Transaction?

A set of logically related operations is known as transaction. The main operations of a


transaction are:
Read(A): Read operations Read(A) or R(A) reads the value of A from the database and stores
it in a buffer in main memory.
Write (A): Write operation Write(A) or W(A) writes the value back to the database from
buffer.
Let us take a debit transaction from an account which consists of following operations:
1. R(A);
2. A=A-1000;
3. W(A);
Assume A’s value before starting of transaction is 5000.
83

• The first operation reads the value of A from database and stores it in a buffer.
• Second operation will decrease its value by 1000. So buffer will contain 4000.
• Third operation will write the value from buffer to database. So A’s final value will be
4000.

But it may also be possible that transaction may fail after executing some of its operations.
The failure can be because of hardware, software or power etc. For example, if debit transaction
discussed above fails after executing operation 2, the value of A will remain 5000 in the database
which is not acceptable by the bank. To avoid this, Database has two important operations:
Commit: After all instructions of a transaction are successfully executed, the changes made
by transaction are made permanent in the database.
Rollback: If a transaction is not able to execute all operations successfully, all the changes
made by transaction are undone.
Properties of a transaction
Atomicity: As a transaction is set of logically related operations, either all of them should
be executed or none. A debit transaction discussed above should either execute all three operations
or none.If debit transaction fails after executing operation 1 and 2 then its new value 4000 will not
be updated in the database which leads to inconsistency.
Consistency: If operations of debit and credit transactions on same account are executed
concurrently, it may leave database in an inconsistent state.
For Example, T1 (debit of Rs. 1000 from A) and T2 (credit of 500 to A) executing
concurrently, the database reaches inconsistent state.

T T1’s buffer T2’s Buffer Dat


1 space T2 Space abase

A=5
000

R A=5
(A); A=5000 000

R(A A=5
A=5000 ); A=5000 000

A
=A- A=5
1000; A=4000 A=5000 000
84

A=A
A=4000 +500; A=5500

W A=4
(A); A=5500 000

W( A=5
A); 500

• Let us assume Account balance of A is Rs. 5000. T1 reads A(5000) and stores the value in its
local buffer space. Then T2 reads A(5000) and also stores the value in its local buffer space.
• T1 performs A=A-1000 (5000-1000=4000) and 4000 is stored in T1 buffer space. Then T2
performs A=A+500 (5000+500=5500) and 5500 is stored in T2 buffer space. T1 writes the value
from its buffer back to database.
• A’s value is updated to 4000 in database and then T2 writes the value from its buffer back to
database. A’s value is updated to 5500 which shows that the effect of debit transaction is lost and
database has become inconsistent.
• To maintain consistency of database, we need concurrency control protocols which will be
discussed in next article. The operations of T1 and T2 with their buffers and database have been
shown in Table 1.

What is a Schedule?
A schedule is a series of operations from one or more transactions. A schedule can be of two
types:
• Serial Schedule: When one transaction completely executes before starting another
transaction, the schedule is called serial schedule. A serial schedule is always consistent. e.g.;
If a schedule S has debit transaction T1 and credit transaction T2, possible serial schedules
are T1 followed by T2 (T1->T2) or T2 followed by T1 ((T2->T1).
• Concurrent Schedule: When operations of a transaction are interleaved with operations of
other transactions of a schedule, the schedule is called Concurrent schedule. e.g.; Schedule of
debit and credit transaction shown in Table 1 is concurrent in nature. But concurrency can
lead to inconsistency in the database. The above example of a concurrent schedule is also
inconsistent.
Question: Consider the following transaction involving two bank accounts x and y:

1. read(x);
2. x := x – 50;
3. write(x);
4. read(y);
5. y := y + 50;
6. write(y);
The constraint that the sum of the accounts x and y should remain constant is that of?
85

1. Atomicity
2. Consistency
3. Isolation
4. Durability
Lock Based Concurrency Control Protocol in DBMS
A Database may provide a mechanism that ensures that the schedules are either conflict or
view serializable and recoverable (also preferably cascadeless). Testing for a schedule for
Serializability after it has executed is obviously too late!
So we need Concurrency Control Protocols that ensures Serializability .
Concurrency-control protocols:
• Allow concurrent schedules, but ensure that the schedules are conflict/view serializable, and
are recoverable and maybe even cascade less.
• These protocols do not examine the precedence graph as it is being created, instead a
protocol imposes a discipline that avoids non-seralizable schedules.
• Different concurrency control protocols provide different advantages between the amount of
concurrency they allow and the amount of overhead that they impose.
• We’ll be learning some protocols which are important for GATE CS. Questions from this
topic is frequently asked and it’s recommended to learn this concept.
Different categories of protocols:
• Lock Based Protocol
▪ Basic 2-PL
▪ Conservative 2-PL
▪ Strict 2-PL
▪ Rigorous 2-PL
• Graph Based Protocol
• Time-Stamp Ordering Protocol
• Multi-version Protocol

Lock Based Protocols –

• A lock is a variable associated with a data item that describes a status of data item with
respect to possible operation that can be applied to it.
• They synchronize the access by concurrent transactions to the database items.
• It is required in this protocol that all the data items must be accessed in a mutually exclusive
manner.
• Let me introduce you to two common locks which are used and some terminology followed
in this protocol.
• A lock is a mechanism to control concurrent access to a data item
• Data items can be locked in two modes :
• Exclusive (X) mode. Data item can be both read as well as
• Written. X-lock is requested using lock-X instruction.
• Shared (S) mode. Data item can only be read. S-lock is
• Requested using lock-S instruction.
• Lock requests are made to the concurrency-control manager by the programmer. Transaction
can proceed only after request is granted.
86

Lock-compatibility matrix
• A transaction may be granted a lock on an item if the requested lock is compatible with locks
already held on the item by other transactions
• Any number of transactions can hold shared locks on an item,
o But if any transaction holds an exclusive on the item no other transaction may hold
any lock on the item.

o
• If a lock cannot be granted, the requesting transaction is made to wait till all incompatible
locks held by other transactions have been released. The lock is then granted.
Example of a transaction performing locking:
T2: lock-S(A);
read (A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B)

A locking protocol is a set of rules followed by all transactions while requesting and releasing
locks. Locking protocols restrict the set of possible schedules
• This protocol ensures conflict-serializable schedules.
• Phase 1: Growing Phase
▪ Transaction may obtain locks
▪ Transaction may not release locks
• Phase 2: Shrinking Phase
87

▪ Transaction may release locks


▪ Transaction may not obtain locks
• The protocol assures serializability. It can be proved that the transactions can be serialized in
the order of their lock points (i.e., the point where a transaction acquired its final lock).
• There can be conflict serializable schedules that cannot be obtained if two-phase locking is
used.
• Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj that
uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.


• Two-phase locking with lock conversions:
• – First Phase:
can acquire a lock-S on item
can acquire a lock-X on item
can convert a lock-S to a lock-X (upgrade)
• – Second Phase:
can release a lock-S
can release a lock-X
can convert a lock-X to a lock-S (downgrade)
• This protocol assures serializability. But still relies on the programmer to insert the various
locking instructions.
1. Shared Lock (S): also known as Read-only lock. As the name suggests it can be shared between
transactions because while holding this lock the transaction does not have the permission to
update data on the data item. S-lock is requested using lock-S instruction.
2. Exclusive Lock (X): Data item can be both read as well as written.This is Exclusive and cannot
be held simultaneously on the same data item. X-lock is requested using lock-X instruction.
3. Upgrade / Downgrade locks : A transaction that holds a lock on an item A is allowed under
certain condition to change the lock state from one state to another.
Upgrade: A S(A) can be upgraded to X(A) if Ti is the only transaction holding the S-lock on
element A.
Downgrade: We may downgrade X(A) to S(A) when we feel that we no longer want to write on
data-item A. As we were holding X-lock on A, we need not check any conditions.
Shortly we’ll use 2-Phase Locking (2-PL) which will use the concept of Locks to avoid
deadlock. So, applying simple locking, we may not always produce Serializable results, it may
lead to Deadlock Inconsistency.
4. Deadlock – consider the above execution phase. Now, T1 holds an Exclusive lock over B,
and T2 holds a Shared lock over A. Consider Statement 7, T2 requests for lock on B, while in
88

Statement 8 T1 requests lock on A. This as you may notice imposes a Deadlock as none can
proceed with their execution.
5. Starvation – is also possible if concurrency control manager is badly designed. For example: A
transaction may be waiting for an X-lock on an item, while a sequence of other transactions
request and are granted an S-lock on the same item. This may be avoided if the concurrency
control manager is properly designed.
Problem With Simple Locking…: Consider the Partial Schedule:
T1 T2

1 lock-X(B)

2 read(B)

3 B:=B-50

4 write(B)

5 lock-S(A)

6 read(A)

7 lock-S(B)

8 lock-X(A)

9 …… ……

Deadlocks
A deadlock is a condition where two or more transactions are waiting indefinitely for one
another to give up locks. Deadlock is said to be one of the most feared complications in DBMS as
no task ever gets finished and is in waiting state forever.
For example:
In the student table, transaction T1 holds a lock on some rows and needs to update some rows
in the grade table. Simultaneously, transaction T2 holds locks on some rows in the grade table and
needs to update the rows in the Student table held by Transaction T1.
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and
similarly, transaction T2 is waiting for T1 to release its lock.
89

All activities come to a halt state and remain at a standstill. It will remain in a standstill until
the DBMS detects the deadlock and aborts one of the transactions.

• Consider the partial schedule

• Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release
its lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.
• Such a situation is called a deadlock.

Deadlock Avoidance
o When a database is stuck in a deadlock state, then it is better to avoid the database rather than
aborting or restating the database. This is a waste of time and resource.
o Deadlock avoidance mechanism is used to detect any deadlock situation in advance. A
method like "wait for graph" is used for detecting the deadlock situation but this method is
suitable only for the smaller database. For the larger database, deadlock prevention method
can be used.
Deadlock Detection
In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS should
detect whether the transaction is involved in a deadlock or not. The lock manager maintains a Wait
for the graph to detect the deadlock cycle in the database.
• Two-phase locking does not ensure freedom from deadlocks.
90

• In addition to deadlocks, there is a possibility of starvation.


• Starvation occurs if the concurrency control manager is badly designed. For example:
o A transaction may be waiting for an X-lock on an item, while a sequence of other
transactions request and are granted an S-lock on the same item.
o The same transaction is repeatedly rolled back due to deadlocks.
• Concurrency control manager can be designed to prevent starvation.
• When a deadlock occurs there is a possibility of cascading roll-backs.
• Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified
protocol called strict two-phase locking -- a transaction must hold all its exclusive locks till
it commits/aborts.
• Rigorous two-phase locking is even stricter. Here, all locks are held till commit/abort. In
this protocol transactions can be serialized in the order in which they commit.
Implementation of Locking
▪ A lock manager can be implemented as a separate process to which transactions send lock
and unlock requests
▪ The lock manager replies to a lock request by sending a lock grant messages (or a message
asking the transaction to roll back, in case of a deadlock)
▪ The requesting transaction waits until its request is answered
▪ The lock manager maintains a data-structure called a lock table to record granted locks and
pending requests
▪ The lock table is usually implemented as an in-memory hash table indexed on the name of
the data item being locked
Deadlock Handling
• System is deadlocked if there is a set of transactions such that every transaction in the set is
waiting for another transaction in the set.
• Deadlock prevention protocols ensure that the system will never enter into a deadlock state.
Some prevention strategies :
• Require that each transaction locks all its data items before it begins execution (pre
declaration).
• Impose partial ordering of all data items and require that a transaction can lock data items
only in the order specified by the partial order.
Timeout-Based Schemes:
o A transaction waits for a lock only for a specified amount of time. If the lock has not
been granted within that time, the transaction is rolled back and restarted,
o Thus, deadlocks are not possible
o Simple to implement; but starvation is possible. Also difficult to determine good value
of the timeout interval.
• Deadlocks can be described as a wait-for graph, which consists of a pair G = (V,E),
o V is a set of vertices (all the transactions in the system)
o E is a set of edges; each element is an ordered pair Ti →Tj.
The system is in a deadlock state if and only if the wait-for graph has a cycle. Must invoke a
deadlock-detection algorithm periodically to look for cycles
91

Wait-for graph without a cycle Wait-for graph with a cycle


Deadlock Recovery :When deadlock is detected :

• Some transaction will have to rolled back (made a victim) to break deadlock. Select that
transaction as victim that will incur minimum cost.
o Rollback -- determine how far to roll back transaction
o Total rollback: Abort the transaction and then restart it.
• More effective to roll back transaction only as far as necessary to break deadlock.
• Starvation happens if same transaction is always chosen as victim. Include the number of
rollbacks in the cost factor to avoid starvation.

Timestamp Ordering Protocol


o The Timestamp Ordering Protocol is used to order the transactions based on their
Timestamps. The order of transaction is nothing but the ascending order of the transaction
creation.
o The priority of the older transaction is higher that's why it executes first. To determine the
timestamp of the transaction, this protocol uses system time or logical counter.
o The lock-based protocol is used to manage the order between conflicting pairs among
transactions at the execution time. But Timestamp based protocols start working as soon as a
transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered
the system at 007 times and transaction T2 has entered the system at 009 times. T1 has the
higher priority, so it executes first as it is entered the system first.
o The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.
• The protocol manages concurrent execution such that the time-stamps determine the
serializability order.
• In order to assure such behavior, the protocol maintains for each data Q two timestamp
values:
o W-timestamp(Q) is the largest time-stamp of any transaction that executed write(Q)
successfully.
o R-timestamp(Q) is the largest time-stamp of any transaction that executed read(Q)
successfully.
• The timestamp ordering protocol ensures that any conflicting read and write operations are
executed in timestamp order.
92

Basic Timestamp ordering protocol works as follows:


1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
o If W_TS(X) >TS(Ti) then the operation is rejected.
o If W_TS(X) <= TS(Ti) then the operation is executed.
o Timestamps of all the data items are updated.
2. Check the following condition whenever a transaction Ti issues a Write(X) operation:
o If TS(Ti) < R_TS(X) then the operation is rejected.
o If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the operation is
executed.
Where,Skip Ad
TS(TI) denotes the timestamp of the transaction Ti.
R_TS(X) denotes the Read time-stamp of data-item X.
W_TS(X) denotes the Write time-stamp of data-item X.
• EXAMPLE : A partial schedule for several data items for transactions with timestamps
1, 2, 3, 4, 5

• The timestamp-ordering protocol guarantees serializability since all the arcs in the
precedence graph are of the form:

• Thus, there will be no cycles in the precedence graph


• Timestamp protocol ensures freedom from deadlock as no transaction ever waits.
• But the schedule may not be cascade-free, and may not even be recoverable.
Recoverability and Cascade Freedom
• Problem with timestamp-ordering protocol:
o Suppose Ti aborts, but Tj has read a data item written by Ti
o Then Tj must abort; if Tj had been allowed to commit earlier, the schedule is not
recoverable.
93

o Further, any transaction that has read a data item written by Tj must abort
o This can lead to cascading rollback --- that is, a chain of rollbacks
• Solution 1:
o A transaction is structured such that its writes are all performed at the end of its
processing
o All writes of a transaction form an atomic action; no transaction may execute while a
transaction is being written
o A transaction that aborts is restarted with a new timestamp
• Solution 2: Limited form of locking: wait for data to be committed before reading it
• Solution 3: Use commit dependencies to ensure recoverability
Thomas’ Write Rule
• Modified version of the timestamp-ordering protocol in which obsolete write operations
may be ignored under certain circumstances.
• When Ti attempts to write data item Q, if TS(Ti) < W-timestamp(Q), then Ti is attempting to
write an obsolete value of {Q}.
o Rather than rolling back Ti as the timestamp ordering protocol would have done, this
{write} operation can be ignored.
• Otherwise this protocol is the same as the timestamp ordering protocol.
• Thomas' Write Rule allows greater potential concurrency.
o Allows some view-serializable schedules that are not conflict-serializable.
Validation-Based Protocol
• Execution of transaction Ti is done in three phases.
o Read and execution phase: Transaction Ti writes only to
▪ temporary local variables
o Validation phase: Transaction Ti performs a ''validation test''
▪ to determine if local variables can be written without violating
▪ serializability.
o Write phase: If Ti is validated, the updates are applied to the
▪ database; otherwise, Ti is rolled back.
• The three phases of concurrently executing transactions can be interleaved, but each
transaction must go through the three phases in that order.
o Assume for simplicity that the validation and write phase occur together, atomically
and serially
▪ I.e., only one transaction executes validation/write at a time.
• Also called as optimistic concurrency control since transaction executes fully in the hope
that all will go well during validation
• Each transaction Ti has 3 timestamps
o Start(Ti) : the time when Ti started its execution
o Validation(Ti): the time when Ti entered its validation phase
o Finish(Ti) : the time when Ti finished its write phase
94

• Serializability order is determined by timestamp given at validation time; this is done to


increase concurrency.
o Thus, TS(Ti) is given the value of Validation(Ti).
• This protocol is useful and gives greater degree of concurrency if probability of conflicts is
low.
o because the serializability order is not pre-decided, and
o relatively few transactions will have to be rolled back.

Example: of schedule produced using validation

Multi-Version Concurrency Control Techniques


Multiversion concurrency control (MCC or MVCC), is a concurrency control method
commonly used by database management systems to provide concurrent access to the
database and in programming languages to implement transactional memory.
• Unlike most other database systems which use locks for concurrency
conntrol, Postgres maintains data consistency by using a multiversion model.
• This means that while querying a database each transaction sees a snapshot of data
(a database version) as it was some time ago, regardless of the current state of the
underlying data.
• This protects the transaction from viewing inconsistent data that could be caused by
(other) concurrent transaction updates on the same data rows, providing transaction
isolation for each database session.
• The main difference between multiversion and lock models is that in MVCC locks acquired
for querying (reading) data don't conflict with locks acquired for writing data and so
reading never blocks writing and writing never blocks reading.
• Multiversion schemes keep old versions of data item to increase concurrency.
• Multiversion Timestamp Ordering
• Multiversion Two-Phase Locking
• Each successful write results in the creation of a new version of the data item written.
• Use timestamps to label versions.
• When a read(Q) operation is issued, select an appropriate version of Q based on the
timestamp of the transaction, and return the value of the selected version.
• reads never have to wait as an appropriate version is returned immediately.

Multiversion Timestamp Ordering


95

• Each data item Q has a sequence of versions <Q1, Q2,...., Qm>. Each version Qk contains three
data fields:
• Content -- the value of version Qk.
• W-timestamp(Qk) -- timestamp of the transaction that created (wrote) version Qk
• R-timestamp(Qk) -- largest timestamp of a transaction that successfully read version
Qk
• When a transaction Ti creates a new version Qk of Q, Qk's W-timestamp and R-timestamp are
initialized to TS(Ti).
• R-timestamp of Qk is updated whenever a transaction Tj reads Qk, and TS(Tj) > R-
timestamp(Qk).
• Suppose that transaction Ti issues a read(Q) or write(Q) operation. Let Qk denote the
version of Q whose write timestamp is the largest write timestamp less than or equal to
TS(Ti).
• If transaction Ti issues a read(Q), then the value returned is the content of version Qk.
• If transaction Ti issues a write(Q)
• if TS(Ti) < R-timestamp(Qk), then transaction Ti is rolled back.
• if TS(Ti) = W-timestamp(Qk), the contents of Qk are overwritten
• else a new version of Q is created.
• Observe that
• Reads always succeed
• A write by Ti is rejected if some other transaction Tj that (in the serialization order
defined by the timestamp values) should read
Ti's write, has already read a version created by a transaction older than Ti.
• Protocol guarantees serializability
• Differentiates between read-only transactions and update transactions
• Update transactions acquire read and write locks, and hold all locks up to the end of the
transaction. That is, update transactions follow rigorous two-phase locking.
• Each successful write results in the creation of a new version of the data item written.
• Each version of a data item has a single timestamp whose value is obtained from a
counter ts-counter that is incremented during commit processing.
• Read-only transactions are assigned a timestamp by reading the current value of ts-counter
before they start execution;
They follow the multiversion timestamp-ordering protocol for performing reads.
• When an update transaction wants to read a data item:
• it obtains a shared lock on it, and reads the latest version.
• When it wants to write an item
• it obtains X lock on; it then creates a new version of the item and sets this version's
timestamp to .
• When update transaction Ti completes, commit processing occurs:
• Ti sets timestamp on the versions it has created to ts-counter + 1
• Ti increments ts-counter by 1
• Read-only transactions that start after Ti increments ts-counter will see the values updated
by Ti.
• Read-only transactions that start before Ti increments the ts-counter will see the value before
the updates by Ti.
96

• Only serializable schedules are produced.

MVCC: Implementation Issues

• Creation of multiple versions increases storage overhead


• Extra tuples
• Extra space in each tuple for storing version information
• Versions can, however, be garbage collected
• E.g. if Q has two versions Q5 and Q9, and the oldest active transaction has timestamp
> 9, than Q5 will never be required again
97

UNIT-5
Disk storage basic file structures and hashing
Disk Storage Devices
Databases are stored as files of records stored on disks - Physical database file structures –
Physical levels of three schema architecture
The collection of data in a DB must be stored on some storage medium.
The DBMS software can retrieve, update, and process this data as needed -

◼ Preferred secondary storage device for high storage capacity and low cost.
◼ Data stored as magnetized areas on magnetic disk surfaces.
◼ A disk pack contains several magnetic disks connected to a rotating spindle.
◼ Disks are divided into concentric circular tracks on each disk surface.
Track capacities vary typically from 4 to 50 Kbytes or more
◼ A track is divided into smaller blocks or sectors
◼ because it usually contains a large amount of information
◼ The division of a track into sectors is hard-coded on the disk surface and cannot be changed.
◼ One type of sector organization calls a portion of a track that subtends a fixed angle at the
center as a sector.
◼ A track is divided into blocks.
◼ The block size B is fixed for each system.
◼ Typical block sizes range from B=512 bytes to B=4096 bytes.
◼ Whole blocks are transferred between disk and main memory for processing.
◼ Storage media forms a hierarchy

◼ A read-write head moves to the track that contains the block to be transferred.
◼ Disk rotation moves the block under the read-write head for reading or writing.
◼ A physical disk block (hardware) address consists of:
◼ a cylinder number (imaginary collection of tracks of same radius from all recorded surfaces)
◼ the track number or surface number (within the cylinder)
◼ and block number (within track).
98

◼ Reading or writing a disk block is time consuming because of the seek time s and rotational delay
(latency) rd.
◼ Double buffering can be used to speed up the transfer of contiguous disk blocks.

Blocking
◼ Blocking:
◼ Refers to storing a number of records in one block on the disk.
◼ Blocking factor (bfr) refers to the number of records per block.
◼ There may be empty space in a block if an integral number of records do not fit in one block.
◼ Spanned Records:
Refers to records that exceed the size of one or more blocks and hence span a number of blocks

Files of Records
◼ A file is a sequence of records, where each record is a collection of data values (or data items).
◼ A file descriptor (or file header) includes information that describes the file, such as the field names
and their data types, and the addresses of the file blocks on disk.
◼ Records are stored on disk blocks.
◼ The blocking factor bfr for a file is the (average) number of file records stored in a disk block.
◼ A file can have fixed-length records or variable-length records.
◼ File records can be unspanned or spanned
◼ Unspanned: no record can span two blocks
◼ Spanned: a record can be stored in more than one block.
◼ The physical disk blocks that are allocated to hold the records of a file can be contiguous, linked, or
indexed.
(blocks on Hard Disk → Physical block;
blocks on memory → logical blocks).
◼ In a file of fixed-length records, all records have the same format.
◼ Usually, unspanned blocking is used with such files.
◼ Files of variable-length records require additional information to be stored in each record, such as
separator characters and field types.
◼ Usually spanned blocking is used with such files.

Operation on Files
◼ Typical file operations include:
◼ OPEN: Readies the file for access, and associates a pointer that will refer to a current
file record at each point in time.
◼ FIND: Searches for the first file record that satisfies a certain condition, and makes it
the current file record.
◼ FINDNEXT: Searches for the next file record (from the current record) that satisfies a
certain condition, and makes it the current file record.
◼ READ: Reads the current file record into a program variable.
◼ INSERT: Inserts a new record into the file & makes it the current file record.
◼ DELETE: Removes the current file record from the file, usually by marking the
record to indicate that it is no longer valid.
◼ MODIFY: Changes the values of some fields of the current file record.
◼ CLOSE: Terminates access to the file.
◼ REORGANIZE: Reorganizes the file records.
◼ For example, the records marked deleted are physically removed from the file
or a new organization of the file records is created.
◼ READ_ORDERED: Read the file blocks in order of a specific field of the file.
99

Unordered Files
◼ Also called a heap or a pile file.
◼ New records are inserted at the end of the file.
◼ A linear search through the file records is necessary to search for a record.
◼ This requires reading and searching half the file blocks on the average, and is hence
quite expensive.
◼ Record insertion is quite efficient (its advantage).
Reading the records in order of a particular field requires sorting the file records

Unordered Files
1. Insert (a new record): Very efficient!

last block
on the
disk

Copy

Buffer

Add

New
record
Re-write

Block back
to
the disk
update

Address of the last file


block kept in the
‘File header’
100

Delete a record:

Find block
on the
disk

Copy

Buffer

Delete

Del
record

Re-write
Block back
to
the disk

update

Address of the
file
block kept in the
‘File header’

3. Search for a record: expensive!

• Linear search → the file block by block


• If one record satisfies the search condition, then, on the average the
program will read into the memory and search half of the file blocks
before it finds the record.
• Ex: blocks number b → reads at least (b/2)
• If not? Will read all blocks..
Ordered Files
101

◼ Also called a sequential file.


◼ File records are kept sorted by the values of an ordering field.
◼ We can physically order the records of a file on disk based on the
values of their fields called the ordering field; this leads to an ordered
or sequential file.
◼ If the ordering field is also a key field of the file, then, filed must have
a unique value in each record. Hence, the field is called ordering key
for the file.
◼ Insertion is expensive: records must be inserted in the correct order.
◼ It is common to keep a separate unordered overflow (or transaction)
file – temporary file – for new records to improve insertion
efficiency; this is periodically merged with the main ordered file.
◼ A binary search can be used to search for a record on its ordering field
value.
◼ This requires reading and searching log2 of the file blocks on the
average, an improvement over linear search.
◼ Reading the records in order of the ordering field is quite efficient.
102
103

Average Access Times

◼ The following table shows the average access time to access a specific
record for a given type of file

Hashing Techniques

◼ Def: another type of primary file organization based on hashing which provide very fast
access to records under certain search conditions. This organization is usually called a hash
file.
◼ The search condition must be an equality on a single field called hash filed. In most cases, the
has filed is also a key field of the file, in which case it is called the hash key.
The idea behind hashing is to provide a function h, called a hash function or randomizing
function, which is applied to the hash field value of a record and yields the address of the disk block
in which the record is sorted.
Hashing Techniques Types
1. Internal Hashing (inside)
◼ For internal files, hashing is implemented as a hash table through the use of an array of
records. Assume that the array index range is from [ 0 to M – 1, where M is the number of
blocks ].
◼ The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ...,
bucketM-1.
◼ Typically, a bucket corresponds to one (or a fixed number of) disk block
◼ One of the file fields is designated to be the hash key of the file.
◼ The record with hash key value K is stored in bucket i, where i=h(K), where h is the hashing
function.
◼ Hash function 1: h(k)= k mod M → returns the remainder of an integer hash filed
value K after division by M; this value then used for the record address.
◼ Function 2 (folding): involving applying an arithmetic function (i.e. addition) or a
logic function (i.e. exclusive) to differentiate portions of the hash field value to
compute the hash address
◼ Search is very efficient on the hash key.
104

Hashed Files

◼ Collisions occur when a new record hashes to a bucket that is already full.
◼ An overflow file is kept for sorting such records.
◼ Overflow records that hash to each bucket can be linked together.
The process of finding another position is called collision resolution
◼ There are numerous methods for collision resolution, including the following:
◼ Open addressing: Proceeding from the occupied position specified by the hash
address, the program checks the subsequent positions in order until an unused (empty)
position is found.
◼ Chaining: For this method, various overflow locations are kept, usually by extending
the array with a number of overflow positions. In addition, a pointer field is added to
each record location. A collision is resolved by placing the new record in an unused
overflow location and setting the pointer of the occupied hash address location to the
address of that overflow location.
105

Multiple hashing: The program applies a second hash function if the first results in a collision.
If another collision results, the program uses open addressing or applies a third hash function and
then uses open addressing if necessary

Hashing Techniques Types


2. External Hashing (Outside)
◼ Hashing for disk files is called External Hashing
◼ To suit the characteristics of disk storage, the target address space is made of buckets, each
of which holds multiple records.
◼ A bucket is either one disk block or a cluster of contiguous disk blocks.
◼ The hashing function maps a key into a relative bucket number, rather than assigning an
absolute block address to the bucket.
◼ A table maintained in the file header converts the bucket number into the corresponding disk
block address.
◼ The collision problem is less severe with the buckets, because as many records as will fit in
a bucket can hash to the same bucket without causing problems.
◼ However, the pointers in the linked list should be record pointers, which include both a
block address and a relative record position with the block.
106

◼ To reduce overflow records, a hash file is typically kept 70-80% full.


◼ The hash function h should distribute the records uniformly among the buckets
◼ Otherwise, search time will be increased because many overflow records will exist.
◼ Main disadvantages of static external hashing:
◼ Fixed number of buckets M is a problem if the number of records in the file grows or
shrinks.
◼ Ordered access on the hash key is quite inefficient (requires sorting the records).

Dynamic And Extendible Hashed Files


◼ Dynamic and Extendible Hashing Techniques
◼ Hashing techniques are adapted to allow the dynamic growth and shrinking of the
number of file records.
◼ These techniques include the following: dynamic hashing, extendible hashing, and
linear hashing.
◼ Both dynamic and extendible hashing use the binary representation of the hash value h(K)
in order to access a directory (all addresses).
107

◼ In dynamic hashing the directory is a binary tree.


◼ In extendible hashing the directory is an array of size 2d where d is called the global
depth.
◼ The directories can be stored on disk, and they expand or shrink dynamically.
◼ Directory entries point to the disk blocks that contain the stored records.
◼ An insertion in a disk block that is full causes the block to split into two blocks and the
records are redistributed among the two blocks.
◼ The directory is updated appropriately.
◼ Dynamic and extendible hashing do not require an overflow area.
◼ Linear hashing does require an overflow area but does not use a directory.
◼ Blocks are split in linear order as the file expands.
◼ Dynamic hashing uses an access structure based on binary
◼ tree data structures.


108

Extendible hashing
◼ In extendible hashing, stores an access structure in addition to the file, and hence is
somewhat similar to indexing.
◼ The main differences is that the access structure is based on the values that result after
application of the hash function to the search field. In indexing, the access structure is based
on the values of the search field itself.
◼ A type of directory an array 2 bucket addresses is maintained, where d is called the global
depth of the directory; đ is a local depth.
◼ The integer value corresponding to the first (high-order) d bits of a hash value is used
as an index to the array to determine a directory entry, and the address in that entry
determines the bucket in which the corresponding records are sorted.
◼ A local depth ɗ, stored with each bucket specifies the number of bits on which the
bucket contents are based.
109

Linear hashing

◼ The idea behind it is to allow a hash file to expand and shrink its number of buckets
dynamically without needing to a directory.
Parallelizing Disk Access using RAID Technology

◼ Secondary storage technology must take steps to keep up in performance and reliability with
processor technology.
◼ A major advance in secondary storage technology is represented by the development of
RAID, which originally stood for Redundant Arrays of Inexpensive Disks.
The main goal of RAID is to even out the widely different rates of performance improvement
of disks against those in memory and microprocessors.
RAID Technology
◼ A natural solution is a large array of small independent disks acting as a single higher-
performance logical disk.
◼ A concept called data striping is used, which utilizes parallelism to improve disk
performance.
◼ Data striping distributes data transparently over multiple disks to make them appear as a
single large, fast disk.


◼ Different raid organizations were defined based on different combinations of the two factors
of granularity of data interleaving (striping) and pattern used to compute redundant
information.
◼ Raid level 0 (uses striping/distribution) has no redundant data and hence has the best
write performance at the risk of data loss
◼ Raid level 1 uses mirrored disks.
◼ Raid level 2 uses memory-style redundancy by using Hamming codes, which contain
parity bits for distinct overlapping subsets of components. Level 2 includes both error
detection and correction.
◼ Raid level 3 uses a single parity disk relying on the disk controller to figure out which
disk has failed.
◼ Raid Levels 4 and 5 use block-level data striping, with level 5 distributing data and
parity information across all disks.
110

◼ Raid level 6 applies the so-called P + Q redundancy scheme using Reed-Soloman


codes to protect against up to two disk failures by using just two redundant disks.
Use of RAID Technology

◼ Different raid organizations are being used under different situations


◼ Raid level 1 (mirrored disks) is the easiest for rebuild of a disk from other disks
◼ It is used for critical applications like logs
◼ Raid level 2 uses memory-style redundancy by using Hamming codes, which contain
parity bits for distinct overlapping subsets of components.
◼ Level 2 includes both error detection and correction.
◼ Raid level 3 (single parity disks relying on the disk controller to figure out which disk
has failed) and level 5 (block-level data striping) are preferred for Large volume
storage, with level 3 giving higher transfer rates.
◼ Most popular uses of the RAID technology currently are:
◼ Level 0 (with striping), Level 1 (with mirroring) and Level 5 with an extra drive for
parity.
◼ Design Decisions for RAID include:
◼ Level of RAID, number of disks, choice of parity schemes, and grouping of disks for
block-level striping.

Storage Area Networks

◼ The demand for higher storage has risen considerably in recent times.
◼ Organizations have a need to move from a static fixed data center oriented operation to a
more flexible and dynamic infrastructure for information processing.
◼ Thus they are moving to a concept of Storage Area Networks (SANs).
◼ In a SAN, online storage peripherals are configured as nodes on a high-speed network
and can be attached and detached from servers in a very flexible manner.
◼ This allows storage systems to be placed at longer distances from the servers and provide
different performance and connectivity options.
111

◼ Advantages of SANs are:


◼ Flexible many-to-many connectivity among servers and storage devices using fiber
channel hubs and switches.
◼ Up to 10km separation between a server and a storage system using appropriate fiber
optic cables.
◼ Better isolation capabilities allowing non-disruptive addition of new peripherals and
servers.
◼ SANs face the problem of combining storage options from multiple vendors and dealing with
evolving standards of storage management software and hardware.
INDEXING
▪ Indexing is a data structure technique which allows you to quickly retrieve records from a
database file.
▪ An index is a small table having only two columns.
▪ The first column comprises a copy of the primary or candidate key of table.
▪ Second column contains a set of pointers for holding the address of the disk block where that
specific key value is stored.
▪ An index->
• takes search key as input
• efficiently returns a collection of matching records.

TYPES OF INDEX METHODS/ TECHNIQUES :-


TYPES OF INDEX

PRIMARY INDEX SECONDARY INDEX CLUSTERING


INDEX

DENSE SPARSE
o indexing in database is defined based on its indexing attributes.
PRIMARY INDEX :-
primary index is an ordered file which is fixed length size with two fields.
The first field is same as a primary key and second field is pointed to that specific data
block.
The primary indexing in DBMS is also divided into two types
1. DENSE INDEX
2. SPARSE INDEX
DENSE INDEXING :-
• In dense index, a record is created for every search key valued in the
database.
• This helps to search faster.
• But needs more space to store index records.
SPARSE INDEXING:-
112

• It is an index record that appears for only some of the values in the
file.
• Sparse index helps you to resolve the issues of dense indexing in
DBMS.
• However, sparse index stores index records for only some search_key
values.
• It needs less space, less maintenance over level for insertion and
deletion.
• But slower compared to the dense index for locating records.
• The sparse index maintains ordered data indexing, we can’t
maintain.
SECONDARY INDEX :-
The secondary index in DBMS can be generated by a field which has
unique value for each record, and it should be a candidate key.
It is also known as non-clustering index.
This is two level database indexing technique, that reduces the
mapping size of first level.
As the size of table grows, the size of mapping also grows.
If mapping size grows then fetching the address itself becomes
slower.
In this case sparse, index will not be efficient
To overcome the problem, secondary index is introduced
CLUSTER INDEX:-
A clustered index can be defined as an ordered data file.
Some times index is created on non-primary key columns which may
not be unique for each record.
In this case, to identify the record faster we will group 2 or more
columns to get the unique value & create index out of them .
This method is called clustering.
This schema little confusing because one disk block is shared by records
which belongs to disk clusters.
If we use separate disk block for separate clusters, then it is called better
technique.
MULTILEVEL INDEX:-
Multi-level indexing in database Is created when a primary index does not fit in
memory.

• Index records comprise search key values & data pointers.


• Multi-level index is stored on the disk along with the actual database files.
• As the size of database grows, so does the size of the indices.
• There is a immense need to keep the index records in the main memory so as to speed up the
search operations.
113

• If single-level, index is used, then a large size index can’t be kept in memory which leads to
multiple disk accesses
• Multi-level index helps in breaking down the index into several smaller indices in order to
make the make the outermost level so small that it can be saved in a single disk block, which
can easily be accommodated anywhere in the main memory
B+ TREE:-
Is balanced BST that follows a multi-level index format.
• B+ tree can support random access & sequential access.
• B+ tree leaf nodes are actual data pointer.
• B+ tree ensures that all leaf nodes are linked using link list, and all leaf nodes remain at same
height, thus balanced.

You might also like