DBMS Notes
DBMS Notes
3. Network Model – Network Model is same as hierarchical model except that it has graph-like structure rather than a tree-
based structure. Unlike hierarchical model, this model allows each record to have more than one parentrecord.
DATA MODELS IN DBMS Physical Data Models – These models describe data at the lowest level of abstraction.
A Data Model is a logical structure of Database. It describes the design of database to reflect entities, attributes, relationship among data,
Three Schema Architecture
constrains etc.
P
Types of Data Models:
The goal of the three schema architecture is to separate the user applications and the physical database. The schemas can be defined
Object based logical Models – Describe data at the conceptual and view levels.
P
at the following levels:
AP
1. E-RModel
1. The internal level – has an internal schema which describes the physical storage structure of the database. Uses a physical
An entity–relationship model (ER model) is a systematic way of describing and defining a business process. An ER model is typically
data model and describes the complete details of data storage and access paths for the database.
AP
implemented as a database. The main components of E-R model are: entity set and relationshipset.
2. The conceptual level – has a conceptual schema which describes the structure of the database for users. It hides the details
of the physical storage structures, and concentrates on describing entities, data types, relationships, user operations and
constraints. Usually a representational data model is used to describe the conceptual schema.
R
3. The External or View level – includes external schemas or user vies. Each external schema describes the part of the
database that a particular user group is interested in and hides the rest of the database from that user group. Represented
R
using the representational datamodel.
O
O
The three schema architecture is used to visualize the schema levels in a database. The three schemas are only descriptions of data,
UC
the data only actually exists is at the physical level.
2. Object orientedModel
UC
An object data model is a data model based on object-oriented programming, associating methods (procedures) with objects that
can benefit from class hierarchies. Thus, "objects" are levels of abstraction that include attributes and behavior
Record based logical Models – Like Object based model, they also describe data at the conceptual and view levels. These models
specify logical structure of database with records, fields and attributes.
ST
1. Relational Model
In relational model, the data and relationships are represented by collection of inter-related tables. Each table is a group of column
ST
and rows, where column represents attribute of an entity and rows represents records.
Sample relationship Model: Student table with 3 columns and three records.
Stu_Id Stu_Name Stu_Age
111 Ashish 23
123 Saurav 22
169 Lester 24
2. HierarchicalModel
In hierarchical model, data is organized into a tree like structure with each record is having one parent record and many children.
The main drawback of this model is that, it can have only one to many relationships between nodes. Sample Hierarchical
ModelDiagram:
COMPONENTS OF DBMS
Database Users
Users are differentiated by the way they expect to interact with the system EVOLUTION OF RDBMS
• Applicationprogrammers Before the acceptance of Codd's Relational Model, database management systems was just an ad hoc collection of data designed to
• Sophisticatedusers solve a particular type of problem, later extended to solve more basic purposes. This led to complex systems, which were difficult to
P
• Naiveusers understand, install, maintain and use. These database systems were plagued with the following problems:
• DatabaseAdministrator • They required large budgets and staffs of people with special skills that were in shortsupply.
P
• Specialized usersetc,. • Database administrators' staff and application developers required prior preparation to access these database systems.
AP
• End-user access to the data was rarelyprovided.
• These database systems did not support the implementation of business logic as a DBMSresponsibility.
AP
Application programmers:
Professionals who write application programs and using these application programs they interact with the
databasesystem Hence, the objective of developing a relational model was to address each and every one of the shortcomings that plagued those
Sophisticated users : systems that existed at the end of the 1960s decade, and make DBMS products more widely appealing to all kinds of users.
These user interact with the database system without writing programs, But they submit queries to retrieve the
R
information The existing relational database management systems offer powerful, yet simple solutions for a wide variety of commercial and
Specialized users: scientific application problems. Almost every industry uses relational systems to store, update and retrieve data for operational,
R
Who write specialized database applications to interact with the database system. transaction, as well as decision support systems.
Naive users: RELATIONAL DATABASE
O
Interacts with the database system by invoking some application programs that have been written A relational database is a database system in which the database is organized and accessed according to the relationships between
previously by application programmers data items without the need for any consideration of physical orientation and relationship. Relationships between data items are
O
Eg : people accessing database over the web expressed by means of tables.
UC
Database Administrator: It is a tool, which can help you store, manage and disseminate information of various kinds. It is a collection of objects, tables, queries,
Coordinates all the activities of the database system; the database administrator has a good understanding of the enterprise's forms, reports, and macros, all stored in a computer program all of which are inter-related.
UC
information resources and needs. It is a method of structuring data in the form of records, so that relations between different entities and attributes can be used for
Schemadefinition data access and transformation.
Access methoddefinition RELATIONAL DATABASE MANAGEMENT SYSTEM
Schema and physical organization modification A Relational Database Management System (RDBMS) is a system, which allows us to perceive data as tables (and nothing but tables),
ST
Granting user authority to access thedatabase and operators necessary to manipulate that data are at the user's disposal.
Monitoringperformance Features of an RDBMS
ST
P
P
Primary Key: It is a minimum super key. consideredrelations.
It is a unique identifier for the table (a column or a column combination with the property that at any given time no two rows of the The fundamental operations of relational algebra are as follows −
table contain the same value in that column or column combination). Select
AP
AP
Foreign Key: A foreign keyis a field (or collection of fields) in one table that uniquely identifies a row of another table. In simpler Project
words, the foreign key is defined in a second table, but it refers to the primary key in the first table. Candidate Key: There may be Union
two or more attributes or combinations of attributes that uniquely identify an instance of an entity set. These attributes or combinations Set different
of attributes are called candidatekeys. Cartesianproduct
Secondary Key: A secondary key is an attribute or combination of attributes that may not be a candidate key, but that classifies the Rename
entity set on a particular characteristic. Any key consisting of a single attribute is called a simple key, while that consisting of a We will discuss all these operations in the following sections.
combination of attributes is called a composite key. Select Operation (σ)
R
R
Referential Integrity It selects tuples that satisfy the given predicate from a relation.
Referential Integrity can be defined as an integrity constraint that specifies that the value (or existence) of an attribute in one relation Notation − σp(r)
depend on the value (or existence) of an attribute in the same or another relation. Referential integrity in a relational database is Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula which may use connectors like and, or,
O
O
consistency between coupled tables. It is usually enforced by the combination of a primary key and a foreign key. For referential and not. These terms may use relational operators like − =, ≠, ≥, < , >, ≤.
integrity to hold, any field in a table that is declared a foreign key can contain only values from a parent table's primary key field. For example −
For instance, deleting a record that contains a value referred to by a foreign key in another table would break referentialintegrity.
σsubject = "database"(Books)
UC
UC
Relational Model
Relational data model is the primary data model, which is used widely around the world for data storage and processing. This model Output − Selects tuples from books where subject is 'database'. σsubject =
is simple and it has all the properties and capabilities required to process data with storage efficiency.
"database" and price = "450"(Books)
Concepts
Tables − In relational data model, relations are saved in the format of Tables. This format stores the relation among entities. A table has Output − Selects tuples from books where subject is 'database' and 'price' is 450.
rows and columns, where rows represents records and columns represent the attributes. σsubject = "database" and price = "450" or year > "2010"(Books)
Tuple − A single row of a table, which contains a single record for that relation is called a tuple.
ST
ST
Output − Selects tuples from books where subject is 'database' and 'price' is 450 or those books published after 2010.
Relation instance − A finite set of tuples in the relational database system represents relation instance. Relation instances do not have
Project Operation (∏)
duplicate tuples.
It projects column(s) that satisfy a given predicate.
Relation schema − A relation schema describes the relation name (table name), attributes, and their names. Relation key − Each
Notation − ∏A1,A2,An(r)
row has one or more attributes, known as relation key, which can identify the row in the relation (table) uniquely.
Where A1, A2, Anare attribute names of relation r.
Attribute domain − Every attribute has some pre-defined value scope, known as attribute domain.
Duplicate rows are automatically eliminated, as relation is a set.
Constraints
For example −
Every relation has some conditions that must hold for it to be a valid relation. These conditions are called Relational
Integrity Constraints. There are three main integrity constraints− ∏subject, author (Books)
Keyconstraints Selects and projects columns named as subject and author from the relation Books.
Domainconstraints Union Operation (∪)
Referential integrity constraints It performs binary union between two given relations and is defined as − r ∪ s =
{ t | t ∈ r or t ∈ s}
KeyConstraints
There must be at least one minimal subset of attributes in the relation, which can identify a tuple uniquely. This minimal subset of
attributes is called keyfor that relation. If there are more than one such minimal subsets, these are called candidate keys. Notation − r U s
Where r and s are either database relations or relation result set (temporary relation). For a
Key constraints force that −
union operation to be valid, the following conditions must hold −
r, and s must have the same number ofattributes.
Attribute domains must becompatible.
P
Output − Yields a relation, which shows all the books and articles written by tutorialspoint. UPDATE - updates data in a databasetable
P
Rename Operation (ρ) DELETE - deletes data from a databasetable
The results of relational algebra are also relations but without any name. The rename operation allows us to rename the output INSERT INTO - inserts new data into a databasetable
AP
relation. 'rename' operation is denoted with small Greek letter rho ρ.
AP
Notation − ρ x(E) SQL Data Definition Language (DDL)
Where the result of expression E is saved with name of x. The Data Definition Language (DDL) part of SQL permits database tables to be created or deleted. We can also define indexes
Additional operations are − (keys), specify links between tables, and impose constraints between database tables.
Set intersection The most important DDL statements in SQL are:
Assignment CREATE TABLE - creates a new databasetable
Naturaljoin ALTER TABLE - alters (changes) a databasetable
R
SQL FUNDAMENTALS: DROP TABLE - deletes a databasetable
R
SQL is a standard computer language for accessing and manipulating databases. CREATE INDEX - creates an index (searchkey)
DROP INDEX - deletes anindex
O
What is SQL? The SQL SELECT Statement
O
SQL stands for Structured QueryLanguage The SELECT statement is used to select data from a table. The tabular result is stored in a result table (called the result-set).
SQL allows you to access adatabase Syntax
UC
SQL is an ANSI standard computerlanguage
UC
SELECT column_name(s)
SQL can execute queries against adatabase
FROM table_name
SQL can retrieve data from adatabase
SQL can insert new records in adatabase Note: SQL statements are not case sensitive. SELECT is the same as select.
SQL can delete records from adatabase
SQL can update records in adatabase SQL SELECT Example
ST
SQL is easy tolearn To select the content of columns named "LastName" and "FirstName", from the database table called "Persons", use a SELECT
ST
P
we do not have to put a semicolon after each SQL statement, but some database programs force you to use it. LIKE Search for a pattern
P
Note: In some versions of SQL the <> operator may be written as !=
AP
The SELECT DISTINCT Statement
The DISTINCT keyword is used to return only distinct (different) values. Using the WHERE Clause
AP
The SELECT statement returns information from table columns. But what if we only want to select distinct elements? To select only the persons living in the city "Sandnes", we add a WHERE clause to the SELECT statement:
With SQL, all we need to do is to add a DISTINCT keyword to the SELECT statement: SELECT * FROM Persons
Syntax WHERE City='Sandnes'
SELECT DISTINCT column_name(s) FROM "Persons" table
table_name LastName FirstName Address City Year
R
Hansen Ola Timoteivn 10 Sandnes 1951
R
Using the DISTINCT keyword Svendson Tove Borgvn 23 Sandnes 1978
To select ALL values from the column named "Company" we use a SELECT statement like this:
Svendson Stale Kaivn 18 Sandnes 1980
O
SELECT Company FROM Orders
Pettersen Kari Storgt 20 Stavanger 1960
O
Result
"Orders" table
UC
LastName FirstName Address City Year
Hansen Ola Timoteivn 10 Sandnes 1951
UC
ST
Trio 4678 Note that we have used single quotes around the conditional values in the examples.
Result
W3Schools 6798
ST
SQL uses single quotes around text values (most database systems will also accept double quotes). Numeric values should not be enclosed
in quotes.
For text values:
Company
This is correct:
Sega
SELECT * FROM Persons WHERE FirstName='Tove' This is
W3Schools wrong:
Trio SELECT * FROM Persons WHERE FirstName=Tove
Note that "W3Schools" is listed twice in the result-set.
W3Schools For numeric values:
To select only DIFFERENT values from the column named "Company" we use a SELECT DISTINCT statement like this:
SELECT DISTINCT Company FROM Orders This is correct:
SELECT * FROM Persons WHERE Year>1965 This is
Result:
wrong:
SELECT * FROM Persons WHERE Year>'1965'
P
INSERT INTO table_name
VALUES(value1, value2, ) Update several Columns in a Row
We want to change the address and add the name of the city:
AP
AP
You can also specify the columns for which you want to insert data: INSERT
INTO table_name (column1, column2,.) UPDATE Person
VALUES(value1, value2, ) SET Address = 'Stien 12', City = 'Stavanger' WHERE
LastName = 'Rasmussen'
Insert a New Row Result:
This "Persons" table: LastName FirstName Address City
LastName FirstName Address City Nilsen Fred Kirkegt 56 Stavanger
R
R
Pettersen Kari Storgt 20 Stavanger Rasmussen Nina Stien 12 Stavanger
And this SQL statement: The DELETE Statement
The DELETE statement is used to delete rows in a table.
O
O
INSERT INTO Persons
VALUES ('Hetland', 'Camilla', 'Hagabakka 24', 'Sandnes') Will give this Syntax
result: DELETE FROM table_name WHERE
column_name =some_value
UC
UC
LastName FirstName Address City
Pettersen Kari Storgt 20 Stavanger Person:
Hetland Camilla Hagabakka 24 Sandnes
LastName FirstName Address City
Nilsen Fred Kirkegt 56 Stavanger
Insert Data in Specified Columns
This "Persons" table: Rasmussen Nina Stien 12 Stavanger
ST
ST
Delete
LastName FirstName Address City
Drop
Pettersen Kari Storgt 20 Stavanger
Hetland Camilla Hagabakka 24 Sandnes Delete a Row
And This SQL statement: "Nina Rasmussen" is going to be deleted:
INSERT INTO Persons (LastName, Address) DELETE FROM Person WHERE LastName = 'Rasmussen'
VALUES ('Rasmussen', 'Storgt 67')
Result
Will give this result:
LastName FirstName Address City
LastName FirstName Address City Nilsen Fred Kirkegt 56 Stavanger
Pettersen Kari Storgt 20 Stavanger
Hetland Camilla Hagabakka 24 Sandnes Delete All Rows
Rasmussen Storgt 67 It is possible to delete all rows in a table without deleting the table. This means that the table structure, attributes, and indexes will be
Null (no value …not space not empty) intact:
The UPDATE statement is used to modify the data in a table. DELETE FROM table_name or
Syntax DELETE * FROM table_name
The ORDER BY keyword is used to sort the result.
P
Sega 3412 STDEV(column)
W3Schools 6798 STDEVP(column)
W3Schools 2312 SUM(column) Returns the total sum of a column
AP
AP
VAR(column)
Example VARP(column)
To display the company names in alphabetical order AND the OrderNumber in numerical order: SELECT
Company, OrderNumber FROM Orders
ORDER BY Company, OrderNumber
Result: Scalar functions
Company OrderNumber Scalar functions operate against a single value, and return a single value based on the input value.
R
R
ABC Shop 5678 Useful Scalar Functions in MS Access
Sega 3412 Function Description
O
O
W3Schools 2312 UCASE(c) Converts a field to upper case
W3Schools 6798 LCASE(c) Converts a field to lower case
Aggregate functions MID(c,start[,end]) Extract characters from a text field
UC
UC
Aggregate functions operate against a collection of values, but return a single value. LEN(c) Returns the length of a text field
Note: If used among many other expressions in the item list of a SELECT statement, the SELECT must have a GROUP BY clause!! INSTR(c,char) Returns the numeric position of a named character within a text field
"Persons" table (used in most examples)
LEFT(c,number_of_char) Return the left part of a text field requested
Name Age RIGHT(c,number_of_char) Return the right part of a text field requested
ROUND(c,decimals) Rounds a numeric field to the number of decimals specified
ST
ST
Hansen, Ola 34
Svendson, Tove 45 MOD(x,y) Returns the remainder of a division operation
Pettersen, Kari 19
Aggregate functions in MS Access
Aggregate functions (like SUM) often need an added GROUP BY functionality.
Function Description GROUP BY
GROUP BY... was added to SQL because aggregate functions (like SUM) return the aggregate of all column values every time they are
AVG(column) Returns the average value of a column
called, and without the GROUP BY function it was impossible to find the sum for each individual group of column values.
COUNT(column) Returns the number of rows (without a NULL value) of a column The syntax for the GROUP BY function is:
COUNT(*) Returns the number of selected rows SELECT column,SUM(column) FROM table GROUP BY column
FIRST(column) Returns the value of the first record in a specified field
LAST(column) Returns the value of the last record in a specified field GROUP BY Example
MAX(column) Returns the highest value of a column This "Sales" Table:
MIN(column) Returns the lowest value of a column
STDEV(column) Company Amount
STDEVP(column) W3Schools 5500
SUM(column) Returns the total sum of a column IBM 4500
VAR(column) W3Schools 7100
And This SQL:
P
W3Schools 12600
char STD_NAME [15];
P
IBM 4500
char ADDRESS[20];
AP
HAVING… EXEC SQL END DECLARE SECTION;
AP
HAVING... was added to SQL because the WHERE keyword could not be used against aggregate functions (like SUM), and without We can note here that variables are written inside begin and end block of the SQL, but they are declared using C code. It does not use
HAVING... it would be impossible to test for result conditions. SQL code to declare the variables. Why? This is because they are host variables – variables of C language. Hence we cannot use SQL
The syntax for the HAVING function is: syntax to declare them. Host language supports almost all the datatypes from int, char, long, float, double, pointer, array, string,
SELECT column,SUM(column) FROM table GROUP structuresetc.
BY column When host variables are used in a SQL query, it should be preceded by colon – ':' to indicate that it is a host variable. Hence when pre
R
HAVING SUM(column) condition value This -compiler compiles SQL code, it substitutes the value of host variable andcompiles.
R
"Sales" Table: EXEC SQL SELECT * FROM STUDENT WHERE STUDENT_ID =:STD_ID;
The following code is a simple embedded SQL program, written in C. The program illustrates many, but not all, of the
O
embedded SQL techniques. The program prompts the user for an order number, retrieves the customer number,
O
Company Amount
W3Schools 5500 salesperson, and status of the order, and displays the retrieved information on the screen.
UC
IBM 4500 int main() {
UC
ST
char SalesPerson[10] /* Retrievedsalespersonname */
ST
query_error:
printf ("SQL error: %ld\n", sqlca->sqlcode); exit();
bad_number:
printf ("Invalid order number.\n");
exit();
}
DYNAMIC SQL
P
The main disadvantage of embedded SQL is that it supports only static SQLs. If we need to build up queries at run time, then we
P
can use dynamic sql. That means if query changes according to user input, then it always better to use dynamic SQL. Like we said
AP
above, the query when user enters student name alone and when user enters both student name and address, is different. If we use
AP
embedded SQL, one cannot implement this requirement in the code. In such case dynamic SQL helps the user to develop query
depending on the values entered by him, without making him know which query is being executed. It can also be used when we do not
know which SQL statements like Insert, Delete update or select needs to be used, when number of host variables is unknown, or when
datatypes of host variables are unknown or when there is direct reference to DB objects like tables, views, indexes are required.
However this will make user requirement simple and easy but it may make query lengthier and complex. That means depending upon
R
user inputs, the query may grow or shrink making the code flexible enough to handle all the possibilities. In embedded SQL, compiler
R
knows the query in advance and pre-compiler compiles the SQL code much before C compiles the code for execution. Hence
embedded SQLs will be faster in execution. But in the case of dynamic SQL, queries are created, compiled and executed only at the run
O
time. This makes the dynamic SQL little complex, and timeconsuming.
O
Since query needs to be prepared at run time, in addition to the structures discussed in embedded SQL, we have three more clauses in
dynamic SQL. These are mainly used to build the query and execute them at runtime.
UC
PREPARE
UC
Since dynamic SQL builds a query at run time, as a first step we need to capture all the inputs from the user. It will be stored in a
string variable. Depending on the inputs received from the user, string variable is appended with inputs and SQL keywords. These
SQL like string statements are then converted into SQL query. This is done by using PREPAREstatement.
For example, below is the small snippet from dynamic SQL. Here sql_stmt is a character variable, which holds inputs from the users
ST
along with SQL commands. But is cannot be considered as SQL query as it is still a sting value. It needs to be converted into a proper
ST
SQL query which is done at the last line using PREPARE statement. Here sql_query is also a string variable, but it holds the string as a
SQLquery.
EXECUTE
This statement is used to compile and execute the SQL statements prepared in DB. EXEC SQL
EXECUTE sql_query;
EXECUTEIMMEDIATE
This statement is used to prepare SQL statement as well as execute the SQL statements in DB. It performs the task of PREPARE and
EXECUTE in a singleline.
EXEC SQL EXECUTE IMMEDIATE :sql_stmt;
Dynamic SQL will not have any SELECT queries and host variables. But it can be any other SQL statements like insert, delete, update,
grant etc. But when we use insert/ delete/ updates in this type, we cannot use host variables. All the input values will be hardcoded.
Hence the SQL statements can be directly executed using EXECUTE IMMEDIATE rather than using PREPARE and thenEXECUTE.
P
Entity-Relationship Data Model
Classical, popular conceptual datamodel
P
Firstintroduced(mid70's)asa(relativelyminor)improvementtotherelationalmodel:pictorialdiagramsare easier to read than
AP
Relationships between Entities - Weak and Strong
relational databaseschemas
Rhombus is used to setup relationships between two or more entities.
AP
Then evolved as a popular model for the first conceptual representation of data structures in the processof
databasedesign
ER Model: Entity and Entity Set
Considering the above example, Student is an entity, Teacher is an entity, similarly, Class, Subjectetc are also entities.
An Entity is generally a real-world object which has characteristics and holds relationships in a DBMS. If a Student is
R
an Entity, then the complete dataset of all the students will be the Entity Set
R
ER Model: Attributes
O
Attributes for any Entity
If a Student is an Entity, then student's roll no., student's name, student's age, student's gender etc will be its attributes.
Ellipse is used to represent attributes of any entity. It is connected to the entity.
O
An attribute can be of many types, here are different types of attributes defined in ER database model:
UC
1. Simple attribute: The attributes with values that are atomic and cannot be broken down further aresimple attributes. For
example, student's age.
UC
2. Composite attribute: A composite attribute is made up of more than one simple attribute. Forexample, student's address
will contain, house no., street name, pincodeetc.
3. Derivedattribute:Thesearetheattributeswhicharenotpresentinthewholedatabasemanagementsystem, but are derived using other
attributes. For example, average age of students in aclass.
ST
ST
5. Multi-valued attribute: And, they can have multiplevalues.
A weak Entity is represented using double rectangular boxes. It is generally connected to another entity.
ER Model: Relationships
When an Entity is related to another Entity, they are said to have a relationship. For example, A ClassEntity is related to Student
entity, because students study in classes, hence this is arelationship.
Key Attribute for any Entity
Depending upon the number of entities involved, a degree is assigned torelationships. To represent a Key attribute, the attribute name inside the Ellipse is underlined.
For example, if 2 entities are involved, it is said to be Binary relationship, if 3 entities are involved, it is said to be Ternary
relationship, and so on.
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 1 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 2
The above example describes that one student can enroll only for one course and a course will also have only one Student. This is not
P
what you will usually see in real-world relationships.
P
One to Many Relationship
AP
Composite Attribute for any Entity The below example showcases this relationship, which means that 1 student can opt for many courses, but a course can only have 1
AP
A composite attribute is the attribute, which also has attributes. student. Sounds weird! This is how it is.
R
R
ER Diagram: Entity
An Entity can be any object, place, person or class. In ER Diagram, an entity is represented using rectangles. Consider an example
O
of an Organisation- Employee, Manager, Department, Product and many more can be taken as entities in anOrganisation. Many to One Relationship
O
It reflects business rule that many entities can be associated with just one entity. For example, Student enrolls for only one Course
but a Course can have many Students.
UC
UC
ST
ST
The yellow rhombus in between represents a relationship.
ER Diagram: Key Attribute
Key attribute represents the main characteristic of an Entity. It is used to represent a Primary key. Ellipse with the text underlined, Many to Many Relationship
represents Key Attribute.
The above diagram represents that one student can enroll for more than one courses. And a course can have more than 1 student
enrolled in it.
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 3 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 4
P
P
covers both.
third.
Specialization
AP
AP
Specialization is opposite to Generalization. It is a top-down approach in which one higher level entity can be broken down into two
lower level entity. In specialization, a higher level entity may not have any lower-level entity sets, it's possible.
R
R
For example, in the diagram above, we have three related entities, Company, Product and Sector. To understand the relationship better or
O
O
to define rules around the model, we should relate two entities and then derive the third one.
A Company produces many Products/ each product is produced by exactly one company. A Company
operates in only one Sector / each sector has many companies operating in it.
UC
UC
Considering the above two rules or relationships, we see that although the complete relationship involves three entities, but we are Aggregation
looking at two entities at a time. Aggregation is a process when relation between two entities is treated as a single entity.
The Enhanced ER Model
As the complexity of data increased in the late 1980s, it became more and more difficult to use the traditional ER Model for database
modelling. Hence some improvements or enhancements were made to the existing ER Model to make it able to handle the complex
ST
ST
applications better.
Hence, as part of the Enhanced ER Model, along with other improvements, three new concepts were added to the existing ER Model,
they were:
1. Generalization
2. Specialization
3. Aggregration
Generalization In the diagram above, the relationship between Center and Course together, is acting as an Entity, which is in relationship with another
entity Visitor. Now in real world, if a Visitor or a Student visits a Coaching Center, he/she will never enquire about the center only or
Generalization is a bottom-up approach in which two lower level entities combine to form a higher level entity. In generalization, the just about the course, rather he/she will ask enquire about both.
higher level entity can also combine with other lower level entities to make further higher level entity.
It's more like Superclass and Subclass system, but the only difference is the approach, which is bottom-up. Hence, entities are ER Model to Relational Model
combined to form a more generalised entity, in other words, sub-classes are combined to form a super- ER Model can be represented using ER Diagrams which is a great way of designing and representing the database design in more
of a flow chart form.
It is very convenient to design the database using the ER Model by creating an ER diagram and later on converting it into relational name.
model to design your tables. Also proper foreign key constraints must be set for all the tables.
Not all the ER Model constraints and components can be directly transformed into relational model, but an approximate schema can be
derived.
Functional Dependency
Few examples of ER diagrams and convert it into relational model schema, hence creating tables in RDBMS. The functional dependency is a relationship that exists between two attributes. It typically exists between the primary key and
non-key attribute within a table.
Entity becomes Table X→Y
Entity in ER Model is changed into tables, or we can say for every Entity in ER model, a table is created in Relational Model. The left side of FD is known as a determinant, the right side of the production is known as a dependent.
And the attributes of the Entity gets converted to columns of the table. For example:
And the primary key specified for the entity in the ER model, will become the primary key for the table in relationalmodel. Assume we have an employee table with attributes: Emp_Id, Emp_Name, Emp_Address.
For example, for the below ER Diagram in ER Model, Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table because if we know the Emp_Id, we can tell that
P
employee name associated with it.
Functional dependency can be written as:
Emp_Id → Emp_Name
P
AP
Types of Functional dependency
AP
A table with nameStudentwill be created in relational model, which will have 4
R
columns, id, name, age, address and id will be the primary key for thistable.
Table:Student
R
O
id name age address
O
Relationship becomes a Relationship Table
UC
In ER diagram, we use diamond/rhombus to represent a relationship between two entities. In Relational model we create a
relationship table for ER Model relationships too.
UC
Trivial functional dependency
A → B has trivial functional dependency if B is a subset ofA.
In the ER diagram below, we have two entities Teacher and Student with a relationship between them.
o
o The following dependencies are also trivial like: A → A, B →B
ST
Example:
Consider a table with two columns Employee_Id and Employee_Name.
{Employee_id,Employee_Name} →
ST
Employee_Id is a trivial functional dependency as
Employee_Id is a subset of {Employee_Id,Employee_Name}.
3. Also, Employee_Id → Employee_Id andEmployee_Name → Employee_Name are trivial dependenciestoo.
approach of decomposing tables to eliminate data redundancy(repetition) and undesirable characteristics like Insertion, Normalization Rule
Update and Deletion anomalies. It is a multi-step process that puts data into tabular form, removing duplicated data from the Normalization rules are divided into the following normal forms:
relation tables. 1. First NormalForm
Normalization is used for mainly two purposes, 2. Second NormalForm
3. Third NormalForm
Eliminating reduntant(useless)data. 4. BCNF
Ensuring data dependencies make sense i.e data is logicallystored. 5. Fourth NormalForm
6. Fifth NormalForm
Problems Without Normalization First Normal Form (1NF)
If a table is not properly normalized and have data redundancy then it will not only eat up extra memory space but will also make it For a table to be in the First Normal Form, it should follow the following 4 rules:
difficult to handle and update the database, without facing data loss. Insertion, Updation and Deletion Anomalies are very frequent if 1. It should only have single(atomic) valuedattributes/columns.
database is not normalized. To understand these anomalies let us take an example of a Student table. 2. Values stored in a column should be of the samedomain
3. All the columns in a table should have uniquenames.
P
4. And the order in which data is stored, does notmatter.
rollno name branch hod office_tel
Rules for First Normal Form
AP
AP
The first normal form expects you to follow a few simple rules while designing your database, and they are:
401 Akon CSE Mr. X 53337 Rule 1: Single Valued Attributes
Each column of your table should be single valued which means they should not contain multiple values. We will explain this
with help of an example later, let's see the other rules for now.
402 Bkon CSE Mr. X 53337
Rule 2: Attribute Domain should not change
R
R
This is more of a "Common Sense" rule. In each column the values stored must be of the same kind or type. For example: If
you have a column dob to save date of births of a set of people, then you cannot or you must not save 'names' of some of
403 Ckon CSE Mr. X 53337 them in that column along with 'date of birth' of others in that column. It should hold only 'date of birth' for all the
O
O
records/rows.
404 Dkon CSE Mr. X 53337 Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to avoid confusion at the time of retrieving data
UC
UC
or performing any other operation on the stored data.
In the table above, we have data of 4 Computer Sci. students. As we can see, data for the fields branch, hod(Head of Department) If one or more columns have same name, then the DBMS system will be left confused.
and office_tel is repeated for the students who are in the same branch in the college, this is Data Redundancy. Rule 4: Order doesn't matters
Insertion Anomaly This rule says that the order in which you store the data in your table doesn't matter.
Suppose for a new admission, until and unless a student opts for a branch, data of the student cannot be inserted, or else we will
ST
ST
have to set the branch information as NULL.
Also, if we have to insert data of 100 students of same branch, then the branch information will be repeated for all those 100 students.
EXAMPLE
These scenarios are nothing but Insertion anomalies. Create a table to store student data which will have student's roll no., their name and the name of subjects they have opted for.
Here is the table, with some sample data added to it.
Updation Anomaly
What if Mr. X leaves the college? or is no longer the HOD of computer science department? In that case all the student records will
have to be updated, and if by mistake we miss any record, it will lead to data inconsistency. This is Updation anomaly.
roll_no name subject
Deletion Anomaly
101 Akon OS, CN
In our Student table, two different informations are kept together, Student information and Branch information. Hence, at the end
of the academic year, if student records are deleted, we will also lose the branch information. This is Deletion anomaly. 103 Ckon Java
102 Bkon C, C++
The table already satisfies 3 rules out of the 4 rules, as all our column names are unique, we have stored data in the order we
wanted to and we have not inter-mixed different type of data incolumns.
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 9 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 10
But out of the 3 different students in our table, 2 have opted for more than 1 subject. And we have stored the subject names in a as a primary key.
single column. But as per the 1st Normal form each column must contain atomic value. Let's create another table for Subject, which will have subject_id and subject_name fields and subject_id will be the primary
It's very simple, because all we have to do is break the values into atomic values. Here is our key.
updated table and it now satisfies the First Normal Form. subject_i subject_nam
roll_no name subject d e
101 Akon OS 1 Java
101 Akon CN 2 C++
103 Ckon Java 3 Php
102 Bkon C Now we have a Student table with student information and another table Subject for storing subject information.
102 Bkon C++
Let's create another table Score, to store the marks obtained by students in the respective subjects. We will also be saving name
of the teacher who teaches that subject along with marks.
By doing so, although a few values are getting repeated but values for the subject column are now atomic for each
P
record/row. Using the First Normal Form, data redundancy increases, as there will be many columns with same data in multiple subject_i d
score_id student_id marks teacher
rows but each row as a whole will be unique.
1 10 1 70 Java Teacher
Second Normal Form (2NF)
AP
AP
For a table to be in the Second Normal Form, 2 10 2 75 C++ Teacher
1. It should be in the First Normalform. 3 11 1 80 Java Teacher
2. And, it should not have PartialDependency. In the score table we are saving the student_id to know which student's marks are these and subject_id to know for which
Dependency subject the marks are for.
Let's take an example of a Student table with columns student_id, name, reg_no(registration number), Together, student_id + subject_id forms a Candidate Key which can be the Primary key.
R
R
branch and address(student's home address).
To get me marks of student with student_id 10, can you get it from this table? No, because you don't know for which subject. And
student_ id reg_n branc h addre if I give you subject_id, you would not know for which student. Hence we need student_id
name
o ss
O
O
+ subject_id to uniquely identify anyrow.
But where is PartialDependency?
UC
UC
In this table, student_id is the primary key and will be unique for every row, hence we can use student_id to fetch any row of Now if you look at the Score table, we have a column names teacher which is only dependent on the subject, for Java it's Java
data from this table Teacher and for C++ it's C++ Teacher & so on.
Even for a case, where student names are same, if we know the student_id we can easily fetch the correct record. Now as we just discussed that the primary key for this table is a composition of two columns which is student_id &
student_id name reg_no branch address subject_id but the teacher's name only depends on subject, hence the subject_id, and has nothing to do withstudent_id.
10 Akon 07-WY CSE Kerala
ST
ST
This is Partial Dependency, where an attribute in a table depends on only a part of the primary key and not on the whole key.
11 Akon 08-WY IT Gujarat
How to remove Partial Dependency?
Hence we can say a Primary Key for a table is the column or a group of columns(composite key) which can uniquely identify There can be many different solutions for this, but out objective is to remove teacher's name from Score table. The simplest
each record in the table. solution is to remove columns teacher from Score table and add it to the Subject table. Hence, the Subject table will become:
I can ask from branch name of student with student_id 10, and I can get it. Similarly, if I ask for name of student with
student_id 10 or 11, I will get it. So all I need is student_id and every other column depends on it, or can be fetched using it.This
is Dependency and we also call it Functional Dependency.
Partial Dependency subject_id subject_name teacher
Now that we know what dependency is, we are in a better state to understand what partial dependency is. 1 Java Java Teacher
For a simple table like Student, a single column like student_id can uniquely identfy all the records in a table. But this is not true 2 C++ C++ Teacher
all the time. So now let's extend our example to see if more than 1 column together can act 3 Php Php Teacher
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 11 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 12
Well, the column total_marks depends on exam_name as with exam type the total score changes. For example, practicals are of
And our Score table is now in the second normal form, with no partial dependency. less marks while theory exams are of more marks.
score_ student_ id subject_ id mar But, exam_name is just another column in the score table. It is not a primary key or even a part of the primary key, and
id ks total_marks depends on it.
1 10 1 70 This is Transitive Dependency. When a non-prime attribute depends on other non-prime attributes rather than depending upon the
2 10 2 75 prime attributes or primary key.
3 11 How to remove Transitive Dependency
Again the solution is very simple. Take out the columns exam_name and total_marks from Score table and put them in an Exam
Third Normal Form (3NF) table and use the exam_id wherever required.
A table is said to be in the Third Normal Form when, Score Table: In 3rd Normal Form
1. It is in the Second Normalform.
student_i subject_
2. And, it doesn't have TransitiveDependency. score_id marks exam_id
d id
P
So let's use the same example, where we have 3 tables, Student, Subject and Score.
StudentTable The new Exam table
AP
AP
student_i total_mark
name reg_no branch address exam_id exam_name
d s
10 Akon 07-WY CSE Kerala 1 Workshop 200
11 Akon 08-WY IT Gujarat 2 Mains 70
12 Bkon 09-WY IT Rajasthan 3 Practicals 30
Advantage of removing Transitive Dependency
SubjectTable
R
R
The advantage of removing transitive dependency is,
subject_id subject_name teacher Amount of data duplication isreduced.
1 Java Java Teacher Data integrityachieved.
O
O
2 C++ C++ Teacher Boyce and Codd Normal Form (BCNF)
3 Php Php Teacher Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals with certain type of anomaly that
UC
UC
is not handled by 3NF. A 3NF table which does not have multiple overlapping candidate keys is said to be in BCNF. For a table
Score Table to be in BCNF, following conditions must be satisfied:
In the Score table, we need to store some more information, which is the exam name and total marks, so let's add 2 more R must be in 3rd NormalForm
columns to the Score table. and, for each functional dependency ( X → Y ), X should be a super Key.In simple words, it means, that for a dependency
student_i subject_i A → B, A cannot be a non-prime attribute, if B is a primeattribute.
score_id marks
d d Example
ST
ST
1 10 1 70 College enrolment table with columns student_id, subject and professor.
2 10 2 75 student_id subject professor
3 11 1 80 101 Java P.Java
101 C++ P.Cpp
102 Java P.Java2
Transitive Dependency 103 C# P.Chash
With exam_name and total_marks added to our Score table, it saves more data now. Primary key for the Score table is a 104 Java P.Java
composite key, which means it's made up of two attributes or columns → student_id + subject_id. In the table above:
The new column exam_name depends on both student and subject. For example, a mechanical engineering student will have One student can enroll for multiple subjects. For example, student with student_id 101, has opted for subjects - Java & C++
Workshop exam but a computer science student won't. And for some subjects you have Practical exams and for some For each subject, a professor is assigned to thestudent.
you don't. So we can say that exam_name is dependent on both student_id and subject_id. And, there can be multiple professors teaching one subject like Java. What do
And what about our second new column total_marks? Does it depend on our Score table's primary key? you think should be the PrimaryKey?
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 13 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 14
Well, in the table above student_id, subject together form the primary key, because using Example
student_id and subject, we can find all the columns of the table. Below we have a college enrolment table with columns s_id, course and hobby.
One more important point to note here is, one professor teaches only one subject, but one subject may have two different cours
s_id hobby
professors. e
Hence, there is a dependency between subject and professor here, where subject depends on the professor name. Scienc e
1 Cricket
This table satisfies the 1st Normal form because all the values are atomic, column names are unique and all the values stored in
1 Maths Hockey
a particular column are of same domain.
2 C# Cricket
This table also satisfies the 2nd Normal Form as there is no Partial Dependency.
2 Php Hockey
And, there is no Transitive Dependency, hence the table also satisfies the 3rd Normal Form. But this table
From the table above, student with s_id 1 has opted for two courses, Science and Maths, and has two hobbies,
is not in Boyce-Codd Normal Form.
Cricket andHockey.
P
Why this table is not in BCNF?
You must be thinking what problem this can lead to, right?
P
In the table above, student_id, subject form primary key, which means subject column is a prime attribute. But, there is
one more dependency, professor → subject.
Well the two records for student with s_id 1, will give rise to two more records, as shown below, because for one student, two
AP
hobbies exists, hence along with both the courses, these hobbies should be specified.
AP
And while subject is a prime attribute, professor is a non-prime attribute, which is not allowed by BCNF.
s_id course hobby
How to satisfy BCNF?
1 Science Cricket
To make this relation(table) satisfy BCNF, we will decompose this table into two tables, student table and professor
1 Maths Hockey
table.
1 Science Hockey
Below we have the structure for both the tables.
R
1 Maths Cricket
Student Table
R
And, in the table above, there is no relationship between the columns course and hobby. They are independent of each other.
student_id p_id
So there is multi-value dependency, which leads to un-necessary repetition of data and other anomalies as well.
O
101 1
How to satisfy 4th Normal Form?
O
101 2
To make the above relation satify the 4th normal form, we can decompose the table into 2 tables.
Professor Table
UC
UC
s_id course
1 P.Java Java
1 Science
2 P.Cpp C++
1 Maths
And now, this relation satisfy Boyce-Codd Normal Form.
2 C#
ST
ST
A table is said to be in the Fourth Normal Form when, Hobbies Table,
1. It is in the Boyce-Codd NormalForm. s_id hobby
2. And, it doesn't have Multi-ValuedDependency. 1 Cricket
Multi-valued Dependency 1 Hockey
A table is said to have multi-valued dependency, if the following conditions are true, 2 Cricket
1. For a dependency A → B, if for a single value of A, multiple value of B exists, then the table mayhave multi- 2 Hockey
valueddependency. Now this relation satisfies the fourth normal form.
2. Also, a table should have at-least 3 columns for it to have a multi-valueddependency. A table can also have functional dependency along with multi-valued dependency. In that case, the functionally dependent
3. And, for a relation R(A,B,C), if there is a multi-valued dependency between, A and B, then B and C should be columns are moved in a separate table and the multi-valued dependent columns are moved to separatetables.
independent of eachother.
If all these conditions are true for any relation(table), it is said to have multi-valued dependency.
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 15 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 16
Fifth Normal Form (5NF) The above relations have join dependency, so they are not in 5NF. That would mean that a join relation of the above three
A database is said to be in 5NF, if and only if, relations is equal to our original relation <Employee>.
1. It's in4NF
2. If we can decompose table further to eliminate redundancy and anomaly, and when we re-join the decomposed FIFTH NORMAL FORM EXAMPLE
tables by means of candidate keys, we should not be losing the original data or anynew record set should not arise. Consider an example of different Subjects taught by different lecturers and the lecturers taking classes for different semesters.
In simple words, joining two or more decomposed table should not lose records nor create newrecords. Note: Please consider that Semester 1 has Mathematics, Physics and Chemistry and Semester 2 has only Mathematics in
What is Join Dependency its academic year!!
If a table can be recreated by joining multiple tables and each of this table have a subset of the attributes of the table, then the
table is in Join Dependency. It is a generalization of Multivalued DependencyJoin Dependency can be related to 5NF, wherein a
relation is in 5NF, only if it is already in 4NF and it cannot be decomposed further.
Example
<Employee>
P
P
AP
EmpName EmpSkills EmpJob (Assigned Work)
AP
Tom Networking EJ001 In above table, Rose takes both Mathematics and Physics class for Semester 1, but she does not take Physics class for
Harry Web Development EJ002 Semester 2. In this case, combination of all these 3 fields is required to identify a valid data. Imagine we want to add a new class
- Semester3 but do not know which Subject and who will be taking that subject. We would be simply inserting a new entry with
Katietable can be decomposed into the following
The above Programming EJ002
three tables; therefore it is not in 5NF: Class as Semester3 and leaving Lecturer and subject as NULL. As we discussed above, it's not a good to have such entries.
R
Moreover, all the three columns together act as a primary key, we cannot leave other two columns blank!
Hence we have to decompose the table in such a way that it satisfies all the rules till 4NF and when join them by using keys, it
R
<EmployeeSkills>
should yield correct record. Here, we can represent each lecturer's Subject area and their classes in a better way. We can
O
divide above table into three - (SUBJECT, LECTURER), (LECTURER, CLASS), (SUBJECT, CLASS)
O
EmpName EmpSkills
UC
Tom Networking
UC
ST
EmpName EmpJob
ST
Tom EJ001
Harry EJ002
<JobSkills>
Katie EJ002 Now, each of combinations is in three different tables. If we need to identify who is teaching which subject to which semester,
we need join the keys of each table and get the result.
EmpSkills EmpJob For example, who teaches Physics to Semester 1, we would be selecting Physics and Semester1 from table 3 above, join with
table1 using Subject to filter out the lecturer names. Then join with table2 using Lecturer to get correct lecturer name. That is
Networking EJ001 we joined key columns of each table to get the correct data. Hence there is no lose or new data - satisfying 5NF condition.
Web Development EJ002
Our Join Dependency: Programming EJ002
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 17 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 18
1. TRANSACTION CONCEPTS What Partially A transaction goes into the partially committed state after the end of a transaction.
Committed
isTransaction?
A set of logically related operations is known as transaction. The main operations of a transaction are:
CommittedStat When the transaction is committed to state, it has already completed its execution successfully.
Read(A): Read operations Read(A) or R(A) reads the value of A from the database and stores it in a buffer in main e Moreover, all of its changes are recorded to the database permanently.
memory.
Write (A): Write operation Write(A) or W(A) writes the value back to the database from buffer. Failed State A transaction considers failed when any one of the checks fails or if the transaction is aborted
P
while it is in the active state.
P
Let us take a debit transaction from an account which consists of following operations:
1.R(A); Terminated State of transaction reaches terminated state when certain transactions which are leaving the
AP
AP
2.A=A-1000; State system can't be restarted.
3.W(A);
Assume A's value before starting of transaction is 5000.
The first operation reads the value of A from database and stores it in abuffer.
Second operation will decrease its value by 1000. So buffer will contain4000.
R
R
Third operation will write the value from buffer to database. So A's final value will be4000.
But it may also be possible that transaction may fail after executing some of its operations. The failure can be because of
O
hardware, software or power etc. For example, if debit transaction discussed above fails after executing operation 2, the
O
value of A will remain 5000 in the database which is not acceptable by the bank.
UC
UC
To avoid this, Database has two important operations: Transition Diagram for a Database Transaction
Commit: After all instructions of a transaction are successfully executed, the changes made by transaction are made permanent in Let's study a state transition diagram that highlights how a transaction moves between these various states.
the database. 1. Once a transaction states execution, it becomes active. It can issue READ or WRITEoperation.
Rollback:Ifatransactionisnotabletoexecutealloperationssuccessfully,allthechangesmadebytransaction areundone. 2. Once the READ and WRITE operations complete, the transactions becomes partially committedstate.
ST
ST
A database is a shared resource accessed. It is used by many users and processes concurrently. Forexample, the banking transaction permanently. If this check is a success, the transaction commits and enters into the committedstate.
system, railway, and air reservations systems, stock market monitoring, supermarket inventory, and checkouts,etc. 4. If the check is a fail, the transaction goes to the Failedstate.
Not managing concurrent access may create issues like: 5. Ifthetransactionisabortedwhileit'sintheactivestate,itgoestothefailedstate.Thetransactionshould be rolled back to undo the
Hardware failure and systemcrashes effect of its write operations on thedatabase.
Concurrent execution of the same transaction, deadlock, or slowperformance 6. The terminated state refers to the transaction leaving thesystem.
1 Prepared By: Mrs. E. Ajitha (AP/CSE) 2 Prepared By: Mrs. E. Ajitha (AP/CSE)
Atomicity
By this, we mean that either the entire transaction takes place at once or doesn't happen at all. There is no midway i.e.
transactions do not occur partially. Each transaction is considered as one unit and either runs to completion or is not
executed at all. It involves following two operations.
—Abort: If a transaction aborts, changes made to database are not visible.
—Commit: If a transaction commits, changes made are visible. Atomicity is also
known as the 'All or nothing rule'.
Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X to account Y. Suppose T has been executed till Read (Y) and then T'' starts. As a result , interleaving of operations takes place due
to which T'' reads correct value of X but incorrect value of Y and sum computed by
T'': (X+Y = 50, 000+500=50, 500)
P
is thus not consistent with the sum at end of transaction:
P
T: (X+Y = 50, 000 + 450 = 50, 450).
AP
This results in database inconsistency, due to a loss of 50 units. Hence, transactions must take place in isolation and
changes should be visible only after a they have been made to the main memory.
AP
Durability:
This property ensures that once the transaction has completed execution, the updates andmodifications
If the transaction fails after completion of T1 but before completion of T2.( say, after write(X) but before tothedatabasearestoredinandwrittentodiskandtheypersistevenifsystemfailureoccurs.Theseupdates now become permanent
write(Y)),thenamounthasbeendeductedfromXbutnotaddedtoY.Thisresultsinaninconsistent database state. Therefore, the and are stored in a non-volatile memory. The effects of the transaction, thus, are neverlost.
R
transaction must be executed in entirety in order to ensure correctness of databasestate.
Consistency
R
3. SCHEDULES
This means that integrity constraints must be maintained so that the database is consistent before and after the 1. SerialSchedules
O
transaction. It refers to correctness of a database. Schedules in which the transactions are executed non-interleaved, i.e., a serial schedule is one in which no transaction
O
Referring to the example above, starts until a running transaction has ended are called serial schedules.
UC
The total amount before and after the transaction must be maintained. Total Example: Consider the following schedule involving two transactions T1 and T2.
UC
before T occurs = 500 + 200 = 700. T1 T2
Total after T occurs = 400 + 300 = 700.
Therefore, database is consistent. Inconsistency occurs in case T1 completes but T2 fails. As a result T is incomplete. R(A)
Isolation
W(A)
ST
This property ensures that multiple transactions can occur concurrently without leading to inconsistency of database
ST
state. Transactions occur independently without interference. Changes occurring in a particular transaction will not be R(B)
visible to any other transaction until that particular change in that transaction is written to memory or has been committed.
W(B)
This property ensures that the execution of transactions concurrently will result in a state that is equivalent to a state
achieved these were executed serially in some order. R(A)
Let X= 500, Y = 500.
Consider two transactions T and T". R(B)
where R(A) denotes that a read operation is performed on some data item 'A'
This is a serial schedule since the transactions perform serially in the order T1 —> T2
2. CompleteSchedules
Schedules in which the last operation of each transaction is either abort (or) commit are called complete
schedules.
3 Prepared By: Mrs. E. Ajitha (AP/CSE) 4 Prepared By: Mrs. E. Ajitha (AP/CSE)
Example: Consider the following schedule involving three transactions T1, T2 and T3.
T1 T2
T1 T2 T3
R(A)
R(A)
W(A)
W(A)
W(A)
R(B)
commit
W(B)
R(A)
commit
commit
P
commit
P
This schedule is cascadeless. Since the updated value of A is read by T2 only after the updating transaction
i.e. T1 commits.
AP
abort
5. StrictSchedules
AP
This is a complete schedule since the last operation performed under every transaction is either "commit" or A schedule is strict if for any two transactions Ti, Tj, if a write operation of Ti precedes a conflicting operation of
"abort". Tj (either read or write), then the commit or abort event of Ti also precedes that conflicting operation of Tj.
3. RecoverableSchedules In other words, Tj can read or write updated or written value of Ti only after Ti commits/aborts.
Schedules in which transactions commit only after all transactions whose changes they read commit are called Example: Consider the following schedule involving two transactions T1 and T2.
recoverable schedules. In other words, if some transaction Tj is reading value updated or written by some other T1 T2
R
transaction Ti, then the commit of Tj must occur after the commit of Ti.
R
Example – Consider the following schedule involving two transactions T1 and T2. R(A)
O
T1 T2 R(A)
O
R(A) W(A)
UC
W(A)
UC
commit
W(A) W(A)
R(A) R(A)
ST
commit
ST
commit
commit This is a strict schedule since T2 reads and writes A which is written by T1 only after the commit of T1.
Note – It can be seen that:
1. Cascadeless schedules are stricter than recoverable schedules or are a subset of recoverableschedules.
This is a recoverable schedule since T1 commits before T2, that makes the value read by T2 correct.
2. Strict schedules are stricter than cascadeless schedules or are a subset of cascadelessschedules.
4. Cascadeless Schedules –
3. Serial schedules satisfy constraints of all recoverable, cascadeless and strict schedules and hence is a subset of
Also called Avoids cascading aborts/rollbacks (ACA). Schedules in which transactions read values only after all
strictschedules.
transactions whose changes they are going to read commit are called cascadeless schedules. Avoids that a single
transaction abort leads to a series of transaction rollbacks. A strategy to prevent cascading aborts is to disallow a
4. SERIALIZABILITY
transaction from reading uncommitted changes from another transaction in the same schedule.
In other words, if some transaction Tj wants to read value updated or written by some other transaction Ti, then the commit Whenmultipletransactionsarerunningconcurrentlythenthereisapossibilitythatthedatabasemaybe left in an inconsistent
of Tj must read it after the commit of Ti. state. Serializability is a concept that helps us to check which schedules are serializable. A serializable schedule is the one
Example: Consider the following schedule involving two transactions T1 and T2. that always leaves the database in consistentstate.
5 Prepared By: Mrs. E. Ajitha (AP/CSE) 6 Prepared By: Mrs. E. Ajitha (AP/CSE)
P
serializable ornot. R(B)
P
What is Conflict Serializability? W(B)
R(B)
AP
A schedule is called conflict serializable if we can convert it into a serial schedule after swapping its non-conflicting
W(A)
AP
operations.
Let's swap non-conflicting operations:
Conflicting operations
After swapping R(A) of T1 and R(A) of T2 we get:
Two operations are said to be in conflict, if they satisfy all the following three conditions: T1 T2
1. Both the operations should belong to differenttransactions. ----- ------
2. Both the operations are working on same dataitem. R(A)
R
3. At least one of the operations is a write operation. Let's R(A)
R
see some examples to understandthis: R(B)
W(B)
O
Example 1: Operation W(X) of transaction T1 and operation R(X) of transaction T2 are conflicting operations,
R(B)
O
because they satisfy all the three conditions mentioned above. They belong to different transactions, they are
W(A)
working on same data item X, one of the operations in write operation. After swapping R(A) of T1 and R(B) of T2 we get:
UC
Example 2: Similarly, Operations W(X) of T1 and W(X) of T2 are conflicting operations. T1 T2
UC
Example 3: Operations W(X) of T1 and W(Y) of T2 are non-conflicting operations because both the write operations are not ----- ------
working on same data item so these operations don't satisfy the second condition. R(A)
Example 4: Similarly, R(X) of T1 and R(X) of T2 are non-conflicting operations because none of them is write operation. R(B)
Example 5: Similarly, W(X) of T1 and R(X) of T1 are non-conflicting operations because both the operations belong R(A)
ST
to same transaction T1. W(B)
ST
R(B)
Conflict Equivalent Schedules
W(A)
Two schedules are said to be conflict Equivalent if one schedule can be converted into other schedule after swapping After swapping R(A) of T1 and W(B) of T2 we get:
non-conflicting operations. T1 T2
Conflict Serializable check ----- ------
Let's check whether a schedule is conflict serializable or not. If a schedule is conflict Equivalent to its serial schedule then R(A)
it is called Conflict Serializable schedule. Let's take few examples of schedules. R(B)
Example of Conflict Serializability W(B)
Let's consider thisschedule: T1 R(A)
R(B)
T2 W(A)
We finally got a serial schedule after swapping all the non-conflicting operations so we can say that the given schedule
R(A)
R(B) is Conflict Serializable.
7 Prepared By: Mrs. E. Ajitha (AP/CSE) 8 Prepared By: Mrs. E. Ajitha (AP/CSE)
2. View Serializability
View Serializability is a process to find out that a given schedule is view serializable or not.
To check whether a given schedule is view serializable, we need to check whether the given schedule is View
Equivalent to its serial schedule. Lets take an example to understand what I mean bythat.
Given Schedule:
T1 T2
----- ------
R(X)
W(X)
R(X)
W(X)
R(Y)
P
W(Y)
P
R(Y)
W(Y)
AP
AP
Serial Schedule of the above given schedule:
As we know that in Serial schedule a transaction only starts when the current running transaction is finished. So the
serial schedule of the above given schedule would look like this:
T1 T2
----- ------
R(X)
R
R
W(X)
R(Y)
W(Y)
O
O
R(X)
W(X)
UC
R(Y)
UC
W(Y)
If we can prove that the given schedule is View Equivalent to its serial schedule then the given schedule is called view
Serializable.
Testing for Serializability
ST
ST
5. CONCURRENCY CONTROL
In the concurrency control, the multiple transactions can be executed simultaneously.
It may affect the transaction result. It is highly important to maintain the order of execution of those transactions.
6. NEED FOR CONCURRENCY
Problems of concurrencycontrol
Several problems can occur when concurrent transactions are executed in an uncontrolled manner.
Following are the three problems in concurrency control.
9 Prepared By: Mrs. E. Ajitha (AP/CSE) 10 Prepared By: Mrs. E. Ajitha (AP/CSE)
P
Example:
Suppose two transactions operate on three accounts.
AP
AP
Here,
o At time t2, transaction-X reads A's value.
R
o At time t3, Transaction-Y reads A'svalue.
R
o At time t4, Transactions-X writes A's value on the basis of the value seen at timet2.
o At time t5, Transactions-Y writes A's value on the basis of the value seen at timet3.
O
O
o So at time T5, the update of Transaction-X is lost because Transaction y overwrites itwithout looking at its
currentvalue.
UC
UC
o Such type of problem is known as Lost Update Problem as update made by one transaction is losthere.
2. DirtyRead
o The dirty read occurs in the case when one transaction updates an item of the database,and then the
transaction fails for some reason. The updated database item is accessed by another transaction before it is
changed back to the originalvalue.
ST
ST
o A transaction T1 updates a record which is read by T2. If T1 aborts then T2 now hasvalues which have
never formed part of the stabledatabase.
Example:
o Transaction-X is doing the sum of all balance while transaction-Y is transferring an amount 50 from
Account-1 toAccount-3.
o At time t2, transaction-Y writes A'svalue. o Here, transaction-X produces the result of 550 which is incorrect. If we write this produced result in the
o At time t3, Transaction-X reads A'svalue. database, the database will become an inconsistent state because the actual sum is600.
o Here, transaction-X has seen an inconsistent state of thedatabase.
11 Prepared By: Mrs. E. Ajitha (AP/CSE) 12 Prepared By: Mrs. E. Ajitha (AP/CSE)
P
1. Shared lock:
P
o It is also known as a Read-only lock. In a shared lock, the data item can only read bythe transaction.
AP
o It can be shared between the transactions because when the transaction holds a lock, thenit can't update
AP
the data on the dataitem.
2. Exclusive lock:
o In the exclusive lock, the data item can be both reads as well as written by thetransaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the samedata simultaneously.
R
R
8. TWO PHASE LOCKING (2PL)
The two-phase locking protocol divides the execution phase of the transaction into three parts.
O
o In the first part, when the execution of the transaction starts, it seeks permission for the lockit requires.
O
o In the second part, the transaction acquires all the locks. The third phase is started as soonas the
UC
UC
o In the third phase, the transaction cannot demand any new locks. It only releases theacquired locks. The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
o Growing phase: from step1-3
o Shrinking phase: from step5-7
ST
ST
Transaction T2:
o Growing phase: from step2-6
o Shrinking phase: from step8-9
o Lock point: at6
Types of Two Phase Locking (2PL)
1. Strict Two-phase locking(Strict-2PL)
o The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all the locks, the
transaction continues to executenormally.
o The only difference between 2PL and strict 2PL is that Strict-2PL does not release a lock after using it.
13 Prepared By: Mrs. E. Ajitha (AP/CSE) 14 Prepared By: Mrs. E. Ajitha (AP/CSE)
o Strict-2PL waits until the whole transaction to commit, and then it releases all the locks ata time. Where,
o Strict-2PL protocol does not have shrinking phase of lockrelease. TS(TI) denotes the timestamp of the transaction Ti. R_TS(X)
denotes the Read time-stamp of data-item X. W_TS(X) denotes
the Write time-stamp of data-item X.
Thomas write Rule
Thomas Write Rule provides the guarantee of serializability order for the protocol. It improves the Basic
Timestamp Ordering Algorithm.
P
The basic Thomas write rules are as follows:
o If TS(T) < R_TS(X) then transaction T is aborted and rolled back, and operation isrejected.
P
It does not have cascading abort as 2PL does.
AP
2. Rigorous Two-PhaseLocking o If TS(T) < W_TS(X) then don't execute the W_item(X) operation of the transaction and
Rigorous Two – Phase Locking Protocol avoids cascadingrollbacks.
AP
continueprocessing.
This protocol requires that all the share and exclusive locks to be held until the transaction commits.
Timestamp Ordering Protocol o If neither condition 1 nor condition 2 occurs, then allowed to execute the WRITE operationby transaction Ti
o The Timestamp Ordering Protocol is used to order the transactions based on their Timestamps. The order and set W_TS(X) to TS(T).
R
of transaction is nothing but the ascending order of the transactioncreation.
o The priority of the older transaction is higher that's why it executes first. To determinethe timestamp of Problems
R
O
the transaction, this protocol uses system time or logicalcounter.
o The lock-based protocol is used to manage the order between conflicting pairsamong transactions
O
at the execution time. But Timestamp based protocols start working as soon as a transaction iscreated.
UC
o Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered the system at
UC
007 times and transaction T2 has entered the system at 009 times. T1 has the higher priority, so it executes first as it
is entered the systemfirst.
o The timestamp ordering protocol also maintains the timestamp of last 'read' and'write' operation on
ST
adata.
Basic Timestamp ordering protocol works as follows:
ST
1. Check the following condition whenever a transaction Ti issues a Read (X)operation:
o If W_TS(X) >TS(Ti) then the operation isrejected.
o If W_TS(X) <= TS(Ti) then the operation isexecuted.
o Timestamps of all the data items areupdated.
2. Check the following condition whenever a transaction Ti issues a Write(X)operation:
o If TS(Ti) < R_TS(X) then the operation isrejected.
o If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwisethe operation
isexecuted.
15 Prepared By: Mrs. E. Ajitha (AP/CSE) 16 Prepared By: Mrs. E. Ajitha (AP/CSE)
Deadlock Avoidance
P
P
o When a database is stuck in a deadlock state, then it is better to avoid the database rather than aborting or restating the
AP
AP
o Deadlockavoidancemechanismisusedtodetectanydeadlocksituationinadvance.Amethodlike"wait for graph" is used for
detecting the deadlock situation but this method is suitable only for the smaller database. For the larger database,
deadlock prevention method can beused.
Deadlock Detection
R
R
9. DEADLOCK In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS should detect whether the
A deadlock is a condition wherein two or more tasks are waiting for each other in order to be finished but none of transaction is involved in a deadlock or not. The lock manager maintains a Wait for the graph to detect the deadlock cycle in the
O
O
the task is willing to give up the resources that other task needs. In this situation no task ever gets finished and is in waiting database.
state forever.
Coffman conditions Wait for Graph
UC
UC
Coffman stated four conditions for a deadlock occurrence. A deadlock may occur if all the following conditions
holds true. o This is the suitable method for deadlock detection. In this method, a graph is created based on the transaction and their
Mutual exclusion condition: There must be at least one resource that cannot be used by more than one process at atime. lock. If the created graph has a cycle or closed loop, then there is adeadlock.
Hold and wait condition: A process that is holding a resource can request for additional resources that are being held
by other processes in thesystem. o Thewaitforthegraphis maintainedbythesystemforeverytransactionwhichiswaitingforsomedata held by the others. The
No preemption condition: A resource cannot be forcibly taken from a process. Only the processcan release a
ST
ST
For example: In the student table, transaction T1 holds a lock on some rows and needs to update some rows in the
grade table. Simultaneously, transaction T2 holds locks on some rows in the grade table and needs to update the rows in the
Student table held by Transaction T1.
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and similarly, transaction T2 is
waiting for T1 to release its lock. All activities come to a halt state and remain at a standstill. It will remain in a standstill until the
DBMS detects the deadlock and aborts one of the transactions.
17 Prepared By: Mrs. E. Ajitha (AP/CSE) 18 Prepared By: Mrs. E. Ajitha (AP/CSE)
I. Wait-Diescheme
In this scheme, if a transaction requests for a resource which is already held with a conflicting lock by another
transaction then the DBMS simply checks the timestamp of both transactions. It allows the older transaction to wait until the
P
resource is available for execution.
Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any transaction T. If T2 holds a
AP
AP
lock by some other transaction and T1 is requesting for resources held by T2 then the following actions are performed by
DBMS:
1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource, then Ti is allowed to wait
until the data-item is available for execution. That means if the older transaction is waiting for a resource which is
R
R
locked by the younger transaction, then the older transaction is allowed to wait for resource until it isavailable.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj iswaiting for it, then
O
O
Tj is killed and restarted later with the random delay but with the sametimestamp.
II. Wound waitscheme
UC
UC
o In wound wait scheme, if the older transaction requests for a resource which is held by the younger transaction,
then older transaction forces younger one to kill the transaction and release theresource. After the minute delay,
the younger transaction is restarted but with the sametimestamp.
ST
o If the older transaction has held a resource which is requested by the Younger transaction, thenthe younger
ST
transaction is asked to wait until older releasesit.
Here is the table representation of resource allocation for each algorithm. Both of these algorithms take
process age into consideration while determining the best possible way of resource allocation for deadlock avoidance.
One of the famous deadlock avoidance algorithm is Banker's algorithm
19 Prepared By: Mrs. E. Ajitha (AP/CSE) 20 Prepared By: Mrs. E. Ajitha (AP/CSE)
P
AP
AP
R
R
O
O
UC
UC
ST
21
P
AP
AP
R
R
O
O
UC
UC
ST
23
P
AP
AP
R
R
O
O
UC
UC
ST
25
P
AP
AP
R
R
O
O
UC
UC
ST
27
P
AP
AP
R
R
O
O
UC
UC
ST
29
P
AP
AP
R
R
O
O
UC
UC
ST
31
P
be handled in parallel if the data required is on separatedisks
A single I/O operation can be handled in parallel if the data required is distributed across multipledisks.
Benefits of RAID
AP
AP
Data loss can be very dangerous for anorganization RAID Level 2:
RAID technology prevents data loss due to diskfailure This configuration uses striping across disks, with some disks storing error checking and correcting (ECC) information. It has no
RAID technology can be implemented in hardware orsoftware advantage over RAID 3 and is no longer used.
Servers make use of RAIDTechnology
RAID Level 0- Stripping and non-redundant
RAID level 0 divides data into block units and writes them across a number of disks. As data is placed across multiple disks it
is also called "dataStriping".
R
R
The advantage of distributing data over disks is that if different I/O requests are pending for two different blocks of data,
then there is a possibility that the requested blocks are on differentdisks
O
O
UC
UC
RAID Level 3: Bit-Interleaved Parity
A single parity bit is enough for error correction, not just detection, since we know which disk hasfailed
– When writing data, corresponding parity bits must also be computed and written to a parity bitdisk
– To recover data in a damaged disk, compute XOR of bits from other disks (including parity bitdisk)
ST
ST
I/O operation addresses all the drives at the same time, RAID 3 cannot overlap I/O. For this reason, RAID 3 is best for
There is no parity checking of data. So if data in one drive gets corrupted then all the data would be lost. Thus RAID 0 does not single-user systems with long recordapplications.
support data recovery Spanning is another term that is used with RAID level 0 because the logical disk will span all the physical drives.
RAID 0 implementation requires minimum 2 disks.
Advantages
I/O performance is greatly improved by spreading the I/O load across many channels &drives.
Best performance is achieved when data is striped across multiple controllers with only one driver per controller
Disadvantages
It is not fault-tolerant, failure of one drive will result in all data in an array beinglost
RAID Level 1: Mirroring (or shadowing)
Also known asdisk mirroring, this configuration consists of at least two drives that duplicate the storage of
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 1 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 2
P
- Record access is simple
Example pseudocode
P
AP
type account = record
RAID Level 5: account_number char(10);
AP
RAID 5 uses striping as well as parity for redundancy. It is well suited for heavy read and low write operations. branch_name char(22); balance
Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and numeric(8);
parity in 1disk. end Total
bytes 40 for arecord
R
R
O
O
UC
UC
Two problems
- Difficult to delete record from thisstructure.
RAID Level 6:
- Some record will cross block boundaries, that is part of the record will be stored in one blockand
This technique is similar to RAID 5, but includes a second parity scheme that is distributed across the drives in the array. The part in another. It would require two block accesses to read or write
ST
use of additional parity allows the array to continue to function even if two disks fail simultaneously. However, this extra Reuse the free space alternatives:
protection comes at acost. – move records i + 1, . . ., n to n i, . . . , n –1
P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple diskfailures.
ST
– do not move records, but link all free records on a
- Better reliability than Level 5 at a higher cost; not used as widely. freelist
– Move the final record to deleted recordplace.
Free Lists
Store the address of the first deleted record in the fileheader.
Use this first record to store the address of the second deleted record, and soon
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 3 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 4
P
There is no space in general for records growslonger
P
Slotted Page Structure
AP
AP Deletion – use pointer chains
Insertion – locate the position where the record is to be inserted
– if there is free space insertthere
– if no free space, insert the record in an overflowblock
R
R
– In either case, pointer chain must beupdated
Slotted page header contains:
– number of record entries Indexing and Hashing
O
– end of free space in theblock
O
Basic Concepts
– location and size of eachrecord Indexing mechanisms used to speed up access to desireddata.
– E.g., author catalog inlibrary
UC
UC
PointerMethod Search Key - attribute to set of attributes used to look up records in afile. An index
file consists of records (called index entries) of theform
Search-key pointer
Index files are typically much smaller than the originalfile
ST
ST
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 5 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 6
Dense index — Index record appears for every search-key value in thefile. Index Update:Deletion
If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the indexalso.
Single-level indexdeletion:
– Dense indices – deletion of search-key is similar to file recorddeletion.
– Sparse indices – if an entry for the search key exists in the index, it is deleted by replacing the entry in the index
with the next search-key value in the file (in search-key order). If the next search-key value already has an index
entry, the entry is deleted instead of beingreplaced.
Index Update:Insertion
Single-level indexinsertion:
– Perform a lookup using the search-key value appearing in the record to beinserted.
– Dense indices – if the search-key value does not appear in the index, insertit.
Sparse Index Files – Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index
SparseIndex unless a new block is created. In this case, the first search-key value appearing in the new block is inserted into
P
theindex.
P
contains index records for only some search-key values.
To locate a record with search-key value Kwe: Secondary Index on balance field of account
AP
– Find index record with largest search-key value that is less than or equal to Search
AP
file sequentially starting at the record to which the index recordpoints
R
R
O
O
Multilevel Index
Primary and Secondary Indices
If primary index does not fit in memory, access becomesexpensive.
Secondary indices have to bedense.
UC
To reduce number of disk accesses to index records, treat primary index kept on disk as a sequential file and construct a Indices offer substantial benefits when searching forrecords.
UC
ST
ST
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 7 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 8
P
B -Tree Node Structure
P
Since the inter-node connections are done by pointers, "logically" close blocks need not be "physically" close.
Typical node
The B+-tree contains a relatively small number of levels thus searches can be conductedefficiently.
Insertions and deletions to the main file can be handledefficiently.
AP
AP
Updates on B+-Trees: Insertion
Find the leaf node in which the search-key value wouldappear
– Kiare the search-keyvalues
If the search-key value is already there in the leaf node, record is added to file and if necessary a pointer is inserted into
– Piare pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes).
thebucket.
The search-keys in a node areordered
If the search-key value is not there, then add the record to the main file and create a bucket if necessary.Then:
K1< K2< K3< . . . < Kn–1
– If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node otherwise, split the node.
Properties of Leaf Nodes
Example: B+-Tree before and after insertion of "Clearview"
R
R
For i = 1, 2, . . ., n–1, pointer Pieither points to a file record with search-key value Ki, or to a bucket of pointers to file
records, each record having search-key valueKi.
Pnpoints to next leaf node in search-keyorder
O
O
UC
UC
Non-Leaf Nodes in B+-Trees
ST
ST
Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers:
– All the search-keys in the subtree to which P1points are less thanK1.
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 9 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 10
P
P
AP
AP
Nonleaf node – pointers Biare the bucket or file record pointers.
B+-tree on same data
• The removal of the leaf node containing "Downtown" did not result in its parent having too little pointers. So the cascaded
R
deletions stopped with the deleted leaf node'sparent.
R
Deletion of "Perryridge" from result of previous example
O
O
Advantages of B-Tree indices:
UC
– May use less tree nodes than a correspondingB -Tree.
+
UC
ST
– Insertion and deletion more complicated than inB -Trees
+ +
B -Tree File Organization
ST
• The leaf nodes in a B+-tree file organization store records, instead ofpointers. – Implementation is harder thanB+-Trees.
• Since records are larger than pointers, the maximum number of records that can be stored in a leaf node is less than the
number of pointers in a nonleafnode. HASHING
• Leaf nodes are still required to be halffull. • Hashing is an effective technique to calculate the direct location of a data record on the disk without using indexstructure.
• Insertion and deletion are handled in the same way as insertion and deletion of entries in a B -treeindex. • Hashing uses hash functions with search keys as parameters to generate the address of a datarecord.
+
Hash Organization
Bucket
A hash file stores data in bucket format. Bucket is considered a unit of storage. A bucket typically stores one
complete disk block, which in turn can store one or more records.
Hash Function
A hash function, h, is a mapping function that maps all the set of search-keys K to the address where actual
records are placed. It is a function from search keys to bucketaddresses.
Worst hash function maps all search-key values to the samebucket.
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 11 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 12
An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all
possiblevalues.
Ideal hash function is random, so each bucket will have the same number of records. Types
• StaticHashing
• DynamicHashing
Static Hashing
In static hashing, when a search-key value is provided, the hash function always computes the sameaddress.
For example, if mod-4 hash function is used, then it shall generate only 5 values. The output address shall always be same
for thatfunction.
The number of buckets provided remains unchanged at alltimes.
Example of Hash File Organization
There are 10buckets,
The hash function returns the sum of the binary representations of the characters modulo10
P
– E.g. h(Perryridge)=5
P
h(Round Hill) = 3 h(Brighton) =3
Hash Indices
• Hashing can be used not only for file organization, but also for index-structurecreation.
AP
• A hash index organizes the search keys, with their associated record pointers, into a hash filestructure.
AP
• Hash indices are always secondaryindices
R
R
O
O
UC
UC
Operation Deficiencies of Static Hashing
Insertion − When a record is required to be entered using static hash, the hash function h computes the bucket address for In static hashing, function h maps search-key values to a fixed set of B of bucketaddresses.
search key K, where the record will bestored. – Databases grow with time. If initial number of buckets is too small, performance will degrade due to too
Bucket address = h(K) muchoverflows.
ST
Search − When a record needs to be retrieved, the same hash function can be used to retrieve the address of the bucket – If file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount
ST
where the data isstored. of space will be wastedinitially.
Delete − This is simply a search followed by a deletionoperation. – If database shrinks, again space will bewasted.
Handling of Bucket Overflows These problems can be avoided by using techniques that allow the number of buckets to be modified dynamically.
Bucket overflow can occur becauseof Dynamic Hashing
– Insufficientbuckets • Good for database that grows and shrinks insize
– Skew in distribution of records. This can occur due to: • Allows the hash function to be modifieddynamically
• multiple records have same search-keyvalue • Extendable hashing – one form of dynamichashing
Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflowbuckets. – Hash function generates values over a large range — typically b-bit integers, with b =32.
Overflow chaining – the overflow buckets of a given bucket are chained together in a linkedlist. – At any time use only a prefix of the hash function to index into a table of bucketaddresses.
Above scheme is called closedhashing. – Let the length of the prefix be i bits, 0 i 32.
– Bucket address table size = 2 Initially i =0
i.
– An alternative, called open hashing, which does not use overflow buckets, is not suitable for databaseapplications.
– Value of i grows and shrinks as the size of the database grows andshrinks.
– Multiple entries in the bucket address table may point to abucket.
– Thus, actual number of buckets is <2i
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 13 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 14
– The number of buckets also changes dynamically due to coalescing and splitting ofbuckets.
General Extendable Hash
In this structure, i2= i3= i, whereas i1= i – 1
P
Insertion in Extendable Hash Structure
AP
To split a bucket j when inserting record with search-key value Kj:
If i > ij(more than one pointer to bucketj)
– allocate a new bucket z, and set ij= iz= (ij+1)
– Update the second half of the bucket address table entries originally pointing to j, to point toz
– remove each record in bucket j and reinsert (in j orz)
– recompute new bucket for Kjand insert record in the bucket (further splitting is required if the bucket is stillfull)
If i = ij(only one pointer to bucketj)
R
– If i reaches some limit b, or too many splits have happened in this insertion, create an overflow bucket
Else
– increment i and double the size of the bucket addresstable.
O
– replace each entry in the table by two entries that point to the samebucket.
– recompute new bucket address table entry forKj
Now i > ijso use the first case above.
UC
Deletion in Extendable Hash Structure
To delete a keyvalue,
– locate it in its bucket and removeit.
– The bucket itself can be removed if it becomes empty (with appropriate updates to the bucket addresstable).
– Coalescing of buckets can be done (can coalesce only with a "buddy" bucket having same value of ijand same ij–1
prefix, if it ispresent)
ST
– Decreasing bucket address table size is alsopossible
• Note: decreasing bucket address table size is an expensive operation and should be done only if number
of buckets becomes much smaller than the size of thetable
Example
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 15 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 16
appropriate bucket
Updates in Extendable Hash Structure
To insert a record with search-key valueKj
– follow same procedure as look-up and locate the bucket, sayj.
– If there is room in the bucket j insert record in thebucket.
Initial Hash structure, bucket size = 2 – Overflow buckets used instead in somecases.
Hash structure after insertion of one Brighton and two Downtown records To delete a keyvalue,
– locate it in its bucket and removeit.
– The bucket itself can be removed if it becomesempty
– Coalescing of buckets can bedone
– Decreasing bucket address table size is alsopossible
Benefits of extendablehashing:
– Hash performance does not degrade with growth offile
– Minimal space overhead
P
P
Disadvantages of extendable hashing
– Extra level of indirection to find desired record
AP
AP
Bucket address table may itself become verybig.
Hash structure after insertion of Mianus record
QUERY PROCESSING OVERVIEW
1. The scanning, parsing, and validating module produces an internal representation of thequery.
2. The query optimizer module devises an execution plan which is the execution strategy to retrieve the result of the query from
the database files. A query typically has many possible execution strategies differing in performance, and the process of
choosing a reasonably efficient one is known as query optimization. Query optimization is beyond this course. The code
R
generator generates the code to execute the plan.The runtime database processor runs the generated code to produce the
R
queryresult.
O
O
Hash structure after insertion of three Perryridge records
UC
UC
ST
ST
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 17 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 18
P
(C represents the result returned from the inner block.) – An equality comparison on a key attribute with a primary index (or hashkey).
P
The query optimizer would then choose an execution plan for eachblock. – This condition retrieves a single record (atmost).
The inner block needs to be evaluated only once. (Uncorrelated nestedquery). – Cost :primary index : bind/2 + 1(hash key: 1bucket if nocollision).
AP
It is much harder to optimize the more complex correlated nestedqueries.
AP
S4: Using a primary index to retrieve multiplerecords
External Sorting – Comparison condition is >, >=, <, or <= on a key field with a primaryindex
It refers to sorting algorithms that are suitable for large files of records on disk that do not fit entirely in main memory, such – σDNUMBER>5(DEPARTMENT)
as most database files.. – Use the index to find the record satisfying the corresponding equality condition (DNUMBER=5), then retrieve all
ORDERBY. subsequent records in the (ordered)file.
Sort-merge algorithms for JOIN and other operations (UNION, INTERSECTION). Duplicate elimination algorithms for the – For the condition (DNUMBER <5), retrieve all the precedingrecords.
– Method used for range queries too(i.e. queries to retrieve records in certainrange)
R
PROJECT operation (DISTINCT). Typical external sorting algorithm uses a sort-merge strategy: Sort phase: Create sort
– Cost: bind/2 + ?. '?' could be known if the number of duplicatesknown.
R
small sub-files (sorted sub-files are calledruns).
Merge phase: Then merges the sorted runs. N-way merge uses N memory buffers to buffer input runs, and 1 block to S5: Using a clustering index to retrieve multiplerecords
– If the selection condition involves an equality comparison on a non-key attribute with a
O
buffer output. Select the 1st record (in the sort order) among input buffers, write it to the output buffer and delete it from the
O
input buffer. If output buffer full, write it to disk. If input buffer empty, read next block from the correspondingrun.. clusteringindex.
E.g. 2-way Sort-Merge – σDNO=5(EMPLOYEE)
– Use the index to retrieve all the records satisfying thecondition.
UC
– Cost: log2bind + ?. '?' could be known if the number of duplicatesknown.
UC
ST
– Cost to retrieve: a key= height + 1; a non key= height+1(extra-level)+?,comparison=(height-1)+?+?
ST
Many search methods can be used for complex selection which involve a Conjunctive Condition: S7 through as S9.
Basic Algorithms for Executing Relational Query Operations – Conjunctive condition: several simple conditions connected with the AND logicalconnective.
An RDBMS must include one or more alternative algorithms that implement each relational algebra operation (SELECT, – (OP4): s DNO=5 AND SALARY>30000 AND SEX = 'F'(EMPLOYEE).
JOIN,…) and, in many cases, that implement each combination of theseoperations. S7:Conjunctive selection using an individualindex.
Each algorithm may apply only to particular storage structures and access paths (suchindex,…). – If an attribute involved in any single simple condition in the conjunctive condition has an access path that permits the
Only execution strategies that can be implemented by the RDBMS algorithms and that apply to the particular query and use of one of the Methods S2 to S6, use that condition to retrieve therecords.
particular database design can be considered by the query optimizationmodule. – Then check whether each retrieved record satisfies the remaining simple conditions in the conjunctive condition
Algorithms for implementing SELECT operation S8:Conjunctive selection using a compositeindex:
These algorithms depend on the file having specific access paths and may apply only to certain types of selectionconditions. – If two or more attributes are involved in equality conditions in the conjunctive condition and a composite index (or
We will use the following examples of SELECToperations: hash structure) exists on the combinedfields.
– (OP1):σSSN='123456789'(EMPLOYEE) – Example:Ifanindexhasbeencreatedonthecompositekey(ESSN,PNO)oftheWORKS_ONfile,
– (OP2):σ DNUMBER > 5(DEPARTMENT)
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 19 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 20
we can use the index directly. 3. Using commutativity and associativity of binary operations, rearrange the leaf nodes of thetree
– (OP5):σESSN='123456789' AND PNO=10(WORKS_ON). 4. Combine a CARTESIAN PRODUCT operation with a subsequent SELECT operation in the tree into a JOIN operation, if the
S9: Conjunctive selection by intersection of recordpointers condition represents a joincondition
– If the secondary indexes are available on more than one of the fields involved in simple conditions in the conjunctive 5. Using the cascading of PROJECT and the commuting of PROJECT with other operations, break down and move lists of
condition, and if the indexes include record pointers (rather than block pointers), then each index can be used to projection attributes down the tree as far as possible by creating new PROJECT operations as needed
retrieve the set of record pointers that satisfy the individualcondition. 6. Identify sub-trees that represent groups of operations that can be executed by a singlealgorithm
– The intersection of these sets of record pointers gives the record pointers that satisfy the conjunctive condition. Example
– If only some of the conditions have secondary indexes, each retrieval record is further tested to determine whether it Query
satisfies the remainingconditions. "Find the last names of employees born after 1957 who work on a project named 'Aquarius'."
SQL
Algorithms for implementing JOIN Operation SELECT LNAME
Join: time-consuming operation. We will consider only natural joinoperation FROM EMPLOYEE, WORKS_ON, PROJECT
– Two-way join: join on twofiles. WHERE PNAME='Aquarius' AND PNUMBER=PNO AND ESSN=SSN AND BDATE.'1957-12-31';
P
P
– Multiway join: involving more than twofiles.
The following examples of two-way JOIN operation (RΘ A=BS) will beused:
– OP6: EMPLOYEE Θ DNO=DNUMBERDEPARTMENT
AP
AP
– OP7: DEPARTMENT Θ MGRSSN=SSNEMPLOYEE
J1: Nested-loop join (bruteforce)
– For each record t in R (outer loop), retrieve every record s from S (inner loop) and test whether the two records
satisfy the join condition t[A] =s[B].
J2: Single-loop join (using an access structure to retrieve the matchingrecords)
– If an index (or hash key) exists for one of the two join attributes (e.g B of S), retrieve each record t in R, one at a
R
R
time (single loop), and then use the access structure to retrieve directly all matching records s from S that satisfy s[B]
=t[A]
J3: Sort-mergejoin:
– If the records of R and S are physically sorted (ordered) by value of the join attributes A and B, respectively, we
O
O
can implement the join in the most efficientway.
– Both files are scanned concurrently in order of the join attributes, matching the records that have the same values
for A and B.
UC
UC
– If the files are not sorted, they may be sorted first by using externalsorting.
– Pairs of file blocks are copied into memory buffers in order and records of each file are scanned only once each
for matching with the other file if A & B are keyattributes.
– The method is slightly modified in case where A and B are not keyattributes. Cost Estimation in Query Optimization
J4:Hash-join The main aim of query optimization is to choose the most efficient way of implementing the relational algebra operations at the
– The records of files R and S are both hashed to the same hash file using the same hashing function on the join lowest possible cost.
ST
ST
attributes A of R and B of S as hashkeys. The query optimizer should not depend solely on heuristic rules, but, it should also estimate the cost of executing the different
PartitioningPhase strategies and find out the strategy with the minimum costestimate.
– First, a single pass through the file with fewer records (say, R) hashes its records to the hash file buckets. The cost functions used in query optimization are estimates and not exact cost functions.
– Assumption: The smaller file fits entirely into memory buckets after the firstphase. The cost of an operation is heavily dependent on its selectivity, that is, the proportion of select operation(s) that forms the
– (If the above assumption is not satisfied, the method is a more complex one and number of variations have been output.
proposed to improve efficiency: partition has join and hybrid hashjoin.) In general the different algorithms are suitable for low or high selectivityqueries.
ProbingPhase In order for query optimizer to choose suitable algorithm for an operation an estimate of the cost of executing that algorithm
– A single pass through the other file (S) then hashes each of its records to probe appropriate bucket, and that record must beprovided
is combined with all matching records from R in thatbucket. The cost of an algorithm depends on cardinality of its input.
Heuristic-Based Query Optimization To estimate the cost of different query execution strategies, the query tree is viewed as containing a series of basic
1. Break up SELECT operations with conjunctive conditions into a cascade of SELECToperations operations which are linked in order to perform thequery.
2. Using the commutativity of SELECT with other operations, move each SELECT operation as far down the query tree as is It is also important to know the expected cardinality of an operation's output because this forms the input to the next operation.
permitted by the attributes involved in the selectcondition
St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 21 St. Joseph's College of Engineering/St. Joseph's Institute of Technology Page 22
P
Cost Components of Query Execution
The cost of executing the query includes the following components:
AP
– Access cost to secondarystorage.
– Storagecost.
– Computationcost.
– Memory usescost.
– Communicationcost.
Importance of Access cost
Out of the above five cost components, the most important is the secondary storage access cost.
R
The emphasis of the cost minimization depends on the size and type of databaseapplications.
For example in smaller database the emphasis is on the minimizing computing cost as because most of the data in the files
involve in the query can be completely store in the mainmemory.
O
For large database, the main emphasis is on minimizing the access cost to secondarydevice.
For distributed database, the communication cost is minimized as because many sites areinvolved
for the datatransfer.
UC