Chapter 5 Fundamentals of Database System
Chapter 5: Record Storage and Primary File Organization
5.1. Introduction
The collection of data that makes up a computerized database must be stored physically
on some computer storage medium
Computer storage media form a storage hierarchy that includes two main categories:
Primary storage: this includes storage media that can be operated on directly by the
computer central processing unit (CPU), such as the computer main memory, and smaller
and faster cache memory.
Primary storage usually provide fast access to data but is of limited storage
capacity.
Secondary storage: this category includes magnetic disks, optical disks, and tapes. Thee device
usually have a large capacity, cost less and provide slower access to data than do primary storage
devices.
Data in secondary storage cannot be processed directly by the CPU. It must first be
copied into primary storage
Chapter 5 Fundamentals of Database System
A file: is a sequence of records stored in binary format.
A disk drive is formatted into several blocks that can store records.
File is a collection of records related to each other.
File Organization Storage
There are different ways of storing data in the database. Storing data in files is one of them. A user
can store the data in files in an organized manner.
Files are organized logically as a sequence of records and reside permanently on disks.
Each file is divided into fixed-length storage units known as Blocks.
These blocks are the units of storage allocation as well as data transfer.
The default block size in the database is 4 to 8 kilobytes, many databases allow specifying
the size at the time of creating the database instance.
Chapter 5 Fundamentals of Database System
5.2 Operations on Files
Operations on database files can be broadly classified into two categories −
Update Operations
Retrieval Operations
Update operations change the data values by insertion, deletion, or update. Retrieval operations,
on the other hand, do not alter the data but retrieve them after optional conditional filtering. In
both types of operations, selection plays a significant role. Other than creation and deletion of a
file, there could be several operations, which can be done on files.
Open − A file can be opened in one of the two modes, read mode or write mode. In read
mode, the operating system does not allow anyone to alter data. In other words, data is
read only. Files opened in read mode can be shared among several entities. Write mode
allows data modification. Files opened in write mode can be read but cannot be shared.
Locate − every file has a file pointer, which tells the current position where the data is to
be read or written. This pointer can be adjusted accordingly. Using find (seek) operation,
it can be moved forward or backward.
Read − By default, when files are opened in read mode, the file pointer points to the
beginning of the file. There are options where the user can tell the operating system where
to locate the file pointer at the time of opening a file. The very next data to the file pointer
is read.
Write − User can select to open a file in write mode, which enables them to edit its
contents. It can be deletion, insertion, or modification. The file pointer can be located at
the time of opening or can be dynamically changed if the operating system allows to do
so.
Close − this is the most important operation from the operating system’s point of view.
When a request to close a file is generated, the operating system
o removes all the locks (if in shared mode),
o saves the data (if altered) to the secondary storage media, and
o Releases all the buffers and file handlers associated with the file.
The organization of data inside a file plays a major role here. The process to locate the file pointer
to a desired record inside a file various based on whether the records are arranged sequentially or
clustered.
File Organization
Chapter 5 Fundamentals of Database System
o The File is a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization which was
used for a given set of records.
o File organization is a logical relationship among various records. This method defines how
file records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store only
one fixed length record in any given file. An alternative approach is to structure our files
so that we can contain multiple lengths for records.
o Files of fixed length records are easier to implement than the files of variable length
records.
o File organization contains various methods.
o These particular methods have pros and cons on the basis of access or selection.
o In the file organization, the programmer decides the best-suited file organization method
according to his requirement
Objective of file organization
o It contains an optimal selection of records, i.e., records can be selected as fast as possible.
o To perform insert, delete or update transaction on the records should be quick and easy.
o The duplicate records cannot be induced as a result of insert, update or delete.
o For the minimal cost of storage, records should be stored efficiently.
5.3 Files of Unordered Records (Heap Files)
Heap file organization
o It is the simplest and most basic type of organization. It works with data blocks. In heap
file organization, the records are inserted at the file's end. When the records are inserted, it
doesn't require the sorting and ordering of records.
o When the data block is full, the new record is stored in some other block. This new data
block need not to be the very next data block, but it can select any data block in the memory
to store new records. The heap file is also known as an unordered file.
Chapter 5 Fundamentals of Database System
o In the file, every record has a unique id, and every page in a file is of the same size. It is
the DBMS responsibility to store and manage the new records.
Insertion of a new record
Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert a
new record R2 in a heap. If the data block 3 is full then it will be inserted in any of the database
selected by the DBMS, let's say data block 1.
Chapter 5 Fundamentals of Database System
If we want to search, update or delete the data in heap file organization, then we need to traverse
the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will be time-consuming
because there is no sorting or ordering of records. In the heap file organization, we need to check
all the data until we get the requested record.
Pros of Heap file organization
o It is a very good method of file organization for bulk insertion. If there is a large number
of data which needs to load into the database at a time, then this method is best suited.
o In case of a small database, fetching and retrieving of records is faster than the sequential
record.
Cons of Heap file organization
o This method is inefficient for the large database because it takes time to search or modify
the record.
Chapter 5 Fundamentals of Database System
o This method is inefficient for large databases.
5.4 Files of Ordered Records (Sorted Files)
Sequential File Organization
This method is the easiest method for file organization. In this method, files are stored sequentially.
Sorted File Method:
o In this method, the new record is always inserted at the file's end, and then it will sort the
sequence in ascending or descending order. Sorting of records is based on any primary key
or any other key.
o In the case of modification of any record, it will update the record and then sort the file,
and lastly, the updated record is placed in the right place.
Insertion of the new record:
Suppose there is a preexisting sorted sequence of four records R1, R3 and so on up to R6 and R7.
Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the end of
the file, and then it will sort the sequence.
Chapter 5 Fundamentals of Database System
Pros of sequential file organization
o It contains a fast and efficient method for the huge amount of data.
o In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
o It is simple in design. It requires no much effort to store the data.
o This method is used when most of the records have to be accessed like grade calculation
of a student, generating the salary slip, etc.
o This method is used for report generation or statistical calculations.
Cons of sequential file organization
o It will waste time as we cannot jump on a particular record that is required but we have to
move sequentially which takes our time.
o Sorted file method takes more time and space for sorting the records.
5.5 Hashing Techniques
Hash File Organization
Chapter 5 Fundamentals of Database System
Hash File Organization uses the computation of hash function on some fields of the records. The
hash function's output determines the location of disk block where the records are to be placed.
When a record has to be received using the hash key columns, then the address is generated, and
the whole record is retrieved using that address. In the same way, when a new record has to be
inserted, then the address is generated using the hash key and record is directly inserted. The same
process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.
Chapter 5 Fundamentals of Database System
5.6. Index Structure for Files
Indexing in DBMS
o Indexing is used to optimize the performance of a database by minimizing the number of
disk accesses required when a query is processed.
o The index is a type of data structure. It is used to locate and access the data in a database
table quickly.
Clustered index and non-clustered index
A non-clustered index is a data structure that improves the speed of data retrieval from
tables. Unlike a clustered index.
A non-clustered index sorts and stores data separately from the data rows in the table. It
is a copy of selected columns of data from a table with the links to the associated table.
Similar to a clustered index, a non-clustered index uses the B-tree structure to organize its
data.
A table may have one or more non clustered indexes and each non-clustered index may
include one or more columns of the table.
Chapter 5 Fundamentals of Database System
SQL Server CREATE INDEX statement
To create a non-clustered index, you use the CREATE INDEX statement:
CREATE [NONCLUSTERED] INDEX index_name
ON table_name(column_list);
Code language: SQL (Structured Query Language) (sql)
In this syntax:
First, specify the name of the index after the CREATE NONCLUSTERED INDEX clause. Note that
the NONCLUSTERED keyword is optional.
Second, specify the table name on which you want to create the index and a list of columns of
that table as the index key columns.
SQL Server CREATE INDEX statement examples
We will use the sales.customers from the sample database for the demonstration.
The sales.customers table is a clustered table because it has a primary key customer_id.
A) Using the SQL Server CREATE INDEX statement to create a nonclustered index for one
column example
This statement finds customers who locate in Atwater:
SELECT
customer_id,
city
FROM
sales.customers
WHERE
Chapter 5 Fundamentals of Database System
city = 'Atwater';
Code language: SQL (Structured Query Language) (sql)
If you display the estimated execution plan, you will see that the query optimizer scans the
clustered index to find the row. This is because the sales.customers table does not have an index for
the city column.
To improve the speed of this query, you can create a new index named ix_customers_city for
the city column:
CREATE INDEX ix_customers_city
ON sales.customers(city);
Code language: SQL (Structured Query Language) (sql)
Now, if you display the estimated execution plan of the above query again, you will find that the
query optimizer uses the nonclustered index ix_customers_city:
B) Using SQL Server CREATE INDEX statement to create a nonclustered index for multiple
columns example
The following statement finds the customer whose last name is Berg and first name is Monika:
SELECT
customer_id,
first_name,
last_name
FROM
sales.customers
WHERE
last_name = 'Berg' AND
first_name = 'Monika';
Code language: SQL (Structured Query Language) (sql)
Chapter 5 Fundamentals of Database System
The query optimizer scans the clustered index to locate the customer.
To speed up the retrieval of data, you can create a nonclustered index that includes
both last_name and first_name columns:
CREATE INDEX ix_customers_name
ON sales.customers(last_name, first_name);
Code language: SQL (Structured Query Language) (sql)
Now, the query optimizer uses the index ix_customers_name to find the customer.
SELECT
customer_id,
first_name,
last_name
FROM
sales.customers
WHERE
last_name = 'Berg' AND
first_name = 'Monika';
Code language: SQL (Structured Query Language) (sql)
When you create a nonclustered index that consists of multiple columns, the order of the
columns in the index is very important. You should place the columns that you often use to query
data at the beginning of the column list.
For example, the following statement finds customers whose last name is Albert. Because
the last_name is the leftmost column in the index, the query optimizer can leverage the index and
uses the index seek method for searching:
SELECT
customer_id,
first_name,
last_name
FROM
sales.customers
WHERE
Chapter 5 Fundamentals of Database System
last_name = 'Albert';
Code language: SQL (Structured Query Language) (sql)
This statement finds customers whose first name is Adam. It also leverages
the ix_customer_name index. But it needs to scan the whole index for searching, which is slower
than index seek.
SELECT
customer_id,
first_name,
last_name
FROM
sales.customers
WHERE
first_name = 'Adam';
Code language: SQL (Structured Query Language) (sql)
Therefore, it is a good practice to place the columns that you often use to query data at the
beginning of the column list of the index.
In this tutorial, you have learned about the non-clustered indexes and how to use the SQL
Server CREATE INDEX statement to create non-clustered indexes for tables to improve the speed
of data retrieval.
Indexed sequential access method (ISAM)
ISAM method is an advanced sequential file organization. In this method, records are stored in the
file using the primary key. An index value is generated for each primary key and mapped with the
record. This index contains the address of the record in the file.
Chapter 5 Fundamentals of Database System
If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
o In this method, each record has the address of its data block, searching a record in a huge
database is quick and easy.
o This method supports range retrieval and partial retrieval of records. Since the index is
based on the primary key values, we can retrieve the data for the given range of value. In
the same way, the partial value can also be easily searched, i.e., the student name starting
with 'JA' can be easily searched.
Cons of ISAM
o This method requires extra space in the disk to store the index value.
o When the new records are inserted, then these files have to be reconstructed to maintain
the sequence.
o When the record is deleted, then the space used by it needs to be released. Otherwise, the
performance of the database will slow down.
Chapter 5 Fundamentals of Database System
5.7 Types of Single Level Ordered Index
Indexing is a data structure technique to efficiently retrieve records from the database files based
on some attributes on which the indexing has been done. Indexing in database systems is similar
to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following types −
Primary Index − Primary index is defined on an ordered data file. The data file is ordered
on a key field. The key field is generally the primary key of the relation.
Secondary Index − Secondary index may be generated from a field which is a candidate
key and has a unique value in every record, or a non-key with duplicate values.
Clustering Index − Clustering index is defined on an ordered data file. The data file is
ordered on a non-key field.
Ordered Indexing is of two types −
Dense Index
Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index records contain search
key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here contains
a search key and an actual pointer to the data on the disk. To search a record, we first proceed by
index record and reach at the actual location of the data. If the data we are looking for is not where
we directly reach by following the index, then the system starts sequential search until the desired
data is found.
Chapter 5 Fundamentals of Database System
5.8 Dynamic Multilevel indexes using B-Trees and B+ Trees
Because a single-level index is an ordered file, we can create a primary index to the index
itself; in this case, the original index file is called the first-level index and the index to the
index is called the second-level index.
We can repeat the process, creating a third, fourth, top level until all entries of the top level
fit in one disk block
A multi-level index can be created for any type of first
Level index (primary, secondary, clustering) as long as the first-level index consists of
more than one disk bloc
A multi-level index is a form of search tree; however, insertion and deletion of new index
entries is a severe problem because every level of the index is an ordered file
Dynamic Multilevel Indexes Using B-Trees and B+-Trees
Because of the insertion and deletion problem, most multi-level indexes use B-tree or B+-
tree data structures, which leave space in each tree node (disk block) to allow for new index
entries
These data structures are variations of search trees that allow efficient insertion and
deletion of new search values.
In B-Tree and B+-Tree data structures, each node corresponds to a disk block
Each node is kept between half-full and completely full
An insertion into a node that is not full is quite efficient; if a node is full the insertion causes
a split into two nodes
Splitting may propagate to other tree levels
A deletion is quite efficient if a node does not become less than half full
If a deletion causes a node to become less than half full, it must be merged with
neighboring nodes
Chapter 5 Fundamentals of Database System
5.9 Indexes on Multiple Indexes
Multicolumn Indexes
Multicolumn indexes (also known as composite indexes) are similar to standard indexes. They
both store a sorted “table” of pointers to the main table. Multicolumn indexes however can store
additional sorted pointers to other columns.
Syntax
CREATE INDEX [index name]
ON [Table name]([column1, column2, column3,...]);
What is a multicolumn index?
Multicolumn indexes are indexes that store data on up to 32 columns. When creating a
multicolumn index, the column order is very important. This is due to the structure that
multicolumn indexes possess. Multicolumn indexes are structured to have a hierarchical
structure. Take for example this table:
Chapter 5 Fundamentals of Database System
The index points back to the table and is sorted by year. Adding a second column to the index
looks like this:
Now the index has pointers to a secondary reference table that is sorted by make. Adding a third
column to the index causes the index to look like this:
In a three column index we can see that the main index stores pointers to both the original
table and the reference table on make, which in turn has pointers to the reference table on
model.
When the multicolumn index is accessed, the main portion of the index (the index on the
first column) is accessed first. Each entry in the main index has a reference to the row‘s
location in the main table.
The main index also has a pointer to the secondary index where the related make is
stored.
The secondary index in term has a pointer to the tertiary index.
Because of this pointer ordering, in order to access the secondary index, it has to be done
through the main index. This means that this multicolumn index can be used for queries
Chapter 5 Fundamentals of Database System
that filter by just year, year and make, or year, make, and model. However, the
multicolumn index cannot be used for queries just on the make or model of the car
because the pointers are inaccessible.
For example:
Create NonClustered Index IX_IndexName On TableName
(Column1 Asc, Column2 Asc, Column3 Asc)
Versus:
Create NonClustered Index IX_IndexName1 On TableName
(Column1 Asc)
Create NonClustered Index IX_IndexName2 On TableName
(Column2 Asc)
Create NonClustered Index IX_IndexName3 On TableName
(Column3 Asc)