Chapter5 Storage&Indexing
Chapter5 Storage&Indexing
Contents
Data on External (Permanent) Storage ............................................................................................... 1
File Organizations & Indexing............................................................................................................. 3
Tree Based Index Structures............................................................................................................... 7
ISAM – Static Index structure .................................................................................................... 9
B+ Trees – Dynamic Index structure ........................................................................................ 10
Hash Based Index Structures............................................................................................................ 12
Static Hashing.......................................................................................................................... 14
Extendible Hashing - Dynamic Hashing ................................................................................... 14
Linear Hashing – Dynamic Hashing.......................................................................................... 16
File Manager component is responsible for tracking information like which all pages belong to a file,
tracking free pages, allocation of new pages to a file as we insert more records, tracking free space
with in allocated pages of the file, and so on.
For processing (retrieving or saving the new values) the data, it is fetched into main memory. One
page is the minimum unit of data that is fetched into main memory for processing and then
eventually saving back on to the hard disk. Buffer Manager component is responsible for bringing
the relevant pages into main memory and saving them back on the hard disk after processing.
During any database operations, the time spent for doing disk I/O operations is generally much
higher compared to the actual processing of the data in main memory. So, the way data is organized
on hard disk is very crucial for the overall performance of the database system.
Hard disks (Random access devices) are used for permanent storage, as they are far efficient
compared to tapes (Sequential access devices). The structure of a hard disk is as follows:
Ramakrishna N - MGIT
Structure of Hard disk
Disk arm makes the Disk head to reach to the needed location on the magnetic disk, which
reads/writes the data.
Block is the minimum unit of data that gets transferred to/from main memory. It is typically 4KB or
8KB size. In DBMS terms, it is like a page of data. Block contains Sectors, which contain the actual
data records.
Every record contains a unique rid (record id). This rid helps figuring out the location or address of
the storage area on hard disk. So, retrieving a record from hard disk, involves figuring out the disk
address, moving the disk head to that address and finally transferring that block/page of data into
main memory.
So, the data records that are closely relevant to each other, are better to be stored closer to each
other on the hard disk. Otherwise, multiple pages are needed to be read into main memory and
delays the overall operation. Also, lesser the disk head movement involved, faster will be the overall
I/O operation.
Database Systems use RAIDs (Redundant Arrays of Independent Disks), instead of single hard disks,
for the following 2 main reasons:
Ramakrishna N - MGIT
1. Same data is stored on to multiple disks (redundant disks) in parallel, as high reliability is
crucial. Even if one of the disks fail (bad sectors or any other failure), a backup is available in
the same system.
2. Data is striped (or divided) on to multiple disks, to make the retrieval operations as efficient
as possible.
Tapes (Sequential access devices) are used for periodic safe side backups only, as they are cheaper.
1. How the data records are organized in the files on hard disk?
2. How the pages corresponding to a file are allocated or organized, for storing the records?
3. How the records are organized with in pages?
Index is a data structure used to help accessing the data in the file efficiently. Two types of Indexing
are:
Index is based on certain Search Key – a column or set of columns used for accessing the record(s).
For any given table or file, we can have multiple indexes defined.
Rid (Record-id) has the information about where the actual data record is at, on the hard disk.
Clustered Index – Index is referred as Clustered index for either of the following cases:
With Clustered index exists, number of disk I/O operations needed would be less.
Unclustered Index – If the order of data records does not match with the order of data entries, then
it is called as Unclustered Index.
With Unclustered index, more number of disk I/O operations may be needed.
Primary Index – If an Index is based on the Primary key attribute (or a super key having the primary
key in it), then it is referred as Primary index.
Generally, the primary index is the Clustered index, which helps in efficient retrieval of the data when
searched based on the primary key.
Unique Index – If an Index is based on the candidate key or primary key, it is referred as Unique
Index. Basically, for a given search key, only one record exists in the table.
Ramakrishna N - MGIT
Secondary Index – If an index is based on non-key attributes, then it is referred as Secondary Index.
In general, duplicate/multiple records are expected with any given search key. For example, Age
attribute in Students table.
Database Systems automatically create Indexes based on the Primary key and Candidate Keys, for
each table. The additionally needed ones, based on the frequent search/access pattern, DBA has to
define (Create Index) them explicitly.
The combination of file organization and type of Indexing used, makes certain operations (or data
access patterns) efficient, while some operations are very time consuming.
The various types of data operations that are considered, for evaluating / comparing the various file
organizations / indexing structures are –
1. Accessing a particular record / Searching for particular record, based on the given column
value. Like, a student with a particular rollno (key type).
2. Accessing a set of records, that match a particular column value. Like, set of students of age
21 (non-key type).
3. Accessing records in a particular range. Set of students whose roll numbers are within a given
range or given age range.
4. Scanning through the entire table or accessing all the records of the table.
5. Inserting new record(s) into the table.
6. Deleting some existing record(s) from the table.
1. Heap (unordered or unsorted) file organization – Pages are allocated randomly. Records are
placed in pages, where ever space is available. No particular sort order is maintained. Heap
file is implemented either by maintaining pointer to the next page or by having a directory of
all pages in the file. Directory contains links to all pages. If Directory page gets filled in, then
another Directory page gets added. A list of Directory pages are maintained.
Ramakrishna N - MGIT
2. Sorted file (Sequential file) organization – All the records are maintained in some sorted
order, in the pages of the file.
3. Hashed file organization – Hashing the search key value gives the bucket, into which the data
records need to be placed. Same technique is used for accessing that data record later. This
could be based on static hashing technique or a dynamic hashing technique, as described in
next sections.
4. Indexed file organization – The primary key based index structure is used, like a B+ tree. The
leaf nodes of this Index structure, is where the actual data records are placed. These are
close to the sequential order. But, instead of complete sorted order of all the records, it is
like bunches of sorted records. All pages are in sorted order, either by having link to the next
page as in case of B+ tree or contiguous as in case of ISAM. This helps retrieving range based
queries as well, very efficiently.
Ramakrishna N - MGIT
5. Clustered file organization - The records from multiple tables that are joined frequently, are
stored on same block. This help needing to read less number of data blocks/disk IO
operations, overall.
Ramakrishna N - MGIT
4. Indexed file organization
a. Inserting a record is much better compared to the Sorted file. The page into which
the insertion should be made, is to identified with the Tree based index structure. If
the page has sufficient room, then it will be inserted over there. Even if shifting of
records are involved, that is scoped only to that page of the file (not the entire file).
If the page is already full, then that page of records is split into two pages and that
change is propagated up in the Tree structure (B+), or overflow page is added (ISAM).
b. Deleting a record is also very similar and efficient. Appropriate leaf node is identified
and within that page, record is searched and deleted.
c. Searching for a record is efficient, as it just involves traversing the Tree starting from
the root to appropriate leaf level, where record can be found.
d. Range based Searches involve finding the first record matching the criteria and then
scanning continuously from there, till the range completes. So, this is also, very
efficient.
e. As this file organization is efficient for all insertion, deletion, search based on single
key and search based on range, this is generally used by the database systems.
Primary key of the table is used for building the tree index structure.
5. Clustered file organization
a. The records from multiple tables that are joined frequently, are stored on same
block. This help needing to read less number of data blocks/disk IO operations,
overall.
The tree structure is height balanced, i.e., the path length from root node to any leaf node, be same.
The non-leaf level of the Tree helps directing the search towards the correct leaf node. Non leaf
nodes contain a set of search keys and one additional pointer to the child nodes. If N search keys
there, N+1 pointers will be there for child nodes.
Ramakrishna N - MGIT
The Leaf level contains the Data Entries. In the above example, Data Entries are the actual Data
Records themselves (So, this is an example for Clustered Index). The leaf level is a doubly linked list.
So, once we reach the correct leaf level, there onwards, we could traverse all the relevant data
entries sequentially. This is what helps for the range based search queries (like people with age > 34
and age < 50).
The average number of children of non-leaf nodes is referred as Fan out of the Tree. If N is fan out of
the Tree & h is the height of the tree, then Nh is the number of leaf pages. In real-world, generally the
Fan out is around 100 and height of the tree is around 4. With this, 100Mn leaf pages are possible for
having data entries (either data records themselves or <SearchKey, rid or rid-list> pairs.)
Inside the non-leaf node for finding the relevant child based on the Search Key, binary search is used.
A binary search over 100 entries is not time consuming. 3 or 4 of such searches to reach the relevant
data record in a set of 100Mn page of data records, is very efficient.
For using B+ tree index, the actual data records need not be in sorted order.
Data Entries are <SearchKey, rid> pairs. Above, we could see that, order of data entries are matching
with the order of actual data records. This clustered index based search is more efficient, compared
to unclustered index based search, as it would need less number of page I/O operations.
Ramakrishna N - MGIT
Data Entries are <SearchKey, rid> pairs. Above, we could see that, order of data records is not aligned
with data entries. So, search operation based on this index, would need more page I/Os, as records
are scattered over multiple pages. So, less efficient compared to Clustered Index based searches.
All the leaf nodes, the primary pages above, are allocated sequentially on the hard disk. These leaf
nodes contain either the data records themselves or <SearchKey, rid or rid-list> pairs are maintained.
Initially, all the data records are placed in sequential order only.
The non-leaf nodes help with directing the search towards the correct leaf node. All the non-leaf
pages are fixed once they are created. This is a static tree structure.
This indexing structure is suitable, when inserts / deletes are less frequent. During Insert, if there is
no space left in the leaf page, then additional overflow page is added, though it is not a continuation
page / block on the hard disk. A Link is maintained to overflow page.
Ramakrishna N - MGIT
The leaf nodes with * (like, 10*) are indicating the data records.
With the insertion of records with 23, 48, 41, 42… tree becomes:
To accommodate some insertions, during the initial creation of tree itself, 20% of each leaf page is
kept free.
The fact that the non-leaf pages are static, makes the locking mechanism simpler. There is no need to
lock the non-leaf pages, as their content is fixed once created. Only the leaf nodes need locking, for
concurrency control. This is advantage of ISAM structure over B+ trees. ISAM is suitable for situations
where insertions are rare, once the ISAM index is created.
Ramakrishna N - MGIT
In B+ trees, modifications in non-leaf nodes are allowed, maintaining the height-balancing.
Leaf nodes either contain the data records themselves or contain <SearchKey, rid or rid-list> pairs.
Doubly Linked List is used for leaf nodes, to help traversing to the next node. This is what helps in
range based searches (like employees with in age group 25 to 35).
The non-leaf nodes direct the search towards the leaf node. Sample non-leaf node:
If N search keys are present, then N+1 pointers to child nodes will be there. Generally, in all non-leaf
nodes 50% occupancy is guaranteed. Only exception is the root node, which can have occupancy
lower than 50%. Every node (except root node) contains m entries, where d <= m <= 2d. d is referred
as the order of the B+ tree.
For concurrency control, the non-leaf nodes will be locked, as they may need updates during
Insertions and deletions. This is an overhead in B+ trees, compared to ISAM tree structure.
Search
Searching for any Key value, begins at root node.
At every non-leaf node, if the search Key happens to be lower than the very first Key, then pointer
left to it is followed to get to the correct child node. If the search Key happens to be greater than the
last Key in the node, then pointer right to it is followed to get to the correct child node. Otherwise,
binary search is used to get to the correct Key position like Ki < K < Ki+1 & the pointer between them
is followed to get the correct child node.
Once the leaf node is reached, search is done for the matching Key based record or the rid entry
(that points to the data record). As these leaf nodes are in doubly linked list structure, from here
onwards we can traverse to the next leaf node and continue. This is what helps for the range based
searches.
Insert
For inserting into the appropriate position at the leaf node level, same Search mechanism is
followed.
Ramakrishna N - MGIT
If in the leaf node level, space is available, insert is made directly.
If in the leaf node, no space is left, then this leaf node content will be split into 2 leaf nodes. Doubly
Linked List structure is updated accordingly. Because of the newly introduced leaf node, the parent
non-leaf node needs an update with additional pointer. A new Search Key and an additional pointer
(pointing to the new leaf node) is introduced at appropriate position. This would need right shifting
the next keys and pointers with in that node, to make a room for the new Key and new Pointer. But
typical fan out of 100 Keys/Pointers is not that big for right shifting operation.
In this parent non-leaf node as well there is no space left, then even this needs to be split into 2 non-
leaf nodes and similar change needs to be propagated to it’s parent level. This trend continues up till
root node. If root node is also full already, then it is also split into 2 halves and a new root is created.
Delete
Delete is also very similar to Insert. First tree is traversed from root to appropriate leaf level, where
record is present. Then it is deleted from the leaf node.
If this happens to be the last record in the node/page, then that node/page can be deleted
altogether and its parent is updated accordingly. Corresponding pointer can be deleted from the
parent. Effectively, all the search keys and pointers next to it, can be forwarded by one step.
Generally, even if occupancy of any nodes goes below 50% as well, it is not merged with siblings, as
generally insertions do happen in future again.
Ramakrishna N - MGIT
1. Hashing with age value (Search Key). Using a simple hash function like the last 2 digits from
the binary representation of the age value. So, overall dividing the data into 4 primary
buckets only. Here, data entries contain the data records themselves.
2. Hashing with sal value (Search Key). For this also, same hash function of last 2 digits from the
binary representation has been used. Here, data entries are of the form <SearchKey, rid>
pairs. Rid helps in pointing to the actual data record where it is present on the hard disk.
Once the primary bucket/page gets filled up, additional overflow pages can be added as needed.
Search involves, applying hash function, getting the correct page and searching for the record with in
that page or in one of the overflow pages. With in the bucket, records can be maintained in sorted or
need not be (both the choices are feasible).
Insert involves, getting the correct page similar to Search, and then inserting the record in that page.
If that page is already full, then overflow page is introduced and records is added over there. If
records are maintained in sorted order, then insertion involves right shifting all the next records with
in that bucket/page and then inserting the record at correct position.
Delete involves, getting the correct page similar to Search, and then finding the record with in that
page/bucket and deleting it. If records are maintained in sorted order, all the next records need to be
shifted one step ahead, post deletion of the record.
In general, the hash function should be selected in such way that, all possible records get distributed
uniformly as much as possible and avoids overflow buckets as much as possible. If in-appropriate
hash function is used, some buckets may overflow heavily and some buckets may be very less
occupied. This is referred as Skew.
For Range based Searches, this hash indexing structure is not suitable at all. The hashes of 2 adjacent
Search Key values, need not be adjacent pages/buckets. So, range based queries, involve Searching
by each value in that range one by one. (Tree based indexing suits for range based queries).
This hash based indexing is highly suitable for natural join operations, as they involve searching for
matching records with the particular search key value one by one. (Tree based indexing needs
traversing from root of the tree to leaf, for each of the search key value).
Ramakrishna N - MGIT
Static Hashing
The primary buckets are fixed and sequential/contiguous, similar to ISAM tree structure, on hard
disk. For N primary buckets, typically used hash function is of the form (a * searchKeyValue + b) Mod
N, where a and b constants are tuned to distribute all the possible records into N buckets uniformly
as much as possible.
If the primary page/buckets fills up, then overflow pages are added as needed. More the number of
overflow pages been added, lesser will be the efficiency of Search, Insert and Delete operations as
they would involve looking for record in the overflow pages as well!. To overcome this situation, one
option is to rehash the entire data for more number of primary buckets. But, this rehashing takes
time and it could cause operational downtime in the database system.
If a bucket gets filled up and new insertions have to be made, then that bucket will be split into two
& hash function is modified dynamically to help pointing to the 2 buckets, instead of earlier 1 bucket.
This hashing technique uses a directory in between. Directory contains pointers to pages/buckets.
When a page gets filled up and needs splitting, directory is doubled (when required) to help pointing
to more buckets.
For example, hash function used is – using last 2 least significant bits from binary representation. Let
us say, the current hash index is as follows:
Ramakrishna N - MGIT
Local depth of all current buckets is 2, signifying that, last 2 bits are used. Global depth at Directory
level is as well 2, signifying that, last 2 bits are used.
If we have to Insert a record with Key 20, then the above hash index is dynamically changed to:
Ramakrishna N - MGIT
Note that, as Bucket A was full, a new bucket got added. For these 2 buckets, last 3 bits are used.
Initial bucket is now pointed by 000 case. The new bucket is pointed by 100 case. So, updated the
local depth to 3 for these two. Global depth is also updated to 3.
At this stage, if a new record needs to be inserted into Bucket B, then we can directly add a new
bucket without needing to update the global depth, as it is already at 3 and local is at 2 only. New
bucket can be added, bucket B content can be distributed between the 2 buckets and Directory
pointers will be updated accordingly.
Search
Based on the global depth of the directory, those many least significant bits are used from binary
representation of the search key value. This points to the appropriate page. Record will be searched
with in that page. No overflows involved. So, efficient.
Insert
Appropriate bucket is searched first. If space is available, simply inserted. Otherwise, hash is changed
dynamically to accommodate the new bucket.
Delete
Appropriate bucket is searched first. If the record happens to be the last one in the bucket, then
shrinking of buckets and shrinking of directory can be considered. If near future again insertions may
happen, this structure can be left out as it is. Only the record can be deleted.
This technique uses a family of hash functions h0, h1, h2…, such that, each function bucketizes to
double the number, compared to its predecessor hash function. If h0, divides the data into N
buckets, then h1 would divide the overall data into 2N buckets and so on. Such family of hash
functions can be achieved by:
There will be rounds of splitting the data. Overall, in each round, all the current buckets will be split
into two. A pointer is maintained to decide which bucket needs to be split next. Whenever any
overflow is encountered, the bucket at the pointer is split into two. When all the buckets get split,
current round is said to be completed.
Ramakrishna N - MGIT
Here, buckets are simply split in round robin fashion. It is not that the bucket, which is currently
getting overflowed, would get split. The bucket pointed by the next pointer will be the one that is
split. Current insertion could be made in overflow page, similar to static hashing. Eventually in round
robin fashion, all the buckets get split and redistribution of entries in them.
Ramakrishna N - MGIT
Search
At any given point in time, for searching for any record/key, it may be part of the buckets that got
split already or it may be part of the bucket that is yet to be split in the current round.
For searching for any key, first the current Level’s hash function is used. If that bucket is not yet split,
then record is searched for, with in that bucket. If that got split, then record can be present either in
this bucket or in the new bucket that got created out this. To determine this, the next Level hash
function is used, which will lead to correct bucket, where record should be present.
Insert
First bucket is determined for inserting, as in Search case. If space is available, record is inserted
directly. If no, the next pointed bucket is split into two, to accommodate more insertions. Present
record is added to a overflow page, if the bucket being split now is a different one.
Delete
First bucket is determined, as in the Search case. Record is located and deleted.
Extendible hashing involves additional directory structure in between, which is not present in Linear
hashing technique.
Bucket occupancy is generally higher in the extendible hashing, as the bucket that is being
overflowed is the one that will be split always, unlike Linear hashing, where buckets are split in round
robin fashion.
If data distribution is very skewed, then this linear hashing technique can end up having lot of
overflow buckets/chains & splitting the buckets that are not genuinely need splitting (buckets are
Ramakrishna N - MGIT
simply split in round robin fashion). In such skewed case, this linear hashing performance is very
worse, compared to extendible hashing.
In extendible hashing, as the bucket that is being overflowed, is split always & does not involve
having overflow buckets at all, the performance is much better in skewed case.
Ramakrishna N - MGIT