KEMBAR78
Chapter5 Storage&Indexing | PDF | Database Index | Computer Data Storage
0% found this document useful (0 votes)
8 views19 pages

Chapter5 Storage&Indexing

The document discusses transactions, concurrency control, and recovery in relational database systems, focusing on data storage, file organization, and indexing techniques. It outlines the structure of external storage, including the use of RAID for reliability and performance, and details various file organization methods such as heap, sorted, hashed, indexed, and clustered organizations. Additionally, it explains tree-based and hash-based indexing structures, their types, and their impact on data retrieval efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views19 pages

Chapter5 Storage&Indexing

The document discusses transactions, concurrency control, and recovery in relational database systems, focusing on data storage, file organization, and indexing techniques. It outlines the structure of external storage, including the use of RAID for reliability and performance, and details various file organization methods such as heap, sorted, hashed, indexed, and clustered organizations. Additionally, it explains tree-based and hash-based indexing structures, their types, and their impact on data retrieval efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Transactions – Concurrency Control & Recovery

Contents
Data on External (Permanent) Storage ............................................................................................... 1
File Organizations & Indexing............................................................................................................. 3
Tree Based Index Structures............................................................................................................... 7
ISAM – Static Index structure .................................................................................................... 9
B+ Trees – Dynamic Index structure ........................................................................................ 10
Hash Based Index Structures............................................................................................................ 12
Static Hashing.......................................................................................................................... 14
Extendible Hashing - Dynamic Hashing ................................................................................... 14
Linear Hashing – Dynamic Hashing.......................................................................................... 16

Data on External (Permanent) Storage


Relational Database systems store the data as collection of records in a table (aka relation). For each
table, a file is maintained on the hard disk (permanent storage device). Each file consists of multiple
pages. Each page, contains multiple data records. The size of the page is around 4KB or 8KB, which is
generally configurable to the Database Administrator.

File Manager component is responsible for tracking information like which all pages belong to a file,
tracking free pages, allocation of new pages to a file as we insert more records, tracking free space
with in allocated pages of the file, and so on.

For processing (retrieving or saving the new values) the data, it is fetched into main memory. One
page is the minimum unit of data that is fetched into main memory for processing and then
eventually saving back on to the hard disk. Buffer Manager component is responsible for bringing
the relevant pages into main memory and saving them back on the hard disk after processing.

During any database operations, the time spent for doing disk I/O operations is generally much
higher compared to the actual processing of the data in main memory. So, the way data is organized
on hard disk is very crucial for the overall performance of the database system.

Hard disks (Random access devices) are used for permanent storage, as they are far efficient
compared to tapes (Sequential access devices). The structure of a hard disk is as follows:

Ramakrishna N - MGIT
Structure of Hard disk

Platters are magnetic disks on which data is stored.

Disk arm makes the Disk head to reach to the needed location on the magnetic disk, which
reads/writes the data.

Spindle helps rotating the disk.

Block is the minimum unit of data that gets transferred to/from main memory. It is typically 4KB or
8KB size. In DBMS terms, it is like a page of data. Block contains Sectors, which contain the actual
data records.

Every record contains a unique rid (record id). This rid helps figuring out the location or address of
the storage area on hard disk. So, retrieving a record from hard disk, involves figuring out the disk
address, moving the disk head to that address and finally transferring that block/page of data into
main memory.

So, the data records that are closely relevant to each other, are better to be stored closer to each
other on the hard disk. Otherwise, multiple pages are needed to be read into main memory and
delays the overall operation. Also, lesser the disk head movement involved, faster will be the overall
I/O operation.

Database Systems use RAIDs (Redundant Arrays of Independent Disks), instead of single hard disks,
for the following 2 main reasons:

Ramakrishna N - MGIT
1. Same data is stored on to multiple disks (redundant disks) in parallel, as high reliability is
crucial. Even if one of the disks fail (bad sectors or any other failure), a backup is available in
the same system.
2. Data is striped (or divided) on to multiple disks, to make the retrieval operations as efficient
as possible.

Permanent Storage with RAID configuration, can be referred as Stable Storage.

Tapes (Sequential access devices) are used for periodic safe side backups only, as they are cheaper.

File Organizations & Indexing


File Organization is about

1. How the data records are organized in the files on hard disk?
2. How the pages corresponding to a file are allocated or organized, for storing the records?
3. How the records are organized with in pages?

Index is a data structure used to help accessing the data in the file efficiently. Two types of Indexing
are:

1. Tree based Indexing


2. Hash based Indexing.

Index is based on certain Search Key – a column or set of columns used for accessing the record(s).
For any given table or file, we can have multiple indexes defined.

Data entries of the Index structure could be any of the following:

1. Actual Data records themselves


2. <SearchKey, rid> pair
3. <SearchKey, rid-list> pair

Rid (Record-id) has the information about where the actual data record is at, on the hard disk.

Clustered Index – Index is referred as Clustered index for either of the following cases:

1. If the index structure itself has the actual data records


2. If the order of data records matches with the order of data entries in the index

With Clustered index exists, number of disk I/O operations needed would be less.

Unclustered Index – If the order of data records does not match with the order of data entries, then
it is called as Unclustered Index.

With Unclustered index, more number of disk I/O operations may be needed.

Primary Index – If an Index is based on the Primary key attribute (or a super key having the primary
key in it), then it is referred as Primary index.

Generally, the primary index is the Clustered index, which helps in efficient retrieval of the data when
searched based on the primary key.

Unique Index – If an Index is based on the candidate key or primary key, it is referred as Unique
Index. Basically, for a given search key, only one record exists in the table.

Ramakrishna N - MGIT
Secondary Index – If an index is based on non-key attributes, then it is referred as Secondary Index.
In general, duplicate/multiple records are expected with any given search key. For example, Age
attribute in Students table.

Generally, Secondary Indexes are Unclustered Indexes.

Database Systems automatically create Indexes based on the Primary key and Candidate Keys, for
each table. The additionally needed ones, based on the frequent search/access pattern, DBA has to
define (Create Index) them explicitly.

SQL Support for Index -


CREATE INDEX IndAge ON Students (email);
The above creates Tree based index, which is the default in majority of the RDBMS systems in the
market.
Very few RDBMS systems like PostgreSQL, support creating hash based indexes:
CREATE INDEX EmpHashIdx ON Employees USING hash(emp_name);
To drop an index:
DROP INDEX IndAge;

The combination of file organization and type of Indexing used, makes certain operations (or data
access patterns) efficient, while some operations are very time consuming.

The various types of data operations that are considered, for evaluating / comparing the various file
organizations / indexing structures are –

1. Accessing a particular record / Searching for particular record, based on the given column
value. Like, a student with a particular rollno (key type).
2. Accessing a set of records, that match a particular column value. Like, set of students of age
21 (non-key type).
3. Accessing records in a particular range. Set of students whose roll numbers are within a given
range or given age range.
4. Scanning through the entire table or accessing all the records of the table.
5. Inserting new record(s) into the table.
6. Deleting some existing record(s) from the table.

Various types are file organizations -

1. Heap (unordered or unsorted) file organization – Pages are allocated randomly. Records are
placed in pages, where ever space is available. No particular sort order is maintained. Heap
file is implemented either by maintaining pointer to the next page or by having a directory of
all pages in the file. Directory contains links to all pages. If Directory page gets filled in, then
another Directory page gets added. A list of Directory pages are maintained.

Ramakrishna N - MGIT
2. Sorted file (Sequential file) organization – All the records are maintained in some sorted
order, in the pages of the file.
3. Hashed file organization – Hashing the search key value gives the bucket, into which the data
records need to be placed. Same technique is used for accessing that data record later. This
could be based on static hashing technique or a dynamic hashing technique, as described in
next sections.
4. Indexed file organization – The primary key based index structure is used, like a B+ tree. The
leaf nodes of this Index structure, is where the actual data records are placed. These are
close to the sequential order. But, instead of complete sorted order of all the records, it is
like bunches of sorted records. All pages are in sorted order, either by having link to the next
page as in case of B+ tree or contiguous as in case of ISAM. This helps retrieving range based
queries as well, very efficiently.

Ramakrishna N - MGIT
5. Clustered file organization - The records from multiple tables that are joined frequently, are
stored on same block. This help needing to read less number of data blocks/disk IO
operations, overall.

Comparison of various file organizations -

1. Heap (unordered or unsorted) file organization


a. Inserting a record is easy, as it can be placed in any available free slot.
b. Deleting a record would involve searching for the record first and then directly
deleting it, without worrying about the free space created.
c. Searching for a record (either for retrieval or for updating it or deleting it), would
need to scan through the entire file, as record could be present anywhere. This is not
efficient.
d. Searching with a range based query would also need to scan through entire file. This
is not efficient.
2. Sorted file (Sequential file) organization
a. Inserting a record is very tedious or time consuming. Where it needs to be placed in
the sorted order, needs to be identified first. All the subsequent records are moved a
step further to make a room for the new record. Then new record can be inserted.
This is not efficient at all.
b. Deleting a record would also involve identifying the record first, with binary search.
Then deleting the record and then moving all the records one step ahead to fill the
free space.
c. Searching for a record – a binary search criteria can be used to locate the record. For
very large tables, even this binary search is not that efficient. Lot of pages needs to
be read in, for doing binary search.
d. Searching for a range based query would involve identifying the first record matching
the range and then sequentially checking all subsequent records, as per the range.
e. Good for use cases where inserts, deletes are very less frequent and generally
involve batch processing or retrieving a series of records in order.
3. Hashed file organization
a. Inserting the record would involve finding the appropriate bucket/page based on the
hash function result & then inserting into it. O(1) time complexity. Very efficient. But
heavily depends on the efficiency of the hash function chosen.
b. Searching for a record, based on a particular search key value, involve simply
applying hash function and getting the appropriate bucket. If it is clustered, then that
bucket or page itself would contain the data record. If not clustered, then rid points
to the data record/page where it is located. It can be fetched very efficiently.
c. This file organization is suitable, when natural joins are involved, as they involve
searching for the matching record heavily (to join with other relevant table’s
records).
d. Searching with a range based query is not at all efficient. Searching would involve,
searching with each particular key from that range. So effectively, repeating the
process for all such values in the range. Sometimes, it is better to scan through the
complete file, instead.
e. If range based queries are very rare, then this file organization is efficient in terms of
inserts, deletes and searches based on single key value.

Ramakrishna N - MGIT
4. Indexed file organization
a. Inserting a record is much better compared to the Sorted file. The page into which
the insertion should be made, is to identified with the Tree based index structure. If
the page has sufficient room, then it will be inserted over there. Even if shifting of
records are involved, that is scoped only to that page of the file (not the entire file).
If the page is already full, then that page of records is split into two pages and that
change is propagated up in the Tree structure (B+), or overflow page is added (ISAM).
b. Deleting a record is also very similar and efficient. Appropriate leaf node is identified
and within that page, record is searched and deleted.
c. Searching for a record is efficient, as it just involves traversing the Tree starting from
the root to appropriate leaf level, where record can be found.
d. Range based Searches involve finding the first record matching the criteria and then
scanning continuously from there, till the range completes. So, this is also, very
efficient.
e. As this file organization is efficient for all insertion, deletion, search based on single
key and search based on range, this is generally used by the database systems.
Primary key of the table is used for building the tree index structure.
5. Clustered file organization
a. The records from multiple tables that are joined frequently, are stored on same
block. This help needing to read less number of data blocks/disk IO operations,
overall.

Tree Based Index Structures


Sample Tree based Index

The tree structure is height balanced, i.e., the path length from root node to any leaf node, be same.

The non-leaf level of the Tree helps directing the search towards the correct leaf node. Non leaf
nodes contain a set of search keys and one additional pointer to the child nodes. If N search keys
there, N+1 pointers will be there for child nodes.

Ramakrishna N - MGIT
The Leaf level contains the Data Entries. In the above example, Data Entries are the actual Data
Records themselves (So, this is an example for Clustered Index). The leaf level is a doubly linked list.
So, once we reach the correct leaf level, there onwards, we could traverse all the relevant data
entries sequentially. This is what helps for the range based search queries (like people with age > 34
and age < 50).

The average number of children of non-leaf nodes is referred as Fan out of the Tree. If N is fan out of
the Tree & h is the height of the tree, then Nh is the number of leaf pages. In real-world, generally the
Fan out is around 100 and height of the tree is around 4. With this, 100Mn leaf pages are possible for
having data entries (either data records themselves or <SearchKey, rid or rid-list> pairs.)

Inside the non-leaf node for finding the relevant child based on the Search Key, binary search is used.
A binary search over 100 entries is not time consuming. 3 or 4 of such searches to reach the relevant
data record in a set of 100Mn page of data records, is very efficient.

For using B+ tree index, the actual data records need not be in sorted order.

Sample Clustered Tree based Index

Data Entries are <SearchKey, rid> pairs. Above, we could see that, order of data entries are matching
with the order of actual data records. This clustered index based search is more efficient, compared
to unclustered index based search, as it would need less number of page I/O operations.

Sample Unclustered Tree based Index

Ramakrishna N - MGIT
Data Entries are <SearchKey, rid> pairs. Above, we could see that, order of data records is not aligned
with data entries. So, search operation based on this index, would need more page I/Os, as records
are scattered over multiple pages. So, less efficient compared to Clustered Index based searches.

ISAM – Static Index structure

Every Tree node corresponds to a page.

All the leaf nodes, the primary pages above, are allocated sequentially on the hard disk. These leaf
nodes contain either the data records themselves or <SearchKey, rid or rid-list> pairs are maintained.
Initially, all the data records are placed in sequential order only.

The non-leaf nodes help with directing the search towards the correct leaf node. All the non-leaf
pages are fixed once they are created. This is a static tree structure.

This indexing structure is suitable, when inserts / deletes are less frequent. During Insert, if there is
no space left in the leaf page, then additional overflow page is added, though it is not a continuation
page / block on the hard disk. A Link is maintained to overflow page.

For example, Initial ISAM tree is:

Ramakrishna N - MGIT
The leaf nodes with * (like, 10*) are indicating the data records.

With the insertion of records with 23, 48, 41, 42… tree becomes:

To accommodate some insertions, during the initial creation of tree itself, 20% of each leaf page is
kept free.

The fact that the non-leaf pages are static, makes the locking mechanism simpler. There is no need to
lock the non-leaf pages, as their content is fixed once created. Only the leaf nodes need locking, for
concurrency control. This is advantage of ISAM structure over B+ trees. ISAM is suitable for situations
where insertions are rare, once the ISAM index is created.

B+ Trees – Dynamic Index structure


ISAM tree structure is not suitable, where data insertions are common/frequent. All database
systems use this B+ tree structure majorly for the tree based indexing.

Ramakrishna N - MGIT
In B+ trees, modifications in non-leaf nodes are allowed, maintaining the height-balancing.

Leaf nodes either contain the data records themselves or contain <SearchKey, rid or rid-list> pairs.
Doubly Linked List is used for leaf nodes, to help traversing to the next node. This is what helps in
range based searches (like employees with in age group 25 to 35).

The non-leaf nodes direct the search towards the leaf node. Sample non-leaf node:

If N search keys are present, then N+1 pointers to child nodes will be there. Generally, in all non-leaf
nodes 50% occupancy is guaranteed. Only exception is the root node, which can have occupancy
lower than 50%. Every node (except root node) contains m entries, where d <= m <= 2d. d is referred
as the order of the B+ tree.

For concurrency control, the non-leaf nodes will be locked, as they may need updates during
Insertions and deletions. This is an overhead in B+ trees, compared to ISAM tree structure.

Search
Searching for any Key value, begins at root node.

At every non-leaf node, if the search Key happens to be lower than the very first Key, then pointer
left to it is followed to get to the correct child node. If the search Key happens to be greater than the
last Key in the node, then pointer right to it is followed to get to the correct child node. Otherwise,
binary search is used to get to the correct Key position like Ki < K < Ki+1 & the pointer between them
is followed to get the correct child node.

Once the leaf node is reached, search is done for the matching Key based record or the rid entry
(that points to the data record). As these leaf nodes are in doubly linked list structure, from here
onwards we can traverse to the next leaf node and continue. This is what helps for the range based
searches.

Insert
For inserting into the appropriate position at the leaf node level, same Search mechanism is
followed.

Ramakrishna N - MGIT
If in the leaf node level, space is available, insert is made directly.

If in the leaf node, no space is left, then this leaf node content will be split into 2 leaf nodes. Doubly
Linked List structure is updated accordingly. Because of the newly introduced leaf node, the parent
non-leaf node needs an update with additional pointer. A new Search Key and an additional pointer
(pointing to the new leaf node) is introduced at appropriate position. This would need right shifting
the next keys and pointers with in that node, to make a room for the new Key and new Pointer. But
typical fan out of 100 Keys/Pointers is not that big for right shifting operation.

In this parent non-leaf node as well there is no space left, then even this needs to be split into 2 non-
leaf nodes and similar change needs to be propagated to it’s parent level. This trend continues up till
root node. If root node is also full already, then it is also split into 2 halves and a new root is created.

Delete
Delete is also very similar to Insert. First tree is traversed from root to appropriate leaf level, where
record is present. Then it is deleted from the leaf node.

If this happens to be the last record in the node/page, then that node/page can be deleted
altogether and its parent is updated accordingly. Corresponding pointer can be deleted from the
parent. Effectively, all the search keys and pointers next to it, can be forwarded by one step.

Generally, even if occupancy of any nodes goes below 50% as well, it is not merged with siblings, as
generally insertions do happen in future again.

Hash Based Index Structures


This indexing involves applying a hash function on the Search Key value, getting the bucket / page
(the primary page) information in which that record is expected to be present and then searching for
that record in that page or in overflow pages.

Above diagram shows:

Ramakrishna N - MGIT
1. Hashing with age value (Search Key). Using a simple hash function like the last 2 digits from
the binary representation of the age value. So, overall dividing the data into 4 primary
buckets only. Here, data entries contain the data records themselves.
2. Hashing with sal value (Search Key). For this also, same hash function of last 2 digits from the
binary representation has been used. Here, data entries are of the form <SearchKey, rid>
pairs. Rid helps in pointing to the actual data record where it is present on the hard disk.

Once the primary bucket/page gets filled up, additional overflow pages can be added as needed.

Search involves, applying hash function, getting the correct page and searching for the record with in
that page or in one of the overflow pages. With in the bucket, records can be maintained in sorted or
need not be (both the choices are feasible).

Insert involves, getting the correct page similar to Search, and then inserting the record in that page.
If that page is already full, then overflow page is introduced and records is added over there. If
records are maintained in sorted order, then insertion involves right shifting all the next records with
in that bucket/page and then inserting the record at correct position.

Delete involves, getting the correct page similar to Search, and then finding the record with in that
page/bucket and deleting it. If records are maintained in sorted order, all the next records need to be
shifted one step ahead, post deletion of the record.

In general, the hash function should be selected in such way that, all possible records get distributed
uniformly as much as possible and avoids overflow buckets as much as possible. If in-appropriate
hash function is used, some buckets may overflow heavily and some buckets may be very less
occupied. This is referred as Skew.

For Range based Searches, this hash indexing structure is not suitable at all. The hashes of 2 adjacent
Search Key values, need not be adjacent pages/buckets. So, range based queries, involve Searching
by each value in that range one by one. (Tree based indexing suits for range based queries).

This hash based indexing is highly suitable for natural join operations, as they involve searching for
matching records with the particular search key value one by one. (Tree based indexing needs
traversing from root of the tree to leaf, for each of the search key value).

Ramakrishna N - MGIT
Static Hashing

The primary buckets are fixed and sequential/contiguous, similar to ISAM tree structure, on hard
disk. For N primary buckets, typically used hash function is of the form (a * searchKeyValue + b) Mod
N, where a and b constants are tuned to distribute all the possible records into N buckets uniformly
as much as possible.

If the primary page/buckets fills up, then overflow pages are added as needed. More the number of
overflow pages been added, lesser will be the efficiency of Search, Insert and Delete operations as
they would involve looking for record in the overflow pages as well!. To overcome this situation, one
option is to rehash the entire data for more number of primary buckets. But, this rehashing takes
time and it could cause operational downtime in the database system.

Extendible Hashing - Dynamic Hashing


This technique overcomes the problem with static hashing, where lot of overflow pages may get
created, which degrade the performance of all operations.

If a bucket gets filled up and new insertions have to be made, then that bucket will be split into two
& hash function is modified dynamically to help pointing to the 2 buckets, instead of earlier 1 bucket.

This hashing technique uses a directory in between. Directory contains pointers to pages/buckets.
When a page gets filled up and needs splitting, directory is doubled (when required) to help pointing
to more buckets.

For example, hash function used is – using last 2 least significant bits from binary representation. Let
us say, the current hash index is as follows:

Ramakrishna N - MGIT
Local depth of all current buckets is 2, signifying that, last 2 bits are used. Global depth at Directory
level is as well 2, signifying that, last 2 bits are used.

If we have to Insert a record with Key 20, then the above hash index is dynamically changed to:

Ramakrishna N - MGIT
Note that, as Bucket A was full, a new bucket got added. For these 2 buckets, last 3 bits are used.
Initial bucket is now pointed by 000 case. The new bucket is pointed by 100 case. So, updated the
local depth to 3 for these two. Global depth is also updated to 3.

At this stage, if a new record needs to be inserted into Bucket B, then we can directly add a new
bucket without needing to update the global depth, as it is already at 3 and local is at 2 only. New
bucket can be added, bucket B content can be distributed between the 2 buckets and Directory
pointers will be updated accordingly.

Search
Based on the global depth of the directory, those many least significant bits are used from binary
representation of the search key value. This points to the appropriate page. Record will be searched
with in that page. No overflows involved. So, efficient.

Insert
Appropriate bucket is searched first. If space is available, simply inserted. Otherwise, hash is changed
dynamically to accommodate the new bucket.

Delete
Appropriate bucket is searched first. If the record happens to be the last one in the bucket, then
shrinking of buckets and shrinking of directory can be considered. If near future again insertions may
happen, this structure can be left out as it is. Only the record can be deleted.

Linear Hashing – Dynamic Hashing


Linear hashing is another dynamic hashing technique, which adjusts/adds the buckets gracefully to
accommodate more insertions. This does not use any directory structure in between (unlike
extendible hashing).

This technique uses a family of hash functions h0, h1, h2…, such that, each function bucketizes to
double the number, compared to its predecessor hash function. If h0, divides the data into N
buckets, then h1 would divide the overall data into 2N buckets and so on. Such family of hash
functions can be achieved by:

hi(value) = h(value) mod (2i * N).

There will be rounds of splitting the data. Overall, in each round, all the current buckets will be split
into two. A pointer is maintained to decide which bucket needs to be split next. Whenever any
overflow is encountered, the bucket at the pointer is split into two. When all the buckets get split,
current round is said to be completed.

Ramakrishna N - MGIT
Here, buckets are simply split in round robin fashion. It is not that the bucket, which is currently
getting overflowed, would get split. The bucket pointed by the next pointer will be the one that is
split. Current insertion could be made in overflow page, similar to static hashing. Eventually in round
robin fashion, all the buckets get split and redistribution of entries in them.

Example current state:

Adding a record with 43 key:

Ramakrishna N - MGIT
Search
At any given point in time, for searching for any record/key, it may be part of the buckets that got
split already or it may be part of the bucket that is yet to be split in the current round.

For searching for any key, first the current Level’s hash function is used. If that bucket is not yet split,
then record is searched for, with in that bucket. If that got split, then record can be present either in
this bucket or in the new bucket that got created out this. To determine this, the next Level hash
function is used, which will lead to correct bucket, where record should be present.

Insert
First bucket is determined for inserting, as in Search case. If space is available, record is inserted
directly. If no, the next pointed bucket is split into two, to accommodate more insertions. Present
record is added to a overflow page, if the bucket being split now is a different one.

Delete
First bucket is determined, as in the Search case. Record is located and deleted.

Some comparisons between Linear Hashing and Extendible Hashing


Both are dynamic hashing techniques where more buckets are added as needed & hash function is
dynamically updated to point to correct buckets.

Extendible hashing involves additional directory structure in between, which is not present in Linear
hashing technique.

Bucket occupancy is generally higher in the extendible hashing, as the bucket that is being
overflowed is the one that will be split always, unlike Linear hashing, where buckets are split in round
robin fashion.

If data distribution is very skewed, then this linear hashing technique can end up having lot of
overflow buckets/chains & splitting the buckets that are not genuinely need splitting (buckets are

Ramakrishna N - MGIT
simply split in round robin fashion). In such skewed case, this linear hashing performance is very
worse, compared to extendible hashing.

In extendible hashing, as the bucket that is being overflowed, is split always & does not involve
having overflow buckets at all, the performance is much better in skewed case.

Ramakrishna N - MGIT

You might also like