Unit – 3
Hashing in DBMS
Hashing in DBMS is a technique to quickly locate a data record in a database
irrespective of the size of the database.
For larger databases containing thousands and millions of records, the indexing data
structure technique becomes very inefficient because searching a specific record
through indexing will consume more time.
This doesn’t align with the goals of DBMS, especially when performance and data
retrieval time are minimized.
In hashing we will do searching algorithm with time complexity of O(1).
Components of Hashing
1. Hash Table:
o A hash table is an array or data structure and its size is determined by
the total volume of data records present in the database.
o Each memory location in a hash table is called a ‘bucket‘ or hash
indices and stores a data record’s exact location and can be accessed
through a hash function.
2. Bucket:
o A bucket is a memory location (index) in the hash table that stores the
data record.
o These buckets generally store a disk block which further stores
multiple records. It is also known as the hash index.
3. Hash Function:
o A hash function is a mathematical equation or algorithm that takes one
data record’s primary key as input and computes the hash index as
output.
1|Page
Working of Hash Function
The hash function generates a hash index through the primary key of the data record.
Now, there are 2 possibilities:
1. The hash index generated isn’t already occupied by any other value. So, the
address of the data record will be stored here.
2. The hash index generated is already occupied by some other value. This is called
collision so to counter this, a collision resolution technique will be applied.
Types of Hashing in DBMS
There are two primary hashing techniques in DBMS.
1. Static Hashing
2. Dynamic Hashing
1. Static Hashing
o In static hashing, the hash function always generates the same bucket’s address.
o For example, if we have a data record for employee_id = 1065, the hash function is
mod-5 which is – H(x) % 5, where x = id. Then the operation will take place like this:
H(106)%5=1.
This indicates that the data record should be placed or searched in the 1st bucket (or
1st hash index) in the hash table.
2|Page
Static Hashing has the following Properties
Data Buckets: The number of buckets in memory remains constant. The size of the
hash table is decided initially and it may also implement chaining that will allow
handling some collision issues.
Hash function: It uses the simplest hash function to map the data records to its
appropriate bucket. It is generally modulo-hash function.
Efficient for known data size: It’s very efficient in terms when we know the data
size and its distribution in the database.
Operations of Static Hashing
Searching a record
o When a record needs to be searched, then the same hash function retrieves the
address of the bucket where the data is stored.
Insert a Record
o When a new record is inserted into the table, then we will generate an address
for a new record based on the hash key and record is stored in that location.
Delete a Record
o To delete a record, we will first fetch the record which is supposed to be
deleted. Then we will delete the records for that address in memory.
Update a Record
o To update a record, we will first search it using a hash function, and then the
data record is updated.
Drawback of Static Hashing
It is inefficient and inaccurate when the data size dynamically varies because we have
limited space and the hash function always generates the same value for every
specific input. When the data size fluctuates very often it’s not at all useful because
collision will keep happening and it will result in problems like – bucket skew,
insufficient buckets etc.
To resolve this problem of bucket overflow, techniques such as – chaining and open
addressing are used. Here’s a brief info on both:
3|Page
1. Chaining/Open Hashing/Closed Addressing
Method to handle collision in open hashing is called separate chaining. In separate chaining
if two or more key have same hash value then next element is stored by new link to previous
element.
For example: Suppose R3 is a new address which needs to be inserted into the table, the
hash function generates address as 110 for it. But this bucket is full to store the new data. In
this case, a new bucket is inserted at the end of 110 buckets and is linked to it.
2. Open Addressing/Closed Hashing
When a hash function generates an address at which data is already stored, then the next
bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system
searches next available data bucket, 113 and assigns R3 to it.
4|Page
2. Dynamic Hashing
o Dynamic hashing is also known as extendible hashing, used to handle database that
frequently changes data sets.
o This method offers us a way to add and remove data buckets on demand dynamically.
This way as the number of data records varies, the buckets will also grow and shrink
in size periodically whenever a change is made.
Properties of Dynamic Hashing
The buckets will vary in size dynamically periodically as changes are made offering
more flexibility in making any change.
Dynamic Hashing aids in improving overall performance by minimizing or
completely preventing collisions.
It has the following major components: Data bucket, Flexible hash function, and
directories
Directories are containers that store the pointer to buckets. If bucket overflow or
bucket skew-like problems happen to occur, then bucket splitting is done to maintain
efficient retrieval time of data records. Each directory will have a directory id.
Global Depth: It is defined as the number of bits in each directory id. The more the
number of records, the more bits are there.
Advantages of Dynamic Hashing
In dynamic hashing, performance will not get affected as the amount of data grows in
the system. To accommodate the data, size of memory will be increased.
Dynamic hashing improve the utilization of the memory.
This method is efficient to handle the dynamic database where size of data changes
frequently.
Disadvantages of Dynamic Hashing
A the amount of data changes, bucket size will also get changed. Bucket address
table will keep track of these addresses because data address changes as bucket size
5|Page
increases or decreases. Maintenance of the bucket address table gets difficult when
there is significant increase in data.
In dynamic hashing, bucket overflow can happen.
l
Load Factor
Let, n = number of elements to be added in hash table.
m = number of buckets
Load Factor = n/m
Load factor gives average entries in one bucket.
Limit of load factor = 0.75
If load factor crosses its limit then concept of REHASHING is used.
Rehashing
Rehashing means increasing the size of hash table and redistributing elements in it.
It is very costly operation because:
We have to make new hash table of bigger size.
We have to compute new hash values (indices) for each elements and insert in new
hash table.
6|Page