Searching and
Hashing Algorithms
Chung-Ming Chen
Department of Biomedical Engineering
Textbook:
1. C++ Programming: From Problem Analysis to Program Design, 8th Edition, D.S. Malik
2. Data Structures Using C++, 2nd Edition, D.S. Malik Page 1
Search Algorithms and Analysis
Associated with each item in a data set is a special member that
uniquely identifies the item in the data set. For example, if you have a
data set consisting of student records, then the student ID uniquely
identifies each student in a particular school. This unique member of
the item is called the key of the item.
The keys of the items in the data set are used in such operations as
searching, sorting, insertion, and deletion. For instance, when we
search the data set for a particular item, we compare the key of the
item for which we are searching with the keys of the items in the data
set.
In the analysis of an algorithm, the key comparisons refer to
comparing the key of the search item with the key of an item in the
list. Moreover, the number of key comparisons refers to the number
of times the key of the item (in algorithms such as searching and
sorting) is compared with the keys of the items in the list.
Spring 2024 Data Structure 2
class arrayListType
Spring 2024 Data Structure 3
Sequential Search: Array-based Lists
Spring 2024 Data Structure 4
Sequential Search Analysis
When analyzing a search algorithm, we count the number of key
comparisons because this number gives us the most useful
information. Furthermore, the criteria for counting the number of key
comparisons can be applied equally well to other search algorithms.
key comparisons
Spring 2024 Data Structure 5
Sequential Search Analysis
Suppose that L is list of length n. We want to determine the number of key
comparisons made by the sequential search when the L is searched for a
given item.
If the search item is not in the list, we then compare the search item with
every element in the list, making n comparisons. This is an unsuccessful
case.
Suppose that the search item is in the list. Then the number of key
comparisons depends on where in the list the search item is located.
– Best case: The search item is the first element of L. We make only one key
comparison.
– Worst case: The search item is the last element in the list. The algorithm
makes n comparisons.
– The best and worst cases are not likely to occur every time we apply the
sequential search on L, so it would be more helpful if we could determine the
average behavior of the algorithm.
Spring 2024 Data Structure 6
Sequential Search Analysis
To determine the average number of comparisons in the successful case
of the sequential search algorithm:
1. Consider all possible cases.
2. Find the number of comparisons for each case.
3. Add the number of comparisons and divide by the number of cases.
If the search item, called the target, is the first element in the list, one
comparison is required.
If the target is the second element in the list, two comparisons are
required.
Similarly, if the target is the kth element in the list, k comparisons are
required. We assume that the target can be any element in the list; that is,
all list elements are equally likely to be the target.
Suppose that there are n elements in the list. The following expression
gives the average number of comparisons:
Spring 2024 Data Structure 7
Binary Search for Ordered Lists
The binary search algorithm uses the divide-and-conquer technique
to search the list, which can be performed only on ordered lists.
First, the search item is compared with the middle element of the list.
If the search item is found, the search terminates. If the search item is
less than the middle element of the list, we restrict the search to the
first half of the list; otherwise, we search the second half of the list.
Spring 2024 Data Structure 8
Binary Search Algorithm
The maximum number
of iterations ?
Spring 2024 Data Structure 9
Insertion into an Ordered List
1. Use an algorithm similar to the binary search algorithm to find the
place where the item is to be inserted.
2. if the item is already in this list
output an appropriate message
else
use the function insertAt to insert the item in the list.
Spring 2024 Data Structure 10
Lower Bound on Comparison-Based Search Algorithms
Sequential and binary search algorithms search the list by comparing the
target element with the list elements. For this reason, these algorithms
are called comparison-based search algorithms.
Spring 2024 Data Structure 11
Hashing Algorithms
The complexity of hashing search algorithm is, on average, of order 1.
In hashing, the data is organized with the help of a table, called the
hash table, denoted by HT, and the hash table is stored in an array.
To determine whether a particular item with a key, say X, is in the table,
we apply a function h, called the hash function, to the key X; that is, we
compute h(X), read as h of X. The function h is typically an arithmetic
function and h(X) gives the address of the item in the hash table.
Suppose that the size of the hash table, HT, is m. Then 0 h(X) < m.
Thus, to determine whether the item with key X is in the table, we look
at the entry HT[h(X)] in the hash table.
Because the address of an item is computed with the help of a
function, it follows that the items are stored in no particular order.
Spring 2024 Data Structure 12
Organizing Data in the Hash Table
Two ways that data is organized with the help of the hash table:
– The data is stored within the hash table, that is, in an array.
– The data is stored in linked lists and the hash table is an array of
pointers to those linked lists.
Terminology
The hash table HT is, usually, divided into, say b buckets HT[0],
HT[1], ..., HT[b – 1]. Each bucket is capable of holding, say r items.
Thus, it follows that br = m, where m is the size of HT. Generally, r =
1 and so each bucket can hold one item.
The hash function h maps the key X onto an integer t, that is, h(X) = t,
such that 0 h(X) b – 1.
Spring 2024 Data Structure 13
Collisions
collision
collision
Spring 2024 Data Structure 14
Collisions
Two keys, X1 and X2, such that X1 X2, are called synonyms if
h(X1) = h(X2). Let X be a key and h(X) = t. If bucket t is full, we say
that an overflow occurs. Let X1 and X2 be two nonidentical keys. If
h(X1) = h(X2), we say that a collision occurs. If r = 1, that is, the
bucket size is 1, an overflow and a collision occur at the same time.
When choosing a hash function, the main objectives are to:
– Choose a hash function that is easy to compute.
– Minimize the number of collisions.
Spring 2024 Data Structure 15
Hash Functions: Some Examples
Mid-Square: In this method, the hash function, h, is computed by
squaring the identifier, and then using the appropriate number of bits
from the middle of the square to obtain the bucket address. Because
the middle bits of a square usually depend on all the characters, it is
expected that different keys will yield different hash addresses with
high probability, even if some of the characters are the same.
Folding: In folding, the key X is partitioned into parts such that all the
parts, except possibly the last parts, are of equal length. The parts
are then added, in some convenient way, to obtain the hash address.
Division (Modular arithmetic): In this method, the key X is
converted into an integer iX. This integer is then divided by the size of
the hash table, i.e., HTSize, to get the remainder, giving the address
of X in HT. That is, (in C++)
h(X) = iX % HTSize;
Spring 2024 Data Structure 16
Example: Division Method in C++
Suppose that each key is a string. The following C++ function uses
the division method to compute the address of the key.
Spring 2024 Data Structure 17
Collision Resolution
In reality, collisions are unavoidable because usually a hash function
always maps a larger domain onto a smaller range. Thus, in hashing,
we must include algorithms to handle collisions.
Collision resolution techniques are classified into two categories:
open addressing (also called closed hashing), and chaining (also
called open hashing).
In open addressing, the data is stored within the hash table.
In chaining, the data is organized in linked lists and the hash table is
an array of pointers to the linked lists.
Spring 2024 Data Structure 18
Open Addressing – Linear Probing
Suppose that an item with key X is to be inserted in HT. We use the
hash function to compute the index h(X) of this item in HT. Suppose
that h(X) = t. Then 0 h(X) HTSize – 1. If HT[t] is empty, we store
this item into this array slot. Suppose that HT[t] is already occupied
by another item; we have a collision. In linear probing, starting at
location t, we search the array sequentially to find the next available
array slot.
In linear probing, we assume that the array is circular so that if the
lower portion of the array is full, we can continue the search in the top
portion of the array. This can be easily accomplished by using the
mod operator. That is, starting at t, we check the array locations t,
(t + 1) % HTSize, (t + 2) % HTSize, . . ., (t + j) % HTSize. This is
called the probe sequence.
The next array slot is given by
(h(X) + j) % HTSize
where j is the jth probe.
Spring 2024 Data Structure 19
Linear Probing: Example
Spring 2024 Data Structure 20
Linear Probing in C++
Spring 2024 Data Structure 21
Linear Probing: Clustering Problem
From the definition of linear probing, we see that linear probing is
easy to implement. However, linear probing causes clustering; that
is, more and more new keys would likely be hashed to the array slots
that are already occupied. For example, consider the hash table of
size 20, as shown below:
Initially, all the array positions are available. Because all the array
positions are available, the probability of any position being probed is
(1/20). Suppose that after storing some of the items, the hash table is
as shown below:
In this figure, a cross indicates that this array slot is occupied. Slot 9
will be occupied next if, for the next key, the hash address is 6, 7, 8,
or 9. Thus, the probability that slot 9 will be occupied next is 4/20.
Similarly, in this hash table, the probability that array position 14 will
be occupied next is 5/20.
Spring 2024 Data Structure 22
Linear Probing: Clustering Problem
Consider the hash table below:
In this hash table, the probability that the array position 14 will be
occupied next is 9/20, whereas the probability that the array positions 15,
16, or 17 will be occupied next is 1/20. We see that items tend to cluster,
which would increase the search length. Linear probing, therefore,
causes clustering. This clustering is called primary clustering.
One way to improve linear probing is to skip array positions by a fixed
constant, say c, rather than 1. In this case, the hash address is as
follows:
(h(X) + i * c) % HTSize
If c = 2 and h(X) = 2k, that is, h(X) is even, only the even-numbered array
positions are visited. Similarly, if c = 2 and h(X) = 2k + 1, that is, h(X) is
odd, only the odd-numbered array positions are visited. To visit all the
array positions, the constant c must be relatively prime to HTSize.
Spring 2024 Data Structure 23
Random Probing
This method uses a random number generator to find the next available
slot. The ith slot in the probe sequence is
(h(X) + ri) % HTSize
where ri is the ith value in a random permutation of the numbers 1
to HTSize – 1. All insertions and searches use the same sequence of
random numbers.
Spring 2024 Data Structure 24
Rehashing and Quadratic Probing
Rehashing: In this method, if a collision occurs with the hash function
h, we use a series of hash functions, h1, h2, . . ., hs. That is, if the
collision occurs at h(X), the array slots hi(X), 1 i s are examined.
Quadratic Probing:
Spring 2024 Data Structure 25
Quadratic Probing Property
Spring 2024 Data Structure 26
Quadratic Probing: Fast Implementation
Spring 2024 Data Structure 27
Secondary Clustering
Both random and quadratic probings eliminate primary clustering.
However, if two nonidentical keys, say X1 and X2, are hashed to the
same home position, that is, h(X1) = h(X2), then the same probe
sequence is followed for both keys. The same probe sequence is
used for both keys because random probing and quadratic probing
are functions of the home positions, not the original key. It follows that
if the hash function causes a cluster at a particular home position, the
cluster remains under these probings. This is called secondary
clustering.
Spring 2024 Data Structure 28
Double Hashing
One way to solve secondary clustering is to use linear probing, with
the increment value a function of the key. This is called double
hashing. In double hashing, if a collision occurs at h(X), the probe
sequence is generated by using the rule:
(h(X) + i * g(X)) % HTSize
where g is the second hash function, and i = 0, 1, 2, 3, . . . .
If the size of the hash table is a prime p, then we can define g as
follows:
g(k) = 1+(k % (p – 2))
Spring 2024 Data Structure 29
Double Hashing: Example
Spring 2024 Data Structure 30
Deletion: Open Addressing
Suppose that an item, say R, is to be deleted from the hash table,
HT. Clearly, we first must find the index of R in HT. To find the index
of R, we apply the same criteria we applied to R when R was inserted
in HT. Let us further assume that after inserting R, another item, R’,
was inserted in HT, and the home position of R and R’ is the same.
The probe sequence of R is contained in the probe sequence of R’
because R’ was inserted in the hash table after R.
Suppose that we delete R simply by marking the array slot containing
R as empty. If this array position stays empty, then while searching
for R’ and following its probe sequence, the search terminates at this
empty array position. This gives the impression that R’ is not in the
table, which, of course, is incorrect. The item R cannot be deleted
simply by marking its position as empty from the hash table.
Spring 2024 Data Structure 31
Solution to Deletion Dilemma
Another solution is to use another array, say indexStatusList of int, of
the same size as the hash table as follows: Initialize each position of
indexStatusList to 0, indicating that the corresponding position in the
hash table is empty. When an item is added to the hash table at
position, say i, we set indexStatusList[i] to 1. When an item is deleted
from the hash table at position, say k, we set indexStatusList[k] to -1.
Therefore, each entry in the array indexStatusList is -1, 0, or 1.
Spring 2024 Data Structure 32
Chaining
In chaining, the hash table, HT, is an array of pointers. Therefore, for
each j, where 0 j HTSize – 1, HT[j] is a pointer to a linked list. The
size of the hash table, HTSize, is less than or equal to the number of
items.
Spring 2024 Data Structure 33
Chaining
ITEM INSERTION AND COLLISION: For each key X (in the item),
we first find h(X) = t, where 0 t HTSize – 1. The item with this key
is then inserted in the linked list (which might be empty) pointed to by
HT[t]. It then follows that for nonidentical keys X1 and X2, if h(X1) =
h(X2), the items with keys X1 and X2 are inserted in the same linked
list and so collision is handled quickly and effectively. (A new item can
be inserted at the beginning of the linked list because the data in a
linked list is in no particular order.)
SEARCH: Suppose that we want to determine whether an item R
with key X is in the hash table. As usual, first we calculate h(X).
Suppose h(X) = t. Then the linked list pointed to by HT[t] is searched
sequentially
Spring 2024 Data Structure 34
Chaining
DELETION: To delete an item, say R, from the hash table, first we
search the hash table to find where in a linked list R exists. We then
adjust the pointers at the appropriate locations and deallocate the
memory occupied by R.
OVERFLOW: Because data is stored in linked lists, overflow is no
longer a concern because memory space to store the data is
allocated dynamically. Furthermore, the size of the hash table no
longer needs to be greater than the number of items. If the size of the
hash table is less than the number of items, some of the linked lists
contain more than one item. However, with a good hash function, the
average length of a linked list is still small and so the search is
efficient.
Spring 2024 Data Structure 35
Chaining
ADVANTAGES OF CHAINING: From the construction of the hash
table using chaining, we see that item insertion and deletion are
straightforward. If the hash function is efficient, few keys are hashed
to the same home position. Thus, on average, a linked list is short,
which results in a shorter search length. If the item size is large, it
saves a considerable amount of space. For example, suppose there
are 1000 items and each item requires 10 words of storage. Further
suppose that each pointer requires one word of storage. We then
need 1000 words for the hash table, 10,000 words for the items, and
1000 words for the link in each node. A total of 12,000 words of
storage space, therefore, is required to implement chaining. On the
other hand, if we use quadratic probing, if the hash table size is twice
the number of items, we need 20,000 words of storage..
Spring 2024 Data Structure 36
Chaining
DISADVANTAGES OF CHAINING: If the item size is
small, a considerable amount of space is wasted. For
example, suppose there are 1000 items, each requiring 1
word of storage. Chaining then requires a total of 3000
words of storage. On the other hand, with quadratic
probing, if the hash table size is twice the number of
items, only 2000 words are required for the hash table.
Also, if the table size is three times the number of items,
then in quadratic probing the keys are reasonably spread
out. This results in fewer collisions and so the search is
fast.
Spring 2024 Data Structure 37
Hashing Analysis
Let
The parameter α is called the load factor.
The average number of comparisons for a successful search and an
unsuccessful search are given below:
Spring 2024 Data Structure 38