The document outlines Unit V of a data structures course, focusing on searching, sorting, and hashing techniques. It details various searching algorithms such as linear and binary search, alongside multiple sorting methods including insertion, selection, and quick sort, along with their complexities and algorithms. Additionally, it covers hashing methods, emphasizing their significance in data structure operations.
Discusses the process of searching in lists, covering linear and binary search algorithms with their respective properties and complexities.
Explores various sorting techniques, including internal and external sorting. Key algorithms discussed are Insertion, Selection, Bubble, Shell, Quick, Heap, and Merge Sort.
Introduces hashing, its applications, collision resolution strategies, including open and closed hashing, and discusses rehashing and extendible hashing.
UNIT V :Searching, Sorting and Hashing
By
Mr.S.Selvaraj
Asst. Professor (SRG) / CSE
Kongu Engineering College
Perundurai, Erode, Tamilnadu, India
Thanks to and Resource from : Data Structures and Algorithm Analysis in C by Mark Allen Weiss & Sumitabha Das, “Computer Fundamentals and C
Programming”, 1st Edition, McGraw Hill, 2018.
20CST32 – Data Structures
Searching
• Search isa process of finding a value in a list
of values.
• In other words, searching is the process of
locating given value position in a list of
values.
4/6/2022 5.1 _ Searching 6
Linear Search
• Linearsearch algorithm finds a given element in a list of
elements with O(n) time complexity where n is total
number of elements in the list.
• This search process starts comparing search element with
the first element in the list.
• If both are matched then result is element found otherwise
search element is compared with the next element in the
list.
• Repeat the same until search element is compared with the
last element in the list, if that last element also doesn't
match, then the result is "Element not found in the list".
• That means, the search element is compared with element
by element in the list.
4/6/2022 5.1 _ Searching 8
9.
Linear Search -Algorithm
• Step 1 - Read the search element from the user.
• Step 2 - Compare the search element with the first element in the list.
• Step 3 - If both are matched, then display "Given element is found!!!" and
terminate the function.
• Step 4 - If both are not matched, then compare search element with the
next element in the list.
• Step 5 - Repeat steps 3 and 4 until search element is compared with last
element in the list.
• Step 6 - If last element in the list also doesn't match, then display
"Element is not found!!!" and terminate the function
4/6/2022 5.1 _ Searching 9
Binary Search
• Binarysearch finds a given element in a list of elements
with O(logn) time complexity where n is total number of
elements in the list.
• The binary search algorithm can be used with only a sorted
list of elements.
• That means the binary search is used only with a list of
elements that are already arranged in an order.
• The binary search can not be used for a list of elements
arranged in random order.
• This search process starts comparing the search element with
the middle element in the list.
4/6/2022 5.1 _ Searching 15
16.
Binary Search -Algorithm
• Step 1 - Read the search element from the user.
• Step 2 - Find the middle element in the sorted list.
• Step 3 - Compare the search element with the middle element in
the sorted list.
• Step 4 - If both are matched, then display "Given element is
found!!!" and terminate the function.
• Step 5 - If both are not matched, then check whether the search
element is smaller or larger than the middle element.
• Step 6 - If the search element is smaller than middle element,
repeat steps 2, 3, 4 and 5 for the left sublist of the middle element.
• Step 7 - If the search element is larger than middle element, repeat
steps 2, 3, 4 and 5 for the right sublist of the middle element.
• Step 8 - Repeat the same process until we find the search element
in the list or until sublist contains only one element.
• Step 9 - If that element also doesn't match with the search
element, then display "Element is not found in the list!!!" and
terminate the function.
4/6/2022 5.1 _ Searching 16
UNIT V :Searching, Sorting and Hashing
By
Mr.S.Selvaraj
Asst. Professor (SRG) / CSE
Kongu Engineering College
Perundurai, Erode, Tamilnadu, India
Thanks to and Resource from : Data Structures and Algorithm Analysis in C by Mark Allen Weiss & Sumitabha Das, “Computer Fundamentals and C
Programming”, 1st Edition, McGraw Hill, 2018.
20CST32 – Data Structures
Sorting
• The arrangementof data in a preferred order is
called sorting in the data structure.
• By sorting data, it is easier to search through it
quickly and easily.
• The simplest example of sorting is a dictionary.
• Before the era of the Internet, when you wanted
to look up a word in a dictionary, you would do so
in alphabetical order. This made it easy.
4/6/2022 5.2 _ Sorting 25
Types of Sorting
•When all data is placed in-memory, then sorting
is called internal sorting.
• When all data that needs to be sorted cannot be
placed in-memory at a time, the sorting is
called external sorting.
• External Sorting is used for massive amount of
data.
• Merge Sort and its variations are typically used
for external sorting.
• Some external storage like hard-disk, CD, etc is
used for external storage.
4/6/2022 5.2 _ Sorting 27
Insertion Sort
• Insertionsort is a simple sorting algorithm that works similar to the
way you sort playing cards in your hands.
• The array is virtually split into a sorted and an unsorted part.
• Values from the unsorted part are picked and placed at the correct
position in the sorted part.
• This is an in-place comparison-based sorting algorithm.
• Here, a sub-list is maintained which is always sorted. For example,
the lower part of an array is maintained to be sorted.
• An element which is to be 'insert'ed in this sorted sub-list, has to
find its appropriate place and then it has to be inserted there.
Hence the name, insertion sort.
• The array is searched sequentially and unsorted items are moved
and inserted into the sorted sub-list (in the same array).
• This algorithm is not suitable for large data sets as its average and
worst case complexity are of Ο(n2), where n is the number of items.
4/6/2022 5.2 _ Sorting 29
30.
Insertion Sort -Algorithm
• To sort an array of size n in ascending order:
– 1: Iterate from arr[1] to arr[n] over the array.
– 2: Compare the current element (key) to its
predecessor.
– 3: If the key element is smaller than its
predecessor, compare it to the elements before.
Move the greater elements one position up to
make space for the swapped element.
4/6/2022 5.2 _ Sorting 30
4/6/2022 5.2 _Sorting 32
By now we have 14 and 27 in the sorted sub-list. Next, it compares 33 with 10.
This process goes on until all the unsorted values are covered in a sorted sub-list.
Selection Sort
• Selectionsort is a simple sorting algorithm.
• This sorting algorithm is an in-place comparison-based
algorithm in which the list is divided into two parts, the
sorted part at the left end and the unsorted part at the
right end.
• Initially, the sorted part is empty and the unsorted part is
the entire list.
• The smallest element is selected from the unsorted array
and swapped with the leftmost element, and that element
becomes a part of the sorted array.
• This process continues moving unsorted array boundary by
one element to the right.
• This algorithm is not suitable for large data sets as its
average and worst case complexities are of Ο(n2),
where n is the number of items.
4/6/2022 5.2 _ Sorting 34
35.
Selection Sort -Algorithm
• Step 1 − Set MIN to location 0
• Step 2 − Search the minimum element in the list
• Step 3 − Swap with value at location MIN
• Step 4 − Increment MIN to point to next element
• Step 5 − Repeat until list is sorted
4/6/2022 5.2 _ Sorting 35
Bubble Sort
• Bubblesort is a simple sorting algorithm.
• This sorting algorithm is comparison-based
algorithm in which each pair of adjacent
elements is compared and the elements are
swapped if they are not in order.
• This algorithm is not suitable for large data
sets as its average and worst case complexity
are of Ο(n2) where n is the number of items.
4/6/2022 5.2 _ Sorting 38
Shell Sort
• Shellsort, named after its inventor, Donald Shell, was one of the
first algorithms to break the quadratic time barrier.
• Shell sort is a highly efficient sorting algorithm and is based on
insertion sort algorithm.
• This algorithm avoids large shifts as in case of insertion sort.
• If the smaller value is to the far right and has to be moved to the
far left.
• This algorithm uses insertion sort on a widely spread elements, first
to sort them and then sorts the less widely spaced elements.
• This spacing is termed as interval.
• This algorithm is quite efficient for medium-sized data sets as its
average and worst-case complexity of this algorithm depends on
the gap sequence the best known is Ο(n), where n is the number of
items.
• And the worst case space complexity is O(n).
4/6/2022 5.2 _ Sorting 42
Shell Sort -Example
• In this example, we take the interval of 4.
• Make a virtual sub-list of all values located at the interval of 4
positions.
• Here these values are {35, 14}, {33, 19}, {42, 27} and {10, 44}
4/6/2022 5.2 _ Sorting 44
45.
Shell Sort -Example
• We compare values in each sub-list and swap
them (if necessary) in the original array.
• After this step, the new array should look like
this −
4/6/2022 5.2 _ Sorting 45
46.
• Shell sortuses insertion sort to sort the array.
4/6/2022 5.2 _ Sorting 46
47.
Quick Sort
• Quicksort is a fast sorting algorithm used to sort a list of elements.
• Quick sort algorithm is invented by C. A. R. Hoare.
• The quick sort algorithm attempts to separate the list of elements
into two parts and then sort each part recursively.
• That means it use divide and conquer strategy.
• In quick sort, the partition of the list is performed based on the
element called pivot.
• Here pivot element is one of the elements in the list.
• The list is divided into two partitions such that
– all elements to the left of pivot are smaller than the pivot and
– all elements to the right of pivot are greater than or equal to the
pivot
• This algorithm is quite efficient for large-sized data sets as its
average and worst-case complexity are O(n2), respectively.
4/6/2022 5.2 _ Sorting 47
48.
Quick Sort –Algorithm(Pivot)
• Step 1 - Consider the first element of the list as pivot (i.e., Element at first
position in the list).
• Step 2 - Define two variables i and j. Set i and j to first and last elements of
the list respectively.
• Step 3 - Increment i until list[i] > pivot then stop.
• Step 4 - Decrement j until list[j] < pivot then stop.
• Step 5 - If i < j then exchange list[i] and list[j].
• Step 6 - Repeat steps 3,4 & 5 until i > j.
• Step 7 - Exchange the pivot element with list[j] element.
4/6/2022 5.2 _ Sorting 48
Heap sort
• Heapsort is one of the sorting algorithms
used to arrange a list of elements in order.
• Heapsort algorithm uses one of the tree
concepts called Heap Tree.
• In this sorting algorithm, we use
– Max Heap to arrange list of elements in
Descending order and
– Min Heap to arrange list elements in Ascending
order.
4/6/2022 5.2 _ Sorting 56
57.
Heap Sort -Algorithm
• Step 1 - Construct a Binary Tree with given list of Elements.
• Step 2 - Transform the Binary Tree into Min Heap.
• Step 3 - Delete the root element from Min Heap
using Heapify method.
• Step 4 - Put the deleted element into the Sorted list.
• Step 5 - Repeat the same until Min Heap becomes empty.
• Step 6 - Display the sorted list.
4/6/2022 5.2 _ Sorting 57
Bucket Sort
• BucketSort is a sorting algorithm that divides
the unsorted array elements into several
groups called buckets.
• Each bucket is then sorted by using any of the
suitable sorting algorithms or recursively
applying the same bucket algorithm.
• Finally, the sorted buckets are combined to
form a final sorted array.
• Scatter Gather Approach is used.
4/6/2022 5.2 _ Sorting 65
66.
Scatter Gather approach
•The process of bucket sort can be understood
as a scatter-gather approach.
• Here, elements are first scattered into buckets
then the elements in each bucket are sorted.
• Finally, the elements are gathered in order.
4/6/2022 5.2 _ Sorting 66
Merge Sort
• Mergesort is a sorting technique based on
divide and conquer technique.
• With worst-case time complexity being Ο(n
log n).
• It is one of the most respected algorithms.
• Merge sort first divides the array into equal
halves and then combines them in a sorted
manner.
4/6/2022 5.2 _ Sorting 77
Applications and Drawbacksof
Merge Sort
• Applications:
– Merge Sort is useful for sorting linked lists in O(nLogn)
time.
– Inversion Count Problem
– Used in External Sorting
• Drawbacks:
– Slower comparative to the other sort algorithms for
smaller tasks.
– Merge sort algorithm requires an additional memory
space of 0(n) for the temporary array.
– It goes through the whole process even if the array is
sorted.
4/6/2022 5.2 _ Sorting 82
Mutiway Merge Sort
•The basic external sorting algorithm uses the merge
routine from mergesort.
• Suppose we have four tapes, Ta1, Ta2, Tb1, Tb2, which
are two input and two output tapes.
• Depending on the point in the algorithm, the a and b
tapes are either input tapes or output tapes.
• Suppose the data is initially on Ta1. Suppose further
that the internal memory can hold (and sort) m records
at a time.
• A natural first step is to read m records at a time from
the input tape, sort the records internally, and then
write the sorted records alternately to Tb1 and Tb2.
• We will call each set of sorted records a run. When this
is done, we rewind all the tapes.
4/6/2022 5.2 _ Sorting 87
88.
Multiway Merge -Example
• If m = 3, then after the runs are constructed,
the tapes will contain the data indicated in the
following figure.
4/6/2022 5.2 _ Sorting 88
89.
Multiway Merge –Example (Contd.,)
• This algorithm will require log(n/m) passes, plus
the initial run constructing pass.
• For instance, if we have 10 million records of 128
bytes each, and four megabytes of internal
memory, then the first pass will create 320 runs.
We would then need nine more passes to
complete the sort.
• Our example requires log 13/3 = 3 more passes,
which are shown in the following figure.
4/6/2022 5.2 _ Sorting 89
Polyphase Merge
• Apolyphase merge sort is an algorithm which
decreases the number of runs at every iteration
of the main loop by merging runs into larger runs.
• It is used for external sorting.
• In this type of sort, the tapes being merged, and
the tape to which the merged sub files are
written, vary continuously throughout the sort.
• In this technique, the concept of a pass through
records is not as clear-cut as in the straight or the
natural merge.
4/6/2022 5.2 _ Sorting 91
Polyphase Merge
• Thek-way merging strategy requires the use of 2k tapes.
• This could be prohibitive for some applications.
• It is possible to get by with only k+1 tape.
• Suppose we have three tapes, T1, T2, T3 and an input file on T1 that
will produce 34 runs.
• One options in to put 17 runs each on T2 and T3.
• We could then the merge this result onto T1, obtaining one tape
with 17 runs.
• The problem is that since all the runs are on one tape, we must
now put some of these runs on T2 to perform another merge.
• The logical way to do this is to copy first 8 runs from T1 onto T2 and
then perform the merge.
• This has effect on adding an extra half pass for every pass we do.
4/6/2022 5.2 _ Sorting 93
94.
Polyphase Merge
• Alternativemethod is to split the original 34 runs unevenly.
• Suppose we put 21 runs on T2 and 13 runs on T3.
• We would then merge 13 runs onto T1 before T3 was empty.
• At this point we could rewind T1 and T3 and merge T1 with 13 runs
and T2 which has 8 runs onto T3.
• We would then merge 8 runs untill T2 was empty which would
leave 5 runs left on T1 and 8 runs on T3.
• We could then merge T1 and T3 and so on.
4/6/2022 5.2 _ Sorting 94
95.
Polyphase Merge
• Theoriginal distribution of runs makes a lot of difference.
• E.g. if 22 runs are placed on T2 with 12 on T3, then after first merge we
obtain 12 runs on T1 and 10 on T2.
• After another merge, there are 10 runs on T1 and 2 runs on T3.
• At this point the going gets slow, because we can only merge two sets of
runs before T3 is exhausted.
• Then T1 has 8 runs and T2 has 2 runs.
• Again we can only merge two sets of runs, obtaining T1 with 6 and T3 with
2 runs.
• After three more passes, T2 has 2 runs and other tapes are empty.
• We copy one run to another tape and then we can finish the merge.
• It turns out if the no. of runs is a Fibonacci number, Fn then the best way
to distribute them is to split them into two Fibonacci numbers Fn-1 and
Fn-2.
• Otherwise, it is necessary to pad the tape with dummy runs in order to get
no. of runs up to Fibonacci number.
4/6/2022 5.2 _ Sorting 95
UNIT V :Searching, Sorting and Hashing
By
Mr.S.Selvaraj
Asst. Professor (SRG) / CSE
Kongu Engineering College
Perundurai, Erode, Tamilnadu, India
Thanks to and Resource from : Data Structures and Algorithm Analysis in C by Mark Allen Weiss & Sumitabha Das, “Computer Fundamentals and C
Programming”, 1st Edition, McGraw Hill, 2018.
20CST32 – Data Structures
Hashing
• The implementationof hash tables is frequently called
hashing.
• Hashing is a technique used for performing insertions,
deletions and finds in constant average time.
• Tree operations that require any ordering information
among the elements are not supported efficiently.
• Thus, operations such as find_min, find_max, and the
printing of the entire table in sorted order in linear time
are not supported.
• Here, The central data structure is hash table. We will See
– several methods of implementing the hash table.
– Compare these methods analytically.
– Show numerous applications of hashing.
– Compare hash tables with binary search trees.
4/6/2022 5.3 _ Hashing 105
106.
Hashing
• The idealhash table data structure is merely an array of some fixed size,
containing the keys.
• Typically, a key is a string with an associated value (for instance, salary
information).
• We will refer to the table size as H_SIZE, with the understanding that this
is part of a hash data structure and not merely some variable floating
around globally.
• The common convention is to have the table run from 0 to H_SIZE-1;
• Each key is mapped into some number in the range 0 to H_SIZE - 1 and
placed in the appropriate cell.
• The mapping is called a hash function, which ideally should be simple to
compute and should ensure that any two distinct keys get different cells.
• Since there are a finite number of cells and a virtually inexhaustible supply
of keys, this is clearly impossible, and thus we seek a hash function that
distributes the keys evenly among the cells.
4/6/2022 5.3 _ Hashing 106
107.
Example
• In thisexample, john hashes to 3, phil hashes
to 4, dave hashes to 6, and mary hashes to 7.
4/6/2022 5.3 _ Hashing 107
108.
Collision
• The onlyremaining problems
– deal with choosing a function,
– deciding what to do when two keys hash to the
same value (this is known as a collision), and
– deciding on the table size.
4/6/2022 5.3 _ Hashing 108
109.
Hash Function
• Ifthe input keys are integers, then simply returning key mod
H_SIZE is generally a reasonable strategy, unless key happens to
have some undesirable properties.
• In this case, the choice of hash function needs to be carefully
considered.
• For instance, if the table size is 10 and the keys all end in zero, then
the standard hash function is obviously a bad choice.
• For reasons we shall see later, and to avoid situations like the one
above, it is usually a good idea to ensure that the table size is
prime.
• When the input keys are random integers, then this function is not
only very simple to compute but also distributes the keys evenly.
• Usually, the keys are strings; in this case, the hash function needs to
be chosen carefully.
4/6/2022 5.3 _ Hashing 109
110.
Open Hashing (SeparateChaining)
• The first strategy, commonly known as either
open hashing, or separate chaining, is to keep
a list of all elements that hash to the same
value.
• For convenience, our lists have headers.
• If space is tight, it might be preferable to avoid
their use.
4/6/2022 5.3 _ Hashing 110
111.
Open Hashing –Find & Insert
• To perform a find, we use the hash function to determine which list
to traverse.
• We then traverse this list in the normal manner, returning the
position where the item is found.
• To perform an insert, we traverse down the appropriate list to
check whether the element is already in place.
– if duplicates are expected, an extra field is usually kept, and this field
would be incremented in the event of a match.
– If the element turns out to be new, it is inserted either at the front of
the list or at the end of the list, whichever is easiest.
• This is an issue most easily addressed while the code is being
written.
• Sometimes new elements are inserted at the front of the list, since
it is convenient and also because frequently it happens that
recently inserted elements are the most likely to be accessed in
the near future.
4/6/2022 5.3 _ Hashing 111
112.
Open hashing –Type Declarations
struct list_node
{
element_type element;
node_ptr next;
};
typedef node_ptr LIST;
typedef node_ptr position;
/* LIST *the_list will be an array of lists, allocated later */
/* The lists will use headers, allocated later */
struct hash_tbl
{
unsigned int table_size;
LIST *the_lists;
};
typedef struct hash_tbl *HASH_TABLE;
4/6/2022 5.3 _ Hashing 112
Open Hashing -Example
• We assume for this section that the keys are the first 10
perfect squares and that the hashing function is simply
hash(x) = x mod 10. (The table size is not prime, but is used
here for simplicity.)
4/6/2022 5.3 _ Hashing 116
117.
Load Factor
• Wedefine the load factor, ∆ , of a hash table
to be the ratio of the number of elements in
the hash table to the table size.
• In the example above, ∆ = 1.0.
• The average length of a list is ∆.
• The effort required to perform a search is the
constant time required to evaluate the hash
function plus the time to traverse the list.
4/6/2022 5.3 _ Hashing 117
118.
Load Factor
• Inan unsuccessful search, the number of links to traverse is
∆ (excluding the final NULL link) on average.
• A successful search requires that about 1+(∆/2) links be
traversed, since there is a guarantee that one link must be
traversed (since the search is successful), and we also
expect to go halfway down a list to find our match.
• This analysis shows that the table size is not really
important, but the load factor is.
• The general rule for open hashing is to make the table size
about as large as the number of elements expected (in
other words, let ∆ ≈ 1).
• It is also a good idea, as mentioned before, to keep the
table size prime to ensure a good distribution.
4/6/2022 5.3 _ Hashing 118
119.
Closed Hashing (OpenAddressing)
• Open hashing has the disadvantage of requiring pointers.
• This tends to slow the algorithm down a bit because of the time
required to allocate new cells, and also essentially requires the
implementation of a second data structure.
• Closed hashing, also known as open addressing, is an alternative to
resolving collisions with linked lists.
• In a closed hashing system, if a collision occurs, alternate cells are
tried until an empty cell is found.
• More formally, cells h0(x), h1 (x), h2(x), . . . are tried in succession
where hi(x) = (hash(x) + f(i))mod H_SIZE, with f(0) = 0.
• The function, f , is the collision resolution strategy.
• Because all the data goes inside the table, a bigger table is needed
for closed hashing than for open hashing.
• Generally, the load factor should be below = 0.5 for closed
hashing.
4/6/2022 5.3 _ Hashing 119
120.
Collision Resolution Strategies
•We now look at three common collision
resolution strategies.
– Linear Probing
– Quadratic Probing
– Double Hashing
4/6/2022 5.3 _ Hashing 120
121.
Linear Probing
• Inlinear probing, f is a linear function of i, typically f(i) = i.
• This amounts to trying cells sequentially (with wraparound) in
search of an empty cell.
• Figure shows in next slide the result of inserting keys {89, 18, 49, 58,
69} into a closed table using the same hash function as before and
the collision resolution strategy, (i) = i.
• The first collision occurs when 49 is inserted; it is put in the next
available spot, namely spot 0, which is open.
• 58 collides with 18, 89, and then 49 before an empty cell is found
three away. The collision for 69 is handled in a similar manner.
• As long as the table is big enough, a free cell can always be found,
but the time to do so can get quite large.
• Worse, even if the table is relatively empty, blocks of occupied
cells start forming.
• This effect, known as primary clustering, means that any key that
hashes into the cluster will require several attempts to resolve the
collision, and then it will add to the cluster.
4/6/2022 5.3 _ Hashing 121
• Although wewill not perform the calculations
here, it can be shown that the expected
number of probes using linear probing is
roughly
– 1/2(1 + 1/(1 - ∆)2) for insertions and unsuccessful
searches and
– 1/2(1 + 1/ (1- ∆)) for successful searches.
4/6/2022 5.3 _ Hashing 123
124.
Quadratic probing
• Quadraticprobing is a collision resolution method that eliminates the primary
clustering problem
• of linear probing.
• Quadratic probing is what you would expect-the collision function is quadratic.
• The popular choice is f(i) = i2.
• Figure shows the resulting closed table with this collision function on the same
input used in the linear probing example.
• When 49 collides with 89, the next position attempted is one cell away. This cell is
empty, so 49 is placed there.
• Next 58 collides at position 8. Then the cell one away is tried but another collision
occurs. A vacant cell is found at the next cell tried, which is 22 = 4 away. 58 is thus
placed in cell 2.
• The same thing happens for 69.
• For linear probing it is a bad idea to let the hash table get nearly full, because
performance degrades.
• For quadratic probing, the situation is even more drastic: There is no guarantee of
finding an empty cell once the table gets more than half full, or even before the
table gets half full if the table size is not prime.
• This is because at most half of the table can be used as alternate locations to
resolve collisions.
• Indeed, we prove now that if the table is half empty and the table size is prime,
then we are always guaranteed to be able to insert a new element.
4/6/2022 5.3 _ Hashing 124
Closed hashing –Insert Routine with
Quadratic Probing
• void
• insert( element_type key, HASH_TABLE H )
• {
• position pos;
• pos = find( key, H );
• if( H->the_cells[pos].info != legitimate )
• { /* ok to insert here */
• H->the_cells[pos].info = legitimate;
• H->the_cells[pos].element = key;
• /* Probably need strcpy!! */
• }
• }
4/6/2022 5.3 _ Hashing 129
130.
Quadratic probing
• Althoughquadratic probing eliminates primary
clustering, elements that hash to the same
position will probe the same alternate cells. This
is known as secondary clustering.
• Secondary clustering is a slight theoretical
blemish.
• Simulation results suggest that it generally causes
less than an extra probe per search.
• Double hashing technique eliminates this, but
does so at the cost of extra multiplications and
divisions.
4/6/2022 5.3 _ Hashing 130
131.
Double Hashing
4/6/2022 5.3_ Hashing 131
• For double hashing, one popular choice is f(i) = i.h2(x).
• This formula says that we apply a second hash function to x and
probe at a distance h2(x), 2h2(x), . . ., and so on.
• A poor choice of h2(x) would be disastrous.
• For instance, the obvious choice h2(x) = x mod 9 would not help if
99 were inserted into the input in the previous examples.
• Thus, the function must never evaluate to zero.
• It is also important to make sure all cells can be probed (this is not
possible in the example below, because the table size is not prime).
• A function such as h2(x) = R - (x mod R), with R a prime smaller
than H_SIZE, will work well.
• If we choose R = 7, then Figure shows the results of inserting the
same keys as before.
Double Hashing -Example
• The first collision occurs when 49 is inserted. h2(49) = 7
- 0 = 7, so 49 is inserted in position 6.
• h2(58) = 7 - 2 = 5, so 58 is inserted at location 3.
• Finally, 69 collides and is inserted at a distance h2(69) =
7 - 6 = 1 away. 69 is inserted at location 0.
• If we tried to insert 60 in position 0, we would have a
collision. Since h2(60) = 7 - 4 = 3, we would then try
positions 3, 6, 9, and then 2 until an empty spot is
found.
• It is generally possible to find some bad case, but there
are not too many here.
4/6/2022 5.3 _ Hashing 133
134.
Double Hashing -Example
• As we have said before, the size of our sample hash table is not
prime.
• We have done this for convenience in computing the hash function,
but it is worth seeing why it is important to make sure the table size
is prime when double hashing is used.
• If we attempt to insert 23 into the table, it would collide with 58.
Since h2(23) = 7 - 2 = 5, and the table size is 10, we essentially have
only one alternate location, and it is already taken. Thus, if the table
size is not prime, it is possible to run out of alternate locations
prematurely.
• However, if double hashing is correctly implemented, simulations
imply that the expected number of probes is almost the same as for
a random collision resolution strategy.
• This makes double hashing theoretically interesting.
• Quadratic probing, however, does not require the use of a second
hash function and is thus likely to be simpler and faster in practice.
4/6/2022 5.3 _ Hashing 134
135.
Rehashing
• If thetable gets too full,
– the running time for the operations will start taking too long.
– inserts might fail for closed hashing with quadratic resolution.
• This can happen if there are too many deletions intermixed
with insertions.
• A solution, then, is
– to build another table that is about twice as big (with
associated new hash function) and
– scan down the entire original hash table,
– computing the new hash value for each (non-deleted) element
and
– inserting it in the new table.
4/6/2022 5.3 _ Hashing 135
136.
Rehashing - Example
•As an example, suppose the elements 13, 15, 24, and 6
are inserted into a closed hash table of size 7.
• The hash function is h(x) = x mod 7.
• Suppose linear probing is used to resolve collisions.
• The resulting hash table appears in
4/6/2022 5.3 _ Hashing 136
Rehashing - Example
•If 23 is inserted into the table, the resulting
table will be over 70 percent full.
• Because the table is so full, a new table is
created.
• The size of this table is 17, because this is the
first prime which is twice as large as the old
table size.
• The new hash function is then h(x) = x mod 17.
• The old table is scanned, and elements 6, 15,
23, 24, and 13 are inserted into the new table.
• The resulting table appears as
4/6/2022 5.3 _ Hashing 138
139.
Rehashing - Example
•This entire operation is called rehashing.
• This is obviously a very expensive operation – the running
time is O(n), since there are n elements to rehash and the
table size is roughly 2n, but it is actually not all that bad,
because it happens very infrequently.
• In particular, there must have been n/2 inserts prior to the
last rehash, so it essentially adds a constant cost to each
insertion. (This is why the new table is made twice as large
as the old table.)
• If this data structure is part of the program, the effect is not
noticeable.
• On the other hand, if the hashing is performed as part of an
interactive system, then the unfortunate user whose
insertion caused a rehash could see a slowdown.
4/6/2022 5.3 _ Hashing 139
140.
Rehashing Implementation
• Rehashingcan be implemented in several ways with quadratic
probing.
– One alternative is to rehash as soon as the table is half full.
– The other extreme is to rehash only when an insertion fails.
– A third, middle of the road, strategy is to rehash when the table
reaches a certain load factor.
• Since performance does degrade as the load factor increases, the
third strategy, implemented with a good cutoff, could be best.
• Rehashing frees the programmer from worrying about the table
size and is important because hash tables cannot be made
arbitrarily large in complex programs.
• The exercises ask you to investigate the use of rehashing in
conjunction with lazy deletion.
• Rehashing can be used in other data structures as well.
• For instance, if the queue data structure of became full, we could
declare a double-sized array and copy everything over, freeing the
original.
4/6/2022 5.3 _ Hashing 140
141.
Rehashing Implementation -Code
HASH_TABLE
rehash( HASH_TABLE H )
{
unsigned int i, old_size;
cell *old_cells;
/*1*/ old_cells = H->the_cells;
/*2*/ old_size = H->table_size;
/* Get a new, empty table */
/*3*/ H = initialize_table( 2*old_size );
/* Scan through old table, reinserting into new */
/*4*/ for( i=0; i<old_size; i++ )
/*5*/ if( old_cells[i].info == legitimate )
/*6*/ insert( old_cells[i].element, H );
/*7*/ free( old_cells );
/*8*/ return H;
}
4/6/2022 5.3 _ Hashing 141
142.
Extendible Hashing
• Wehave deals with the case where the amount of data is too large
to fit in main memory.
• Here the main consideration then is the number of disk accesses
required to retrieve data.
• As before, we assume that at any point we have n records to store;
the value of n changes over time. Furthermore, at most m records
fit in one disk block.
• We will use m = 4 in this section.
• If either open hashing or closed hashing is used, the major
problem is that collisions could cause several blocks to be
examined during a find, even for a well-distributed hash table.
• Furthermore, when the table gets too full, an extremely expensive
rehashing step must be performed, which requires O(n) disk
accesses.
• A clever alternative, known as extendible hashing, allows a find to
be performed in two disk accesses. Insertions also require few disk
accesses.
4/6/2022 5.3 _ Hashing 142
143.
Extendible Hashing
• Werecall from previous discussions that a B-tree has depth
O(logm/2 n).
• As m increases, the depth of a B-tree decreases.
• We could in theory choose m to be so large that the depth of the B-
tree would be 1.
• Then any find after the first would take one disk access, since,
presumably, the root node could be stored in main memory.
• The problem with this strategy is that the branching factor is so
high that it would take considerable processing to determine which
leaf the data was in.
• If the time to perform this step could be reduced, then we would
have a practical scheme.
• This is exactly the strategy used by extendible hashing.
4/6/2022 5.3 _ Hashing 143
144.
Extendible Hashing -Example
• Let us suppose, for the moment, that our
data consists of several six-bit integers.
• Figure shows an extendible hashing scheme
for this data.
• The root of the "tree" contains four pointers
determined by the leading two bits of the
data.
• Each leaf has up to m = 4 elements.
• It happens that in each leaf the first two bits
are identical; this is indicated by the number
in parentheses.
• To be more formal, D will represent the
number of bits used by the root, which is
sometimes known as the directory.
• The number of entries in the directory is
thus 2D.
• dl is the number of leading bits that all the
elements of some leaf l have in common.
• dl will depend on the particular leaf, and dl D.
4/6/2022 5.3 _ Hashing 144
145.
Extendible Hashing -Example
• Suppose that we want to insert the key 100100.
• This would go into the third leaf, but as the third leaf is already full, there
is no room.
• We thus split this leaf into two leaves, which are now determined by the
first three bits.
• This requires increasing the directory size to 3.
• These changes are reflected in figure.
4/6/2022 5.3 _ Hashing 145
146.
Extendible Hashing -Example
• Notice that all of the leaves not involved in the split are now
pointed to by two adjacent directory entries.
• Thus, although an entire directory is rewritten, none of the other
leaves are actually accessed.
• If the key 000000 is now inserted, then the first leaf is split,
generating two leaves with dl = 3.
• Since D = 3, the only change required in the directory is the
updating of the 000 and 001 pointers. See Figure.
4/6/2022 5.3 _ Hashing 146
147.
Extendible Hashing -Example
• This very simple strategy provides quick access times for insert and
find operations on large databases.
• There are a few important details we have not considered.
• First, it is possible that several directory splits will be required if the
elements in a leaf agree in more than D + 1 leading bits.
• For instance, starting at the original example, with D = 2, if 111010,
111011, and finally 111100 are inserted, the directory size must be
increased to 4 to distinguish between the five keys.
• This is an easy detail to take care of, but must not be forgotten.
• Second, there is the possibility of duplicate keys; if there are more
than m duplicates, then this algorithm does not work at all.
• In this case, some other arrangements need to be made.
4/6/2022 5.3 _ Hashing 147