Hash Tables
ESO207
Indian Institute of Technology, Kanpur
Introduction
• Consider a dynamic set that supports dictionary
operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction
• Consider a dynamic set that supports dictionary
operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction
• Consider a dynamic set that supports dictionary
operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction
• Consider a dynamic set that supports dictionary
operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction
• Consider a dynamic set that supports dictionary
operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction (contd.)
• A hash table generalizes an array’s ability to directly
address by an index into it in O(1) time.
• Notation: Let U be the universe of the possible values of
the keys.
• The dynamic set S at any time is a subset of U.
• For simplicity (although not general), assume that
U = {0, 1, . . . , m − 1} is a set of numbered keys.
• No two elements of the set have the same key.
Direct Addressing
Direct Address Tables
• Direct addressing works well when the universe is small.
• An array called the direct-address table is used and
denoted by T [0, . . . , m − 1].
• Each position or slot of the array corresponds to a key in
the universe U.
• If the dynamic set is S, then, for each k ∈ S, T [k ] (that is,
slot k ) points to the element of the set with key k .
• For each k ∈ U − S, T [k ] is NIL.
Operations: Direct Address Tables
D IRECT-A DDRESS -S EARCH(T , k )
1. return T [k ]
D IRECT-A DDRESS -I NSERT(T , x)
1. T [x.key ] = x
D IRECT-A DDRESS -D ELETE(T , x)
1. T [x.key ] = NIL
Direct Address Tables
• For some applications, direct address table can hold the
elements directly.
• That is, rather than storing the element’s key and satellite
data in an object external to the table, with a pointer from
the slot, we can store the object in the slot itself.
• For this, empty slots will need to be indicated by a special
key value.
Hash Tables
• Down-side of direct addressing:
1. If the universe U is large, storing a table T of size |U| may
be impractical or impossible.
2. Further, the set of keys K actually stored in the table T
may be very small. This would lead to wastage of space.
3. Hash tables typically work with space O(|K |), while still
being able to search in expected O(1) time. [For
Direct-Address, this was worst-case O(1) time].
Hash Tables
• Keep a table T consisting of m slots numbered
T [0, . . . , m − 1].
• A hash function h maps each key value of the universe
U to some slot in T , that is,
h : U → {0, 1, . . . , m − 1}
• We say that an element with key k hashes to slot h(k ).
• In general, m |U| (and that is where the benefit in the
storage requirement of hash tables arises).
• Hence, two keys from U may hash to the same value,
that is,
Collision : h(k1 ) = h(k2 ), k1 , k2 ∈ U, k1 6= k2 .
In such a case we say that the keys k1 and k2 collide
under the hash function h.
Collisions
• Collisions cannot be avoided, since |U| > m.
• One of the goals of the design of hash functions is to
make h appear “random”.
• The very term “to hash” evokes images of random mixing
and chopping.
• We will explain the notion of randomness of hash
functions later — note that a hash function h is
deterministic, that is, given an input key k , the output
h(k ) is the same value.
Collision resolution: by chaining
• Simplest form of resolving collisions is to keep a linked
list of the elements that hash to each bucket.
• Each slot j contains a pointer to the head of the list of all
the stored elements that hash to the slot j.
• If there are no such elements, slot j contains NIL.
Code for Insert, Search and Delete
C HAINED -H ASH -I NSERT(T , x)
1. Insert x at the head of the list T [h(x.key )]
C HAINED -H ASH -S EARCH(T , k )
1. Search for an element with key k in list T [h(k )]
C HAINED -H ASH -D ELETE(T , x)
1. Delete x from the list T [h(x.key )]
• Worst case time for insertion: O(1). Assumes element
being inserted is not already present.
• Otherwise, we can check for this assumption by
searching for an element whose key is x.key before
inserting.
Search, Delete
C HAINED -H ASH -I NSERT(T , x)
1. Insert x at the head of the list T [h(x.key )]
C HAINED -H ASH -S EARCH(T , k )
1. Search for an element with key k in list T [h(k )]
C HAINED -H ASH -D ELETE(T , x)
1. Delete x from the list T [h(x.key )]
• Worst case time for Search: length of the linked list.
• Worst case time for Delete: O(1). Note that deletion
takes (pointer to the element) x to be deleted.
• Since we have doubly-linked lists, deletion is fast.
Comment
Hash tables with open chaining is a very popular data
structure in the industry.
Analysis of Chaining: introduction
• Load factor: Given a hash table T with m slots that
stores n elements, the load factor α for T is n/m, that is,
the average number of elements stored in a chain.
• A well-designed hash table in practice attempts to keep α
close to 1.
• Worst-case performance. In the worst case, the input
may consist of n keys all of whom hash to the same slot.
This would give a chain length of n at this slot, and chain
lengths of 0 at all other slots. The worst-case time for
searching would be O(n).
• Hash tables are not used for their worst-case
performance.
Analysis: introduction
• Hash tables are immensely popular because of their
average-case performance.
• The average case performance of hashing depends on
how well the hash function h distributes the set of keys to
be stored among the m slots {0, 1, . . . , m − 1}, on the
average.
• Idealized assumption about hash functions for analysis:
• Simple uniform hashing. A key value is equally likely
to hash into any of the m slots, independently of where
any other element hashes to.
Analysis
• Corresponding to slot j ∈ {0, 1, . . . , m − 1}, denote the
length of the chain at slot j to be nj . Then,
n = n0 + n1 + . . . + nm−1
• The average chain length is the sum of all chain lengths
divided by the number of slots, so this is
Pm−1
k =0 nk n
E nj = = =α
m m
which is the load factor of the hash table.
• For the analysis, we will make an important assumption,
namely, that
Hash value h(k ) is computed in time O(1).
Analysis
• Expected time required to search for a key in a hash
table.
• Step 1: compute h(k ) in O(1) time. Then search through
the list at slot number h(k ).
• Length of this list is nh(k ) .
• Hence, an unsuccessful search will go over all the keys in
this list requiring time O(nh(k ) ).
Expected time for searching
• By uniform hashing assumption, h(k ) = j with equal
probability 1/m.
• the average time for search is
m−1
1 X
E nh(k ) = nj = α
m
j=0
as calculated before.
• The average case time complexity of search is
O(1 + α), where, the O(1) time is required for computing
the hash value h(k ).
Expectation Analysis: contd.
• One would expect a successful search to take slightly
less on average, since, the entire chain need not be
browsed, rather, the search may halt once the key is
found.
• A slightly detailed analysis shows that the average time
for a successful search is 1 + α/2 − α/(2m) = Θ(1 + α).
• All this analysis shows that if α = n/m is O(1), then, the
search operation requires on the average O(1) time.
• Using doubly linked lists for storing the chains, deletion
takes O(1) time.
• Insertion (without searching for duplicates) takes O(1)
time, hence, all the hash table operations can be
performed in O(1) time, on average, if α = O(1).
Good Hash Functions
• A good hash function should come close to realizing the
idealized assumption made by simple uniform hashing.
• Many hash functions assume that the universe of keys is
the set of natural numbers N = {0, 1, 2, . . .}.
• Under this assumption, either the keys themselves are
natural numbers, or they are transformed into natural
numbers (e.g., strings of characters may be viewed as
long numbers over a larger alphabet).
Good and bad hash functions
• The simplest hash function is of the form
h(k ) = k mod m
• This hash function is very efficient, although it does not
have good uniformity properties.
• It is often not recommended to use m = 2r for some
value of r , since, k mod 2r gives the low order r bits of k .
• Unless we know that the low order r bits of the input keys
are equally likely, we are better off designing a hash
function whose value depends on all the bits of the key.
Example hash function: key modulo prime
• Choosing m to be a prime p that is not too close to a
power of 2 is often a good choice.
• For example, suppose there are n = 2000 keys to be
hashed. Let p = 701, which would give a load factor of
2000/701 ≈ 3 and h(k ) = k mod 701.
Multiplicative method for creating hash functions
• The multiplication method for creating hash functions
works in two steps.
1. Choose a number A in the range 0 < A < 1 and multiply
the key k by A and take the fractional part of kA, that is,
take kA − bkAc.
2. This is also referred to as
kA mod 1 = kA − bkAc
3. Now multiply this by the number of slots m and return the
floor, that is,
h(k ) = bm(kA mod 1)c
Multiplication method
• Advantage: the value of m is not critical.
• An efficient way of implementing the multiplication
method:
1. Choose m = 2p , a power of 2.
2. Suppose the word size of the machine is w bits and a key
value fits into a word.
3. Choose A to be of the form s/2w so that A · 2w = s.
4. Since k and s are both w-bit integers, ks is a 2w-bit
integer that we write as
ks = r1 2w + r0
where, r1 is the high order w-bit of ks and r0 is the low
order w-bit.
Multiplication method
1. Then,
ks r0
kA = w
= r1 + w
2 2
2. Hence the fractional part of kA is r0 /2w (since,
0 < r0 ≤ 2w − 1).
3. So, kA mod 1 = r0 /2w .
4. Hence,
j r k
0
h(k ) = bm(kA mod 1)c = 2p w
2
5. Since, r0 is a w-bit number, multiplying r0 /2w by 2p gives
the top p bits of r0 .
• Why? write r0 as a w-bit binary number:
r0 = b1 + b2 · 2 + b2 · 22 . . . + bw 2w−1
• Then,
r0 · 2p 1 p p+1 w+p−1
= b1 · 2 + b2 · 2 + . . . + bw · 2
2w 2w
1
= w−p b1 + b2 · 2 + . . . + bw−p · 2w−p−1
2
+ bw−p+1 + bw−p+2 · 2 + . . . + bw · 2p−1
• Thus,
r0 · 2p
= bw−p+1 + bw−p+2 · 2 + . . . + bw · 2p−1
2w
which is the number corresponding to the top-p bits of r0 .
Multiplicative method
• Although this method works with any value of the
constant A, experiments show that it works better with
some values than others.
• Knuth suggests that
√
A ≈ ( 5 − 1)/2 = 0.6180339887 . . .
is likely to work reasonably well.