Hash Tables
1
Introduction
• A table has several fields (types of
information)
– A telephone book may have fields name,
address, phone number
– A user account table may have fields user id,
password, home folder
2
Tables: rows & columns of
information
• To find an entry in the table, you only need
know the contents of one of the fields (not
all of them).
• This field is the key
– In a telephone book, the key is usually name
– In a user account table, the key is usually user
id
• Ideally, the key is unique 3
Table ADT: basic operations
• insert: given a key and an entry, inserts the entry
into the table
• find: given a key, finds the entry associated with
the key
• remove: given a key, finds the entry associated
with the key, and removes it
4
Element: a key and its entry
For searching purposes, it is best to store the
key and the entry separately (even though the
key’s value may be inside the entry)
key entry
“Smith” “Smith”, “124 Hawkers Lane”, “9675846”
Element
“Yeo” “Yeo”, “1 Apple Crescent”, “0044 1970 622455”
5
Implementation 1:
Unsorted Sequential Array
• An array in which elements are
stored consecutively in any order key entry
0 14 <data>
• insert: add to back of array; O(1) 1 45 <data>
• find: search through the keys one at 2 22 <data>
a time, potentially all of the keys; 3 67 <data>
O(n) 4 17 <data>
• remove: find and remove node;
…
O(n)
and so on
6
Implementation 2:
Sorted Sequential Array
• An array in which elements are
stored consecutively, sorted by key key entry
0 15 <data>
• insert: add in sorted order; O(n) 1 17 <data>
• find: binary search; O(log n) 2 22 <data>
3 45 <data>
• remove: find, remove element;
O(log n) 4 67 <data>
…
and so on
7
Implementation 3:
Linked List (Unsorted or Sorted)
• Elements are again stored key entry
consecutively
14 <data>
• insert: add to front; O(1) 45 <data>
or O(n) for a sorted list
• find: search through potentially all
22 <data>
the keys, one at a time; O(n)
for both unsorted and sorted lists
67 <data>
• remove: find, remove using pointer
alterations; O(n)
17 <data>
and so on 8
Implementation 4:
Balanced Binary Tree
• A balanced tree, ordered by key
22 <data>
• insert: a standard insert; O(log
n) 14 <data> 45 <data>
• find: a standard find; O(log n)
• remove: a standard remove;
17 <data> 67 <data>
O(log n)
and so on
9
Implementation 5:
Hash Table
• An array in which elements are not key entry
stored consecutively - their place
of storage is calculated using the
key and a hash function 4 <key> <data>
10 <key> <data>
hash array
Key index
function
123 <key> <data>
Hash values à mappings of the keys as
the indexes in the hash table
10
Implementation 5:
Hash Table
• An array in which elements are not stored key entry
consecutively - their place of storage is
calculated using the key and a hash
function
4 <key> <data>
• insert: calculate place of storage,
insert entry; O(1) 10 <key> <data>
• find: calculate place of storage,
retrieve entry; O(1)
• remove: calculate place of storage,
set it to null; O(1)
123 <key> <data>
All are O(1)
11
Hash Tables
• The implementation of hash tables is called hashing.
• Hashing is a technique used for performing insertions,
deletions and finds in constant time for average case
(i.e. O(1))
– Worst-case times O(n)
• However, hash table is not efficient in operations that
require any ordering information among the elements,
such as findMin, findMax and printing the table in
sorted order.
12
General Idea
• The hash table structure is an array of some fixed size,
containing the items.
• The size of the array is TableSize.
• The items that are stored in the hash table are indexed
by values from 0 to TableSize – 1.
• Each key is mapped into some number in the range 0 to
TableSize – 1
– this number is called hash value.
• The mapping is called a hash function.
13
Example Hash
Table
0
1
Items
2
john 25000
3 john 25000
phil 31250 key Hash 4 phil 31250
Function
dave 27500 5
mary 28200 6 dave 27500
7 mary 28200
key 8
9
hash
value
14
Hash Function
• The hash function
– must be simple to be computed.
– must distribute the keys evenly among the cells.
• If we know which keys will occur in
advance we can write perfect hash
functions, but we don’t.
15
Hash Method
• The worst hash function maps all keys to
the same address
• The best hash function maps all keys to
distinct addresses
• Ideally, distribution of keys to addresses
is uniform and random
16
Hash function
Problems
• Keys may not be numeric.
• Number of possible keys is much larger than the
size of table.
• Different keys may map into the same location
– Collision issue
– If there are too many collisions, the performance of
the hash table is very low
17
Hash Function
• If the input keys are integers then function
(Key mod TableSize) is a general strategy.
• If the keys are strings, hash function needs
more process.
– First convert it into a numeric value.
18
Hashing methods
• Many hashing methods
– Truncation
– Division
– Mid-square
– Folding
– Digit analysis
– And so on
19
Truncation method
• Ignore part of the key and use the rest as the array
index (converting non-numeric parts)
• Example
If students have a 9-digit identification number,
take the last 3 digits as the table position
– e.g. 925371622 becomes 622
20
Division method
• Hash function f(k) = k % b (b is table size)
– Requires only a single division operation (quite
fast)
• Certain values of b should be avoided
– For example, if b=2p, then f(k) is just the p
lowest-order bits of k; the hash function does
not depend on all the bits
• It’s a good practice to set the table size b to
be a prime number
21
Mid-square method
• Middle of square method
• This method squares the key value, and then takes
out the number of bits from the middle of the
square
• The number of bits to be used to obtain the
address depends on the table size
– If r bits are used, the range of values is 0 to 2r-1
• This works well because most or all bits of the key
value contribute to the result
22
Mid-square method
• Example
– consider items whose keys are 4-digit numbers in base
10
– The goal is to hash these key values to a table of size
100
– This range is equivalent to two digits in base 10, that is,
the number of bits is 2
– For example, the input is the number 4567, squaring
gets an 8-digit number, 20857489
– The middle two digits of this result is 57
23
Folding method
• Partition the key into several parts of the same
length except for the last
• These parts are then added together to obtain
the hash address for the key
• Two ways of carrying out this addition
– Shift-folding
– Folding and reverse
24
Folding method
• Example
25
Digit analysis method
• All the keys in the table are known in advance
• The index is formed by extracting, and then
manipulating specific digits from the key
• For example, the key is 925371622, we select the
digits from 5 through 7 resulting 537
• The manipulation can then take many forms
– Reversing the digits (735)
– Performing a circular shift to the right (753)
– Performing a circular shift to the left (375)
– Swapping each pair of digits (357)
26
Hash Function 1
• Add up the ASCII values of all characters of the key.
int hash(const string &key, int tableSize)
{
int hasVal = 0;
for (int i = 0; i < key.length(); i++)
hashVal += key[i];
return hashVal % tableSize;
}
• Simple to implement and fast.
• However, if the table size is large, the function does not
distribute the keys well.
• e.g. Table size =10000, key length <= 8, the hash function can
assume values only between 0 and 1016
27
Hash Function 2
• Examine only the first 3 characters of the key.
int hash (const string &key, int tableSize)
{
return (key[0] + 27*key[1] + 27*27*key[2]) % tableSize;
}
• This function although easily computable, is also not
appropriate if the hash table is reasonably large.
28
Hash Function 3
KeySize -1
hash(key) = å Key
i =0
[ KeySize - i - 1] × 37 i
int hash (const string &key, int tableSize)
{
int hashVal = 0;
for (int i = 0; i < key.length(); i++)
hashVal = 37 * hashVal + key[i];
hashVal %= tableSize;
if (hashVal < 0) /* in case overflows occurs */
hashVal += tableSize;
return hashVal;
};
29
Hash Function 3: example
98 108 105 key[i]
key a l i
0 1 2 i
KeySize = 3;
hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172
0
1
2
“ali” hash ……
function 8172
ali
……
10,006 (TableSize)
30
Collision Resolution
• If, when an element is inserted, it hashes to the
same address as an already inserted element, then
we have a collision and need to resolve it.
• There are several methods
– Separate chaining
– Open addressing
• Linear Probing
• Quadratic Probing
• Double Hashing
31
Collision Resolution: Separate
chaining
32
Separate Chaining
• The idea is to use a linked list of all elements that
hash to the same value.
– The array elements are pointers to the first nodes of the
linked lists.
– A new item is inserted into the front of the list.
• Advantages
– Better space utilization for large items.
– Simple collision handling: searching linked list.
– Overflow: we can store more items than the hash table
size.
– Deletion is quick and easy: deletion from the linked list.
33
Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0
1 81 1
2
4 64 4
5 25
6 36 16
7
9 49 9
34
Operations
• Initialization: all entries are set to NULL
• Find
– locate the cell using hash function.
– sequential search on the linked list in that cell.
• Insertion
– Locate the cell using hash function.
– If the item does not exist, insert it as the first item in the
list.
• Deletion
– Locate the cell using hash function.
– Delete the item from the linked list.
35
Hash Table Class for separate chaining
template <class HashedObj>
class HashTable
{
public:
HashTable(const HashedObj & notFound, int size=101 );
HashTable( const HashTable & rhs )
:ITEM_NOT_FOUND( rhs.ITEM_NOT_FOUND ),
theLists( rhs.theLists ) { }
const HashedObj & find( const HashedObj & x ) const;
void makeEmpty( );
void insert( const HashedObj & x );
void remove( const HashedObj & x );
const HashTable & operator=( const HashTable & rhs );
private:
vector<List<HashedObj> > theLists; // The array of Lists
const HashedObj ITEM_NOT_FOUND;
};
int hash( const string & key, int tableSize );
int hash( int key, int tableSize );
36
Insert routine
/**
* Insert item x into the hash table. If the item is
* already present, then do nothing.
*/
template <class HashedObj>
void HashTable<HashedObj>::insert(const HashedObj & x )
{
List<HashedObj> & whichList = theLists[ hash( x,
theLists.size( ) ) ];
HashedObj* p = whichList.find( x );
if( p == NULL )
whichList.insert( x, whichList.zeroth( ) );
}
37
Remove routine
/**
* Remove item x from the hash table.
*/
template <class HashedObj>
void HashTable<HashedObj>::remove( const HashedObj & x )
{
theLists[hash(x, theLists.size())].remove( x );
}
38
Find routine
/**
* Find item x in the hash table.
* Return the matching item or ITEM_NOT_FOUND if not found
*/
template <class HashedObj>
const HashedObj & HashTable<HashedObj>::find( const
HashedObj & x ) const
{
HashedObj * itr;
itr = theLists[ hash( x, theLists.size( ) ) ].find( x );
if(itr==NULL)
return ITEM_NOT_FOUND;
else
return *itr;
}
39
Collision Resolution: Open
Addressing
43
Open Addressing
• Separate chaining has the disadvantage of
using linked lists.
– Requires the implementation of a second data
structure.
• For open addressing hashing, all the data go
inside the table.
– Thus, a bigger table is needed.
– If a collision occurs, alternative cells are tried
until an empty cell is found.
44
Open Addressing
• Formally
– Cells h0(x), h1(x), h2(x),… are tried sequentially where
hi(x) = (hash(x) + f(i)) mod TableSize, with f(0) = 0.
– Function f is the collision resolution strategy.
• There are three common collision resolution
strategies (f)
– Linear Probing
– Quadratic probing
– Double hashing
45
Linear Probing
• In linear probing, collisions are resolved by
sequentially scanning an array (with
wraparound) until an empty cell is found.
– i.e. f is a linear function of i, typically f(i)= i.
• Example
– Insert items with keys: 89, 18, 49, 58, 9 into an
empty hash table.
– Table size is 10.
– Hash function is hash(x) = x mod 10.
• f(i) = i;
46
Figure 20.4 Linear probing hash table after each insertion
address = (x % 10 + i) % 10 (where i = 0..9)
47
Find and Delete
• The find algorithm follows the same probe
sequence as the insert algorithm.
– A find for 58 would involve 4 probes.
– A find for 19 would involve 5 probes.
• We must use lazy deletion (i.e. marking
items as deleted)
– Standard deletion (i.e. physically removing the
item) cannot be performed.
• When an item is deleted, the location must be marked in a
special way, so that the searches know that the spot used to
have something in it. 48
Clustering Problem
• As long as table is big enough, a free cell
can always be found, but the time to do so
can get quite large.
• Worse, even if the table is relatively empty,
blocks of occupied cells start forming.
• This effect is known as primary clustering.
• Any key that hashes into the cluster will
require several attempts to resolve the
collision, and then it will add to the cluster.
49
Quadratic Probing
• Quadratic Probing eliminates primary clustering
problem of linear probing.
• Collision function is quadratic.
– The popular choice is f(i) = i2.
• If the hash function evaluates to h and a search in
cell h is inconclusive, we try cells h + 12, h+22, …
h + i2.
– i.e. It examines cells 1,4,9 and so on away from the
original probe.
• Remember that subsequent probe points are a
quadratic number of positions from the original
probe point.
54
Figure 20.6
A quadratic probing
hash table after each address = (x % 10 + i2) % 10
insertion (note that (where i = 0..9)
the table size was
poorly chosen
because it is not a
prime number).
55
Quadratic Probing
• Problem
– We may not be sure that we will probe all locations in
the table.
– If the hash table size is not prime this problem will be
much severe.
• However, there is a theorem stating that:
– If quadratic probing is used, and the table size is prime,
then a new element can always be inserted if the table is
at least half empty.
56
Analysis of Quadratic Probing
• Quadratic probing has not yet been mathematically
analyzed.
• Although quadratic probing eliminates primary
clustering, elements that hash to the same location
will probe the same alternative cells. This is known
as secondary clustering.
• Techniques that eliminate secondary clustering are
available.
– the most popular is double hashing.
58
Double Hashing
• A second hash function is used to solve the
collision.
– f(i) = i * hash2(x)
• We apply a second hash function to x and probe at
a distance hash2(x), 2*hash2(x), … and so on.
• The function hash2(x) must never evaluate to zero.
• A function such as hash2(x) = R – ( x mod R) with
R a prime smaller than TableSize will work well.
59
Hashing Applications
• Compilers use hash tables to implement the
symbol table (a data structure to keep track
of declared variables).
• Game programs use hash tables to keep
track of positions it has encountered
(transposition table)
• Online spelling checkers.
61
Summary
• Hash tables can be used to implement the insert
and find operations in constant average time.
• It is important to have a prime number TableSize
and a good choice of hash function.
• Collision resolution
– Separate chaining
– Open addressing
62