KEMBAR78
11_hashtable-1.ppt. Data structure algorithm | PPT
1
Hash Tables
2
Hash Tables
• We’ll discuss the hash table ADT which supports only
a subset of the operations allowed by binary search
trees.
• The implementation of hash tables is called hashing.
• Hashing is a technique used for performing insertions,
deletions and finds in constant average time (i.e. O(1))
– Worst-case times O(n)
• This data structure, however, is not efficient in
operations that require any ordering information among
the elements, such as findMin, findMax and printing the
entire table in sorted order.
3
General Idea
• The ideal hash table structure is merely an array of some fixed
size, containing the items.
• A stored item needs to have a data member, called key, that will
be used in computing the index value for the item.
– Key could be an integer, a string, etc
– e.g. a name or Id that is a part of a large employee structure
• The size of the array is TableSize.
• The items that are stored in the hash table are indexed by values
from 0 to TableSize – 1.
• Each key is mapped into some number in the range 0 to
TableSize – 1; this number is called hash value.
• The mapping is called a hash function.
4
Example
Hash
Function
mary 28200
dave 27500
phil 31250
john 25000
Items
Hash
Table
key
key
0
1
2
3
4
5
6
7
8
9
mary 28200
dave 27500
phil 31250
john 25000
hash
value
5
Hash Function
• The hash function:
– must be simple to compute.
– must distribute the keys evenly among the cells.
• If we know which keys will occur in
advance we can write perfect hash
functions, but we don’t.
6
Hash function
Problems:
• Keys may not be numeric.
• Number of possible keys is much larger than the
space available in table.
• Different keys may map into same location
– Hash function is not one-to-one => collision.
– If there are too many collisions, the performance of
the hash table will suffer dramatically.
7
Hash Functions
• If the input keys are integers then simply
Key mod TableSize is a general strategy.
– Unless key happens to have some undesirable
properties. (e.g. all keys end in 0 and we use
mod 10)
• If the keys are strings, hash function needs
more care.
– First convert it into a numeric value.
8
Some methods
• Truncation:
– e.g. 123456789 map to a table of 1000 addresses by
picking the last 3 digits of the key: H(IDNum) = IDNum % 1000 =
hash value
• Folding:
– e.g. 123|456|789: add them and take mod.
• Key mod N:
– N is the size of the table, better if it is prime.
• Squaring:
– Square the key and then truncate
• Radix conversion:
– e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.
9
Hash Function 1
• Add up the ASCII values of all characters of the key.
int hash(const string &key, int tableSize)
{
int hasVal = 0;
for (int i = 0; i < key.length(); i++)
hashVal += key[i];
return hashVal % tableSize;
}
• Simple to implement and fast.
• However, if the table size is large, the function does not
distribute the keys well.
• e.g. Table size =10000, key length <= 8, the hash function can
assume values only between 0 and 1016
10
Hash Function 2
• Examine only the first 3 characters of the key.
int hash (const string &key, int tableSize)
{
return (key[0]+27 * key[1] + 729*key[2]) % tableSize;
}
• In theory, 26 * 26 * 26 = 17576 different words can be
generated. However, English is not random, only 2851
different combinations are possible.
• Thus, this function although easily computable, is also not
appropriate if the hash table is reasonably large.
11
Hash Function 3
int hash (const string &key, int tableSize)
{
int hashVal = 0;
for (int i = 0; i < key.length(); i++)
hashVal = 37 * hashVal + key[i];
hashVal %=tableSize;
if (hashVal < 0) /* in case overflows occurs */
hashVal += tableSize;
return hashVal;
};







1
0
37
]
1
[
)
(
KeySize
i
i
i
KeySize
Key
key
hash
12
Hash function for strings:
a l i
key
KeySize = 3;
98 108 105
hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172
0 1 2 i
key[i]
hash
function
ali
……
……
0
1
2
8172
10,006 (TableSize)
“ali”
13
Collision Resolution
• If, when an element is inserted, it hashes to the
same value as an already inserted element, then we
have a collision and need to resolve it.
• There are several methods for dealing with this:
– Separate chaining
– Open addressing
• Linear Probing
• Quadratic Probing
• Double Hashing
14
Separate Chaining
• The idea is to keep a list of all elements that hash
to the same value.
– The array elements are pointers to the first nodes of the
lists.
– A new item is inserted to the front of the list.
• Advantages:
– Better space utilization for large items.
– Simple collision handling: searching linked list.
– Overflow: we can store more items than the hash table
size.
– Deletion is quick and easy: deletion from the linked list.
15
Example
0
1
2
3
4
5
6
7
8
9
0
81 1
64 4
25
36 16
49 9
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
16
Operations
• Initialization: all entries are set to NULL
• Find:
– locate the cell using hash function.
– sequential search on the linked list in that cell.
• Insertion:
– Locate the cell using hash function.
– (If the item does not exist) insert it as the first item in
the list.
• Deletion:
– Locate the cell using hash function.
– Delete the item from the linked list.
17
Hash Table Class for separate chaining
template <class HashedObj>
class HashTable
{
public:
HashTable(const HashedObj & notFound, int size=101 );
HashTable( const HashTable & rhs )
:ITEM_NOT_FOUND( rhs.ITEM_NOT_FOUND ),
theLists( rhs.theLists ) { }
const HashedObj & find( const HashedObj & x ) const;
void makeEmpty( );
void insert( const HashedObj & x );
void remove( const HashedObj & x );
const HashTable & operator=( const HashTable & rhs );
private:
vector<List<HashedObj> > theLists; // The array of Lists
const HashedObj ITEM_NOT_FOUND;
};
int hash( const string & key, int tableSize );
int hash( int key, int tableSize );
18
Insert routine
/**
* Insert item x into the hash table. If the item is
* already present, then do nothing.
*/
template <class HashedObj>
void HashTable<HashedObj>::insert(const HashedObj & x )
{
List<HashedObj> & whichList = theLists[ hash( x,
theLists.size( ) ) ];
HashedObj* p = whichList.find( x );
if( p == NULL )
whichList.insert( x, whichList.zeroth( ) );
}
19
Remove routine
/**
* Remove item x from the hash table.
*/
template <class HashedObj>
void HashTable<HashedObj>::remove( const HashedObj & x )
{
theLists[hash(x, theLists.size())].remove( x );
}
20
Find routine
/**
* Find item x in the hash table.
* Return the matching item or ITEM_NOT_FOUND if not found
*/
template <class HashedObj>
const HashedObj & HashTable<HashedObj>::find( const
HashedObj & x ) const
{
HashedObj * itr;
itr = theLists[ hash( x, theLists.size( ) ) ].find( x );
if(itr==NULL)
return ITEM_NOT_FOUND;
else
return *itr;
}
21
Analysis of Separate Chaining
• Collisions are very likely.
– How likely and what is the average length of
lists?
• Load factor l definition:
– Ratio of number of elements (N) in a hash table
to the hash TableSize.
• i.e. l = N/TableSize
– The average length of a list is also l.
– For chaining l is not bound by 1; it can be > 1.
22
Cost of searching
• Cost = Constant time to evaluate the hash function
+ time to traverse the list.
• Unsuccessful search:
– We have to traverse the entire list, so we need to compare l nodes on
the average.
• Successful search:
– List contains the one node that stores the searched item + 0 or more
other nodes.
– Expected # of other nodes = x = (N-1)/M which is essentially l, since
M is presumed large.
– On the average, we need to check half of the other nodes while
searching for a certain element
– Thus average search cost = 1 + l/2
23
Summary
• The analysis shows us that the table size is
not really important, but the load factor is.
• TableSize should be as large as the number
of expected elements in the hash table.
– To keep load factor around 1.
• TableSize should be prime for even
distribution of keys to hash table cells.
24
Hashing: Open Addressing
25
Collision Resolution with
Open Addressing
• Separate chaining has the disadvantage of
using linked lists.
– Requires the implementation of a second data
structure.
• In an open addressing hashing system, all
the data go inside the table.
– Thus, a bigger table is needed.
• Generally the load factor should be below 0.5.
– If a collision occurs, alternative cells are tried
until an empty cell is found.
26
Open Addressing
• More formally:
– Cells h0(x), h1(x), h2(x), …are tried in succession where
hi(x) = (hash(x) + f(i)) mod TableSize, with f(0) = 0.
– The function f is the collision resolution strategy.
• There are three common collision resolution
strategies:
– Linear Probing
– Quadratic probing
– Double hashing
27
Linear Probing
• In linear probing, collisions are resolved by
sequentially scanning an array (with
wraparound) until an empty cell is found.
– i.e. f is a linear function of i, typically f(i)= i.
• Example:
– Insert items with keys: 89, 18, 49, 58, 9 into an
empty hash table.
– Table size is 10.
– Hash function is hash(x) = x mod 10.
• f(i) = i;
28
Figure 20.4
Linear probing
hash table after
each insertion
29
Find and Delete
• The find algorithm follows the same probe
sequence as the insert algorithm.
– A find for 58 would involve 4 probes.
– A find for 19 would involve 5 probes.
• We must use lazy deletion (i.e. marking
items as deleted)
– Standard deletion (i.e. physically removing the
item) cannot be performed.
• When an item is deleted, the location must be marked in a special way, so that
the searches know that the spot used to have something in it.
– e.g. remove 89 from hash table.
30
Clustering Problem
• As long as table is big enough, a free cell
can always be found, but the time to do so
can get quite large.
• Worse, even if the table is relatively empty,
blocks of occupied cells start forming.
• This effect is known as primary clustering.
• Any key that hashes into the cluster will
require several attempts to resolve the
collision, and then it will add to the cluster.
31
Analysis of insertion
• The average number of cells that are examined in
an insertion using linear probing is roughly
(1 + 1/(1 – λ)2) / 2
• Proof is beyond the scope of this class
• For a half full table we obtain 2.5 as the average
number of cells examined during an insertion.
• Primary clustering is a problem at high load
factors. For half empty tables the effect is not
disastrous.
32
Analysis of Find
• An unsuccessful search costs the same as
insertion.
• The cost of a successful search of X is equal to the
cost of inserting X at the time X was inserted.
• For λ = 0.5 the average cost of insertion is 2.5.
The average cost of finding the newly inserted
item will be 2.5 no matter how many insertions
follow.
• Thus the average cost of a successful search is an
average of the insertion costs over all smaller load
factors.
33
Average cost of find
• The average number of cells that are examined in
an unsuccessful search using linear probing is
roughly (1 + 1/(1 – λ)2) / 2.
• The average number of cells that are examined in a
successful search is approximately
(1 + 1/(1 – λ)) / 2.
– Derived from:
dx
x












l
l 0
x
2
)
1
(
1
1
2
1
1
• What is the average number of probes for a successful
search and an unsuccessful search for this hash table?
– Hash Function: h(x) = x mod 11
Successful Search:
– 20: 9 -- 30: 8 -- 2 : 2 -- 13: 2, 3 -- 25: 3,4
– 24: 2,3,4,5 -- 10: 10 -- 9: 9,10, 0
Avg. Probe for SS = (1+1+1+2+2+4+1+3)/8=15/8
Unsuccessful Search:
– We assume that the hash function uniformly
distributes the keys.
– 0: 0,1 -- 1: 1 -- 2: 2,3,4,5,6 -- 3: 3,4,5,6
– 4: 4,5,6 -- 5: 5,6 -- 6: 6 -- 7: 7 -- 8: 8,9,10,0,1
– 9: 9,10,0,1 -- 10: 10,0,1
Avg. Probe for US =
(2+1+5+4+3+2+1+1+5+4+3)/11=31/11
34
Linear Probing – Analysis -- Example
0 9
1
2 2
3 13
4 25
5 24
6
7
8 30
9 20
10 10
35
Quadratic Probing
• Quadratic Probing eliminates primary clustering
problem of linear probing.
• Collision function is quadratic.
– The popular choice is f(i) = i2.
• If the hash function evaluates to h and a search in
cell h is inconclusive, we try cells h + 12, h+22, …
h + i2.
– i.e. It examines cells 1,4,9 and so on away from the
original probe.
• Remember that subsequent probe points are a
quadratic number of positions from the original
probe point.
36
Figure 20.6
A quadratic
probing hash table
after each
insertion (note that
the table size was
poorly chosen
because it is not a
prime number).
37
Quadratic Probing
• Problem:
– We may not be sure that we will probe all locations in
the table (i.e. there is no guarantee to find an empty cell
if table is more than half full.)
– If the hash table size is not prime this problem will be
much severe.
• However, there is a theorem stating that:
– If the table size is prime and load factor is not larger
than 0.5, all probes will be to different locations and an
item can always be inserted.
38
Theorem
• If quadratic probing is used, and the table
size is prime, then a new element can
always be inserted if the table is at least half
empty.
39
Some considerations
• How efficient is calculating the quadratic
probes?
– Linear probing is easily implemented.
Quadratic probing appears to require * and %
operations.
– However by the use of the following trick, this
is overcome:
• Hi = Hi-1+2i – 1 (mod M)
40
Some Considerations
• What happens if load factor gets too high?
– Dynamically expand the table as soon as the
load factor reaches 0.5, which is called
rehashing.
– Always double to a prime number.
– When expanding the hash table, reinsert the
new table by using the new hash function.
41
Analysis of Quadratic Probing
• Quadratic probing has not yet been mathematically
analyzed.
• Although quadratic probing eliminates primary
clustering, elements that hash to the same location
will probe the same alternative cells. This is known
as secondary clustering.
• Techniques that eliminate secondary clustering are
available.
– the most popular is double hashing.
42
Double Hashing
• A second hash function is used to drive the
collision resolution.
– f(i) = i * hash2(x)
• We apply a second hash function to x and probe at
a distance hash2(x), 2*hash2(x), … and so on.
• The function hash2(x) must never evaluate to zero.
– e.g. Let hash2(x) = x mod 9 and try to insert 99 in the
previous example.
• A function such as hash2(x) = R – ( x mod R) with
R a prime smaller than TableSize will work well.
– e.g. try R = 7 for the previous example.(7 - x mode 7)
43
The relative efficiency of
four collision-resolution methods
44
Hashing Applications
• Compilers use hash tables to implement the
symbol table (a data structure to keep track
of declared variables).
• Game programs use hash tables to keep
track of positions it has encountered
(transposition table)
• Online spelling checkers.
45
Largest Subset w/ Consecutive Numbers
Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4
Insert all numbers into Hash Table (O(n))
Traverse array again w/ this strategy
For next number x
If x-1 in the hash table (find: O(1))
A sequence not start at x (‘cos x is part of another (larger) seq)
Else
A seq is starting at x 
Append x to the seq, x++
Repeat ’till x+1 not in the hash table (find: O(1))
If seq.size larger than champion, update champion
46
Largest Subset w/ Consecutive Numbers
Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4
47
Largest Subset w/ Consecutive Numbers
Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4
Complexity analyses of 3 solutions
1) Naïve: for each element e, set a=e search for a+1 then a++: O(n2)
2) Sorting: Get 1,2,3,4,8,10,11,14. Start in 1, track until 4 (size=4);
restart in 8, track until 10; restart in 10, track until 14 (size=2); restart
in 14. Note that sorted array traversed once: O(nlogn + n) = O(nlogn)
3) Hashing: for each cluster of size k, we do
k successful find() by the starter (+1s)
Other k-1 nonstarter members find() left element (-1)
Total cost is hence k+k-1 = 2k-1
[ C1 ][C2][ C3 ][ C4 ][ C5 ]  ∑ (2ki - 1) = 2∑ ki - #clusters
i i
= 2n - #clusters = O(n) //worst: 1 huge cluster  2n-1; best: n clusters (each size 1)  2n-n
48
Summary
• Hash tables can be used to implement the insert
and find operations in constant average time.
– it depends on the load factor not on the number of items
in the table.
• It is important to have a prime TableSize and a
correct choice of load factor and hash function.
• For separate chaining the load factor should be
close to 1.
• For open addressing load factor should not exceed
0.5 unless this is completely unavoidable.
– Rehashing can be implemented to grow (or shrink) the
table.

11_hashtable-1.ppt. Data structure algorithm

  • 1.
  • 2.
    2 Hash Tables • We’lldiscuss the hash table ADT which supports only a subset of the operations allowed by binary search trees. • The implementation of hash tables is called hashing. • Hashing is a technique used for performing insertions, deletions and finds in constant average time (i.e. O(1)) – Worst-case times O(n) • This data structure, however, is not efficient in operations that require any ordering information among the elements, such as findMin, findMax and printing the entire table in sorted order.
  • 3.
    3 General Idea • Theideal hash table structure is merely an array of some fixed size, containing the items. • A stored item needs to have a data member, called key, that will be used in computing the index value for the item. – Key could be an integer, a string, etc – e.g. a name or Id that is a part of a large employee structure • The size of the array is TableSize. • The items that are stored in the hash table are indexed by values from 0 to TableSize – 1. • Each key is mapped into some number in the range 0 to TableSize – 1; this number is called hash value. • The mapping is called a hash function.
  • 4.
    4 Example Hash Function mary 28200 dave 27500 phil31250 john 25000 Items Hash Table key key 0 1 2 3 4 5 6 7 8 9 mary 28200 dave 27500 phil 31250 john 25000 hash value
  • 5.
    5 Hash Function • Thehash function: – must be simple to compute. – must distribute the keys evenly among the cells. • If we know which keys will occur in advance we can write perfect hash functions, but we don’t.
  • 6.
    6 Hash function Problems: • Keysmay not be numeric. • Number of possible keys is much larger than the space available in table. • Different keys may map into same location – Hash function is not one-to-one => collision. – If there are too many collisions, the performance of the hash table will suffer dramatically.
  • 7.
    7 Hash Functions • Ifthe input keys are integers then simply Key mod TableSize is a general strategy. – Unless key happens to have some undesirable properties. (e.g. all keys end in 0 and we use mod 10) • If the keys are strings, hash function needs more care. – First convert it into a numeric value.
  • 8.
    8 Some methods • Truncation: –e.g. 123456789 map to a table of 1000 addresses by picking the last 3 digits of the key: H(IDNum) = IDNum % 1000 = hash value • Folding: – e.g. 123|456|789: add them and take mod. • Key mod N: – N is the size of the table, better if it is prime. • Squaring: – Square the key and then truncate • Radix conversion: – e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.
  • 9.
    9 Hash Function 1 •Add up the ASCII values of all characters of the key. int hash(const string &key, int tableSize) { int hasVal = 0; for (int i = 0; i < key.length(); i++) hashVal += key[i]; return hashVal % tableSize; } • Simple to implement and fast. • However, if the table size is large, the function does not distribute the keys well. • e.g. Table size =10000, key length <= 8, the hash function can assume values only between 0 and 1016
  • 10.
    10 Hash Function 2 •Examine only the first 3 characters of the key. int hash (const string &key, int tableSize) { return (key[0]+27 * key[1] + 729*key[2]) % tableSize; } • In theory, 26 * 26 * 26 = 17576 different words can be generated. However, English is not random, only 2851 different combinations are possible. • Thus, this function although easily computable, is also not appropriate if the hash table is reasonably large.
  • 11.
    11 Hash Function 3 inthash (const string &key, int tableSize) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = 37 * hashVal + key[i]; hashVal %=tableSize; if (hashVal < 0) /* in case overflows occurs */ hashVal += tableSize; return hashVal; };        1 0 37 ] 1 [ ) ( KeySize i i i KeySize Key key hash
  • 12.
    12 Hash function forstrings: a l i key KeySize = 3; 98 108 105 hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172 0 1 2 i key[i] hash function ali …… …… 0 1 2 8172 10,006 (TableSize) “ali”
  • 13.
    13 Collision Resolution • If,when an element is inserted, it hashes to the same value as an already inserted element, then we have a collision and need to resolve it. • There are several methods for dealing with this: – Separate chaining – Open addressing • Linear Probing • Quadratic Probing • Double Hashing
  • 14.
    14 Separate Chaining • Theidea is to keep a list of all elements that hash to the same value. – The array elements are pointers to the first nodes of the lists. – A new item is inserted to the front of the list. • Advantages: – Better space utilization for large items. – Simple collision handling: searching linked list. – Overflow: we can store more items than the hash table size. – Deletion is quick and easy: deletion from the linked list.
  • 15.
    15 Example 0 1 2 3 4 5 6 7 8 9 0 81 1 64 4 25 3616 49 9 Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81 hash(key) = key % 10.
  • 16.
    16 Operations • Initialization: allentries are set to NULL • Find: – locate the cell using hash function. – sequential search on the linked list in that cell. • Insertion: – Locate the cell using hash function. – (If the item does not exist) insert it as the first item in the list. • Deletion: – Locate the cell using hash function. – Delete the item from the linked list.
  • 17.
    17 Hash Table Classfor separate chaining template <class HashedObj> class HashTable { public: HashTable(const HashedObj & notFound, int size=101 ); HashTable( const HashTable & rhs ) :ITEM_NOT_FOUND( rhs.ITEM_NOT_FOUND ), theLists( rhs.theLists ) { } const HashedObj & find( const HashedObj & x ) const; void makeEmpty( ); void insert( const HashedObj & x ); void remove( const HashedObj & x ); const HashTable & operator=( const HashTable & rhs ); private: vector<List<HashedObj> > theLists; // The array of Lists const HashedObj ITEM_NOT_FOUND; }; int hash( const string & key, int tableSize ); int hash( int key, int tableSize );
  • 18.
    18 Insert routine /** * Insertitem x into the hash table. If the item is * already present, then do nothing. */ template <class HashedObj> void HashTable<HashedObj>::insert(const HashedObj & x ) { List<HashedObj> & whichList = theLists[ hash( x, theLists.size( ) ) ]; HashedObj* p = whichList.find( x ); if( p == NULL ) whichList.insert( x, whichList.zeroth( ) ); }
  • 19.
    19 Remove routine /** * Removeitem x from the hash table. */ template <class HashedObj> void HashTable<HashedObj>::remove( const HashedObj & x ) { theLists[hash(x, theLists.size())].remove( x ); }
  • 20.
    20 Find routine /** * Finditem x in the hash table. * Return the matching item or ITEM_NOT_FOUND if not found */ template <class HashedObj> const HashedObj & HashTable<HashedObj>::find( const HashedObj & x ) const { HashedObj * itr; itr = theLists[ hash( x, theLists.size( ) ) ].find( x ); if(itr==NULL) return ITEM_NOT_FOUND; else return *itr; }
  • 21.
    21 Analysis of SeparateChaining • Collisions are very likely. – How likely and what is the average length of lists? • Load factor l definition: – Ratio of number of elements (N) in a hash table to the hash TableSize. • i.e. l = N/TableSize – The average length of a list is also l. – For chaining l is not bound by 1; it can be > 1.
  • 22.
    22 Cost of searching •Cost = Constant time to evaluate the hash function + time to traverse the list. • Unsuccessful search: – We have to traverse the entire list, so we need to compare l nodes on the average. • Successful search: – List contains the one node that stores the searched item + 0 or more other nodes. – Expected # of other nodes = x = (N-1)/M which is essentially l, since M is presumed large. – On the average, we need to check half of the other nodes while searching for a certain element – Thus average search cost = 1 + l/2
  • 23.
    23 Summary • The analysisshows us that the table size is not really important, but the load factor is. • TableSize should be as large as the number of expected elements in the hash table. – To keep load factor around 1. • TableSize should be prime for even distribution of keys to hash table cells.
  • 24.
  • 25.
    25 Collision Resolution with OpenAddressing • Separate chaining has the disadvantage of using linked lists. – Requires the implementation of a second data structure. • In an open addressing hashing system, all the data go inside the table. – Thus, a bigger table is needed. • Generally the load factor should be below 0.5. – If a collision occurs, alternative cells are tried until an empty cell is found.
  • 26.
    26 Open Addressing • Moreformally: – Cells h0(x), h1(x), h2(x), …are tried in succession where hi(x) = (hash(x) + f(i)) mod TableSize, with f(0) = 0. – The function f is the collision resolution strategy. • There are three common collision resolution strategies: – Linear Probing – Quadratic probing – Double hashing
  • 27.
    27 Linear Probing • Inlinear probing, collisions are resolved by sequentially scanning an array (with wraparound) until an empty cell is found. – i.e. f is a linear function of i, typically f(i)= i. • Example: – Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table. – Table size is 10. – Hash function is hash(x) = x mod 10. • f(i) = i;
  • 28.
    28 Figure 20.4 Linear probing hashtable after each insertion
  • 29.
    29 Find and Delete •The find algorithm follows the same probe sequence as the insert algorithm. – A find for 58 would involve 4 probes. – A find for 19 would involve 5 probes. • We must use lazy deletion (i.e. marking items as deleted) – Standard deletion (i.e. physically removing the item) cannot be performed. • When an item is deleted, the location must be marked in a special way, so that the searches know that the spot used to have something in it. – e.g. remove 89 from hash table.
  • 30.
    30 Clustering Problem • Aslong as table is big enough, a free cell can always be found, but the time to do so can get quite large. • Worse, even if the table is relatively empty, blocks of occupied cells start forming. • This effect is known as primary clustering. • Any key that hashes into the cluster will require several attempts to resolve the collision, and then it will add to the cluster.
  • 31.
    31 Analysis of insertion •The average number of cells that are examined in an insertion using linear probing is roughly (1 + 1/(1 – λ)2) / 2 • Proof is beyond the scope of this class • For a half full table we obtain 2.5 as the average number of cells examined during an insertion. • Primary clustering is a problem at high load factors. For half empty tables the effect is not disastrous.
  • 32.
    32 Analysis of Find •An unsuccessful search costs the same as insertion. • The cost of a successful search of X is equal to the cost of inserting X at the time X was inserted. • For λ = 0.5 the average cost of insertion is 2.5. The average cost of finding the newly inserted item will be 2.5 no matter how many insertions follow. • Thus the average cost of a successful search is an average of the insertion costs over all smaller load factors.
  • 33.
    33 Average cost offind • The average number of cells that are examined in an unsuccessful search using linear probing is roughly (1 + 1/(1 – λ)2) / 2. • The average number of cells that are examined in a successful search is approximately (1 + 1/(1 – λ)) / 2. – Derived from: dx x             l l 0 x 2 ) 1 ( 1 1 2 1 1
  • 34.
    • What isthe average number of probes for a successful search and an unsuccessful search for this hash table? – Hash Function: h(x) = x mod 11 Successful Search: – 20: 9 -- 30: 8 -- 2 : 2 -- 13: 2, 3 -- 25: 3,4 – 24: 2,3,4,5 -- 10: 10 -- 9: 9,10, 0 Avg. Probe for SS = (1+1+1+2+2+4+1+3)/8=15/8 Unsuccessful Search: – We assume that the hash function uniformly distributes the keys. – 0: 0,1 -- 1: 1 -- 2: 2,3,4,5,6 -- 3: 3,4,5,6 – 4: 4,5,6 -- 5: 5,6 -- 6: 6 -- 7: 7 -- 8: 8,9,10,0,1 – 9: 9,10,0,1 -- 10: 10,0,1 Avg. Probe for US = (2+1+5+4+3+2+1+1+5+4+3)/11=31/11 34 Linear Probing – Analysis -- Example 0 9 1 2 2 3 13 4 25 5 24 6 7 8 30 9 20 10 10
  • 35.
    35 Quadratic Probing • QuadraticProbing eliminates primary clustering problem of linear probing. • Collision function is quadratic. – The popular choice is f(i) = i2. • If the hash function evaluates to h and a search in cell h is inconclusive, we try cells h + 12, h+22, … h + i2. – i.e. It examines cells 1,4,9 and so on away from the original probe. • Remember that subsequent probe points are a quadratic number of positions from the original probe point.
  • 36.
    36 Figure 20.6 A quadratic probinghash table after each insertion (note that the table size was poorly chosen because it is not a prime number).
  • 37.
    37 Quadratic Probing • Problem: –We may not be sure that we will probe all locations in the table (i.e. there is no guarantee to find an empty cell if table is more than half full.) – If the hash table size is not prime this problem will be much severe. • However, there is a theorem stating that: – If the table size is prime and load factor is not larger than 0.5, all probes will be to different locations and an item can always be inserted.
  • 38.
    38 Theorem • If quadraticprobing is used, and the table size is prime, then a new element can always be inserted if the table is at least half empty.
  • 39.
    39 Some considerations • Howefficient is calculating the quadratic probes? – Linear probing is easily implemented. Quadratic probing appears to require * and % operations. – However by the use of the following trick, this is overcome: • Hi = Hi-1+2i – 1 (mod M)
  • 40.
    40 Some Considerations • Whathappens if load factor gets too high? – Dynamically expand the table as soon as the load factor reaches 0.5, which is called rehashing. – Always double to a prime number. – When expanding the hash table, reinsert the new table by using the new hash function.
  • 41.
    41 Analysis of QuadraticProbing • Quadratic probing has not yet been mathematically analyzed. • Although quadratic probing eliminates primary clustering, elements that hash to the same location will probe the same alternative cells. This is known as secondary clustering. • Techniques that eliminate secondary clustering are available. – the most popular is double hashing.
  • 42.
    42 Double Hashing • Asecond hash function is used to drive the collision resolution. – f(i) = i * hash2(x) • We apply a second hash function to x and probe at a distance hash2(x), 2*hash2(x), … and so on. • The function hash2(x) must never evaluate to zero. – e.g. Let hash2(x) = x mod 9 and try to insert 99 in the previous example. • A function such as hash2(x) = R – ( x mod R) with R a prime smaller than TableSize will work well. – e.g. try R = 7 for the previous example.(7 - x mode 7)
  • 43.
    43 The relative efficiencyof four collision-resolution methods
  • 44.
    44 Hashing Applications • Compilersuse hash tables to implement the symbol table (a data structure to keep track of declared variables). • Game programs use hash tables to keep track of positions it has encountered (transposition table) • Online spelling checkers.
  • 45.
    45 Largest Subset w/Consecutive Numbers Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4 Insert all numbers into Hash Table (O(n)) Traverse array again w/ this strategy For next number x If x-1 in the hash table (find: O(1)) A sequence not start at x (‘cos x is part of another (larger) seq) Else A seq is starting at x  Append x to the seq, x++ Repeat ’till x+1 not in the hash table (find: O(1)) If seq.size larger than champion, update champion
  • 46.
    46 Largest Subset w/Consecutive Numbers Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4
  • 47.
    47 Largest Subset w/Consecutive Numbers Input: 1,3,8,14,4,10,2,11 Output: 1,2,3,4 Complexity analyses of 3 solutions 1) Naïve: for each element e, set a=e search for a+1 then a++: O(n2) 2) Sorting: Get 1,2,3,4,8,10,11,14. Start in 1, track until 4 (size=4); restart in 8, track until 10; restart in 10, track until 14 (size=2); restart in 14. Note that sorted array traversed once: O(nlogn + n) = O(nlogn) 3) Hashing: for each cluster of size k, we do k successful find() by the starter (+1s) Other k-1 nonstarter members find() left element (-1) Total cost is hence k+k-1 = 2k-1 [ C1 ][C2][ C3 ][ C4 ][ C5 ]  ∑ (2ki - 1) = 2∑ ki - #clusters i i = 2n - #clusters = O(n) //worst: 1 huge cluster  2n-1; best: n clusters (each size 1)  2n-n
  • 48.
    48 Summary • Hash tablescan be used to implement the insert and find operations in constant average time. – it depends on the load factor not on the number of items in the table. • It is important to have a prime TableSize and a correct choice of load factor and hash function. • For separate chaining the load factor should be close to 1. • For open addressing load factor should not exceed 0.5 unless this is completely unavoidable. – Rehashing can be implemented to grow (or shrink) the table.