Algorithms & data structures
Hash table
Damjan Strnad
2
Hash table
●
a hash table is a data structure that stores key-data
pairs (k,r), where k is the key and r is the associated
data of a table element
●
hash table allows direct access to data through key
values, therefore a natural implementation of a hash
table is by using an array:
– the element index in the array is calculated from the key
3
Hash table
●
if the array is large enough and the keys are integers,
each key maps to a unique array location and we can use
direct addressing (array location k belongs to key k):
– the set of active keys K is a subset of a set of all keys U
and determines the
U
locations with valid T
0 key data
0 6 NIL
pointers to stored 2 1
8 1
elements; other 2
K 1 NIL
locations have value 3 3
3
NIL 7 4 4
4
– such table is not yet 5 NIL
5
a hash table NIL
6
7
7
8
NIL
4
Hash table
●
when the number of possible keys is bigger than the array
size, we calculate the element address from the key value
using a hash function h : U {0,...,m-1}, which maps from a
set of keys U into the slots of a hash table T:
– h(k) is the address of element with key k in a hash table
U T
0
h(k1)
k1
K k2
k3 h(k2)=h(k3)
k4 k5 h(k5)
h(k4)
m-1
5
Hash table
●
the advantage of a hash table to a table with direct
addressing is smaller memory consumption (O(|K|) instead
O(|U|)
●
average access time is still O(1), but not for the worst case
●
the disadvantage of a hash table is that two keys can map
to the same slot, which is called a collision
●
the number of collisions can be reduced with a good hash
function that maps keys uniformly across the table
addresses
●
because collisions can occur for |U|>m in any case, we
must use one of techniques for collision resolving
6
Hash functions
●
a good hash function maps each of the keys with equal
probability into one of m slots in a hash table
●
we will assume that keys are natural numbers; when they
are not, we have to transform them into natural numbers:
– example: let the key be a string CLRS. The ASCII values for
individual letters are: C=67, L=76, R=82, S=83. There are
128 values for a 7-bit ASCII, therefore the string CLRS can
be uniquely transformed into a natural number as:
(67 · 1283) + (76 · 1282) + (82 · 1281) + (83 · 1280) = 141 764 947
●
two methods for construction of good hash functions:
– division method
– multiplication method
7
Division method
●
uses the following equation for a hash function:
h(k) = k mod m
●
example: hash table size is m=12, the key is k=100
– h(100) = 100 mod 12 = 4
– values 4,16,28,... (k=4+12i, i=0,1,2,...) map into the
same slot
●
using powers of 2 for m is not always good:
– operation k mod m, where m=2p, returns bottom p bits
of the key; if those are not uniformly distributed among
all possible keys (e.g., postal codes), it will cause poor
dispersion of h(k) and consequently many collisions
●
in practice a good value for m is a prime number that is
not very close to the power of 2
8
Multiplication method
●
uses the following equation for a hash function:
h(k) = ⌊m(kA mod 1)⌋
●
A is a constant from range (0,1), a recommended value is:
A=( √ 5−1)/2≈0,618034
●
(kA mod 1) means we only keep the fractional part of the
product
●
the method advantage is that the value of m is not critical,
the disadvantage is it is slow compared to division method
●
example: m=8, A=13/32, k=21: h(k )=⌊8⋅(21⋅13 mod 1)⌋=
32
=⌊8⋅(8,53125 mod 1)⌋=
=⌊8⋅0,53125⌋=
=⌊ 4,25⌋=
=4
9
Resolving collisions by chaining
●
in chaining, elements that hash to the same slot are
stored in a linked list at that slot
●
each slot contains a pointer to the head of the linked list; if
the slot is empty the pointer is NIL
U T
NIL
k1 k8 NIL
k1
K k8 NIL
k3 k2
k7 k2 k3 NIL
k6 k7
k4 k5 NIL
k6 NIL
NIL
NIL
k4 k5 NIL
10
Resolving collisions by chaining
●
example: a phone book
keys indices key-value pairs
11
Chained hash table operations
● insertion: INSERT-INTO-CHAINED-SLOT(T,x)
– inserts element x into the head of list T[h(key[x])]
– time complexity is O(1) if we assume that x is not yet in T;
if we need to verify that assumption, we must execute the
search
● searching: FIND-IN-CHAINED-SLOT(T,k)
– finds the element with key k in the list T[h(k)]
– worst case time complexity is proportional to the length of
linked list in slot h(k)
● removal: REMOVE-FROM-CHAINED-SLOT(T,x)
– removes element x from list T[h(key[x])]
– if the lists are doubly linked, the time complexity is O(1); if
they are singly linked, we must find the predecessor of x,
for which time complexity equals to that of searching
12
Chained hashing analysis
●
a load factor α for a hash table storing n elements in m
slots is defined as:
n
α=
m
●
α is the average number of elements stored in a chain
(list)
●
the worst case occurs when all elements are in the same
slot; in this case the time complexity of search is O(n), to
which we must add the time of computing the hash
function value
●
average time complexity depends on how well the hash
function h distributes the keys among m slots
13
Chained hashing analysis
●
suppose that each key is equally likely mapped to any of
m slots; this assumption is called simple uniform
hashing
●
theorem: With chained hashing both successful and
unsuccessful search has average time complexity Θ(1+α)
if we assume simple uniform hashing
●
if the hash table does not contain the element with key k,
the search is unsuccessful, otherwise it is successful
●
in expression Θ(1+α) the constant 1 is the time for
computing the hash function and α is the average search
time for a list of length α=n/m
14
Open addressing
●
open addressing is the second technique for resolving
collisions
●
all elements are stored in the same hash table; each slot
contains either a key or a NIL
●
searching is done by systematic inspection of slots until
the sought element is found or it is determined that the
element is not in the table
●
the advantage of open addressing is that we avoid
pointers; the saved memory can be used to enlarge the
hash table
●
hash table size m must be greater than the expected
number of elements n, therefore open addressing is only
used when the latter is known
15
Open addressing – insertion
●
insertion is performed by probing the hash table until we
find a free slot
●
the hash function is extended by using an additional
argument, which is the probing trial counter:
h : U × {0, 1, ... ,m-1} {0, ... ,m-1}
●
for each key k a probe sequence 〈h(k,0), h(k,1), ... ,
h(k,m-1)〉 is determined:
– h(k,0) is the first possible slot for key k, h(k,1) is the second
possible slot for key k, etc.
●
the probe sequence is a permutation of 〈0, 1, ... , m-1〉,
which means that it contains all m slots of a hash table
16
Open addressing – insertion
● procedure INSERT-INTO-HASH-TABLE(T,k) returns a
slot index into which key k is written or an overflow error if
there is no free slot
INSERT-INTO-HASH-TABLE(T,k)
i ← 0
repeat
j ← h(k,i)
if T[j] = NIL or T[j] = DELETED then % slot j is free
T[j] ← k % insert k into slot j
return j % return slot number
else % slot is occupied, try next one
i ← i + 1
until i = m
error „hash table overflow“
17
Open addressing – search
●
the procedure for finding key k searches the same
sequence of slots as the insertion procedure
●
search is unsuccessful if a free slot is encountered or if all
slots are full but none contains key k
●
worst case time complexity is O(n), where n is the number
of elements stored in the hash table
SEARCH-IN-HASH-TABLE(T,k)
i ← 0
repeat
j ← h(k,i)
if T[j] = k then % key k found in slot j
return j % return slot number
i ← i + 1
until T[j] = NIL or i = m
return NIL % key not found
18
Open addressing – removal
●
when removing an element from a hash table with open
addressing, we need to write a special value at the location of
the removed element; the special value signals that the table
location was not empty all of the time
●
this allows us to continue subsequent searches to the first truly
empty slot, stepping over once occupied locations (if we
stored NIL to the removed element location, all elements
behind it in the probe sequence would become inaccessible)
●
when inserting DELETE-IN-HASH-TABLE(T,k)
i ← 0
new elements repeat
the locations j ← h(k,i)
if T[j] = k then % key found
containing the T[j] ← DELETED % mark deleted
special value return
can be i ← i + 1
until T[j] = NIL or i = m
overwritten error „key not found“
19
Probe sequences
●
uniform hashing is a generalization of simple uniform
hashing in which every of m! possible permutations of
〈0, 1, ... , m-1〉 is selected as a probe sequence with equal
probability
●
uniform hashing is difficult to achieve in practice, so we
approximate it using methods that guarantee at least that
the probe sequence is a permutation of 〈0, 1, ... , m-1〉:
– linear probing
– quadratic probing
– double hashing
20
Linear probing
●
uses the following hash function:
h(k,i) = (h‘(k) + i) mod m; i=0, 1, ..., m-1
●
h‘(k) is an auxiliary hash function which determines the
first inspected slot T[h‘(k)]
●
the probe sequence is 〈T[h‘(k)], T[h‘(k)+1], ... , T[m-1],
T[0], T[1], ..., T[h‘(k)-1]〉
●
implementation of linear probing is simple, but the
problem is emergence of long clusters of occupied slots in
the hash table (i.e., primary clustering), which increases
the average search time
21
Quadratic probing
●
uses the following hash function:
h(k,i) = (h‘(k) + c1i + c2i2) mod m; i=0, 1, ..., m-1
● c1 and c2 are non-zero constants, which must be chosen
so that all hash table slots are addressed:
– for m=2n a good choice is c1=c2=0.5
– for arbitrary m and c1=c2=0.5 instead of „mod m“ we use
the first higher power of 2 and skip values h(k,i)≥m
●
the first probed slot is T[h‘(k)]; the index of next probed
slots changes according to a quadratic function of i
●
in practice quadratic probing performs better than linear
probing
22
Double hashing
●
uses the following hash function:
h(k,i) = (h1(k) + i·h2(k)) mod m; i=0, 1, ..., m-1
● h1 are h2 are auxiliary hash functions; the values of h2 should
be non-zero 0 and coprime to m
– for m=2n this is achieved if h2 only returns odd values
● the first probed slot is T[h1(k)]; addresses of subsequently
probed slots depend on h2(k)
●
example:
– h1(k)=k mod m, h2(k)=1 + (k mod (m-1)), k=123456, m=701
– h1(k) = 123456 mod 701 = 80, h2(k) = 1 + (123456 mod 700) =
257
– h(k,i) = [80 + i · 257] mod 701
– the probe sequence is 〈80, 337, 594, 150, 407, ...〉
23
Double hashing – example
●
insert key k=14 into the hash table of size m=13 below using
open addressing with double hashing; the hashing functions are
h1(k) = k mod 13 and h2(k) = 1 + (k mod 11):
0 – h1(14) = 14 mod 13 = 1
1 79
2
– h2(14) = 1 + (14 mod 11) = 1 + 3 = 4
3 68 – h(14,i) = [h1(14) + ih2(14)] mod 13 = [1 + 4i] mod 13
4
5 96
– the probe sequence for i = 0, 1, ..., m-1 is
6 〈1,5,9,0,4,8,12,3,7,11,2,6,10〉
7 – we try to store the key into slot 1, which is occupied
8 – next we try to store the key into slot 5, which is also
9 occupied
10 49 – in the third trial we successfully store the key into
11
slot 9 which is still free*
12
24
Double hashing – example
●
insert key k=14 into the hash table of size m=13 below using
open addressing with double hashing; the hashing functions are
h1(k) = k mod 13 and h2(k) = 1 + (k mod 11):
0 – h1(14) = 14 mod 13 = 1
1 79
2
– h2(14) = 1 + (14 mod 11) = 1 + 3 = 4
3 68 – h(14,i) = [h1(14) + ih2(14)] mod 13 = [1 + 4i] mod 13
4
5 96
– the probe sequence for i = 0, 1, ..., m-1 is
6 〈1,5,9,0,4,8,12,3,7,11,2,6,10〉
7 – we try to store the key into slot 1, which is occupied
8 – next we try to store the key into slot 5, which is also
9 14 occupied
10 49 – in the third trial we successfully store the key into
11
slot 9 which is still free
12
25
Double hashing – example
●
we have removed key 96 from a hash table and marked the
removed element location using a special symbol ×
●
we want to verify if key 14 is stored in the hash table:
0 – the probe sequence will be identical to the one
1 79 used at insertion, i.e. 〈1,5,9,0,4,8,12,3,7,11,2,6,10〉
2 – the first inspected slot is 1, where the key is not
3 68
found
4
5 ×
– the next inspected slot is 5, where the special
6 symbol is encountered, which means we must
7 continue the search
8 – the next inspected slot is 9, where the key is finally
9 14 found*
10 49
11
12
26
Double hashing – example
●
we have removed key 96 from a hash table and marked the
removed element location using a special symbol ×
●
we want to verify if key 14 is stored in the hash table:
0 – the probe sequence will be identical to the one
1 79 used at insertion, i.e. 〈1,5,9,0,4,8,12,3,7,11,2,6,10〉
2 – the first inspected slot is 1, where the key is not
3 68
found
4
5 ×
– the next inspected slot is 5, where the special
6 symbol is encountered, which means we must
7 continue the search
8 – the next inspected slot is 9, where the key is finally
9 14 found
10 49
11
12
27
Open addressing analysis
●
the average number of trials for an unsuccessful search is
at most 1/(1−α) if we assume uniform hashing and a load
factor α < 1
●
element insertion requires at most 1/(1−α) trials if uniform
hashing is assumed
●
if a load factor α < 1, uniform hashing and equal
probability of finding every key in a hash table are
assumed, then the average number of trials for a
successful search is at most:
1 1
⋅ln
α 1−α
28
Disadvantages and applications
●
disadvantages of a hash table:
– it cannot be a dynamic data structure owing to a fixed hash
function
– it cannot be adapted to a changed element distribution
– it does not allow efficient implementation of operations like
minimum or maximum element extraction, predecessor or
successor search, sorting, ...
●
applications:
– dictionaries and associative arrays
– symbol tables in compilers
– memory pages in operating systems