EL9343
Data Structure and Algorithm
Lecture 6: Hash Tables, Binary Search Tree
Instructor: Yong Liu
The Search Problem
} Find items with keys matching a given search key
} Given an array A, containing n keys, and a search key x,
find the index i such as x=A[i]
} As in the case of sorting, a key could be part of a large
record.
2
Special Case: Dictionaries
} Dictionary: Abstract Data Type (ADT) — maintain
a set of items, each with a key, subject to
} Insert(item): add item to set
} Delete(item): remove item from set
} Search(key): return item with key if it exists
3
Applications
} Keeping track of customer account information at
a bank
} Search through records to check balances and perform
transactions
} Search engine
} Looks for all documents containing a given word
} ...
4
Direct Addressing
} Assumptions:
} Key values are distinct
} Each key is drawn from a universe U = {0, 1, . . . , m - 1}
} Idea:
} Store the items in an array, indexed by keys
} Direct-address table representation:
} An array T[0 . . . m - 1]
} Each slot, or position, in T corresponds to a key in U
} For an element x with key k, a pointer to x (or x itself) will be placed in
location T[k]
} If there are no elements with key k in the set, T[k] is empty, represented
by NIL
5
Direct Addressing: Operations
Alg.: DIRECT-ADDRESS-SEARCH(T, k)
return T[k]
Alg.: DIRECT-ADDRESS-INSERT(T, x)
T[key[x]] ← x
Alg.: DIRECT-ADDRESS-DELETE(T, x)
T[key[x]] ← NIL
Running time for
these operations: O(1)
6
Example
Example 1:
} 100 records with distinct integer keys ranging from 1 to 100,
} create an array A of 100 items, store item with key i in A[i]
Example 2:
} keys are nine-digit social security numbers
} create an array A of 10^9 items to store 100 items!
} number of items much smaller than key value range
7
Hash Tables
} When |K| is much smaller than |U|, a hash table
requires much less space than a direct-address
table
} Can reduce storage requirements to |K|
} Can still get O(1) search time, but on the average case,
not the worst case
8
Hash Tables
} Idea
} Use a function h to compute the slot for each key
} Store the element in slot h(k)
} A hash function h transforms a key into an index in a hash
table T[0…m-1]
h : U → {0, 1, . . . , m - 1}
} We say that k hashes to slot h(k)
} Advantages
} Reduce the range of array indices handled: m instead of |U|
} Storage is also reduced
9
Hash Tables: Example
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys)
k5 k3 h(k3)
m-1
10
Revisit Example 2
Suppose keys are 9-digit social security numbers
Possible Hash Functions h(ssn)=ssn mod 100 (last 2 digits of ssn)
h(103-224-511)=11=h(201-789-611)
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys) Collisions!
k5 k3 h(k3)
m-1
11
Collisions
} Two or more keys hash to the same slot!!
} For a given set K of keys
} If |K| ≤ m, collisions may or may not happen, depending
on the hash function
} If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
} Avoiding collisions completely is hard, even with
a good hash function
12
Handling Collisions
} Collisions can be resolved by:
} Chaining
} Open addressing
} Linear probing
} Quadratic probing
} Double hashing
} We will discuss chaining first, and ways to build
“good” hash functions.
13
Handling Collisions Using Chaining
} Idea
} Put all elements that hash to the same slot into a linked list
} Slot j contains a pointer to the head of the list of all
elements that hash to j
14
Collision with Chaining - Discussion
} Choosing the size of the table
} Small enough not to waste space
} Large enough such that lists remain short
} Typically 1/5 or 1/10 of the total number of elements
} How should we keep the lists: ordered or not?
} Not ordered!
} Insert is fast
} Can easily remove the most recently inserted elements
15
Insertion in Hash Tables
Alg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]
} Worst-case running time is O(1)
} Assumes that the element being inserted isn’t
already in the list
} It would take an additional search to check if it
was already inserted
16
Deletion in Hash Tables
Alg.: CHAINED-HASH-DELETE(T, x)
delete x from the list T[h(key[x])]
} Need to find the element to be deleted.
} Worst-case running time:
} Deletion depends on searching the corresponding list
17
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)
search for an element with key k in list T[h(k)]
} Running time is proportional to the length of the
list of elements in slot h(k)
18
Analysis of Hashing with Chaining:Worst Case
} How long does it take to search T
for an element with a given 0
key?
} Worst case:
} All n keys hash to the same slot
} Worst-case time to search is Θ(n),
chain
plus time to compute the hash
function m-1
19
Analysis of Hashing with Chaining:Average Case
} Average case
} depends on how well the hash function distributes
the n keys among the m slots T
} n0 = 0
Simple uniform hashing assumption
} Any given element is equally likely to hash into
n2
any of the m slots (i.e., probability of collision n3
Pr(h(x)=h(y)), is 1/m)
} Length of a list: T[j] = nj, j = 0, 1, . . . , m – 1 nj
} Number of keys in the table: n = n0 + n1 +· · · + nm-1 nk
} Average value of nj: E[nj] = α = n/m
nm – 1 = 0
20
Load Factor of a Hash Table
} Load factor of a hash table T:
T
α = n/m 0
} n = # of elements stored in the table
chain
} m = # of slots in the table = # of linked lists chain
} α encodes the average number of chain
elements stored in a chain
chain
} α can be <, =, > 1
m-1
21
Case 1: Unsuccessful Search (i.e., item not stored
in the table)
Theorem
} An unsuccessful search in a hash table takes expected time Θ(1 + α )
under the assumption of simple uniform hashing
(i.e., probability of collision Pr(h(x)=h(y)), is 1/m)
Proof
} Searching unsuccessfully for any key k
} need to search to the end of the list T[h(k)]
} Expected length of the list:
} E[nh(k)] = α = n/m
} Expected number of elements examined in an unsuccessful search is α
} Total time required is:
} O(1) (for computing the hash function) + α Θ(1 + α )
22
Case 2: Successful Search
Theorem
} An successful search in a hash table takes expected time Θ(1 + α )
under the assumption of simple uniform hashing
Proof: let xi be the i-th element inserted to the hash table,
define Xij to be the indicator random variable that element i and j
will be hashed to the same value, then the expected number of
elements examined in a successful search is:
23
Analysis of Search in Hash Table
} If m (# of slots) is proportional to n (# of
elements in the table):
} n = O(m)
} α = n/m = O(m)/m = O(1)
} Searching takes constant time on average
24
Hash Functions
} A hash function transforms a key into a table
address
} What makes a good hash function?
} Easy to compute
} Approximates a random function: for every input, every
output is equally likely (simple uniform hashing)
} In practice, it is very hard to satisfy the simple
uniform hashing property
} i.e., we don’t know in advance the probability
distribution that keys are drawn from
25
Good Approaches for Hash Functions
} Minimize the chance that closely related keys hash
to the same slot
} Strings such as pt and pts should hash to different slots
} Derive a hash value that is independent from any
patterns that may exist in the distribution of the keys
26
The Division Method
} Idea
} Map a key k into one of the m slots by taking
the remainder of k divided by m
h(k) = k mod m
} Advantage
} Fast, requires only one operation
} Disadvantage
} Certain values of m are bad, e.g.,
} power of 2
} non-prime numbers
27
The Division Method: Example
} If m = 2p, then h(k) is just the least
significant p bits of k
} p=1 m=2
h(k) = {0, 1} , least significant 1 bit of k
} p=2 m=4
h(k) = {0, 1, 2, 3}, least significant 2 bits of k
} Choose m to be a prime, not close to a
power of 2
} Column 2: k mod 97
}
28
Column 3: k mod 100 28
The Multiplication Method
Idea
} Multiply key k by a constant A, where 0 < A < 1
} Extract the fractional part of kA
} Multiply the fractional part by m
} Take the floor of the result
h(k) = = m (k A mod 1)
fractional part of kA = kA - ⎣kA⎦
} Disadvantage: Slower than division method
} Advantage: Value of m is not critical, e.g., typically 2p
29
The Multiplication Method: Example
30
Universal Hashing
} In practice, keys are not randomly distributed
} Any fixed hash function, adversary may construct
a key sequence so that the search time is Θ(n)
} Goal: hash functions that produce random table
indices irrespective of the keys
} Idea: select a hash function at random, from a
designed class of functions at the beginning of the
execution
31
31
Universal Hashing
(at the beginning
of the execution)
32
Definition of Universal Hash Functions
From the textbook:
33
Universal Hashing:Main Result
} With universal hashing the chance of collision
between distinct keys k and l is no more than
the chance 1/m of a collision if locations h(k)
and h(l) were randomly and independently
chosen from the set {0, 1, …, m – 1}
34
Designing a Universal Class of Hash Functions
} Choose a prime number p large enough so that every
possible key k is in the range [0 ... p – 1]
Zp = {0, 1, …, p - 1} and Zp* = {1, …, p - 1}
} Define the following hash function
ha,b(k) = ((ak + b) mod p) mod m,
∀ a ∈ Zp* and b ∈ Zp The class Hp,m of hash
functions is universal
} The family of all such hash functions is
Hp,m = {ha,b: a ∈ Zp* and b ∈ Zp}
a , b: chosen randomly at the beginning of execution
35
Universal Hashing Function: Example
E.g.: p = 17, m = 6
ha,b(k) = ((ak + b) mod p) mod m
h3,4(8) = ((3⋅8 + 4) mod 17) mod 6
= (28 mod 17) mod 6
= 11 mod 6
=5
36
Universal Hashing Function: Advantages
} Universal hashing provides good results on
average performance, independently of the keys to
be stored
} Guarantees that no input will always elicit the
worst-case performance
} Poor performance occurs only when the random
choice returns an inefficient hash function – this
has small probability
37
Binary Search Tree Property
} Binary search tree:
} binary tree
} linked data structure, each node has
} key, satellite data, 5
} left/right child, parent
3 7
} Property: 2 5 9
} If y is in left subtree of x,
} then key [y] ≤ key [x]
} If y is in right subtree of x,
} then key [y] ≥ key [x]
key[leftSubtree(x)] ≤ key[x] ≤ key[rightSubtree(x)]
38
Traversing a Binary Search Tree
} Inorder tree walk:
} Root is printed between the values of its left and right
subtrees: left, root, right
} Keys are printed in sorted order
} Preorder tree walk:
} root printed first: root, left, right
} Postorder tree walk:
} root printed last: left, right, root
5 Inorder: 2 3 5 5 7 9
Preorder: 5 3 2 5 7 9
3 7
2 5 9 Postorder: 2 5 3 9 7 5
39
Inorder tree walk
Alg: INORDER-TREE-WALK(x)
1. if x ≠ NIL
2. INORDER-TREE-WALK ( left [x] )
3. print key [x]
4. INORDER-TREE-WALK ( right [x] )
E.g.:
5
3 7
Output: 2 3 5 5 7 9
2 5 9
} Running time:
} Θ(n), where n is the size of the tree rooted at x
40
Binary Search Trees
} Support many operations
} SEARCH, MINIMUM, MAXIMUM, PREDECESSOR,
SUCCESSOR, INSERT, DELETE
} Running time of basic operations on binary search
trees
} On average: Θ(logn)
} The expected height of the tree is logn
} In the worst case: Θ(n)
} The tree is a linear chain of n nodes (very unbalanced)
41
Searching for a Key
} Given a pointer to the root of a tree and a key k:
5
} Return a pointer to a node with key k if one exists
} Otherwise return NIL 3 7
2 4 9
} Idea
} Starting at the root: trace down a path by comparing k with the key
of the current node:
} If the keys are equal: we have found the key
} If k < key[x] search in the left subtree of x
} If k > key[x] search in the right subtree of x
42
Searching for a Key: Example
15
• Search for key 13:
6 18 – 15 → 6 → 7 → 13
3 7 17 20
2 4 13
9
43
Binary Search Trees
Alg: TREE-SEARCH(x, k) 5
1. if x = NIL or k = key [x] 3 7
2 4 9
2. then return x
3. if k < key [x]
4. then return TREE-SEARCH(left [x], k )
5. else return TREE-SEARCH(right [x], k )
Running Time: O (h),
h – the height of the tree
44
Binary Search Trees: Finding the Minimum
} Goal: find the minimum value in a BST
} Following left child pointers from the root, until a
NIL is encountered 15
6 18
Alg: TREE-MINIMUM(x)
3 7 17 20
while left [x] ≠ NIL 2 4 13
do x ← left [x] 9
return x Minimum = 2
} Running time: O(h), h – height of tree
45
Binary Search Trees: Finding the Maximum
} Goal: find the maximum value in a BST
} Following right child pointers from the root, until a
NIL is encountered
15
Alg: TREE-MAXIMUM(x) 6 18
while right [x] ≠ NIL 3 7 17 20
2 4 13
do x ← right [x]
9
return x Maximum = 20
} Running time: O(h), h – height of tree
46
Successor
} Def: successor (x) = y, such that key [y] is the
smallest key > key [x]
15
E.g.: successor (15) = 17
successor (13) = 15 6 18
successor (9) = 13 3 7 17 y 20
x
2 4 13
} Case 1: right (x) is non empty
9
} successor (x) = the minimum in right (x)
} Case 2: right (x) is empty
} go up the tree until the current node is a left child:
successor (x) is the parent of the current node
} if you cannot go further (and you reached the root): x is
47 the largest element
Finding the Successor
Alg: TREE-SUCCESSOR(x)
1. if right [x] ≠ NIL
2. then return TREE-MINIMUM(right [x])
3. y ← p[x] 15
4. while y ≠ NIL and x = right [y] y
6 18
5. do x ← y
3 7 17 20
6. y ← p[y] x
2 4 13
7. return y 9
Running time: O (h), h – height of the tree
48
Predecessor
Def: predecessor (x ) = y, such that key [y] is the
biggest key < key [x]
15
E.g.: predecessor (15) = 13 y
predecessor (9) = 7 6 x 18
predecessor (7) = 6
3 7 17 20
2 4 13
Case 1: left (x) is non empty
9
predecessor (x ) = the maximum in left (x)
Case 2: left (x) is empty
} go up the tree until the current node is a right child:
predecessor (x ) is the parent of the current node
} if you cannot go further (and you reached the root): x is
the smallest element
49
Insertion
} Goal:
} Insert value v into a binary search tree
} Idea:
} If key [x] < v move to the right child of x, else
Insert value 13
move to the left child of x
} When x is NIL, we found the correct position 12
} let y be parent of x, if v < key [y] insert the new
5 18
node as y’s left child else insert it as y’s right
child 2 9 15 19
} Beginning at the root, go down the tree and 1 3 13 17
maintain:
} Pointer x : traces the downward path (current
node)
} Pointer y : parent of x (“trailing pointer” )
50
Insertion: Example
Insert 13: x=root[T], y=NIL y
12 12
x
5 18 5 18
2 9 15 19 2 9 15 19
1 3 17 1 3 17
12 12
x y
5 18 5 18
2 9 15 19 2 9 15 19
1 3 17 1 3 13 17
x = NIL
y = 15
51
Tree Insertion
1. y ← NIL
2. x ← root [T]
3. while x ≠ NIL 12
4. do y ← x
5 18
5. if key [z] < key [x]
2 9 15 19
6. then x ← left [x]
1 3 13 17
7. else x ← right [x]
8. p[z] ← y
9. if y = NIL
10. then root [T] ← z // Tree T was empty
11. else if key [z] < key [y]
12. then left [y] ← z
13. else right [y] ← z Running time: O(h)
52
Deletion
} Goal:
} Delete a given node z from a binary search tree
} Idea:
} Case 1: z has no children
} Delete z by making the parent of z point to NIL
15 15
5 16 5 16
3 12 20 3 12 20
z
10 13 18 23 10 18 23
6 delete 6
7 7
53
Deletion
} Case 2: z has one child
} Delete z by making the parent of z point to z’s child,
instead of to z
15 delete 15
z
5 16 5 20
3 12 20 3 12 18 23
10 13 18 23 10
6 6
7 7
54
Deletion
} Case 3: z has two children
} z’s successor (y) is the minimum node in z’s right subtree
} y has either no children or one right child (but no left child)
} Delete y from the tree (via Case 1 or 2)
} Replace z’s key and satellite data with y’s.
6 15 15
delete z
5 16 6 16
3 12 20 3 12 20
10 13 18 23 10 13 18 23
y 6 7
7
55
Binary Search Trees: Summary
} Operations on binary search trees:
} SEARCH O(h)
} PREDECESSOR O(h)
} SUCCESOR O(h)
} MINIMUM O(h)
} MAXIMUM O(h)
} INSERT O(h)
} DELETE O(h)
} These operations are fast if the height of the tree
is small
56
What's next...
} Binary Search Trees (Cont.d)
} Midterm Review
57