Huffman Codes
• Widely used technique for data compression
• Assume the data to be a sequence of characters
• Looking for an effective way of storing the data
• Binary character code
– Uniquely represents a character by a binary string
1
Fixed-Length Codes
E.g.: Data file containing 100,000 characters
a b c d e f
Frequency (thousands) 45 13 12 16 9 5
• 3 bits needed
• a = 000, b = 001, c = 010, d = 011, e = 100, f = 101
• Requires: 100,000 3 = 300,000 bits
2
Huffman Codes
• Idea:
– Use the frequencies of occurrence of characters to
build a optimal way of representing each character
a b c d e f
Frequency (thousands) 45 13 12 16 9 5
3
Variable-Length Codes
E.g.: Data file containing 100,000 characters
a b c d e f
Frequency (thousands) 45 13 12 16 9 5
• Assign short codewords to frequent characters and
long codewords to infrequent characters
• a = 0, b = 101, c = 100, d = 111, e = 1101, f = 1100
• (45 1 + 13 3 + 12 3 + 16 3 + 9 4 + 5 4)
1,000
= 224,000 bits
4
Prefix Codes
• Prefix codes:
– Codes for which no codeword is also a prefix of some
other codeword
– Better name would be “prefix-free codes”
• We can achieve optimal data compression using
prefix codes
– We will restrict our attention to prefix codes
5
Encoding with Binary Character Codes
• Encoding
– Concatenate the codewords representing each
character in the file
• E.g.:
– a = 0, b = 101, c = 100, d = 111, e = 1101, f = 1100
– abc = 0 101 100 = 0101100
6
Decoding with Binary Character Codes
• Prefix codes simplify decoding
– No codeword is a prefix of another the codeword
that begins an encoded file is unambiguous
• Approach
– Identify the initial codeword
– Translate it back to the original character
– Repeat the process on the remainder of the file
• E.g.:
– a = 0, b = 101, c = 100, d = 111, e = 1101, f = 1100
– 001011101 = 0 0 101 1101 = aabe
7
Prefix Code Representation
• Binary tree whose leaves are the given characters
• Binary codeword
– the path from the root to the character, where 0 means “go to the
left child” and 1 means “go to the right child”
• Length of the codeword
– Length of the path from root to the character leaf (depth of node)
100 100
0 1 0 1
86 14 a: 45 55
0 1 0 0 1
25 30
58 28 14
0 1 0 1
0 1 0 1 0 1
c: 12 b: 13 14 d: 16
a: 45 b: 13 c: 12 d: 16 e: 9 f: 5
0 1
f: 5 e: 9 8
Optimal Codes
• An optimal code is always represented by a full
binary tree
– Every non-leaf has two children
– Fixed-length code is not optimal, variable-length is
• How many bits are required to encode a file?
– Let C be the alphabet of characters
– Let f(c) be the frequency of character c
– Let dT(c) be the depth of c’s leaf in the tree T
corresponding to a prefix code
B (T ) f ( c )d T ( c ) the cost of tree T
cC
9
Constructing a Huffman Code
• A greedy algorithm that constructs an optimal prefix code
called a Huffman code
• Assume that:
– C is a set of n characters
– Each character has a frequency f(c)
– The tree T is built in a bottom up manner
• Idea: f: 5 e: 9 c: 12 b: 13 d: 16 a: 45
– Start with a set of |C| leaves
– At each step, merge the two least frequent objects: the frequency of
the new node = sum of two frequencies
– Use a min-priority queue Q, keyed on f to identify the two least
frequent objects
10
Example
f: 5 e: 9 c: 12 b: 13 d: 16 a: 45 c: 12 b: 13 14 d: 16 a: 45
0 1
f: 5 e: 9
14 d: 16 25 a: 45 25 30 a: 45
0 1 0 1 0 1 0 1
f: 5 e: 9 c: 12 b: 13 c: 12 b: 13 14 d: 16
0 1
f: 5 e: 9
a: 45 55 0 100 1
0 1
a: 45 55
25 30 0 1
0 1 0 1
25 30
c: 12 b: 13 14 d: 16 0 1 0 1
0 1 c: 12 b: 13 d: 16
14
f: 5 e: 9 0 1
f: 5 e: 9 11
Building a Huffman Code
Alg.: HUFFMAN(C) Running time: O(nlgn)
1. n C
2. Q C O(n)
3. for i 1 to n – 1
4. do allocate a new node z
5. left[z] x EXTRACT-MIN(Q)
O(nlgn)
6. right[z] y EXTRACT-MIN(Q)
7. f[z] f[x] + f[y]
8. INSERT (Q, z)
9. return EXTRACT-MIN(Q)
12