Huffman Codes
Encoding
Huffman coding is a lossless ecoding algorithm that encodes an alphabet into
a prefix code, a mapping of codewords to characters so that no codeword
is a prefix of another. The encoded alphabet can be represented as a binary
tree with all letters on the leaves.
0
a
1
0
b
0
c
1
d
Optimal Code If P (x) is the probability of letter x and dT (x) is the
depth in the tree of the codeword, for an alphabet C, the expected encoding
length for n characters, which we want to minimize, is
B(T ) = n
p(x)dT (x)
(1)
xC
Algorithm
Build the tree bottom-up, merging 2 characters into a meta-character on
each step, then recurse on the 1-letter smaller alphabet with the metacharacter. When merging, always pick the 2 characters with the lowest
probability.
1
for x in C :
add x to heap Q by p ( x )
for i in [1 , | C | - 1]:
z = new internal tree node
left [ z ] = x = extract - min ( Q )
right [ z ] = y = extract - min ( Q )
p(z) = p(x) + p(y)
insert z into Q
11
return last element in Q as root
Huffman Codes
3
Proof
Claim For x, y  C with the lowest probabilities, there exists an optimal tree with these two letters at maximum depth.
Proof By contradiction, take b, c, x, y  C | p(b)  p(c), p(x)  p(y)
with no loss of generality. By the claim that x and y have the lowest probabilities, p(x)  p(b), p(y)  p(c). Now put b, c at the bottom of the tree so
that dT (b)  dT (x), dT (c)  dT (y).
T0
T
0
x
b
0
y
1
0
b
0
y
1
c
1
0
x
1
c
Swap the positions of x and b in the tree to obtain the new tree T 0 . The
cost change from T to T 0 is given by:
B(T 0 ) = B(T )  p(x)dT (x) + p(x)dT (b)  p(b)dT (b) + p(b)dT (x)
(2)
= B(T ) + p(x)(dT (b)  dT (x))  p(b)(dT (b)  dT (x))
(3)
= B(T )  (p(b)  p(x))(dT (b)  dT (x))
(4)
In the analysis above, we fixed that p(x)  p(b) and dT (x)  dT (b), so
therefore B(T 0 )  B(T ). By a similar exchange argument for y, this proves
that there exists some optimal tree with x and y at the deepest level.
Claim
Huffmans algorithm always produces an optimal tree.
Proof Use induction. The base case n = 1 is trivially correct, and by
the inductive hypothesis we assume correctness on n  1 characters.
Given a tree for n characters such that x, y are the lowest probability
letters, we know that they are at the bottom of the optimal solution. Now
Huffman Codes
replace x, y with z so that p(z) = p(x) + p(y), as in the algorithm so that
now |C 0 | = n  1.
Consider that any prefix tree T for C 0 has z as a leaf node (by definition).
Replace this leaf node with an internal node branching out to two new leaves
x and y. The cost change is:
B(T 0 ) = B(T )  p(z)dT (z) + p(x)(dT (z) + 1) + p(y)(dT (z) + 1)
= B(T ) + (p(x) + p(y))
(5)
(6)
By the inductive hypothesis, B(T ) is minimal. Because we picked x, y
so that their sum would be the smallest possible out of the alphabet, B(T 0 )
is optimal as well, which means Huffmans algorithm always produces a
minimal prefix code.