Huffman Coding
• Huffman coding is a lossless data compression algorithm.
• The idea is to assign variable-length codes to input characters, lengths
of the assigned codes are based on the frequencies of corresponding
characters.
• The character which occurs most frequently gets the smallest code.
• The character which occurs least frequently gets the largest code.
• Formally, Given a set of n characters from the alphabet A (each
character c ∈ A) and their corresponding frequency freq(c), find a binary
code for each character such that 𝑐∈𝐴 𝑓𝑟𝑒𝑞 𝑐 ∗ 𝑏𝑖𝑛𝑎𝑟𝑦𝑐𝑜𝑑𝑒(𝑐) is
minimum, where 𝑏𝑖𝑛𝑎𝑟𝑦𝑐𝑜𝑑𝑒(𝑐) represents the length of binary code
of character c.
Prefix code/Prefix-free code
• The variable-length codes assigned to input characters are Prefix free
Codes, means the codes (bit sequences) are assigned in such a way that
the code assigned to one character is not the prefix of code assigned to
any other character.
• This is how Huffman Coding makes sure that there is no ambiguity
when decoding the generated bitstream.
Steps for creating Huffman
Tree
1. Create a leaf node for each character of the text. Leaf node of a character
contains the occurring frequency of that character.
2. Arrange all the nodes in increasing order of their frequency value.
3. Considering the first two nodes having minimum frequency, Create a new
internal node. The frequency of this new node is the sum of frequency of
those two nodes. Make the first node as a left child and the other node as a
right child of the newly created node.
4. Keep repeating Step-02 and Step-03 until all the nodes form a single tree.
5. The tree finally obtained is the desired Huffman Tree. Leaves of the binary
tree will represent the bit strings. The nodes with larger frequency will be
nearer to the root and with lesser frequency will be towards the leaves.
Huffman Coding Rule
• Convention 1: If you assign weight ‘0’ to the left edges, then assign
weight ‘1’ to the right edges.
• Convention 2: If you assign weight ‘1’ to the left edges, then assign
weight ‘0’ to the right edges.
• Any of the above two conventions may be followed. But follow the
same convention at the time of decoding that is adopted at the time of
encoding.
• We will follow Convention 1.
Algorithm
HUFFMAN(C)
1. n ← |C|
2. Q ← C //Q is priority queue on the basis of frequency
3. for i ← 1 to n − 1
4. do allocate a new node z
5. left[z] ← x ←EXTRACT-MIN(Q)
6. right[z] ← y ←EXTRACT-MIN(Q)
7. f [z] ← f [x] + f [y]
8. INSERT(Q, z)
9. return EXTRACT-MIN(Q)
Time Complexity
O(N logN),where n is the number of unique characters.
Example 1
File f has one million characters. Only a, b, c, d, e, f chars come in the file.
Assume frequency of occurrence of a, b, c, d, e, f is 45%, 13%, 12%, 16%,
9%, 5% . Determine:
a) Huffman Code for each character
b) Calculate bits saved
c) Average code length
Frequency of occurrence of a, b, c, d, e, f is 45%, 13%, 12%, 16%,
9%, 5%
Character Probability Fixed length HuffmanBin |Binarycode|
code arycode
a 0.45 000 0 1
b 0.13 001 101 3
c 0.12 010 100 3
d 0.16 011 111 3
e 0.09 100 1101 4
f 0.05 101 1100 4
What about decoding?
1. We start from the root and do the following until a leaf is found.
2. If the current bit is 0, we move to the left node of the tree.
3. If the bit is 1, we move to right node of the tree.
4. If during the traversal, we encounter a leaf node, we print the
character of that particular leaf node and then again continue the
iteration of the encoded data starting from step 1.
Example 2
Characters a e i o u s t
Frequency 10 15 12 3 4 13 1
Huffman Code for each character is:
a = 111
e = 10
i = 00
o = 11001
u = 1101
s = 01
t = 11000
Try this!
A networking company uses a Huffman compression technique to encode
the message before transmitting over the network. Consider the following
characters with their frequency: Character Frequency
D 18
A 37
T 29
S 50
R 17
U 30
C 6
E 13
a) Show stepwise a tree that Huffman’s greedy algorithm could produce for
these characters.
b) Encode the message “DATASTRUCTURES” before transmitting on a
network.
c) If the codeword for character “C” is reversed, will there be any effect on
decoding the message at receiver. Justify your answer with example.