Huffman coding
Huffman Coding-
Huffman Coding is a famous Greedy Algorithm.
It is used for the lossless compression of data.
It uses variable length encoding.
It assigns variable length code to all the characters.
The code length of a character depends on how frequently it occurs in the given text.
The character which occurs most frequently gets the smallest code.
The character which occurs least frequently gets the largest code.
It is also known as Huffman Encoding.
Prefix Rule-
Huffman Coding implements a rule known as a prefix rule.
This is to prevent the ambiguities while decoding.
It ensures that the code assigned to any character is not a prefix of the code
assigned to any other character.
Major Steps in Huffman Coding-
There are two major steps in Huffman Coding-
1. Building a Huffman Tree from the input characters.
2. Assigning code to the characters by traversing the Huffman Tree.
Huffman Tree-
The steps involved in the construction of Huffman Tree are as follows-
Step-01:
Create a leaf node for each character of the text.
Leaf node of a character contains the occurring frequency of that character.
Step-02:
Arrange all the nodes in increasing order of their frequency value.
Step-03:
Considering the first two nodes having minimum frequency,
Create a new internal node.
The frequency of this new node is the sum of frequency of those two nodes.
Make the first node as a left child and the other node as a right child of the newly
created node.
Step-04:
Keep repeating Step-02 and Step-03 until all the nodes form a single tree.
The tree finally obtained is the desired Huffman Tree.
Time Complexity-
The time complexity analysis of Huffman Coding is as follows-
extractMin( ) is called 2 x (n-1) times if there are n nodes.
As extractMin( ) calls minHeapify( ), it takes O(logn) time.
Thus, Overall time complexity of Huffman Coding becomes O(nlogn).
Here, n is the number of unique characters in the given text.
Important Formulas-
The following 2 formulas are important to solve the problems based on Huffman Coding-
Formula-01:
Formula-02:
Total number of bits in Huffman encoded message
= Total number of characters in the message x Average code length per character
= ∑ ( frequencyi x Code lengthi )
PRACTICE PROBLEM BASED ON HUFFMAN
CODING-
Problem-
A file contains the following characters with the frequencies as shown. If Huffman Coding is
used for data compression, determine-
1. Huffman Code for each character
2. Average code length
3. Length of Huffman encoded message (in bits)
Characters Frequencies
a 10
e 15
i 12
o 3
u 4
s 13
t 1
Solution-
First let us construct the Huffman Tree.
Huffman Tree is constructed in the following steps-
Step-01:
Step-02:
Step-03:
Step-04:
Step-05:
Step-06:
Step-07:
Now,
We assign weight to all the edges of the constructed Huffman Tree.
Let us assign weight ‘0’ to the left edges and weight ‘1’ to the right edges.
Rule
If you assign weight ‘0’ to the left edges, then assign weight ‘1’ to the right
edges.
If you assign weight ‘1’ to the left edges, then assign weight ‘0’ to the right
edges.
Any of the above two conventions may be followed.
But follow the same convention at the time of decoding that is adopted at
the time of encoding.
After assigning weight to all the edges, the modified Huffman Tree is-
Now, let us answer each part of the given problem one by one-
1. Huffman Code For Characters-
To write Huffman Code for any character, traverse the Huffman Tree from root node to the
leaf node of that character.
Following this rule, the Huffman Code for each character is-
a = 111
e = 10
i = 00
o = 11001
u = 1101
s = 01
t = 11000
From here, we can observe-
Characters occurring less frequently in the text are assigned the larger code.
Characters occurring more frequently in the text are assigned the smaller code.
2. Average Code Length-
Using formula-01, we have-
Average code length
= ∑ ( frequencyi x code lengthi ) / ∑ ( frequencyi )
= { (10 x 3) + (15 x 2) + (12 x 2) + (3 x 5) + (4 x 4) + (13 x 2) + (1 x 5) } / (10 + 15 + 12 + 3 +
4 + 13 + 1)
= 2.52
3. Length of Huffman Encoded Message-
Using formula-02, we have-
Total number of bits in Huffman encoded message
= Total number of characters in the message x Average code length per character
= 58 x 2.52
= 146.16
≅ 147 bits
Huffman Coding Complexity
The time complexity for encoding each unique character based on its
frequency is O(nlog n) .
Extracting minimum frequency from the priority queue takes place 2*(n-
1) times and its complexity is O(log n) . Thus the overall complexity is O(nlog
n) .
Huffman Coding Applications
Huffman coding is used in conventional compression formats like GZIP,
BZIP2, PKZIP, etc.
For text and fax transmissions.
Practice Questions on Huffman Encoding
Huffman Encoding is an important topic from GATE point of view and
different types of questions are asked from this topic. Before
understanding this article, you should have basic idea about Huffman
encoding.
These are the types of questions asked in GATE based on Huffman
Encoding.
Type 1. Conceptual questions based on Huffman Encoding –
Here are the few key points based on Huffman Encoding:
It is a lossless data compressing technique generating variable
length codes for different symbols.
It is based on greedy approach which considers
frequency/probability of alphabets for generating codes.
It has complexity of nlogn where n is the number of unique
characters.
The length of the code for a character is inversely proportional to
frequency of its occurrence.
No code is prefix of another code due to which a sequence of code
can be unambiguously decoded to characters.
Que – 1. Which of the following is true about Huffman Coding?
(A) Huffman coding may become lossy in some cases
(B) Huffman Codes may not be optimal lossless codes in some cases
(C) In Huffman coding, no code is prefix of any other code.
(D) All of the above
Solution: As discussed, Huffman encoding is a lossless compression
technique. Therefore, option (A) and (B) are false. Option (C) is true as
this is the basis of decoding of message from given code.
Type 2. To find number of bits for encoding a given message –
To solve this type of questions:
First calculate frequency of characters if not given
Generate Huffman Tree
Calculate number of bits using frequency of characters and number
of bits required to represent those characters.
Que – 2. How many bits may be required for encoding the message
‘mississippi’?
Solution: Following is the frequency table of characters in ‘mississippi’
in non-decreasing order of frequency:
The generated Huffman tree is:
Following are the codes:
Total number of bits
= freq(m) * codelength(m) + freq(p) * code_length(p) + freq(s) *
code_length(s) + freq(i) * code length(i)
= 1*3 + 2*3 + 4*2 + 4*1 = 21
Also, average bits per character can be found as:
Total number of bits required / total number of characters = 21/11 =
1.909
Type 3. Decoding from code to message –
To solve this type of question:
Generate codes for each character using Huffman tree (if not given)
Using prefix matching, replace the codes with characters.
Que – 3. The characters a to h have the set of frequencies based on
the first 8 Fibonacci numbers as follows:
a : 1, b : 1, c : 2, d : 3, e : 5, f : 8, g : 13, h : 21
A Huffman code is used to represent the characters. What is the
sequence of characters corresponding to the following code?
110111100111010
(A) fdheg
(B) ecgdf
(C) dchfg
(D) fehdg
Solution: Using frequencies given in the question, huffman tree can
be generated as:
Following are the codes:
Using prefix matching, given string can be decomposed as
110 11110 0 1110 10
f d h e g
Type 4. Number of bits saved using Huffman encoding –
Que – 4. A networking company uses a compression technique to
encode the message before transmitting over the network. Suppose
the message contains the following characters with their frequency:
Note that each character in input message takes 1 byte.
If the compression technique used is Huffman Coding, how many bits
will be saved in the message?
(A) 224
(B) 800
(C) 576
(D) 324
Solutions: Finding number of bits without using Huffman,
Total number of characters = sum of frequencies = 100
size of 1 character = 1byte = 8 bits
Total number of bits = 8*100 = 800
Using Huffman Encoding, Total number of bits needed can be
calculated as:
5*4 + 9*4 + 12*3 + 13*3 + 16*3 + 45* 1 = 224
Bits saved = 800-224 = 576.