C#ODE Studio
Huffman Coding: An Application of Binary Trees and Priority Queues
CS 102
C#ODE Studio Encoding and Compression of Data
Fax Machines ASCII Variations on ASCII
min number of bits needed cost of savings patterns modifications
CS 102
C#ODE Studio Purpose of Huffman Coding
Proposed by Dr. David A. Huffman in 1952
A Method for the Construction of Minimum Redundancy Codes
Applicable to many forms of data transmission
Our example: text files
CS 102
The Basic Algorithm
C#ODE Studio
Huffman coding is a form of statistical coding Not all characters occur with the same frequency! Yet all characters are allocated the same amount of space
1 char = 1 byte, be it
or
CS 102
The Basic Algorithm
C#ODE Studio
Any savings in tailoring codes to frequency of character? Code word lengths are no longer fixed like ASCII. Code word lengths vary and will be shorter for the more frequently used characters.
CS 102
C#ODE The (Real) Basic Algorithm
1. 2. 3. 4. 5.
Studio
Scan text to be compressed and tally occurrence of all characters. Sort or prioritize characters based on number of occurrences in text. Build Huffman code tree based on prioritized list. Perform a traversal of tree to determine all code words. Scan text again and create new file using the Huffman codes.
CS 102
Building a Tree
C#ODE Studio
Scan the original text
Consider the following short text:
Eerie eyes seen near lake.
Count up the occurrences of all characters in the text
CS 102
Building a Tree
C#ODE Studio
Scan the original text
Eerie eyes seen near lake. What characters are present?
E e r i space y s n a r l k .
CS 102
Building a Tree
C#ODE Studio
Scan the original text
Eerie eyes seen near lake.
What is the frequency of each character in the text?
Char Freq. E 1 e 8 r 2 i 1 space 4
Char Freq. y 1 s 2 n 2 a 2 l 1
CS 102
Char Freq. k 1 . 1
Building a Tree
C#ODE Studio
Prioritize characters
Create binary tree nodes with character and frequency of each character Place nodes in a priority queue
The lower the occurrence, the higher the priority in the queue
CS 102
Building a Tree
Uses binary tree nodes
C#ODE Studio
Prioritize characters
public class HuffNode { public char myChar; public int myFrequency; public HuffNode myLeft, myRight; }
priorityQueue myQueue;
CS 102
Building a Tree
C#ODE Studio
The queue after inserting all nodes
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
Null Pointers are not shown
CS 102
Building a Tree
C#ODE Studio
While priority queue contains two or more nodes
Create new node Dequeue node and make Dequeue next node and subtree Frequency of new node frequency of left and Enqueue new node back
CS 102
it left subtree make it right equals sum of right children into queue
Building a Tree
C#ODE Studio
E 1
i 1
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
CS 102
Building a Tree
C#ODE Studio
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
2 E 1 i 1
CS 102
Building a Tree
C#ODE Studio
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2 E 1
2
i 1
sp 4
e 8
CS 102
Building a Tree
C#ODE Studio
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
E 1
i 1
2 y 1
l 1
CS 102
Building a Tree
C#ODE Studio
k 1
. 1
r 2
s 2
n 2
a 2
2 y 1
2 l 1
sp 4
e 8
E 1
i 1
CS 102
Building a Tree
C#ODE Studio
2 y 1
2 l 1
sp
2
E 1 i 1
2 k 1 . 1
CS 102
Building a Tree
C#ODE Studio
2 i 1
sp
2
E 1
4 y 1 l 1
k 1 . 1
CS 102
Building a Tree
C#ODE Studio
2
E 1 i 1
sp 4 . 1
e 8
y 1
l 1
4 r 2 s 2
k 1
CS 102
Building a Tree
C#ODE Studio
2 E 1 i 1
y 1
2
l 1 k 1
2
. 1
sp 4 r 2
4 s 2
CS 102
Building a Tree
C#ODE Studio
2 y 1
2 l 1 k 1
2 . 1
sp 4 r 2
4 s 2
e 8
E 1
i 1
4 n 2 a 2
CS 102
Building a Tree
C#ODE Studio
2 y 1
2 l 1 k 1
2 . 1
sp 4 r 2
4 s 2 n 2
4 a 2
e 8
E 1
i 1
CS 102
Building a Tree
C#ODE Studio
2 k 1 . 1
sp 4 r 2
4 s 2 n 2
4 a 2
4 2 E 1 i 1 2
y 1
l 1 CS 102
Building a Tree
C#ODE Studio
2 k 1 . 1
sp 4 r 2
4 s 2 n 2
4 a 2 2
4 2 y 1
e 8
E 1
i 1
l 1
CS 102
Building a Tree
C#ODE Studio
4 r 2 s 2 n 2
4 a 2 E 1 6 2 k 1 . 1 sp 4 2 i 1
4 2 y 1 l 1
e 8
CS 102
Building a Tree
C#ODE Studio
4 r 2 s 2 n 2
4 a 2
e sp 4 8
2 E 1 i 1
y 1
2 l 1 k 1
. 1
What is happening to the characters with a low number of occurrences?
CS 102
Building a Tree
C#ODE Studio
4 2 E 1 i 1 2 2 l 1
6 sp 4
e 8
y 1
k 1
. 1
8 4 r 2 CS 102 s 2 n 2 4 a 2
Building a Tree
C#ODE Studio
4 2 E 1 i 1 2 2 l 1
6 sp 4
e 8 4 r 2 s 2
8 4 n 2 a 2
y 1
k 1
. 1
CS 102
Building a Tree
C#ODE Studio
e 8
4
r 2 s 2 n 2
4 10 a 2 2 4 2 y 1 6
2 l 1
k 1 . 1
sp 4
E 1
i 1
CS 102
Building a Tree
C#ODE Studio
e 8
10 4 4 a 2 E 1 2 i 1 2 2 l 1 6 sp 4
4
r 2 s 2 n 2
y 1
k 1
. 1
CS 102
Building a Tree
C#ODE Studio
10 4 2 E 1 i 1 2 2 l 1 16 6 sp 4 e 8 4 r 2 s 2 n 2 8 4 a 2
y 1
k 1
. 1
CS 102
Building a Tree
C#ODE Studio
10 4 2 E 1 i 1 2 2 l 1 6 sp 4
16 e 8 4 r 2 s 2 n 2
8 4
a 2
y 1
k 1
. 1
CS 102
Building a Tree
26
C#ODE Studio
10
4 2 E 1 i 1 y 1 2 l 1 k 1 2 . 1
16 e 8 sp 4 r 2 4 s 2 n 2
8 4
a 2
CS 102
Building a Tree
C#ODE Studio
26 10 4 2 E 1 i 1 y 1 2 l 1 k 1 6 e 8 sp 4 . 1 r 2 4 s 2 n 2 16 8 4 a 2
After enqueueing this node there is only one node left in priority queue.
CS 102
Building a Tree
Dequeue the single node left in the queue. This tree contains the new code words for each character.
2 10 4 2 2 6
C#ODE Studio
26 16 e 8 sp 4 4 8 4
Frequency of root node should equal number of characters in text.
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
Eerie eyes seen near lake.
CS 102
26 characters
Encoding the File
Traverse Tree for Codes
Perform a traversal of the tree to obtain new code words Going left is a 0 going right is a 1 code word is only completed when a leaf node is reached
C#ODE Studio
26 10 16 e 8 sp 4 4 8 4
4 2 2 2
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
CS 102
Encoding the File
Traverse Tree for Codes
Char E i y l k . space e r s n a Code 0000 0001 0010 0011 0100 0101 011 10 1100 1101 1110 1111
C#ODE Studio
26 10 16 e 8 sp 4 4 8 4
4 2 2 2
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
CS 102
Encoding the File
Rescan text and encode file using new code words
Eerie eyes seen near lake.
C#ODE Studio
0000101100000110011 1000101011011010011 1110101111110001100 1111110100100101
Why is there no need for a separator character?
.
Char E i y l k . space e r s n a
CS 102
Code 0000 0001 0010 0011 0100 0101 011 10 1100 1101 1110 1111
Encoding the File
Results
C#ODE Studio
Have we made 0000101100000110011 things any 1000101011011010011 better? 1110101111110001100 73 bits to encode 1111110100100101 the text ASCII would take 8 * 26 = 208 bits If modified code used 4 bits per character are needed. Total bits 4 * 26 = 104. Savings not as great.
CS 102
Decoding the File
C#ODE Studio
How does receiver know what the codes are? Tree constructed for each text file.
Considers frequency for each file Big hit on compression, especially for smaller files
Tree predetermined
based on statistical analysis of text files or file types
Data transmission is bit based versus byte based
CS 102
Decoding the File
Once receiver has tree it scans incoming bit stream 0 go left 1 go right
26 10 4 6
C#ODE Studio
16 e 8 sp 4 4 8 4
101000110111101111 01111110000110101
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
CS 102
Summary
C#ODE Studio
Huffman coding is a technique used to compress files for transmission Uses statistical coding
more frequently used symbols have shorter code words
Works well for text and fax transmissions An application that uses several data structures
CS 102