Chapter 3
Multimedia DataCompression
1. Lossless and Lossy compression
2. Run Length Coding
3. Huffman coding
4. Dictionary-based coding (LZW)
3.1 Lossless and Lossy compression
• Compression: the process of coding that will effectively reduce the
total number of bits needed to represent certain information.
Fig 3.1 A general data compression scheme
• We call the output of the encoder codes or codewords.
• The intermediate medium could either be data storage or a
communication/computer network.
• If the compression and decompression processes induce no
information loss, the compression scheme is lossless; otherwise, it is
lossy.
B0
compressionratio =
B1
B0 – number of bits before compression
B1 – number of bits after compression
• In general, we would desire any codec (encoder/decoder scheme) to
have a compression ratio much larger than 1.0.
• The higher the compression ratio, the better the lossless compression
scheme, as long as it is computationally feasible.
3.2 Run Length Coding
Run-length coding is a very widely used and simple
compression technique
In this method we replace runs of symbols with pairs of (run-
length, symbol)
Example:
Input symbols: 7,7,7,7,7,90,9,9,9,1,1,1
requires 12 Byte
Using RLC: 5,7,90,3,9,3,1= 7 Byte
Compression ratio: 12/7
3.3 Huffmancoding
• Huffman coding is an efficient method of compressing data without
losing information.
• Huffman coding provides an efficient, unambiguous code by analyzing
the frequencies that certain symbols appear in a message.
• Symbols that appear more often will be encoded as a shorter-bit
string while symbols that aren't used as much will be encoded as
longer strings.
• There are mainly two major parts in Huffman Coding
1) Build a Huffman Tree from input characters.
2) Traverse the Huffman Tree and assign codes to characters.
Algorithm
1. Initialization: put all symbols on the list sorted according to their
frequency counts.
2. Repeat until the list has only one symbol left.
a) From the list, pick two symbols with the lowest frequency counts. Form a
Huffman subtree that has these two symbols as child nodes and create a
parent node for them.
b) Assign the sum of the children’s frequency counts to the parent and insert
it into the list, such that the order is maintained.
c) Delete the children from the list.
3. Assign a codeword for each leaf based on the path from the root.
Properties of Huffmancoding
1. Unique Prefix Property: No Huffman code is a prefix of any other
Huffman code - precludes any ambiguity in decoding.
2. Optimality: minimum redundancy code - proved optimal for a given
data model (i.e., a given, accurate, probability distribution):
a) The two least frequent symbols will have the same length for their Huffman
codes, differing only at the last bit.
b) Symbols that occur more frequently will have shorter Huffman codes than
symbols that occur less frequently.
c) The average code length for an information source S is strictly less than η +
1.
l +1
The definition of entropy() is aimed at identifying often-occurring symbols in the
data stream as good candidates for short codewords in the compressed bitstream.
Example:1
• Suppose the string below is to be sent over a network.
• Each character occupies 8 bits. There are a total of 15 characters in
the above string. Thus, a total of 8*15 = 120 bits are required to send
this string.
• Using the Huffman Coding technique, we can compress the string to a
smaller size.
• Huffman coding first creates a tree using the frequencies of the
character and then generates code for each character.
• Once the data is encoded, it has to be decoded. Decoding is done
using the same tree.
Huffman coding is done with the help of the following steps.
1. Calculate the frequency of each character in the string.
2. Sort the characters in increasing order of the frequency. These are
stored in a priority queue Q.
3. Make each unique character as a leaf node.
4. Create an empty node z. Assign the minimum frequency to the left
child of z and assign the second minimum frequency to the right
child of z. Set the value of the z as the sum of the above two
minimum frequencies.
5. Remove these two minimum frequencies from Q and add the sum
into the list of frequencies (* denote the internal nodes in the figure
above).
6. Insert node z into the tree.
7. Repeat steps 3 to 5 for all the characters.
(a) (b)
8. For each non-leaf node, assign 0 to the left edge and 1 to the right
edge.
• For sending the above string over a network, we have to send the
tree as well as the above compressed-code. The total size is given by
the table below.
• Without encoding, the total size of the string was 120 bits. After
encoding the size is reduced to 32+15+28 = 75 bits.
Decoding the code
• For decoding the code, we can take the code and traverse through
the tree to find the character.
• Let 101 is to be decoded, we can traverse from the root as in the
figure below.
Example:2
Source Number of Codeword Length of
Symbol occurrence assigned codeword
S1 30 00 2
S2 10 101 3
S3 20 11 2
S4 5 1001 4
S5 10 1000 4
S6 25 01 2
S1 ( 0.30 ) S1 ( 0.30 ) S1 ( 0.30 ) S5,4,2,3 ( 0.45 ) S1,6 ( 0.55 ) 0
S ( 1.0 )
S6 ( 0.25 ) S6 ( 0.25 ) S6 ( 0.25 ) S1 ( 0.30 ) S5,4,2,3 ( 0.45 ) 1
0
S3 ( 0.20 ) S3 ( 0.20 ) S5,4,2 ( 0.25 ) S6 ( 0.25 ) 1
0
S2 ( 0.10 ) S5,4 ( 0.15 ) S3 ( 0.20 ) 1
0
S5 ( 0.10 ) S2 ( 0.10 ) 1
0
S4 ( 0.05 ) 1
3.4 Dictionary-based coding(LZW)
• LZW(Lempel-Ziv-Welch) employs an adaptive – dictionary based
compression technique. Unlike variable- length coding, in which the
length of code words are different, LZW uses fixed- length codeword
to represent variable-length strings of symbols/characters that
commonly occur together, such as words in English text.
• The LZW encoder and decoder build up the same dictionary
dynamically while receiving the data.
• LZW places longer and longer repeated entries into a dictionary, and
then emits the code for an element, rather than the string itself, if the
element has already been placed in the dictionary.
Algorithm:
Example
LZW Compression for String ABABBABCABABBA
• Let us start with a very simple dictionary (also referred to as a string
table), initially containing only three characters, with codes as follows:
• Now if the input string is ABABBABCABABBA, the LZW compression
algorithm works as follows:
• The output codes are 1 2 4 5 2 3 4 6 1. Instead of 14 characters, only 9
codes need to be sent. If we assume each character or code is
transmitted as a byte, that is quite a saving (the compression ratio
would be 14/9 = 1.56).
• LZW is an adaptive algorithm, in which the encoder and decoder
independently build their own string tables. Hence, there is no
overhead involving transmitting the string table.