ELEC1010 Tutorial 8
Source Coding
Fixed Length Code
Huffman Code
Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Coding
Converts information into a different form for storage, transmission,
or other types of processing.
Why coding?
Compression
Error Protection
Secrecy
Source coding
Ensures bits are used efficiently
Reduces the number of bits
Channel coding
Adds redundant bits for protection.
Help receiver to detect and possibly recover bit errors.
2 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Example - Video storage in a Blu-ray Disc (BD)
A full HD frame size has: 1080 × 1920 pixels
There are 1080 ×1920 = 2,073,600 pixels per frame
With color: 3 color channels to represent one pixel. One byte per each
color channel.
Total number of bytes in each frame: 2,073,600×3 = 6,220,800 bytes/frame
For a 2-hour movie with 60 frames/second, there would be:
6220800 ×60 ×3600 ×2 hrs × = 2.44TB/movie (~100 blu-ray)
There is a huge number of bytes in a 2-hour movie. Need to reduce
the data size to save storage space.
3 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding (Data Compression)
Source Coding: the process of encoding information using a
small number of bits (compression).
Decompression: the process of reconstructing the original
signal, or something close to it, from the compressed version.
Compression:
Lossless compression
Lossy compression
4 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding - Lossless Compression
Nothing about the original set of number or symbols is lost.
Reversible
The reverse process is decompression.
The exact set of original numbers can be recovered by
decompression process.
Used for signals and information that require accuracy.
e.g. financial record, scientific data
WinZip & WinRAR are lossless compression software
5 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding - Lossy Compression
Selectively throw away information about the sets of
numbers or symbols being compressed.
Loses features that would not be noticed or are not
important in the application.
Not reversible.
Cannot get back the exact original set of numbers from
the compressed version.
Reconstruct something close to the original signal from
compressed version.
Offers bigger compression ratio.
e.g. MP3 music, JPEG picture, DVD movies
6 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding Basics : Symbol-to-bit assignment
The information is a sequence of symbols.
The set of symbols is the set {xi} in Shannon’s entropy
equation.
e.g. The 52 alphabets in a book {a, b, c, d,….z; A, B, C,…Z}
The black dot and the white dot in a fax page {B, W}
Assign a sequence of bits (codeword) to represent each
symbol.
7 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding -Fixed-Length Code (Uniform Code)
Every codeword has the same code length (the number of bits).
For N symbols, we need a minimum of M bits, such that the total
number of combinations is 2M ≥ N
Examples:
# of Symbols Symbol Set Possible Codewords
2: {B, W} 0, 1
4: {a, e, q, z} 00, 01, 10, 11
Any problem in Uniform code?
What will happen if there are 5 symbols?
Examples:
# of Symbols Symbol Set Possible Codewords
5: {a, b, c, d, e} 000, 001, 010, 011, 100
Requires 3 bits for each symbol. 23 > 5, not efficient.
8 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding - Variable-Length Code
Assigns variable length codeword to different symbols.
Examples:
# of Symbols Symbol Set Possible Codewords
5: {a, e, r, o, z} 1, 0, 10, 11, 01
On average, require less than 2 bits to represent each symbol.
More efficient than uniform code.
Is there any problem?
Given a sequence: 110110 aaeaae
The sequence can be decoded as: aazr
araae
ozr
ozae
oeoe
……
9 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Prefix Code
Typically a variable-length code.
No codeword is a prefix of any other codeword.
0 1
Uniquely decodable.
e. g. a=000, b=001, c=010, d=011, e=1 is prefix code
1
0011011000010 can only be bedac 0 1 e
Note: a=0, b=1, c=10, d=11 is not prefix code
How to assign code to each symbol?
0 1 0 1
Code Tree
• Branches to the left for a “0” bit and to 001 010 011
000
the right for a “1” bit. a b c d
• All codewords must be at the leaves. Prefix Code -> Uniquely
Decodable
• Decode from the top of the tree.
Is the code efficient? How to use the bits more efficiently?
10 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code
Developed by David Albert Huffman.
Variable-length, prefix source coding technique.
Uniquely decodable.
Used for lossless data compression.
Objective: most infrequent symbol uses most bits.
14 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code
“Relative frequency” : the frequency of a symbol compared to other
symbols. This relative frequency is the probability p(x) in Shannon’s
entropy formula! Here are the steps:
Step 1: find out the relative frequency (p(x)) for each symbol
Step 2: combine the most infrequent two symbols into one symbol
Step 3: assign 0 to the left (upper) and 1 to the right (lower side) (can
be arbitrary, e.g. We can also ask to follow the convention that the
group/symbol with higher probability is assigned a ‘0’ and that with
lower probability is assigned a ‘1’, or vice versa)
Step 4: find out the relative frequency of the new symbol
Step 5: repeat step 1 to step 4, until all symbols are merged into a tree
Step 6: convert the code tree into a prefix code
15 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code Exercise 1
There are 100 students taking the exam. 20% of them get A, 70% of
them get B, 10% of them get C.
i. Perform Huffman code on the symbols. Provide a codeword for
each symbol and calculate the average bit/symbol.
ii. Perform Huffman code on symbol pairs, provide a codeword for
each symbol pair and calculate the average bit/symbol.
answer
i. P(A)=0.2, P(B)=0.7, P(C)=0.1 Symbol Codeword
B 0.7
A 10
0
1 B 0
0.2
A 0
0.3 1 C 11
0.1 1
C
Avg. bit/sym = 0.2×2+0.7×1+0.1×2 = 1.3 bit/symbol
16 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code Exercise 1
ii. answer
BB 0.49
0
BA 0.14 0 0.28 1
AB 0.14 1 0
1
0.51
CB 0.07 0
0.14
BC 0.07 1 0 1
0.23
AA 0.04
0
1
0.02 0.09
CA
0
0.02 0.05 1
AC 0
0.03 1
CC 0.01
1
17
Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code Exercise 1 answer
P(BB)=0.49
Symbol Codeword
P(BA)=0.14
BB 0
P(AB)=0.14 BA 100
P(CB)=0.07 AB 101
CB 1100
P(BC)=0.07
BC 1101
P(AA)=0.04 AA 1110
P(CA)=0.02 CA 11110
AC 111110
P(AC)=0.02
CC 111111
P(CC)=0.01
Avg. bit/sym = (0.04×4+0.14×3+0.02×6+0.14×3+0.49×1+0.07×4
+0.02×5+0.07×4+0.01×6)/2
= 1.165bit/symbol
18 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code Exercise 2
Assume that we encode the sequence: 111101100100 by grouping three
binary bits as one symbol and encode one symbol at a time using
Huffman Coding. We assume further the distribution of three
successive binary bits is completely random which means that the
probability of finding a specific sequence of three binary bits is equal
to product of their individual probability as shown below:
P(000) = P(0)×P(0)×P(0); P(001) = P(0)×P(0)×P(1)
P(010) = P(0)×P(1)×P(0); P(011) = P(0)×P(1)×P(1)
P(100) = P(1)×P(0)×P(0); P(101) = P(1)×P(0)×P(1)
P(110) = P(1)×P(1)×P(0); P(111) = P(1)×P(1)×P(1)
1) Compute the probabilities of the 8 different symbols.
2) Draw the code tree and provide the codeword for each symbol.
3) Encode the sequence with the codewords and provide the
encoded sequence.
19 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code Exercise 2 answer
1) P(1) = 7/12 = 0.6, P(0) = 5/12 = 0.4
P(000) = 0.4 × 0.4 × 0.4 = 0.064
P(001) = 0.4 × 0.4 × 0.6 = 0.096
P(010) = 0.4 × 0.6 × 0.4 = 0.096
P(011) = 0.4 × 0.6 × 0.6 = 0.144
P(100) = 0.6 × 0.4 × 0.4 = 0.096
P(101) = 0.6 × 0.4 × 0.6 = 0.144
P(110) = 0.6 × 0.6 × 0.4 = 0.144
P(111) = 0.6 × 0.6 × 0.6 = 0.216
20 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code Exercise 2 answer
2)
21 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code Exercise 2 answer
2)…The codewords are: 3) Original sequence:
111101100100
Symbol Codeword
000 1111 Codeword:
001 1110 111: 00
010 011 101: 100
011 101 100: 010
100 010
100: 010
101 100
Encoded sequence is
110 110 00100010010
111 00
22 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST
Source Coding – Huffman Code
Solution is not necessarily unique but gives the same avg.
bits/symbol.
Huffman coding is completely reversible (assuming no bit
errors over storage or transmission). It is lossless.
More frequently used symbols have shorter codewords.
23 Raymundo TANG TANG, Electronic and Computer Eng. Dept, HKUST