Information and Data Comparison
Information and Data Comparison
Information Theory
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
and Data
2022/2023
Level 3
Prepared by
30210061601189 30210061601189 30210061601189
Course Syllabus
IT351: Information Theory and Data Compression
Prerequisite: Math (3) - Object Oriented Programming
Instructor: Dr. Omnia Elbarbary
Textbooks:
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
1. Information Theory, Pattern Recognition, and Neural Networks (English, free), David J. C.
MacKay, University of Cambridge, ISBN: 9780521642989. FREE, downloadable from the
course Gdrive.
2. A Student's Guide To Coding And Information Theory, STEFAN M. MOSER PO-NING CHEN.
3. Introduction to information theory based on examples (in Japanese), ISBN 978-4-06-153803-0,
Shinichi Oishi (1993) Kodansha (not required)
Course description:
This course “Information Theory and Data Compression” gives knowledge and basic skills as follows.
Transmitting
2022/2023 information efficiently and accurately is one of the important technical
2022/2023 challenges in the
2022/2023
modern digital society. Information theory is rooted in mathematical formulation and provides a
theoretical solution to this problem. The idea of information theory makes it possible to construct an
efficient coding for information communication and error correction by utilizing the probability and the
statistical theorem. Information theory plays an important role in fields such as image data compression,
cryptology theory, network communication, information quantity evaluation, etc.
Course objectives:
The main contents of this course are mathematical formulation of source coding and channel coding,
efficient
construction method of coding, and entropy for measuring information ambiguity and information
volume.
30210061601189 30210061601189 30210061601189
Lecture Contents:
• Lecture 1: Course introduction, Intro. to Information Theory
• Lecture 2: Lossless Compression
• Lecture 3 – 4: Variable length Coding and Shannon-Fano Coding
2
• Lecture 5 – 6: Data compression & Huffman Code (Prefix-free, Trees & Codes, Kraft Inequality,
Trees with Prob., Huffman Code)
• Lecture 7 – 8: Dictionary Compression
• Lecture 9: Entropy & Source Coding
• Lecture 10 Which lossless algorithm is most suited for a given application
• Lecture 11: Lossy Compression
• Lecture 12: Image Compression
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Laboratory Exercises:
• Lab 0: tutorial using programming language (2 sessions)
• Lab 1 (2 sessions): Channel coding
• Lab 2 (3 sessions): Source coding - Data compression using Huffman Coding
• Lab 3 (4 sessions): Image transmission over AWGN channel
o Part 1: without source and channel coding
o Part 2: with/without compression
o Part 3:
2022/2023 with/without ECC (Hamming 7,4) 2022/2023 2022/2023
3
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Chapter 1
2022/2023 2022/2023 2022/2023
Basic
Compression
30210061601189 30210061601189 30210061601189
Concepts 4
Introduction
6
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
7
differences can result in statements with very different meanings. Consider
the sentences “Do not send money” and “Do now send money.” A similar
argument holds for computer files and for certain types of data such as bank
records.
If data of any kind are to be processed or “enhanced” later to yield more
information, it is important that the integrity be preserved. For example,
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
8
1.1.2 Lossy Compression
Lossy compression techniques involve some loss of information, and data that
have been compressed using lossy techniques generally cannot be recovered
or reconstructed exactly.
In return for accepting this distortion in the reconstruction, we can generally
obtain much higher compression ratios than is possible with lossless
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
compression.
In many applications, this lack of exact reconstruction is not a problem. For
example, when storing or transmitting speech, the exact value of each sample
of speech is not necessary. Depending on the quality required of the
reconstructed speech, varying amounts of loss of information about the value
of each sample can be tolerated. If the quality of the reconstructed speech is to
be similar to that heard on the telephone,
2022/2023 2022/2023
a significant loss of 2022/2023
information can
be tolerated. However, if the reconstructed speech needs to be of the quality
heard on a compact disc, the amount of information loss that can be tolerated
is much lower.
Similarly, when viewing a reconstruction of a video sequence, the fact that the
reconstruction is different from the original is generally not important as long
as the differences do not result in annoying artifacts. Thus, video is generally
compressed using lossy compression.
Once we have developed a data compression scheme, we need to be able to
30210061601189 30210061601189 30210061601189
measure its performance. Because of the number of different areas of
application, different terms have been developed to describe and measure the
performance.
9
1.1.3 Measures of Performance
10
original data. Therefore, in order to determine the efficiency of a compression
algorithm, we have to have some way of quantifying the difference. The
difference between the original and the reconstruction is often called the
distortion Lossy techniques are generally used for the compression of data
that originate as analog signals, such as speech and video. In compression of
speech and video, the final arbiter of quality is human. Because human
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
11
There is a saying attributed to Bobby Knight, the basketball coach at Texas
Tech University: “If the only tool you have is a hammer, you approach every
problem as if it were a nail.” Our intention in this book is to provide you with
a large number of tools that you can use to solve the particular data
compression problem. It should be remembered that data compression, if it is
a science at all, is an experimental science. The approach that works best for a
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
12
in the data, we can represent the sequence using fewer bits. If we plot these
data as shown in Figure 1.2, we see that the data seem to fall on a straight
line. A model for the data could therefore be a straight line given by the
equation
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
The residual sequence consists of only three numbers {−1, 0, 1}. If we assign
a code of 00 to −1, a code of 01 to 0, and a code of 10 to 1, we need to use 2
bits to represent each element of
30210061601189
the residual sequence. Therefore,
30210061601189 30210061601189
we can
obtain compression by transmitting or storing the parameters of the model and
the residual sequence. The encoding can be exact if the required compression
is to be lossless, or approximate if the compression can be lossy.
13
The type of structure or redundancy that existed in these data follows a simple
law. Once we recognize this law, we can make use of the structure to predict
the value of each element in the sequence and then encode the residual.
Structure of this type is only one of many types of structure. Consider the
following example.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
The sequence does not seem to follow a simple law as in the previous case.
However, each value is close to the previous value. Suppose we send the first
14
value, then in place of subsequent values we send the difference between it
and the previous value. The sequence of transmitted values would be
Like the previous example, the number of distinct values has been reduced.
Fewer bits are required to represent each number and compression is
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
achieved. The decoder adds each received value to the previous decoded
value to obtain the reconstruction corresponding to the received value.
Techniques that use the past values of a sequence to predict the current value
and then encode the error in prediction, or residual, are called predictive
coding schemes. Assuming both encoder and decoder know the model being
used, we would still have to
send the value of the first element of
2022/2023 the sequence. _
2022/2023 2022/2023
15
single bit to the symbol that occurs most often, and correspondingly longer
codewords to symbols that occur less often. If we substitute the codes for each
symbol, we will use 106 bits to encode the entire sequence. As there are 41
symbols in the sequence, this works out to approximately 2_58 bits per
symbol. This means we have obtained a compression ratio of 1.16:1. We will
study how to use statistical redundancy of this sort .
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
When dealing with text, along with statistical redundancy, we also see
redundancy in the form of words that repeat often. We can take advantage of
this form of redundancy by constructing a list of these words and then
represent them by their position in the list. This type of compression scheme
is called a dictionary compression scheme. We will study these schemes in the
following Chapters.
1.3 What is Compression Technology ?
30210061601189 30210061601189 30210061601189
• source-code
• arbitrary files
• images
• video
• audio data
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
• speech
Obviously, these data are different in terms of data volume, data structure,
intended usage etc. For that reason, it is plausible to develop specific
compression technologies for different data and application types,
respectively. The idea of a general-purpose compression technique for all
thinkable application scenarios has to be entirely abandoned. Instead, we find
a large number of very different techniques with respect to target data types
2022/2023 2022/2023 2022/2023
and target application environments (e.g. data transmission in networks like
streaming, storage and long term archiving applications, interactive
multimedia applications like gaming etc.). Given
this vast number of different techniques, there are different ways how to
classify compression techniques:
with respect to the type of data to be compressed
• with respect to the target application area
• with respect the fundamental building blocks of the algorithms used
Terminology: when talking about
30210061601189 compression, we often
30210061601189 mean “lossy
30210061601189
fundamental term
in the area is compression rate (or compression ratio) which denotes the
relation between the size of the original data before compression and the size
of the compressed data. Compression ratio therefore rates the effectivity of a
compression system in terms of data reduction capability. This must not be
confused with other measures of compressed data size like bit per pixel (bpp)
or bit per sample (bps).
2022/2023 2022/2023 2022/2023
18
from the aim of simply reducing the amount of data, standards like MPEG-4,
MPEG-7, and MPEG-21 offer additional functionalities.
During the last years three important trends have contributed to the fact that
nowadays compression
technology is as important as it has never been before – this development has
already changed the way we work with multimedia data like text, speech,
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
audio, images, and video which will lead to new products and applications:
• The availability of highly effective methods for compressing various types
of data.
• The availability of fast and cheap hardware components to conduct
compression on single-chip systems, microprocessors, DSPs and VLSI
systems.
• Convergence of computer, communication, consumer electronics,
2022/2023 2022/2023 2022/2023
publishing, and entertainment industries.
19
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
• 1st century B.C.: Stenography 19th century: Morse- and Braille alphabets
• 50ies of the 20th century: compression technologies exploiting statistical
redundancy are developed – bit-patterns with varying length are used to
represent individual symbols according to their relative frequency.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
22
• adaptivity: do we expect data with highly varying properties or can we pre-
adapt to specific properties (fingerprint compression)
• synchronization issues – audio and video data should both be frame-based
preferably
• transcoding: can the data be converted into other datatypes easily (e.g.,
MJPEG −− > MPEG)
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
23
Chapter 2
Introduction to information theory and
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Entropy calculation
2.1 Overview
The treatment of data compression in this book is not very mathematical.
However, we do need some mathematical preliminaries to appreciate the
compression techniques we will discuss. Compression schemes can be
2022/2023 2022/2023 2022/2023
divided into two classes, lossy and lossless. Lossy compression schemes
involve the loss of some information, and data that have been compressed
using a lossy scheme generally cannot be recovered exactly. Lossless schemes
compress the data without loss of information, and the original data can be
recovered exactly from the compressed data.
In this chapter, some of the ideas in information theory that provide the
framework for the development of lossless data compression schemes are
briefly reviewed. We will also look at some ways to model the data that lead
to efficient coding schemes. We have
30210061601189 assumed some knowledge
30210061601189 of probability
30210061601189
concepts.
24
2.2 A Brief Introduction to Information Theory
Although the idea of a quantitative measure of information has been around
for a while, the person who pulled everything together into what is now called
information theory was Claude Elwood Shannon [7], an electrical engineer at
Bell Labs. Shannon defined a quantity called self-information. Suppose we
have an event A, which is a set of outcomes of some random experiment. If
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
P(A) is the probability that the event A will occur, then the self-information
associated with A is given by
Note that we have not specified the base of the log function. We will discuss
this in more detail later in the chapter. The use of the logarithm to obtain a
measure of information was not an arbitrary choice as we shall see later in this
chapter. But first let’s see if the use2022/2023
2022/2023 of a logarithm in this context makes sense
2022/2023
from an intuitive point of view. Recall that log(1) = 0, and −log(x) increases
as x decreases from one to zero. Therefore, if the
probability of an event is low, the amount of self-information associated with
it is high; if the probability of an event is high, the information associated
with it is low. Even if we ignore the mathematical definition of information
and simply use the definition we use in everyday language, this makes some
intuitive sense. The barking of a dog during a burglary is a high-probability
event and, therefore, does not contain
30210061601189 too much information.
30210061601189 However, if the
30210061601189
dog did not bark during a burglary, this is a low-probability event and
contains a lot of information. (Obviously, Sherlock Holmes understood
information theory!)1 Although this equivalence of the mathematical and
25
semantic definitions of information holds true most of the time, it does not
hold all the time. For example, a totally random string of letters will contain
more information (in the mathematical sense) than a well-thought-out treatise
on information theory.
Another property of this mathematical definition of information that makes
intuitive sense is that the information obtained from the occurrence of two
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
The unit of information depends on the base of the log. If we use log base 2,
the unit is bits; if we use log base e, the unit is nets; and if we use log base 10,
the unit is Hartley’s. Note that to calculate the information in bits, we need to
30210061601189 30210061601189 30210061601189
take the logarithm base 2 of the probabilities. Because this probably does not
appear on your calculator, let’s review logarithms briefly. Recall that
26
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
27
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
28
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
29
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
30
Lossless Compression typically is a process with three stages:
• The adaptor: uses information’s extracted from the data (usually during
encoding) to adapt the model (more or less) continuously to the data.
31
information provided by such symbols as being the sum of information as
given by the single independent symbols. We define
the case in audio, image, and video data). To give an example, we consider an
image with 999 pixels and 10 different symbols. In case a) we have p(si) =
1/10, i = 1, . . . , 10, in case b) we have p(s1) = 91/100 and p(si) = 1/100 for i
= 2, . . . , 10.
Case a) corresponds to the “conventional” image with uniformly distributed
symbols, while case b) is the differential image consisting of yi, where s1 is
the zero symbol. In case a) we result in H(S) = 3.32, while in case b) entropy
is only H(S) = 0.72. This small example immediately makes clear that
30210061601189 30210061601189 30210061601189
differential coding is sensible. It is the basis of many compression schemes
employing prediction.
33
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
We denote the length of the codeword for the symbol si as l(si) (in bits). The
average code length is defined as: L(S) = Pni=1 p(si)l(si). To minimize this
expression (which obviously is the aim of compression) we need to assign
short codewords to symbols with high probability of occurrence. It is
interesting to notice the close relation to entropy: H(S) = Pni=1 p(si)I(si).
Consequently it is evendent that l(si) = I(si) = −log2 p(si) needs to be fulfilled
in order to attain the actual entropy value. However, this is possible only in
case log2 p(si) is a natural number.2022/2023
2022/2023
Usually, this is not the case of course, so
2022/2023
the value is rounded to the next integer (which is a hint to the sub-optimality
of the entire procedure !). Shannon-Fano Coding (see below) exactly provides
the corresponding solution using the idea described before. The term code
efficiency is defined as H(S) L(S) – obviously values close to 1 are desirable !
Example: S = {s1, s2, s3, s4} with p(s1) = 0.6, p(s2) = 0.3, p(s3) = 0.05 and
p(s4) = 0.05. H(S) = 1.4 bits/symbol. When choosing a fixed length encoding
like e.g. 00, 01, 10, 11 as codewords with average code length of 2
bits/symbol. When computing l(si)
30210061601189 we obtain the values 1,2,
30210061601189 and 5 and result
30210061601189
in the following variable length code: 0, 10, 110 und 111. s3 and s4 would
require a 5-bit codeword but 3 bits are sufficient to generate a unique code.
The corresponding average code length is 1.5 bits/symbol.
35
A prefix code (or prefix-free code) is a code system, typically a variable-
length code, with the “prefix property”: there is no valid code word in the
system that is a prefix (start) of any other valid code word in the set. E.g., a
code with code words 9, 59, 55 has the prefix property; a code consisting of 9,
5, 59, 55 does not, because 5 is a prefix of both 59 and 55. With a prefix code,
a receiver can identify each word without requiring a special marker between
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
words.
A way to generate a prefix code is to use a full binary tree to represent the
code. The leaves represent the symbols to be coded while path traversing the
tree from the root to the leave determines the bit-sequence of the resulting
codeword. To actually determine a bit sequence, we need to define how to
encode a branching:
for example, a branching to the left can be encoded by 0 and a branching to
2022/2023 2022/2023 2022/2023
the right by 1. The tree given in the figure leads to the following codewords:
A – 00, B – 01, C – 100, D – 101, E – 1. The tree (and code) generation of the
Shannon-Fano scheme works as follows (in the example, we assume the
following frequency counts: A – 15, B – 7, C – 6, D – 6, E – 5): 1.
36
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
37
bits/symbol can be achieved, however, at a higher cost in terms of computation
and memory requirements.
38
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
39
bit/symbol since the smallest length for a codeword is of course 1 bit. The
entropy is of course much smaller (recall the coding of differences).
2. In case of changing statistics of the source one can easily obtain a data
expansion.
3. Since a fixed codebook is of poor quality in many cases, we have a two
stage algorithm: building statistics (the model), generate the code.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
prefix of codes assigned to a and b. If the compressed bit stream is 0001, the
de-compressed output may be “cccd” or “ccb” or “acd” or “ab”.
40
See this for applications of Huffman Coding.
There are mainly two major parts in Huffman Coding
1. Build a Huffman Tree from input characters.
2. Traverse the Huffman Tree and assign codes to characters.
Steps to build Huffman Tree
Input is an array of unique characters along with their frequency of
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
2022/2023
3. Create a new internal node with a frequency equal to
2022/2023
the sum of the
2022/2023
two nodes frequencies. Make the first extracted node as its left child
and the other extracted node as its right child. Add this node to the
min heap.
4. Repeat steps#2 and #3 until the heap contains only one node. The
remaining node is the root node and the tree is complete.
a 5
b 9
41
c 12
d 13
e 16
f 45
Step 1. Build a min heap that contains 6 nodes where each node represents
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Now min heap contains 5 nodes where 4 nodes are roots of trees with single
element each, and one heap node is root of tree with 3 elements
character Frequency
c 12
d 13
Internal Node
30210061601189
14 30210061601189 30210061601189
e 16
f 45
42
Step 3: Extract two minimum frequency nodes from heap. Add a new internal
node with frequency 12 + 13 = 25
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Now min heap contains 4 nodes where 2 nodes are roots of trees with single
element each, and two heap nodes are root of tree with more than one nodes
character Frequency
Internal Node 14
e
2022/2023 16 2022/2023 2022/2023
Internal Node 25
f 45
Step 4: Extract two minimum frequency nodes. Add a new internal node with
frequency 14 + 16 = 30
43
Now min heap contains 3 nodes.
character Frequency
Internal Node 25
Internal Node 30
f 45
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Step 5: Extract two minimum frequency nodes. Add a new internal node with
frequency 25 + 30 = 55
character Frequency
f 45
Internal Node 55
Step 6: Extract two minimum frequency nodes. Add a new internal node with
frequency 45 + 55 = 100
30210061601189 30210061601189 30210061601189
44
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
character Frequency
Since the heap contains only one node, the algorithm stops here.
character code-word
45
f 0
c 100
d 101
a 1100
b 1101
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
e 111
46
It’s an entropy-based algorithm, first proposed in a paper from 1987 (Witten,
Ian H., Radford M. Neal, and John G. Cleary. “Arithmetic coding for data
compression.” Communications of the ACM 30.6 (1987): 520-540).
One reason why AE algorithms reduce the number of bits is that AE encodes
the entire message using a single number between 0.0 and 1.0. Each symbol
in the message takes a sub-interval in the 0-1 interval, corresponding to its
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
probability.
To calculate the probability of each symbol, a frequency table should be given
as an input to the algorithm. This table maps each character to its frequency.
The more frequent the symbol, the lower number of bits it is assigned. As a
result, the total number of bits representing the entire message is reduced.
This is compared to using a fixed number of bits for all symbols in the
message as in Huffman coding.
2022/2023 2022/2023 2022/2023
Frequency table
To encode (= compress) a message using AE, you need these inputs:
The message you want to encode.
The frequency table of all possible symbols in the message.
The frequency table is usually calculated by analyzing different messages and
concluding by the number of times each symbol is repeated.
For example: assume that the messages to be encoded consist of just 3
symbols a, b, and c. By analyzing some previous messages (e.g. babb, cbab,
and bb), we get the frequencies of30210061601189
30210061601189 the 3 symbols: 30210061601189
a=2
b=7
c=1
47
Based on the frequency table, how do you encode a message like abc using
AE? Before you start encoding, you need to calculate a probability table out
of the frequency table.
Probability table
Using the frequency table, we can calculate the probability of occurrence of
each symbol. The probability is calculated as follows:
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
The first symbol starts from the same start point of the line (0.0) and the last
symbol ends at the endpoint of the line (1.0).
A symbol C covers a percentage of the range that corresponds to its
probability. For example, the symbol b covers 70% of the line because its
probability is 0.7.
Restricting the interval
AE works by restricting the line interval, which starts from 0.0 to 1.0, through
some stages equal to the number of symbols in the message. In this example,
30210061601189 30210061601189 30210061601189
there are just 3 symbols in the message, so there are just 3 stages.
In each stage, the line’s interval is restricted according to the sub-interval of
the current symbol.
49
After all symbols are processed, AE returns a single double value encoding
the entire message.
Now, it’s time to look at the message to be encoded, abc, and work on the first
symbol a to restrict the line interval.Based on the previous figure, the
symbol a covers the interval from 0.0 to 0.2. This interval becomes the line’s
interval. In other words, the line interval changes from 0.0:1.0 to 0.0:0.2 as in
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
the next figure. Note that symbol a starts from 0.0 and the last symbol, c, ends
at 0.2.
To calculate the interval of each symbol, we’ll use the already mentioned
formula:
S+(P(C))*R
2022/2023 2022/2023 2022/2023
Because the line’s range has changed, R changes to 0.2-0.0=0.2. Let’s
calculate the start and end values for each symbol:
For the first symbol, a, it starts from 0.0 to 0.0+(0.2)*0.2=0.04. The start
value is 0.0 because it’s the first symbol. The range is (0.0:0.04).
For the second symbol b, it starts from 0.04 to 0.04+(0.7)*0.2=0.18. The
range is (0.04:0.18).
The third symbol c starts from 0.18 to 0.18+(0.1)*0.2=0.2. The range is
(0.18:0.2).
30210061601189 30210061601189 30210061601189
At the end of the first stage, here is the figure that shows the start and end
values of each symbol’s interval.
50
Note that each symbol takes a portion of the line equal to its probability. For
example, the probability of the symbol b is 0.7 and thus it takes 70% of the
line’s interval.
The next figure summarizes the progress of AE at this point. Let’s start the
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
second stage, which just repeats what we did in the first stage.
Because the line’s interval changed to (0.04:0.18), R also changes. Its new
value is 0.18-0.04=0.14. Let’s calculate the start and end values for each
symbol:
30210061601189 30210061601189 30210061601189
For the first symbol, a, it starts from 0.04 to 0.04+(0.2)*0.14=0.068. The
range is (0.04:0.068).
For the second symbol, b, it starts from 0.068 to 0.068+(0.7)*0.14=0.166. The
range is (0.068:0.166).
51
The third symbol, c, starts from 0.166 to 0.166+(0.1)*0.14=0.18. The range is
(0.166:0.18).
The next figure summarizes the intervals of the 3 symbols. This marks the end
of the second stage.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
To summarize the progress made up to this point, the next figure connects the
2 completed stages. Let’s move to the third stage.
The third and last symbol in the message is c. According to the latest stage,
this symbol falls within the range starting from 0.166 and ending at 0.18. c’s
current interval will be the next interval of the line according to the next
figure. R for this new line is 0.18-0.166=0.014.
30210061601189 30210061601189 30210061601189
52
On that line, similarly to previous 2 stages, here are the intervals of the 3
symbols:
symbol a starts from 0.166 to 0.166+(0.2)*0.014=0.1688. The range is (0.166
:0.1688).
symbol b starts from 0.1688 to 0.1688+(0.7)*0.014=0.1786. The range is
(0.1688 :0.1786).
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
There are no more symbols in the message. The next figure summarizes the
calculations from the initial interval (0.0:1.0) to the last interval (0.166:0.18).
2022/2023 2022/2023 2022/2023
53
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
54
This marks the end of the encoding process. Now let’s discuss the decoding
process.
Decoding
The inputs for decoding are:
The single value that encodes the message.
The frequency table. It should be identical to the one used for encoding.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
The value 0.173 falls within the first interval (0.0:0.2). Thus, the first symbol
30210061601189 30210061601189 30210061601189
in the encoded message is a. As a result, the line’s interval is restricted to the
interval that starts from 0.0 to 0.2.
55
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
The value 0.173 falls within the second interval (0.04:0.18). Thus, the second
symbol in the encoded message is b. In the next figure, the interval will be
restricted to the interval 0.04 to 0.18.
The value 0.173 falls within the third interval (0.166:0.18). This is why the
next symbol in the message is c.
Remember that the number of symbols in the original message is 3. After
30210061601189 30210061601189 30210061601189
decoding 3 symbols, the decoding process is completed. The decoded
message is abc.
Let’s look at another example.
56
Example #2
In this example, I won’t detail the steps like before.
There are 4 symbols available to build messages – a, b, c, and d. The
frequencies of these symbols are:
a=2
b=3
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
c=1
d=4
Based on the frequency table, here are the probabilities of the symbols:
p(a)=2/10=0.2
p(b)=3/10=0.3
p(c)=1/10=0.1
p(d)=4/10=0.4
2022/2023 2022/2023 2022/2023
The next figure shows the cumulative probabilities of the symbols where the
interval of each symbol is equal to its probability.
57
In the second stage, the next symbol in the message is d. As in the next figure,
the line’s interval will be restricted to the interval of d which starts from 0.38
to 0.5.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Because the next symbol in the message is a, the interval of the third stage is
restricted to start from 0.38 and end at 0.404.
58
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
At the end of the last stage, the interval is from 0.3848 to 0.392. To encode
the entire message, any value within2022/2023
2022/2023 the interval would be used.
2022/2023
The average value is used in this example, which is 0.3884. This marks the
end of the encoding process.
(0.3848+0.392)/2=0.3884
The decoding process starts given the following 3 inputs:
59
Here is how the decoding process works:
1. By checking the initial interval that starts from 0.0 to 1.0, the
value 0.3884 falls within the interval of symbol b. Thus, the first
symbol in the message is b.
2. In the next interval that starts from 0.2 to 0.5, the value 0.3884 falls
within d‘s interval that starts from 0.38 to 0.5. Thus, the second
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
symbol is d.
3. The next interval is restricted to d‘s interval where the
value 0.3884 falls within the interval of symbol a. Thus, the third
symbol is a.
4. In the next interval, the value 0.3884 falls within the interval of
symbol b and thus the fourth and last symbol in the message is b.
Example: Encoding the sequence wuvw leads to the interval [0.5822, 0.5850]
the length of which is 0.0028 = p(w)p(u)p(v)p(w). In binary representation the
interval borders are 0.10010101110002 and 0.10010101000012,
60
When computing the entropy of the sequence, we get −log20.0028 = 8.48bit
which is close to 9.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Advantages:
1. Adaptivity can be easily realized2022/2023
2022/2023 – encoder and decoder need to carry the
2022/2023
61
3. The restriction to integer bits/symbol values is overcome. Arithmetic coding
is used in JBIG-1 as the only coding engine and in the more recent lossy
standards like JPEG2000 and H.264 as lossless compression stage.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
entire strings are replaced by tokens (i.e. codewords). The basic algorithmic
elements are lookup tables, and the corresponding operations are very fast.
62
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
LZ77 Ziv and Lempel 1977 Send pairs with pointer and character.
Pointer is fix-size and indicates a substring
in the previous N characters.
LZR Rodeh et al. 1981 Send pairs with pointer and character.
Pointer is variable-size and indicates a
substring anywhere in the previous
characters.
LZSS Bell
30210061601189 1986 A flag bit distinguishes
30210061601189 sending of pointer or
30210061601189
63
LZB Bell 1987 Same as LZSS, except that pointer is
variable-size
LZH Bell 1987 Same as LZSS, except that Huffman coding
is used for pointers on a second pass.
LZ78 Ziv and Lempel 1978 Send pairs of pointer and character. Pointer
indicates a previously parsed substring
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
64
The name originates from Lempel and Ziv (1977), the algorithm described
here is a variant, LZSS (Storer/Szymanski) as of 1982. A typical feature is the
so-called “sliding window” consisting of two parts:
the history buffer (the dictionary) contains a block of data recently encoded
(some 1000 symbols) – and the look-ahead buffer which contains strings to be
encoded (10-20 symbols).
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Example:
Text out < − ——DATA COMPRESSION—— COMPRESSES DATA——
< − Text in.
“DATA COMPRESSION” is the content of the history buffer,
“COMPRESSES DATA” is the look-ahead buffer. The actual coding used
subsequently is the QIC-122 standard
30210061601189 for tape drives using 30210061601189
30210061601189 a history buffer of
2048 bytes.
To start with, we encode each symbol in “DATA COMPRESSION” as
0<ASCII Char. Wert>. “COMPRESS” is a string of length 9 which has been
65
found in the history buffer 12 symbols backwards: 1<offset=12><length=9>.
Subsequently we encode “E” and “S” again as 0<ASCII Char. Wert> and
“DATA” as string of length 4
which has been found in the history buffer 28 positions backwards:
1<offset=28><length=4>.
We notice that
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
• the algorithm requires a start-up phase where the data in the dictionary has
to be encoded explicitly, which means that it is hardly suited for
compressing lots of small files.
• The algorithm is asymmetric – searching the history buffer in an efficient
manner is crucial. Intelligent data structures are required.
• Since data leaves the history buffer after some time, long term redundancy
cannot be exploited.
2022/2023 2022/2023 2022/2023
The Algorithm
To encode the sequence in the look-ahead buffer, the encoder moves a search
pointer back through the search buffer until it encounters a match to the first
symbol in the look-ahead buffer. The distance of the pointer from the look-
ahead buffer is called the offset. The encoder then examines the symbols
following the symbol at the pointer location to see if they match consecutive
symbols in the look-ahead buffer.30210061601189
30210061601189
The number of consecutive symbols in the
30210061601189
element in the triple is to take care of the situation where no match for the
symbol in the look-ahead buffer can be found in the search buffer. In this
case, the offset and match-length values are set to 0, and the third element of
the triple is the code for the symbol itself.
For the decoding process, it is basically a table loop-up procedure and can be
done by reversing the encoding procedures. We employ the same buffer
sized n characters, and then use its first (N-n) spaces to hold the previously
67
decoded characters, where N is the size of the window (sum of the size of the
look-ahead buffer and the search buffer) used in the encoding process. If we
break up each triple that we encounter back into its components:- position
offset o, match length l, and the last symbol of the incoming stream c, we
extract the match string from buffer according to o, and thus obtain the
original content. What has been shifted out of the buffer is sent to the output.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
The process continues until all code words have been decoded, and the
decoding is then complete.
________________________________________________________
___
LZ78 Algorithm
2022/2023 2022/2023 2022/2023
The Concept
To avoid the problems that occurred with LZ77, Ziv and Lempel developed a
different form of dictionary-based compression. LZ78 abandons the concept
of a text window. In LZ77, the dictionary of phrases was defined by a fixed-
length window of previously seen text. Under LZ78, the dictionary is a
potentially unlimited collection of previously seen phrases. LZ78-based
schemes work by entering phrases
30210061601189 into a ‘dictionary’ and then,
30210061601189 when a repeat
30210061601189
68
follows that phrase. Unlike LZ77, there is no need to pass the phrase length as
a parameter because decoder already has this information.
Just as files have certain symbols that occur with more regularity than others,
many files (especially, but not restricted to, text files) also have certain string
that are repeated frequently. For an example, take the string " the " (including
the spaces). Including the instances of the spaces, the string takes 5¥ 8 = 40
bits to encode. However, if we were to append this entire string to the list of
characters, at position 256, then at every subsequent occurrence of " the " we
2022/2023 2022/2023 2022/2023
could send the code 256 instead of the index sequence 32, 116, 104, 101, 32.
Since 256 cannot be represent by the standard 8-bit byte, this would require 9
bits – nevertheless, it is still a great improvement on the standard 40-bit
ASCII encoding.
Unlike LZ77, LZ78 does not have a ready-made window full of text (the
search windows) to use as a dictionary. On the contrary, it has to create a
dictionary ‘on the fly’: it creates a new phrase each time a token is output, and
it adds that phrase to the dictionary.
30210061601189 After the phrase is 30210061601189
30210061601189 appended, it will
available to the encoder at any time in the future – not just for the next few
thousand characters as with LZ77.
69
The Algorithm
This is precisely the idea behind the LZW compression algorithm. It starts
with a 4K "dictionary" of all the single character with indexes from 0 to 255
(which the standard ASCII set). It then starts to expand the dictionary as it
processes the text, using the entries 256 to 4095 to refer to substrings. As a
new string is parsed, a new code is generated. Strings for parsing are formed
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
set w = NIL
loop
read a character K
if wK exists in the dictionary
2022/2023 2022/2023 2022/2023
w = wK
else
output the code for w
add wK to the string table
w = K
endloop
70
The LZW algorithm is a very common compression technique. This algorithm
is typically used in GIF and optionally in PDF and TIFF. Unix’s ‘compress’
command, among other uses. It is lossless, meaning no data is lost when
compressing. The algorithm is simple to implement and has the potential for
very high throughput in hardware implementations. It is the algorithm of the
widely used Unix file compression utility compress and is used in the GIF
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
image format.
The Idea relies on reoccurring patterns to save data space. LZW is the
foremost technique for general-purpose data compression due to its simplicity
and versatility. It is the basis of many PC utilities that claim to “double the
capacity of your hard drive”.
30210061601189
file. 30210061601189 30210061601189
When encoding begins the code table contains only the first 256
entries, with the remainder of the table being blanks. Compression
71
is achieved by using codes 256 through 4095 to represent
sequences of bytes.
As the encoding continues, LZW identifies repeated sequences in
the data and adds them to the code table.
Decoding is achieved by taking each code from the compressed file
and translating it through the code table to find what character or
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
characters it represents.
Example: ASCII code. Typically, every character is stored with 8 binary
bits, allowing up to 256 unique symbols for the data. This algorithm tries to
extend the library to 9 to 12 bits per character. The new unique symbols are
made up of combinations of symbols that occurred previously in the string.
It does not always compress well, especially with short, diverse strings. But
is good for compressing redundant data, and does not have to save the new
2022/2023 2022/2023 2022/2023
dictionary with the data: this method can both compress and uncompress
data.
Implementation
The idea of the compression algorithm is the following: as the input data is
being processed, a dictionary keeps a correspondence between the longest
encountered words and a list of code values. The words are replaced by their
corresponding codes and so the input file is compressed. Therefore, the
efficiency of the algorithm increases
30210061601189 as the number of long,
30210061601189 repetitive words
30210061601189
72
LZW ENCODING
* PSEUDOCODE
6 P = P + C
7 ELSE
2022/2023 2022/2023 2022/2023
10 P = C
11 END WHILE
73
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
74
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
LZW Decompression
The LZW decompressor creates the same string table during
decompression. It starts with the first 256 table entries initialized to single
characters. The string table is updated for each character in the input
stream, except the first one. Decoding is achieved by reading codes and
translating them through the code table being built.
7 S = translation of OLD
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
8 S = S + C
9 ELSE
10 S = translation of NEW
11 output S
12 C = first character of S
15 END WHILE
76
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Critical issues:
• How to efficiently encode a variable number of strings with variable length?
• How long can we grow the dictionary? What are the strategies when the
dictionary gets too large (start from scratch, keep the most common entries)?
78
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
In their original LZ77 algorithm, Lempel and Ziv proposed that all strings be
encoded as a length and offset, even strings with no match. Storer and
Szymanski observed that individual unmatched symbols or matched strings of
one or two symbols take up more space to encode than they do to leave uncoded.
For example, encoding a string from a dictionary of 4096 symbols, and allowing
for matches of up to 15 symbols, will require 16 bits to store an offset and a
length. A single character is only 830210061601189
30210061601189
bits. Using this method, encoding a single
30210061601189
79
Storer and Szymanski proposed preceding each symbol a with coded/uncoded
flag. Using the above example it now takes 9 bits for each uncoded symbol and
17 bits for each encoded string.
Storer and Szymanski also observed that if you're not encoding strings of length
0 to M, then M may be used as an offset to the length of the match. Applying
their observation, instead of using 4 bits to represent match lengths of 0 to 15,
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
the same 4 bits may be used to represent match lengths of M to (15 + M).
Based on the discussion above, encoding input requires the following steps:
Step 1. Initialize the dictionary to a known value.
Step 2. Read an uncoded string that is the length of the maximum allowable
match.
2022/2023 2022/2023 2022/2023
A. Write the encoded flag, then the offset and length to the encoded output.
B. Otherwise, write the uncoded flag and the first uncoded symbol to the
encoded output.
Step 5. Shift a copy of the symbols written to the encoded output from the
30210061601189 30210061601189 30210061601189
unencoded string to the dictionary.
Step 6. Read a number of symbols from the uncoded input equal to the number
of symbols written in Step 4.
Step 7. Repeat from Step 3, until all the entire input has been encoded.
80
Decoding Strings
A. Read the encoded length and offset, then copy the specified number of
2022/2023
symbols from the dictionary to2022/2023
the decoded output. 2022/2023
B. Otherwise, read the next character and write it to the decoded output.
Step 4. Shift a copy of the symbols written to the decoded output into the
dictionary.
Step 5. Repeat from Step 2, until all the entire input has been decoded.
Implementation Issues
30210061601189 30210061601189 30210061601189
When I set out to implement LZSS, I had the following goals in mind:
The code should be easy to follow, easy to debug, and easy to modify.
81
Encoded strings should not be longer than two bytes (16 bits). 16 bit
values are easy to deal with and keep encoded strings small.
Avoid reading/writing individual bits.
My version 0.2 and later code uses a bitfile library, so reading and writing
individual bits isn't a big deal. All of my versions use codes that are integral
bytes, and each of my newer versions are derived from my earlier versions so
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
The Dictionary
2022/2023 The time to search the entire dictionary
2022/2023 2022/2023
82
Keeping the goal of a 16 bit encoded string in mind, and the above requirement
for a 12 bit offset, I'm left with 4 bits to encode the string length. With 4 bits, I
can encode lengths of 0 through 15. However, as Storer and Szymanski
observed, it only takes 1 byte to write out characters that match dictionary strings
0 or 1 character long, and 2 bytes to write out characters that match dictionary
strings 2 characters long. Savings of one or more bytes per string doesn't occur
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
By not encoding strings that offer less than one byte of savings, the minimum
encoded length is three characters. Since the minimum encoded length is three, I
am able to add three to the 4 bit value stored in the length field, resulting in
encoded string lengths of 3 to 18. So 18 is the maximum string length encoded
by this implementation.
83
technique, but it allowed my version 0.1 implementation to use standard one byte
reads and writes instead of one bit reads and writes.
This technique also makes the EOF clear. By processing only bytes, there are no
spare bits, os the EOF of the encoded dats is actually the EOF of the encoded
file. Don't worry about it if I lost you on the EOF discussion, just relax, because
there's no need to handle it specially.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
Other's have successfully improved the time required to find matching strings
using the following techniques:
30210061601189 30210061601189 30210061601189
experiment with others as time allows. The rest of this section documents some
of what I have tried so far.
2.4.1 Quantization
When changing the description of a signal from using a large alphabet to
using a smaller alphabet of symbols
30210061601189 we apply quantization.30210061601189
30210061601189 Obviously, when
reversing the process, a perfect reconstruction is no longer possible since two
or more symbols of the larger alphabet are mapped to a single symbol of the
86
small alphabet. Often, quantization errors are the only actual errors caused by
lossy compression (apart from rounding errors in transforms).
Example:
suppose we have stored a grayscale image with 8bit/pixel (bpp)
precision, which means that for each pixel, we may choose among 256
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
87
uniform quantization is applied, except for the quantization bin around zero,
which has twice the size of the other bins (symetrically around zero).
In case the occurrence probabilities of the symbols in the original signal
are not uniformly distributed and the distribution is known, it makes sense to
apply finer quantization (i.e. smaller bins) in the areas of the histogram where
more data values are accumulated. In this case we talk about “non- uniform
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
quantization”.
Note that following this strategy we minimize PSNR (the error is minimized
in areas where many values are found) but we do not care at all about
perceptual issues.
If the distribution of the occurrence probabilities of the symbols is not known,
the signal can be analyzed and based on the results, the quantization bins can
be set adaptively – termed “adaptive quantization”.
2022/2023 2022/2023 2022/2023
Since especially in the case of transform coding in many cases the
probabilities of symbol occurrence are quite stable for typical signals,
quantization can be modeled in advance and does not have to be adapted to
signal characteristics.
2.4.2 DPCM
Differential pulse code modulation (DPCM) is the general principle
behind differential / predictive coding.
Instead of considering the simple30210061601189
30210061601189
difference between adjacent signal values,
30210061601189
better compressed).
The predictor coefficients can be fixed for classes of known signals (e.g.
exploiting specific patterns like ridges in fingerprint images) or can be
adapted for each single signal (in the latter case, these need to be stored). The
optimal values for _, _, e.g. minimizing entropy for yi can be determined of
course.
Furthermore, for each distinct part of a signal dedicated coefficients can be
2022/2023 2022/2023 2022/2023
used and optimized (“local adaptive prediction”). By analogy, these
coefficients need to be stored for each signal part impacting on compression
ratio (obviously, there is a trade-off !).
The decision if DPCM is lossy or lossless is taken by the way how yi is
compressed. In case e.g. Huffman coding is applied to yi we result in an
overall lossless procedure. Is yi subject to quantization, the overall scheme is
a lossy one.
quantize combinations which are applied tp the data depending on local signal
statistics. Lossless JPEG is such a scheme (without applying quantization of
course).
89
2.4.3 Transform Coding
Most standardized lossy media compression schemes are based on transform
coding techniques. Typically, transform based coding consists of three stages.
1. Transformation
2. Quantization: a variety of techniques (as discussed before) can be applied to
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
map the transform coefficients to a small alphabet, the size of the latter
being selected corresponding to the target bitrate. Most rate-control
procedures are applied in this stage.
3. Lossless encoding of coefficients: a variety of techniques (as discussed
before) is applied to encode the quantized transform coefficients close to
the entropy bound. Typically, a combination of run length (zero-runs!) and
Huffman coding (for the older standards like JPEG, MPEG-1,2,4) or
2022/2023 2022/2023 2022/2023
arithmetic coding (for the more recent standards like JPEG2000 H.264) is
used.
The Transform
The aim of transformation is to change the data of a signal in a way that it is
better suited for subsequent compression. In most cases, so-called “integral-
transforms” are used, which are based (in their continuous form) on applying
integral operators. When applying transforms, the concept of projecting the
data onto orthogonal basis functions is used.
30210061601189 30210061601189 30210061601189
Motivation: Vectors in two-dimensional space can be represented by a set of
orthogonal basis-vectors (“orthogonal basis”, orthogonal requires the inner
product between vectors to be 0).
90
{(1, 0), (0, 1)} are orthogonal basis-vectors, α and β are called “coefficients”
which indicate a weight for each basis-vector to be able to represent the vector
(x, y). The property of orthogonality guarantees that a minimal number of
basis-vectors suffices (obviously, this will be important for compression).
This concept can be transferred by analogy to represent functions or signals
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
91
where M is the number of pairs and xj is the mean of xj computed over all
pairs. A simple technique to apply compression is to replace one of the two
components in each pair by the corresponding mean xj. The MSE of this
compression approach is exactly 𝜎 , the variance. No matter which
coordinates we replace be the mean,
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
the variance of both cases is large and therefore, the error of this compression
approach is large.
92
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
compression scheme to Y2 (by replacing its components by the mean Y2) since
the variance and consequently the error will be small.
Examples of popular transforms (desired properties are: decorrelation of
data, energy compaction, basis functions not dependent on data, fast
algorithms):
94
4. Wavelet and subband transform: uses basis functions with 2 parameters
instead of the single frequency parameter of the DFT and DCT: a sort of
scale and time / position.
Entropy Coding
Most schemes use either a combination of Huffman and Runlegth Coding
(mostly older standards like JPEG, MPEG-1,2,4) or arithmetic coding (more
recent schemes like JPEG2000, H.264 / CABAC)
Like the lossless case, we employ a dictionary to represent parts of the signal
by entries in the dictionary. I.e., select a part of the signal of size n and
replace it by the entry in the dictionary with has the same size and is most like
95
it. Of course, also here arises the question how to determine the measure of
similarity or quality. Instead of the signal part we only store the index of the
entry in the dictionary. Encoder and Decoder need to be synchronized with
respect to the dictionary entries.
Important techniques of this class are vector quantization (applied directly in
the signal domain and not to transform coefficients) and fractal compression.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
96
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
97
1. Toy Language A random variable X is distributed according to the point
probabilities in the following table:
x a b c d e
PX (x) 1 /3 1 /4 1 /6 1 /6 1/12
98
2. Dice Throws Let X be a random variable distributed uniformly on the
set {1, 2, 3, 4, 5, 6}.
99
3. Maximum Equivocation Suppose X and Y are random variables rang-
ing over n different values.
100
4. Typical Sequences Suppose we have a bent coin with PX (1) = 1/3. We
flip this coin five times, producing a random sequence S = (X1, . . . , X5).
(a) Compute the entropy H(X) of each individual coin flip, and the
entropy H(S) of the sequence.
(b) Draw up a table of the surprisal values log PS(s) and their proba-
−
bilities.
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
101
5. High-Probability Sets (CT, Ex. 3.5) Suppose that X1, X2, . . . , Xn are
independent and identically distributed random variables with a common
entropy H(X). Let PS be the probability distribution of the sequence
S = (X1, . . . , Xn), and define the set C(τ ) as
(a) For a fixed τ , what’s the largest number of tuples such a set can
contain?
(b) Sketch a graph of the probability
P (S ∈ C(τ )),
ﻣﻴﺮﻧﺎ ﻣﺤﻤﺪ ﻋﺒﺪﺍﻟﻤﻨﻌﻢ ﻋﻄﺎﺍﻟﻠﻪ
102