0% found this document useful (0 votes)

39 views43 pages

A Visual Introduction To Information Theory

Uploaded by

zhaoyichao01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views43 pages

A Visual Introduction To Information Theory

Uploaded by

zhaoyichao01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

A visual introduction to information theory

Henry Pinkard1,2 and Laura Waller1

1 Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
2 Computational Biology Graduate Group, University of California, Berkeley
* Correspondence: hbp [at] berkeley.edu
arXiv:2206.07867v1 [cs.IT] 16 Jun 2022

Though originally developed for communications engineering, information the-

ory contains mathematical tools with numerous applications in science and engi-
neering. These tools can be used to characterize the fundamental limits of data
compression and transmission in the presence of noise. Here, we present a practical
guide to key concepts in information theory, focusing on intuitions and providing
visual explanations wherever possible. Our presentation assumes only a familiarity
with basic probability theory.

1 Introduction
In the 1940s, Claude Shannon [1] showed that a precise mathematical definition of information arises
directly from the axioms of probability. Since then, information theory has served as a foundation
for the digital world we live in today by defining the limits of data compression and the reliable
transmission of data across noisy networks. In addition, the mathematical tools it provides have
found numerous applications in diverse areas such as statistics, machine learning, cryptography,
quantum computing, biology, and many others.
Here we introduce the foundations of information theory, with an emphasis on intuition. This
is meant to serve as a minimal introduction to key ideas. A more comprehensive treatment can be
found in the excellent textbooks [2, 3] as well as Shannon’s original formulation [1].

Main ideas Outside of the context of information theory, most of us already have a vague intuitive
notion of what “information” is, which is usually something close to “knowledge obtained about
something unknown.” One of the main advances of information theory was to formalize this intuition
into something mathematically precise. Shannon himself once said that “information can be treated
very much like a physical quantity such as mass or energy.”
Shannon was able to make the idea of information mathematically concrete by defining it in
terms of probability. If a sender is transmitting information about something to a receiver, that
information is describing something that would otherwise be unpredictable. In other words, from
the receiver’s perspective, it is random. Information theory shows that the randomness of the
message, as characterized by the probability distribution over potential messages, is in fact the only
thing that matters for figuring out how to transmit information. The actual content of the messages
is irrelevant.
This fact enables us to consider the fundamental concepts in information theory using a generic
source of random events: drawing colored marbles from an urn, with each marble representing a
possible message from the set of all messages. The mathematics developed on this example can be
directly applied to other application areas of information theory. For example, the way we describe
the randomness of a sequence of colored marbles is the same as how would describe randomness of
a string of letters in English text, or a sequence of pixels in images taken on the Hubble telescope.
The fundamental unit of information content is the bit. It is worth noting that the word “bit”
also describes a container that can hold information: a digit that can be either 0 or 1. The actual
information within the container must be defined with respect to some probability distribution, and

1
it could hold less than 1 bit of information in the same way as a 1 liter bottle can be filled with only
1
2 liter of water. Throughout this paper, we try to avoid this ambiguity by choosing examples where
where each bit container is “full” with one bit of information. So it is safe to assume that “bit” here
refers to the units of information, unless explicitly stated otherwise.
There are two key problems in information theory that will come up throughout the sections
below. The first is source coding, which concerns data compression: Given a source of random
events, what are the limits on how succinctly we can record the outcome of a random sequence, on
average? This can be further subdivided into the cases where the original sequence can be perfectly
reconstructed upon decompression (lossless compression) or the case where some distortion from
the original sequence can be tolerated to achieve even smaller data size (lossy compression).
The second problem is channel coding, which concerns data transmission: Given a source
of random events, and an imperfect system for transmitting information (a noisy channel), how
can we encode a sequence of such events such that they will be robust to errors introduced by the
channel? Again, we can subdivide into perfect data transmission, and data transmission with errors.

1.1 Notation

X a random variable, which will also be called a “random event”

x a particular outcome of X
X the probability space/set of possible outcomes for X. x ∈ X
|X | the number of possible outcomes in the probability space
pX (x), p(X = x) or p(x) the probability that the random event X has outcome x
H(X) the entropy of X
Hmax (X ) the maximum entropy for the state space X
W (X) the redundancy of X
X = X1 , X2 , ... a stochastic process: an ordered sequence of random variables
X2 = X × X the state space for (x1 , x2 ) tuples where x1 ∈ X , x2 ∈ X
PX The set of probability distributions on the space X
pX A vector representation of the probabilities of the distribution of X
PX,Y A matrix representation of the joint distribution of X and Y
PY|X A matrix representation of conditional distribution Y given X
C The channel capacity:
the maximum information that can be transmitted by a channel

2 Information, uncertainty, entropy, and mutual information

2.1 What is information?
Uncertainty, randomness, and a lack of information are different ways of viewing the same underlying
idea. If we cannot predict the outcome of an event, we are uncertain, and it is, from our perspective,
random. However, we may be able to acquire information about the event that reduces its apparent
randomness, making us more certain about it.
What will the weather be like in 3 hours? It’s uncertain. But from now until then, we can
continuously acquire information by observing the current weather, and 3 hours from now our
uncertainty will have collapsed to zero. It no longer appears random to us.
Weather forecasting is not the only area that fits this paradigm. In fact, information theory
shows that the almost all of the characteristics of the random event in question are separable from
its information content–all that matters is its probability distribution. Thus, we can consider infor-
mation theory in the context of a generic random event: repeatedly drawing a marble at random
from an urn containing blue, green, yellow, and gray marbles. (Fig. 1a).

2
a) Independent random sampling c) Information content of different propositions
Draw at random Probabilities Proposition: Probability Information
(with replacement) Ruled out true gained
Possible

2 bits

b) All possible 2-sequences

0.25 0.2 bits
Probability

0
1.5 bits
Outcomes

Figure 1: Equivalence of probability and information a) A sequence of two marbles is drawn

at random (with replacement) from an urn, giving rise to b) a probability distribution over the
16 possible two-color sequences. c) Learning that a proposition about the two colors drawn is
true enables the elimination of certain outcomes. For example, learning neither marble is blue
7
eliminates 16 possibilities containing 43 of the the probability mass. Eliminating probability mass,
reducing uncertainty about the outcome, and gaining information are all mathematically equivalent.
Reduction of 50% of the probability mass corresponds to 1 bit of information

3
Formally, we have a finite, discrete set of independent and identically distributed (IID) events
and a random variable that assigns a probability to each element of the set. The intuitions developed
on this simplified scenario will later be generalized to the non IID case (Sec. 2.8) and continuous
variables and probability density functions (Sec. 2.9).
More common outcomes provide us with less information, and rarer outcomes provide us with
more. To understand why, suppose someone has drawn two colored marbles from an urn, but we
don’t know what they’ve draw. We’ll denote these two random events with X1 and X2 . X1 and X2
have a joint probability distribution pX1 ,X2 , which tells us the probability of each of the 16 possible
two-color sequences (Fig. 1b). Suppose we then learn some facts about what was drawn (but not
the exact outcome). For example, we might learn that neither marble is blue, or the first marble
isn’t green, or the two marbles are different colors (Fig. 1c).
Learning each fact allows us to rule out possibilities for the outcome of X1 , X2 . The less often the
fact is true, the more possible outcomes we can rule out. For example, knowing the two marbles are
21
the same color rules out 12 of the 16 possible outcomes that correspond to 32 of probability mass.
The more possibilities we can rule out, the more information we’ve gained. Thus, rarer outcomes
contain more information. The more surprised we are by the outcome, the more information it has
provided us.
Furthermore, we can calculate exactly how much information that outcome has provided based
on its probability. Each time we can rule out half of the probability mass, we have gained exactly 1
bit of information. In the case where we learn that neither marble is blue, we’re left with only 14 of
the probability mass, and we have gained 2 bits of information, since we have halved the probability
mass twice. The amount of information can be calculated from the probability of the outcome by:

1
log2
p(x)

The conventional choice of a base-2 logarithm means that the units will be bits (The 2 will often
be omitted in subsequent sections).
Since we can calculate the information provided by each outcome, and we know each outcome’s
probability, we can compute the probability-weighted average amount of information provided by a
random event, otherwise known as the entropy. Entropy is denoted H(X) and is defined mathe-
matically as:
X 1
H(X) = p(x) log
p(x)
x∈X

Entropy can also be interpreted as how surprising the random event is on average. Events that
are more random will tend to be more surprising on average, and observing their outcome will yield
more information, and reduce our uncertainty more. On the flip side, higher entropy also means that
there is more uncertainty in the random event to begin with, so we must acquire more information
to be certain about its outcome compared to an event with lower entropy.
It might seem that, since lower probability outcomes contain more information, that random
events containing many extremely rare outcomes will have the highest entropy. In fact, the opposite
is true: random events with probability spread as evenly as possible over their possible outcomes
will have the highest entropy. Mathematically, this happens because the gains in information that
come from an event becoming increasingly rare are outweighed by the decrease in the probability
that the highly informative outcome will occur, due to the logarithm in the formula for information.
Section 2.3 will discuss maximum entropy distributions in more detail.

4
2.2 Entropy and data compression
A random variable’s entropy is connected to many other interesting properties. For example, it
quantifies the limit of much (lossless) compression of data can be achieved, a problem in information
theory known as source coding.

Optimal recording Entropy: Average number of bits per event

Probabilities scheme:

Figure 2: Entropy can be interpreted as the average length of the shortest encoding of a sequence
of random events, which here are repeated draws (with replacement) of colored marbles from an
urn. (Top) With equal probability of each color, the shortest binary recording assigns a two-digit
binary string code to each event. The entropy is the average number of bits per event of a typical
sequence: 2 bits. (Bottom) When some colors are more likely than others, the more probable ones
can be recorded as shorter binary strings to save space. This gives a shorter entropy: 1.75 bits.

To demonstrate this, we return to the problem of drawing colored marbles from an urn. Now
we’re going to draw a sequence of marbles, and record the outcome of each draw using binary
codewords, such as 1, 10, etc. Our goal is to choose an encoding scheme that will (on average)
yield the shortest possible binary description of our sequence without introducing any ambiguity as
to what the outcomes of the draws were. This is a lossless compression problem, and the shortest
possible length is determined by the entropy the random draw.
We want to pick an encoding scheme that matches outcomes (e.g. a green marble drawn) to
bit strings (e.g. 01), such that the length of the string for a series of random draws is as short as
possible. How short we can make this encoding, while still being able to correctly decode our binary
encoding into the original sequence, is closely related to the fundamental information content of the
sequence: sequences that require longer encodings (on average) contain more information. Thus, the
length of the optimal binary recording scheme is determined by the random variable’s entropy–the
average information present in a single outcome.
The amount of information and the optimal recording scheme both depend on the probabilities of
each outcome. For example, consider a random variable X on the outcome space X with associated
probabilities pX (Fig. 2, top):

5
X = {blue, gray, yellow, green}
pX = { 41 , 14 , 14 , 14 }
The shortest lossless encoding scheme given these probabilities is a unique 2 digit binary code-
word for each outcome: 00, 01, 10, 11, because the average information in each random event (its
entropy) is 2 bits.
Alternatively, consider the case where the outcomes are not equally probable (Fig. 2, bottom),
but instead:

X = {blue, gray, yellow, green}

pX = { 21 , 14 , 18 , 18 }
In this case, even though the outcome still has some randomness, we can more reliably predict
the next element of the sequence compared to the case where all probabilities are equal: it is more
likely to be blue and less likely to be yellow or green. Thus, we are more certain and and shouldn’t
need as much information (on average) to describe the outcomes compared to the previous case.
This is reflected in the fact that we can achieve a shorter average binary string length by encoding
more probable outcomes with shorter strings and less probable outcomes with longer strings. The
shortest, lossless encoding scheme will be to use the codewords 1, 01, 001, 000. (This encoding is
not unique–we could always swap 0s and 1s). This is an example of a prefix code, which can be
uniquely decoded even if the the codewords are concatenated one after another, since no codeword
is a prefix of another. Dividing the length of the encoding of a series of events by the number of
events, we compute that the average information (entropy) in each random event is 1.75 bits.
If we instead used the 2 digit binary code, the amount of information would be unchanged,
but average encoding would be longer than necessary. In other words, this code would introduce
redundancy.
In addition to defining the limits of data compression about a random variable, entropy describes
the uncertainty about what the outcome of a random variable will be. If p(blue) = 1 and all other
probabilities equal 0, the entropy is 0. In this case there is no uncertainty because we are sure of
the outcome even before seeing it.

2.3 Redundancy
The redundancy of a random event depends on how uncertain its outcome is, compared to how
uncertain a variable on the same probability space could be. The latter is defined by the maximum
entropy distribution Hmax (X ) on the set of possible outcomes X . The maximum entropy distri-
bution is always the one in which the probability of all events is equal (Fig. 2, top). This makes
sense because the outcome is the most uncertain, or equivalently the optimal binary recording of
the events has the longest possible length.
By setting the probability of all events equal in a particular probability space, we find that
maximum entropy is equal to the log number of possible states:

Hmax = log2 |X |
Where |X | is the number of elements in the set X . The larger the number of elements in X , the
more uncertain we can potentially be about what value it will take.
The redundancy W is the difference between this maximum entropy and the actual entropy of
the random variable:

W (X) = Hmax (X ) − H(X)

6
2.4 General data compression limits limits
In order to make the example in Figure 2 illustrate the concept of entropy, we’ve made two conve-
nient choices that won’t necessarily be true in general.
First, we’ve chosen the probabilities to be all of the form 21k where k is a positive integer. This
makes the information in each event equal to an integer number of bits, which means we can fully
compress the sequence by encoding each event individually. If we hadn’t done this, the information
content of a single event might equal something like 43 bits, and encoding in a one digit binary string
would waste 14 bits of space and yield a redundant binary representation. This technicality is why
the entropy is in general less than or equal to the expected length of the shortest binary encoding,
rather than always exactly equal. We’ve specifically picked the case where it is equal for illustrative
purposes. It is possible to get closer to this bound in the more general case by encoding sequences
of multiple events together into a single binary string, otherwise known as block encoding.
The second simplification we’ve made is that the random sequences of colors we’ve chosen are
made up of exactly the expected number of each color. For example, the bottom sequence is half
8 4
( 16 events) blue, a quarter ( 16 events) gray, etc. In reality, different random sequences might yield
binary strings with very different lengths. For example, a sequence of all blue (which happens to be
the most probable sequence) would have a 16 bit binary encoding, while a string of all greens (one
of the least probable sequences) would yield a 48 bit binary encoding. These sequences in Figure
2 are called typical sequences.

2.4.1 Typical sequences

A typical sequence is an outcome which has information close to the entropy of random event to
which it belongs. The typical set is the set of all such typical sequences, which is usually a small
subset of all possible sequences. For a sequence of N independent and identically distributed random
variables X1 , X2 , ..., XN , the average information content is H(X1 )+ H(X2 )+ ...+ H(XN ) = N H(X).
For a small positive number ǫ, a typical sequence is one that satisfies:

1
H(X) − ǫ ≤ − log p(x1 , x2 , ..., xN ) ≤ H(X) + ǫ
N
Does the choice of ǫ matter? For the sequences in Figure 2, we’ve conveniently avoided this
question since the sequences shown have a information content exactly equal to N H(X). However,
as N → ∞, we would see essentially the same behavior, because all of the probability mass of
the distribution over sequences concentrates on the typical set defined with any positive value of
ǫ (Fig. 3). Furthermore, as N → ∞, all typical sequences will occur with approximately the
same probability: ≈ 2−N H(X) . Since the total probability is 1, and each typical sequence occurs
with probability ≈ 2−N H(X) , there must be ≈ 2N H(X) typical sequences. This result is known as
the asymptotic equipartition property (AEP). It is a direct consequence of the Law of Large
Numbers, and it is used to prove many important theorems in information theory.
To summarize, the AEP says that with infinite sequence length, we can ignore the vanishingly
small probability that falls on non-typical sequences. As a result, we can focus on the typical set of
≈ 2N H(X) sequences, which each occur with probability ≈ 2−N H(X) .
A data compression scheme that encoded only sequences in the typical set and discarded all
other sequences would require log2 2N H(X) = N H(X) bits to assign a unique binary string to each
element of the typical set. Since there is negligible probability mass outside the typical set, the
scheme would be lossless in the limit as N → ∞. Thus, lossless compression requires of a length N
sequence requires N H(X) bits. Using fewer bits would result in different typical sequences mapping
to the same binary string. Using more would waste bits on non-typical sequences that occur with
nearly zero probability.

7
Interestingly, the typical set does not include the most probable individual sequences. For exam-
ple, in Figure 3b, the sequence for each N that consists of N blues in a row is the most probably
sequence, would be omitted from the typical set for a small value of ǫ. For any finite N you could
improve the compression algorithm described above by taking the binary string used for the least
probable typical sequence and instead using it for the sequence of all blue. The demonstrates an im-
portant point about the typical set. It is useful for theoretical work, because the number of sequences
it contains can be counted. However, in practical situations (i.e. N < ∞), better compression can
be achieved by designing for the set of most probable sequences. As N → ∞, this becomes less and
less true, because the most probable sequences contain vanishingly small total probability mass.

a) Probabilities of sequences of independent and indentically distributed random events

N=1 N=2
... N=3

...
b) Probability concentrates onto “typical sequences”, each with probability
N=1 N=2 N=3 N=7 N = 13
All sequences Entropy =
Example: Probability weighted 1.75 bits
2 5 3000
15 1.5e7
# sequences

0 0 0 0 0
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Figure 3: Typical sequences a) Example sequences of independent and identically distributed

events with increasing increasing length (N ). b) Histograms of the information (i.e. − logNp(x) ) of each
possible sequence with length N . Black shows the histogram of every possible sequence. Magenta
shows the distribution of probability-weighted sequences (i.e. the expected distribution one would
get by taking a random sample). As N increases, nearly all of the probability mass concentrates
on a tiny subset of the total number of sequences: typical sequences. There are ≈ 2N H(X) typical
sequences each with probability ≈ 2−N H(X) .

The relationship between entropy and redundancy, probability distributions, and typical se-
quences are shown in Figure 4.

2.5 Mutual information

Having described how to quantify the information content of a random event, we now consider the
more general case of how information can be shared among multiple random events–that is, how
how much uncertainty will be reduced about a random event by knowing the outcome of a different
random event. This provides the basis for reasoning about and calculating the transmission of
information.

8
a) b) Probability mass Typical sequence Entropy, redundancy

Probability
Entropy Redundancy

Max entropy

Probability
Information (bits)

Figure 4: Probability, redundancy, and typicality. a) The redundancy of a random variable

X is equal to the difference between its entropy H(X) and the maximum possible entropy on its
probability space Hmax (X ). b) Distributions with more concentrated probability mass have higher
redundancy. (Top) The equal probability case, (Bottom) the concentrated probability case. (Left)
Probability distribution over a single event of an independent and identically distributed sequence,
(Middle) a typical sequence of events from this distribution. (Right) The entropy, redundancy and
maximum entropy.

Consider the scenario of drawing not colored marbles, but colored objects from an urn. We have
the same four possible colors, blue, green, yellow, and gray, and now there are also four possible
object shapes: , N, ⋆, and •. Our random draws will now be shape-color combinations like N, ,
, etc. X will still represent the represent the object’s color, and now we’ll also use Y to represent its
shape. The random event of drawing a particular shape with a particular color is X, Y, and its value
is governed by the joint probability distribution over X × Y, the outcome space of all shape/color
combinations.
Suppose that we do not observe which object was drawn, but only a black and white photograph
of the object in which all colors look the same. What can we say about the object’s color (X), having
only observed its shape Y? In other words, how much information does its shape convey about its
color? The answer to this question is determined by the mutual information between X and Y.
Mutual information, denoted I(X; Y), quantifies how well you can discriminate alternative pos-
sibilities of an unknown random event X based on observing some other random event Y. In other
words, it tells you how much of the information gained from observing Y = N will reduce your
uncertainty about the object’s color, X. The minimum possible value is I(X; Y) = 0, which means
that X and Y are independent–observing shape will not reduce your uncertainty about color at
all. The maximum possible value is the lesser of H(X) and H(Y), because a random event can
neither convey more information about another event than it itself contains, nor can it convey more
information about the other event than the other event contains. Mutual information is symmetric:
I(X; Y) = I(X; Y).
Another common way of quantifying the dependence between two variables is through correlation.
In some sense, mutual information can be thought of as a more general version of correlation, because
unlike correlation, which usually captures only linear relationships between the variables, mutual
information captures any sort of statistical dependency between them.
To compute mutual information, we need to know the joint probability distribution of the two
variables X and Y. That is, we need to know which shapes come in which colors. For a particular
draw, say ⋆, we can calculate the amount of information that the color and shape convey about
one another using the point-wise mutual information:

9

p(x, y)
log
p(x)p(y)

Point-wise mutual information is analogous to the information of a single outcome. It can be

rewritten in a few different ways, which allow for different interpretations. For example:

p(x, y) p(x | y)
log = log
p(x)p(y) p(x)
= log p(x | y) − log p(x)
1 1
= log − log
p(x) p(x | y)

The final line can be interpreted as the surprise of the outcome x minus the surprise of x when
1
y is already known. If knowledge that Y = y makes X = x less surprising, then p(x|y) will decrease,
reflecting the fact that y tells us something about the value of x. For example, if there are many
different colors of objects, but all ⋆’s are blue, then learning the object is a ⋆ will eliminate all
uncertainty about its color: p(⋆ | ⋆) = 1. This means that in this case all of the information
provided by the outcome X = blue is shared by Y = ⋆.
By taking a probability-weighted average of the point-wise mutual information over all possibil-
ities, we arrive at the mutual information, which represents the average amount of information
provided from the outcome of one random event that is relevant to another. Mathematically:

XX p(x, y)
I(X; Y) = p(x, y) log (1)
p(x)p(y)
x∈X y∈Y

Just as in Figure 1, when one bit of information meant we could rule out half of the probability
mass, for each bit of mutual information between X and Y, we can eliminate half of the probability
mass of one random event after observing the outcome of the other. This is most easily seen if
all outcomes are equally probable, because in this case eliminating half of the probability mass
corresponds to eliminating half of the outcomes.
Figure 5 shows three different scenarios with different amounts of mutual information, and four
different ways of visualizing the relationships between the two variables. When I(X; Y) = 0 bits,
observing the object’s shape does not allow us to eliminate any possibilities of the object’s color. A
mutual information of 0 between two random variables is equivalent to those two random variables
being independent. Alternatively, when I(X; Y) = 1 bit, after observing the shape only, we can
eliminate two of the four possible colors. Finally, when I(X; Y) = 2, we can eliminate 3 of the
four possible colors and know exactly what color it is. In this case, we have gained all 2 bits of
information that were present in X.
Since we can at most gain 2 bits from observing 1 of 4 possible shapes (i.e. Hmax (Y) = 2), if
there were more than 2 bits of information in X, we wouldn’t be able to determine color exactly.
For example, if there were 8 possible colors that were all equally probable, we could not uniquely
identify all possibilities based on 4 possible shapes.

2.6 Joint and conditional entropy

Using entropy and mutual information, we can now describe two other important quantities in
information theory: conditional entropy and joint entropy.

10
a) Distribution view b) Mapping view c) Compact view d) Information view
Inferred
color
(shape)
2 bits

Colors Possible Shapes

combos
(color)
Mutual information

1 bit
0 bits

Figure 5: Mutual information describes the relationship between two random variables. Here
those random variables are the shape and color of an object drawn at random. The joint distribu-
tion of shape and color determines the amount of mutual information. (Top row) 2 bits of mutual
information, (middle row) 1 bit of mutual information, (bottom) 0 bits of mutual information.
a) The joint distribution of shape and color, with uniform probability over all possible shape/color
combinations shown. b) Mapping view showing the colors, possible shape color combinations, pos-
sible shapes, and the possibilities for colors that can be inferred from shape alone. Line thickness
shows strength of the relationship. c) Compact view that omits the joint distribution and color
inference. d) More of the entropy of the two events is shared with greater mutual information.

11
2.6.1 Conditional entropy
Mutual information quantifies how much knowing the outcome of one random event reduces un-
certainty about a second random event, while conditional entropy quantifies how much uncertainty
remains about the second random event. Adding the two together gives the total uncertainty of the
second random event on its own.
Once again, we start with an example in the point-wise case, where the information gained by
learning the outcome of a random event X after already knowing the outcome of a second random
event Y = y is:

1
log
p(x | y)
Or, in the specific case of learning an object is green after knowing it is a •:
1
log
p(green | •)
The surprise of finding a green • depends on how many different colors of • are in our set of
objects. In the same way we calculated entropy as a probability-weighted average of all possible
outcomes of a random event, we can similarly compute an average uncertainty of the object’s color,
given that the object is a • (Fig. 6a). This yields the point-wise conditional entropy:

•) = •) log p(x1| •)
X
H(X | p(x | (2)
x∈{green,blue,
yellow,gray}

Which can also be written in the generic form as:

X 1
H(X | y) = p(x | y) log
p(x | y)
x∈X

Looking at the joint distribution in the middle row of Figure 5a, knowing that the object is a
• leaves three possible colors: green, yellow, or blue. Since we’ve specified that these occur with
equal probability, the point-wise conditional entropy will be log2 3 ≈ 1.58 bits.
Alternatively, if we looked at the conditional entropy of color given that the object is a ⋆, the
only possible color would be green. This means the point-wise conditional entropy is 0–we know
the value of X exactly given Y.
Next, we might want to consider the average uncertainty of color, conditional not specifically on
, but averaged over all possible shapes. To do so, we’ll sum equation (2) over all possible states in
Y (e.g. ⋆,,N,•), weighting by the probability of each shape. The generic form of this is:

X X 1
H(X | Y) = p(y) p(x | y) log
p(x | y)
y∈Y x∈X

With some slight algebraic manipulation, we get to the traditional equation for conditional
entropy:

XX 1
H(X | Y) = p(x, y) log
p(x | y)
y∈Y x∈X

12
a) b)

1.58 bits 1 bit

0.5
0
Distribution view

0 0.5
Probability
Mapping view

Figure 6: Point-wise conditional entropy (Top) The joint and marginal distributions of shape
and color. (Bottom) The compact mapping view between shape and color. a) The conditional
entropy of color given the shape is • is log2 3 ≈ 1.58 bits since there are three equally probably
possibilities (in magenta box on distribution view) b) The conditional entropy of shape given a blue
object is 1 bit since there are two equally probable shapes.

13
When H(X | Y) = 0, after observing the outcome of Y we know exactly what the outcome of X
is – there is no remaining uncertainty. Alternatively, if H(X | Y) > 0 then the outcome of X is not
fully known having observed Y, at least for some outcomes of Y.

2.6.2 Joint entropy

Joint entropy is an extension of entropy to multiple random variables, which takes into account the
dependence between them (i.e. the mutual information):

XX 1
H(X, Y) = p(x, y) log
p(x, y)
x∈X y∈Y

For example, p(x, y) in the example shown in Figure 5 would represent the probability of a
particular shape-color combination.
In the case that X and Y are independent, observing the value of Y provides no information
about X. Thus, H(X, Y) = H(X) + H(Y). This can be proven algebraically by substituting
p(x, y) = p(x)p(y) into the joint entropy formula, as shown in Appendix B.1. This can be equiva-
lently expressed as p(y) = p(y | x) or p(x, y) = p(x)p(y).
When X and Y are dependent, the joint entropy will be less than the sum of the individual
entropies, since there will be nonzero mutual information. For example, in the top row of Figure 5,
H(X) = 2, H(Y) = 2, but H(X, Y) = 2 also, since the 2 bits are mutual to both random variables.

2.7 Relationships between entropy and mutual information

Entropy, joint entropy, conditional entropy, and mutual information can be described in terms of
each other. Their relationships are summarized in Figure 7. Entropy is the average uncertainty
in a single random event, mutual information is the amount of that uncertainty that is in common
with another random event through some statistical relationship. Conditional entropy is amount of
remaining uncertainty in one random event after subtracting the mutual information shared with
another random event. Joint entropy is the average uncertainty when describing both random events
together.

Joint entropy
Entropy Conditional entropy
Conditional entropy Entropy
Mutual information

Information (bits)

Figure 7: The relationships between entropy, joint entropy, conditional entropy and
mutual information The width of each bar represents its size (in bits). Adapted from [2], p140

Mutual information can be written in three different ways in terms of the other entropies:

I(X; Y) = H(Y) − H(Y | X)

= H(X) − H(X | Y)
= H(X) + H(Y) − H(X, Y)

14
An example of how I(X; Y) can be manipulated into these other forms can be found in Appendix
B.2.
The first two can be interpreted as the average uncertainty in one random event after subtracting
the uncertainty that remains having observed a different random event. The third can be interpreted
as the total uncertainty of the both random events in isolation minus the uncertainty of both
considered together.

2.8 Entropy rate of stochastic processes

Thus far, we’ve discussed only the case in which the source of information is an independent and iden-
tically distributed (IID) random variable, but the information theoretic concepts we’ve introduced
generalize to non-independent ordered sequences of random variables, also known as stochastic
processes.
In the IID case, we have a random variable X, which has some probability density/mass function
pX (x). A sequence of events can be represented by the random vector X = (X1 , X2 , ...), which is
an ordered sequence of random variables. Its joint probability density/mass function is denoted
pX (x) = pX1 ,X2 ,... (x1 , x2 , ...). The IID assumption allows us to simplify this expression by factoring
the joint distribution and writing each term as the shared probability density/mass function of X:

pX (x) = pX1 ,X2 ,... (x1 , x2 , ...)

= pX (x1 )pX (x2 )...

Entropy is additive for independent random variables, so the entropy of X = (X1 , X2 , ...XN ) will
be equal to N times the entropy of X. Thus, H(X), the entropy per event, is called the entropy
rate of the process. This quantity is important because if we are trying to transmit information
through sequential uses of a communication channel, our channel will need to be able to keep up
with this rate (as discussed in greater depth in section 4.2.2).
Alternatively, suppose that the this IID assumption does not hold. What is the entropy rate of
a stochastic process with an arbitrary joint probability distribution? There are two ways we might
do this, and under certain assumptions, they are equivalent ([3] Sec. 4.2).
Returning to the example of drawing marbles, suppose we are drawing from a magical urn that,
the first time we draw from it, will give all four colors with equal probability. After the first draw,
the magic kicks in, and the urn makes it more likely that we will again draw the same color as
our previous draw. For example, the conditional probability pXN +1 |XN (blue | blue) = 85 , while
pXN +1 |XN (green | blue) = 18 (Fig. 8a). If we write out all the possible conditional probabilities,
we will get a matrix where each column represents the conditional distribution pXN +1 |XN . A typical
sequence from this distribution will tend to give repeats of the same color (Fig. 8b).
What can we say about uncertainties in this process? We are more uncertain about the color of
the first draw X1 (all colors with 41 probability) than we are about the second color after observing
the first ( 58 chance of the same color again). This means that H(X1 ) > H(X2 | X1 ) (Fig. 8c).
Knowing X1 reduces our uncertainty about the value of X2 , and similarly knowing X2 would reduce
our uncertainty about the value of X1 . This means that the joint entropy is less than the sum of
the individual entropies H(X1 , X2 ) < H(X1 ) + H(X2 ).
What about the entropy rate of the process? There are two ways we could define this. First,
we could simply say that the entropy rate is our average uncertainty over all draws. This can be
quantified by dividing the joint entropy by the number of draws to get an average of the information
in each draw:

1
H(X1 , X2 , ...XN ) (3)
N

15
a) First draw b) A typical sequence
...

c) Joint and conditional entropy of first 3 draws

Subsequent draws

d) Two ways of computing entropy rate

2
Entropy rate (bits)

Average entropy
Conditonal entropy
0
0 2 4 6 8 10
N

Figure 8: Entropy rate of a stochastic process a) The stochastic process, which consists of
(top) an initial draw of a colored marble with each color having a probability of 14 and (bottom)
subsequent draws where the probability of repeating the same color as the previous draw is 85 and
the probability of all other colors 18 . b) A typical sequence from this stochastic process where a
random variable Xk represents the color selected at each position. c) Entropy, conditional entropy,
and joint entropy of the first three draws. Knowledge of past outcomes reduces uncertainty of future
outcomes (or vice versa). d) The two ways of computing entropy rate: the average of the joint
entropy and the conditional entropy of the next draw given the previous. For a stationary stochastic
process, these converge to the same value as N → ∞.

16
Alternatively, we could say that the entropy rate is our uncertainty about the next draw, given
the previous ones. This is quantified with the conditional entropy:

H(XN +1 | XN , XN −1 , ...X1 ) (4)

Under our example, (3) will gradually decrease as N increases, as our high uncertainty about
the first draw has a proportionally smaller and smaller effect compared to the subsequent, more
predictable draws.
For the second definition, since our model specifies that draw N + 1 only depends on the outcome
of draw N (a type of model called a Markov chain), (4) will simplify to:

H(XN +1 | XN ) (5)

Thus, it will have a large value for X1 and then be slightly lower and stay constant for all
subsequent draws.
As N → ∞, both measures will converge to the same value for the entropy rate because the high
uncertainty of the first draw is ultimately neglected for both, so the two definitions are equivalent in
the limit. This will be true whenever the stochastic process is stationary, which means that its joint
probability distribution is shift-invariant. Mathematically, a stationary process has the property:

pXN +1 |XN ,XN −1 ,... = pXN +1+k |XN +k ,XN −1+k ,...

Where k is some arbitrary integer.

It is often useful to consider stationary processes in an asymptotic sense. For example, the process
in Figure 8 is not stationary in that pX2 |X1 6= pX102 |X101 . However, as N → ∞ these differences
disappear.

2.8.1 Redundancy and stochastic processes

Moving from a sequence of independent and identically distributed random variables to a stochastic
process requires revisiting the definition of redundancy provided in section 2.3. We can no longer
define redundancy with respect to only the entropy of a single random variable in the sequence (e.g.
H(Xk )). The more general definition of redundancy must account for the joint entropy:

W (X) = Hmax (X × X × ...) − H(X1 , X2 , ...)

Essentially, this means that redundancy is now a function of higher order interactions between
variables rather than just of the distribution of each random variable individually. In the example
above H(Xk ) = Hmax (X ), all colors are equally likely for every draw in the absence of any infor-
mation about the outcomes of other draws. But the joint distribution in our example is not the
maximum entropy distribution, because the information present in the distribution of one random
variable is shared by others. This manifests in the fact that we can do much better than random
chance in guessing the next color in a sequence given knowledge of the previous ones. The maximum
entropy joint distribution would be independent and identically distributed random events, each of
which has equal probability over all of its states.

17
2.9 Probability densities and differential entropy
The examples above all considered probability mass functions over finite, discrete sets of events.
How do the formulae and interpretations generalize to cases other than this, such as probability
density functions defined over the infinite real line?
For probability densities that are defined over the continuous real line (e.g. the normal distri-
bution, the exponential distribution), we can define an analog to entropy called the differential
entropy, replacing the discrete sum with an integral:

Z
H(X) = − p(x) log p(x)dx (6)
X

where p(x) is the probability density function of X.

The differential entropy does not have the same clear interpretations as the discrete entropy.
In particular, it cannot be interpreted as the number of bits needed to describe the outcome of a
random variable with that probability density. In fact, it would take an infinite number of bits
to describe a real number with arbitrary precision. The differential entropy can also take negative
values unlike its discrete counterpart. Finally, as is pointed out in ([2] p180), equation (6) is actually
improper in the sense that it takes the logarithm of a dimensional quantity, the probability density
(which has units x1 ). For this reason, the value of differential entropy will change with a change of
units.
One way to resolve these difficulties is to convert the continuous probability density into a discrete
probability mass function by dividing the real line into equally spaced intervals with width ∆ and
integrating the probability density within each interval to get a probability mass function ([3] section
8.3). One can then compute the discrete entropy of this distribution, which will be equal to the
differential entropy minus log ∆. In other words, the more fine grained our discretization of the
probability density, the more bits are needed to describe it. In many cases we may work directly
with such discretized probability densities on a computer, and it is important to remember that
the absolute value of the entropy depends on an additive factor based on our arbitrary choice of
discretization interval.
The maximum entropy distribution for a given state space also works differently in the case
of continuous probability densities. If our state space X is the set of real numbers R, then the
entropy can be driven infinitely high by making an infinitely wide distribution. However, we can
still describe a maximum entropy distribution under certain constraints. For example, the maximum
entropy distribution on R with a variance equal to σ 2 is a normal distribution with that variance.
Many commonly used distributions are optimal under similar constraints, and maximum entropy
distributions have many important applications in statistical inference and statistical physics that
are beyond our scope here.
Luckily, the mutual information of two continuous probability densities does not suffer from the
same mathematical difficulties and lack of clear interpretability. Even though it may take an infinite
number of bits to specify an arbitrary real number, the interpretation that 1 bit of information allows
you to rule out half of the probability mass remains valid. Mathematically, mutual information in the
case of probability densities is equivalent to the discrete case, with the sums replaced by integrals:

Z Z
p(x, y)
I(X; Y) = p(x, y) log dxdy
X Y p(x)p(y)

The mutual information also avoids the problem of the improper logarithm of the differential
entropy, because for mutual information, the log is taken on the ratio of densities, in which the
units cancel out. It is also possible to compute mutual information between discrete and continuous
random variables [4].

18
In summary, in spite of the difficulties associated with the differential entropy over probability
densities, mutual information retains its usefulness and a clear interpretation as the amount of
information one random variable provides about another.

2.10 Rate distortion theory

What happens if we try to compress data to a size smaller than its information content, or infer the
outcome of a random event about which we have incomplete information? These questions can be
answered by rate-distortion theory.
One application area of rate distortion theory is in lossy compression, where we’re trying to
compress some source of events to the smallest size possible, but we’re willing to tolerate some level
of error upon decompression. That is, the decompressed message will be “close to” but not exactly
the same as the compressed source. Because we can tolerate some error, we need not preserve all of
the information about the source in the message.
Figure 9 provides an example of this. We have some source of information, like the equally
probable (maximum entropy) four-color marble scenario of Figure 2. We’ll take a series of random
colors from this source, and pass it into a deterministic compressor function, which will map it
to a binary string. Next, we pass that binary string into a deterministic decompressor function,
which tries to reproduce the original string. We want our compressor to map the sequence to the
shortest binary string possible (on average). In lossless compression (Fig. 9a), we impose the
requirement that the decompressed binary string must match exactly the original sequence. In lossy
compression (Fig. 9b), we’re willing to tolerate some errors upon decompression, which will allow
us to discard some of the source information and achieve an even shorter binary encoding.
There is a trade-off between how much information we discard and how many errors will be
present after decompression: the more information we throw away, the shorter a binary encoding
we can make, but the more errors we will make upon decompression. The rate-distortion curve
quantifies this trade-off. The rate quantifies the amount of information we have about the source,
and distortion quantifies the number and severity of the errors we make. The rate-distortion curve
applies not just to data compression problems, but to any situation in which we have incomplete
information about a random event, such as the imperfect transmission of data.
Mathematically, a distortion function d(x, x̂) takes in an original event on the space of X (e.g.
a color) and a decompressed (and possibly distorted) event on the space Xˆ , and it computes a
non-negative real number that represents the distortion or difference between them:

d : X × X̂ → R+

In the example shown in Figure 9, X and X̂ are identical (i.e. the set of the four possible
colors), but this may not be true in every situation. For example, we might consider the distortion
associated with storing an arbitrary real number as a 32-bit floating point number on a computer.
In this case, X would be the set of real numbers and Xˆ would be the set of possible 32-bit floating
point numbers.
The source produces information at some rate, which we’ll call the “source rate”, whereas the
rate describes the how much of it we’re capturing it (over, for example, time, space, etc.). Higher
rates are good, because we’re not losing source information. For a discrete source, the maximum
achievable rate is the entropy rate of the source H(X).
A useful scalar descriptor of the distortion for a particular source and encoding scheme is the
average value of a distortion function applied to a source and some distorted version of that source.
For example, this could be the average number of differences (the distortion function) between a
binary string from a source and the string found by passing the original string through a lossy
encoder and then decoder functions (as shown in (Fig. 9b)):

19
a) Lossless compression
Information content = 32 bits

Compressor

32 bit encoding

Decompressor

Perfect decoding

b) Lossy compression
Information content = 32 bits

Compressor

16 bit encoding

Decompressor

Decoding with distortions

c) Rate-distortion curve gives limits of lossy compression

Best uninformed
guessing
Distortion

Posssible

Impossible

0
0 Rate (bits)

Figure 9: Lossy compression and rate distortion a) In lossless compression, a typical sequence
is mapped to a binary codeword with length equal to its information content and can be decom-
pressed without error. b) In lossy compression, a sequence is mapped to a sequence shorter than
its information content, and errors are present upon decompression. c) The black curve shows the
minimum number of bits needed to achieve a given average distortion.

20
h i
D = E d(X, X̂)

The limiting cases of complete information (where distortion may be zero) and zero information
(where nonzero distortion is impossible except in the trivial case of a constant source) suggest that
more information allows for lower distortion. Rate-distortion theory takes this intuition even further,
proving that if we want to achieve distortion below some threshold (i.e. D < Dmax ), we must have
at least R(D) bits of information about X. R(D) is called the rate-distortion function. This
means that if we’re already achieving the best average distortion for the amount of information we
have, I(X; X̂), we cannot do better unless we acquire more information. This is true regardless of the
chosen distortion function, provided that the distortion function d(x, x̂) is finite for every possible x
and x̂. This finding is proven in the asymptotic case of infinitely long block lengths.
This doesn’t necessarily mean that having N bits about a source will give us an X̂ that achieves
the N -bit best performance for a given distortion function. For example, if we know X perfectly, and
we applied an invertible operation like mapping each color to a new, distinct color, we would not
have lost any information. However, if our distortion function was the number of incorrect colors,
the average distortion would increase.
An example rate-distortion function for a discrete source in shown in Figure 9c. The green
area represents possible combinations of rates and distortions. For sufficiently high values of average
distortion, no information about the source is necessary, so the curve intersects the horizontal axis.
For example, consider simply guessing the average outcome each time. For discrete sources, the curve
intersects the vertical axis at the value of the source’s entropy, at this point we have zero remaining
uncertainty. In contrast, for a continuous source, the curve would asymptotically approach the
vertical axis, since an infinite number of bits would be needed to describe a continuous source
exactly (Sec. 2.9).
In general, R(D) will be a non-increasing convex function of D, which means that 1) achieving
lower and lower distortion will always require acquiring more information (assuming we we already
making good use of the information we had), and 2) there will be diminishing returns on lowering
distortion as we acquire more and more information.

3 Channels
Thus far we’ve considered the mathematical foundations of information theory in a general sense,
without specifying what gives rise to the statistical relationship between X and Y. Now we move to
a more a specific scenario that models the transmission of information through a channel. Often,
channels model an actual physical medium for information transmission, like a band of frequencies
of radio waves or transistors on a computer.
Mathematically, a channel is some mapping from inputs in X to outputs in Y. Rather than using
the term “outcome” to describe the value taken by a random variable as in the previous section,
we now use the terms “input” and “output”. Which inputs map to which outputs is determined
by the conditional probability distribution pY|X . In the special case that pY|X can be described by
a deterministic function, then we have a noiseless channel; otherwise, we have the more general
case of a noisy channel.1

3.1 Channels as matrices

A noisy channel is equivalent to a conditional probability distribution pY|X , and in the case of a
discrete outcome space, this can be visualized either as a mapping or a matrix (Fig. 10a). In the
1 This is true of a ”memoryless” channel, which means that multiple channel uses are independent and identically

distributed. All channels discussed here assume memorylessness

21
matrix form, which we denote PY|X , each column is the conditional distribution p(y | X = x), which
describes the probability that the input x will map to each possible output in Y, This describes the
noise distribution of that input. Since each column is itself a probability distribution, it sums to 1.
Using the channel matrix, two quantities of interest can be computed. The output distribution
can be described by a vector of probabilities pY , and it can be computed from the input distribution
pX by multiplying it by the channel matrix (Fig. 10b):

pY = PY|X pX

The joint distribution PX,Y is also useful because it is needed in the computation of conditional
entropy and mutual information. It can be found by putting in the input probability vector pX
along the diagonal of a matrix and multiplying by the channel matrix (Fig. 10c):

PX,Y = PY|X diag(pX )

Unlike PY|X , for which each column is a probability distribution and sums to 1, all entries of
PX,Y together sum to 1 since the whole matrix is a joint probability distribution.

3.2 Entropy and mutual information in channels

Entropy, conditional entropy, and mutual information have more specific interpretations for chan-
nels than general probability distributions, and they provide intuitions for how one might design a
system for optimally transmitting information. Our goal is to recover as much of H(X), the aver-
age information in at the channel inputs, as possible. The mutual information between X and Y
measures the information shared between a transmitted and received message. The higher it is, the
more information has successfully been transmitted to the receiver.
When considering information propagation through noisy channels, it is important to recognize
that both signal and noise can be described in terms of entropy. In general two dependent random
variables might have interesting or useful information on their own (like shapes and colors in Section
2.5). In the case of a noisy channel, pX contains the information we’re interested in, and pY|X
is the noise imparted by the channel. Since the definitions of entropy is based on probabilities
alone, both H(X) (the input) and H(Y | X) (the noise) are measured in bits and can technically
be described mathematically as information. It is thus important to keep in mind this distinction
between information that helps us infer the state of the source, and randomness that is irrelevant.
The entropy of the channel output H(Y) will usually contain both informaiton about the input and
noise.
The information transmitted through a channel depends on both the channel itself (pY|X ) and the
distribution over the channel’s inputs (pX ). Two factors govern how much information is transmitted
through the channel: the level of noise in the individual channel inputs, and the how much the
outputs of different inputs overlap.
The noise level of different inputs can be computed from knowledge of the channel alone without
knowledge of the input distribution. For input x, the noise at the output is H(Y | x). Assuming
a particular input distribution allows us to compute the average noise imparted by the channel
H(Y | X). Input distributions that put more probability on channel inputs that are less noisy will
tend to transmit more information.
The other important factor is how much different inputs overlap at the channel output. The
output overlap of a particular input with other inputs, averaged over all outputs is H(x | Y). Unlike
the noise level in a particular input, this overlap can only be calculated with respect to a particular
input distribution, because it requires knowledge of the joint distribution of inputs and outputs.
By averaging over all inputs, we can calculate the average uncertainty about which channel input

22
a) A noisy channel
Input Channel Output
Mapping view
distribution distribution

1 0 0 1
Probability

b) Matrix-vector
representation

0.75
Probability
0.2

0 0

c) Compute output d) Compute joint distribution

Noisless
channel channel

0.4 1.0 0.4 0.4 1.0 0.4

0.2

0 0 0 0 0 0

0.75 0.75
Noisy

0.2 0.1

0 0 0 0

Figure 10: Channels a) The mapping view of a noisy channel. The channel is mathematically
represented by the conditional random variable Y | X b) Channels and inputs/outputs can also be
represented by matrices and vectors of probabilities, respectively. The vectors sum to one since they
represent probability distributions, and each column of the matrix sums to one since it represents
the conditional distribution Y | X = x. The matrix/vector representations can be used to compute
c) the output distribution and d) the joint distribution through matrix-vector and matrix-matrix
multiplications, respectively. c) The joint distribution is computed from the matrix-matrix product
of the channel and a matrix with the input matrix along the diagonal.

23
gave rise to a certain output, H(X | y). Averaging over both inputs and outputs gives H(X | Y).
Channels that have less overlap at the output will tend to transmit more information.

Noisy mapping
No loss of informaiton

Deterministic mapping
a) b)
Mapping view Matrix view Information view

Information (bits)

c) d)
Loss of informaiton

Figure 11: Examples of noisy and noiseless channels. A random variable X with equal prob-
ability over all of its states is mapped to a random variable Y. These mappings can be either
deterministic or noisy and information-preserving or information-destroying. Bars to the right of
each mapping show entropy, conditional entropy, joint entropy, mutual information, and the max-
imum entropy over the state space of each random variable. Line thickness denotes magnitude of
p(y | x) a) A deterministic, information-preserving mapping with more outputs than inputs. Each
input maps uniquely to exactly one output. The outputs with no line connected have zero probabil-
ity. b) An information-preserving, noisy mapping. Each input maps into multiple outputs, but the
outputs mapped to by a particular input are all disjoint. c) A deterministic, information-destroying
mapping. Multiple inputs collide on the same output, meaning the mapping cannot be inverted. d)
A noisy, information-destroying mapping. Each input maps to multiple outputs, and the outputs
mapped to by each input are not disjoint.

Examples of noisy and non-noisy channels We now give specific examples of channels that
are either noisy or noiseless and either lose information about the inputs or preserve all of it. For
simplicity, here we assume distribution over inputs is uniform. The ideal information transmission
system would lose no information and be able to recover X exactly given a output Y, which would
mean that I(X; Y) = H(X). One way this can occur is in a noiseless channel in which the channel
maps every input to a unique output (Fig. 11a). In this situation, H(X) = H(Y) = I(X; Y) and
H(Y | X) = H(X | Y) = 0, meaning that we could perfectly predict X given Y (or Y given X).
However, a channel does not need to be noiseless to completely preserve input information. All
input information can be recovered through a noisy channel, so long as each input maps to outputs
that don’t overlap with those of other inputs (Fig. 11b). In this scenario, H(Y | X) is no longer
equal to zero, because even knowing what the input was, we cannot predict exactly how the noise
of the channel manifests at the output. In contrast, H(X | Y) = 0 because given the output, we can
know exactly what the input was. H(Y) is greater than H(X) because it includes all of the entropy
H(X) along with some additional entropy caused by noise.
Alternatively, even if the channel imparts no noise, information can be lost. This happens if
multiple inputs map to the same output (Fig. 11c). It thus becomes impossible to tell with

24
certainty from the output Y which input was used, or equivalently, H(X | Y) > 0.
Finally, the most general situation, which will be encountered most in practice, is the one in
which there is both a loss of information (H(X | Y) > 0 or equivalently H(X) > I(X; Y)) and noise
present in the output (H(Y | X) > 0 or H(Y) > I(X; Y)) (Fig. 11d). Although Figure 11d
shows a scenario where H(Y) > H(X), this can also occur even if H(X) = H(Y): some of the input
entropy is replaced by noise.
To summarize: I(X; Y) is the average amount of information successfully transmitted through the
channel. H(Y | X) is the average noise at the output–uncertainty about what the output outcome
will be knowing the specific input used. H(X | Y) is uncertainty in which input was used knowing
the output outcome, which can arise from deterministic many-to-one mappings in the channel or
inputs whose noise overlaps at the output.
The different ways of decomposing mutual information presented in Section 2.7 have more specific
interpretations in the context of a noisy channel:

I(X; Y) = H(X) − H(X | Y)

Can now be interpreted as the average input information, minus the uncertainty at the output
about which message was sent, which is determined by how much different messages overlap at the
channels output.

I(X; Y) = H(Y) − H(Y | X)

Can be interpreted as the total uncertainty at the output (the sum of input uncertainty and
noise), minus the noise in the received messages.

3.3 Data processing inequality

The Data Processing Inequality is an important theorem which states that the information
about a signal cannot be increased by any physical or computational operation. In the context of
channels, this means that if we had a series of channels, each of which mapped from one random
variable to another, information about the first random variable can only be preserved or lost at
each step, and thus we can only be equally or more uncertain after each channel.
Mathematically, if we have A → B → C, where each arrow represents a channel, we say that
these three variables form a Markov chain. The theorem states that:

I(A; B) ≥ I(A; C)

and this statement can be generalizes to greater than three random variables.

Maximizing information throughput

If all channel inputs are 1) equally noisy and 2) have outputs that overlap with the outputs from other
inputs equally, then a uniform distribution over the inputs will maximize the amount of information
the channel transmits. There may also be other input distributions which transmit the same amount
of information, but redistributing probability cannot lower noise (since all inputs are equally noisy),
and any information transmission gains had by decreasing the overlap at certain outputs will be
offset by increased overlap at other outputs.
In the more general case, both of these conditions will not be met, which leads to the questions: 1)
What are the optimal input distribution(s) p∗X for maximizing the channel’s information throughput?

25
2) How much information will the channel be able to transmit if this optimal input distribution is
used? The answer to the latter question is called the channel capacity and is denoted by C.
The quantities can be found by solving an optimization problem:

p∗X = arg max I(X; Y) (7)

C = max I(X; Y) (8)

(9)

Where PX is the set of probability distribution on the space X . In the case where X is discrete,
this will be the set of nonnegative vectors that whose sum is 1 (formally known as the probability
simplex). The computational details of how to solve this optimization problem can be found in
Appendix A. Here, we focus on the intuition behind its objective function.
Mutual information can be decomposed as I(X, Y) = H(Y) − H(Y | X), and this decomposition
is useful in analyzing the competing goals of this optimization problem. Consider the noisy channel
shown in Figure 12a, which has two inputs that are noisy but map to distinct outputs, and one
input that is less noisy but has outputs that overlap more with the outputs of the other inputs.
If we were to not maximize mutual information, but instead only maximize the first term in
the decomposition, H(Y), probability would be placed on inputs in such a way that the output
probability was as uniformly distributed as possible. This is beneficial because it means more of
the channels outputs are utilized, which means different channel inputs have more space to avoid
overlapping on the same outputs. In the case of this channel, this would result in putting half the
probability mass on each of the two noisy but non-overlapping inputs, which would result in 1 bit
of information gained with each use of the channel (Fig. 12b).
Alternatively, if we were to maximize only the second term, H(Y | X) would be minimized,
leading to probability being placed on in the inputs in such a way as to select for the least noisy
inputs. Since the middle input is less noisy than the other two, this would result in all of the
probability mass being placed on it, which would lead to no information being transmitted.
By optimizing the full objective, these goals are balanced: some probability is placed on the
least noisy middle input, while some probability is placed on the noisier inputs, which makes use of
the full output space. This leads to an information transmission of 1.13 bits, which is the channel
capacity C.

4 Channel coding
In information transmission problems, there will usually be a source of random events S. We will
refer to the outcomes of S as messages. We don’t have any control over the messages or the channel,
but we do have the ability to design a function called an encoder, which will map certain messages
to certain channel inputs.
In this section we discuss the problem of channel coding, which addresses the question of how
we should design encoders to maximize information transmission, and minimize the probability that
we incorrectly infer the source message.

4.1 Encoders
An encoder is a deterministic function that maps messages from the source to the inputs of a
channel. We’ll make the assumption, unless otherwise stated, that every message is mapped to a
unique input, so no infomration is lost from S to X. A good encoder will maximize the mutual
information between X and Y, so its goal is to map S to X in such a way that the distribution of

26
a) Noisy channel b) Only maximizing output entropy c) Only minimizing output noise d) Maximizing mutual information

0.5

Probability
1 0 1 0
1 0
Probability
0 1 0 1 0 1
0
1 bit 0 bits 1.13 bits

Figure 12: Optimizing mutual information for a fixed channel gives both the optimal input
distribution and the channel capacity. a) Matrix representation of a noisy channel. b) Only maxi-
mizing output entropy disregards putting probability on the least noisy input. c) Only minimizing
the the average input noise fails to utilize the full output space. d) The full objective function
balances these competing goals.

X will balance: 1) having probability mass on non-noisy inputs, and 2) spreading probability of the
channel outputs as uniformly as possible to avoid different inputs overlapping at the output.
Figure 13a shows a noisy channel which maps each input X = x to two possible outputs with
equal probability. Our goal is to transmit messages from a non-redundant source (Fig. 13b), in
which three colors occur with equal probability through a noisy channel, with minimal information
loss.
Figure 13c shows a bad way of doing this: using three states in X that have overlapping
outputs in Y, which leads to a loss of information that creates ambiguity as to what the original
source message was.
Alternatively, we could design an encoder which uses only a subset of the states in X that produce
disjoint outputs in Y (Fig. 13d). In this case, there is no loss of information.
This encoder is not unique, since we can randomly permute which state of S mapped to which
state of X and achieve the same mutual information.

4.1.1 Compatibility of channels and sources

Suppose we are given some arbitrary source, we have multiple channels to choose from, and we can
design any type of encoder we want, provided that it only encodes 1 source message at a time. What
channel will allow us to transmit the most information? How much information will we be able to
transmit? Will this channel also be the best for any other different source?
The answer to the last question is no. Even with freedom to design the encoder as we see fit,
some sources are channels are inherently better “matched” to one another. Furthermore, there is
no guarantee that for a given source and a given channel that we’ll be able to achieve the channel
capacity.
Intuitively, how well a source and channel are ”matched” is determined by how well our best
encoder maps more probable source outcomes to 1) inputs with less noisy outputs and 2) inputs
whose outputs don’t overlap with those of other inputs. The process of creating an encoder to
perform this matching is also known as joint source-channel coding.
Figure 14 shows an example of the differences in information transmission that occur when
matching different source distributions to different channels. There are two noisy channels (Fig.

27
a) Noisy channel c) a suboptimal encoder
Noisy
channel
Encoder

Not
uniquely
decodable

d) an optimal encoder

Decoder

b) Source

1
Probability

e) a different optimal encoder

Figure 13: Optimal encoders for a noisy channel. a) a noisy channel that maps each input to
two possible outputs b) A maximum entropy source of random messages. c) A encoder that maps
messages in S to inputs in X that produce overlapping outputs that are not uniquely decodable,
thus transmitting less information than the maximum possible. d) An encoder that transmits the
maximum amount of information by mapping messages to inputs that produce disjoint, and thus
decodable, outputs. e) Another encoder/decoder pair that transmits the maximum amount of
information, showing that the optimal encoder is not always unique.

28
a) Noisy channel Matrix
Individual inputs representation
Symmetric
Asymmetric

b) Source c) Optimal encoding Symmetric channel Asymmetric channel

Symmetric

1 Probability 0
0 0.5 0 0.5
0.58 bits 0.92 bits

Asymmetric

1 0.43 0 0 0.5 0 0.5

0.54 bits 1 bit

Figure 14: Matching sources and channels for maximum information transmission. a) A
symmetric noisy channel, in which all inputs are equally noisy and equally overlapping (for a uniform
input distribution) and an asymmetric noisy channel, in which all inputs are equally noisy, but two
pairs of inputs overlap more at the output with each other than the other pair (for a uniform input
distribution). b) Symmetric and asymmetric sources with the same entropy. c) The symmetric
source can be encoded to transmit more information in the symmetric channel, because it is able
to choose inputs such that overlap equally at the output, whereas the asymmetric source is not.
For the asymmetric channel, this is reversed, and the more probable messages from the asymmetric
source can be encoded such that they are less overlapping at the output.

29
14a): The first channel (the “symmetric channel”) has inputs that are all equally noisy and all have
outputs that overlap equally. The second channel (the “asymmetric channel”) also has equally noisy
inputs, but their outputs are not equally overlapping with one another: it has two pairs of inputs
whose outputs overlap less with each other than do the outputs within each pair.
There are also two different sources with the same entropy (Fig. 14b). The first source (the
“symmetric source”) has three equally probably outcomes, while the second source (the “asymmetric
source”) has two more probable and two less probable outcomes. For channels and sources with a
small number of states such as the ones shown here, the optimal encoder can be found by exhaustively
searching all possible ways of mapping messages to channel inputs.
For the symmetric channel, the optimal encoder for the symmetric source is able to transmit more
information than the optimal encoder for the asymmetric source, but for the asymmetric channel,
the reverse is true (Fig. 14c). In the latter case, this occurs because the encoder is able to map
each of source’s two most highly probable messages (green and gray) to distinct subsets of channel
inputs that overlap less at the output. This is not possible for the symmetric source, because all of
its messages are equally probable.
In contrast, for the asymmetric channel, more information can be transmitted about the asym-
metric source than the symmetric source. This is because the outputs for all inputs are equally
overlapping, which forces the asymmetric source to map highly probable messages channel outputs
with more overlap.

Maximal information transmission Having seen that the channel capacity C can be measured
by finding the optimal input distribution, and that encoders for different sources that map one mes-
sage at a time to channel inputs can achieve different maximal amounts of information transmission,
it can be concluded that the amount of information transmitted is less than the channel capacity
under these circumstances. However, a central result in information theory, the Noisy channel
coding theorem, proves that it is in fact possible to transmit information up to the channel capac-
ity for any source/channel pair. The key to making this possible is not encoding one message at a
time, but rather many messages at once. This both changes the distribution of the source messages
and the noise per each channel input so that they are more uniform and can be more easily matched
by an encoder.

4.2 The noisy channel coding theorem

Noisy channel coding is one of the central problems in information theory. The noisy channel
coding theorem is the foundation of the digital world we live in today, because it defined the limits
of reliable transmission and storage of information using inherently noisy physical components.
The general setup is: There is a source of random messages, which could be human-chosen or
the result of some natural random process. The goal is to transmit these messages to a receiver with
minimal or no error, but to do so they must pass through a noisy channel.
The noisy channel coding theorem shows that any source/channel pair is able to transmit in-
formation at a rate (i.e. information per channel use) up to the channel capacity and in doing so,
enable a decoder to infer the true message with arbitrarily small probability of error. It also shows
that above the channel capacity, achieving arbitrarily small probability of error is impossible without
driving the rate to 0.
The analysis and proof of the noisy channel coding theorem break apart noisy channel coding into
separate source coding and channel coding steps. This yields five steps: (Fig. 15): 1) A compressor
takes in the random events, and removes any redundancy, making it a maximum entropy source.
2) The compressed events pass into an encoder, which adds redundancy to try to prevent the loss
of information when 3) passing through a noisy channel. 4) The channel output is then fed into
a decoder, which attempts to remove any errors introduced by the noisy channel and recover the

30
original message. 5) The compressed messages are decompressed back onto the original probability
space.
Steps 2-4 correspond to the random variables:
• S: The (compressed) source message
• X: The input to the channel (the encoded message)
• Y: The output of the channel (the received message)
• Ŝ: The estimate of the (compressed) message

Source
coding Compressor Decompressor

Channel
coding Encoder Decoder

Noisy
channel

Figure 15: Noisy channel coding: the big picture A redundant source passes into a compressor,
which yields a compressed (i.e. maximum entropy) random message S. S then passes into an encoder,
which adds redundancy to create robustness to passage over a noisy channel. The encoded message
X passes through the channel, yielding the received message Y, which possibly contains errors. Y
then goes into a decoder, yielding an estimate of S, Ŝ, and finally a decompressor to give an estimate
of the source. Adapted from [2] p146.

4.2.1 Ignoring compression with block encoders

The result of the noisy channel coding theorem is achieved by using an encoder that doesn’t encode
1 message at a time to channel inputs, but rather one that encodes multiple messages at a time.
The number of messages is called block length. In practice, an infinitely long block length cannot
be used, because the message would never actually transmit. As a result, a noisy channel might
not be able to transmit information up to its full capacity for certain sources. There may also be
possible performance gains in practice by using joint source-channel coding, as will be discussed in
Section 4.3.
Compression and encoding (and decompression and decoding) are separated into distinct steps
not because they must be carried out separately, but to allow for easier analysis of the problem. The
source-channel separation theorem states that there is no cost to this division–compressors and
encoders can be designed just as well separately as they can jointly (in the limit of an infinitely long
block length).
Without loss of generality, we can treat S as a maximum entropy source and ignore the compres-
sion/decompression step. This is because if we have a redundant source, it is possible when using
a long block length to perform lossless compression, which yields a maximum entropy source over a
much larger state space.

31
a) A noisy channel

b) Extended noisy channel formed by 2 channel uses

First use

Second use

Figure 16: An extended noisy channel is formed by treating multiple uses of a single channel as
a channel itself. a) One use of the original channel. b) one use of the extended channel.

32
The larger state space of the source means that we’ll also need a channel with a larger state
space, which we can form by using an extended noisy channel. This is simply using the channel
N times, and considering a meta-channel whose inputs and outputs are all possible combinations of
of the original channel’s inputs and outputs (Fig. 16).
As a result of this separation and the theorem justifying it, we can disregard the source coding
problem (i.e. the compressor and decompressor) and focus only on the channel coding problem
(encoder → noisy channel → decoder). Since any source of random events can be compressed
into a maximum entropy source of (Sec. 2.4), we can simply assume this has already taken place.

4.2.2 Theoretical limits of perfect transmission

The noisy channel coding theorem was a revolutionary step forward because it proved something
that was previously thought to be impossible: that the probability of error when transmitting a
message over a noisy channel and trying to infer its value on the other side could be made arbitrarily
small without the rate of information transfer dropping to zero.
To understand this, we start with an encoder → noisy channel → decoder. As shown
in Figure 17a, the messages we are trying to send are the outcomes of a series of maximum
entropy/non-redundant random events (like colored marbles drawn from an urn with equal prob-
ability). The noisy channel transmits binary data, so we will then convert the outcome of each
random event to a binary string using an optimal encoder that minimizes the number of bits used.
Using binary channels simplifies the analysis, but is not strictly needed. Noisy channels can be over
any probability space, continuous or discrete.
These binary strings will be transmitted over a noisy channel, which will, with some probability,
flip 0s to 1s and vice versa. If we sent the raw binary messages over this channel, errors will occur
and the message on the other side will be interpreted incorrectly as a result of noise displacing and
thus destroying some of the usable information in each string. To combat this, before transmitting
over the noisy channel, we will add some type of redundancy so that the messages are more robust to
the channel’s noise. Adding this redundancy means that the binary strings transmitted will become
longer, but the information that contain will be unchanged. Thus, the number of (information) bits
will be smaller than the number transmitted (data) bits.
The rate R at which we transmit data depends on the amount of redundancy added:

# source bits
R=
# transmitted bits

(Note: This rate is not exactly the same as the rate used in rate-distortion theory (Sec. 2.10).
Rate as used here is defined as the compression of a transmitted message, whereas for rate-distortion
theory it describes the absolute amount of information about particular source.)
Next, the redundant message passes through a noisy channel. The channel shown in Figure 17a
shows a binary symmetric channel in which 0s and 1s transmit correctly with probability 45 and
flip with probability 15 .
Finally, a decoder interprets the received message and attempts to correct any errors and recon-
struct the original message.
A naive approach to this problem is through repetition coding (Fig. 17b). Here, the message
is repeated N times before being sent through the channel, and the decoder chooses the bit at each
position that occurred the greatest number of times in the repetition-coded message. The wrong
message can be received if more than half of the bits in a given position flip. To lower the probability
of this happening, we can increase the number of repetitions. This reduces the probability of error,
but in doing so lengthens the message needed for each event and thus lowers the rate at which
information can be transmitted. As we specify a lower and lower maximal acceptable probability of
message error, the rate at which information is sent approaches 0 (Fig. 17c).

33
a) Source Encoder Noisy channel Decoder
Random events (with Add redudnacy for Transmit binary Correct channel
no redundancy) noise robustness strings (with errors) errors and infer event

Transmitted
0.8

Probability
Recieved
0.2
0

b) Repitition coding c)
Encoder Noisy channel Decoder Posssible
Source

Probability of error
More
repititions

Impossible

0
0 Rate 1

d) Block coding
Source Encoder Noisy channel Decoder
...

As # source bits,
e) block length
Constant rate, variable
block length encoders
Probability of error

4 source bits 7 transmitted bits

Posssible
8 source bits 14 transmitted bits

16 source bits 28 transmitted bits Impossible

32 source bits 56 transmitted bits 0

0 Channel capacity 1
...

Rate

Figure 17: The noisy channel coding theorem a) An overview of the problem setup. A source
produces non-redundant (i.e. maximum entropy) messages, here a colored marble which is trans-
formed into a two-digit binary encoding. An encoder adds redundancy to prepare a transmitted
message (e.g. shown here, repeating the message 3 times). The rate is defined as the ratio of source
bits to transmitted bits. The message passes through a noisy channel (e.g. shown here one that flips
each bit with probability 15 ). A decoder attempts to correct errors and recover the original message.
b) A repetition coding scheme (same as part a) repeats the source bits a fixed number of times
and c) can achieve arbitrarily small probability of error (for the noisy channel in (a)), but needs
to send information at a rate approaching 0 to do so. d) A block coding scheme can do better, by
encoding multiple events together. e) As the block length goes to ∞, such a scheme can achieve
communication with arbitrarily small probability of error at rates up to the channel capacity. At
rates above the channel capacity, infinite block lengths will have nonzero probability of error.

34
The noisy channel coding theorem proved that we can do better than this. To do so, we must
send multiple messages at once, also known as block coding. In the case of this binary channel,
this is done by taking multiple binary messages, and passing them through an encoder that maps
to a single binary string. The encoder in Figure 17d is called a Hamming code, which is a good
choice for this problem, but not specifically needed for the noisy channel coding theorem. Similar
to the previous case, those bits are then transmitted and the decoder attempts to correct errors and
reconstruct the original message.
As the bottom panel of Figure 17d shows, by transmitting a sequence of random events as a
single outcome, we can increase the block length of our transmission system while leaving the rate
unchanged by keeping the ratio of source bits and transmitted bits constant.
The noisy channel coding theorem considers the performance of channels as the block length
goes to infinity, and it has two important implications (Fig. 17e):
1. Encoder/decoder pairs exist which can transmit messages with arbitrarily small probability of
error at nonzero rates up to the channel capacity
2. At rates above the channel capacity, no encoder/decoder pairs can transmit messages with
arbitrarily small error probability
The theorem does not state how one can find the appropriate encoders/decoders to achieve this
performance, only that they exist. Specifically, it shows that the average performance of randomly
constructed encoders achieves this, which implies that at least one of the individual randomly con-
structed encoders can achieve it. Finding good performing codes that can also be decoded is not
straightforward based on this theorem alone, in part because extending the block length makes naive
decoding algorithms exponential more complex. It took several decades of research to discover codes
for binary channels that approach the performance the theorem implies.

4.2.3 The benefits of long block length

The noisy channel coding theorem shows that information can be transmitted at a rate up to the
channel capacity in the asymptotic setting of infinitely long block lengths. This leads to the question,
what is it about long block lengths that makes this achievable, even for channel/source pairs (like
the ones in Figure 14) that cannot reach the channel capacity in the short block length setting?
The answer is that long block lengths make both channels and sources more uniform in such a
way that the problem of finding an encoder that optimally maps source messages to channel inputs
becomes trivial: it can accomplished by simply picking a fixed random mapping. As discussed in
Section 4.1.1, a good encoder will map the most probable source messages to channel inputs that
are the least noisy (minimize H(Y | X)) as well as choosing inputs that minimize the overlap of
messages at the channel output (minimize H(Y | X)).
By extending a source to a long block length (i.e. encoding a sequence of messages rather
than individual messages), all sources will start to resemble maximum entropy sources, which have
uniform probability over all messages. This happens either because: 1) individual messages are
already coming from a maximum entropy source, and this will remain true at any block length. 2)
Redundant sources will produce equally probable typical sequences (discussed in Section 2.4.1) onto
which nearly all of the probability mass will concentrate, and non-typical sequences can be ignored
because they contain vanishingly small probability. Thus, a redundant source, when extended,
behaves like a maximum entropy source over a smaller number of possible messages (the typical
set).
A similarly uniformity arises when a noisy channel is extended: both the noisiness of each
channel input and the overlap of its outputs with other channel inputs (assuming a uniform input
distribution) become the same for all inputs. The noisiness of a channel input x is measured by its
point-wise conditional entropy H(Y | x). A noiseless input would have a value of 0, and noisy inputs

35
have positive values. By looking at the distribution of point-wise conditional entropies for each
channel input, we can assess how heterogeneous the channel inputs are in terms of their noisiness
(Fig. 18). The absolute noise level will always increase as we make extended noisy channels with
longer block lengths, but the relative amount of noise in each input becomes increasingly close to
the block length N times the average noisiness of the non-extended channel, H(Y | X). This is a
consequence of the Central Limit Theorem, as shown in Appendix B.3.
a) Extended noisy channel b) Channel inputs become
N=1
equally noisy as N →

Probability
0.6

Count
0 0
0.4 2

Count
0 0
N=2 0.2 9

Count
N=3
First use
0 0
32
0.06

Count
N=4

Second use
0 0
135
0.02

Count
N=5

0 0
1.2 1.4 1.6 1.8

(Bits)

Figure 18: Conditional entropy in an extended noisy channel. a) Matrix representation

of extended noisy channels with increasing block lengths. b) Histogram of conditional entropy for
each channel input. With increasing block lengths, channel inputs tend towards having the same
conditional entropy, which means they are equally noisy.

In addition to this uniformity in the noise level of different channel inputs, the overlap of different
channel inputs becomes uniform as block length increases. Mathematically, this means that H(x | Y)
tends toward the same value for all choices of x. Like the increasing uniformity of noise for each
channel input in the previous paragraph, this phenomenon is a consequence of the Law of Large
Numbers and can be shown mathematically with a proof similar to that in Appendix B.3. The only
difference is that unlike the noise of different inputs, the overlap of different inputs requires knowledge
of the joint distribution pX,Y . It thus depends on a particular choice of input distribution, since
pX,Y = pY|X pX . However, since we know that all sources will produce equally probable messages as
N → ∞, we can simply assume that pX is uniform. This means that the matrix PX,Y is equal to
PY|X multiplied by a scalar constant.
It is worth noting that extended noisy channels still have some inputs that are especially noisy
and especially noiseless, as well as those the are more or less overlapping. For example, the extended
noisy channel input that corresponds to using the most noisy input of the non-extended channel for
each transmission still may be much noisier than the average extended channel input. Outliers like
this don’t go away as N → ∞, they simply cease to matter because they are outnumbered. However,
since we can never actually have an infinite block length in practice, in many situations information
transmission can do better than a fixed, random encoder, by shifting more probable messages onto

36
less noisy, less-overlapping inputs. This is possible because when N is not infinite, any redundant
source will not produce equally probable messages. How to do this in practice will be explored in
the next section.

4.3 Noisy channel coding in practice

In Section 4.1.1 we demonstrated that for a given source/channel pair, designing encoders that
encode one message at a time may not always be able to transmit information at a rate up to
the channel capacity. In Section 4.2, we described how the Noisy Channel Coding Theorem proves
that by extending the block length to infinity, there will always be an encoder that can transmit
information at a rate up to the channel capacity. In this section, we turn to the more practical
intermediate setting, where we want to design an encoder for maximal information transmission and
have some, but not an infinite ability to extend the block length.

4.3.1 Longer, but not infinite, block length

The first important question is, how do the asymptotic changes that improve our ability to match
channels and sources manifest as N increases? These changes, the transformation of a source into
an extended source with uniform probability over its messages and a channel into an extended
noisy channel with equally noisy inputs that overlap equally at the output (for a uniform input
distribution), are not only possible as N → ∞. Some source/channel pairs, like the one shown in
Figure 13, already have this property with a block length of 1. However, this represents a special
case that will usually not be true.
Generally, a redundant source, when extended, will produce messages that are more uniform in
probability as N increases (as shown in Figure 3). Similarly, the inputs of a noisy channel will
become closer to each other in their noise level and overlap at the output (for equally probably
inputs) as N increases.
There will initially be large performance gains for increases in block length beyond N = 1 that
rapidly diminish as block length continues to increase. Different theoretical results provide perfor-
mance bounds that demonstrate this type of performance. For example, it has been proven that there
are block codes in which the probability of decoding error is bounded by an exponentially decreasing
function ([2] p171). Similarly, for a fixed probability of error, the minimum discrepancy between the
channel capacity and the actual amount of information transmitted decreases proportional to √1N
[5].

4.3.2 Optimizing encoders

Since sources will not produce equally probable messages and not all inputs of a channel will be
equivalent, in many cases we will be able to increase information throughput by designing encoders
that map more probable messages to less noisy inputs that overlap less with the outputs from other
inputs [5]. This is also known as joint source-channel coding, to differentiate it from the setting
where an encoder is designed based on the assumption that it will be encoding a maximum entropy
source.
We can do this by solving the following optimization problem:

arg max I(f (S); Y) (10)

(11)

Where f (·) is the encoder function that maps from the space of possible messages S to the the
space of channel inputs X .

37
When S and X are discrete spaces, this can be a challenging optimization problem, because it
is not amenable to gradient-based optimization. If we consider this problem in matrix form, pS is a
vector of probabilities of different messages, and Pf is noiseless channel that maps pS to pX . Since
the channel is noiseless, each column has exactly one entry with probability 1, and all other entries
are 0, like the channels shown in Figure 11a,c.
A naive approach to solving this optimization problem is to use a brute force search, in which the
mutual information is calculated for every possible encoder matrix. However, this quickly becomes
computationally intractable, since the number of possible encoder matrices grows with with the
factorial of the number of channel inputs.
Another possibility is to instead use stochastic encoders, which consist of replacing the de-
terministic encoder function f (·) with a conditional probability distribution p(x | y). In the matrix
view, this corresponds to matrices which can have an arbitrary number of nonzero elements along
each column, so long as they are all positive and each column sums to 1, like the matrices in Figure
11b,d.
Since every deterministic encoder represents a specific case of a larger class of stochastic encoders,
stochastic encoders should be able to do as good or better, in theory. In practice that often do worse,
though there are certain conditions under which they’re superior [6, 7]. This remains an area of active
research.
The major advantage that stochastic encoders do have is that they are much more amenable to
optimization. Since each column of a stochastic encoder matrix (or any channel) is a conditional
probability distribution it can be optimized with respect individual probabilities using the same
numerical optimization tools used to find the optimal input distribution for a fixed channel (Appendix
A).

Continuous encoders It is also possible to optimize encoders in the case of continuous sources/chan-
nels. That is, a source is represented by a probability density function, and encoder is a continuous
function (if it is deterministic) or a conditional probability density function (if it is stochastic). In this
setting, the entropies become differential entropies, which have a less clear intuitive interpretation
and can in some cases take on infinite values (Section 2.9). However, the gradients of these entropies
are well defined and do not suffer from such mathematical difficulties, making gradient-based op-
timization possible [8]. Still, there remain many difficulties to optimizing mutual information in
both continuous and discrete settings, particularly in high-dimensional settings, and this remains an
active area of research [9].

4.3.3 Variable channels and the cliff effect

Throughout the discussion of channels and encoders, we’ve made the implicit assumption that the
channel’s noise characteristics are known and fixed, and that any encoder we’ve designed will be
deployed on source channel combinations exactly like those for which it was designed. In practice,
this may not always be true, and we may encounter a phenomenon known as the cliff effect, which
is a rapid degradation in performance when the noise characteristics of the channel deviate from the
ones for which the encoder was designed [5]. The results of the noisy channel coding theorem do not
provide any guidance about how to design encoders that are robust to this type of situation, and
this too remains an active area of research.

4.4 Decoders
Without a proper encoder, we may incur an unnecessary loss of information in the noisy channel.
However, even with a an optimal encoder, there remains the challenge of inferring the original
message S from the channel output Y. This is the job of the decoder, a function or conditional
probability distribution which forms an estimate of the original source message Ŝ. Encoders and

38
decoders must be designed together, and the noise characteristics of the channel must be considered
to achieve good performance.
The channel output will contain both source information I(X; Y) and noise added by the channel
H(Y | X). A optimal decoder will get rid of the latter, mapping all noisy versions of a particular
message p(y | x) back to a correct estimate of the original message s = ŝ. In practice, our decoder
may not be able to eliminate all of the noise H(Y | S) or preserve all the signal I(Ŝ; S): There may
be a trade-off between these competing goals.
Preserving the information contained in the received message Y is necessary, but not sufficient,
for estimating the correct message. To understand why it is necessary, consider the rate distortion
curve in Figure 9c. The probability that the estimated source message is not equal to the actual
source message is a valid distortion function, and the rate-distortion curve tells us that to improve
the best possible average performance on this distortion function, we will need more information.
To understand what it is not sufficient, consider that if we had a perfect decoder, we could always
permute all of its predictions such that it would be wrong every time. This would give the worst
possible distortion, but all the information would remain, because we could always perform the
inverse permutation.
As discussed in the previous section, using block encoders and an extended noisy channel can in-
crease the information throughput of a channel (up to the channel capacity). However, this strategy
comes at the cost of increasing the complexity of designing a corresponding decoder. Specifically,
as we increase the block length, the number of different messages we have to decode increases expo-
nentially: |Y|N . A naive decoder would simply be a lookup table that maps a received message Y to
an estimate of the source event Ŝ, and this lookup table becomes exponentially large with increasing
block length. The best encoder-decoder pairs are thus semi-random, giving them the advantages of
random encoder construction with the potential for cleverly designed decoding algorithms that can
do better than the exponential complexity of the naive approach. There is not a known formula for
designing such encoder-decoder pairs that works across different types of channels.

5 Code availability
The code used to produce the figures and hi res versions of the figures can be found at https://doi.org/10.5281/zenodo.

6 Acknowledgements
We thank Matt Thomson for alerting H.P. to the existence of information theory, and in particular
the wonderful textbook [2] by David Mackay that made learning accessible and interesting, Chenling
Antelope Xu for being an insightful and thorough member the of a study group of that book, Tom
Courtade for providing feedback and clarifying concepts, and Leyla Kabuli and Tiffany Chien for
their extremely helpful feedback in the preparation of this manuscript

A Numerical optimization of channel input distribution

Computing the optimal input distribution for a noisy channel can be used not only to design encoders
that provide maximal information throughput, but also to figure out what the channel’s capacity is.
In some cases we may be able to use analytical properties of the channel (e.g. Gaussian noise with
zero mean known variance) to perform this computation. In other cases, we can find this distribution
by maximizing the channel’s mutual information with respect to the input probabilities ([2] p169).
Mathematically:

39
p∗X = arg max I(X; Y) (12)
PX

C = max I(X; Y) (13)

(14)

This optimization problem can be solved, and the optimal distribution/channel capacity com-
puted, using numerical optimization techniques such as gradient ascent. Mutual information has
the nice property that it is a convex function with respect to the input probabilities, so optimizing
the input probabilities directly using gradient ascent will be guaranteed to find a global maximum
eventually with a proper learning rate.
We solve this problem using projected ascent. which consists of applying the following updates:

pk+1
X = proj(pkX + λ∇pX I(X, Y))

Where pX is a vector of probabilities for each state in X .

The projection operator proj(·) enforces the constraints that the probabilities need to be positive
and sum to 1. This is done by taking the element-wise maximum of pX and 0 (a vector of all 0s),
and then adding enough to each element such that they sum to one. Mathematically:

pX = max(pX , 0)
PN
1 − i=1 pi
pX = pX +
N
Where N is the number of elements in pX and pi is the ith entry of pX .
While this heuristic projection step appears to work in practice, we note that there may be better
and more theoretically motivated alternatives [10].

B Proofs
B.1 Proof that joint entropy of two random variables equals sum of in-
dividual entropies
XX 1
H(X, Y) = − p(x, y) log
p(x, y)
x∈X y∈Y
XX 1
=− p(x)p(y) log
p(x)p(y)
x∈X y∈Y

XX 1 1
=− p(x)p(y) log + log
p(x) p(y)
x∈X y∈Y
   
XX 1 XX 1
= − p(x)p(y) log − p(x)p(y) log 
p(x) p(y)
x∈X y∈Y x∈X y∈Y
   
X 1 X X 1 X
= − p(x) log p(y) −  p(y) log p(x)
p(x) p(y)
x∈X y∈Y y∈Y x∈X

40
Taking advantage of the fact that the sum over any probability distribution is one, this reduces
to:

!  
X 1 X 1 
=− p(x) log − p(y) log
p(x) p(y)
x∈X y∈Y

= H(X) + H(Y)

B.2 Decomposition of mutual information

XX p(x, y)
I(X; Y) = p(x, y) log
p(x)p(y)
x∈X y∈Y

XX p(y | x)
= p(x, y) log
p(y)
x∈X y∈Y
XX
= p(x, y)(− log p(y) + log p(y | x))
x∈X y∈Y
   
XX XX
= − p(x, y) log p(y) +  p(x, y) log p(y | x)
x∈X y∈Y x∈X y∈Y
 !   
X X XX
= − p(x, y) log p(y) +  p(x, y) log p(y | x)
y∈Y x∈X x∈X y∈Y
   
X XX
= − p(y) log p(y) +  p(x, y) log p(y | x)
y∈Y x∈X y∈Y

= H(Y) − H(Y | X)

B.3 Proof that block length-normalized extended noisy channel inputs

become equally noisy fo infinte block length
Since each channel use within the extended noisy channel is independent, the point-wise conditional
entropies will add. For a sequence of channel uses where x(k) represent the input used on the kth
transmission of the non-extended channel:

H(Y | x) = (15)
(1) (2) (N )
= H(Y | x ,x , ..., x ) (16)
= H(Y | x(1) ) + H(Y | x(2) ) + ... + H(Y | x(N ) ) (17)

Where x = x(1) , x(2) , ... represents the input used on each channel use. These too are independent
random variables, since the input chosen on one channel use is independent of subsequent channel
uses. Thus, we can take a probability-weighted average over the choice of channel input, yielding:

By using the fact the sum of a probability distribution is equal to one, this reduces to:

   
X X
 p(x(1) )H(Y | x(1) ) +  p(x(2) )H(Y | x(1) ) + ...
x(1) ∈X x(2) ∈X

= H(Y | X(1) ) + H(Y | X(2) ) + ...

N
X
= H(Y | X)
k=1

By dividing by N and invoking the Law of Large Numbers, we can see that as N → ∞, the block
length-normalized noise level of an average extended channel input will become increasingly close to
average noise of the non-extended channel

References
[1] Claude E Shannon. A mathematical theory of communication. The Bell System Technical
Journal, 5(1):3, 1 1948.
[2] David J C MacKay. Information Theory, Inference, and Learning Algorithms David J.C.
MacKay. Cambridge University Press, 1st edition, 2003.
[3] Thomas M Cover and Joy A. Thomas. Elements of Information Theory. Wiley Series in
Telecommunications. Wiley, New York, USA, 2nd edition, 9 2005.
[4] Brian C. Ross. Mutual information between discrete and continuous data sets. PLoS ONE,
9(2), 2014.
[5] Yury Polyanskiy, H. Vincent Poor, and Sergio Verdú. Channel coding rate in the finite block-
length regime. IEEE Transactions on Information Theory, 56(5):2307–2359, 2010.
[6] Lucas Theis and Eirikur Agustsson. On the advantages of stochastic encoders. arXiv, pages
1–8, 2 2021.
[7] Aaron B. Wagner and Johannes Balle. Neural Networks Optimally Compress the Sawbridge.
Data Compression Conference Proceedings, 2021-March:143–152, 2021.
[8] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and
blind deconvolution. Neural computation, 7(6):1129–1159, 1995.

42
[9] Ben Poolel, Sherjil Ozair, Aäron Van Den Oord, Alexander A Alemi, and George Tucker.
On variational bounds of mutual information. In 36th International Conference on Machine
Learning, ICML 2019, volume 2019-June, pages 9036–9049, 2019.
[10] Weiran Wang and Miguel Á. Carreira-Perpiñán. Projection onto the probability simplex: An
efficient algorithm with a simple proof, and an application. arXiv, 2013.

Information Theory and Coding by Norman Abramson
No ratings yet
Information Theory and Coding by Norman Abramson
103 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
Overview & Chapter 1
No ratings yet
Overview & Chapter 1
45 pages
21ECE72 - Coding and Cryp Module 1
No ratings yet
21ECE72 - Coding and Cryp Module 1
34 pages
C&C Combined Module Notes
No ratings yet
C&C Combined Module Notes
206 pages
Info Theory1
No ratings yet
Info Theory1
20 pages
Chapter 1 (A)
No ratings yet
Chapter 1 (A)
30 pages
Introduction to Information Theory
No ratings yet
Introduction to Information Theory
32 pages
Unit 5 DC
No ratings yet
Unit 5 DC
22 pages
Ch5 Entropy and Information
No ratings yet
Ch5 Entropy and Information
77 pages
Information & Coding Theory Basics
No ratings yet
Information & Coding Theory Basics
34 pages
Communication Theory II - Lecture 6
No ratings yet
Communication Theory II - Lecture 6
22 pages
(Electronic Science) Norman Abramson - Information Theory and Coding (1963, McGraw-Hill Inc.,US) - Libgen - Li
100% (1)
(Electronic Science) Norman Abramson - Information Theory and Coding (1963, McGraw-Hill Inc.,US) - Libgen - Li
111 pages
Module 1
No ratings yet
Module 1
40 pages
Unit 1 Notes (Information Theroy of Cyber Security)
No ratings yet
Unit 1 Notes (Information Theroy of Cyber Security)
10 pages
Info Theory & Coding Course
No ratings yet
Info Theory & Coding Course
35 pages
Amount of Information I Log (1/P)
No ratings yet
Amount of Information I Log (1/P)
2 pages
Informationa and Entropy
No ratings yet
Informationa and Entropy
162 pages
MIT6 02F12 Chap02
No ratings yet
MIT6 02F12 Chap02
12 pages
18 753 Information
No ratings yet
18 753 Information
71 pages
Quantum Info for Physics Students
No ratings yet
Quantum Info for Physics Students
64 pages
Communication Theory Lectures
100% (1)
Communication Theory Lectures
52 pages
Information Theory and Coding: Faculty of Engineering
No ratings yet
Information Theory and Coding: Faculty of Engineering
18 pages
Lecture-1 Information Theory
No ratings yet
Lecture-1 Information Theory
20 pages
EE-317: Information Theory: Fall 2008
No ratings yet
EE-317: Information Theory: Fall 2008
38 pages
Information, Entropy, and The Motivation For Source Codes: Hapter
No ratings yet
Information, Entropy, and The Motivation For Source Codes: Hapter
12 pages
Infonoise
No ratings yet
Infonoise
37 pages
COSC 402 Information Theory
No ratings yet
COSC 402 Information Theory
291 pages
1 Information Theory
No ratings yet
1 Information Theory
57 pages
ECM3701 Study Unit 8
No ratings yet
ECM3701 Study Unit 8
20 pages
Info Theory
No ratings yet
Info Theory
59 pages
Course 1
No ratings yet
Course 1
54 pages
Information Theory
100% (1)
Information Theory
48 pages
CSC 402 Information Theory3
No ratings yet
CSC 402 Information Theory3
292 pages
Information T Information Theory and Coding: S.Chandramohan
No ratings yet
Information T Information Theory and Coding: S.Chandramohan
38 pages
Information Theory and Coding PDF
No ratings yet
Information Theory and Coding PDF
61 pages
Information Theory and Source Coding
No ratings yet
Information Theory and Source Coding
45 pages
Understanding Information Theory
No ratings yet
Understanding Information Theory
70 pages
Information Theory 5th Unit
No ratings yet
Information Theory 5th Unit
20 pages
Information Theory
No ratings yet
Information Theory
26 pages
Iict Unit One
No ratings yet
Iict Unit One
35 pages
2015 Chapter 7 MMS IT
No ratings yet
2015 Chapter 7 MMS IT
36 pages
Source Coding & Information Theory
No ratings yet
Source Coding & Information Theory
23 pages
Ination - Wedia
No ratings yet
Ination - Wedia
16 pages
Efficient Compression From Non-Ergodic Sources With Genetic Algorithms
No ratings yet
Efficient Compression From Non-Ergodic Sources With Genetic Algorithms
7 pages
Adami What Is Information
No ratings yet
Adami What Is Information
14 pages
Info Theory Course Overview
No ratings yet
Info Theory Course Overview
37 pages
A Cornputational Theory of Surprise: Transmission of Data, Shan
No ratings yet
A Cornputational Theory of Surprise: Transmission of Data, Shan
25 pages
Introduction to Information Theory
No ratings yet
Introduction to Information Theory
72 pages
Foundations of Information Theory (Amiel Feinstein) (Z-Library)
No ratings yet
Foundations of Information Theory (Amiel Feinstein) (Z-Library)
142 pages
01-Syllabus and Intro
No ratings yet
01-Syllabus and Intro
21 pages
What Is Information?: W. Szpankowski
No ratings yet
What Is Information?: W. Szpankowski
29 pages
ECE503: Communications Engineering 4: Dr. Fatma Khallaf
No ratings yet
ECE503: Communications Engineering 4: Dr. Fatma Khallaf
39 pages
Information Paper PDF
No ratings yet
Information Paper PDF
53 pages
Unit IV - Information Theory
No ratings yet
Unit IV - Information Theory
17 pages
Shallow Vs Deep Nns Dse 3151 Deep Learning
No ratings yet
Shallow Vs Deep Nns Dse 3151 Deep Learning
591 pages
Cryptography Exam Questions
No ratings yet
Cryptography Exam Questions
10 pages
Z transform08SS
No ratings yet
Z transform08SS
47 pages
TMS320c50 Programs
67% (3)
TMS320c50 Programs
28 pages
Fusion of Neural Networks With Fuzzy Logic and Genetic Algorithm
No ratings yet
Fusion of Neural Networks With Fuzzy Logic and Genetic Algorithm
10 pages
Problems 25 - Codeforces
No ratings yet
Problems 25 - Codeforces
2 pages
Z-Scores & Empirical Rule Lesson
No ratings yet
Z-Scores & Empirical Rule Lesson
5 pages
Pole Placement Adaptive Control For
No ratings yet
Pole Placement Adaptive Control For
5 pages
Notes-1-Activation Functions
No ratings yet
Notes-1-Activation Functions
2 pages
123 234 345 456
No ratings yet
123 234 345 456
5 pages
SQC CH6
0% (2)
SQC CH6
63 pages
Time Series Forecasting
100% (1)
Time Series Forecasting
52 pages
Detection of Structural Cracks of An Aircraft Using Deep Neural Networks
No ratings yet
Detection of Structural Cracks of An Aircraft Using Deep Neural Networks
9 pages
Memetic Algorithm
No ratings yet
Memetic Algorithm
10 pages
Numerical Analysis Pseudo Codes
No ratings yet
Numerical Analysis Pseudo Codes
32 pages
5 Multi-Attribute Utility Function
No ratings yet
5 Multi-Attribute Utility Function
53 pages
Chapter#9 Queues
No ratings yet
Chapter#9 Queues
32 pages
Module 2: Modeling Discrete Time Systems by Pulse Transfer Function
No ratings yet
Module 2: Modeling Discrete Time Systems by Pulse Transfer Function
4 pages
FAI Module - 2
No ratings yet
FAI Module - 2
45 pages
EE 675 Lecture 27th March
No ratings yet
EE 675 Lecture 27th March
4 pages
Cahcet Updated Algorithm Lab Manual
No ratings yet
Cahcet Updated Algorithm Lab Manual
63 pages
Final Projects
No ratings yet
Final Projects
1 page
Steem Keys PDF
No ratings yet
Steem Keys PDF
1 page
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
Blind Source Separation Using Renyi's Mutual Information: Kenneth E. Hild II
No ratings yet
Blind Source Separation Using Renyi's Mutual Information: Kenneth E. Hild II
7 pages
Moments
No ratings yet
Moments
15 pages
Linear Algebra Review Sheet - Chapter 1: by Vincent Fiorentini
No ratings yet
Linear Algebra Review Sheet - Chapter 1: by Vincent Fiorentini
2 pages
1 1 0.5z 1 1 0.3z 1 1 0.7z 1 (1 0.5z) (1 0.3z) 1 (1 0.5z) (1 0.7z) 1 1 0.5z 1 1 0.3z 1 1 0.7z
No ratings yet
1 1 0.5z 1 1 0.3z 1 1 0.7z 1 (1 0.5z) (1 0.3z) 1 (1 0.5z) (1 0.7z) 1 1 0.5z 1 1 0.3z 1 1 0.7z
4 pages
ACM4
No ratings yet
ACM4
1 page
Acme Public Schoo1 Xii
No ratings yet
Acme Public Schoo1 Xii
9 pages

A Visual Introduction To Information Theory

Uploaded by

A Visual Introduction To Information Theory

Uploaded by

A visual introduction to information theory

Henry Pinkard1,2 and Laura Waller1

Though originally developed for communications engineering, information the-

X a random variable, which will also be called a “random event”

2 Information, uncertainty, entropy, and mutual information

b) All possible 2-sequences

Figure 1: Equivalence of probability and information a) A sequence of two marbles is drawn

Optimal recording Entropy: Average number of bits per event

X = {blue, gray, yellow, green}

W (X) = Hmax (X ) − H(X)

2.4.1 Typical sequences

a) Probabilities of sequences of independent and indentically distributed random events

Figure 3: Typical sequences a) Example sequences of independent and identically distributed

2.5 Mutual information

Figure 4: Probability, redundancy, and typicality. a) The redundancy of a random variable

Point-wise mutual information is analogous to the information of a single outcome. It can be

2.6 Joint and conditional entropy

Colors Possible Shapes

Which can also be written in the generic form as:

1.58 bits 1 bit

2.6.2 Joint entropy

2.7 Relationships between entropy and mutual information

I(X; Y) = H(Y) − H(Y | X)

2.8 Entropy rate of stochastic processes

pX (x) = pX1 ,X2 ,... (x1 , x2 , ...)

c) Joint and conditional entropy of first 3 draws

d) Two ways of computing entropy rate

H(XN +1 | XN , XN −1 , ...X1 ) (4)

Where k is some arbitrary integer.

2.8.1 Redundancy and stochastic processes

W (X) = Hmax (X × X × ...) − H(X1 , X2 , ...)

where p(x) is the probability density function of X.

2.10 Rate distortion theory

Decoding with distortions

c) Rate-distortion curve gives limits of lossy compression

3.1 Channels as matrices

distributed. All channels discussed here assume memorylessness

PX,Y = PY|X diag(pX )

3.2 Entropy and mutual information in channels

c) Compute output d) Compute joint distribution

0.4 1.0 0.4 0.4 1.0 0.4

I(X; Y) = H(X) − H(X | Y)

I(X; Y) = H(Y) − H(Y | X)

3.3 Data processing inequality

Maximizing information throughput

p∗X = arg max I(X; Y) (7)

C = max I(X; Y) (8)

4.1.1 Compatibility of channels and sources

e) a different optimal encoder

b) Source c) Optimal encoding Symmetric channel Asymmetric channel

1 0.43 0 0 0.5 0 0.5

4.2 The noisy channel coding theorem

4.2.1 Ignoring compression with block encoders

b) Extended noisy channel formed by 2 channel uses

4.2.2 Theoretical limits of perfect transmission

4 source bits 7 transmitted bits

16 source bits 28 transmitted bits Impossible

32 source bits 56 transmitted bits 0

4.2.3 The benefits of long block length

Figure 18: Conditional entropy in an extended noisy channel. a) Matrix representation

4.3 Noisy channel coding in practice

4.3.1 Longer, but not infinite, block length

4.3.2 Optimizing encoders

arg max I(f (S); Y) (10)

4.3.3 Variable channels and the cliff effect

A Numerical optimization of channel input distribution

C = max I(X; Y) (13)

Where pX is a vector of probabilities for each state in X .

B.2 Decomposition of mutual information

B.3 Proof that block length-normalized extended noisy channel inputs

= H(Y | X(1) ) + H(Y | X(2) ) + ...

You might also like