A Visual Introduction To Information Theory
A Visual Introduction To Information Theory
1 Introduction
In the 1940s, Claude Shannon [1] showed that a precise mathematical definition of information arises
directly from the axioms of probability. Since then, information theory has served as a foundation
for the digital world we live in today by defining the limits of data compression and the reliable
transmission of data across noisy networks. In addition, the mathematical tools it provides have
found numerous applications in diverse areas such as statistics, machine learning, cryptography,
quantum computing, biology, and many others.
Here we introduce the foundations of information theory, with an emphasis on intuition. This
is meant to serve as a minimal introduction to key ideas. A more comprehensive treatment can be
found in the excellent textbooks [2, 3] as well as Shannon’s original formulation [1].
Main ideas Outside of the context of information theory, most of us already have a vague intuitive
notion of what “information” is, which is usually something close to “knowledge obtained about
something unknown.” One of the main advances of information theory was to formalize this intuition
into something mathematically precise. Shannon himself once said that “information can be treated
very much like a physical quantity such as mass or energy.”
Shannon was able to make the idea of information mathematically concrete by defining it in
terms of probability. If a sender is transmitting information about something to a receiver, that
information is describing something that would otherwise be unpredictable. In other words, from
the receiver’s perspective, it is random. Information theory shows that the randomness of the
message, as characterized by the probability distribution over potential messages, is in fact the only
thing that matters for figuring out how to transmit information. The actual content of the messages
is irrelevant.
This fact enables us to consider the fundamental concepts in information theory using a generic
source of random events: drawing colored marbles from an urn, with each marble representing a
possible message from the set of all messages. The mathematics developed on this example can be
directly applied to other application areas of information theory. For example, the way we describe
the randomness of a sequence of colored marbles is the same as how would describe randomness of
a string of letters in English text, or a sequence of pixels in images taken on the Hubble telescope.
The fundamental unit of information content is the bit. It is worth noting that the word “bit”
also describes a container that can hold information: a digit that can be either 0 or 1. The actual
information within the container must be defined with respect to some probability distribution, and
1
it could hold less than 1 bit of information in the same way as a 1 liter bottle can be filled with only
1
2 liter of water. Throughout this paper, we try to avoid this ambiguity by choosing examples where
where each bit container is “full” with one bit of information. So it is safe to assume that “bit” here
refers to the units of information, unless explicitly stated otherwise.
There are two key problems in information theory that will come up throughout the sections
below. The first is source coding, which concerns data compression: Given a source of random
events, what are the limits on how succinctly we can record the outcome of a random sequence, on
average? This can be further subdivided into the cases where the original sequence can be perfectly
reconstructed upon decompression (lossless compression) or the case where some distortion from
the original sequence can be tolerated to achieve even smaller data size (lossy compression).
The second problem is channel coding, which concerns data transmission: Given a source
of random events, and an imperfect system for transmitting information (a noisy channel), how
can we encode a sequence of such events such that they will be robust to errors introduced by the
channel? Again, we can subdivide into perfect data transmission, and data transmission with errors.
1.1 Notation
2
a) Independent random sampling c) Information content of different propositions
Draw at random Probabilities Proposition: Probability Information
(with replacement) Ruled out true gained
Possible
2 bits
0
1.5 bits
Outcomes
3
Formally, we have a finite, discrete set of independent and identically distributed (IID) events
and a random variable that assigns a probability to each element of the set. The intuitions developed
on this simplified scenario will later be generalized to the non IID case (Sec. 2.8) and continuous
variables and probability density functions (Sec. 2.9).
More common outcomes provide us with less information, and rarer outcomes provide us with
more. To understand why, suppose someone has drawn two colored marbles from an urn, but we
don’t know what they’ve draw. We’ll denote these two random events with X1 and X2 . X1 and X2
have a joint probability distribution pX1 ,X2 , which tells us the probability of each of the 16 possible
two-color sequences (Fig. 1b). Suppose we then learn some facts about what was drawn (but not
the exact outcome). For example, we might learn that neither marble is blue, or the first marble
isn’t green, or the two marbles are different colors (Fig. 1c).
Learning each fact allows us to rule out possibilities for the outcome of X1 , X2 . The less often the
fact is true, the more possible outcomes we can rule out. For example, knowing the two marbles are
21
the same color rules out 12 of the 16 possible outcomes that correspond to 32 of probability mass.
The more possibilities we can rule out, the more information we’ve gained. Thus, rarer outcomes
contain more information. The more surprised we are by the outcome, the more information it has
provided us.
Furthermore, we can calculate exactly how much information that outcome has provided based
on its probability. Each time we can rule out half of the probability mass, we have gained exactly 1
bit of information. In the case where we learn that neither marble is blue, we’re left with only 14 of
the probability mass, and we have gained 2 bits of information, since we have halved the probability
mass twice. The amount of information can be calculated from the probability of the outcome by:
1
log2
p(x)
The conventional choice of a base-2 logarithm means that the units will be bits (The 2 will often
be omitted in subsequent sections).
Since we can calculate the information provided by each outcome, and we know each outcome’s
probability, we can compute the probability-weighted average amount of information provided by a
random event, otherwise known as the entropy. Entropy is denoted H(X) and is defined mathe-
matically as:
X 1
H(X) = p(x) log
p(x)
x∈X
Entropy can also be interpreted as how surprising the random event is on average. Events that
are more random will tend to be more surprising on average, and observing their outcome will yield
more information, and reduce our uncertainty more. On the flip side, higher entropy also means that
there is more uncertainty in the random event to begin with, so we must acquire more information
to be certain about its outcome compared to an event with lower entropy.
It might seem that, since lower probability outcomes contain more information, that random
events containing many extremely rare outcomes will have the highest entropy. In fact, the opposite
is true: random events with probability spread as evenly as possible over their possible outcomes
will have the highest entropy. Mathematically, this happens because the gains in information that
come from an event becoming increasingly rare are outweighed by the decrease in the probability
that the highly informative outcome will occur, due to the logarithm in the formula for information.
Section 2.3 will discuss maximum entropy distributions in more detail.
4
2.2 Entropy and data compression
A random variable’s entropy is connected to many other interesting properties. For example, it
quantifies the limit of much (lossless) compression of data can be achieved, a problem in information
theory known as source coding.
Figure 2: Entropy can be interpreted as the average length of the shortest encoding of a sequence
of random events, which here are repeated draws (with replacement) of colored marbles from an
urn. (Top) With equal probability of each color, the shortest binary recording assigns a two-digit
binary string code to each event. The entropy is the average number of bits per event of a typical
sequence: 2 bits. (Bottom) When some colors are more likely than others, the more probable ones
can be recorded as shorter binary strings to save space. This gives a shorter entropy: 1.75 bits.
To demonstrate this, we return to the problem of drawing colored marbles from an urn. Now
we’re going to draw a sequence of marbles, and record the outcome of each draw using binary
codewords, such as 1, 10, etc. Our goal is to choose an encoding scheme that will (on average)
yield the shortest possible binary description of our sequence without introducing any ambiguity as
to what the outcomes of the draws were. This is a lossless compression problem, and the shortest
possible length is determined by the entropy the random draw.
We want to pick an encoding scheme that matches outcomes (e.g. a green marble drawn) to
bit strings (e.g. 01), such that the length of the string for a series of random draws is as short as
possible. How short we can make this encoding, while still being able to correctly decode our binary
encoding into the original sequence, is closely related to the fundamental information content of the
sequence: sequences that require longer encodings (on average) contain more information. Thus, the
length of the optimal binary recording scheme is determined by the random variable’s entropy–the
average information present in a single outcome.
The amount of information and the optimal recording scheme both depend on the probabilities of
each outcome. For example, consider a random variable X on the outcome space X with associated
probabilities pX (Fig. 2, top):
5
X = {blue, gray, yellow, green}
pX = { 41 , 14 , 14 , 14 }
The shortest lossless encoding scheme given these probabilities is a unique 2 digit binary code-
word for each outcome: 00, 01, 10, 11, because the average information in each random event (its
entropy) is 2 bits.
Alternatively, consider the case where the outcomes are not equally probable (Fig. 2, bottom),
but instead:
2.3 Redundancy
The redundancy of a random event depends on how uncertain its outcome is, compared to how
uncertain a variable on the same probability space could be. The latter is defined by the maximum
entropy distribution Hmax (X ) on the set of possible outcomes X . The maximum entropy distri-
bution is always the one in which the probability of all events is equal (Fig. 2, top). This makes
sense because the outcome is the most uncertain, or equivalently the optimal binary recording of
the events has the longest possible length.
By setting the probability of all events equal in a particular probability space, we find that
maximum entropy is equal to the log number of possible states:
Hmax = log2 |X |
Where |X | is the number of elements in the set X . The larger the number of elements in X , the
more uncertain we can potentially be about what value it will take.
The redundancy W is the difference between this maximum entropy and the actual entropy of
the random variable:
6
2.4 General data compression limits limits
In order to make the example in Figure 2 illustrate the concept of entropy, we’ve made two conve-
nient choices that won’t necessarily be true in general.
First, we’ve chosen the probabilities to be all of the form 21k where k is a positive integer. This
makes the information in each event equal to an integer number of bits, which means we can fully
compress the sequence by encoding each event individually. If we hadn’t done this, the information
content of a single event might equal something like 43 bits, and encoding in a one digit binary string
would waste 14 bits of space and yield a redundant binary representation. This technicality is why
the entropy is in general less than or equal to the expected length of the shortest binary encoding,
rather than always exactly equal. We’ve specifically picked the case where it is equal for illustrative
purposes. It is possible to get closer to this bound in the more general case by encoding sequences
of multiple events together into a single binary string, otherwise known as block encoding.
The second simplification we’ve made is that the random sequences of colors we’ve chosen are
made up of exactly the expected number of each color. For example, the bottom sequence is half
8 4
( 16 events) blue, a quarter ( 16 events) gray, etc. In reality, different random sequences might yield
binary strings with very different lengths. For example, a sequence of all blue (which happens to be
the most probable sequence) would have a 16 bit binary encoding, while a string of all greens (one
of the least probable sequences) would yield a 48 bit binary encoding. These sequences in Figure
2 are called typical sequences.
1
H(X) − ǫ ≤ − log p(x1 , x2 , ..., xN ) ≤ H(X) + ǫ
N
Does the choice of ǫ matter? For the sequences in Figure 2, we’ve conveniently avoided this
question since the sequences shown have a information content exactly equal to N H(X). However,
as N → ∞, we would see essentially the same behavior, because all of the probability mass of
the distribution over sequences concentrates on the typical set defined with any positive value of
ǫ (Fig. 3). Furthermore, as N → ∞, all typical sequences will occur with approximately the
same probability: ≈ 2−N H(X) . Since the total probability is 1, and each typical sequence occurs
with probability ≈ 2−N H(X) , there must be ≈ 2N H(X) typical sequences. This result is known as
the asymptotic equipartition property (AEP). It is a direct consequence of the Law of Large
Numbers, and it is used to prove many important theorems in information theory.
To summarize, the AEP says that with infinite sequence length, we can ignore the vanishingly
small probability that falls on non-typical sequences. As a result, we can focus on the typical set of
≈ 2N H(X) sequences, which each occur with probability ≈ 2−N H(X) .
A data compression scheme that encoded only sequences in the typical set and discarded all
other sequences would require log2 2N H(X) = N H(X) bits to assign a unique binary string to each
element of the typical set. Since there is negligible probability mass outside the typical set, the
scheme would be lossless in the limit as N → ∞. Thus, lossless compression requires of a length N
sequence requires N H(X) bits. Using fewer bits would result in different typical sequences mapping
to the same binary string. Using more would waste bits on non-typical sequences that occur with
nearly zero probability.
7
Interestingly, the typical set does not include the most probable individual sequences. For exam-
ple, in Figure 3b, the sequence for each N that consists of N blues in a row is the most probably
sequence, would be omitted from the typical set for a small value of ǫ. For any finite N you could
improve the compression algorithm described above by taking the binary string used for the least
probable typical sequence and instead using it for the sequence of all blue. The demonstrates an im-
portant point about the typical set. It is useful for theoretical work, because the number of sequences
it contains can be counted. However, in practical situations (i.e. N < ∞), better compression can
be achieved by designing for the set of most probable sequences. As N → ∞, this becomes less and
less true, because the most probable sequences contain vanishingly small total probability mass.
...
b) Probability concentrates onto “typical sequences”, each with probability
N=1 N=2 N=3 N=7 N = 13
All sequences Entropy =
Example: Probability weighted 1.75 bits
2 5 3000
15 1.5e7
# sequences
0 0 0 0 0
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
The relationship between entropy and redundancy, probability distributions, and typical se-
quences are shown in Figure 4.
8
a) b) Probability mass Typical sequence Entropy, redundancy
Probability
Entropy Redundancy
Max entropy
Probability
Information (bits)
Consider the scenario of drawing not colored marbles, but colored objects from an urn. We have
the same four possible colors, blue, green, yellow, and gray, and now there are also four possible
object shapes: , N, ⋆, and •. Our random draws will now be shape-color combinations like N, ,
, etc. X will still represent the represent the object’s color, and now we’ll also use Y to represent its
shape. The random event of drawing a particular shape with a particular color is X, Y, and its value
is governed by the joint probability distribution over X × Y, the outcome space of all shape/color
combinations.
Suppose that we do not observe which object was drawn, but only a black and white photograph
of the object in which all colors look the same. What can we say about the object’s color (X), having
only observed its shape Y? In other words, how much information does its shape convey about its
color? The answer to this question is determined by the mutual information between X and Y.
Mutual information, denoted I(X; Y), quantifies how well you can discriminate alternative pos-
sibilities of an unknown random event X based on observing some other random event Y. In other
words, it tells you how much of the information gained from observing Y = N will reduce your
uncertainty about the object’s color, X. The minimum possible value is I(X; Y) = 0, which means
that X and Y are independent–observing shape will not reduce your uncertainty about color at
all. The maximum possible value is the lesser of H(X) and H(Y), because a random event can
neither convey more information about another event than it itself contains, nor can it convey more
information about the other event than the other event contains. Mutual information is symmetric:
I(X; Y) = I(X; Y).
Another common way of quantifying the dependence between two variables is through correlation.
In some sense, mutual information can be thought of as a more general version of correlation, because
unlike correlation, which usually captures only linear relationships between the variables, mutual
information captures any sort of statistical dependency between them.
To compute mutual information, we need to know the joint probability distribution of the two
variables X and Y. That is, we need to know which shapes come in which colors. For a particular
draw, say ⋆, we can calculate the amount of information that the color and shape convey about
one another using the point-wise mutual information:
9
p(x, y)
log
p(x)p(y)
p(x, y) p(x | y)
log = log
p(x)p(y) p(x)
= log p(x | y) − log p(x)
1 1
= log − log
p(x) p(x | y)
The final line can be interpreted as the surprise of the outcome x minus the surprise of x when
1
y is already known. If knowledge that Y = y makes X = x less surprising, then p(x|y) will decrease,
reflecting the fact that y tells us something about the value of x. For example, if there are many
different colors of objects, but all ⋆’s are blue, then learning the object is a ⋆ will eliminate all
uncertainty about its color: p(⋆ | ⋆) = 1. This means that in this case all of the information
provided by the outcome X = blue is shared by Y = ⋆.
By taking a probability-weighted average of the point-wise mutual information over all possibil-
ities, we arrive at the mutual information, which represents the average amount of information
provided from the outcome of one random event that is relevant to another. Mathematically:
XX p(x, y)
I(X; Y) = p(x, y) log (1)
p(x)p(y)
x∈X y∈Y
Just as in Figure 1, when one bit of information meant we could rule out half of the probability
mass, for each bit of mutual information between X and Y, we can eliminate half of the probability
mass of one random event after observing the outcome of the other. This is most easily seen if
all outcomes are equally probable, because in this case eliminating half of the probability mass
corresponds to eliminating half of the outcomes.
Figure 5 shows three different scenarios with different amounts of mutual information, and four
different ways of visualizing the relationships between the two variables. When I(X; Y) = 0 bits,
observing the object’s shape does not allow us to eliminate any possibilities of the object’s color. A
mutual information of 0 between two random variables is equivalent to those two random variables
being independent. Alternatively, when I(X; Y) = 1 bit, after observing the shape only, we can
eliminate two of the four possible colors. Finally, when I(X; Y) = 2, we can eliminate 3 of the
four possible colors and know exactly what color it is. In this case, we have gained all 2 bits of
information that were present in X.
Since we can at most gain 2 bits from observing 1 of 4 possible shapes (i.e. Hmax (Y) = 2), if
there were more than 2 bits of information in X, we wouldn’t be able to determine color exactly.
For example, if there were 8 possible colors that were all equally probable, we could not uniquely
identify all possibilities based on 4 possible shapes.
10
a) Distribution view b) Mapping view c) Compact view d) Information view
Inferred
color
(shape)
2 bits
1 bit
0 bits
Figure 5: Mutual information describes the relationship between two random variables. Here
those random variables are the shape and color of an object drawn at random. The joint distribu-
tion of shape and color determines the amount of mutual information. (Top row) 2 bits of mutual
information, (middle row) 1 bit of mutual information, (bottom) 0 bits of mutual information.
a) The joint distribution of shape and color, with uniform probability over all possible shape/color
combinations shown. b) Mapping view showing the colors, possible shape color combinations, pos-
sible shapes, and the possibilities for colors that can be inferred from shape alone. Line thickness
shows strength of the relationship. c) Compact view that omits the joint distribution and color
inference. d) More of the entropy of the two events is shared with greater mutual information.
11
2.6.1 Conditional entropy
Mutual information quantifies how much knowing the outcome of one random event reduces un-
certainty about a second random event, while conditional entropy quantifies how much uncertainty
remains about the second random event. Adding the two together gives the total uncertainty of the
second random event on its own.
Once again, we start with an example in the point-wise case, where the information gained by
learning the outcome of a random event X after already knowing the outcome of a second random
event Y = y is:
1
log
p(x | y)
Or, in the specific case of learning an object is green after knowing it is a •:
1
log
p(green | •)
The surprise of finding a green • depends on how many different colors of • are in our set of
objects. In the same way we calculated entropy as a probability-weighted average of all possible
outcomes of a random event, we can similarly compute an average uncertainty of the object’s color,
given that the object is a • (Fig. 6a). This yields the point-wise conditional entropy:
•) = •) log p(x1| •)
X
H(X | p(x | (2)
x∈{green,blue,
yellow,gray}
X 1
H(X | y) = p(x | y) log
p(x | y)
x∈X
Looking at the joint distribution in the middle row of Figure 5a, knowing that the object is a
• leaves three possible colors: green, yellow, or blue. Since we’ve specified that these occur with
equal probability, the point-wise conditional entropy will be log2 3 ≈ 1.58 bits.
Alternatively, if we looked at the conditional entropy of color given that the object is a ⋆, the
only possible color would be green. This means the point-wise conditional entropy is 0–we know
the value of X exactly given Y.
Next, we might want to consider the average uncertainty of color, conditional not specifically on
, but averaged over all possible shapes. To do so, we’ll sum equation (2) over all possible states in
Y (e.g. ⋆,,N,•), weighting by the probability of each shape. The generic form of this is:
X X 1
H(X | Y) = p(y) p(x | y) log
p(x | y)
y∈Y x∈X
With some slight algebraic manipulation, we get to the traditional equation for conditional
entropy:
XX 1
H(X | Y) = p(x, y) log
p(x | y)
y∈Y x∈X
12
a) b)
0 0.5
Probability
Mapping view
Figure 6: Point-wise conditional entropy (Top) The joint and marginal distributions of shape
and color. (Bottom) The compact mapping view between shape and color. a) The conditional
entropy of color given the shape is • is log2 3 ≈ 1.58 bits since there are three equally probably
possibilities (in magenta box on distribution view) b) The conditional entropy of shape given a blue
object is 1 bit since there are two equally probable shapes.
13
When H(X | Y) = 0, after observing the outcome of Y we know exactly what the outcome of X
is – there is no remaining uncertainty. Alternatively, if H(X | Y) > 0 then the outcome of X is not
fully known having observed Y, at least for some outcomes of Y.
XX 1
H(X, Y) = p(x, y) log
p(x, y)
x∈X y∈Y
For example, p(x, y) in the example shown in Figure 5 would represent the probability of a
particular shape-color combination.
In the case that X and Y are independent, observing the value of Y provides no information
about X. Thus, H(X, Y) = H(X) + H(Y). This can be proven algebraically by substituting
p(x, y) = p(x)p(y) into the joint entropy formula, as shown in Appendix B.1. This can be equiva-
lently expressed as p(y) = p(y | x) or p(x, y) = p(x)p(y).
When X and Y are dependent, the joint entropy will be less than the sum of the individual
entropies, since there will be nonzero mutual information. For example, in the top row of Figure 5,
H(X) = 2, H(Y) = 2, but H(X, Y) = 2 also, since the 2 bits are mutual to both random variables.
Joint entropy
Entropy Conditional entropy
Conditional entropy Entropy
Mutual information
Information (bits)
Figure 7: The relationships between entropy, joint entropy, conditional entropy and
mutual information The width of each bar represents its size (in bits). Adapted from [2], p140
Mutual information can be written in three different ways in terms of the other entropies:
14
An example of how I(X; Y) can be manipulated into these other forms can be found in Appendix
B.2.
The first two can be interpreted as the average uncertainty in one random event after subtracting
the uncertainty that remains having observed a different random event. The third can be interpreted
as the total uncertainty of the both random events in isolation minus the uncertainty of both
considered together.
Entropy is additive for independent random variables, so the entropy of X = (X1 , X2 , ...XN ) will
be equal to N times the entropy of X. Thus, H(X), the entropy per event, is called the entropy
rate of the process. This quantity is important because if we are trying to transmit information
through sequential uses of a communication channel, our channel will need to be able to keep up
with this rate (as discussed in greater depth in section 4.2.2).
Alternatively, suppose that the this IID assumption does not hold. What is the entropy rate of
a stochastic process with an arbitrary joint probability distribution? There are two ways we might
do this, and under certain assumptions, they are equivalent ([3] Sec. 4.2).
Returning to the example of drawing marbles, suppose we are drawing from a magical urn that,
the first time we draw from it, will give all four colors with equal probability. After the first draw,
the magic kicks in, and the urn makes it more likely that we will again draw the same color as
our previous draw. For example, the conditional probability pXN +1 |XN (blue | blue) = 85 , while
pXN +1 |XN (green | blue) = 18 (Fig. 8a). If we write out all the possible conditional probabilities,
we will get a matrix where each column represents the conditional distribution pXN +1 |XN . A typical
sequence from this distribution will tend to give repeats of the same color (Fig. 8b).
What can we say about uncertainties in this process? We are more uncertain about the color of
the first draw X1 (all colors with 41 probability) than we are about the second color after observing
the first ( 58 chance of the same color again). This means that H(X1 ) > H(X2 | X1 ) (Fig. 8c).
Knowing X1 reduces our uncertainty about the value of X2 , and similarly knowing X2 would reduce
our uncertainty about the value of X1 . This means that the joint entropy is less than the sum of
the individual entropies H(X1 , X2 ) < H(X1 ) + H(X2 ).
What about the entropy rate of the process? There are two ways we could define this. First,
we could simply say that the entropy rate is our average uncertainty over all draws. This can be
quantified by dividing the joint entropy by the number of draws to get an average of the information
in each draw:
1
H(X1 , X2 , ...XN ) (3)
N
15
a) First draw b) A typical sequence
...
Subsequent draws
Average entropy
Conditonal entropy
0
0 2 4 6 8 10
N
Figure 8: Entropy rate of a stochastic process a) The stochastic process, which consists of
(top) an initial draw of a colored marble with each color having a probability of 14 and (bottom)
subsequent draws where the probability of repeating the same color as the previous draw is 85 and
the probability of all other colors 18 . b) A typical sequence from this stochastic process where a
random variable Xk represents the color selected at each position. c) Entropy, conditional entropy,
and joint entropy of the first three draws. Knowledge of past outcomes reduces uncertainty of future
outcomes (or vice versa). d) The two ways of computing entropy rate: the average of the joint
entropy and the conditional entropy of the next draw given the previous. For a stationary stochastic
process, these converge to the same value as N → ∞.
16
Alternatively, we could say that the entropy rate is our uncertainty about the next draw, given
the previous ones. This is quantified with the conditional entropy:
Under our example, (3) will gradually decrease as N increases, as our high uncertainty about
the first draw has a proportionally smaller and smaller effect compared to the subsequent, more
predictable draws.
For the second definition, since our model specifies that draw N + 1 only depends on the outcome
of draw N (a type of model called a Markov chain), (4) will simplify to:
H(XN +1 | XN ) (5)
Thus, it will have a large value for X1 and then be slightly lower and stay constant for all
subsequent draws.
As N → ∞, both measures will converge to the same value for the entropy rate because the high
uncertainty of the first draw is ultimately neglected for both, so the two definitions are equivalent in
the limit. This will be true whenever the stochastic process is stationary, which means that its joint
probability distribution is shift-invariant. Mathematically, a stationary process has the property:
pXN +1 |XN ,XN −1 ,... = pXN +1+k |XN +k ,XN −1+k ,...
Essentially, this means that redundancy is now a function of higher order interactions between
variables rather than just of the distribution of each random variable individually. In the example
above H(Xk ) = Hmax (X ), all colors are equally likely for every draw in the absence of any infor-
mation about the outcomes of other draws. But the joint distribution in our example is not the
maximum entropy distribution, because the information present in the distribution of one random
variable is shared by others. This manifests in the fact that we can do much better than random
chance in guessing the next color in a sequence given knowledge of the previous ones. The maximum
entropy joint distribution would be independent and identically distributed random events, each of
which has equal probability over all of its states.
17
2.9 Probability densities and differential entropy
The examples above all considered probability mass functions over finite, discrete sets of events.
How do the formulae and interpretations generalize to cases other than this, such as probability
density functions defined over the infinite real line?
For probability densities that are defined over the continuous real line (e.g. the normal distri-
bution, the exponential distribution), we can define an analog to entropy called the differential
entropy, replacing the discrete sum with an integral:
Z
H(X) = − p(x) log p(x)dx (6)
X
Z Z
p(x, y)
I(X; Y) = p(x, y) log dxdy
X Y p(x)p(y)
The mutual information also avoids the problem of the improper logarithm of the differential
entropy, because for mutual information, the log is taken on the ratio of densities, in which the
units cancel out. It is also possible to compute mutual information between discrete and continuous
random variables [4].
18
In summary, in spite of the difficulties associated with the differential entropy over probability
densities, mutual information retains its usefulness and a clear interpretation as the amount of
information one random variable provides about another.
d : X × X̂ → R+
In the example shown in Figure 9, X and X̂ are identical (i.e. the set of the four possible
colors), but this may not be true in every situation. For example, we might consider the distortion
associated with storing an arbitrary real number as a 32-bit floating point number on a computer.
In this case, X would be the set of real numbers and Xˆ would be the set of possible 32-bit floating
point numbers.
The source produces information at some rate, which we’ll call the “source rate”, whereas the
rate describes the how much of it we’re capturing it (over, for example, time, space, etc.). Higher
rates are good, because we’re not losing source information. For a discrete source, the maximum
achievable rate is the entropy rate of the source H(X).
A useful scalar descriptor of the distortion for a particular source and encoding scheme is the
average value of a distortion function applied to a source and some distorted version of that source.
For example, this could be the average number of differences (the distortion function) between a
binary string from a source and the string found by passing the original string through a lossy
encoder and then decoder functions (as shown in (Fig. 9b)):
19
a) Lossless compression
Information content = 32 bits
Compressor
32 bit encoding
Decompressor
Perfect decoding
b) Lossy compression
Information content = 32 bits
Compressor
16 bit encoding
Decompressor
Best uninformed
guessing
Distortion
Posssible
Impossible
0
0 Rate (bits)
Figure 9: Lossy compression and rate distortion a) In lossless compression, a typical sequence
is mapped to a binary codeword with length equal to its information content and can be decom-
pressed without error. b) In lossy compression, a sequence is mapped to a sequence shorter than
its information content, and errors are present upon decompression. c) The black curve shows the
minimum number of bits needed to achieve a given average distortion.
20
h i
D = E d(X, X̂)
The limiting cases of complete information (where distortion may be zero) and zero information
(where nonzero distortion is impossible except in the trivial case of a constant source) suggest that
more information allows for lower distortion. Rate-distortion theory takes this intuition even further,
proving that if we want to achieve distortion below some threshold (i.e. D < Dmax ), we must have
at least R(D) bits of information about X. R(D) is called the rate-distortion function. This
means that if we’re already achieving the best average distortion for the amount of information we
have, I(X; X̂), we cannot do better unless we acquire more information. This is true regardless of the
chosen distortion function, provided that the distortion function d(x, x̂) is finite for every possible x
and x̂. This finding is proven in the asymptotic case of infinitely long block lengths.
This doesn’t necessarily mean that having N bits about a source will give us an X̂ that achieves
the N -bit best performance for a given distortion function. For example, if we know X perfectly, and
we applied an invertible operation like mapping each color to a new, distinct color, we would not
have lost any information. However, if our distortion function was the number of incorrect colors,
the average distortion would increase.
An example rate-distortion function for a discrete source in shown in Figure 9c. The green
area represents possible combinations of rates and distortions. For sufficiently high values of average
distortion, no information about the source is necessary, so the curve intersects the horizontal axis.
For example, consider simply guessing the average outcome each time. For discrete sources, the curve
intersects the vertical axis at the value of the source’s entropy, at this point we have zero remaining
uncertainty. In contrast, for a continuous source, the curve would asymptotically approach the
vertical axis, since an infinite number of bits would be needed to describe a continuous source
exactly (Sec. 2.9).
In general, R(D) will be a non-increasing convex function of D, which means that 1) achieving
lower and lower distortion will always require acquiring more information (assuming we we already
making good use of the information we had), and 2) there will be diminishing returns on lowering
distortion as we acquire more and more information.
3 Channels
Thus far we’ve considered the mathematical foundations of information theory in a general sense,
without specifying what gives rise to the statistical relationship between X and Y. Now we move to
a more a specific scenario that models the transmission of information through a channel. Often,
channels model an actual physical medium for information transmission, like a band of frequencies
of radio waves or transistors on a computer.
Mathematically, a channel is some mapping from inputs in X to outputs in Y. Rather than using
the term “outcome” to describe the value taken by a random variable as in the previous section,
we now use the terms “input” and “output”. Which inputs map to which outputs is determined
by the conditional probability distribution pY|X . In the special case that pY|X can be described by
a deterministic function, then we have a noiseless channel; otherwise, we have the more general
case of a noisy channel.1
21
matrix form, which we denote PY|X , each column is the conditional distribution p(y | X = x), which
describes the probability that the input x will map to each possible output in Y, This describes the
noise distribution of that input. Since each column is itself a probability distribution, it sums to 1.
Using the channel matrix, two quantities of interest can be computed. The output distribution
can be described by a vector of probabilities pY , and it can be computed from the input distribution
pX by multiplying it by the channel matrix (Fig. 10b):
pY = PY|X pX
The joint distribution PX,Y is also useful because it is needed in the computation of conditional
entropy and mutual information. It can be found by putting in the input probability vector pX
along the diagonal of a matrix and multiplying by the channel matrix (Fig. 10c):
Unlike PY|X , for which each column is a probability distribution and sums to 1, all entries of
PX,Y together sum to 1 since the whole matrix is a joint probability distribution.
22
a) A noisy channel
Input Channel Output
Mapping view
distribution distribution
1 0 0 1
Probability
b) Matrix-vector
representation
0.75
Probability
0.2
0 0
0.2
0 0 0 0 0 0
0.75 0.75
Noisy
0.2 0.1
0 0 0 0
Figure 10: Channels a) The mapping view of a noisy channel. The channel is mathematically
represented by the conditional random variable Y | X b) Channels and inputs/outputs can also be
represented by matrices and vectors of probabilities, respectively. The vectors sum to one since they
represent probability distributions, and each column of the matrix sums to one since it represents
the conditional distribution Y | X = x. The matrix/vector representations can be used to compute
c) the output distribution and d) the joint distribution through matrix-vector and matrix-matrix
multiplications, respectively. c) The joint distribution is computed from the matrix-matrix product
of the channel and a matrix with the input matrix along the diagonal.
23
gave rise to a certain output, H(X | y). Averaging over both inputs and outputs gives H(X | Y).
Channels that have less overlap at the output will tend to transmit more information.
Noisy mapping
No loss of informaiton
Deterministic mapping
a) b)
Mapping view Matrix view Information view
Information (bits)
c) d)
Loss of informaiton
Figure 11: Examples of noisy and noiseless channels. A random variable X with equal prob-
ability over all of its states is mapped to a random variable Y. These mappings can be either
deterministic or noisy and information-preserving or information-destroying. Bars to the right of
each mapping show entropy, conditional entropy, joint entropy, mutual information, and the max-
imum entropy over the state space of each random variable. Line thickness denotes magnitude of
p(y | x) a) A deterministic, information-preserving mapping with more outputs than inputs. Each
input maps uniquely to exactly one output. The outputs with no line connected have zero probabil-
ity. b) An information-preserving, noisy mapping. Each input maps into multiple outputs, but the
outputs mapped to by a particular input are all disjoint. c) A deterministic, information-destroying
mapping. Multiple inputs collide on the same output, meaning the mapping cannot be inverted. d)
A noisy, information-destroying mapping. Each input maps to multiple outputs, and the outputs
mapped to by each input are not disjoint.
Examples of noisy and non-noisy channels We now give specific examples of channels that
are either noisy or noiseless and either lose information about the inputs or preserve all of it. For
simplicity, here we assume distribution over inputs is uniform. The ideal information transmission
system would lose no information and be able to recover X exactly given a output Y, which would
mean that I(X; Y) = H(X). One way this can occur is in a noiseless channel in which the channel
maps every input to a unique output (Fig. 11a). In this situation, H(X) = H(Y) = I(X; Y) and
H(Y | X) = H(X | Y) = 0, meaning that we could perfectly predict X given Y (or Y given X).
However, a channel does not need to be noiseless to completely preserve input information. All
input information can be recovered through a noisy channel, so long as each input maps to outputs
that don’t overlap with those of other inputs (Fig. 11b). In this scenario, H(Y | X) is no longer
equal to zero, because even knowing what the input was, we cannot predict exactly how the noise
of the channel manifests at the output. In contrast, H(X | Y) = 0 because given the output, we can
know exactly what the input was. H(Y) is greater than H(X) because it includes all of the entropy
H(X) along with some additional entropy caused by noise.
Alternatively, even if the channel imparts no noise, information can be lost. This happens if
multiple inputs map to the same output (Fig. 11c). It thus becomes impossible to tell with
24
certainty from the output Y which input was used, or equivalently, H(X | Y) > 0.
Finally, the most general situation, which will be encountered most in practice, is the one in
which there is both a loss of information (H(X | Y) > 0 or equivalently H(X) > I(X; Y)) and noise
present in the output (H(Y | X) > 0 or H(Y) > I(X; Y)) (Fig. 11d). Although Figure 11d
shows a scenario where H(Y) > H(X), this can also occur even if H(X) = H(Y): some of the input
entropy is replaced by noise.
To summarize: I(X; Y) is the average amount of information successfully transmitted through the
channel. H(Y | X) is the average noise at the output–uncertainty about what the output outcome
will be knowing the specific input used. H(X | Y) is uncertainty in which input was used knowing
the output outcome, which can arise from deterministic many-to-one mappings in the channel or
inputs whose noise overlaps at the output.
The different ways of decomposing mutual information presented in Section 2.7 have more specific
interpretations in the context of a noisy channel:
Can now be interpreted as the average input information, minus the uncertainty at the output
about which message was sent, which is determined by how much different messages overlap at the
channels output.
Can be interpreted as the total uncertainty at the output (the sum of input uncertainty and
noise), minus the noise in the received messages.
I(A; B) ≥ I(A; C)
and this statement can be generalizes to greater than three random variables.
25
2) How much information will the channel be able to transmit if this optimal input distribution is
used? The answer to the latter question is called the channel capacity and is denoted by C.
The quantities can be found by solving an optimization problem:
(9)
Where PX is the set of probability distribution on the space X . In the case where X is discrete,
this will be the set of nonnegative vectors that whose sum is 1 (formally known as the probability
simplex). The computational details of how to solve this optimization problem can be found in
Appendix A. Here, we focus on the intuition behind its objective function.
Mutual information can be decomposed as I(X, Y) = H(Y) − H(Y | X), and this decomposition
is useful in analyzing the competing goals of this optimization problem. Consider the noisy channel
shown in Figure 12a, which has two inputs that are noisy but map to distinct outputs, and one
input that is less noisy but has outputs that overlap more with the outputs of the other inputs.
If we were to not maximize mutual information, but instead only maximize the first term in
the decomposition, H(Y), probability would be placed on inputs in such a way that the output
probability was as uniformly distributed as possible. This is beneficial because it means more of
the channels outputs are utilized, which means different channel inputs have more space to avoid
overlapping on the same outputs. In the case of this channel, this would result in putting half the
probability mass on each of the two noisy but non-overlapping inputs, which would result in 1 bit
of information gained with each use of the channel (Fig. 12b).
Alternatively, if we were to maximize only the second term, H(Y | X) would be minimized,
leading to probability being placed on in the inputs in such a way as to select for the least noisy
inputs. Since the middle input is less noisy than the other two, this would result in all of the
probability mass being placed on it, which would lead to no information being transmitted.
By optimizing the full objective, these goals are balanced: some probability is placed on the
least noisy middle input, while some probability is placed on the noisier inputs, which makes use of
the full output space. This leads to an information transmission of 1.13 bits, which is the channel
capacity C.
4 Channel coding
In information transmission problems, there will usually be a source of random events S. We will
refer to the outcomes of S as messages. We don’t have any control over the messages or the channel,
but we do have the ability to design a function called an encoder, which will map certain messages
to certain channel inputs.
In this section we discuss the problem of channel coding, which addresses the question of how
we should design encoders to maximize information transmission, and minimize the probability that
we incorrectly infer the source message.
4.1 Encoders
An encoder is a deterministic function that maps messages from the source to the inputs of a
channel. We’ll make the assumption, unless otherwise stated, that every message is mapped to a
unique input, so no infomration is lost from S to X. A good encoder will maximize the mutual
information between X and Y, so its goal is to map S to X in such a way that the distribution of
26
a) Noisy channel b) Only maximizing output entropy c) Only minimizing output noise d) Maximizing mutual information
0.5
Probability
1 0 1 0
1 0
Probability
0 1 0 1 0 1
0
1 bit 0 bits 1.13 bits
Figure 12: Optimizing mutual information for a fixed channel gives both the optimal input
distribution and the channel capacity. a) Matrix representation of a noisy channel. b) Only maxi-
mizing output entropy disregards putting probability on the least noisy input. c) Only minimizing
the the average input noise fails to utilize the full output space. d) The full objective function
balances these competing goals.
X will balance: 1) having probability mass on non-noisy inputs, and 2) spreading probability of the
channel outputs as uniformly as possible to avoid different inputs overlapping at the output.
Figure 13a shows a noisy channel which maps each input X = x to two possible outputs with
equal probability. Our goal is to transmit messages from a non-redundant source (Fig. 13b), in
which three colors occur with equal probability through a noisy channel, with minimal information
loss.
Figure 13c shows a bad way of doing this: using three states in X that have overlapping
outputs in Y, which leads to a loss of information that creates ambiguity as to what the original
source message was.
Alternatively, we could design an encoder which uses only a subset of the states in X that produce
disjoint outputs in Y (Fig. 13d). In this case, there is no loss of information.
This encoder is not unique, since we can randomly permute which state of S mapped to which
state of X and achieve the same mutual information.
27
a) Noisy channel c) a suboptimal encoder
Noisy
channel
Encoder
Not
uniquely
decodable
d) an optimal encoder
Decoder
b) Source
1
Probability
Figure 13: Optimal encoders for a noisy channel. a) a noisy channel that maps each input to
two possible outputs b) A maximum entropy source of random messages. c) A encoder that maps
messages in S to inputs in X that produce overlapping outputs that are not uniquely decodable,
thus transmitting less information than the maximum possible. d) An encoder that transmits the
maximum amount of information by mapping messages to inputs that produce disjoint, and thus
decodable, outputs. e) Another encoder/decoder pair that transmits the maximum amount of
information, showing that the optimal encoder is not always unique.
28
a) Noisy channel Matrix
Individual inputs representation
Symmetric
Asymmetric
1 Probability 0
0 0.5 0 0.5
0.58 bits 0.92 bits
Asymmetric
Figure 14: Matching sources and channels for maximum information transmission. a) A
symmetric noisy channel, in which all inputs are equally noisy and equally overlapping (for a uniform
input distribution) and an asymmetric noisy channel, in which all inputs are equally noisy, but two
pairs of inputs overlap more at the output with each other than the other pair (for a uniform input
distribution). b) Symmetric and asymmetric sources with the same entropy. c) The symmetric
source can be encoded to transmit more information in the symmetric channel, because it is able
to choose inputs such that overlap equally at the output, whereas the asymmetric source is not.
For the asymmetric channel, this is reversed, and the more probable messages from the asymmetric
source can be encoded such that they are less overlapping at the output.
29
14a): The first channel (the “symmetric channel”) has inputs that are all equally noisy and all have
outputs that overlap equally. The second channel (the “asymmetric channel”) also has equally noisy
inputs, but their outputs are not equally overlapping with one another: it has two pairs of inputs
whose outputs overlap less with each other than do the outputs within each pair.
There are also two different sources with the same entropy (Fig. 14b). The first source (the
“symmetric source”) has three equally probably outcomes, while the second source (the “asymmetric
source”) has two more probable and two less probable outcomes. For channels and sources with a
small number of states such as the ones shown here, the optimal encoder can be found by exhaustively
searching all possible ways of mapping messages to channel inputs.
For the symmetric channel, the optimal encoder for the symmetric source is able to transmit more
information than the optimal encoder for the asymmetric source, but for the asymmetric channel,
the reverse is true (Fig. 14c). In the latter case, this occurs because the encoder is able to map
each of source’s two most highly probable messages (green and gray) to distinct subsets of channel
inputs that overlap less at the output. This is not possible for the symmetric source, because all of
its messages are equally probable.
In contrast, for the asymmetric channel, more information can be transmitted about the asym-
metric source than the symmetric source. This is because the outputs for all inputs are equally
overlapping, which forces the asymmetric source to map highly probable messages channel outputs
with more overlap.
Maximal information transmission Having seen that the channel capacity C can be measured
by finding the optimal input distribution, and that encoders for different sources that map one mes-
sage at a time to channel inputs can achieve different maximal amounts of information transmission,
it can be concluded that the amount of information transmitted is less than the channel capacity
under these circumstances. However, a central result in information theory, the Noisy channel
coding theorem, proves that it is in fact possible to transmit information up to the channel capac-
ity for any source/channel pair. The key to making this possible is not encoding one message at a
time, but rather many messages at once. This both changes the distribution of the source messages
and the noise per each channel input so that they are more uniform and can be more easily matched
by an encoder.
30
original message. 5) The compressed messages are decompressed back onto the original probability
space.
Steps 2-4 correspond to the random variables:
• S: The (compressed) source message
• X: The input to the channel (the encoded message)
• Y: The output of the channel (the received message)
• Ŝ: The estimate of the (compressed) message
Source
coding Compressor Decompressor
Channel
coding Encoder Decoder
Noisy
channel
Figure 15: Noisy channel coding: the big picture A redundant source passes into a compressor,
which yields a compressed (i.e. maximum entropy) random message S. S then passes into an encoder,
which adds redundancy to create robustness to passage over a noisy channel. The encoded message
X passes through the channel, yielding the received message Y, which possibly contains errors. Y
then goes into a decoder, yielding an estimate of S, Ŝ, and finally a decompressor to give an estimate
of the source. Adapted from [2] p146.
31
a) A noisy channel
First use
Second use
Figure 16: An extended noisy channel is formed by treating multiple uses of a single channel as
a channel itself. a) One use of the original channel. b) one use of the extended channel.
32
The larger state space of the source means that we’ll also need a channel with a larger state
space, which we can form by using an extended noisy channel. This is simply using the channel
N times, and considering a meta-channel whose inputs and outputs are all possible combinations of
of the original channel’s inputs and outputs (Fig. 16).
As a result of this separation and the theorem justifying it, we can disregard the source coding
problem (i.e. the compressor and decompressor) and focus only on the channel coding problem
(encoder → noisy channel → decoder). Since any source of random events can be compressed
into a maximum entropy source of (Sec. 2.4), we can simply assume this has already taken place.
# source bits
R=
# transmitted bits
(Note: This rate is not exactly the same as the rate used in rate-distortion theory (Sec. 2.10).
Rate as used here is defined as the compression of a transmitted message, whereas for rate-distortion
theory it describes the absolute amount of information about particular source.)
Next, the redundant message passes through a noisy channel. The channel shown in Figure 17a
shows a binary symmetric channel in which 0s and 1s transmit correctly with probability 45 and
flip with probability 15 .
Finally, a decoder interprets the received message and attempts to correct any errors and recon-
struct the original message.
A naive approach to this problem is through repetition coding (Fig. 17b). Here, the message
is repeated N times before being sent through the channel, and the decoder chooses the bit at each
position that occurred the greatest number of times in the repetition-coded message. The wrong
message can be received if more than half of the bits in a given position flip. To lower the probability
of this happening, we can increase the number of repetitions. This reduces the probability of error,
but in doing so lengthens the message needed for each event and thus lowers the rate at which
information can be transmitted. As we specify a lower and lower maximal acceptable probability of
message error, the rate at which information is sent approaches 0 (Fig. 17c).
33
a) Source Encoder Noisy channel Decoder
Random events (with Add redudnacy for Transmit binary Correct channel
no redundancy) noise robustness strings (with errors) errors and infer event
Transmitted
0.8
Probability
Recieved
0.2
0
b) Repitition coding c)
Encoder Noisy channel Decoder Posssible
Source
Probability of error
More
repititions
Impossible
0
0 Rate 1
d) Block coding
Source Encoder Noisy channel Decoder
...
As # source bits,
e) block length
Constant rate, variable
block length encoders
Probability of error
Rate
Figure 17: The noisy channel coding theorem a) An overview of the problem setup. A source
produces non-redundant (i.e. maximum entropy) messages, here a colored marble which is trans-
formed into a two-digit binary encoding. An encoder adds redundancy to prepare a transmitted
message (e.g. shown here, repeating the message 3 times). The rate is defined as the ratio of source
bits to transmitted bits. The message passes through a noisy channel (e.g. shown here one that flips
each bit with probability 15 ). A decoder attempts to correct errors and recover the original message.
b) A repetition coding scheme (same as part a) repeats the source bits a fixed number of times
and c) can achieve arbitrarily small probability of error (for the noisy channel in (a)), but needs
to send information at a rate approaching 0 to do so. d) A block coding scheme can do better, by
encoding multiple events together. e) As the block length goes to ∞, such a scheme can achieve
communication with arbitrarily small probability of error at rates up to the channel capacity. At
rates above the channel capacity, infinite block lengths will have nonzero probability of error.
34
The noisy channel coding theorem proved that we can do better than this. To do so, we must
send multiple messages at once, also known as block coding. In the case of this binary channel,
this is done by taking multiple binary messages, and passing them through an encoder that maps
to a single binary string. The encoder in Figure 17d is called a Hamming code, which is a good
choice for this problem, but not specifically needed for the noisy channel coding theorem. Similar
to the previous case, those bits are then transmitted and the decoder attempts to correct errors and
reconstruct the original message.
As the bottom panel of Figure 17d shows, by transmitting a sequence of random events as a
single outcome, we can increase the block length of our transmission system while leaving the rate
unchanged by keeping the ratio of source bits and transmitted bits constant.
The noisy channel coding theorem considers the performance of channels as the block length
goes to infinity, and it has two important implications (Fig. 17e):
1. Encoder/decoder pairs exist which can transmit messages with arbitrarily small probability of
error at nonzero rates up to the channel capacity
2. At rates above the channel capacity, no encoder/decoder pairs can transmit messages with
arbitrarily small error probability
The theorem does not state how one can find the appropriate encoders/decoders to achieve this
performance, only that they exist. Specifically, it shows that the average performance of randomly
constructed encoders achieves this, which implies that at least one of the individual randomly con-
structed encoders can achieve it. Finding good performing codes that can also be decoded is not
straightforward based on this theorem alone, in part because extending the block length makes naive
decoding algorithms exponential more complex. It took several decades of research to discover codes
for binary channels that approach the performance the theorem implies.
35
have positive values. By looking at the distribution of point-wise conditional entropies for each
channel input, we can assess how heterogeneous the channel inputs are in terms of their noisiness
(Fig. 18). The absolute noise level will always increase as we make extended noisy channels with
longer block lengths, but the relative amount of noise in each input becomes increasingly close to
the block length N times the average noisiness of the non-extended channel, H(Y | X). This is a
consequence of the Central Limit Theorem, as shown in Appendix B.3.
a) Extended noisy channel b) Channel inputs become
N=1
equally noisy as N →
Probability
0.6
Count
0 0
0.4 2
Count
0 0
N=2 0.2 9
Count
N=3
First use
0 0
32
0.06
Count
N=4
Second use
0 0
135
0.02
Count
N=5
0 0
1.2 1.4 1.6 1.8
(Bits)
In addition to this uniformity in the noise level of different channel inputs, the overlap of different
channel inputs becomes uniform as block length increases. Mathematically, this means that H(x | Y)
tends toward the same value for all choices of x. Like the increasing uniformity of noise for each
channel input in the previous paragraph, this phenomenon is a consequence of the Law of Large
Numbers and can be shown mathematically with a proof similar to that in Appendix B.3. The only
difference is that unlike the noise of different inputs, the overlap of different inputs requires knowledge
of the joint distribution pX,Y . It thus depends on a particular choice of input distribution, since
pX,Y = pY|X pX . However, since we know that all sources will produce equally probable messages as
N → ∞, we can simply assume that pX is uniform. This means that the matrix PX,Y is equal to
PY|X multiplied by a scalar constant.
It is worth noting that extended noisy channels still have some inputs that are especially noisy
and especially noiseless, as well as those the are more or less overlapping. For example, the extended
noisy channel input that corresponds to using the most noisy input of the non-extended channel for
each transmission still may be much noisier than the average extended channel input. Outliers like
this don’t go away as N → ∞, they simply cease to matter because they are outnumbered. However,
since we can never actually have an infinite block length in practice, in many situations information
transmission can do better than a fixed, random encoder, by shifting more probable messages onto
36
less noisy, less-overlapping inputs. This is possible because when N is not infinite, any redundant
source will not produce equally probable messages. How to do this in practice will be explored in
the next section.
(11)
Where f (·) is the encoder function that maps from the space of possible messages S to the the
space of channel inputs X .
37
When S and X are discrete spaces, this can be a challenging optimization problem, because it
is not amenable to gradient-based optimization. If we consider this problem in matrix form, pS is a
vector of probabilities of different messages, and Pf is noiseless channel that maps pS to pX . Since
the channel is noiseless, each column has exactly one entry with probability 1, and all other entries
are 0, like the channels shown in Figure 11a,c.
A naive approach to solving this optimization problem is to use a brute force search, in which the
mutual information is calculated for every possible encoder matrix. However, this quickly becomes
computationally intractable, since the number of possible encoder matrices grows with with the
factorial of the number of channel inputs.
Another possibility is to instead use stochastic encoders, which consist of replacing the de-
terministic encoder function f (·) with a conditional probability distribution p(x | y). In the matrix
view, this corresponds to matrices which can have an arbitrary number of nonzero elements along
each column, so long as they are all positive and each column sums to 1, like the matrices in Figure
11b,d.
Since every deterministic encoder represents a specific case of a larger class of stochastic encoders,
stochastic encoders should be able to do as good or better, in theory. In practice that often do worse,
though there are certain conditions under which they’re superior [6, 7]. This remains an area of active
research.
The major advantage that stochastic encoders do have is that they are much more amenable to
optimization. Since each column of a stochastic encoder matrix (or any channel) is a conditional
probability distribution it can be optimized with respect individual probabilities using the same
numerical optimization tools used to find the optimal input distribution for a fixed channel (Appendix
A).
Continuous encoders It is also possible to optimize encoders in the case of continuous sources/chan-
nels. That is, a source is represented by a probability density function, and encoder is a continuous
function (if it is deterministic) or a conditional probability density function (if it is stochastic). In this
setting, the entropies become differential entropies, which have a less clear intuitive interpretation
and can in some cases take on infinite values (Section 2.9). However, the gradients of these entropies
are well defined and do not suffer from such mathematical difficulties, making gradient-based op-
timization possible [8]. Still, there remain many difficulties to optimizing mutual information in
both continuous and discrete settings, particularly in high-dimensional settings, and this remains an
active area of research [9].
4.4 Decoders
Without a proper encoder, we may incur an unnecessary loss of information in the noisy channel.
However, even with a an optimal encoder, there remains the challenge of inferring the original
message S from the channel output Y. This is the job of the decoder, a function or conditional
probability distribution which forms an estimate of the original source message Ŝ. Encoders and
38
decoders must be designed together, and the noise characteristics of the channel must be considered
to achieve good performance.
The channel output will contain both source information I(X; Y) and noise added by the channel
H(Y | X). A optimal decoder will get rid of the latter, mapping all noisy versions of a particular
message p(y | x) back to a correct estimate of the original message s = ŝ. In practice, our decoder
may not be able to eliminate all of the noise H(Y | S) or preserve all the signal I(Ŝ; S): There may
be a trade-off between these competing goals.
Preserving the information contained in the received message Y is necessary, but not sufficient,
for estimating the correct message. To understand why it is necessary, consider the rate distortion
curve in Figure 9c. The probability that the estimated source message is not equal to the actual
source message is a valid distortion function, and the rate-distortion curve tells us that to improve
the best possible average performance on this distortion function, we will need more information.
To understand what it is not sufficient, consider that if we had a perfect decoder, we could always
permute all of its predictions such that it would be wrong every time. This would give the worst
possible distortion, but all the information would remain, because we could always perform the
inverse permutation.
As discussed in the previous section, using block encoders and an extended noisy channel can in-
crease the information throughput of a channel (up to the channel capacity). However, this strategy
comes at the cost of increasing the complexity of designing a corresponding decoder. Specifically,
as we increase the block length, the number of different messages we have to decode increases expo-
nentially: |Y|N . A naive decoder would simply be a lookup table that maps a received message Y to
an estimate of the source event Ŝ, and this lookup table becomes exponentially large with increasing
block length. The best encoder-decoder pairs are thus semi-random, giving them the advantages of
random encoder construction with the potential for cleverly designed decoding algorithms that can
do better than the exponential complexity of the naive approach. There is not a known formula for
designing such encoder-decoder pairs that works across different types of channels.
5 Code availability
The code used to produce the figures and hi res versions of the figures can be found at https://doi.org/10.5281/zenodo.
6 Acknowledgements
We thank Matt Thomson for alerting H.P. to the existence of information theory, and in particular
the wonderful textbook [2] by David Mackay that made learning accessible and interesting, Chenling
Antelope Xu for being an insightful and thorough member the of a study group of that book, Tom
Courtade for providing feedback and clarifying concepts, and Leyla Kabuli and Tiffany Chien for
their extremely helpful feedback in the preparation of this manuscript
39
p∗X = arg max I(X; Y) (12)
PX
(14)
This optimization problem can be solved, and the optimal distribution/channel capacity com-
puted, using numerical optimization techniques such as gradient ascent. Mutual information has
the nice property that it is a convex function with respect to the input probabilities, so optimizing
the input probabilities directly using gradient ascent will be guaranteed to find a global maximum
eventually with a proper learning rate.
We solve this problem using projected ascent. which consists of applying the following updates:
pk+1
X = proj(pkX + λ∇pX I(X, Y))
pX = max(pX , 0)
PN
1 − i=1 pi
pX = pX +
N
Where N is the number of elements in pX and pi is the ith entry of pX .
While this heuristic projection step appears to work in practice, we note that there may be better
and more theoretically motivated alternatives [10].
B Proofs
B.1 Proof that joint entropy of two random variables equals sum of in-
dividual entropies
XX 1
H(X, Y) = − p(x, y) log
p(x, y)
x∈X y∈Y
XX 1
=− p(x)p(y) log
p(x)p(y)
x∈X y∈Y
XX 1 1
=− p(x)p(y) log + log
p(x) p(y)
x∈X y∈Y
XX 1 XX 1
= − p(x)p(y) log − p(x)p(y) log
p(x) p(y)
x∈X y∈Y x∈X y∈Y
X 1 X X 1 X
= − p(x) log p(y) − p(y) log p(x)
p(x) p(y)
x∈X y∈Y y∈Y x∈X
40
Taking advantage of the fact that the sum over any probability distribution is one, this reduces
to:
!
X 1 X 1
=− p(x) log − p(y) log
p(x) p(y)
x∈X y∈Y
= H(X) + H(Y)
= H(Y) − H(Y | X)
H(Y | x) = (15)
(1) (2) (N )
= H(Y | x ,x , ..., x ) (16)
= H(Y | x(1) ) + H(Y | x(2) ) + ... + H(Y | x(N ) ) (17)
Where x = x(1) , x(2) , ... represents the input used on each channel use. These too are independent
random variables, since the input chosen on one channel use is independent of subsequent channel
uses. Thus, we can take a probability-weighted average over the choice of channel input, yielding:
41
! !
X X X
p(x)H(Y | x) = p(x)H(Y | x(1) ) + p(x)H(Y | x(2) ) + ...
x∈X N x∈X N x∈X N
X N
Y X
= p(x(1) )H(Y | x(1) ) p(x(i) ) +
x(1) ∈X i=2 x(i) ∈X
X X N
Y X
p(x(2) )H(Y | x(2) ) p(x(1) ) p(x(i) ) ...
x(2) ∈X x(1) ∈X i=3 x(i) ∈X
By using the fact the sum of a probability distribution is equal to one, this reduces to:
X X
p(x(1) )H(Y | x(1) ) + p(x(2) )H(Y | x(1) ) + ...
x(1) ∈X x(2) ∈X
By dividing by N and invoking the Law of Large Numbers, we can see that as N → ∞, the block
length-normalized noise level of an average extended channel input will become increasingly close to
average noise of the non-extended channel
References
[1] Claude E Shannon. A mathematical theory of communication. The Bell System Technical
Journal, 5(1):3, 1 1948.
[2] David J C MacKay. Information Theory, Inference, and Learning Algorithms David J.C.
MacKay. Cambridge University Press, 1st edition, 2003.
[3] Thomas M Cover and Joy A. Thomas. Elements of Information Theory. Wiley Series in
Telecommunications. Wiley, New York, USA, 2nd edition, 9 2005.
[4] Brian C. Ross. Mutual information between discrete and continuous data sets. PLoS ONE,
9(2), 2014.
[5] Yury Polyanskiy, H. Vincent Poor, and Sergio Verdú. Channel coding rate in the finite block-
length regime. IEEE Transactions on Information Theory, 56(5):2307–2359, 2010.
[6] Lucas Theis and Eirikur Agustsson. On the advantages of stochastic encoders. arXiv, pages
1–8, 2 2021.
[7] Aaron B. Wagner and Johannes Balle. Neural Networks Optimally Compress the Sawbridge.
Data Compression Conference Proceedings, 2021-March:143–152, 2021.
[8] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and
blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
42
[9] Ben Poolel, Sherjil Ozair, Aäron Van Den Oord, Alexander A Alemi, and George Tucker.
On variational bounds of mutual information. In 36th International Conference on Machine
Learning, ICML 2019, volume 2019-June, pages 9036–9049, 2019.
[10] Weiran Wang and Miguel Á. Carreira-Perpiñán. Projection onto the probability simplex: An
efficient algorithm with a simple proof, and an application. arXiv, 2013.
43