Introduction to Information Theory
Introduction to Information Theory
Changho Suh1
December 4, 2019
About instructor
Welcome to EE623: Information Theory! My name is Changho Suh, an instructor of the
course. A brief introduction of myself. A long time ago, I was one of the students in KAIST
like you. I spent six years at KAIST to obtain the Bachalor and Master degrees all from
Electrical Engineering in 2000 and 2002, respectively. And then I left academia, joining Samsung
electronics. At Samsung, I worked on the design of wireless communication systems like 4G-
LTE systems. Spending four and a half years, I then left industry, joining UC-Berkeley where I
obtained the PhD degree in 2011. I then joined MIT as a postdoc, spending around one year.
And then I came back to KAIST. While in Berkeley and MIT, I worked on a field, so called
information theory, which I am going to cover in this course. This is one of the main reasons
that I am teaching this course.
Today’s lecture
In today’s lecture, we will cover two very basic stuffs. The first is logistics of this course: details
as to how the course is organized and will proceed. The second thing to cover is a brief overview
to this course. In the second part, I am going to tell you a story of how the information theory
is developed, as well as what the theory is.
Prerequisite
The basic prerequisite for this course is familiarity with the concept on probability and random
processes. In terms of a course, this means that you should have taken the following course:
EE210, which is an undergraduate-level course on probability. This is the one that is offered in
Electrical Engineering. Some of you from other departments might take a different yet equivalent
course (e.g., MAS250). This is also okay.
There must be a reason as to why the concept of probability is crucial for this course. Here
is why. Information theory that we will cover in this course was developed in the context of
communication, which I am going to tell you with greater details in the second part of this
lecture. So the communication is the field where one can see relationship between information
theory and probability.
Now what is communication? Communication is the transfer of information from one end (called
transmitter ) to the other (called receiver ), over a physical medium (like an air) between the two
ends. The physical medium is so called the channel. Here the channel is the one that relates
the concept of probability to communication. If you think about how the channel behaves,
then you can easily see why. First of all, the channel is a sort of system (in other words, a
function) which takes a transmitted signal as an input and a received signal as an output. Here
one big problem arises in the system. The problem is that it is not a deterministic function.
1
If the channel is deterministic and one-to-one mapping, then one can easily reconstruct the
input from the output. So there is no problem in transferring information from the transmitter
to the receiver. However, the channel is not deterministic in reality. It is actually a random
function. In other words, there is a random entity (also known as noise in the field) added into
the system. In typical communication systems, the noise is additive: a received signal is the sum
of a transmitted signal and the noise. In general, we have no idea of the noise. In mathematics
or statistics, there is a terminology which indicates such a random quantity. That is, random
variable or random process, which the probability forms the basis for. This is the very reason
that this course requires a deep understanding on probability. Some of you may have no idea of
the random processess although you took EE210. Please don’t be offended. It is nothing but a
sequence of random variables. Whenever the concept on random process comes up, I will provide
detailed explanations and/or materials which serve you to understand the contents required.
If you think that you lack this prerequisite, then come and consult with me so that I can help
you as much as possible. If you did not take this source, but if you took relevant courses held
in other departments, then probably it would be okay.
There is another important course which help understanding this course. That is EE528, which
is a course on random processes. This is a graduate-level course. It deals with the probability in
a deeper manner and also covers many important concepts on random processes. So if you have
enough energy, passion and time, I strongly recommend you to take this course simultaneously.
But notice that is not a prerequisite.
Course website
We have a course website on the KLMS system. You can simply login with your portal ID. If
you want to only sit in this course (or cannot register the course for some reason), please let
me or one of TAs know your email address. Upon request, we are willing to distribute course
materials to the email address that you sent us.
Text
The course consists of three parts. A textbook that we will use for the first and second parts
is: Cover and Thomas’s book, which is the most renowned text in the field, titled “Elements of
Information Theory.” I am going to call it CT for short. We will also use a celebrated paper
written by the Father of Information Theory: Claude E. Shannon. Here is the title of the paper:
“A mathematical theory of communication”. We will also use lecture notes by Prof. Robert G.
Gallager, which we are going to use for the first few weeks. You can download them from the
course website. On top of these three, I am going to provide you with lecture slides which I will
use during lectures, as well as course notes like the one that you are now reading. Perhaps these
materials will be posted at night on a day before class.
There is no textbook for the last part. But don’t worry. I will instead provide you with lecture
slides and course notes which contain every details and thus are self-contained. Reading these
materials would suffice for you to understand all the contents covered in the last part.
Problem sets
There will be weekly or bi-weekly problem sets. So there would be seven to eight problem sets
in total. Solutions will be usually available at the end of the due date. This means that in
principle, we do not accept any late submission. We encourage you to cooperate with each other
in solving the problem sets. However, you should write down your own solutions by yourselves.
2
You are welcome to flag confusing topics in the problem sets; this will not lower your grade.
Exams
As usual, there will be two exams: midterm and final. Please see syllabus for the schedule that
our institution assigns by default. Please let us know if someone cannot make it for the schedule.
With convincing reasons, we can change the schedule or can give a chance for you to take an
exam in a different time slot that we will organize individually.
Two things to notice. First, for both exams, you are allowed to use one cheating sheet, A4-sized
and double-sided. So it is a kind of semi-closed-book exam. Second, for your convenience, we
may provide an instruction note for each exam (if you wish), which contains detailed guidelines
as to how to prepare for the exam. Such information includes: (1) how many problems are in the
exam; (2) what types of problems are dealt with in what contexts; (3) the best way to prepare
for such problems.
Course grade
Here is a rule for the course grade that you are mostly interested in perhaps. The grade will be
decided based on four factors: problem sets (22%); midterm (32%); final (40%); and interaction
(6%). Here the interaction means any type of interaction. So it includes attendance, in-class
participation, questions, and discussion.
Overview
Now let’s move onto the second part. Here is information for reading materials: Chapter 1 in
Gallager’s notes, Shannon’s paper, and Chapter 2 in CT. In this part, I will tell you how the
information theory was developed and what the theory is. As I mentioned earlier, information
theory was developed in the context of communication. So first I am going to tell you how the
information theory was developed in that context. Perhaps next time, I will provide you with
specific topics that we will learn throughout the course.
3
over wires such as copper lines. So it is the first communication system that have something to
do with electrical signals. Actually this is the main reason as to why communication systems
have been studied within the field of electrical engineering where most of you guys are in. Later
this technology was further developed. In the 1870s, Alexander Graham Bell invented a more
advanced version of such systems, called telephone.
Communication systems were further upgraded. The upgrade was based on another finding in
physics: not only we can send electrical signals over wires, but we can also send them in a
wireless environment through electromagnetic waves, simply called radio waves. This finding
inspired an Italian engineer at that time, named Guglielmo Marconi. He developed a wireless
version of telegraph, called wireless telegraphy. Later this technology was further developed,
which led to the invention of radio and TV.
Claude E. Shannon
What he felt from such observations is that such communication systems are really annoying.
He did not like the communication systems because such systems are designed in a pretty ad-
hoc manner (no systematic design principle!), in other words, in a completely different manner
depending on each specific application of interest. He actually believed that there must be
one simple & beautiful framework that can unify all such different communication systems. So
he tried to unify the ad-hoc approach. As a specific effort, he raised the following three big
questions.
The second question is a natural follow-up question which helps addressing the first question.
He thought that if unification is possible, then there may exist a common currency (like dollar in
the context of economics) w.r.t. (with respect to) information. In the communication systems,
we have a variety of different information sources, like text, voice, video, images. The second
question that he asked is related to the existence of such common currency.
Question 2: Is there a common currency of information that can represent all such different
information sources?
4
The last question is with respect to the task of interest: communication.
He had spent around eight years to address the above questions. Perhaps he had quite a difficult
time as the long period of eight years indicates. But he could make a progress in the end. He
could address all the big questions in a single stroke. In the process of doing this, he could
develop a theory, later people call “information theory”.
Looking ahead
Next time, I will tell you how Shannon could do such things. Then, I am going to tell you what
specific topics we are going to cover in this course.
5
EE623 Information Theory September 4, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Lecture 2: Overview
Recap
Last time I told you about a story of how Claude Shannon came up with information theory. The
story dates back to the early 20th century when a variety of different communication systems
are prevalent. Motivated by the fact that most of communication systems at that time were
designed in an ad-hoc manner, Shannon intended to develop a single framework that can unify
all the different systems. To this end, he first came up with a common currency of information.
Using this, he then established a unified framework. Based on the framework, he characterized
the limit to how fast one can communicate. In the process of doing this, he was particularly
excited by the fact that given a channel determined by the Nature, the limit does not change no
matter what transmitter and receiver do.
Shannon theorized all of these in a single mathematical framework. He then called it “a math-
ematical theory of communication”, which people later called information theory or Shannon
theory. This was how information theory was invented.
Today’s lecture
Today we are going to go a little bit deeper. Specifically I will tell you how Shannon did all of
these in detail. Based on this, I will then list up specific topics that we are going to cover in
this course.
Communication architecture
Recall the three terminologies that we introduced last time: transmitter, receiver, and something
that lies in between transmitter and receiver, which is the channel. Now consider information
signals that one wishes to transmit, such as text, voice, and image pixels. Here is another
terminology which indicates such signals: “information source”. What Shannon thought in
the first place is that there must be something which processes the information source before
transmission. He abstracted this process with a black box that he called “encoder”. Of course
there must be something at receiver which attempts to recover the information source from the
received signals that the channel yields. He abstracted the process at the receiver end with
another black box that he called “decoder”. This is the very first basic block diagram that
Shannon imagined for a communication architecture. See Fig. 1. From a Shannon’s viewpoint,
a communication system is nothing but a collection of encoder and decoder, so designing a
communication system is concerning how to develop such encoder and decoder.
transmitter receiver
information
encoder channel decoder
source
1
Representation of information source
With this simple picture in his mind, Shannon wished to unify all the different communication
systems that were prevalent in the early 20th century. Most engineers at that time attempted to
transmit information sources (which are of course different depending on applications of interest)
directly without any significant modification. Shannon thought that this is the very reason as
to why there were different communication systems.
He then thought that for the purpose of unification, there must be at least something that
can represent different information sources. He believed that the common thing would lead
to a universal framework. It turns out Shannon’s work on master thesis in MIT could play a
significant role to find a universal way of representing information source. His master thesis
was about Boolean algebra. What he did specifically is that any logical relationship in circuit
systems can be represented with 0/1 logic, in other words, binary string.
Inspired by this, Shannon thought that the same thing may happen in the context of commu-
nication systems, meaning that any type of information source can be represented with binary
string. He showed that it is indeed the case. Specifically what he showed is that the binary
string that he called “bits” can represent the meaning of information.
For instance, suppose that information source is an English text that comprises English letters.
Now how to represent each English letter with a binary string? Here one key observation is that
there are only a finite number of candidates that each letter can take on. This number is the
total number of English alphabets, which is 26.1 From this, we see that dlog2 26e = 5 number
of bits suffices to represent each letter.
A two-stage architecture
This observation led Shannon to introduce bits as a common currency of information. Specifically
here is what he did. He first thought that we need a block which converts information source
into bits of our interest. Motivated by this, Shannon came up with a two-stage architecture in
which the encoder is split into two parts. Here the role of the first block is to convert information
source into bits. Shannon called the block “source encoder”, as it is a part of the encoder as
well as is related to how information source looks like. The role of the second block is to convert
bits into a signal that can actually be transmitted over a channel. Shannon called it “channel
encoder” because it is obviously concerning a channel. See the upper part in Fig. 2.
Similarly, receiver consists of two stages. But the way it is structured is opposite. In other
words, we first convert the received signals into the bits that we sent at the transmitter (channel
decoder). Next, we reconstruct the information source from the bits (source decoder). As you
can see, this block is nothing but an inverse function of source encoder. Obviously source encoder
should be one-to-one mapping; otherwise, there is no way to reconstruct the information source.
See the lower part in Fig. 2.
Actually there is a terminology which indicates the part spanning from channel encoder, channel,
to channel decoder. We call it “digital interface”. Here one thing to notice is that this digital
interference is universal in a sense that it has nothing to do with type of information source
because the input to the digital interference is always bits. So in that regard, it is indeed a
unified communication architecture.
2
transmitter
receiver
Keeping the two-stage architecture in his mind, Shannon tried to address the third question: is
there a limit to how fast one can communicate? To this end, Shannon first thought about the
goal of communication: transferring information as much as possible. Actually if information
source itself is bits, then we only need to worry about the maximum number of bits that can be
transmitted over a channel. But here notice in the two-staged architecture that we have another
block in front, that is, source encoder which converts the information source into bits.
This naturally led him to worry about efficiency of the first block. In order to maximize
the amount of information to send, first we need to minimize the number of bits that can
represent the information source. This led Shannon to worry about what the most efficient way
of representing the information source is.
To make this more explicit, Shannon split the question on the limit into two. The first is about
efficiency of the first block. What is the minimum number of bits that can represent information
source? The second question is about transmission capability. What is the maximum number
of bits that can be transmitted over a channel?
Shannon developed two theorems to answer the two questions. He first developed a theorem
that he called “source coding theorem” to identify what the minimum number of bits is. He also
developed another theorem that he called “channel coding theorem”, in which the maximum
transmission capability is characterized.
3
very simple to state.
Let S be a generic random variable which indicates an individual random variable that forms the
random process. In the source coding context, each random variable is called “symbol”. It turns
out the minimum number of bits is related to one important notion that is very well-known in
other fields such as physics and chemistry. That is “entropy”. The entropy is a kind of measure
of disorder or randomness. It turns out this measure plays a crucial role to establish the source
coding theorem, formally stated below.
Theorem 0.1 (Source coding theorem in the i.i.d. case) The minimum number of bits
that can represent the source per symbol is the entropy of the random variable S, denoted by
X 1
H(S) := p(s) log2 . (1)
p(s)
s∈S
where p(s) denotes the distribution2 of S, and S (that we call “caligraph of S”) indicates the
range, which is the set of all possible values that S can take on.
Now the question is: how to design f (S) such that the number of bits that represents S, reflected
in the average length of codeword E[f (S)], is minimized? Note that there are four letters in
total. So it suffices to use two bits to represent each symbol. One naive way is to take the
following simple mapping rule: A → 00; C → 01; T → 10; G → 11. This way, we can achieve 2
bits/symbol. But the source coding theorem says that we can actually do better, as the claimed
limit reads:
1 1 1 1
H(S) = · 1 + · 2 + · 3 + · 3 = 1.75. (2)
2 4 8 8
As the theorem promised, there is indeed a code that can achieve the limit. The code is based
on the following observations: (i) the letter A occurs more frequently than other letters; (ii) the
length of codeword is not necessarily fixed. This naturally leads to an idea: assigning a short-
length codeword to frequent letters while assigning a long-length codeword to less frequent
letters. Now the question is how to implement such an idea? For visualization purpose, let us
introduce a so-called “binary code tree” with which one can easily implement a mapping from
S to f (S).
2
It is a probability mass function, simply called pmf, for the case where S is a discrete random variable.
4
Binary code tree: A binary tree is the one in which every internal node has only two branches.
Notice that a node which has no branch is called the leaf. See an example in Fig. 3. Now let me
relate the binary tree to a code. Suppose we assign a symbol to a leaf. We can then establish
the functional relationship by specifying the pattern of the corresponding codeword as follows:
the sequence of binary labels (associated with branches) from the root to that leaf. For example,
suppose we label 0 on an upper branch, 1 on a lower branch, and assign a symbol A to the top
leaf. Then, f (A) = 0, as we have only one branch, labeled 0, that links the root to the leaf.
Similarly f (C) = 10. Note that there are two branches (labeled 1 and 0 subsequently) which
connect the root to the leaf assigned to C.
Now how to implement an optimal mapping rule (which achieves 1.75 bits/sym) via a binary
code tree. As mentioned earlier, the idea is to assign a short-length codeword to frequent letters.
Obviously we should assign the most frequent letter A to the top leaf, since it has the shortest
codeword length. Now what about for the second most frequent letter C? One may want to
assign it to an internal node marked in a blue square in Fig. 3. But it is not valid. Why?
That way, the codeword pattern ran out - this is problematic because we have only two leaves
although we need four in total. So we must have another two branches generating from the
internal node. We can then assign the C to the second top leaf of codeword length 2. Similarly
the next frequent letter T (or G) cannot be assigned to a follow-up internal node marked in a
red triangle in Fig. 3. So the node should have another set of two branches. The remaining
letters T and G are now assigned to the two remaining leaves. With this mapping rule, one can
readily see that
5
we will need to study. One such important notion is “mutual information.” We will deal with
definitions of those later on.
Course outline
The two theorems are parts of the main topics that we are going to cover in this course. Specif-
ically, the course consists of three parts. In Part I, we will study basic concepts on key notions
in the field: (i) entropy (which played a fundamental role in establishing the source coding the-
orem); (ii) mutual information (a crucial notion for the channel coding theorem); (iii) Kullback-
Leibler (KL) divergence (another key notion which plays a similar role with mutual information
yet is known to more powerful in a widening array of other disciplines such as statistics, math-
ematics, and machine learning). In Part II, we will prove the two major theorems using the
notions of entropy and mutual information. It turns out information theory can serve to address
some important issues that arise in a wide variety of fields, not limited to communication, rang-
ing from data science, computational biology, recently to machine learning and deep learning.
In Part III, we will explore applications of information theory in machine learning and deep
learning. In particular, we will focus on the roles of the above key notions in the design of
supervised and unsupervised learning algorithms.
6
EE623 Information Theory September 9, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Lecture 3: Entropy
Recap
During the past lectures, I told you about a story of how Shannon came up with information
theory. Shannon’s effort was initiated by a strong motivation for unifying distinct communication
systems that were prevalent in the early 20th century. As an initial step towards unification,
he introduced bits as a means to represent possibly different information sources and then
proposed a unified two-stage architecture in which the first block converts an information source
into bits and the second block generates a signal that can actually be transmitted over a channel.
Within this framework, Shannon quantified the limit on the amount of information that one can
communicate. In the process of doing this, he established two fundamental theorems which in
turn laid the foundation of information theory. The first is the source coding theorem which
demonstrates that the minimum number of bits that can represent an information source is
the entropy of the source. The second is the channel coding theorem which characterizes the
maximum number of bits that can be transmitted over a channel.
Today’s lecture
Prior to proving the source and channel coding theorems which we are going to cover in depth
in Part II, we will first study four key notions which form the basis of the theorems : (i) entropy;
(ii) mutual information; (iii) Kullback-Leibler (KL) divergence; (iv) cross entropy. It turns out
these notions play a crucial role to address important issues that arise in a widening array of
other disciplines such as statistics, physics, computational biology and machine learning. So it
is worthwhile being familiar with detailed properties about the notions for many purposes.
Today we will study in depth about the first notion: Entropy. Specifically what we are going to
do are three folded. First we will review the definition of entropy that we already introduced
last time. I will then introduce a couple of intuitive interpretations on the meaning of entropy,
which may help us to understand the source coding theorem as to why the limit on the number
of bits required to represent the information source must be the entropy. Next we will study
some key properties which are useful in a variety of contexts not limited to the source coding
theorem. Later in Part II you will see how the entropy comes up in the proof of the source
coding theorem. Please be patient until I show you such connection.
Definition of entropy
The entropy is defined with respect to (w.r.t.) a random quantity which has uncertainty. More
formally, it is concerned about a random variable or a random process depending on the dimen-
sion of a random quantity. For simplicity, let us focus on a simple case in which the random
quantity is a random variable. Later we will cover a general case dealing with a random process.
More precisely the entropy is defined w.r.t. a discrete 1 random variable. Let X be a discrete
random variable and p(x) be its probability mass function (pmf). Let X (that we call “caligraph
1
We say that a random variable is discrete if its range (the set of values that the random variable can take on)
is finite or at most countably infinite. See Ch. 2.1 in BT for details.
1
ex”) be its range: the set of values that X can take on. The entropy is defined as:
X 1
H(X) := p(x) log2 bits. (1)
p(x)
x∈X
In this course, mostly we will use only one type of a logarithmic function, that is log base 2. For
simplicity, we will drop the “2” throughout the course.
Actually there is an alternative expression for the entropy formula that I strongly recommend
you to remember. It is the one which is much simpler and thus easy to remember. In addition, it
helps greatly simplifying the proof of some important properties that we are going to investigate
1
later on. Note that the entropy is sort of a weighted sum of log p(x) . So it can be represented
as:
1
H(X) := E log . (2)
p(X)
where the expectation is taken over the distribution of X: p(x).
Interpretation #1
Actually there are some well-known interpretations on the meaning of the entropy which are
quite intuitive and therefore can give some insights into why the entropy is related to the limit
promised by the source coding theorem. Here we will investigate two of them. The first is:
Let me give you a concrete example where we can see this interpretation makes sense. Consider
two experiments: (i) tossing a fair coin; (ii) rolling a fair dice. One simple random variable that
one can naturally think of for the first experiment is a function that maps the head (or tail)
event to 0 (or 1). Since the coin is fair, we have:
0, w.p. 21 ;
X=
1, w.p. 12 ,
where “w.p” stands for “with probability”. On the other hand, a natural random variable in
the second experiment is a function that maps a dice result to the same number, which in turn
yields:
1, w.p. 16 ;
2, w.p. 16 ;
3, w.p. 16 ;
X=
4, w.p. 16 ;
5, w.p. 16 ;
6, w.p. 16 .
Now which random variable is more uncertain, in other words, more random? Your intuition
points to the second one! No wonder. Actually the entropy supports your answer with explicit
numbers. Note that H(X) = 1 in the first experiment while H(X) = log 6 > 1 in the latter.
Entropy can indeed quantify such uncertainty.
Let me give you another example. Suppose now we have a bent coin. Then, it yields a different
head event probability, say p 6= 21 :
0, w.p. p;
X= (3)
1, w.p. 1 − p.
2
Now the question is: Does this random variable has more uncertainty, compared to the fair coin
case? To see this clearly, consider an extreme case in which p 1 and thus:
1 1
H(X) = p log + (1 − p) log ≈ 0. (4)
p 1−p
Here we used the fact that limp→0 p log p1 = 0. Why? Remember the Lopital theorem that you
learned about from the course on calculus. This implies that the bent-coin case is definitely
more certain. This makes sense. Why? Very small p ( 1) means that the tail event happens
almost all the time, and thus the result is well predictable.
Interpretation #2
The second interpretation is related to a common way of removing uncertainty. What does
this mean? Let me give you an example. Suppose we meet a person for the first time. Then
the person is completely unknown to us. Here we can think of a common way to remove such
randomness: asking questions. With answers to the questions, we can then somehow remove
randomness w.r.t. him/her. With enough questions, we may have complete knowledge about
him/her. So from this, we see that the number of questions required to have complete knowledge
represents the amount of randomness: the more questions required, the more uncertain. This
naturally leads to:
Sometimes the number of questions (on average) exactly matches H(X). Here is an example
where that happens. Suppose
1, w.p. 12 ;
2, w.p. 14 ;
X=
3, w.p. 18 ;
4, w.p. 18 .
A straightforward calculation gives H(X) = 1.75. Now suppose that questions are of the yes-
or-no type. Then, the minimum number of questions (on average) required to determine the
value of X is H(X). Here is the optimal way of asking questions that achieves H(X). We first
ask whether or not X = 1. If yes, X = 1; otherwise, we ask if X = 2 or not. If yes, X = 2;
otherwise we ask if X = 3 or not. Let f (x) be the number of questions when X = x. Then, this
way yields:
1 1 1 1
E [f (X)] = f (1) + f (2) + f (3) + f (4)
2 4 8 8
1 1 1 1
= · 1 + · 2 + · 3 + · 3 = 1.75.
2 4 8 8
Some of you may recognize that this is the same as the source code example that we examined
last time.
Key properties
The entropy has several important properties. We can identify those properties by making some
important observations.
Recall the bent-coin example. See (??) for the distribution of the associated random variable
X. Consider the entropy calculated in (??). First we note that the entropy is a function of p
which fully describes X. See Fig. ?? which plots H(X) as a function of p. From this, one can
3
Figure 1: Entropy of the binary random variable X with Pr(X = 0) = p and Pr(X = 1) = 1 − p
make two observations: (i) the minimum entropy is 0; (ii) the entropy is maximized when p = 12 ,
i.e., X is uniformly distributed.
Consider another example in which X ∈ X = {1, 2, . . . , M } and is uniformly distributed:
1
1, w.p. M
;
2, w.p. 1 ;
M
X= ..
.
1
M, w.p. M .
In this case,
X 1
H(X) = log M = log M = log |X |
M
where |X | indicates the cardinality (the size of the set) of X . This leads to another observation:
the entropy of the uniformly distributed random variable X ∈ X is log |X |.
The above three observations lead us to conjecture the following two: for X ∈ X ,
Property 1: H(X) ≥ 0.
Property 2: H(X) ≤ log |X |.
It turns that these properties indeed hold. The first is easy to prove. Using the definition of
1
entropy and the fact that p(X) ≤ 1, we get: H(X) = E[log p(X) ] ≥ E[log 1] = 0. The proof
of the second property is also easy once we rely on one of the very popular inequalities called
Jensen’s inequality 2 , formally stated below.
E [f (X)] ≤ f (E[X]) .
Proof: The proof is immediate for a simple binary case, say X ∈ X = {x1 , x2 }. By
setting p(x1 ) = λ and p(x2 ) = 1 − λ, we get: E[X] = λx1 + (1 − λ)x2 and E[f (X)] =
λf (x1 ) + (1 − λ)f (x2 ). Now the definition of the concavity (see the associated footnote
for the definition or Fig. ??) completes the proof. The generalization to an arbitrary
X is not that hard. The idea is by induction. Try this in Problem Set 1.
4
Figure 2: Jensen’s inequality for a concave function: E[f (X)] ≤ f (E[X])
Using the definition of entropy and the fact that log is concave, we get:
1
H(X) = E log
p(X)
1
≤ log E
p(X)
!
X 1
= log p(x) ·
p(x)
x∈X
= log |X |
Joint entropy
Now let’s move on to the entropy defined w.r.t. multiple (say two) random variables. It is said to
be the joint entropy as it involves multiple quantities, and is defined as follows: for two discrete
random variables, X ∈ X and Y ∈ Y,
XX 1
H(X, Y ) := p(x, y) log .
p(x, y)
x∈X y∈Y
We see that the only distinction w.r.t. the single random variable case is that the joint distri-
bution p(x, y) comes into picture. Similarly, an alternative expression is:
1
H(X, Y ) = E log
p(X, Y )
Chain rule
2
Jensen’s inequality is one of the most well-known inequalities widely used in mathematics, as well as the one
that underlies many of the basic results in information theory.
3
We say that a function f is concave if for any (x1 , x2 ) and λ ∈ [0, 1], λf (x1 )+(1−λ)f (x2 ) ≤ f (λx1 +(1−λ)x2 ).
In other words, for a concave function, the weighted sum w.r.t. functions evaluated at two points is less than or
equal to the function at the weighted sum of the two points. See Fig. ??.
5
There is a key property regarding the joint entropy, called the chain rule, which shows the
relationship between the multiple random variables. For the two random variable case, it reads:
where the expectation is taken over p(x, y). The proof of the chain rule is straightforward:
1
H(X, Y ) = E log
p(X, Y )
(a) 1
= E log
p(X)p(Y |X)
(b) 1 1
= EX,Y log + E log
p(X) p(Y |X)
(c) 1 1
= EX log + E log
p(X) p(Y |X)
= H(X) + H(Y |X)
Lastly let me leave one remark on conditional entropy. Another way of expressing conditional
6
entropy (which I strongly recommend you to remember) is:
XX 1
H(Y |X) = p(x, y) log
p(y|x)
x∈X y∈Y
X X 1
= p(x) p(y|x) log
p(y|x)
x∈X y∈Y
X
= p(x)H(Y |X = x)
x∈X
Here H(Y |X = x) is the entropy defined w.r.t. Y when X = x. So the conditional entropy
can be interpreted as the weight sum of H(Y |X = x), which sort of helps you to remember the
formula as well as to provide an easy way to calculate.
Look ahead
We are now ready to prove the source coding theorem, but not for channel coding theorem. As
I mentioned earlier, the proof of the channel coding theorem requires the concept on another
notion: mutual information. So next time we will study details on mutual information, ranging
from its definition to several key properties which serve to prove the theorem as well as turn out
to play crucial roles in other disciplines. We will also study about another key notion: the KL
divergence. Actually it is quite instrumental to proving the source coding theorem. Moreover, it
has played a significant role in other fields, in particular statistics, as it can serve as a quantity
that measures a sort of distance between distributions, which is of strong interest to the field.
7
EE623 Information Theory September 11, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time we learned about one key notion, entropy, which serves to play a central role in
establishing the source coding theorem that we will investigate in depth in Part II. Specifically
we started with the definition of entropy for the single random variable case and then extended
to a general case in which more random variables are involved. We then studied one important
rule which dictates the relationship across multiple random variables: the chain rule. For two
random variables, say X and Y , the chain rule says:
where H(Y |X) denotes conditional entropy. Remember the definition of conditional entropy:
X
H(Y |X) := H(Y |X = x) (2)
x∈X
Today’s lecture
So today we will study details on such two notions: mutual information and the KL divergence.
Specifically what we are going to do are four-folded. First we will study the definition of
mutual information. We will then investigate some key properties of mutual information which
turn out to play crucial roles in establishing the channel coding theorem as well as addressing
important issues that arise in other disciplines. Next we will learn about the KL divergence
and an interesting relationship with mutual information. Lastly we will discuss a connection of
mutual information to the channel coding theorem.
Observation
An interesting observation we made w.r.t. the chain rule (the Venn diagram interpretation in
Fig. 1) turns out to yield a natural definition for mutual information. The interpretation was:
the randomness of two random variables X and Y (reflected in the total area of two Venn
diagrams) is the sum of the randomness of one variable, say X, (reflected in the area of the blue
Venn diagram) and the uncertainty that remains about Y conditioned on X (reflected in the
crescent-shaped red area). By the chain rule, the crescent-shaped red area can be represented
as: H(Y |X).
Here we see an overlap between the blue and red areas. The area of the overlapped part depends
how large H(Y |X) is: the larger the overlapped area, the smaller H(Y |X). A small H(Y |X)
somehow indicates a strong dependency between X and Y . Hence, one can say that: the larger
the overlapped area, the larger dependency between the two. In view of this, the overlapped
1
CN04_1
= total area
area area
area captures the amount of information that X and Y share in common, sort of common
information.
Key properties
As entropy has important properties (described with the non-negativity bound H(X) ≥ 0 and
the cardinality bound H(X) ≤ log |X |), mutual information has similar key properties:
Property 1: I(X; Y ) = I(Y ; X);
Property 2: I(X; Y ) ≥ 0;
Property 3: I(X; Y ) = 0 ⇐⇒ X ⊥
⊥ Y.
The first property (named the symmetry property) is obvious from the picture. The proof is also
straightforward:
I(X; Y ) := H(Y ) − H(Y |X)
(a)
= H(Y ) − (H(X, Y ) − H(X))
(b)
= H(Y ) + H(X) − (H(Y ) + H(X|Y ))
(c)
= I(Y ; X)
where (a) and (b) follows from the chain rule (1); and (c) is due to the definition of mutual
information (3).
The second property is also very much intuitive, as the mutual information captures the size of
the overlapped area and therefore it must be non-negative. But the proof is not that straight-
forward. It requires a bunch of steps as well as the usage of an important inequality that we
learned last time: Jensen’s inequality. See the next section for a detailed proof.
2
The third property also makes sense. Why? The mutual information being 0 means no corre-
lation between X and Y , implying independence between the two. But the proof is not quite
trivial either. See a section followed by the next section for a proof.
Proof of I(X; Y ) = 0 ⇐⇒ X ⊥
⊥Y
3
To prove this, first recall one procedure that we had while proving the second property above:
I(X; Y ) := H(Y ) − H(Y |X)
p(Y )
= E − log
p(Y |X)
p(Y )
≥ − log E .
p(Y |X)
Remember that the last inequality is due to Jensen’s inequality. As you will figure out while
solving Problem 3 in PS1, the sufficient and necessary condition for the equality to hold in the
above is:
p(Y )
= c (constant).
p(Y |X)
Please check this in PS1. The condition then implies that
p(y) = cp(y|x) ∀x ∈ X , y ∈ Y.
Using the axiom of probability distribution (the sum of the probabilities being 1), we get c = 1
and therefore:
p(y) = p(y|x) ∀x ∈ X , y ∈ Y.
Due to the definition of independence between two random variables, the above implies that X
and Y are independent. Hence, this completes the proof:
I(X; Y ) = 0 ⇐⇒ X ⊥
⊥ Y.
Interpretation on I(X; Y )
Let me say a few words about I(X; Y ). Using the chain rule and the definitions of entropy and
joint entropy, one can rewrite I(X; Y ) := H(Y ) − H(Y |X) as
I(X; Y ) = H(Y ) + H(X) − H(X, Y )
1 1 1
= E log + E log − E log
p(Y ) p(X) p(X, Y ) (6)
p(X, Y )
= E log .
p(X)p(Y )
This leads to the following interesting observation:
p(X, Y ) close to p(X)p(Y ) =⇒ I(X; Y ) ≈ 0;
p(X, Y ) far from p(X)p(Y ) =⇒ I(X; Y ) far above 0.
This naturally enables us to interpret mutual information as a sort of distance measure that
captures how far the joint distribution p(X, Y ) is from the product distribution p(X)p(Y ). In
statistics, there is a very well-known divergence measure that reflects soft of distance between
two distributions: the KL divergence. So it turns out mutual information can be represented as
the KL divergence. Before detailing such representation, let us first investigate the definition of
the KL divergence1 .
4
Let Z ∈ Z be a discrete random variable. Consider two probability distributions w.r.t. Z: p(z)
and q(z) where z ∈ Z. The KL divergence between the two distributions are defined as:
X p(z)
KL (pkq) := p(z) log
q(z)
z∈Z
(7)
p(Z)
= Ep(Z) log .
q(Z)
where (a) comes from my own definition: Z := (X, Y ) (note that p(x)p(y) is a valid probability
distribution. Why?); and (b) is because of the definition of the KL divergence.
The first property is similar to the one for mutual information but in a completely opposite man-
ner. Unlike mutual information, the KL divergence is not symmetric. Notice in the definition (7)
of the KL divergence that the expectation is taken only over the first probability distribution p.
This breaks up symmetry. The second and third properties are very much similar to the ones
for mutual information. It turns out the proofs are also similar. Please check this in PS2.
5
CN04_2
0 0 0 0
channel erasure e
1 1 1 1
the same as the input w.p. 1 − p; otherwise, it is an erasure no matter what x is. In terms of
conditional distribution, it is explained as:
1 − p, for (x, y) = (0, 0);
p, for (x, y) = (0, e);
p(y|x) = p, for (x, y) = (1, e); (8)
1 − p, for (x, y) = (1, 1);
0, otherwise.
A pictorial description of the BEC is in Fig. 2(b). Here a value placed above each arrow indicates
the transition probability for a transition reflected by the arrow.
To see the connection, let us compute mutual information between the input and the output.
As you can see, it requires a computation of the entropy H(Y ) of a ternary random variable Y .
It turns out that H(Y ) is a bit complicated to compute, while a simpler calculation comes from
an alternative expression:
Now to compute H(X), we need to know about p(x). However, p(x) is not given here. So let
us make a simple assumption: X is uniformly distributed, i.e., (in other words), X ∼ Bern( 12 ).
Here “∼” refers to “is distributed according to”; Bern denotes the distribution of a binary (or
Bernoulli2 ) random variable; the value inside Bern(·) indicates the probability that the variable
takes 1, simply called the Bernoulli parameter. Assuming X ∼ Bern( 12 ), the entropy of X is
simply H(X) = 1 and the conditional entropy H(X|Y ) is calculated as:
(a)
H(X|Y ) = Pr(Y = e)H(X|Y = e) + Pr(Y 6= e)H(X|Y 6= e)
(b)
= Pr(Y = e)H(X|Y = e)
(c)
=p
where (a) is due to the definition of conditional entropy; (b) follows from the fact that Y 6= e
completely determines X (no randomness) and therefore H(X|Y 6= e) = 0; (c) follows from the
2
The Bernoulli random variable is named after a Swiss mathematician in the 1600s: Jacob Bernoulli. He
employs such a simple binary random variable to discover one of the foundational laws in mathematics, called the
Law of Large Numbers (LLN). This is the reason that such a binary random variable is also called the Bernoulli
random variable. Later we will have a chance to investigate the LLN. Please be patient until we get to the point,
unless you are familiar with the LLN.
6
fact that Y = e does not provide any information about X and hence X|Y = e has exactly the
same distribution as X, so H(X|Y = e) = H(X) = 1. Applying this to (9), we get:
Now this is exactly where we can see the connection between mutual information and capacity
C: the maximum number of bits that can be transmitted over a channel. It turns out that
I(X; Y ) = 1 − p is the capacity of the BEC. Remember that we assume the distribution of X
in computing I(X; Y ). It turns out that for a general channel which can be characterized by an
arbitrary p(y|x), such p(x) can play a role as an optimization variable and the channel capacity
is characterized as:
This is the very statement of the channel coding theorem. From this, we see that mutual
information is quite instrumental to describing the theorem. Later in Part II, we will prove the
theorem.
Look ahead
Next time, we will embark on Part II. And we will start proving the source coding theorem.
7
EE623 Information Theory September 16, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
In the first two lectures, we studied Shannon’s two-staged architecture in which the encoder is
split into two parts: (i) source encoder; (ii) channel encoder. The reason that he proposed such
architecture is that he wished to convert an information source of a possibly different type into
a common currency of information: bits. He then established two theorems which deal with
efficiency of the two blocks and therefore characterize the limits on the amount of information
that one can transmit over a channel: (i) source coding theorem; (ii) channel coding theorem.
We are now ready to embark on Part II, wherein the goal is to prove the two theorems. In the
next five lectures including today’s one, we are going to prove the source coding theorem.
Today’s lecture
The source coding theorem quantifies the minimum number of bits required to represent an
information source without any information loss. The information source can be a collection of
dots and lines (in the case of Morse code), an English text, speech signals, video signals or image
pixels. So it consists of multiple components. For example, a text consists of multiple English
letters. Speech signals (waveform) contain multiple points, each indicating the magnitude of
the signal at a specific time instant. Also it can be viewed as a random from a perspective
of a receiver who does not know about the signal. This leads us to model the source as a
random process: a collection of random variables. Let {Xi } be such process. Here Xi is called a
“symbol” in the source coding context. For simplicity, we will start with a simple case in which
Xi ’s are i.i.d. (independent and identically distributed). Let X be a generic random variable
which represents every individual instance Xi . In the i.i.d. case, the source coding theorem says:
Minimum number of bits required to represent the i.i.d. source {Xi } per symbol is H(X).
Today we will attempt to prove this theorem. Once we are done with the i.i.d. case, we will
extend to a general case in which the source respects a realistic non-i.i.d. distribution.
1
The simple case allows us to greatly simplify notations. First it suffices to focus only on one
symbol, say X. The encoder is nothing but a function of X. Let’s call that function C.
Please don’t be confused with the same notation that we used to indicate channel capacity.
The reason that we employ the same notation is that the output C(X) is called “codeword”.
Let `(X) be the length of codeword C(X). For example, consider X ∈ {a, b, c, d} in which
C(a) = 0, C(b) = 10, C(c) = 110, C(d) = 111. In this case, `(a) = 1, `(b) = 2, `(c) = 3, `(d) = 4.
Note that `(X) is a function of a random variable X, hence it is also a random variable. So
we are interested in a representative quantity of such varying quantity, which is the expected
codeword length:
X
E [`(X)] = p(x)`(x).
x∈X
Optimization problem
The efficiency of source encoder is well reflected by such expected codeword length. Hence, we
wish to minimize the expected codeword length. So the optimization problem of our interest is:
X
min p(x)`(x). (1)
`(x)
x∈X
There are many practical ways to estimate the distribution p(x) of an information source. So let
us assume that p(x) is given. In this case, `(x)’s are only variables that we can optimize over.
Now are we done with the problem formulation? Of course, not! We should definitely worry
about constraints that the optimization variables `(x)’s should satisfy. What are constraints on
`(x) then? The obvious constraints are: `(x) ≥ 1 and `(x) ∈ N. Are we now done? No! If it
is done, the solution to this problem becomes trivial. It would be 1. One can set `(x) = 1 for
all x’s to obtain 1. But this is too good to be true. In fact, there is another constraint on `(x),
which is concerning the condition that a valid code should satisfy.
C(x) = C(x0 ) =⇒ x = x0 .
2
of input symbols. Why? Here is a concrete example where one can see this. Suppose that the
output sequence reads:
Then, what are the corresponding input sequence? One possible input would be simply “b”
(C(b) = 010). But there are also some other patterns that yield the same output: “ca”
(C(c)C(a) = 010) and “ad” (C(a)C(d) = 010). We have multiple candidates that agree upon
the same output. This is problematic because we cannot tell which input sequence is put into.
In other words, we cannot uniquely figure out what the input is.
Example
Let me give you an example where the unique decodability condition holds:
One may wonder how to check unique decodability. Here is how. Suppose the output sequence
reads:
First we read a binary string until we find a matching codeword or a codeword which includes the
string in part. In this example, the first read must be 10 because there is only one corresponding
codeword: C(a). So, the corresponding input is “a”. What about the next read? Here an
ambiguity arises in the next two bits: 11. We have two possible candidates: (i) a matching
codeword C(c) = 11; (ii) another codeword C(d) = 110 which includes the string 11 in part.
Here the “11” is either from “c” or from “d”. It looks this code is not uniquley decodable. But
it is actually uniquely decodable - we can tell which one is the correct one. How? Via looking at
the future string! What does this mean? Suppose we see one more bit after “11”, i.e., we read
110. Still no way to identify which is correct one! However, suppose we see two more bits after
“11”, i.e., we read 1101. We can then tell which symbol is actually put into. That is, “d”! Why?
Another possibility “cb” (C(c)C(b) = 1100) does not agree upon the 1101. So it is eliminated.
We repeat this. If one can uniquely decode the input sequence with this way, then the code
is said to be uniquely decodable. Actually one can readily verify that the above mapping (3)
ensures unique decodability1 .
3
Recall our goal: finding constraints on `(x) in the optimization problem (1). So we now need to
worry about what is an appropriate mathematical constraint on `(x)’s that respects the property
of unique decodability. How to translate the unique decodability property into a mathematical
constraint in terms of `(x)’s? It turns out the translation is a bit difficult.
But there is a good news. The good news is that there is an indirect and much easier way to
identify the constraint. A subclass of uniquely decodable codes, called prefix-free codes comes to
rescue! The prefix-free code that we will describe in greater details soon satisfies the following
two properties: (i) it yields exactly the same constraint as the constraint that the uniquely
decodable code should satisfy (meaning that if a code is uniquely decodable, then it must
respect the constraint due to prefix codes as well); (ii) it offers a much easier way to identify the
constraint that the valid code should satisfy.
Here the first property means that the constraint due to prefix-free codes is a sufficient and
necessary condition for a valid uniquely-decodable code. So it suffices to consider the prefix-free
code when coming up with the constraint due to the valid code. The proof of the first property
is not that simple. So you will be asked to prove it in PS2. But don’t worry. There would be
many subproblems which will help you to prove it without much difficulty although the proof
itself is highly non-trivial.
To understand what the second property means, we should figure out what the prefix-free code
is in detail. So I will first explain what the code is and also will leave a side remark on one big
advantage against non prefix-free yet uniquely decodable codes.
Prefix-free codes
Recall the previous example of (3) in which the code is uniquely decodable. Actually this is a
perfect example which poses some issue w.r.t. decoding complexity, thereby motivating prefix
codes. What is that complexity issue? The issue is that as we saw in the prior example, decoding
the second input symbol requires looking at a future string, so decoding is not instantaneous.
Actually this issue can be aggravated further. In the worst case, we can think of the following
output sequence:
1100000000000000000000000000000000000001.
In this case, decoding even the first symbol, we have to take a look at many future strings.
The prefix-free code that I will define soon is the one in which there is no such complexity issue.
Here is an example of such code:
One key property of this code is that no codeword is a prefix of any other codeword. This is
why the code is named the prefix-free code. As you can readily figure out, the code in the
previous example (3) is not prefix-free although it is indeed uniquely-decodable: there exists
some codeword such that it is a prefix of some other codeword. This was the main reason that
decoding such a code requires looking at future strings, in the worst case, we need to take a look
at the entire string to decode even the first input symbol. On the other hand, the prefix-free code
like (4) has no such codeword that is a prefix of any other codeword. This enables us to have
no such complexity issue in decoding because there is no ambiguity in decoding and therefore
we don’t need to look at any future string to decode an input. So the code is instantaneously
decodable. That’s why such code is also called instantaneous.
Look ahead
4
Recall the optimization problem (1). As I mentioned earlier, the good news is that: (i) the
constraint on `(x) that the prefix-free code should satisfy is the same as that due to the uniquely
decodable code; (ii) it is easy to identify the constraint due to the prefix-free code. So next time,
we will figure out the constraint that the prefix-free code property should respect. We will then
attack the optimization problem.
5
EE623 Information Theory September 18, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time, we tried to prove the source coding theorem for i.i.d. sources. As an initial effort, we
focused on a simple symbol-by-symbol encoder setting in which the code acts on each individual
symbol independently. We then set out the goal: designing a code C such that E [`(X)] is
minimized. To achieve the goal, we formulated an optimization problem:
X
min p(x)`(x)
`(x)
x∈X
subject to some constraint on `(x).
Here the key to solving the problem is to come up with mathematical constraints on `(x) that
a valid code (fully specified by the unique-decodability property) should respect.
At the end of the last lecture, I mentioned that it is a bit difficult to come up with such
constraints. So we planned to take a different approach, being motivated by the following facts
(which I claimed but did not prove): (1) constraints on `(x) that prefix-free codes (a subclass of
uniquely-decodable codes) satisfy are equivalent to those due to uniquely-decodable codes; (2)
deriving mathematical constraints on `(x) induced by the prefix-free code property is relatively
easier. We deferred the proof of the first property to PS2.
Today’s lecture
Today, we are going to derive the constraint due to the prefix-free code property, and will attack
the optimization problem accordingly.
1
Next we assign each node, a sequence of binary labels that are along the path from the root to
that node. So 0 is assigned to the top node; 1 is assigned to the bottom. We then check if there
is a codeword which matches one of the two sequences 0 and 1. We see that the codeword C(a)
matches the sequence 0 assigned to the top node. So we assign “a” to the top node. We repeat
this for the bottom. Note that this node has no matching codeword. If there is no matching
node but there still exist codewords, then we generate two additional branches originating from
the node. We then assign a label of 0 on the top, a label of 1 on the bottom. Again we can
take the other way around. We now assign to the top node a sequence of binary labels on the
branches connecting the root to the node: “10”. On the bottom, we assign “11”. Next we check
if there is a codeword that is identical either to “10” or to “11”. We see that C(b) = 10. So we
assign “b” to the top node. Note that there is no matching codeword for “11”. So we split the
bottom into two, assigning “110” on the top and “111’ on the bottom. Finally we assign “c” to
the top, “d” to the bottom. See Fig. 1 for visual representation.
Observation #1
The first observation is regarding the location of nodes to which symbols are assigned. A tree
consists of two types of nodes. One is an ending node (terminal) which has no further branch.
We call that ending node a leaf. The second is a node from which another branch generates.
We call it an internal node. Keeping these in our mind, let’s take a look at the earlier binary
code tree illustrated in Fig. 1. Notice that all codewords are assigned to leaves only. In other
words, there is no codeword that is assigned to an internal node. Can you see why that is the
case? Actually it is obvious. In a prefix-free code, there is no codeword that is a prefix of any
other codeword. If there were a codeword that is assigned to an internal code, then we would
violate the prefix-free code property because that codeword is definitely a prefix of some other
codeword which lives in the associated leaf. So the first observation that one can make from the
prefix-free code is:
Observation #2
2
Now let’s move on to the second observation that serves to relate the prefix-free code property
to the mathematical constraint on `(x). That is,
Observation #2: Any codeword can be mapped to a subinterval in [0, 1].
What does this mean? Let me explain what it means by drawing another picture on top of the
binary code tree in Fig. 1. Here an interval [0, 1] is associated with the entire set of codewords.
See Fig. 2. We then draw a line that is crossing the root, and assign the average value of two ends
in the associated interval ( 0+1
2 = 0.5) to a point that intersects the line and the [0, 1] interval.
Now we assign a codeword to the [0, 0.5] interval. We see that there is only one codeword above
the root level. So we map the codeword C(a) to the [0, 0.5] interval. Below the root level,
there are multiple codewords though. We draw another line on the centered interior node in
the bottom level. We then assign 0.5+1
2 = 0.75 to a point that intersects the line and the [0.5, 1]
interval. Here we do the same thing. We assign C(b) to the [0.5, 0.75] interval. We repeat
this until we map every codeword. See the resulting picture in Fig. 2. Here we see that every
codeword is mapped into a subinterval of [0, 1]:
C(a) ↔ [0, 0.5];
C(b) ↔ [0.5, 0.75];
C(c) ↔ [0.75, 0.875];
C(d) ↔ [0.875, 1].
Figure 2: Observation #2: Any codeword can be mapped to a subinterval in [0, 1].
This naturally leads to the following two facts: (1) subinterval size = 2−`(x) ; (2) there is no
overlap between subintervals. The second fact comes from the first observation: an interior
node cannot be a codeword. This leads to the following constraint:
X
2−`(x) ≤ 1. (2)
x∈X
Optimization problem
3
Using Kraft’s inequality, we can now formulate the optimization problem as:
X
min p(x)`(x)
`(x)
x∈X
X
subject to 2−`(x) ≤ 1, `(x) ∈ N, `(x) ≥ 1.
x∈X
2−`(x) ≤
P
Here one can ignore the constraint `(x) ≥ 1? Why? Otherwise, the Kraft’s inequality x∈X
1 is violated. So the simplified problem reads:
X
min p(x)`(x)
`(x)
x∈X
X (3)
subject to 2−`(x) ≤ 1, `(x) ∈ N.
x∈X
Approximate!
What did Shannon do in order to overcome this challenge? Here is what he did. Since the
problem is extremely hard, he just wanted to say something about the solution. He believed
that although the problem is hard, it may be doable to come up with something which is similar
to the exact solution. To this end, what he attempted to do is to approximate the solution. In
4
other words, he intended to come with upper and lower bounds (on the solution) which are close
enough.
A lower bound
First consider a lower bound. Notice that the optimization problem of our interest aims to
minimize the object function. So one can easily expect that with a larger (more relaxed) search
space for `(x), the solution would be smaller than or equal to the exact solution in the original
problem! This motivated Shannon to broaden the search space as it yields a lower bound. Which
constraint did Shannon want to remove, in an effort to broaden the search space then? Shannon
was smart. Of course, he has chosen the integer constraint which contains a non-convex set
and hence makes the problem quite challenging. Here is the relaxed version of the problem that
Shannon formulated:
X
L := min p(x)`(x)
`(x)
x∈X
X (4)
subject to 2−`(x) ≤ 1.
x∈X
Look ahead
So far we have formulated the optimization problem which aims at minimizing the expected
codeword length subject to Kraft’s inequality and an integer constraint on `(x). Relaxing the
integer constraint, we translated the non-convex optimization into a tractable convex optimiza-
tion problem. Next time we will derive a lower bound by solving such convex optimization. We
will then derive an upper bound, and based on the lower and upper bounds, we will attempt to
prove the source coding theorem.
5
EE623 Information Theory September 23, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time we formulated an optimization problem which aims at minimizing the expected code-
word length subject to Kraft’ inequality (that prefix-free codes imply) and the integer constraint
on `(x) (an intrinsic constraint due to the nature of the problem). We then realized that it is
a non-convex optimization problem which is intractable in general. In an effort to make some
progress, we employed an approach that Shannon took, which is to approximate: Deriving lower
and upper bounds as close as possible. At the end of the last lecture, we introduced a trick
which allows us to come up with a lower bound. The trick is to relax constraints, i.e., broad-
ening search space in an effort to yield a better solution. Specifically we removed the integer
constraint which was the cause for non-convexity, and hence could translate the problem into a
tractable convex optimization problem:
X
L := min p(x)`(x)
`
x∈X
X (1)
subject to 2−`(x) ≤ 1.
x∈X
Today’s lecture
While we derived a lower bound in class, CN6 did not contain details on the derivation of the
bound, although LS6 did. So here we will provide such derivation. We will then derive an upper
bound introducing another trick. Based on the lower and upper bounds, we will later complete
the proof of the source coding theorem.
A lower bound
Actually the convex optimization problem (in this case, the objective function and the functions
that appear in constraints are convex) is very well studied. So there are many ways of solving
the problems. One of them is the one that you already learned about from a course on Calculus.
That is, the Lagrange multiplier method. The way it works is very simple. To describe how it
works, we first need to define the Lagrange function. The Lagrange function is defined in terms
of the optimization variables `(x)’s and some other variable called Lagrange multiplier, say λ:
!
X X
−`(x)
L(`(x), λ) = p(x)`(x) + λ 2 −1 .
x∈X x∈X
The number of Lagrange multipliers is the same as the number of constraints involved. So here
we have only one. Notice that the Lagrange function is the sum of two terms: (i) the objective
function; (ii) Lagrange multiplier times the left-hand-side of the inequality constraint in the
canonical form1 .
1
The canonical form of inequality constraints is the one in which the right-hand-side is 0 and the inequality
direction is ≤.
1
Now then how does the Lagrange multiplier method work? We take a derivative of the Lagrange
function w.r.t. optimization variables `(x)’s. We also take a derivative w.r.t. the Lagrange
multiplier λ. Setting these derivatives to zero, we get:
L(`(x), λ)
= 0;
d`(x)
L(`(x), λ)
= 0.
dλ
It turns out solving these under the constraint of λ ≥ 0 leads to the solution. This is how it
works. But in this lecture, we are not going to use this method. Why? There are two reasons:
(i) this way is a bit complicated and messy; (ii) the method is not quite intuitive as to why it
should work2 .
Instead we are going to use an alternative approach which turns out to be much simpler and
intuitive. But don’t worry - in PS3, you will have a chance to take the Langrange multiplier
method to solve the problem. Before describing the method in details, let us first simplify the
optimization problem further. Here one Pthing −`(x)
to notice is that we do not need to consider the
case when the strict inequality holds: x∈X 2 < 1. Here is why. Suppose there exists an
optimal solution, say `∗ (x), such that the strict inequality holds: −`∗ (x) < 1. Then, one
P
x∈X 2
can always come up with a better solution, say `0 (x), such that: for some x0 ,
∗
Here we reducedP `∗ (x0 ) a bit for one particular symbol x0 , in an effort to increase 2−` (x0 ) so
0
that we achieve 2−` (x) = 1. Note that this is indeed a better solution as it yields a smaller
objective solution due to `0 (x0 ) < `∗ (x0 ). This is obviously contradiction! This implies that an
optimal solution occurs only when the equality constraint holds. Hence, it suffices to consider
the equality constraint. So the optimization can be simplified as:
X
L := min p(x)`(x)
`(x)
x∈X
X (2)
subject to 2−`(x) = 1.
x∈X
X 1
L = min p(x) log
q(x) q(x)
x∈X
X (3)
subject to q(x) = 1, q(x) ≥ 0.
x∈X
2
Actually there is a deep underlying theorem, so called the strong duality theorem, which proves that such
approach leads to the optimal solution under convexity constraints. Since we will not deal with the proof in this
course, it is reasonable not to take the approach which relies on such deep theorem. But if you are interested
in the theorem, please refer to course notes (see CN12) for convex optimization which I uploaded on the course
website. These are the notes which I wrote for the course; so if you don’t understand, please consult with me.
2
Here one thing to notice is that the constraint of q(x) ≥ 0 is newly introduced as q(x) = 2−`(x)
- remember that whenever we apply “change of variable”, we should worry about intrinsic
constraints onPthe new variables which did not appear in the original optimization problem.
Now observe x∈X q(x) = 1. What does this remind you of? Probability mass function! Recall
the axiom that the pmf should satisfy (if you don’t remember, please refer to Ch 1 and 2 in
BT).
Here the objective function looks like the one that you saw earlier. That is, entropy H(X)! Now
what I claim is that the solution to the optimization problem is H(X) and the minimizer q ∗ (x)
(that minimizes the objective function) is p(x). Here is the proof.
Let’s consider subtraction of the objective function from H(X). We then get:
X 1 X 1
p(x) log − p(x) log
q(x) p(x)
x∈X x∈X
X p(x)
= p(x) log
q(x)
x∈X
p(X)
= Ep log
q(X)
Now what does this remind you of? That is the Kullback-Leibler (KL) divergence that you
encountered in Lecture 4! Applying one key fact regarding the KL divergence that it is non-
negative (this is what you will prove in one of the subproblems in PS2), one can immediately
see that the objective function is minimized as L = H(X), and the minimizer q ∗ (x) = p(x).
An upper bound
Now let us move on to an upper-bound issue. Remember the way to come up with a lower
bound. That is to broaden the search space. Then, what is a natural corresponding way to
come up with an upper bound? The answer is: the other way around! That is to reduce the
search space. Here we will take one very naive way of reducing the search space. That is to
make one particular choice for optimization variables `(x)’s.
Now the question is: Which particular choice that we want to make for `(x)? Obviously a
random choice for `(x) will lead to a possibly very loose upper bound. So we need to be very
careful about the choice. One can imagine that a good choice might be similar to the optimal
solution in the relaxed optimization problem (without the integer constraint). In this regard,
the minimizer `∗ (x) for the relaxed optimization can shed some lights on this. Remember the
minimizer in the case:
q ∗ (x) = p(x)
If `∗ (x)’s were integers, then we are more than happy as we can obtain the exact solution to
the original optimization problem. In general, however, `∗ (x)’s are not necessarily integers. One
natural choice for `(x) is then to take an integer which is as close to `∗ (x) as possible. So one
1 1
can think of two options: (i) blog p(x) c; (ii) dlog p(x) e. Which one do you want to take? Actually
the first choice is not valid. Why? Think about the Kraft’s inequality. Hence, the natural choice
3
is the second one. Under the second choice, we then get:
X 1
L∗ ≤ p(x) log
p(x)
x∈X
X 1
≤ p(x) log +1
p(x)
x∈X
= H(X) + 1.
H(X) ≤ L∗ ≤ H(X) + 1.
1
First take a look at the lower bound. Note the lower bound is tight when log p(x) ’s are integers
∗ 1
(i.e., p(x)’s are integer powers of 2) and therefore ` (x) can be chosen as log p(x) without violating
the integer constraint. However, this is a very particular case because in general p(x) is not
limited to that particular type. As for the upper bound H(X) + 1: one thing that we can say is
that if H(X) is large enough, the gap of 1 would be negligible. However, if H(X) is comparable
to (or much smaller than) 1, then the bounds are quite loose. For instance, consider a case in
which X is a binary r.v. with Pr(X = 0) = p where p 1. In this case, H(X) is close to 0;
hence, the bounds play almost no role in the case.
Since the codeword length per symbol is of our interest, what we do care about is the one divided
by n:
H(Zn ) L∗ H(Zn ) + 1
≤ n ≤ .
n n n
4
Remember that the length n of super-symbol is of our design choice and hence we can choose it
so that it is an arbitrary integer. In an extreme case, we can make it arbitrarily large. Actually
this is exactly the way that we achieve the limit. Note that the lower and upper bounds coincide
with H(Z n
n)
as n tends to infinity.
We are now almost done with the proof. What remains is to calculate such matching quantity.
Using the chain rule (a generalized version of the chain rule - check in PS2), we get:
where (a) follows from the independence of (X1 , X2 , . . . , Xn ) and the fact that H(X2 |X1 ) =
H(X2 ) when X1 and X2 are independent (check in PS2); and (b) is due to the fact that
(X1 , X2 , . . . , Xn ) are identically distributed.
Hence,
L∗n
lim = H(X).
n→∞ n
This proves the source coding theorem for i.i.d. sources: the minimum number of bits that
represents an information source per symbol on average is indeed H(X).
Look ahead
So far we have proved the source coding theorem. We formulated an optimization problem and
then showed that the solution to the optimization problem is entropy. What this result implies
is that there exist optimal codes that can achieve the limit. Actually we never talked about
explicit sequence pattern of such optimal codes. In other words, we never discussed how to design
such codes. Next time, we will discuss this issue in depth.
5
EE623 Information Theory September 25, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
So far we have proved the source coding theorem for the i.i.d. source case. In an effort to gain
some insights, we first considered a simple yet restricted setting in which source encoder acts
on each individual symbol, i.e., the symbol-by-symbol encoder setting. We then formulated an
optimization problem which aims at minimizing the expected codeword length under the setting.
Realizing the difficulty of such problem, we attempted to approximate the solution via deriving
reasonably tight lower and upper bounds. Applying simple yet power lower-&-upper tricks, we
could come up with the bounds which differ by 1.
We then applied these bounding techniques to a general setting in which source encoder can act
now on multiple symbols, the number of which can possibly be arbitrarily large. We ended up
with:
H(X1 , . . . , Xn ) L∗ H(X1 , . . . , Xn ) + 1
≤ n ≤
n n n
where L∗n indicates the minimum expected codeword length w.r.t. a super symbol Zn :=
(X1 , . . . , Xn ). This led us to obtain: In the limit of n,
L∗n
−→ H(X).
n
Today’s lecture
Here what this result implies is that there exist optimal codes that can achieve the limit. Actually
we never talked about explicit sequence pattern of such optimal codes. In other words, we never
discussed how to design such codes. In this lecture, we will discuss this issue in details.
1
quantities regarding the sequence. One is the faction of symbol “a”, represented as the number
of a’s divided by n. This is an empirical mean of occurrences of symbol “a”. The second is the
fraction of symbol “b”, the number of b’s divided by n.
To see this clearly, let’s compute the probability of observing X n . Since X n is i.i.d., p(X n ) is
simply the product of individual probabilities:
Here each probability is p or 1 − p depending on the value of Xi . So the result would be of the
following form:
1 1 1
log n
= {# of a’s} · log + {# of b’s} · log .
p(X ) p 1−p
1 1 {# of a’s} 1 {# of b’s} 1
log = · log + · log . (1)
n p(X n ) n p n 1−p
Now what can we say about this in the limit of n? Your intuition says: {# ofn a’s} → p as n → ∞!
One can naturally expect that when n → ∞, the fraction would be very close to Pr(X = a).
Actually this is indeed the case. Relying on a very well-known theorem, so called the Weak Laws
of Large Numbers (WLLN1 ), one can prove this. Here is the proof. Let Yi = 1(Xi = a) where
1{·} is an indicator function that returns 1 when the event · is true, 0 otherwise. Consider:
Y1 + Y2 + · · · + Yn
Sn := .
n
The Sn indicates the fraction of a’s. Obviously it is a sequence of n. Now consider the con-
vergence of the sequence. Remember from the course on Calculus that you learned about the
convergence of sequences, which are deterministic.
But here the story is a bit different. The reason is that Sn is a random variable (not deter-
ministic). So now we need to consider the convergence w.r.t. a random process. Actually there
are multiple types of convergence w.r.t. random processes. One type of the convergence that is
needed to be explored in our problem context is the convergence in probability. Actually what
the WLLN that I mentioned above says is that
What it means by this in words is very simple. It means that Sn converges to p with high
probability. But in mathematics, the meaning should be rigorously stated. What it means by
this in mathematics is that for any > 0,
Pr (|Sn − p| ≤ ) → 1.
2
{# of a’s} {# of b’s}
Now applying the WLLN to n and n , we get: for any 1 , 2 > 0,
{# of a’s}
is within p ± 1 ,
n
{# of b’s}
is within 1 − p ± 2 ,
n
as n → ∞. Applying this to (1), we can then say that in the limit of n,
1 1 1 1
log = (p ± 1 ) log + (1 − p ± 2 ) log
n p(X n ) p 1−p
= H(X) ±
Here the key observation is that for very large n, can be made arbitrarily close to 0 and thus
p(X n ) is almost the same for all such X n . We call such sequences typical sequences. The set
that contains typical sequences is said to be a typical set, formally defined as below.
−n(H(p)+)
A(n) n
:= {x : 2 ≤ p(X n ) ≤ 2−(H(p)−) }.
Notice that p(X n ) ≈ 2−nH(X) is almost uniformly distributed. It turns out this property plays
a crucial role to design optimal codes. In the sequel, we will use this property to design such
codes.
So one can consider a binary code tree in which the depth of the tree is roughly nH(X) and
codewords are assigned to leaves. Recall for a prefix-free code that codewords live only in leaves.
Now let’s check if we can map all the possible sequence patterns of X n into leaves. To this
end, we need to check two values: (1) total number of leaves; (2) total number of possible
input sequences. First of all, the total number of leaves is roughly 2nH(X) . Now what about
the second value? Obviously the total number of input sequence patterns is 2n because each
symbol can take on one of the two values (“a” and “b”) and we have n of them. But here one
thing that we have to keep in our mind is that in the limit of n, the sequence X n behaves in a
particular manner, more specifically, in a manner that p(xn ) ≈ 2−nH(X) ; hence, the number of
such sequences is not the maximum possible value of 2n . Then, now the question is: how many
sequences such that p(xn ) ≈ 2−nH(X) ? To see this, consider:
X
p(xn ) ≈ |{xn : p(xn ) ≈ 2−nH(X) }| × 2−nH(X) .
xn :p(xn )≈2−nH(X)
3
Obviously, the aggregation of all the probabilities of such sequences cannot exceed 1; hence,
Note that the cardinality of such set does not exceed the total number of leaves ≈ 2nH(X) . So
by using parts of leaves, we can map all of such sequences. This completes the design of source
code. See Fig. 1. Finally let’s check if this code indeed achieves the fundamental limit of H(X).
Notice that the length of every codeword is ≈ nH(X). So the expected codeword length per
symbol would be ≈ nH(X)n = H(X).
# of leaves
although X is an arbitrary r.v., not limited to the binary one. Please check this is indeed the case
in PS3. Roughly speaking, this implies that p(X n ) ≈ 2−nH(X) w.h.p. - any arbitrary sequence
is asymptotically equally probable. So using this and applying the code design rule based on
a binary code tree, we can easily construct an optimal source code - every input sequence is
mapped to a leaf in a binary code tree with depth ≈ nH(X) and hence, the expected codeword
length per symbol is H(X). The sequence pattern of a codeword is determined by which leaf
the codeword is assigned to.
4
Actually we can readily answer these questions by applying the bounding techniques that we
learned. Suppose we employ a super symbol-based source code of the super-symbol size n. Let
L∗n = E [`(X n )]. Applying exactly the same lower-&-upper bound techniques that we learned,
one can show that
H(X1 , X2 , . . . , X n ) L∗ H(X1 , X2 , . . . , Xn ) + 1
≤ n ≤ .
n n n
Here what we can say is that the expected codeword length per symbol is definitely related to
the following quantity:
H(X1 , X2 , . . . , Xn )
lim .
n→∞ n
Now the question is: Does the limit exist? If the limit exists, then we are done! The answer is:
Not always. Actually there are some very contrived examples where the limit does not exist -
one such example is in page 75 in CT. But the good news is: in many practically-relevant cases
of our interest, the limit does exist. So let’s consider only these cases where the limit exists.
Then, we can state the source coding theorem as follows:
H(X1 , X2 , . . . , Xn )
Minimum # of bits that represent general source per symbol = lim .
n→∞ n
Actually there is a terminology which indicates such limit. To understand the rationale behind
the terminology that you will see soon, let’s consider a plot in which x-axis and y-axis indicate
n and H(X1 , X2 , . . . , Xn ) respectively. See Fig. 2. What the above limit means is the slope. In
other words, it means the growth rate of the sequence uncertainty w.r.t. n. Hence, it is called
the entropy rate.
Look ahead
In the latter part of this lecture, we have proved the source coding theorem for general in-
formation source (not limited to the i.i.d. case): the minimum number of bits that represent
information source per symbol is entropy rate. Next time, we will explore how to construct
optimal codes that achieve such limit.
5
EE623 Information Theory September 30, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time, we studied how to construct an optimal code that can achieve H(X) promised by the
source coding theorem for the i.i.d source case. The construction was based on a super-symbol-
based approach in which the source encoder acts on multiple input symbols, say n symbols,
and the parameter n is of our design choice. The idea of the construction was to make the
super-symbol size n arbitrarily large. In the interested regime of large n, what we found with
the help of the WLLN is:
1 1 in prob.
log −→ H(X) (1)
n p(X n )
i.e., n1 log p(X1 n ) becomes in between H(X) − and H(X) + for any > 0, as n tends to infinity.
Inspired by the fact that the codeword length solution for the lower bound in the optimization
problem that we formulated earlier is `∗ (xn ) = log p(x1n ) , we focused on the quantity of p(x1n ) .
From (1), we observed that in the limit of n, the quantity becomes:
1
log ≈ nH(X).
p(X n )
This then motivated us to consider a prefix free code in which an input sequence xn is mapped
to one of the leaves that reside in the level with the tree depth of ≈ nH(X). This ensures the
expected codeword length per symbol to be ≈ H(X), matching with the promised limit. We
also checked that the number of possible input sequences such that p(xn ) ≈ 2−nH(X) is less than
the total number 2nH(X) of leaves. This ensured a mapping for every such input sequence.
Today’s lecture
We have thus far considered only i.i.d. sources. However, in practice, many of the information
sources are far from i.i.d. Remember the English text example that we introduced last time. So
a natural question arises: What about for general non-i.i.d. sources? What is the corresponding
source coding theorem? How to design optimal codes? In this lecture, we will address these
questions.
1
Here what we can say is that the expected codeword length per symbol is definitely related to
the following quantity:
H(X1 , X2 , . . . , Xn )
lim .
n→∞ n
Now the question is: Does the limit exist? If the limit exists, then we are done! However, the
answer is: Not always. Actually there are some very contrived examples where the limit does
not exist - one such example is in page 75 in CT. But there is a good news. The good news
is: in many practically-relevant cases of our interest, the limit does exist. So let’s consider only
these cases where the limit exists. We can then state the source coding theorem as follows:
H(X1 , X2 , . . . , Xn )
Minimum # of bits that represent a general source per symbol = lim .
n→∞ n
Actually there is a terminology which indicates such limit. To understand the rationale behind
the terminology that you will see soon, let’s consider a plot in which x-axis and y-axis indicate
n and H(X1 , X2 , . . . , Xn ) respectively. See Fig. 1. What the above limit means is the slope. In
other words, it means the growth rate of the sequence uncertainty w.r.t. n. Hence, it is called
the entropy rate.
Stationary process
Now how to compute the entropy rate? It turns out in many practical information sources, the
entropy rate is easy to compute. One such practical example is the one in which information
source is a stationary process. Here is the definition of the stationary process. We say that a
random process is stationary if {Xi } is a statistical copy (e.g., having the same joint distribution)
of its shifted version {Xi+` } for any non-negative integer `. One concrete example is again an
English text. Notice that the statistics of 10-year-old text would be almost the same as that of
a nowaday’s text; for instance, the frequency of “the” in an old text would be roughly the same
as that of a current text.
As claimed earlier, the limit can be simplified a bit further when specialized to the stationary
process case. To see this, we consider:
H(X1 , X2 , . . . , Xn ) = H(X1 ) + H(X2 |X1 ) + · · · + H(Xn |X1 , . . . , Xn−1 )
Xn
= H(Xi |X1 , X2 , . . . , Xi−1 )
i=1
n
X
= H(Xi |X i−1 )
i=1
2
where the first equality is due to the chain rule and the last comes from the shorthand notation
that we introduced earlier: X i−1 := (X1 , . . . , Xi−1 ). Let ai = H(Xi |X i−1 ). Now we can see
two properties of the deterministic sequence {ai }. First it is non-negative. Second, it is non-
increasing, i.e., ai+1 ≤ ai . To see this, consider:
where the inequality follows from the fact that conditioning reduces entropy and the second
last equality is due to the stationarity of the process. These two properties imply that the
deterministic sequence {ai } has a limit. Why? Please think about the definition of the existence
of a limit that you learned perhaps from the course on Calculus or somewhere else. Otherwise,
look for wikipedia.
Now let’s consider
n
H(X1 , . . . , Xn ) 1X
= ai .
n n
i=1
Notice that this quantity indicates the running average of the sequence {ai }. Keeping in your
mind that {ai } has a limit, what your intuition says is that the running average will converge to
the limit as well because almost all the components that form the running average will converge
to the limit. It turns out this is indeed the case. So the entropy rate of the stationary process
is:
The proof of this is not that simple, requiring some non-trivial tricks. So we will not deal with
the proof here. But if you are interested, I am happy to issue a problem for the proof in a later
homework, say PS4 or PS5. Please let me know your interest in class or via any communication
route. Recalling what we learned last time, what (3) suggests is that the optimal code assigns
the same codeword length for every sequence and the length should read roughly nH(X ). The
sequence patterns will be determined by the binary code tree of depth ≈ nH(X ) in which
codewords are mapped to leaves.
3
hardware that we will employ for the use of a code obviously cannot support an infinite size of
n. The other is that the amount of information source available is definitely limited, as pointed
out by a student in class. So one natural question that we can ask in this context is then: What
if n is finite? What are optimal codes for finite n? Actually this is exactly the question that
Shannon raised right after establishing the source coding theorem. Also this question was shared
with a bunch of professors in MIT at that time including Prof. Robert Fano. Prof. Fano also
shared this question with students who took the information theory course that he held at that
time.
As I mentioned earlier, for the finite n case, we need to solve a non-convex optimization problem
in order to give a concrete answer. I told you that this problem is extremely hard, and a closed-
form solution to the limit has been open thus far. So Prof. Fano never expected that any one
would come up with a solution to this very difficult problem.
But surprisingly, one of the students in class, named David Huffmnan, came up with a very
simple and straightforward algorithm which leads to the optimal code. Huffman did not provide
the exact closed-form solution to the limit, but instead could provide a very explicit design rule,
called algorithm, which leads to the optimal code. This code is called the Huffman code. One
amazing fact is that he did this as a term project in the class, implying that you may also be
able to do something great right now!
While the Huffman code provides the optimal code in a practical scenario where n is finite, it
still has limitations in implementation. One such limitation is that the codes require knowledge
of source statistics. In other words, to design Huffman code, we need to know about the joint
distribution of a super symbol: p(x1 , x2 , . . . , xn ). But in practice, it may not be that easy to
obtain the statistical knowledge. So one follow-up question that arises in this more practical
scenario is: What if source statistics is unknown? Are there any optimal codes that require no
priori knowledge on source statistics?
A universal code
Two very smart Jewish people could answer this question: Lempel & Ziv. They developed a
universal code that does not require any statistical knowledge on information sources. This
code is called Lempel-Ziv code. Since this code can be applied to any type of information source
while still approaching the limit (entropy rate), many system designers were very interested in
implementing their code. So it turns out the code was implemented in a variety of systems as
the following protocols named: gzip, pkzip, UNIX compression.
The idea of the Lempel-Ziv code is very simple. The idea is the one that many people in our
daily life are using. Let us explain the idea in one specific context: “closing email phrases”. As
a closing phrase, people usually use the following:
Actually if one intends to represent this English phrase as it is, then we may need a bunch of
bits, of which the number is definitely larger than the number of alphabets in a phrase. But the
Lempel-Ziv code says that we can do much better than this. The idea is based on dictionary.
One noticeble property is that we have only a few number of closing phrases because people
usually use very similar phrases. Inspired by this, one can imagine a dictionary that maps each
phrase into an index like 1, 2, 3. This dictionary forms the basis of the Lempel-Ziv code.
4
Roughly speaking, here is how the Lempel-Ziv code works. First of all, we make a dictionary
from some pilot sequence which are part of the entire sequence. We then share this between
source encoder and decoder. The source encoder encodes only indices and the decocder decoes
the indices using the shared dictionary. This way, we do not need to know about the statistics
of information sources. Also it has been shown that this code can achieve the limit (entropy
rate) as the dictionary size increases.
In the interest of time, we will not cover more details on the Huffman code and the Lempel-Ziv
code. For those who are interested in, I will upload a supplementary material w.r.t. the Huffman
code. Unfortunately, I don’t have a course note for the Lempel-Ziv code. So if you are interested
in, please look for wikipedia.
Look ahead
So far we have proved the source coding theorem for general information source, and learned
about how to design optimal codes that achieve the limit. We also briefly overviewed some
practical codes such as the Huffman code and the Lempel-Ziv code. Next time, we will move
onto another core topic in Part II: the channel coding theorem.
5
EE623 Information Theory October 2, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
In the first two lectures, we learned about the two-stage communication architecture that Shan-
non developed. The role of the first stage that we call source coding is to convert an information
source (which is possibly of a different type) into a common currency of information, bits. The
role of the second stage that we call channel coding is to convert the bits into a signal so that
it can be reliably transmitted over a channel to the receiver. During the past five lectures,
we studied the source coding theorem which quantifies the fundamental limit on the number of
bits required to represent an information source without loss of meaning. In the upcoming five
lectures (including today’s one), we will study the channel coding theorem which quantifies the
fundamental limit on the number of bits that can be reliably transmitted over a channel.
Today’s lecture
Remember the verbal statement of the channel coding theorem that we made in Lecture 2. Like a
law in physics, there is a law in communication systems: Given a channel, the maximum number
of bits that can be transmitted over a channel is determined, regardless of any operations done
at the transmitter and the receiver. In other words, there is a fundamental limit on the number
of bits, under which communication is possible and above which communication is impossible,
no matter what we do and whatsoever. The limit is so called channel capacity. In today’s
lecture, we will study what this statement means in a more explicit and precise manner. To this
end, we will study the following: (i) a specific problem setup that Shannon considered; (ii) a
mathematical model of channels; (iii) precise meaning of possible (or impossible) communication.
Once the statement becomes clearer, we will later prove the theorem.
Problem setup
The goal of the communication problem is to deliver an information of binary string type (bits)
to the receiver as much as possible. Let’s start by investigating the statistics of the binary string.
Fortunately, unlike the information source in the source coding context, one can make a very
simple assumption on the statistics of the binary string.
To see this, let’s recall the source coding setup that we considered in the past. Notice that
an information source is of arbitrary statistics depending on applications of our interest. So
the source code design was tailored for the statistics of the information source. On the other
hand, we have a good news in the channel coding part. The good news is that the statistics
of the input binary string is not arbitrary - it follows a particular statistics under a reasonable
assumption. Here the reasonable assumption is that we use an optimal source code. Now then
can you see why the binary string is of a particular statistics?
To see this, let’s recall the source coding theorem that we proved. Let the binary string
(b1 , b2 , . . . , bm ) be an output of an optimal source encoder. Then, using the source coding
theorem, we get:
m = H(information source)
(1)
= H(b1 , b2 , . . . , bm )
1
where the second equality follows from the fact that a source encoder is one-to-one mapping and
the fact H(X) = H(f (X)) for an one-to-one mapping function f (Why?). Now observe that
(b1 , . . . , bm ) = (0, 0, . . . , 0) −→ W = 1;
(b1 , . . . , bm ) = (0, 0, . . . , 1) −→ W = 2;
..
.
(b1 , . . . , bm ) = (1, 1, . . . , 1) −→ W = 2m .
This allows us to express the input only with a single r.v. W and the distribution of W is
uniform. Why uniform?
Things to design
The design of the digital communication system means the design of two things: (i) channel
encoder, say f ; (ii) channel decoder, say g. See Fig. 1. One crucial thing to consider in designing
the encoder and the decoder is the behavior of channel. As mentioned earlier, the channel is
an enemy against the communication system because it induces errors, making the system a
random function. So the encoder and the decoder should be designed so as to protect against
such errors. A natural way to protect against the errors is to add some redundancy. Notice
in our conversation that we say again when the communication is not successful. Inspired by
this, we can think of a sequence of symbols instead of a single quantity for the channel encoder
output, say (X1 , X2 , . . . , Xn ). Here we use a shorthand notation X n := (X1 , X2 , . . . , Xn ) for
simplicity, and n is called code length. Note here that n has nothing to do with the super-symbol
size employed in the source coding context. Denote by Y n the output of the channel. The goal
of the decoder g is to infer W given the channel output Y n . Here as we can easily imagine,
there is a challenge in decoding W . The challenge is that Y n is not a deterministic function of
X n . Actually if the channel were deterministic, then the function g is nothing but an inverse
function of the concatenation of the two: f and channel. But the problem is that the channel is
not deterministic. So the design of g is not that simple. Actually the design of g requires a deep
understanding of how the channel behaves. So before getting into details as to how to design f
and g, let’s briefly talk about how the channel behaves as well as how to model it.
Channel modeling
Let Xi and Yi be input and output of the channel at time i, respectively. One typical way to
capture the randomness of the channel is via a conditional distribution: p(y|x). To give you an
idea, let us consider one concrete example: binary symmetric channel (BSC). The other example
2
channel channel
channel
encoder decoder
that one can think of is: binary erasure channel (BEC). But this is the one that we studied in
the past (Lecture 4). So let’s consider only the BSC.
In the BSC, Yi is a flipped version of Xi with probability, say p (called crossover probability):
p(y|x) = p when y 6= x. So the relation between Xi and Yi is captured by the following:
Xi , w.p. 1 − p;
Yi =
Xi ⊕ 1, w.p. p.
Yi = Xi ⊕ Zi
where Zi ∼ Bern(p). Usually we consider a so called mememoryless channel in which the noises
are independent across different time instants. So one natural assumption is that Zi ’s are i.i.d.
log M
R= bits/channel use.
n
The second performance metric is the one which captures the decoding quality. Since the channel
is random, we cannot always guarantee the decoded message (say Ŵ ) to be the same as the
original message W . So the decoding quality can be captured by the probability of an error
event:
Pe := Pr{W 6= Ŵ }.
Optimization problem
With these two performance metrics, Shannon came up with a very natural optimization problem
to be explained soon. He then came up with a concept of possible communication. Let us first
take a look at the optimization problem. One can easily expect that there should be tradeoff
3
relationship between R and Pe . The larger R, the larger Pe and vice versa. So one natural
optimization problem is: Given R and n:
Pe∗ := min Pe .
f,g
What Shannon realized is that unfortunately, this is again a non-convex optimization problem,
being very difficult to solve. So Shannon took a different approach, as he did for the source
coding theorem. This is, to approximate! In other words, he attempted to develop upper and
lower bounds on Pe∗ .
In the process of doing this, he discovered something very interesting and came up with a concept
of possible communication. What he found is that when R is below some value, an upper bound
on Pe∗ can be made arbitrarily close to 0 as n increases. This implies that in that case, actual
Pe∗ can be made very small. He also found that when R is above some value, an lower bound
on Pe∗ cannot be made arbitrarily close to 0 no matter what we do. This implies that in that
case, actual Pe∗ cannot be made very small.
From this observation, he then came up with the concept of possible communication. We say
that communication is possible if one can make Pe∗ arbitrarily close to 0 as n → ∞; otherwise
communication is said to be impossible. He also came up with a follow-up natural concept of
achievable rate. We say that data rate R is achievable if we can make Pe∗ → 0 as n → ∞ given
R; otherwise R is said to be not achievable.
Now from the next lecture on, we will try to prove this theorem. Specifically we will prove two
things:
R ≤ C =⇒ Pe → 0
R > C =⇒ Pe 9 0.
The first is called the achievability proof or the direct proof. The second is called the converse
proof. Notice that the contraposition of the second statement is:
Pe → 0 =⇒ R ≤ C,
Looking ahead
Next time, we will prove the achievability for a simple example. Gaining insights from the
example, we will later prove the acheivability for a general channel specified by an arbitrary
conditional distribution p(y|x).
4
EE623 Information Theory October 7, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time we introduced the channel coding problem setup. Specifically we came up with a
precise mathematical statement of channel coding theorem which was only vaguely stated in
the beginning of this course. To this end, we first took into account two performance metrics:
(i) data rate R := lognM (bits/channel use); (ii) probability of error Pe := Pr(W 6= Ŵ ). Here
the message W ∈ {1, 2, . . . , M } and n indicates code length (the number of channels used). In
an effort to understand the tradeoff relationship between the two performance metrics, we then
considered the following optimization problem: Given R and n:
Pe∗ = min Pe
f,g
where f and g indicate channel encoder and decoder respectively. Unfortunately, Shannon could
not solve this problem. Instead he looked into upper and lower bounds. In the process of doing
this, he made a very interesting observation: If R is less then a threshold, say C, then Pe can
be made arbitrarily close to 0 as n → ∞; otherwise, Pe 9 0 no matter what we do. This leads
to a natural concept of achievable rate: data rate R is said to be achievable if given R we can
make Pe → 0. This finally formed the channel coding theorem:
The channel coding theorem requires the proof of two parts. The first one is: R ≤ C =⇒ Pe → 0.
To this end, we need to come up with an achievable scheme such that given R ≤ C, one can
make Pe → 0. Hence, the proof is so called the achievability proof. The second part to prove
is: R > C =⇒ Pe 9 0. Notice that the contraposition of this statement is: Pe → 0 =⇒ R ≤ C,
which is exactly the opposite of the statement for the first part proof. Hence, it is so called the
converse proof.
Today’s lecture
In today’s lecture, we will attempt to prove the achievability. Specifically we will try to gain
some insights from a simple channel example: the binary erasure channel (BEC). It turns out
the BEC provides significant insights into proving the achievability for arbitrary channels. Once
we are equipped with enough insights, we will later attack a fairly general channel model setting.
where e stands for “erasure” (garbage) and p indicates the erasure probability. As mentioned
earlier, we consider a memoryless channel in which noises are independent across different time
instances.
1
First of all, let’s guess what the channel capacity C is. Actually one can make a naive guess
based on the following observation: the channel is perfect w.p. 1 − p; erased w.p. p; and the
erasure (garbage) does not provide any information w.r.t. what we transmit. This naturally
leads to: C is at most 1 − p. Then, a question arises: Is this achievable? Let’s think about
a very easy case in which the transmitter knows all erasure patterns across all time instances
beforehand. Of course this is far from reality. But just for simplicity, let’s think about this
case for the time being. In this case, one can readily achieve 1 − p. The achievable scheme is:
Transmit a bit whenever the channel is perfect. Since the perfect channel probability is 1 − p,
by the Law of Large Number (LLN), we can achieve 1 − p as n tends to infinity.
Now let’s consider the realistic scenario in which we cannot predict the future events and thus
the transmitter has no idea of erasure patterns. In this case, we cannot apply the above naive
transmission scheme because each transmission of a bit is not guaranteed to be successful due
to the lack of the knowledge of erasure patterns. Perhaps one may imagine that it is impossible
to achieve 1 − p. But it turns out that very interestingly one can still achieve 1 − p even in this
realistic setup.
How to encode?
Here is a transmission scheme. First fix R arbitrarily close to 1 − p. Now given R, what is
M (the cardinality of the range of the message W )? Since R := lognM , one needs to set as:
M = 2nR . So the message W takes one of the values among 1, 2, . . . , 2nR .
Next, how to encode the message W ? In other words, what is a mapping rule between W and
X n ? Here we call X n codeword 1 . It turns out Shannon’s encoding rule, to be explained in
the sequel, enables us to achieve R = 1 − p. Shannon’s encoding is extremely simple - it looks
even dumb! (You’ll see why soon). The idea is to generate a random binary string given any
W = w. In other words, for any W = w, every component of X n (w) follows a binary r.v.
with parameter 21 and those are independent with each other, i.e., Xi (w)’s are i.i.d. ∼ Bern( 12 )
across all i’s. These are i.i.d. also across all w’s. This looks like a really dumb scheme. But
surprisingly Shannon showed that this dumb scheme can achieve the rate of 1 − p. Here there
is a terminology which indicates a collection (book) of Xi (w)’s. It is called codebook.
How to decode?
Let’s move onto the decoder side. Obviously the decoder input is a received signal Y n (channel
output). Here one can make a reasonable assumption on the decoder: the codebook (the collec-
tion of Xi (w)’s) is known. This assumption is realistic because this information is the one that
suffices to share only at the beginning of communication. Once this information is shared, we
can use this all the time until communication is terminated.
Now the question is: How to decode W from Y n , assuming that the codebook is known? What
is an optimal way of decoding W ? Remember the second performance metric: the probability
of error Pe := Pr(W 6= Ŵ ). Since the channel is not deterministic (i.e., random), the only
thing that we can do for the best is to minimize the probability of error as small as possible.
So the optimal decoder can be defined as the one that minimizes the probability of error. An
alternative way to define the optimal decoder is: Optimal decoder is the one that maximizes the
success probability: Pr(W = Ŵ ). But here the received signal is given because it is an input to
the decoder. So the success probability should be a conditional probability:
Pr(W = Ŵ |Y n = y n ).
1
The word of codeword was used to indicate the output of the source encoder in prior lectures. It is a sort of
convention to use the same word also to indicate the output of channel encoder.
2
Then, a more formal definition of the optimal decoder is:
Ŵ = arg max Pr(W = w|Y n = y n ).
w
Actually there is a terminology which indicates such optimal decoder. Notice that Pr(W =
w|Y n = y n ) is the probability after an observation y n is made; and Pr(W = w) is so called the
a priori probability because it is the one known beforehand. Hence, Pr(W = w|Y n = y n ) is
called the a posteriori probability. Observe that the optimal decoder is the one that Maximizes
A Posteriori (MAP) probability. So it is called the MAP decoder. The MAP decoder acts as an
optimal decoder for many interesting problems not limited to this problem context. As long as a
problem of interest is an inference problem (infer X from Y when X and Y are probabilistically
related), then the optimal decoder is always the MAP decoder.
In fact, this MAP decoder can be simplified further in many cases including our case as a special
case. Here is how it is simplified. Using the definition of conditional probability, we get:
Pr(W = w, Y n = y n )
Pr(W = w|Y n = y n ) =
Pr(Y n = y n )
Pr(W = w)
= · Pr(Y n = y n |W = w).
Pr(Y n = y n )
1
Here one thing to notice is that Pr(W = w) = 2nR - it is irrelevant to w. Also Pr(Y n = y n )
is not a function of w. This implies that it suffices to consider only Pr(Y n = y n |W = w) in
figuring out when the above probability is maximized. Hence, we get:
Ŵ = arg max Pr(W = w|Y n = y n )
w
= arg max Pr(Y n = y n |W = w).
w
Also given Xn = xn (w),Y is a sole function of the channel, meaning that given X n = xn (w),
n
Y n is independent of w. Hence,
Ŵ = arg max Pr(Y n = y n |X n = xn (w), W = w)
w
= arg max Pr(Y n = y n |X n = xn (w)).
w
Notice that Pr(Y n = y n |X n = xn (w)) is nothing but a conditional distribution, given by the
channel, which is easy to compute. Actually there is another name for the conditional distribu-
tion: likelihood. So the decoder is called the Maximum Likelihood (ML) decoder.
3
Suppose y n = 0e00. Then, we can easily compute the likelihood P(y n |xn (w)). For instance,
because the channel is perfect three times (time 1, 3 and 4), while being erased at time 2. On
the other hand,
because the third bit 1 does not match y3 = 0, and this is the event that would never happen,
meaning that the probability must be 0. In other words, the second message is incompatible
with the received signal y n . Then, what is the ML decoding rule? Here is how it works.
1. We eliminate all the messages which are incompatible with the received signal.
2. If there is only one message that survives, then we declare the survival is the correct message
that the transmitter sent.
However, this procedure is not sufficient to describe the ML decoding rule. The reason is that
we may have a different erasure pattern that confuses the rule. To see this clearly, consider the
following concrete example. Suppose that the received signal is now y n = (0ee0). Then, we see
that
X n (1) = (1 − p)2 p2 ;
X n (2) = (1 − p)2 p2 .
Now the two messages (1 and 2) are compatible and those likelihood functions are equal. In
this case, what we can do for the best is to flip a coin, choosing one out of the two in a random
manner. This forms the ML decoding rule.
Look ahead
Next time, we will demonstrate that we can achieve the rate of 1 − p under the optimal ML
decoding rule.
4
EE623 Information Theory October 14, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time, we claimed that the capacity of the BEC with erasure probability is
CBEC = 1 − p.
Today’s lecture
Today we are going to complete this proof. We will then consider another channel example
which turns out give further insights into a general channel setting. The channel is the one that
we saw earlier: the binary symmetric channel.
Probability of error
The probability of error is a function of codebook C. So here we are going to investigate
the average error probability EC [Pe (C)] taken over all possible random codebooks, and will
demonstrate that EC [Pe (C)] approaches 0 as n → ∞. This then implies the existence of an
optimal deterministic codebook, say C ∗ , such that Pe (C ∗ ) → 0 as n → ∞ (Why? See PS4 for
details), which completes achievability proof. Consider:
(a) X
EC [Pe (C)] = Pr(C = c)Pr(Ŵ 6= W |C = c)
c
(b)
= Pr(Ŵ 6= W ) (1)
2nR
(c) X
= Pr(W = w)Pr Ŵ 6= w|W = w
w=1
where (a) follows from the fact that Pr(C = c) indicates the probability that codebook is chosen
as a particular realization c; (b) and (c) are due to the total probability law (check Ch. 1 and 2
in BT if you do not get it).
Now consider Pr(Ŵ 6= w|W = w). Notice that [X1 (w), . . . , Xn (w)]’s are identically distributed
over all w’s. This means that the way that the wth codeword is constructed is exactly the same
for all w’s. This implies that Pr(Ŵ 6= w|W = w) is irrelevant of what the particular value of w
is, meaning that
1
This together with (1) gives:
EC [Pe (C)] = Pr Ŵ 6= 1|W = 1
nR
2[ (2)
= Pr {Ŵ = w}|W = 1 .
w=2
Here the second equality is due to the fact that {Ŵ 6= 1} implies that Ŵ = w for some w 6= 1. In
general, the probability of the union of multiple events is not that simple to compute. Actually
it is quite complicated especially when it involves a large number of multiple events. Even for
the three-event case (A, B, C), the probability formula is not that simple:
Even worse, the number of associated multiple events here in (2) is 2nR −1 and this will make the
formula of that probability very complicated. So the calculation of that probability is not that
simple. Of course, Shannon realized the difficulty of the calculation. To make some progress,
he took an indirect approach as he did in the proof of the source coding theorem. He did not
attempt to compute the probability of error exactly. Instead, he intended to approximate it!
Specifically he tried to derive an upper bound because at the end of day what we want to show
is that the probability of error approaches 0. Note that if an upper bound goes to zero, then
the exact quantity also approaches 0. We have a very well-known upper bound w.r.t. the union
of multiple events. That is, the union bound : for events A and B,
where the equality follows from the fact that the codebook employed is symmetric w.r.t message
indices. The event of Ŵ = 2 implies that message 2 is compatible; otherwise, Ŵ cannot be
chosen as 2. Using the fact that for two events A and B such that A implies B, Pr(A) ≤ Pr(B),
we get:
Notice that the fact that message 2 is compatible implies that for time i in which the transmitted
signal is not erased, Xi (2) must be the same as Xi (1); otherwise, message 2 cannot be compatible.
To see this clearly, let B = {i : Yi 6= e}. Then, what the above means is that we get:
!
\
nR
EC [Pe (C)] ≤ 2 Pr {Xi (1) = Xi (2)}|W = 1
i∈B (3)
|B|
nR 1
=2
2
2
1
where the equality is due to the fact that Pr(Xi (1) = Xi (2)) = 2 and the codebook is indepen-
dent across all i’s.
Now here one key thing to notice is that due to the LLN, for sufficiently large n:
|B| ≈ n(1 − p).
This together with (3) yields:
EC [Pe (C)] . 2n(R−(1−p)) .
Hence, EC [Pe (C)] can be made arbitrarily close to 0 as n → ∞, as long as R < 1 − p. This
completes achievability proof.
But the way that we compute the likelihood function is different from that in the BEC case.
Let’s see this difference through the following example. Suppose that (n, R) = (4, 21 ) and the
codebook is:
X n (1) = 0000;
X n (2) = 1110;
X n (3) = 1010;
X n (4) = 1111.
Suppose that the received signal y n is (0100). Then, the likelihood functions are:
P(y n |xn (1)) = (1 − p)3 p1 ;
P(y n |xn (2)) = (1 − p)2 p2 ;
P(y n |xn (3)) = (1 − p)p3 ;
P(y n |xn (4)) = (1 − p)p3 .
3
Here one thing to notice is that unlike the BEC case, all the messages are compatible because
the likelihood functions are strictly positive. So we need to now compare all the functions to
choose the message that maximizes the function. Since we assume that p ∈ (0, 12 ) without loss
of generality, in the above example, the first message is the maximizer. Notice that it has the
minimum number of flips (marked in red) and the flipping probability is smaller than 12 , and
hence, the corresponding likelihood function is maximized.
From this, one can see that the number of non-matching bits (number of flips) plays a crucial
role in a decision:
d(xn , y n ) := |{i : xi 6= yi }| .
This is so called the Hamming distance. Using this, the ML decoder can be re-written as:
Now what remains to do is to show that one can make the probability of error arbitrarily close
to 0 as n → ∞, when using the ML decoder. However, here we will not employ the ML decoder
because of two reasons. The first is that the error probability analysis is a bit complicated.
The second one is more critical: the proof technique is not extensible to general cases in which
channels are arbitrary. Now then how can we prove the achievabiliy without using the ML
decoder? Fortunately, there is a different yet suboptimal decoder under which the achievability
proof can be greatly simplified as well as which still achieves 1 − H(p). More importantly, the
suboptimal decoder that we will describe soon is extensible. So we are going to employ such
suboptimal decoder to prove the achievability.
A suboptimal decoder
The suboptimal decoder that we will employ is inspired by a key observation that one can make.
Notice that the input to the decoder is Y n and codeword is available at the decoder, i.e., X n (w)
is known for all w’s. Now observe that
Y n ⊕ Xn = Zn
Y n ⊕ X n (1) = Z n ∼ Bern(p).
Here the key observation is that the statistics of the resulting sequence is Bern( 21 ). Notice that
the sum of any two independent Bernoulli r.v. is Bern( 21 ), as long as at least one r.v. follows
Bern( 12 ) (Why?). This motivates us to employ the following decoding rule.
2. Eliminate messages such that the resulting sequence is not typical w.r.t. Bern(p). More
precisely, let
−n(H(p)+)
A(n) n
:= {z : 2 ≤ p(z n ) ≤ 2−n(H(p)−) }
(n)
be a typical set w.r.t. Bern(p). Eliminate all w’s such that Y n ⊕ X n (w) ∈
/ A .
4
3. If there is only one message that survives, then declare the survival is the correct message.
Otherwise, declare an error.
Note that the error event is of two types: (i) multiple survivals; (ii) no survival.
Probability of error
We are now ready to analyze the probability of error to complete achievability proof. For
notational simplicity, let P̄e := EC [Pe (C)]. Using exactly the same argument that we made in
the BEC case, we can get:
P̄e := EC [Pe (C)] = Pr Ŵ 6= 1|W = 1 .
As mentioned earlier, the error event is of the two types: (i) multiple survivals; (ii) no survival.
(n)
The multiple-survival event implies that there exists w 6= 1 such that Y n ⊕ X n (w) ∈ A . The
no-survival event implies that even the Y n ⊕ X n (1) w.r.t. the correct message “1” is not a
(n)
typical sequence, meaning that Z n ∈/ A . Hence, we get:
P̄e = Pr Ŵ 6= 1|W = 1
[
≤ Pr {Y n ⊕ X n (w) ∈ A(n) n
} ∪ {Z ∈ / A(n)
}|W = 1
w6=1
2nR
(a) X (4)
≤ Pr {Y n ⊕ X n (w) ∈ A(n) n
}|W = 1 + Pr Z ∈/ A(n)
|W = 1
w=2
(b)
= (2nR − 1)Pr Y n ⊕ X n (2) ∈ A(n) |W = 1 + Pr Z n ∈
/ A(n)
|W = 1
(c)
≈ (2nR − 1)Pr Y n ⊕ X n (2) ∈ A(n)
|W = 1
where (a) follows from the union bound; (b) is by symmetry of the codebook; and (c) follows
from the fact that Z n is a typical sequence w.h.p due to the WLLN. Now observe that
n n n n n 1
Y ⊕ X (2) = X (1) ⊕ X (2) ⊕ Z ∼ Bern .
2
(n)
How to compute Pr(Y n ⊕ X n (2) ∈ A |W = 1)? To compute this, we need to consider two
quantities: (i) the total number of possible sequences that Y n ⊕ X n (2) ∼ Bern( 12 ) can take on;
(n)
(ii) the size of the typical set |A |. Specifically, we have:
(n)
A
n n
Pr Y ⊕ X (2) ∈ A(n)
|W =1 =
total number of Bern( 12 ) sequences
(n)
Note that the total number of Bern( 12 ) sequences is 2n and |A | ≤ 2n(H(p)+) (Why?). Hence,
2n(H(p)+)
Pr Y n ⊕ X n (2) ∈ A(n)
|W = 1 ≤ .
2n
P̄e . 2n(R−(1−H(p)−)) .
5
Since can be made arbitrarily close to 0, we can conclude that if R < 1 − H(p), then P̄e → 0
as n → ∞. This completes the proof.
Look ahead
We have proved achievability for BSC. We will defer the converse proof for the time being. The
reason is that it turns out the converse proof technique can be applied to a very general problem
context directly. So next time, we will extend achievability proof to the general case in which
the channel is characterized by an arbitrary conditional distribution p(y|x). Later we will prove
converse for the general case.
6
EE623 Information Theory October 16, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
This condition is called memoryless. Notice that given the current channel input xi , the current
output yi is independent of the past input/output (xi−1 , y i−1 ) and any other things including
the message W . Here one key property that we need to keep in our mind is:
n
Y
n n
p(y |x ) = p(yi |xi ). (1)
i=1
This can be proved by using the memoryless property (check in PS5). It turns out this property
plays a crucial role in proving achievability as well as converse. This will be clearer later.
CBEC = 1 − p = I(X; Y );
CBSC = 1 − H(p) = I(X; Y ).
CBEC = 1 − p ≥ I(X; Y );
CBSC = 1 − H(p) ≥ I(X; Y ).
Check this in PS5. Hence, what one can guess on the capacity formula from this is:
1
It turns out it is indeed the case. For the rest of this lecture, we will prove achievability: if
R < maxp(x) I(X; Y ), then the probability of error can be made arbitrarily close to 0 as n → ∞.
Encoder
The encoder that we will employ is a random code, meaning that Xi (w)’s are generated in an
i.i.d. fashion according to some input distribution p(x). We know in the BEC and BSC that
the input distribution is fixed as Bern( 12 ). And one can verify that Bern( 21 ) is the maximizer of
the above optimization problem (2). This motivates us to choose p(x) as:
It turns out this choice enables us to achieve the capacity guessed in (2).
Decoder
As before, we also assume that the codebook is known at the decoder. We use the suboptimal
decoder that we employed in the BSC case. Remember that the suboptimal decoder is based
on a typical sequence and the fact that Y n ⊕ X n = Z n . But one significant distinction in the
general DMC case is that Y n is a kind of arbitrary random function of X n . So now what we
can do is to take a look at pairs of (Y n , X n (w)) and to check if we can see a particular behavior
of the pair associated with the true codeword, compared to the other pairs w.r.t. the wrong
codewords.
To illustrate this clearly, let’s consider the following situation. Suppose that X n (1) is actually
transmitted, i.e., the message 1 is the correct one. Then, the joint distribution of the correct
pair (xn (1), y n ) would be:
where (a) follows from the key property (1). This implies that the pairs (xi (1), yi )’s are i.i.d
over i’s. Then, using the WLLN on the i.i.d. sequence of pairs, one can easily show that
1 1
log −→ H(X, Y ) in prob.
n p(X n (1), Y n )
The reason is that y n is associated only with (xn (1), channel noise), which is independent of
xn (w). Remember that we use a random code in which codewords are independent with each
2
other. Also here one can verify that y n is also i.i.d. (check!). Again using the WLLN on xn (w)
and y n , one can get:
(n)
1. Eliminate all the messages w’s such that (xn (w), y n ) ∈
/ Aǫ .
2. If there is only one message that survives, then declare that the survival is the correct
message.
Similar to the previous case, note that the error event is of two types: (i) multiple survivals (or
one wrong survival); (ii) no survival.
Probability of error
We are now ready to analyze the probability of error to complete the achievability proof. Using
exactly the same argument that we made in the BEC case, we can get:
P̄e := EC [Pe (C)] = Pr Ŵ 6= 1|W = 1 .
Now as mentioned earlier, the error event is of the two types: (i) multiple survivals; (ii) no
survival. The multiple-survival event implies the existence of the wrong pair being a jointly
(n)
typical pair, meaning that there exists w 6= 1 such that (X n (w), Y n ) ∈ Aǫ . The no-survival
(n)
event implies that even the correct pair is not jointly typical, meaning that (X n (1), Y n ) ∈
/ Aǫ .
Hence, we get:
P̄e = Pr Ŵ 6= 1|W = 1
[
≤ Pr {(X n (w), Y n ) ∈ Aǫ(n) } ∪ {(X n (1), Y n ) ∈
/ A(n)
ǫ }|W = 1
w6=1
(a) 2nR
X
≤ Pr {(X n (w), Y n ) ∈ Aǫ(n) }|W = 1 + Pr (X n (1), Y n ) ∈
/ A(n)
ǫ |W = 1
w=2
(b)
≤ 2nR Pr (X n (2), Y n ) ∈ A(n)
ǫ |W = 1 + Pr (X n
(1), Y n
) ∈
/ A(n)
ǫ |W = 1
(c) nR
≈ 2 Pr (X n (2), Y n ) ∈ A(n)
ǫ |W = 1
3
where (a) follows from the union bound; (b) is by symmetry of the codewords w.r.t. message
indices; and (c) follows from the fact that (X n (1), Y n )) is jointly typical for sufficiently large n
w.h.p due to WLLN. Now observe that X n (2) and Y n are independent. So the total number of
pair patterns w.r.t. (X n (2), Y n ) would be:
(n)
On the other hand, the cardinality of the jointly typical pair set Aǫ is:
nH(X,Y )
|A(n)
ǫ |≈2
P̄e . 2n(R−I(X;Y )) .
So one can conclude that if R < I(X; Y ), then P̄e → 0 as n → ∞. Since we choose p(x) such that
I(X; Y ) is maximized, we see that maxp(x) I(X; Y ) is achievable, which completes the proof.
Look ahead
So far we have proved achievability for discrete memoryless channels. Next time, we will prove
converse to complete the proof of channel coding theorem.
4
EE623 Information Theory October 28, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Today’s lecture
Today we will prove converse:
The proof relies on two important inequalities: (1) Fano’s inequality; (2) data processing in-
equality (DPI). Let us start by studying the two inequalities.
Fano’s inequality
Fano’s inequality is one of the well-known inequalities that plays a significant role in the context
of inference problems. Here the inference problem is the one of which the task is to infer input
from output given that the input and output are related in a probabilistic manner. Notice in the
inference problems that the input and output are probabilistically related, and so the only thing
that we can do w.r.t. the input from the output is to guess about the input. In a mathematical
language, the guess refers to inference. Actually the communication problem of our interest
can be viewed as an inference problem because the goal in the problem is to infer the message
W from the received signal Y n that is stochastically related to W . In this inference problem
context, Fano’s inequality captures relationship between the following performance metrics:
Here what one can easily expect is that the smaller Pe , the smaller H(W |Ŵ ). Fano’s inequality
captures such relation. Remember in the converse converse that we need to come up with a
lower bound on Pe because at the end of the day, we intend to show that Pe 9 0 (when R > C),
and the fact that a lower bound of Pe does not goes to 0 ensures the actual Pe not to converge
to 0. Here Fano’s inequality plays a role to provide such a lower bound. Actually a lower bound
on Pe in terms of H(W |Ŵ ) can be written as an upper bound on H(W |Ŵ ) in terms of Pe . And
the upper bound formula is well-known and we will focus on such upper-bound presentation:
Notice that this indeed captures the tradeoff relationship between Pe and H(W |Ŵ ): the smaller
Pe , the more contracted H(W |Ŵ ). The proof of this is very simple. Let
E = 1{Ŵ 6= W }.
1
By the definition of the probability of error and the fact that E[1{Ŵ 6= W }] = Pr{Ŵ 6= W },
E ∼ Bern(Pe ). Starting with the fact that E is a function of (W, Ŵ ), we have:
where (a) is due to a chain rule; (b) follows from the cardinality bound on entropy; (c) comes
from the definition of conditional entropy; and (d) follows from the cardinality bound on entropy.
Here what this condition means is that given the current state xi , the future state xi+1 and
the past states xi−1 ’s are independent with each other. There is a very well-known graphical
representation of such a Markov process. Suppose that (X1 , X2 , X3 ) forms a Markov process.
Then, we represent it as:
X1 − X2 − X3 .
The rationale behind this representation is based on the following observation: Given X2 , it
can be removed from the above graph as it is known; but this removal leads (X3 , X1 ) to be
disconnected, meaning that these have nothing to do with each other, in a statistical language,
being independent. Note that the graph looks like a chain. So it is also called a Markov chain.
In our problem context, there is a Markov chain. First of all, one can show that (W, X n , Y n )
forms a Markov chain:
W − X n − Y n.
The proof is straightforward. Given X n , Y n is a sole function of the noise induced by the
channel, which has nothing to do with the message W , meaning that (Y n , W ) are independent
with each other.
The DPI is the one defined w.r.t. the Markov chain. Specifically it captures the relationship
between the following mutual information: I(W ; X n ), I(W ; Y n ), I(X n ; Y n ). Actually I(W ; X n )
captures somehow common information between W and X n , hence it can be interpreted as the
amount of information regarding W that one can infer from X n . So it can be viewed as inference
quality on W when the inference is based on X n . Similarly one can view I(W ; Y n ) as inference
2
quality on W when the inference is based on Y n . Remember the verbal statement of the DPI:
any processed data (say Y n ) of the original data (say X n ) cannot improve inference quality on
W , meaning that
I(W ; Y n ) ≤ I(W ; X n ).
The proof of this is very simple. Starting with a chain rule and non-negativity of mutual
information, we have:
I(W ; Y n ) ≤ I(W ; Y n , X n )
= I(W ; X n ) + I(W ; Y n |X n ) (2)
(a)
= I(W ; X n )
where (a) follows from the fact that W − X n − Y n . There is another DPI w.r.t. I(W ; Y n ) and
another mutual information I(X n ; Y n ). Note that X n is closer to Y n compared to W . So one
can guess:
I(W ; Y n ) ≤ I(X n ; Y n ).
It turns out this is indeed the case. The proof is also very simple.
I(W ; Y n ) ≤ I(W, X n ; Y n )
= I(X n ; Y n ) + I(W ; Y n |X n ) (3)
n n
= I(X ; Y ).
From (2) and (3), one can summarize that the mutual information between two ending terms
in the Markov chain does not exceed the mutual information between any two terms that lie
in-between the two ends.
Looking at our problem setting, there is another term: Ŵ . It turns out this together with the
above Markov chain (W − X n − Y n ) forms a longer Markov chain:
W − X n − Y n − Ŵ .
Converse proof
We are now ready to prove the converse with the two inequalities that we learned. Starting with
3
the fact that nR = H(W ) (why?), we have:
nR = H(W )
= I(W ; Ŵ ) + H(W |Ŵ )
(a)
≤ I(W ; Ŵ ) + 1 + Pe · nR
(b)
= I(W ; Ŵ ) + nn
(c)
≤ I(X n ; Y n ) + nn
= H(Y n ) − H(Y n |X n ) + nn
n
(d) n
X
= H(Y ) − H(Yi |Xi ) + nn
i=1
(e) n
X
≤ [H(Yi ) − H(Yi |Xi )] + nn
i=1
Xn
= I(Xi ; Yi ) + nn
i=1
(f )
≤ nC + nn
where (a) follows from Fano’s inequality (1); (b) comes from the definition of n that we set as:
1
n := (1 + nPe R);
n
(c) is due to the DPI (7); (d) follows P from the memoryless channel property (p(Y n |X n ) =
Q n n n n
i=1 p(Yi |Xi ) and hence H(Y |X ) = i=1 H(Yi |Xi )); (e) conditioning reduces entropy; and
(f ) is due to the definition C := maxp(x) I(X; Y ). Now dividing by n on both sides, we get:
R ≤ C + n .
R ≤ C,
Look ahead
So far we have proved the channel coding theorem for a fairly general channel: discrete mem-
oryless channel. But in the proof, we only showed the existence of an optimal code such that
under the code Pe can be made arbitrarily close to 0 as n → 0. Next time, we will discuss
explicit deterministic codes which approach (or exactly achieve) the capacity. This together
with the proof of the channel coding theorem forms all the contents of Part II. So right after
the discussion on deterministic codes, we will embark on Part III.
4
EE623 Information Theory October 30, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time, we have proven the converse for discrete mememoryless channels to complete the
proof of the channel coding theorem:
C = max I(X; Y ).
p(x)
This completes Part II. Before moving onto Part III, let me tell you an interesting story behind
the development of the channel coding theorem, which later motivated some efforts on explicit
code constructions that we did not deal with in this course.
1
length tends to infinity, although it does not match the capacity exactly. Encouragingly (perhaps
especially to you guys), he developed such a code during his “PhD study” as the first piece of
work - meaning that you can also do that kind of great works in the near future!
Unfortunately, his work did not receive enough credits at that time. The main reason is that the
LDPC code was still of high implementation complexity considering the digital signal processing
(DSP) technology of the day.2 30 years later, however, the code was revived, receiving significant
attention. The reason is that the code became an efficient code – the DSP technology had been
evolved, finally enabling the code to be implemented. Later the code was more appreciated with
a more advanced DSP technology – it is now widely being employed in a variety of systems such
as LTE, WiFi, DMB3 and storage systems.
At the time when Gallager developed the LDPC code, he was not fully satisfied with his result
though. The reason was that his code is not guaranteed to exactly achieve the capacity even
in the limit of code length. This motivated one of his PhD students, Erdal Arikan, to work on
developing the capacity-achieving deterministic code. It turned out Arikan could develop such
a code, called Polar code. It is a first capacity-achieving deterministic code. Interestingly, he
could develop the code in 2007, 30+ years later than his PhD day when the motivation began.4
Due to its great performance and low-complexity nature, it is now being seriously considered
for implementation in a variety of systems.
These are the two major efforts. Due to the interest of time, we will not cover these in this
course. If you are interested in details, I can instead share some course notes w.r.t. the Polar
code which I drafted in earlier semesters. Please let me know your interest. Unfortunately, I
made no course note for the LDPC code.
2
algorithms that ensure fairness for disadvantageous against advantageous groups.
Machine learning
Machine learning is about an algorithm which is defined to be a set of instructions that a
computer system can execute. Formally speaking, machine learning is the study of algorithms
with which one can train a computer system so that the trained system can perform a specific
CN18_1
task of interest. Pictorially, it means the following; see Fig. 1.
computer system
(machine)
training
algorithm (together w/ data)
Figure 1: Machine learning is the study of algorithms which provide a set of explicit instructions
to a computer system (machine) so that it can perform a specific task of interest. Let x be the
input which indicates information employed to perform a task. Let y be the output that denotes
a task result.
Here the entity that we are interested in building up is a computer system, which is definitely
a machine. Since it is a system (i.e., a function), it has the input and the output. The input,
usually denoted by x, indicates information which is employed to perform a task of interest.
The output, usually denoted by y, indicates a task result. For instance, if a task of interest is
legitimate-emails filtering against spam emails, then x could be multi-dimensional quantities5 :
(i) frequency of a keyword like dollar signs $$$; (ii) frequency of another keyword, say winner.
And y could be an email entity, e.g., y = +1 indicates a legitimate email while y = −1 denotes
a spam email. In machine learning, such y is called a label. Or if an interested task is cat-vs-dog
classification, then x could be image-pixel values and y is a binary value indicating whether the
fed image is a cat (say y = 1) or a dog (y = 0).
Machine learning is about designing algorithms, wherein the main role is to train (teach) the
computer system (machine) so that it can perform such task well. In the process of designing
5
In machine learning, such quantity is called the feature. Usually this refers to a key component that well
describes characteristics of data.
3
algorithms, we use something that is data.
Figure 2: Arthur Lee Samuel is an American pioneer in the field of artificial intelligence, known
mostly as the Father of machine learning. He found the field in the process of developing
computer checkers which later formed the basis of AlphaGo.
Arthur Samuel is actually one of the pioneers in the broader field of Artificial Intelligence (AI)
which includes machine learning as a sub-field. The AI field is the study of creating intelligence
by machines, unlike the natural intelligence displayed by intelligent beings like humans and
animals. Since the ML field is about building up a human-like machine, it is definitely a sub-
field of AI.
In fact, Arthur Samuel found the ML field in the process of developing a human-like computer
player for a board game, called checkers; see the right figure in Fig. 1. He proposed many
algorithms and ideas while developing computer checkers. It turns out those algorithms could
form the basis of AlphaGo, a computer program for the board game Go which defeated one of
the 9-dan professional players, Lee Sedol with 4 wins out of 5 games in 2016.
Supervised learning
There are some methodologies which help us to achieve the goal of ML. One specific yet very
popular method is the one called:
Supervised Learning.
What supervised learning means is learning a function f (·) (indicating a functional of the ma-
chine) with the help of a supervisor. See Fig. 3.
What the supervisor means in this context is the one who provides input-output samples. Ob-
viously the input-output samples form the data employed for training the machine, usually
4
CN18_2
machine
Figure 3: Supervised Learning: Design a computer system (machine) f (·) with the help of a
supervisor which offers input-output pair samples, called a training dataset {(x(i) , y (i) )}m
i=1 .
denoted by:
where (x(i) , y (i) ) indicates the ith input-output sample (or called a training sample or an exam-
ple) and m denotes the number of samples. Using this notation (1), one can say that supervised
learning is to:
Optimization
A common way to estimate f (·) is through optimization. To understand what this means, let
us first explain what optimization is.
Optimization is to choose an optimization variable that minimizes (or maximizes) a certain
quantity of interest possibly given some constraints. There are two important quantities in the
definition. One is the optimization variable which affects the interested quantity and is subject
to our design. This is usually a multi-dimensional quantity. The second is the certain quantity
of interest that we wish to minimize (or maximize). This is called the objective function in the
field, and an one-dimensional scalar.
Objective function
Now what is the objective function in the supervised learning framework? To figure this out,
we need to know about the objective that supervised learning wishes to achieve. In view of the
goal (2), what we want is obviously:
A natural question that arises is then: How to quantify closeness (reflected in the “≈” notation)
between the two quantities: y (i) and f (x(i) )? One very common way that has been used in the
field is to employ a function, called a loss function, usually denoted by:
One obvious property that the loss function `(·, ·) should have is that it should be small when
the two arguments are close, while being zero when the two are identical. Using such loss
5
function (3), one can then formulate an optimization problem as:
m
X
min `(y (i) , f (x(i) )). (4)
f (·)
i=1
P erceptron,
and was invented in 1957 by one of the pioneers in the AI field, named Frank Rosenblatt.
Interestingly, Frank Rosenblatt was a psychologist. He was interested in how brains of intelligent
beings work and his study on brains led him to come up with the perceptron, and therefore gave
significant insights into neural networks that many of you guys heard of.
Perceptron
6
CN18_3
neuron
activation
voltage
Figure 4: Neurons are electrically excitable cells and are connected through synapses.
The above three properties led Frank Rosenblatt to propose the perceptron architecture, as
illustrated in Fig. 5. CN18_4
neuron
synapse
activation
w1 x1 + w2 x2 + · · · + wn xn = wT x. (6)
7
Activation functions
Taking the percentron architecture in Fig. 5, one can then formulate the optimization problem (5)
as:
m
X
min `(y (i) , 1{wT x(i) > th}). (9)
w
i=1
This is an initial optimization problem that people came up with. However, people figured
out there is an issue in solving this optimization. The issue comes from the fact that the
objective function contains an indicator function, so it is not differentiable. It turns out non-
differentiability makes the problem difficult to solve. This will be clearer soon in a later lecture.
Please be patient until we get to the point. What can we do then? One typical way that people
have taken in the field is to approximate the activation function. There are many ways for
approximation. From below, we will investigate one of them.
Notice that fw (x) ≈ 0 when wT x is very small; it then grows exponentially with an increase in
wT x; later grows logarithmically; and finally saturates as 1 when wT x is very large. Actually
the function (10) is a very popular one used in statistics, called the logistic 6 function. There is
another name for the function, which is the sigmoid 7 function.
There are two good things about the logistic function. First it is differentiable. The second is
that it can serve as the probability for the output in the binary classifier, e.g., Pr(y = 1) where
y denotes the ground-truth label in the binary classifier. So it is interpretable.
Look ahead
Under the choice of the logistic activation, what is then a good choice for a loss function? It
turns out the entropy plays an important role in the design of an optimal loss function in some
sense. Next time we will investigate in what sense it is optimal. We will then figure out how
the entropy comes up in the design of the optimal loss function.
6
The word logistic comes from a Greek word which means a slow growth, like a logarithmic growth.
7
Sigmoid means resembling the lower-case Greek letter sigma, S-shaped.
8
EE623 Information Theory November 4, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time we formulated an optimization problem for supervised learning based on the percetron
architecture:
m
X
min `(y (i) , fw (x(i) )). (1)
w
i=1
As an activation function, we considered a logistic function that is widely used in the field:
1
fw (x) = . (2)
1 + e−wT x
I then mentioned that a concept related to entropy appears in the design of the optimal loss
function.
Today’s lecture
Today we will prove this claim. Specifically what we are going to do are three folds. We will
first investigate what it means by the optimal loss function. We will next figure out how entropy
comes up in the design of the optimal loss function. Lastly we will also learn how to solve the
optimization problem.
logistic
regression
Notice that the range of the output ŷ of the logistic regression is in between 0 and 1:
0 ≤ ŷ ≤ 1.
1
Hence, one can interpret this as a probability quantity. Now the optimality of a classifier can be
defined under the following assumption inspired by such interpretation:
To understand what it means in detail, consider the likelihood of the ground-truth classifier:
Pr {y (i) }m (i) m
i=1 | {x }i=1 . (4)
Notice that the classifier output ŷ is a function of weights w. Hence, we see that assuming (3),
the likelihood (4) is also a function of w.
We are now ready to define the optimal w. The optimal weight, say w∗ , is defined as the one
that maximizes the likelihood (4):
w∗ := arg max Pr {y (i) }m
i=1 |{x (i) m
}i=1 . (5)
w
Accordingly the optimal loss function, say `∗ (·, ·) is defined as the one that yields:
m
X
arg min `∗ y (i) , ŷ (i) = arg max Pr {y (i) }m
i=1 |{x (i) m
} i=1 . (6)
w w
i=1
As I mentioned earlier, an entropy-related concept appears in `∗ (·, ·). Let us now figure this out.
Under this assumption, we can then rewrite the likelihood (4) as:
(a) Pr {(x(i) , y (i) )}m
Pr {y (i) }m (i) m
i=1 |{x }i=1 = i=1
Pr {x(i) }m i=1
Qm (i) (i)
(b) i=1 P x , y
= Qm (i)
(8)
i=1 P(x )
m
(c) Y
= P y (i) |x(i)
i=1
where (a) and (c) are due to the definition of conditional probability; and (b) comes from
the independence assumption (7). Here P(x(i) , y (i) ) denotes the probability distribution of the
input-output pair of the system:
where X and Y indicate random variables of the input and the output, respectively.
Recall the probability-interpretation-related assumption (3) made with regard to ŷ:
ŷ = Pr(y = 1|x).
2
This implies that:
y = 1 : P(y|x) = ŷ;
y = 0 : P(y|x) = 1 − ŷ.
P(y|x) = ŷ y (1 − ŷ)1−y .
Now using the notations of (x(i) , y (i) ) and ŷ (i) , we then get:
(i) (i)
P y (i) |x(i) = (ŷ (i) )y (1 − ŷ (i) )1−y .
where (a) comes from the fact that log(·) is a non-decreasing function and m (i) y (i) (1 −
Q
i=1 (ŷ )
(i)
ŷ (i) )1−y is positive; and (b) is due to changing a sign of the objective while replacing max with
min.
In fact, the term inside the summation in the last equality in (11) respects the formula of another
key notion in information theory: cross entropy. In particular, in the context of a loss function,
it is named the cross entropy loss:
Hence, the optimal loss function that yields the maximum likelihood solution is the cross entropy
loss:
3
Y ∼ Bern(q) where X ∼ Bern(p) indicates a binary random variable with p = Pr(X = 1). For
such two random variables, the cross entropy is defined as:
Notice that the formula of (12) is exactly the same as the term inside summation in (11), except
for having different notations. Hence, it is called the cross entropy loss.
Then, you may now wonder why H(p, q) in (13) is called cross entropy. Of course, there is a
rationale. The rationale comes from the following fact that you were asked to prove in PS5:
where H(p) denotes the entropy of Bern(p). Also you were asked to verify that the equality
holds when p = q. So from this, one can interpret H(p, q) as an entropic-measure of discrepancy
across distributions. Hence, it is called cross entropy.
It turns out the above optimization belongs to convex optimization that we are familiar with.
In other words, J(w) is convex in optimization variable w. So for the rest of this lecture, we will
first prove the convexity of the objective function, and then discuss how to solve such convex
optimization.
Proof of convexity
One can readily show that convexity preserves under addition (why? think about the definition
of convex functions). So it suffices to prove the following two:
1
(i) − log is convex in w;
1 + e−wT x
T
e−w x
(ii) − log is convex in w.
1 + e−wT x
Since the second function in the above can be represented as the sum of a linear function and
the first function:
T
e−w x T 1
− log T x = w x − log ,
1+e −w 1 + e−wT x
it suffices to prove the convexity of the first function.
Notice that the first function can be rewritten as:
1 −wT x
− log −w T x = log(1 + e ). (17)
1+e
4
In fact, proving the convexity of the function (17) is a bit involved if one relies directly on the
definition of convex functions. It turns out there is another way to prove. That is based on
the computation of the second derivative of a function, called the Hessian. How to compute
the Hessian? What is the dimension of the Hessian? For a function f : Rd → R, the gradient
∇f (x) ∈ Rd and the Hessian ∇2 f (x) ∈ Rd×d . If you are not familiar, check from the vector
calculus course or from Wikipedia.
A well-known fact says that if the Hessian of a function is positive semi-definite (PSD)1 , then
the function is convex. We will not prove this here. But if you are interested, you will have a
chance to prove it in PS6. If this is too much, you may want to simply remember this fact only.
No need to prove, but the statement itself is very instrumental. Here we will use this fact to
prove the convexity of the function (17).
Taking a derivative of the RHS formula in (17) w.r.t. w, we get:
T
−wT x −xe−w x
∇w log(1 + e )= .
1 + e−wT x
d
This is due to a chain rule of derivatives and the fact that dz log z = z1 , dz
d z
e = ez and d T
dw w x = x.
Taking another derivative of the above, we obtain a Hessian as follows:
!
−xe −wT x
T
∇2w log(1 + e−w x ) = ∇w
1 + e−wT x
Tx T Tx Tx
(a) xxT e−w (1 + e−w x ) − xxT e−w e−w
=
(1 + e−wT x )2 (18)
xxT e −wT x
=
(1 + e−wT x )2
0
∇J(w∗ ) = 0. (19)
But there is an issue in deriving such optimal point w∗ from the above. The issue is that an-
alytically finding such point is not doable because it turns out there is no closed-form solution
(check!). However, there is a good news. The good news is that there developed several algo-
rithms which allow us to find such point efficiently without the knowledge of the closed-form
1
We say that a symmetric matrix, say Q = QT ∈ Rd×d , is positive semi-definite if v T Qv ≥ 0, ∀v ∈ Rd , i.e.,
all the eigenvalues of Q are non-negative. It is simply denoted by Q 0.
5
solution. One prominent algorithm that has been widely used in the field is: the gradient decent
algorithm.
Here is how the algorithm works. It is an iterative algorithm. Suppose that at the t-th iteration,
we have an estimate of w∗ , say w(t) . We then compute the gradient of the function evaluated
at the estimate: ∇J(w(t) ). Next we update the estimate along a direction being opposite to the
direction of the gradient:
where α > 0 indicates a stepsize (or called a learning rate). If you think about it, this update
rule makes sense. Suppose w(t) is placedCN16_2
right relative to the optimal point w∗ , as illustrated in
Fig. 2.
slope:
t-th estimate
Then, we should move w(t) to the left so that it is closer to w∗ . The update rule actually does
this, as we subtract by α∇J(w(t) ). Notice that ∇J(w(t) ) points to the right direction given that
w(t) is placed right relative to w∗ . We repeat this procedure until it converges. It turns out: as
t → ∞, it actually converges:
w(t) −→ w∗ , (21)
as long as the stepsize is chosen properly, like the one delaying exponentially w.r.t. t. We will
not touch upon the proof of this convergence. Actually the proof is not that simple - even there
is a big field in statistics which intends to prove the convergence of a variety of algorithms (if it
is the case).
Look ahead
So far we have formulated an optimization problem for supervised learning, and found that cross
entropy serves as a key component in the design of the optimal loss function. We also learned
how to solve the problem via a famous algorithm, called gradient decent. As mentioned earlier,
there would be programming implementation of such algorithm. Next time, we will study such
implementation details in the context of a simple classifier via Pytorch.
6
EE623 Information Theory November 6, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
We have thus far formulated an optimization problem for supervised learning:
m
X
(i) 1
min `CE y , (i)
(1)
w
i=1
1 + e−wT x
where
`CE (y, ŷ) = −y log ŷ − (1 − y) log(1 − ŷ), (2)
1
ŷ := . (3)
1 + e−wT x
We proved that the cross entropy loss `CE (·, ·) is the optimal loss function in a sense of maximizing
likelihood. We also showed that the normalized version J(w) of the above objection function
is convex in w, and therefore, it can be solved via one prominent algorithm that enables us to
efficiently find the unique stationary point: gradient decent.
m
1 X (i)
J(w) := −y log ŷ (i) − (1 − y (i) ) log(1 − ŷ (i) ). (4)
m
i=1
Today’s lecture
Today we will study how to implement the algorithm in the context of a simple classifier.
Specifically we will first investigate what that simple classifier setting of our focus is. We will
then study four implementation details w.r.t. such classifier: (i) dataset that we will use for
training and testing1 ; (ii) a deep neural network2 model & ReLU3 activation; (iii) Softmax: a
natural extension of a logistic activation for multiple (more than two) classes; (iv) Adam: An
advanced version of gradient decent algorithm that is widely used in practice. Lastly we will
learn how to do programming for the classifier via one deep learning framework: Pytorch.
1
CN17_1
digit
classifier
mtest = 10, 000 testing images. Each image, say x(i) , consists of 28 × 28 pixels, each indicating
a gray-scale level ranging from 0 (white) to 1 (black). It also comes with a corresponding label,
CN17_2
say y (i) , that takes one of the 10 classes y (i) ∈ {0, 1, . . . , 9}. See Fig. 2.
white
black
Figure 2: MNIST dataset: An input image is of 28-by-28 pixels, each indicating an intensity
from 0 (whilte) to 1 (black); each label with size 1 takes one of the 10 classes from 0 to 9.
2
very well-known figure in the AI field, known as the Godfather of deep learning) and his two
PhD students. What was astonishing at the time is that their DNN can achieve human-level
recognition performances, which were never attained earlier. Hence, this anchored the start
of deep learning revolution. It turns out digit classification of our interest here is one such
application where a linear classifier does not perform well. So we will employ a DNN yet a very
CN17_3
simple version with only two layers: one-hidden layer and output layer, illustrated in Fig. 3.
softmax
activation
ReLU activation
Figure 3: A two-layer fully-connected neural network where input size is 28 × 28 = 784, the
number of hidden neurons is 500 and the number of classes is 10. We employ ReLU activation
for the hidden layer, and softmax activation for the output layer; see Fig. 4 for details.
Each neuron in the hidden layer respects exactly the same procedure as that in the perceptron:
a linear operation followed by activation. For activation, the logistic function or its shifted
version, called the tanh function (spanning -1 and +1), were frequently employed in the early
days. However, many practitioners recently found that another function, named Rectified Linear
Unit (ReLU for short), is more powerful in enabling faster training while yielding a better or
equal performance. As mentioned earlier, the functional of ReLU is: ReLU(x) = max(0, x).
Hence, a common rule-of-thumb that have been applied in the deep learning field is to use ReLU
activation in all hidden layers. So we adopt the rule-of-thumb, as illustrated in Fig. 3.
z := [z1 , z2 , . . . , zc ]T ∈ Rc (5)
where c denotes the number of classes. The softmax function is then defined as:
ezj
ŷj := [softmax(z)]j = Pc zk
j = 1, 2, . . . , c. (6)
k=1 e
3
CN17_4
output layer softmax
Figure 4: Softmax activation for output layer. This is a natural extension of logistic activation
intended for the 2-class case.
where σ(·) is a logistic function. Viewing z1 − z2 as the binary classifier output ŷ, this coincides
exactly with the logistic regression.
Here ŷi can be interpreted as the probability that the ith example belong to the class i. Hence,
like the binary classifier, one may want to assume:
As you may expect, under this assumption, one can verify that the optimal loss function (in a
sense of maximizing likelihood) is again the cross entropy loss:
c
X
`∗ (y, ŷ) = `CE (y, ŷ) = −yj log ŷj
j=1
where y indicates a label of one-hot vector type. For instance, in the case of label= 2 with
4
c = 10, y takes:
0 0
0
1
1
2
0
3
0 4
y=
0
5
0
6
0
7
0 8
0 9
The proof of this is almost the same as that in the binary classifier. So we will omit the proof
here. Instead you will have a chance to prove it in PS6.
Due to the above rationales, the softmax activation has been widely used for many classifiers in
the field. Hence, we will also use the conventional function in our digit classifier.
Adam optimizer
Let us discuss a specific algorithm that we will employ in our setting. As mentioned earlier, we
will use an advanced version of gradient decent algorithm, called Adam optimizer. To see what
the optimizer operates, let us first recall the vanilla gradient decent:
where w(t) indicates the estimated weight in the tth iteration, and α denotes the learning rate.
Notice that the weight update relies only on the current gradient, reflected in ∇J(w(t) ). Hence,
in case ∇J(w(t) ) fluctuates too much over iterations, the weight update oscillates significantly,
thereby bringing about unstable training. To address this, people often use a variant algorithm
that exploits past gradients for the purpose of stabilization. That is, Adam optimizer.
Here is how the Adam works. The weight update takes the following formula instead:
m(t)
w(t+1) = w(t) + α p (9)
s(t) +
where m(t) indicates a weighted average of the current and past gradients:
1
m(t) = β 1 m (t−1)
− (1 − β 1 )∇J(w (t)
) . (10)
1 − β1t
Here β1 ∈ [0, 1] is a hyperparameter that captures the weight of past gradients, and hence it is
1
called the momentum. So the notation m stands for momentum. The factor 1−β t is applied in
1
front, in an effort to stabilize training in initial iterations (small t). Check the detailed rationale
behind this in PS6.
s(t) is a normalization factor that makes the effect of ∇J(w(t) ) almost constant over t. In case
∇J(w(t) ) is too big or too small, we may have significantly different scalings in magnitude.
Similar to m(t) , s(t) is defined as a weighted average of the current and past values:
1 (t−1)
s(t) = β 2 s − (1 − β 2 )(∇J(w (t) 2
)) (11)
1 − β2t
5
where β2 ∈ [0, 1] denotes another hyperparameter that captures the weight of past values, and
s stands for square.
Notice that the dimensions of w(t) , m(t) and s(t) are the same. So all the operations that appear
in the above (including division in (9) and square in (11)) are component-wise. In (9), is a tiny
value introduced to avoid division by 0 in practice (usually 10−8 ). Here (β1 , β2 ) is exactly the
betas that you encountered in PS5.
where ‘../datasets’ indicates the directory that the MNIST dataset would be downloaded
(‘../’ means the directory prior to the current direction that you are working on), and
transforms.ToTensor() denotes conversion from raw images into tensor data that we will
play with.
During training, we often employ a part of the entire examples to compute a gradient of a loss
function. Such part is called the batch. Hence, the dataset is usually reorganized in the form of
batches. To this end, you can use the following:
In the MNIST dataset, the size of train dataset should be m = 60, 000 and the size of
train loader should read batchm size .
6
import torch.nn as nn
class Net(nn.Module):
def init (self, input size, hidden size, num classes):
super(Net, self). init
self.fc1 = nn.Linear(input size,hidden size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden size,num classes)
where device denotes cuda if available; cpu otherwise. See the pytorch tutorial for details.
Here nn.Linear(in, out) is a built-in class implemented in torch.nn. The main functional is
a linear operation where the associated matrix is of size out-by-in and a bias vector is of size
out. For instance, the command “x = self.fc1(x)” means the following:
x←W x + b
where W ∈ Rhidden size×input size and b ∈ Rhidden size . nn.ReLU() is another built-in class
which contains the following functional: max(0, input argument). For example, the command
“x = self.relu(x)” means: x ← max(0, x).
nn.CrossEntropyLoss()
To use this class for our classifier, you may want to consult with the following commands:
criterion = nn.CrossEntropyLoss()
loss = criterion(output,label)
where output denotes an output of my model defined earlier, and label indicates the ground-
truth class of an associated image ∈ {0, 1, . . . , 9}. Here one important to notice is that the
output is the result prior to softmax activation.
torch.optim.Adam()
7
This is how to use in our classifier:
Remember that Adam optimizer includes three important parameters: learning rate, b1 and
b2. While these are hyperparameters that we can choose as per our wish, a default choice is:
lr:=learning rate=0.001, betas:=(b1,b2)=(0.9,0.999).
Pytorch: Training
For training, we should first enable the “train mode” via:
my model.train()
optimizer.zero grad()
loss = criterion(output,label)
loss.backward()
optimizer.step()
Pytorch: Testing
For testing, we should transit to “test mode” via:
my model.eval()
One important distinction relative to training is that testing does not involve any gradient
computation. Hence, once we are under the “test mode” (through the above command), all the
operations should be programmed within the following command:
, predicted = torch.max(output.data, 1)
Look ahead
Now we are done with the first application that we would deal with in Part III. Next time, we
will move onto the second application of information theory: unsupervised learning.
8
EE623 Information Theory November 11, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
During the past three lectures, we have studied some basic and trending contents on super-
vised learning. The goal of supervised learning is to estimate a functional f (·) of an interested
CN22_1
computer system (machine) from input-output samples, as illustrated in Fig. 1.
machine
Figure 1: Supervised learning: Learning the functional f (·) of an interested system from data
{(x(i) , y (i) )}m
i=1 .
Unsupervised learning
Now what is next? Actually we face one critical challenge that arises in supervised learning.
The challenge is that it is not that easy to collect labeled data in many realistic situations. In
general, gathering labeled data is very expensive, as it usually requires extensive human-labour-
based annotations. So people wish to do something without such labeled data. Then, a natural
question that arises is: What can we do only with {x(i) }m i=1 ?
This is where the concept of unsupervised learning kicks in. Unsupervised learning is a method-
ology for learning something about the data {x(i) }m
i=1 . You may then ask: What is something?
1
There are a few candidates for such something to learn in the field. One very popular candidate
is probability distribution. In fact, this is a sort of the most complex yet most fundamental
information. The probability distribution allows us to create realistic signals as we wish. The
unsupervised learning method for learning such fundamental entity is: generative models. This
is actually the most famous unsupervised learning method, which has received a particularly
significant attention in the field nowadays. So in this course, we are going to focus on this
method.
Figure 2: Ian Goodfellow, a young yet big figure in the modern AI field. He is best known as the
inventor of the Generative Adversarial Networks (GANs), which made a big wave in the history
of the AI field.
master student at Stanford, working with Prof. Andrew Ng. But he moved to the University of
Montreal for PhD study to join Joshua Bengio’s group. During his PhD, he could develop the
“GANs”. The GANs are shown to be extremely instrumental in a wide variety of applications,
even not limited to the AI field. Such applications include: image creation, human image
synthesis, image inpainting, coloring, super-resolution image synthesis, speech synthesis, style
transfer, robot navigation, to name a few. Since it works pretty well, as of Oct. 3 2019, the
state of California even passed a bill that would ban the use of GANs to make fake pornography
without the consent of the people depicted.
In view of information theory, the GAN is also an interesting framework. It turns out the GAN
is closely related to the KL divergence and mutual information that we have learned thus far.
Generative models
2
A generative model is defined as a mathematical model which allows us to generate fake data
which has a similar distribution as that of real data. See Fig. 3 for a pictorial representation.
CN22_4
The model parameters are learned via real data so that the learned model outputs fake data
generative
fake data
model
real data
Figure 3: A generative model is the one that generates fake data which resembles real data.
Here what resembling means in a mathematical language is that it has a similar distribution.
that resemble real data. Here an input signal (which should be placed in front of the model yet
not appeared in the figure) can be either an arbitrary random signal or a specifically synthesized
signal that forms the skeleton of fake data. The type of the input depends on applications of
interest - this will be detailed later on.
3
CN22_5
Let ŷ ∈ Rn be a fake output. Considering m examples, let {(x(i) , ŷ (i) )}m
i=1 be such fake input-
(i) m
output m pairs and let {y }i=1 be m real data examples. See Fig. 4.
real data
Goal
Let G(·) be a function of the generative model. Then, the goal of the generative model can be
stated as follows: Designing G(·) such that
Here what does it mean by “in distribution”? To make it clear, we need to quantify closeness
between two distributions. One natural yet prominent approach employed in the statistics field
is to take the following two steps:
QY , QŶ
2. Next employ a well-known divergence measure in statistics which can serve to quantify
the closeness of two distributions. Let D(·, ·) be one such divergence measure. Then, the
similarity between QY and QŶ can be quantified as:
D(QY , QŶ ).
Taking the above natural approach, one can concretely state the goal as: Designing G(·) such
that
4
As you may easily notice, there are some issues in solving the above problem (1). One issue is
that it is a function optimization problem which we are not familiar with. As mentioned earlier,
one natural way to resolve this issue is to parameterize the function with a neural network:
Look ahead
It turns out there are some ways to address the above issues. Very interestingly, one such way
leads to an optimization problem for GANs! So next time, we will study what that way is, and
then will take the way to derive an optimization problem for GANs.
5
EE623 Information Theory November 13, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time we started investigating unsupervised learning. The goal of unsupervised learning is
to learn something about the data, which we newly denoted by {y (i) }m (i) m
i=1 , instead of {x }i=1 .
Depending on target candidates for something to learn, there are a few unsupervised learning
methods. Among them, we explored one very prominent method which aims to learn the proba-
bility distribution. That is, generative models. We then formulated an optimization problem for
generative models:
where QY and QŶ indicate the empirical distributions (or the estimates of the true distributions)
for real and fake data, respectively; G(·) denotes the function of a generative model; D(·, ·) is
a divergence measure; and N is a class of neural network functions. We then mentioned two
major issues that arise in the problem: (i) the objective is a very complicated function of G(·);
(ii) it is not clear as to how to choose D(·, ·).
At the end of the last lecture, I claimed that there are some ways to address such issues, and very
interestingly, one such way leads to an optimization problem for a recently-developed powerful
generative model, named Generative Adversarial Networks (GANs).
Today’s lecture
Today we are going to explore details on the GANs. Specifically what we are going to do are
three-folded. First we will investigate what that way leading to the GANs is. We will then take
the way to derive an optimization problem for the GANs. Lastly we will demonstrate that the
GANs indeed have close connection to the KL divergence and mutual information.
Interpret D(QY , QŶ ) as a measure that can quantify the ability to discriminate.
In order to express such ability, he believed there must be something that plays a role of dis-
1
criminating. So he introduced an entity that can play such role, and named it:
Discriminator.
Goodfellow considered a simple binary-output discriminator which takes as an input, either real
data y or fake data ŷ. He then wanted to design D(·) such that D(·) well approximates the
probability that the input (·) is real data:
Noticing that
Pr(y = real) = 1;
Pr(ŷ = real) = 0,
Discriminator
fake data
Figure 1: Discriminator wishes to output D(·) such that D(y) is as large as possible while D(ŷ)
is as small as possible.
D(y) ↑; 1 − D(ŷ) ↑.
One naive way to capture the ability is simply adding the above two terms. But Goodfellow did
not take the naive way. Instead he took the following logarithmic summation:
Actually I was wondering why he took this particular way, as his paper does not mention about
the rationale behind the choice. In NIPS 2016, he gave a tutorial on GANs, mentioning that
the problem formulation was inspired by a paper published in AISTATS 2010:
http : //proceedings.mlr.press/v9/gutmann10a.html
2
Reading the paper, I realized that the logarithmic summation (2) comes from Eq. (3) in the
paper. Actually this is also related to the binary cross entropy. Think about why.
Making the particular choice, the ability to discriminate for m examples can be quantified as:
m
1 X
log D(y (i) ) + log(1 − D(ŷ (i) )). (3)
m
i=1
A two-player game
Goodfellow then imagined a two-player game in which Player 1, Discriminator D(·), wishes to
maximize the quantified ability (3), while Player 2, Generator G(·), wants to minimize (3). See
Fig. 2.
CN23_2
real data
Player #1
Discriminator
Player #2
Generator
fake data
You may wonder why not max min (meaning that we first take “min” and then “max”). That
may be another way to go, but Goodfellow made the above choice. Actually there is a reason
why he took the way. This will be clearer soon. Here notice that the optimization is over
the two functions of D(·) and G(·), meaning that it is still a function optimization problem.
Luckily the year of 2014 (when the GAN paper was published) was after the starting point of
the deep learning revolution, the year of 2012. Also at that time, Goodfellow was a PhD student
of Joshua Bengio, one of the deep learning heroes. So he was very much aware of the power
of neural networks: “Deep neural networks can well represent any arbitrary function.” This
motivated him to parameterize the two functions with neural networks, which in turn led to the
following optimization problem:
m
1 X
min max log D(y (i) ) + log(1 − D(ŷ (i) )), (5)
G(·)∈N D(·)∈N m
i=1
where N denotes a class of neural network functions. This is exactly the optimization problem
for GANs!
3
Remember what I mentioned earlier: the way leading to the GAN optimization is an indirect
way of solving the original optimization problem:
Then, a natural question that arises is: How are the two problems (5) and (6) related? It
turns out these are very much related. This is exactly where the choice of min max (instead of
max min) plays the role; the other choice cannot establish a connection - it was a smart choice. It
has been shown that assuming that deep neural networks can represent any arbitrary function,
the GAN optimization (5) can be translated into the original optimization form (6). We will
prove this below.
Notice that the objective is a function of D(·), and the two functions D(·)’s appear but with
different arguments: one is y (i) , colored in blue; the other is ŷ (i) , colored in red. So in the current
form (7), the inner (max) optimization problem is not quite tractable to solve. In an attempt
to make it tractable, let us express it in a different manner using the following notations.
1
Define a random vector Y which takes one of the m real examples with probability m (uniform
distribution):
1
Y ∈ {y (1) , . . . , y (m) } =: Y; QY (y (i) ) = , i = 1, 2, . . . , m,
m
where QY indicates the probability distribution of Y . Similarly define Ŷ for fake examples:
1
Ŷ ∈ {ŷ (1) , . . . , ŷ (m) } =: Ŷ; QŶ (ŷ (i) ) = , i = 1, 2, . . . , m,
m
where QŶ indicates the probability distribution of Ŷ . Using these notations, one can then
rewrite the problem (7) as:
m
X
min max QY (y (i) ) log D(y (i) ) + QŶ (ŷ (i) ) log(1 − D(ŷ (i) )). (8)
G(·) D(·)
i=1
QY (z) := 0 if z ∈ Ŷ \ Y; (9)
QŶ (z) := 0 if z ∈ Y \ Ŷ. (10)
Using the z notation, one can then rewrite the problem (8) as:
X
min max QY (z) log D(z) + QŶ (z) log(1 − D(z)). (11)
G(·) D(·)
z∈Y∪Ŷ
4
We now see that the same arguments appear in the two D(·) functions.
Hence, we get:
QY (z)
D∗ (z) = . (12)
QY (z) + QŶ (z)
Plugging this into (11), we obtain:
X QY (z) QŶ (z)
min QY (z) log + QŶ (z) log . (13)
G(·) QY (z) + QŶ (z) QY (z) + QŶ (z)
z∈Y∪Ŷ
Connection to KL divergence
Let us massage the objective function in (13) to express it as:
X QY (z) QŶ (z)
min QY (z) log QY (z)+QŶ (z)
+ QŶ (z) log QY (z)+QŶ (z)
−2 log 2. (14)
G(·)
z∈Y∪Ŷ 2 2
| {z }
Notice that the above underbraced term can be expressed with a well-known divergence measure
that we are familar with: the KL divergence. Hence, we get:
X QY (z) QŶ (z)
min QY (z) log QY (z)+QŶ (z)
+ QŶ (z) log QY (z)+QŶ (z)
− 2 log 2 (15)
G(·)
z∈Y∪Ŷ 2 2
= min KL QY k(QY + QŶ )/2 + KL QŶ k(QY + QŶ )/2 − 2 log 2. (16)
G(·)
5
where PT and PZ denote the probability distributions of T and Z, respectively; and PZ|t indicates
the conditional distribution of Z given T = t.
Now suppose that T ∼ Bern 21 and we define Z as:
Y, T = 0;
Z=
Ŷ , T = 1.
Then, we get:
PZ|0 = QY , PZ|1 = QŶ , PZ = QY + QŶ /2
where the last is due to the total probability law (why?). This together with (17) and (18) gives:
Look ahead
So far we have formulated an optimization problem for GANs and made an interesting connection
to the KL divergence and mutual information. Next time, we will study how to solve the GAN
optimization (5) and implement it via Pytorch.
6
EE623 Information Theory November 18, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time we investigated Goodfellow’s approach to deal with an optimization problem for
generative models, which in turn led to GANs. He started with quantifying the ability to
discriminate real against fake samples:
m
1 X
log D(y (i) ) + log(1 − D(ŷ (i) )) (1)
m
i=1
where y (i) and ŷ (i) := G(x(i) ) indicate real and fake samples (the outputs of Generator), respec-
tively; D(·) denotes the output of Discriminator; and m is the number of examples. He then
introduced two players: (i) Player 1, Discriminator, who wishes to maximize the ability; (ii)
Player 2, Generator, who wants to minimize it. This naturally led to the optimization problem
for GANs:
m
1 X
min max log D(y (i) ) + log(1 − D(G(x(i) ))) (2)
G(·)∈N D(·)∈N m
i=1
where N denotes a class of neural network functions. Lastly we demonstrated that the prob-
lem (2) can be stated in terms of the KL divergence or mutual information, thus making a
connection to information theory.
Now two questions arise: (i) How to solve the problem (2)?; (ii) How to do programming, say
via Pytorch?
Today’s lecture
Today we will answer these two questions. Specifically we are going to cover four stuffs. First
we will investigate a practical method to solve the problem (2). We will then do one case study
for the purpose of exercising such method. In particular, we will study how to implement an
MNIST-like image generator. Next we will study one important implementation detail: Batch
Normalization, particularly useful for very deep neural networks. Lastly we will learn how to do
programming via Pytorch.
Parameterization
Solving the problem (2) starts with parameterizing the two functions G(·) and D(·) with neural
networks:
m
1 X
min max log Dθ (y (i) ) + log(1 − Dθ (Gw (x(i) ))) (3)
w θ m
| i=1 {z }
=:J(w,θ)
where w and θ indicate parameters for G(·) and D(·), respectively. Now the question of interest
is: Is the parameterized problem (3) the one that we are familiar with? In other words, is J(w, θ)
is convex in w? Is J(w, θ) is concave in θ? It turns out unfortunately, it is not the case. In
general, the objective is highly non-convex in w and also highly non-concave in θ.
1
Then, what can we do? Actually there is nothing we can do more beyond what we know. We
only know how to find a stationary point via a method like gradient decent. So one practical
way to go is to simply look for a stationary point, say (w∗ , θ∗ ), such that
∇w J(w∗ , θ∗ ) = 0, ∇θ J(w∗ , θ∗ ) = 0,
while cross-fingering that such point yields a near optimal performance. It turns out very luckily
it is often the case in reality, especially when employing neural networks for parameterization.
There have been huge efforts by many smart theorists in figuring out why that is the case.
However, a clear theoretical understanding is still missing despite their serious efforts.
where w(t) and θ(t) denote the weights of Generator and Discriminator at the tth iteration,
respectively; and α1 is the learning rate for Generator. Given (w(t+1) , θ(t) ), we next update
Discriminator’s weight as per:
where α2 is the learning rate for Discriminator. Here one important thing to notice is that the
direction along which the Discriminator’s weight is updated should be aligned with the gradient,
reflected in the plus sign colored in blue. Why? Lastly we repeat the above two until converged.
In practice, we may wish to control the frequency of Discriminator weight update relative to
that of Generator. To this end, we often employ the k : 1 alternating gradient decent:
You may wonder why we update Discriminator more frequently than Generator. Usually more
updates in the inner optimization yield better performances in practice. Further, we often
employ the Adam counterpart of the algorithm together with batches. Details are omitted
although we will apply such practical version for programming assignment in PS7.
2
where B indicates a batch of interest and mB is the batch size (the number of examples in the
interested batch). Notice that log Dθ (y (i) ) in the above is irrelevant of Generator’s weight w.
Hence, it suffices to minimize the following:
1 X
min log(1 − Dθ (Gw (x(i) )))
w mB
i∈B
| {z }
generator loss
where the underbraced term is called “generator loss”. However, in practice, instead of mini-
mizing the generator loss directly, people often rely on the following proxy:
1 X
min − log Dθ (Gw (x(i) )).
w mB
i∈B
You may wonder why. There is a technically detailed rationale behind the use of such proxy for
the generator loss. Check this in PS7.
Task
Now let us discuss one case study for implementation. The task that we will focus on is the one
CN20_1
related to the simple digit classifier that we exercised on in Lecture 17. The task is to generate
MNIST-like images, as illustrated in Fig. 1. Here we intend to train Generator with MNIST
MNIST-like
fake image
A random Generator
signal
MNIST dataset
dataset so that it outputs a MNIST-like fake image when fed by a random input signal.
3
CN20_2
1024
784
512
256
100 128
BN
ReLU logistic
Figure 2: Generator: A 5-layer fully-connected neural network where the input size (the dimen-
sion of a latent signal) is 100; the numbers of hidden neurons are 128, 256, 512, 1024; and the
output size is 784 (= 28 × 28). We employ ReLU activation for every hidden layer, and logistic
activation for the output layer to ensure 0-to-1 output signals. We also use Batch Normalization
prior to ReLU at each hidden layer. See Fig. 3 for details.
Normalization.
Batch Normalization
CN20_3
Here is how it works; see Fig. 3. For illustrative purpose, focus on one particular hidden layer.
A hidden layer BN
1. Normalization component-wise
2. Customized scaling
(learnable parameters)
Figure 3: Batch Normalization (BN). First we do zero-centering and normalization with the
mean µB and the variance σB2 computed over the examples in an associated batch B. Next we
do customized scaling by introducing two new parameters that would also be learned during
training: γ ∈ Rn and β ∈ Rn .
Let z := [z1 , . . . , zn ]T be the output of the considered hidden layer prior to activation. Here n
denotes the number of units (neurons) in the hidden layer.
4
The Batch Normalization (BN for short) consists of two steps. First we do zero-centering and
normalization using the mean and variance w.r.t. examples in an associated batch B:
1 X (i) 1 X (i)
µB = z , σB2 = (z − µB )2 (4)
mB mB
i∈B i∈B
where (·)2 indicates a component-wise square, and hence σB2 ∈ Rn . In other words, we generate
the normalized output, say znorm , as:
z (i) − µB
znorm = q (5)
σB2 +
where division and multiplication are all component-wise. Here is a tiny value introduced to
avoid division by 0 (typically 10−5 ).
Second, we do customized scaling as per:
(i)
z̃ (i) = γznorm + β (6)
where γ, β ∈ Rn indicate two new scaling parameters which are learnable via training. Notice
that again the operations in (6) are all component-wise.
The BN lets the model learn the optimal scale and mean of the inputs for each hidden layer. It
turns out this technique plays a significant role in stabilizing and speeding up training especially
for a very deep neural network. This has been verified experimentally by many practitioners,
although no clear theoretical justification has been provided thus far.
5
CN20_4
784
512
256
1
logistic
ReLU
Figure 4: Discriminator: A 3-layer fully-connected neural network where the input size (the
dimension of a flattened vector of a real (or fake) image) is 784 (= 28 × 28); the numbers of
hidden neurons are 512, 256; and the output size is 1. We employ ReLU activation for every
hidden layer, and logistic activation for the output layer.
class Generator(nn.Module):
def init (self, latent dim):
super(Generator, self). init
self.fc1 = nn.Linear(latent dim,128)
self.bn1 = nn.BatchNorm1d(128)
self.relu = nn.ReLU()
.
.
.
def forward(self, x):
x = self.fc1(x)
x = self.bn1(x)
x = self.relu(x)
.
.
.
where latent dim is the dimension of the latent signal (which we set as 100) and nn.BatchNorm1d(128)
indicates a class that does BN for 128 hidden neurons.
6
(Generator and Discriminator), we employ two optimizers accordingly:
where Discriminator is another neural network model which respects the structure in Fig. 4.
Think about how to do programming for the model.
x ∈ Rlatent dim
∼ N (0, Ilatent dim ).
Now introduce the ground-truth real-vs-fake indicator vector [1, 0]T (real=1, fake=0). Then, here
the term log Dθ (y (i) ) can be viewed as the minus binary cross entropy between the real/fake
indicator vector and its prediction counterpart [Dθ (y (i) ), 1 − Dθ (y (i) )]T :
On the other hand, another term log(1 − Dθ (ŷ (i) )) can be interpreted as the minus binary cross
entropy between the fake-vs-real indicator vector (fake=0, real=1) and its prediction counter-
part:
From this, we see that the cross entropy plays a role in computation of the objective function.
In Lecture 17, we used a built-in class: nn.CrossEntropyLoss(). However, here we will use
another slightly different built-in class: nn.BCELoss(). Here is to how to use in our setting:
CE loss = nn.BCELoss()
loss = CE loss(output,real fake indicator)
where output denotes an output of discriminator defined earlier, and real fake indicator
is real/fake indicator vector (real=1, fake=0). Here one important thing to notice is that the
7
output is the result after logistic activation; and real fake indicator is also a vector with the
same dimension as output. The function in nn.BCELoss() automatically detects the number of
examples in an associated batch, thus yielding a normalized version (through division by mB ).
where gen imgs indicate fake images (corresponding to Gw (x(i) )’s) and valid denotes an all-1’s
vector with the same dimension as gen imgs (why?).
Taking the minus sign in the objective, we obtain the equivalent minimization optimization:
1 X
min − log Dθ (y (i) )− log(1 − Dθ (Gw (x(i) )))
θ mB
i∈B
| {z }
discriminator loss
where the discriminator loss is defined as such minus version. Here is how to implement the
discriminator loss:
where real imgs indicate real images (corresponding to y (i) ’s) and fake denotes an all-0’s
vector with the same dimension as gen imgs. Here we apply .detach() into gen imgs which
was associated with generator. This process is needed to enable a different computation w.r.t.
discriminator loss.
Look ahead
Now we are done with the second application that we would deal with in Part III. Next time,
we will move onto the last application of information theory: fairness algorithms.
8
EE623 Information Theory November 20, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
CN21_1
During the past six lectures, we have explored two prominent methodologies that arise in machine
learning: (i) supervised learning; (ii) unsupervised learning. The goal of supervised learning is to
estimate a functional f (·) of an interested system from input-output example pairs, as illustrated
in Fig. 1(a).
(a) (b)
Figure 1: (a) Supervised learning: Learning the functional f (·) of an interested system from
input-output example pairs {(x(i) , y (i) )}m
i=1 ; (b) Generative modeling (an unsupervised learning
methodology): Generating fake data that resemble real data, reflected in {x(i) }m i=1 .
Next application
Now what is next? As the last application of information theory, we will explore one recent
trending topic that arises in the modern machine learning: Fairness machine learning. There
are three reasons that I pick up the topic as the last application.
The first reason is motivated by the recent trend in the machine learning field. As machine
learning becomes prevalent in our daily lives involving a widening array of applications such
as medicine, finance, job hiring and criminal justice, one morally & legally motivated need
1
in the design of machine learning algorithms is to ensure fairness for disadvantageous against
advantageous groups. The fairness issue has received a particular attention from the learning
CN21_3
algorithm by the US Supreme Court that yields unbalanced recidivism (criminal reoffending)
scores across distinct races, e.g., predicts higher scores for blacks against whites; see Fig. 2.
Hence, I wish to touch upon such trending & important topic for this course. I hope this may
Figure 2: The current machine learning algorithm employed in the US Supreme Court
predicts higher recidivism scores for blacks against whites. If you want to know more, see
https://www.propublica.org/article/machine-bias-risk-assessments-incriminal-sentencing
be of great interest to you guys. The second is w.r.t. a connection to information theory. It
turns out that mutual information plays a key role in formulating an optimization problem for
fairness machine learning algorithms. The last reason is that the optimization problem is closely
related to the GAN optimization that we learned in the past lectures. You may see the perfect
coherent sequence of applications, from supervised learning, GAN to fairness machine learning.
2
In order to develop a fair classifier, we first need to understand what it means by fairness.
Actually the fairness is a terminology that arises in law that deals with justice. So it has a long
and rich history, and there are numerous concepts prevalent in the field of law. Here we focus
only on two major and prominent concepts on fairness, which have received particular attention
in the modern machine learning field.
The first is disparate treatment (DT). This means an unequal treatment that occurs directly
because of some sensitive information (such as race, sex, and religion), often called sensitive at-
tributes in the literature. It is also called direct discrimination (“Gikjeop Chabyeol” in Korean),
since such attributes directly serve to yield discrimination.
The second is disparate impact (DI). This means an action that adversely affects one group
against another even with formally neutral rules wherein sensitive attributes are never used
in classification and therefore the DT does not occur. It is also called indirect discrimination
(“Ganjeop Chabyeol” in Korean), since a disparate action is made indirectly through biased
historical data.
A simple setting
For illustrative purpose, here we considerCN21_2
a simplified version of the predictor wherein only a
few information are employed for prediction. See Fig. 3.
objective
Criminal reoffending
predictor
sensitive
There are two types of data employed: (i) objective data; (ii) sensitive data (or called sensitive
attributes). For objective data that we denote by x, we consider only two features, say x1 and x2 .
Let x1 be the number of prior criminal records. Let x2 be a criminal type, e.g., misdemeanour
or felony. For sensitive data, we employ a different notation, say z. Here we consider a simple
scalar and binary case in which z indicates a race type only among white (z = 0) and black
(z = 1). Let ŷ be the classifier output which aims to represent the ground-truth conditional
distribution p(y|x, z). Here y denotes the ground-truth label: y = 1 means reoffending within 2
years; y = 0 otherwise. This is a supervised learning setup, so we are given m example triplets:
1
A casual terminology for criminal records is priors.
3
{(x(i) , z (i) , y (i) )}m
i=1 .
Now how to ensure the above? The solution is very simple: Not using the sensitive attribute z
at all in the prediction, as illustrated with a red-colored “x” mark in Fig. 3. Here an important
thing to notice is that the sensitive attribute is offered as part of training data although it is
not used as part of input. So z (i) ’s can be employed while designing an algorithm.
Two cases
In view of the mathematical definition (3), reducing disparate impact means maximizing the
mathematical quantity (3). Now how to design a classifier so as to maximize the DI then?
Depending on situations, the design methodology can be different. To see this, think about two
extreme cases.
The first is the one in which training data is already fair:
In this case, a natural solution is to simply rely on a conventional classifier that aims to maximize
prediction accuracy. Why? Because maximizing prediction accuracy would well respect training
data, which in turn yields large DI. The second is a non-trivial case in which training data is far
from being fair:
4
In this case, the conventional classifier would yield small DI. This is indeed a challenging scenario
where we need to take some non-trivial action for ensuring fairness.
In fact, the second scenario can often occur in reality, since there could be biased historical
records which form the basis of training data. For instance, the Supreme Court can make some
biased decisions for blacks against whites, and these are likely to be employed as training data.
CN9_2
See Fig. 4 for one such unfair scenario. In Fig. 4, a hallowed (or black-colored-solid) circle
black white
reoffend
non-reoffend
Figure 4: Data points visualization: A hallowed (or black-colored-solid) circle indicates a data
point of an individual with white (or black) race; the red (or blue) colored edge denotes y = 1
reoffending (or y = 1 non-reoffending) label.
indicates a data point of an individual with white (or black) race; and the red (or blue) colored
edge (ring) denotes the event that the interested individual reoffends (or non-reoffends) within
two years. This is an unfair situation. Notice that for positive examples y = 1, there are more
black-colored-solid circles than hallowed ones, meaning sort of biased historical records favouring
whites against blacks. Similarly for negative examples y = 0, there are more hallowed circles
relative to solid ones.
where `CE (·, ·) indicates binary cross entropy loss, and w denotes weights (parameters) for a
classifer. Here one natural approach to ensure large DI is to incorporate an DI-related constraint
in the optimization (4). Maximizing DI is equivalent to minimizing 1 − DI (since 0 ≤ DI ≤ 1).
So we can resort to a very well-known technique in the optimization field: regularization. That
is to add the two objectives with different weights.
Regularized optimization
5
Here is a regularized optimization:
m
1 X
min `CE (y (i) , ŷ (i) ) + λ · (1 − DI) (5)
w m
i=1
where λ denote a regularization factor that balances predication accuracy against the DI-
associated objective (minimizing 1 − DI). However, here an issue arises in solving the regularized
optimization (5). Recalling the definition of DI
!
Pr(Ỹ = 1|Z = 0) Pr(Ỹ = 1|Z = 1)
DI := min , ,
Pr(Ỹ = 1|Z = 1) Pr(Ỹ = 1|Z = 0)
we see that DI is a complicated function of w. We have no idea as to how to express DI in terms
of w.
Another way
Since directly expressing DI as a function of w is not doable, one can rely on another way to go.
Another way that I will take is inspired by information theory, particularly by mutual informa-
tion. Notice that DI = 1 means that the sensitive attribute Z is independent of the hard decision
Ỹ of the prediction. Remember one key property of mutual information: Mutual information
between two input random varibles being zero is the “sufficient and necessary condition” for the
independence between the two inputs. This motivates us to represent the constraint of DI = 1
as:
I(Z; Ỹ ) = 0. (6)
This captures the complete independence between Z and Ỹ . Since the predictor output is Ŷ
(instead of Ỹ ), we consider another stronger condition that concerns Ŷ directly:
I(Z; Ŷ ) = 0. (7)
Notice that the condition (7) is indeed stronger than (6), i.e., (7) implies (6). This is because
(a)
I(Z; Ỹ ) ≤ I(Z; Ỹ , Ŷ )
(8)
(b)
= I(Z; Ŷ )
where (a) is due to the chain rule and non-negativity of mutual information; and (b) is because
Ỹ is a function of Ŷ : Ỹ := 1{Ŷ ≥ 0.5}. Notice that (7) together with (8) gives (6).
Now the question of interest is: How to express I(Z; Ŷ ) in terms of classifier parameters w? It
turns out interestingly there is a way to express it. The idea is to use a GAN trick that we
learned!
Look ahead
Next time we will review the GAN trick w.r.t. I(Z; Ŷ ) and then use the trick to explicitly
formulate an optimization.
6
EE623 Information Theory November 25, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time, we introduced the last application of information theory: A fair classifier. As an
example of a fair classifier, we considered a recidivism predictor wherein the task is to predict if
CN22_1
an interested individual with prior criminal records (priors for short) would reoffend within two
years; see Fig. 1 for illustration.
Recidivism
predictor
Figure 1: A simple recidivism predictor: Predicting a recidivism score ŷ from x = (x1 , x2 ). Here
x1 indicates the number of prior criminal records; x2 denotes a criminal type (misdemeanor or
felony); and z is a race type among white (z = 0) and black (z = 1).
In order to avoid disparate treatment (one famous fairness concept), we made the sensitive
attribute not to be included as part of input. To reflect another fairness constraint (disparate
impact - DI for short), we introduced a regularized term into the original optimization dealing
only with prediction accuracy, thus arriving at:
m
1 X
min `CE (y (i) , ŷ (i) ) + λ · I(Z; Ŷ ) (1)
w m
i=1
where λ ≥ 0 is a regularization factor that balances prediction accuracy (reflected in the bi-
nary cross entropy terms) against the fairness constraint, reflected in I(Z; Ŷ ). Remember that
I(Z; Ŷ ) = 0 is a sufficient condition for DI = 1. At the end of the last lecture, I claimed that one
can express the mutual information I(Z; Ŷ ) of interest in terms of an optimization parameter
w, thereby enabling us to train the model parameterized by w. I then told you that the idea for
translation is to use a GAN trick that we learned in the past lectures.
Today’s lecture
Today we will support the claim. Specifically we are going to cover three stuffs. First we will
figure out what I meant by the GAN trick is. Next we will use the GAN trick to formulate
an optimization yet in a simple binary sensitive attribute setting. Lastly we will extend to the
general non-binary sensitive attribute case.
1
Recall the inner optimization in GAN:
m
1 X
max log D(y (i) ) + log(1 − D(ŷ (i) ))
D(·) m
i=1
where y (i) and ŷ (i) indicate the ith real and fake samples, respectively; and D(·) is a discriminator
output. Also remember that we could make an interesting connection with mutual information:
m
1 X
I(T ; Z) = max log D(y (i) ) + log(1 − D(ŷ (i) )) (2)
D(·) m
i=1
where T = 1{Discriminator’s input is real} and Z is a random variable that takes a real sample
if T = 1; a fake sample if T = 0. Please do not be confused with the different Z in (1).
Here what I meant by the GAN trick is the other way around (reverse), meaning that we start
with the mutual information of interest and then express it in terms of an optimization problem
similarly to (2). Now let us apply this trick to our problem setting. For illustrative purpose, let
us start with a simple binary sensitive attribute setting.
where (a) is due to the definition of KL divergence; (b) comes from the total probability law;
and (c) is due to the definition of entropy.
Observation
2
For the binary sensitive attribute case, we have:
X PŶ ,Z (ŷ, z)
I(Z; Ŷ ) = PŶ ,Z (ŷ, z) log + H(Z)
PŶ (Ŷ )
ŷ∈Ŷ,z∈Z
X PŶ ,Z (ŷ, 1) X PŶ ,Z (ŷ, 0) (3)
= PŶ ,Z (ŷ, 1) log + PŶ ,Z (ŷ, 0) log +H(Z).
PŶ (ŷ) PŶ (ŷ)
ŷ∈Ŷ | {z } ŷ∈Ŷ | {z }
=:D∗ (ŷ) =1−D∗ (ŷ)
Notice that the first log-inside term, defined as D∗ (ŷ), has the close relationship with the second
log-inside term in the above: the sum of the two is 1. This reminds us of the maximization
formula in (2). So one may conjecture that I(Z; Ŷ ) can be expressed in terms of an optimization
problem as follows:
Claim 1:
X X
I(Z; Ŷ ) = max PŶ ,Z (ŷ, 1) log D(ŷ) + PŶ ,Z (ŷ, 0) log (1 − D(ŷ)) + H(Z). (4)
D(·)
ŷ∈Ŷ ŷ∈Ŷ
Proof: It turns out it is indeed the case. The proof is very simple. Taking the derivative w.r.t.
D(ŷ), we get:
where D? (ŷ) is the optimal solution to the optimization (4). Hence, we get:
where the second equality is due to the total probability law. Since this is exactly the same as
D∗ (ŷ) that we defined in (3), we complete the proof.
3
Notice that this approximation is exactly the same as the inner optimization formula in the
GAN (2).
Fairness-GAN optimization
Now let us go back to our original optimization setting:
m
1 X
min `CE (y (i) , ŷ (i) ) + λ · I(Z; Ŷ ).
w m
i=1
Applying the approximation (5) into the above, one can consider:
m
1 X X X
min max `CE (y (i) , ŷ (i) ) + λ log Dθ (ŷ (i) ) + log 1 − Dθ (ŷ (i) ) (6)
w θ m
i=1 (i)
i:z =1 (i) i:z =0
where D(·) is parameterized with θ. The sensitive-attribute entropy H(Z) in (5) is removed
here, since it is irrelevant of the optimization parameters (θ, w). Observe now that the objec-
tive function has an explicit relationship with the optimization parameters (θ, w). Hence, the
parameters are trainable via some practical algorithms such as gradient descent.
This is due to the total probability law. Now what does this remind you of? You should be
reminded of the claim that appeared in the extra problem in midterm!
Claim 2:
X
I(Z; Ŷ ) = Pmax PŶ ,Z (ŷ, z) log D(ŷ, z) + H(Z). (8)
D(ŷ,z): z∈Z D(ŷ,z)=1
ŷ∈Y,z∈Z
4
Proof: For those who do not remember details and/or do not prove the claim, let me provide
a proof here. Since we have multiple equality constraints, marked in blue in the above, we first
define the Lagrange function:
!
X X X
L(D(ŷ, z), ν(ŷ)) = PŶ ,Z (ŷ, z) log D(ŷ, z) + ν(ŷ) 1 − D(ŷ, z)
ŷ∈Ŷ,z∈Z ŷ∈Ŷ z∈Z
where ν(ŷ)’s are Lagrange multipliers. Notice that there are the number |Ŷ| of Lagrange multi-
pliers. Since the objective function is concave in D(·, ·), we can solve the problem via the KKT
conditions:
dL(D(ŷ, z), ν(ŷ)) PŶ ,Z (ŷ, z)
= − ν ? (ŷ) = 0 ∀ŷ, z
dD(ŷ, z) D? (ŷ, z)
dL(D(ŷ, z), ν(ŷ)) X
=1− D? (ŷ, z) = 0 ∀ŷ.
dν(ŷ)
z∈Z
we satisfy the KKT conditions. This implies that D? (ŷ, z) is indeed the optimal solution. Since
D? (ŷ, z) is exactly the same as D∗ (ŷ, z) that we defined in (7), we complete the proof.
1
QŶ ,Z (ŷ (i) , z (i) ) = . (9)
m
So we get:
m
X 1
I(Z; Ŷ ) ≈ Pmax log D(ŷ (i) , z (i) ) + H(Z). (10)
D(ŷ,z): z∈Z D(ŷ,z)=1 m
i=1
Lastly by parameterizing D(·, ·) with θ and excluding H(Z) (irrelevant of (θ, w)), we obtain the
following optimization:
(m m
)
1 X X
min P max `CE (y (i) , ŷ (i) ) + λ log Dθ (ŷ (i) , z (i) ) . (11)
w θ: z∈Z Dθ (ŷ,z)=1 m
i=1 i=1
While it is similar to the GAN, it comes with a couple of distinctions. First the Generator
is replaced with the Classifier. Hence, it includes additional loss term that captures prediction
accuracy, reflected in m (i) , ŷ (i) ). Also we now have multiple (the number of |Z|) outputs
P
`
i=1 CE (y
in the Discriminator. Here each Dθ (ŷ (i) , z (i) ) can be interpreted as:
Pr ŷ (i) belongs to z (i) .
Fairness-GAN architecture
5
CN22_2
softmax
Classifier Discriminator
In view of the final optimization formula (11), the Fairness-GAN architecture can be illustrated
as in Fig. 2. Notice that we have the number |Z| of outputs in the Discriminator. In an attempt
to ensure the equality constraint in (11), we may want to use the softmax activation in the
output layer.
Look ahead
We are now done with the optimization problem formulation for Fairness-GAN. Next time, we
will study how to solve the optimization (11), as well as how to implement it via Pytorch.
6
EE623 Information Theory December 2, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Recap
Last time, we formulated an optimization that respects two fairness constraints: disparate treat-
ment (DT) and disparate impact (DI). Given m example triplets {(x(i) , z (i) , y (i) )}m
i=1 :
m
1 X
min `CE (y (i) , ŷ (i) ) + λ · I(Z; Ŷ )
w m
i=1
where ŷ (i) indicates the Classifier output, depending only on x(i) (not on the sensitive attribute
z (i) due to the DT constraint); and λ is a regularization factor that balances prediction accuracy
against the DI constraint, reflected in I(Z; Ŷ ). Using the GAN trick, we could approximate
I(Z; Ŷ ) in the form of optimization; for the binary sensitive attribute case, the approximation
reads:
1 X X
I(Z; Ŷ ) ≈ max log D(ŷ (i) ) + log 1 − D(ŷ (i) ) + H(Z).
D(·) m
i:z(i)=1 (i) i:z =0
Two questions that arise are: (i) how to solve the optimization (1)?; and (ii) how to implement
the algorithm via Pytorch?
Today’s lecture
Today we will address these two questions. Specifically what we are going to do are four-folded.
First we will investigate a practical algorithm that allows us to attack the optimization (1). We
will then do a case study for the purpose of exercising the algorithm: recidivism prediction. In
the process, we will put a special emphasis on one implementation detail: synthesizing unfair
dataset that we will use in our experiments. Lastly we will learn how to implement programming
via Pytorch.
Observation
1
Let us start by translating the optimization (1) into the one that is more programming-friendly:
m
1 X X X
min max `CE (y (i) , ŷ (i) ) + λ log Dθ (ŷ (i) ) + log 1 − Dθ (ŷ (i) )
w θ m
i=1 i:z (i) =1 i:z (i) =0
(m m
!)
(a) 1 X X
= min max `CE (y (i) , ŷ (i) ) + λ z (i) log Dθ (ŷ (i) ) + (1 − z (i) ) log 1 − Dθ (ŷ (i) )
w θ m
i=1 i=1
(m m
)
(b) 1 X
(i) (i)
X
(i) (i)
= min max `CE y , ŷ −λ `CE z , Dθ (ŷ )
w θ m
i=1 i=1
(m )
1 X
(i) (i)
(i) (i)
= min max `CE y , Gw (x ) − λ`CE z , Dθ (Gw (x ))
w θ m
i=1
| {z }
=:J(w,θ)
where (a) is due to z (i) ∈ {0, 1}; and (b) is due to the definition of the binary cross entropy
loss `CE (·, ·). Notice that J(w, θ) contains two cross entropy loss terms, each being a non-trivial
function of Gw (·) and/or Dθ (·). Hence, in general, J(w, θ) is highly non-convex in w and highly
non-concave in θ.
Again one can use the Adam optimizer possibly together with the batch version of the algorithm.
In order to restrict the range of λ into 0 ≤ λ ≤ 1, we apply the (1 − λ) factor to the prediction
accuracy loss term.
Like the prior GAN setting, let us define two loss terms. One is “Generator (or Classifier) loss”:
(m )
1 X
(i) (i)
(i) (i)
min max (1 − λ)`CE y , Gw (x ) − λ`CE z , Dθ (Gw (x )) e .
w θ m
i=1
| {z }
“generator (classif ier) loss”
2
Given w, the Discriminator wishes to maximize:
m
λ X
max − `CE z (i) , Dθ (Gw (x(i) )) .
θ m
i=1
Performance metrics
Unlike to the prior two settings (supervised learning and GAN), here we need to introduce
another performance metric that captures the degree of fairness. To this end, let us first define
the hard-decision value of the prediction output w.r.t. an unseen (test) example:
where mtest denotes the number of test examples. This is an empirical version of the ground
truth Pr(Ytest = Ỹtest ).
Now how to define a fairness-related performance metric? Recall the mathematical definition of
DI:
!
Pr(Ỹ = 1|Z = 0) Pr(Ỹ = 1|Z = 1)
DI := min , . (4)
Pr(Ỹ = 1|Z = 1) Pr(Ỹ = 1|Z = 0)
Here you may wonder how to compute two probability quantities of interest: Pr(Ỹ = 1|Z = 0)
and Pr(Ỹ = 1|Z = 1). Again we use its empirical version relying on the Law of Large Number:
Pmtest (i) (i)
Pr(Ỹ = 1, Z = 0) i=1 1{ỹtest = 1, ztest = 0}
Pr(Ỹ = 1|Z = 0) = ≈ (i)
.
Pr(Z = 0) Pmtest
1{z = 0}
i=1 test
The above approximation is getting more and more accurate as mtest gets larger. Similarly we
can approximate Pr(Ỹ = 1|Z = 1). This way, we can evaluate the DI (4).
A case study
Now let us exercise what we have learned so far with a simple example. As a case study, we
consider the same simple setting that we introduced earlier: recidivism prediction, wherein the
task is to predict if an interested individual reoffends within 2 years, as illustrated in Fig. 1.
3
CN23_1
Recidivism
predictor
Figure 1: Predicting a recidivism score ŷ from x = (x1 , x2 ). Here x1 indicates the number of
prior criminal records; x2 denotes a criminal type: misdemeanor or felony; and z is a race type
among white (z = 0) and black (z = 1).
In fact, there is a real-world dataset that concerns the recidivism prediction, called the COMPAS
dataset. But this contains many attributes, so it is a bit complicated. Hence, we will take a
particular yet simple method to synthesize a much simpler unfair dataset.
CN23_2
Recall the visualization of an unfair data scenario that we investigated in Lecture 21 and will
form the basis of our synthetic dataset (to be explained in the sequel); see Fig. 2. In Fig. 2,
reoffend
non-reoffend
Figure 2: Data points visualization: A hallowed (or black-colored-solid) circle indicates a data
point of an individual with white (or black) race; the red (or blue) colored edge denotes y = 1
reoffending (or y = 1 non-reoffending) label.
a hallowed (or black-colored-solid) circle indicates a data point of an individual with white (or
black) race; and the red (or blue) colored edge (ring) denotes the event that the interested
individual reoffends (or non-reoffends) within two years. This is indeed an unfair scenario: for
y = 1, there are more black-colored-solid circles than hallowed ones; similarly for y = 0, there
are more hallowed circles relative to solid ones.
To generate such an unfair dataset, we employ a simple method. See Fig. 3. We first generate
m labels y (i) ’s so that they are i.i.d. each being according to Bern( 12 ). For indices of positive
examples (y (i) = 1), we then generate i.i.d. x(i) ’s according to N ((1, 1), 0.52 I); and i.i.d. z (i) ’s
4
as per Bern(0.8), meaning that 80% are blacks (z = 1) and 20% are whites (z = 0) among the
positive individuals. Notice that the generation of x(i) ’s is not quite realistic - the first and
second components in x(i) do not precisely capture the number of priors and a criminal type.
You can view this generation as sort of a crude abstraction of such realistic data. On the other
hand, for negative examples (y (i) = 0), we generate i.i.d. (x(i) , z (i) )’s with different distributions:
CN23_3
x(i) ∼ N ((−1, −1), 0.52 I) and z (i) ∼ Bern(0.2), meaning that 20% are blacks (z = 1) and 80%
are whites (z = 0). This way, z (i) ∼ Bern( 21 ). Why?
Model architecture
CN23_4
Fig. 4 illustrates the architecture of Fairness-GAN. Since we focus on the binary sensitive at-
tribute, we have a single output Dθ (ŷ) in the Discriminator. For models of Classifier and
Classifier Discriminator
Discriminator, we employ very simple single-layer neural networks with logistic activation in the
output layer; see Fig. 5.
5
CN23_6
logistic logistic
where the first two arguments of (1,0.5) specify Bern(0.8); and the null space followed by
train size indicates a single dimension. Remember we generate i.i.d. Gaussian random vari-
ables for x(i) ’s. To this end, one can use one of the following two ways:
x = torch.normal(mean=(1,1),std=0.5,size=(train size,2))
x = numpy.random.normal(loc=(1,1),scale=0.5,size=(train size,2))
classifier = Classifier()
discriminator = Discriminator()
optimizer G = torch.optim.Adam(classifier.parameters(),lr=lr c,betas=(b1,b2))
optimizer D = torch.optim.SGD(discriminator.parameters(),lr=lr d)
where Classifier() and Discriminator() are neural network models which respect the struc-
ture in Fig. 5. Since you may be familiar with how to program the models, we omit details
here.
6
where y pred indicates the Classifier output; y train denotes a label; and z train is a binary
sensitive attribute. Note in this Fairness-GAN implementation that we use the full batch.
Pytorch: Evaluation
Recall the DI performance:
!
Pr(Ỹ = 1|Z = 0) Pr(Ỹ = 1|Z = 1)
DI := min , .
Pr(Ỹ = 1|Z = 1) Pr(Ỹ = 1|Z = 0)
7
EE623 Information Theory December 4, 2019
KAIST, Fall 2019 Changho Suh (chsuh@kaist.ac.kr)
Today’s lecture
Today is the last lecture day for the course. So I would like to share something which may be
useful for your future careers. In particular, I wish to talk about two things w.r.t. how to do
research. The first is my personal advice which might be helpful for you to advance modern
technologies. The second is my own methodology w.r.t. how to efficiently read research papers.
Be strong at fundamentals
One key message that we can see from what we have learned is that fundamental concepts
(entropy, KL divergence, and mutual information in this context) play important roles. So in
view of this, my advice is: Be strong at fundamentals! But you may be confused about my
advice because there are many fundamentals and what I meant by fundamentals are not clear
enough. So you may wonder which fundamentals which I recommend you to focus on. The
fundamentals that I would like to put a special emphasis on are w.r.t. modern technologies. To
explain what it means, let us first investigate fundamentals which played important roles in old
days.
1
The modern-days technologies are being driven by the 4th industrial revolution. As we all know,
the main theme of the 4th industrial revolution is: Artificial Intelligence (AI). Now what are
the fundamentals that form the basis of AI technologies? Remember that the fields of machine
learning and deep learning which provide key methodologies for achieving machine intelligence
are all based on optimization, which is a branch of mathematics. Hence, one major driving force
for AI technologies is: mathematics.
2
and strategy in their minds. The second is that reading papers efficiently is perhaps the most
difficult thing even for experienced students who are often exposed to research environments.
These are the reasons that I decided to talk about this topic.
Disclaimer: The contents that I will share soon are my own opinions.
Two strategies
There are two strategies depending on what you wish to do. The first is a passive approach
which can be applied to the case when you want to learn some basics and trends (not necessarily
to write a paper) in an interested field. The second is an active approach which I think is
applicable to the case when you wish to be an expert (or forced to write a paper) in a trending
and non-mature field.
Passive approach
What I mean by the passive approach is to rely on experts. In other words, it is to wait until
experts can teach in a very accessible manner, and then read well-written papers by such experts.
It would be the best if such papers are of survey type. Two natural questions that you can ask
then are: (i) how to figure out such experts?; (ii) how to read such well-written papers? Here
are my answers.
3
is not necessarily good unless it is insightful. I believe that a good proof comes with insights
although it is rather long.
The last guideline is to respect experts’ notations, logical flows, mindsets and writing-styles.
Usually excellent writers spend lots of time in choosing their notations, making a storyline and
refining their writings. So theirs are best picks and well-cultivated. You may also have your
owns. But if you find theirs better than yours, then please do not hesitate to replace yours with
them. Or if you have none or if you cannot judge, simply follow theirs. It is worth copying.
Active approach
Now let us move onto the second approach: the active approach. As mentioned earlier, this is
intended for those who wish to write papers in a relatively new (emerging) field. In other words,
it is for those who cannot wait until the field becomes mature so that experts start to world-tour
for delivering keynotes and/or tutorials. The spirit of the approach is based on trials-&-errors.
It is to read many recent papers and do back-and-forth until you figure out a big picture. As
you may image, this is not that simple. It is in fact quite difficult. The keys to this approach
are to: (i) identify relevant and good papers quickly; (ii) figure out main ideas of the identified
papers quickly. Now how to do such quickly? How to do back-and-forth with multiple papers?
4
throw it away.
Do back-and-forth
Iterate the above procedure while summarizing the main ideas of possibly all the short-listed
papers. Two things that you should keep in mind during the iteration process. The first is: Do
not spend much time in reading “related works”. Usually “related works” are heavy and are
not along the main storyline of the paper. This is because authors are sort of forced to cite all
the relevant (although not quite related) works to avoid any negative reviews potentially from
non-cited paper authors. So you do not need to investigate all the contents in “related works”.
Rather it is mostly okay to simply ignore all of them. The second is w.r.t. the investigation of
other references. If you really need to read another referenced paper in the process, then try
to read it very quickly while keeping in mind you goal. Preferably do not be distracted with
reading other references or with spending much time on them.
You can stop the iteration process until either: (a) you figure out a big picture; or (b) you find
a very well-written paper (which I called the “anchor paper”) that describes a whole picture
clearly. The way to read the “anchor paper” is the same as the one that I mentioned w.r.t. the
passive approach. This way, you will figure out a big picture.
2. Formulate a simple yet concrete problem that can address the challenge;
3. Try to solve the problem with conventional wisdom from first-principle thinking.
You may be stuck especially at the 3rd procedure in the above. If so, you may want to interact
with your advisors, seniors and/or peers.