Math Notes for ECON2125 Students
Math Notes for ECON2125 Students
John Stachurski
March 4, 2015
Contents
1 Linear Algebra 1
1.1 Vectors Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Spans and Linear Subspaces . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.4 Bases and Dimension . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Linear Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Linear Maps from R N to R N . . . . . . . . . . . . . . . . . . . 16
1.2.3 Linear Maps Across Dimensions . . . . . . . . . . . . . . . . . 17
1.3 Matrices and Linear Equations . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Matrices as Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.3 Square Matrices and Invertibility . . . . . . . . . . . . . . . . . 22
1.3.4 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.5 Solving Equations with Tall Matrices . . . . . . . . . . . . . . . 25
1.3.6 Solving Equations with Wide Matrices . . . . . . . . . . . . . . 26
1.4 Other Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.1 Types of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 27
i
CONTENTS ii
2 Probability 46
2.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.1.3 Dependence and Independence . . . . . . . . . . . . . . . . . . 50
2.1.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2.1 Definition and Notation . . . . . . . . . . . . . . . . . . . . . . 54
2.2.2 Finite Random Variables . . . . . . . . . . . . . . . . . . . . . . 56
2.2.3 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.1 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.2 Densities and PMFs . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.3 The Quantile Function . . . . . . . . . . . . . . . . . . . . . . . 65
2.3.4 Expectations from Distributions . . . . . . . . . . . . . . . . . 66
2.3.5 Common Distributions . . . . . . . . . . . . . . . . . . . . . . . 68
2.4 Joint Distributions and Independence . . . . . . . . . . . . . . . . . . 69
CONTENTS iii
I Appendices 142
Linear Algebra
[roadmap]
1.1.1 Vectors
Let’s quickly review the basics. An important set for us will be, for arbitrary N ∈ N,
the set of all N-vectors, or vectors of length N. This set is denoted by R N , and a
typical element is of the form
x1
x2
x= .. where xn ∈ R for each n
.
xN
(As usual, R = R1 represents the set of all real numbers, which is, in essence, the
union of the rational and irrational numbers.) Here x has been written vertically, as a
column of numbers, but we could also write it horizontally, like so: x = ( x1 , . . . , x N ).
At this stage, we are viewing vectors just as sequences of numbers, so it makes no
difference whether they are written vertically or horizontally. It’s only when we
get to matrix multiplication that we have to concern ourselves with distinguishing
between column (vertical) and row (horizontal) vectors.
1
CHAPTER 1. LINEAR ALGEBRA 2
The vector of ones will be denoted 1, while the vector of zeros will be denoted 0:
1 0
.. ..
1 := . 0 := .
1 0
For elements of R N there are two fundamental algebraic operations: addition and
scalar multiplication. If x ∈ R N and y ∈ R N , then the vector sum is defined by
x1 y1 x1 + y1
x2 y2 x2 + y2
x + y :=: . + . := .
.
. . . .
.
xN yN xN + yN
If α ∈ R, then the scalar product of α and x is defined to be
αx1
αx2
αx := .
..
αx N
Thus, addition and scalar multiplication are defined in terms of ordinary addition
and multiplication in R, and computed element-by-element, by adding and multi-
plying respectively. Figures 1.1 and 1.1 show examples of vector addition and scalar
multiplication in the case N = 2. In the figure, vectors are represented as arrows,
starting at the origin and ending at the location in R2 defined by the vector.
Remark: In some instances, the notion of scalar multiplication includes multiplica-
tion of vectors by complex numbers. In what follows we will work almost entirely
with real scalars, and hence scalar multiplication means real scalar multiplication
unless otherwise stated.
We have defined addition and scalar multiplication of vectors, but not subtraction.
Subtraction is performed element by element, analogous to addition. The definition
can be given in terms of addition and scalar multiplication. x − y := x + (−1)y. An
illustration of this operation is given in figure 1.3. One way to remember this is to
draw a line from y to x, and then shift it to the origin.
The inner product of two vectors x and y in R N is denoted by x0 y, and defined as
the sum of the products of their elements:
N
x0 y := ∑ xn yn
n =1
CHAPTER 1. LINEAR ALGEBRA 3
x+y
y
−2x
x−y
Fact 1.1.1. For any α, β ∈ R and any x, y, z ∈ R N , the following statements are true:
1. x0 y = y0 x
These properties are easy to check from the definition. However, since this is our
first formal claim let’s step through one of the arguments carefully, just as an ex-
ercise: Consider the second equality, which is (αx)0 ( βy) = αβ(x0 y). The claim is
that the stated equality holds for any α, β ∈ R and any x, y ∈ R N . To verify a “for
any” claim we have to show that the stated property holds for arbitrary choices of
these objects. Thus we start by saying pick any α, β ∈ R and any x, y ∈ R N . By the
definitions of scalar multiplication and inner product respectively, we have
N
(αx)0 ( βy) = ∑ αxn βyn
n =1
Passing the scalars out of the sum, we see that this equals αβ ∑nN=1 xn yn , which,
using the definition of inner product again, is αβ(x0 y). This verifies the claim.
Perhaps this style of argument seemed like overkill for such a simple statement. But
getting into the habit of constructing careful arguments will help you a great deal
CHAPTER 1. LINEAR ALGEBRA 5
as we work through more complex material. One thing you’ll find in particular if
you’re still new to formal reasoning is that getting the first sentence of a proof right
(establishing exactly what it is you’re going to show) is critical. If you get this right
the rest should flow more easily.
The (Euclidean) norm of a vector x ∈ R N is defined as
!1/2
√ N
kxk := x0 x := ∑ xn2 (1.1)
n =1
and represents the length of the vector x. (In the arrow representation of vectors in
figures 1.1–1.3, the norm of the vector is equal to the length of the arrow.)
Fact 1.1.2. For any α ∈ R and any x, y ∈ R N , the following statements are true:
2. kαxk = |α|kxk
3. kx + yk ≤ kxk + kyk
4. |x0 y| ≤ kxkkyk
The first two properties you can verify yourself without difficulty. Proofs for the
second two are a bit harder. The third property is called the triangle inequality,
while the fourth is called the Cauchy-Schwarz inequality. The proof of the Cauchy-
Schwarz inequality is given as a solved exercise after we’ve built up some more
tools (see exercise 3.6.13 below). If you’re prepared to accept the Cauchy-Schwarz
inequality for now, then the triangle inequality follows, because, by the properties
of the inner product given in fact 1.1.1,
One of the most elementary ways to work with vectors is to combine them using
linear operations. Given K vectors x1 , . . . , xK in R N , a linear combination of these
vectors is a new vector of the form
K
y= ∑ α k x k = α1 x1 + · · · + α K x K
k =1
Fact 1.1.3. Inner products of linear combinations satisfy the following rule:
!0 !
K J K J
∑ αk xk ∑ β j yj = ∑ ∑ αk β j x0k y j
k =1 j =1 k =1 j =1
Given any nonempty X ⊂ R N , the set of all vectors that can be made by (finite) lin-
ear combinations of elements of X is called the span of X, and denoted by span( X ).
For example, the set of all linear combinations of X := {x1 , . . . , xK } is
( )
K
span( X ) := all vectors ∑ αk xk such that α := (α1, . . . , αK ) ∈ RK
k =1
Example 1.1.1. Let X = {1} = {(1, 1)} ⊂ R2 . The span of X is all vectors of the
form α1 = (α, α) with α ∈ R. This constitutes a line in the plane. Since we can take
α = 0, it follows that the origin 0 is in span( X ). In fact span( X ) is the unique line in
the plane that passes through both 0 and the vector 1 = (1, 1).
Example 1.1.2. Let x1 = (3, 4, 2) and let x2 = (3, −4, 0.4). This pair of vectors is
shown on the left-hand side of figure 1.4. The span of {x1 , x2 } is a plane in R3 that
passes through both of these vectors and the origin. The plane is shown on the
right-hand side of figure 1.4.
Let Y be any subset of R N , and let X := {x1 , . . . , xK }. If Y ⊂ span( X ), we say that the
vectors in X span the set Y, or that X is a spanning set for Y. This is a particularly
nice situation when Y is large but X is small, because it means that all the vectors in
the large set Y are “described” by the small number of vectors in X. We’ll have a lot
more to say about this idea.
CHAPTER 1. LINEAR ALGEBRA 7
x1 x1
0 0
x2 x2
0 0
0 0
Example 1.1.3. Consider the vectors {e1 , . . . , e N } ⊂ R N , where en has all zeros ex-
cept for a 1 as the n-th element:
1 0 0
0 1 0
e1 : = . , e2 : = . , · · · , e N : = .
.
. .
. ..
0 0 1
The case of R2 is illustrated in figure 1.5. The vectors e1 , . . . , e N are called the canon-
ical basis vectors of R N —we’ll see why later on. One reason is that {e1 , . . . , e N }
spans all of R N . To see this in the case of N = 2 (check general N yourself), observe
that for any y ∈ R2 , we have
0 1 0
y1 y1
y := = + = y1 + y2 = y1 e1 + y2 e2
y2 0 y1 0 1
e2 = (0, 1)
e1 = (1, 0)
Fact 1.1.4. Let X and Y be any two finite subsets of R N . If X ⊂ Y, then we have
span( X ) ⊂ span(Y ).
One of the key features of the span of a set X is that it is “closed” under the linear
operations of vector addition and scalar multiplication, in the sense that if we take
elements of the span and combine them using these operations, the resulting vectors
are still in the span.
To check that the span( X ) is closed under vector addition for arbitrary nonempty
X ⊂ R N , just observe that if y, z ∈ span( X ), then we can express them as finite
linear combinations of elements of X, like so:
K K0
y= ∑ αk xk and z= ∑ α0k x0k
k =1 k =1
Here each xk and x0k is an element of X, and each αk and α0k is a scalar. Adding y and
z gives
K K0
y+z = ∑ αk xk + ∑ α0k x0k
k =1 k =1
which is yet another finite linear combination of elements of X. Hence span( X )
is closed under vector addition as claimed. Another easy argument shows that
span( X ) is closed under scalar multiplication.
The notion of a set being closed under scalar multiplication and vector addition is
important enough to have its own name: A nonempty subset S of R N is called a
linear subspace (or just subspace) of R N if, for any x and y in S, and any α and β in
R, the linear combination αx + βy is also in S.
CHAPTER 1. LINEAR ALGEBRA 9
In R3 , lines and planes that pass through the origin are linear subspaces. You will
get more of a feel for this as we go along. Other linear subspaces of R3 are the
singleton set containing the zero element 0, and the set R3 itself.
Fact 1.1.5. Let S be a linear subspace of R N . The following statements are true:
2. If X ⊂ S, then span( X ) ⊂ S.
3. span(S) = S.
e1 = (1, 0)
x = (−2, 0)
x1 ∈ span{x2 , . . . , xK }
x1 = α2 x2 + · · · α K x K
1. If α1 x1 + · · · + αK xK = 0 then α1 = · · · = αK = 0.
1. X is linearly independent.
2. For each y ∈ span( X ) there exists exactly one set of scalars α1 , . . . , αK such that
K
y= ∑ αk xk (1.2)
k =1
Proof. Let X be linearly independent and pick any y ∈ span( X ). Since y is in the
span of X, we know that there exists at least one set of scalars such that (1.2) holds.
Suppose now that there are two. In particular, suppose that y = ∑kK=1 αk xk =
∑kK=1 β k xk . It follows from the second equality that ∑kK=1 (αk − β k )xk = 0. Using
fact 1.1.6, we conclude that αk = β k for all k. In other words, the representation is
unique.
CHAPTER 1. LINEAR ALGEBRA 12
On the other hand, suppose that for each y ∈ span( X ) there exists exactly one set of
scalars α1 , . . . , αK such that y = ∑kK=1 αk xk . Since 0 ∈ span( X ) must be true (why?),
this implies that there exists only one set of scalars such that 0 = ∑kK=1 αk xk . Since
α1 = · · · = αk = 0 has this property, we conclude that no nonzero scalars yield
0 = ∑kK=1 αk xk . In other words, X is linearly independent.
P := {( x1 , x2 , 0) ∈ R3 : x1 , x2 ∈ R} (1.3)
Put differently, if S is spanned by K vectors, then any subset of S with more than K
vectors will be linearly dependent. Many practical results lean on this theorem. The
proof is not particularly hard, but it is a little long. You can find it in most texts on
linear algebra (see, e.g., theorem 1.1 of [21]). Here’s an example of what we can do
with theorem 1.1.2.
1. span( X ) = R N .
2. X is linearly independent.
CHAPTER 1. LINEAR ALGEBRA 13
Example 1.1.6. The pair {e1 , e2 } is a basis for the set P defined in (1.3).
As stated above, the dimension of a linear subspace is the minimum number of vec-
tors needed to span it. To formalize this idea, we will use the following fundamental
result about bases:
The proof of part 1 can be found in any good text on linear algebra. See, for example,
theorem 1.1 of [21]. Part 2 follows from theorem 1.1.2 and is left as an exercise
(exercise 1.6.13).
If S is a linear subspace of R N , then common number identified in theorem 1.1.4 is
called the dimension of S, and written as dim(S). For example,
In R N the singleton subspace {0} is said to have zero dimension. A line {αx ∈ R N :
α ∈ R} through the origin is obviously one dimensional.
If we take a set of K vectors, then how large will its span be in terms of dimension?
The next lemma answers this question.
1. dim(span( X )) ≤ K.
Part 2 of lemma 1.1.1 is important in what follows, and also rather intuitive. It says
that the span of a set will be large when it’s elements are algebraically diverse.
Let’s finish this section with facts that can be deduced from the preceding results.
The first part of fact 1.1.8 has many useful implications, one of which is that the only
N-dimensional linear subspace of R N is R N .
[Roadmap]
CHAPTER 1. LINEAR ALGEBRA 15
1.2.1 Linearity
There are many different classes of functions and perhaps the most important of
these is the class of linear functions. In high school one learns about linear func-
tions as those whose graph is a straight line. We need a more formal (and accurate)
definition: A function T : RK → R N is called linear if
Remark: Mathematicians usually write linear functions with upper case letters and
omit the parenthesis around the argument where no confusion arises. It is a custom
of their tribe.
Example 1.2.1. The function T : R → R defined by Tx = 2x is linear because for
any α, β, x, y in R, we have
Thinking of linear functions as those whose graph is a straight line is incorrect. For
example, the function f : R → R defined by f ( x ) = 1 + 2x is nonlinear, because if
we take α = β = x = y = 1, then f (αx + βy) = f (2) = 5, while α f ( x ) + β f (y) =
3 + 3 = 6. This kind of function is actually called an affine function.
The range of a linear map is the span of the image of the canonical basis functions:
Lemma 1.2.1. If T : RK → R N is a linear map, then
Proof. Any x ∈ RK can be expressed in terms of the basis vectors as ∑kK=1 αk ek , for
some suitable choice of scalars. Hence rng( T ) is the set of all points of the form
" #
K K
Tx = T ∑ αk ek = ∑ αk Tek
k =1 k =1
as we vary α1 , . . . , αK over all scalar combinations. This coincides with the definition
of span(V ).
ker( T ) := {x ∈ RK : Tx = 0}
The proofs are straightforward. For example, suppose that and Tx = Ty for arbi-
trary x, y ∈ RK . Then x − y ∈ ker( T ). If ker( T ) = {0} then it must be that x = y.
Since x and y were arbitrary we conclude that T is one-to-one.
Theorem 1.2.1. If T is a linear function from R N to R N then all of the following are
equivalent:
1. T is a bijection.
2. T is onto.
3. T is one-to-one.
CHAPTER 1. LINEAR ALGEBRA 17
4. ker( T ) = {0}.
If any one of these equivalent conditions is true, then T is called nonsingular. (In
other words, a function from R N to itself is called a nonsingular function if it is a
linear bijection.) Otherwise T is called singular. The proof of theorem 1.2.1 is not
overly difficult given all the work we’ve already done. It is left as a (solved) exercise
(exercise 1.6.15).
If T is nonsingular, then, being a bijection, it must have an inverse function T −1
that is also a bijection (fact 4.2.1 on page 149). It turns out that this inverse function
inherits the linearity of T. In summary,
Fact 1.2.2. If T : R N → R N is nonsingular then so is T −1 .
Theorem 1.2.1 only applies to linear maps between spaces of the same dimension.
Here are fundamental results for the other two cases (smaller to bigger and bigger
to smaller).
Theorem 1.2.2. For a linear map T from RK → R N , the following statements are true:
Proof. Regarding part 1, let K < N and let T : RK → R N be linear. T cannot be onto,
for if T were onto then we would have rng( T ) = R N , in which case the vectors in
V = { Te1 , . . . , TeK } in lemma 1.2.1 would span R N , despite having only K < N
elements. This is impossible. (Why?)
The proof of part 2 is left as an exercise (exercise 1.6.16).
[Roadmap]
CHAPTER 1. LINEAR ALGEBRA 18
while
a11 ··· a1K b11 ··· b1K a11 + b11 ··· a1K + b1K
a
21 ··· a2K b21 ··· b2K a21 + b21 ··· a2K + b2K
.. .. .. + .. .. .. := .. .. ..
. . . . . . . . .
a N1 · · · a NK b N1 · · · b NK a N1 + b N1 · · · a NK + b NK
In the latter case, the matrices have to have the same number of rows and columns
in order for the definition to make sense.
Now let’s look at multiplication of matrices. If A and B are two matrices, then their
product C := AB is formed by taking as it’s i, j-th element the inner product of the
i-th row of A and the j-th column of B. That is,
K
cij = rowi (A)0 col j (B) = ∑ aik bkj
k =1
Since inner products are only defined for vectors of equal length, this requires that
the length of the rows of A is equal to the length of the columns of B. In other words,
if A is N × K and B is J × M, then we require K = J. The resulting matrix AB is
N × M. Here’s the rule to remember:
product of N × K and K × M is N × M
1. A(BC) = (AB)C
2. A(B + C) = AB + AC
CHAPTER 1. LINEAR ALGEBRA 20
3. (A + B)C = AC + BC
4. AαB = αAB
Here, we are using the word “conformable” to indicate dimensions are such that
the operation in question makes sense. For example, we’ll say “for two conformable
matrices A and B, the product AB satisfies xyz” if the dimensions of A and B are
such that the product is well defined; and similarly for addition, etc.
You can verify that, assuming conformability, we always have
AI = IA = A
For this reason the identity matrix is sometimes called the “multiplicative unit”.
The single most useful way to think of a matrix is as a mapping over vector space.
In particular, an N × K matrix A can be thought of as a map sending a vector x ∈ RK
into a new vector y = Ax in R N . Among the collection of all functions from RK to
R N , these functions defined by matrices have a special property: they are all linear.
To see that this is so, take a fixed N × K matrix A and consider the function T : RK →
R N defined by Tx = Ax. To see that T is a linear function, pick any x, y in RK , and
any scalars α and β. The rules of matrix arithmetic (see fact 1.3.1) tell us that
T (αx + βy) := A(αx + βy) = Aαx + Aβy = αAx + βAy =: αTx + βTy
1. T is linear.
In other words, the set of linear functions from RK to R N and the set of N × K
matrices are in one-to-one correspondence.
CHAPTER 1. LINEAR ALGEBRA 21
Proof. We’ve already proved one implication. Regarding the second, let T : RK →
R N be linear. We aim to construct a matrix A such that Tx = Ax for all x ∈ RK . As
usual, let ek be the k-th canonical basis vector in RK . Define an N × K matrix A by
colk (A) = Tek . Pick any x = ( x1 , . . . , xK ) ∈ RK . We can also write x = ∑kK=1 xk ek .
By linearity we have
K K
Tx = ∑ xk Tek = ∑ xk colk (A)
k =1 k =1
This is just Ax, and we are done.
(This is an important point. Please work through it using the definition of matrix
multiplication if you are not fully convinced that it’s true.) For obvious reasons, this
set is also called the column space of A. It is a linear subspace of R N . (Why?)
How large is the column space of a given matrix? To answer that question we have
to say what “large” means. In the context of linear algebra, size of subspaces is
usually measured by dimension. The dimension of span(A) is known as the rank
of A. That is,
rank(A) := dim(span(A))
A is said to have full column rank if rank(A) is equal to K, the number of its
columns. The reason we say “full” rank here is that, by definition, span(A) is
the span of K vectors. Hence, by part 1 of lemma 1.1.1 on page 14, we must have
dim(span(A)) ≤ K. In other words, the rank of A is less than or equal to K. A is
said to have full column rank when this maximum is achieved.
When is this maximum achieved? By part 2 of lemma 1.1.1, this while be the case
precisely when the columns of A are linearly independent. Let’s state this as a fact:
3. If Ax = 0, then x = 0.
The underlying problem driving much of this discussion is the need to solve systems
of equations of the form (1.4), which can be written more conveniently in matrix
notation as Ax = b. Ideally we want general conditions on A such that a solution
to this equation will always exist, plus a way to compute the solution.
There are a variety of scenarios depending on the properties of A. To narrow things
down, let’s concentrate for now on perhaps the most important case, where A is a
square N × N matrix. Let’s also aim for the best case: A set of conditions on A such
that, for every single b ∈ R N , there exists exactly one x ∈ R N such that Ax = b.
The best way to understand this problem is to recall the discussion in §1.2.1, where
we started to think about matrices as maps. Letting T be the linear map Tx = Ax,
the question we are asking here is when does each point in R N have one and only
one preimage under T. In other words, when is T a bijection?
Recall from §1.2.2 that linear bijections have a special name—they are called non-
singular functions. Moreover, theorem 1.2.1 on page 1.2.1 gives us all sorts of nice
equivalences for this property. For example, it’s enough to know that Tx = Ax is
one-to-one, or onto, or that the image of the set of canonical basis vectors is linearly
independent. The next theorem simply replicates these equivalences in the language
of matrices.
2. rank(A) = N.
3. span(A) = R N .
4. If Ax = Ay, then x = y.
5. If Ax = 0, then x = 0.
CHAPTER 1. LINEAR ALGEBRA 23
Obtaining a proof of theorem 1.3.2 is just a matter of going back to theorem 1.2.1
on page 16 and checking the implications for Tx = Ax. For example, with a bit
of algebra you should be able to convince yourself that linear independence of
{ Te1 , . . . , Te N } is equivalent to linear independence of the columns of A.
Following common usage, if any of the equivalent conditions in theorem 1.3.2 are
true we will call not just the map T but also the underlying matrix A nonsingu-
lar. If any one and hence all of these conditions fail, then A is called singular. A
nonsingular matrix sometimes also referred to as invertible.
1. There exists a square matrix B such that AB = BA = I, where where I is the identity
matrix. The matrix B is called the inverse of A, and written as A−1 .
x = A −1 b (1.6)
Proof. Let T be the linear map associated with A via Tx = Ax. Since A is nonsin-
gular, T is also, by definition, nonsingular, and hence, by fact 1.2.2 on page 17, has
a nonsingular inverse T −1 . Being nonsingular, T −1 is necessarily linear, and hence,
by theorem 1.3.1 on page 20, there exists a matrix B such that T −1 x = Bx for all x.
By the definition of the inverse we have
ABx = T ( T −1 (x)) = x = Ix
Since this holds for any x we have AB = I (see exercise 1.6.27). A similar argument
shows that BA = I.
Regarding the second claim, A−1 b is a solution to Ax = b, since AA−1 b = Ib = b.
Uniqueness follows from theorem 1.3.2 (and in particular the fact that nonsingular-
ity of A includes the implication that the map x 7→ Ax is one-to-one).
The next handy fact tells us that to to prove that B is an inverse of A, it suffices to
show either that B is a “left inverse” or that B is a “right inverse”. A short proof is
given in §1.3.4.
CHAPTER 1. LINEAR ALGEBRA 24
The next fact collects more useful results about inverse matrices.
1. (A−1 )−1 = A,
The relation (AB)−1 = B−1 A−1 is just a special case of the analogous rule for inver-
sion of bijections—see fact 4.2.1 on page 149.
1.3.4 Determinants
To each square matrix A, we can associate a unique number det(A) called the de-
terminant of A. The determinant is a bit fiddly to describe, but it turns out to give a
neat one-number summary of certain useful properties.
To begin, let S( N ) be the set of all bijections from {1, . . . , N } to itself, also called the
set of permutations on {1, . . . , N }. For π ∈ S( N ) we define the signature of π as
π (m) − π (n)
sgn(π ) := ∏ m−n
m<n
Solving for determinants of most larger matrices is a fiddly task, best left for com-
puters. However, the determinant has many neat properties that can be used for
proving various results.
CHAPTER 1. LINEAR ALGEBRA 25
1. det(I) = 1
4. det(αA) = α N det(A)
5. det(A−1 ) = (det(A))−1 ,
As an example of how these facts can be useful, let’s establish fact 1.3.3. Fix square
matrix A and suppose that a right inverse B exists, in the sense that AB = I. This
equality immediately implies that both A and B are nonsingular. Indeed, if we apply
det to both sides of AB = I and use the rules in fact 1.3.5 we get det(A) det(B) = 1.
It follows that both det(A) and det(B) are nonzero, and hence both matrices are
nonsingular.
The rest is just a mopping up operation. To show that B is the inverse of A, we just
need to check that, in addition to AB = I we also have BA = I. To obtain the latter
equality from the former, premultiply the former by B to get BAB = B, and then
postmultiply by B−1 to get BA = I. The proof for the left inverse case is similar.
Given our assumption that K < N, the scenario b ∈ span(A) is very rare. The
reason is that the dimension of span(A), which is precisely the rank of A, is less than
or equal to K (see §1.3.2) and hence strictly less than N. We known from fact 1.1.8 on
page 14 that such a space is not equal to R N , where b lives. In fact we can say more:
All K-dimensional subspaces of R N are “negligible,” and hence the “chance” of b
happening to lie in this subspace is likewise small. For example, consider the case
where N = 3 and K = 2. Then the column space of A forms at most a 2 dimensional
plane in R3 . Intuitively, this set has no volume because planes have no “thickness,”
and hence the chance of a randomly chosen b lying in this plane is essentially zero.2
As a result, the standard approach is to admit that an exact solution may not exist,
and instead focus on finding a x ∈ RK such that Ax is as close to b as possible. This
problem is taken up in §3.3.2, after we have developed tools sufficient to tackle it.
Now let’s turn to the case where A is N × K and K > N. In this setting the system of
equations Ax = b is said to be underdetermined, meaning that the number of equa-
tions (equal to N) is strictly smaller than the number of unknowns (the K elements
of x). Intuitively, in such a situation, we may not have enough information to pin
down a unique solution x. Indeed, by theorem 1.2.2 on page 17, the map x 7→ Ax
cannot be one-to-one in this setting. In fact the following is true.
Fact 1.3.6. If A is N × K, the equation Ax = b has a solution and K > N, then the
same equation has an infinity of solutions.
[Roadmap]
For a square N × N matrix A, the N elements of the form ann for n = 1, . . . , N are
called the principal diagonal:
a11 a12 · · · a1N
21 a22 · · · a2N
a
.. .. ..
. . .
a N1 a N2 · · · a NN
A square matrix A is called diagonal if all entries off the principal diagonal are zero.
For example, the identity matrix is diagonal. The following notation is often used to
define diagonal matrices:
d1 0 · · · 0
0 d2 · · · 0
diag(d1 , . . . , d N ) := . .. ..
.. . .
0 0 · · · dN
The k-th power of a square matrix A is written Ak and indicates A · · · A with k terms.
If it exists, the square root of A is written A1/2 and defined as the matrix such that
A1/2 A1/2 is A.
With diagonal matrices we can compute powers, roots, inverses and products very
easily:
1. CD = diag(c1 d1 , . . . , c N d N ).
You can check part 1 from the definition of matrix multiplication. The other parts
follow directly.
A square matrix is called lower triangular if every element strictly above the princi-
ple diagonal is zero. It is called upper triangular if every element strictly below the
principle diagonal is zero. For example, in
1 0 0 1 2 3
L := 2 5 0 and U := 0 5 6
3 6 1 0 0 1
the matrices L and U are lower and upper triangular respectively. The great advan-
tage of triangular matrices is that the associated linear equations are trivial to solve
using either forward or backward substitution. For example, with the system
1 0 0
x1 x1 b1
2 5 0 x2 = 2x1 + 5x2 = b2
3 6 1 x3 3x1 + 6x2 + x3 b3
the top equation involves only x1 , so we can solve for its value directly. Plugging
that value into the second equation, we can solve out for x2 and so on.
The transpose of N × K matrix A is a K × N matrix A0 such that coln (A0 ) = rown (A).
For example, given
10 40
1 3 5
A := 20 50 B := (1.8)
2 4 6
30 60
1 2
10 20 30
A0 = B0 := 3 4
40 50 60
5 6
1. (A0 )0 = A.
2. (AB)0 = B0 A0 .
3. (A + B)0 = A0 + B0 .
Fact 1.4.6. If A and B are N × N matrices and α and β are two scalars, then
The rank of a matrix can be difficult to determine. One case where it is easy is where
the matrix is idempotent. A square matrix A is called idempotent if AA = A.
For the 2 × 2 matrix in (1.7), one can use the rule for the 2 × 2 determinant in (1.7),
fact 1.4.8 and a little bit of algebra to show that its eigenvalues are given by the two
roots of the polynomial expression
λ2 − ( a + d)λ + ( ad − bc) = 0
More generally, given any N × N matrix A, it can be shown via the fundamental
theorem of algebra that there exist complex numbers λ1 , . . . , λ N , not necessarily dis-
tinct, such that
N
det(A − λI) = ∏ (λn − λ) (1.10)
n =1
CHAPTER 1. LINEAR ALGEBRA 31
1. det(A) = ∏nN=1 λn .
2. trace(A) = ∑nN=1 λn .
It follows immediately from item 1 of fact 1.4.9 that A is nonsingular if and only if
all its eigenvalues are nonzero.
An important concept in the field of dynamic systems is conjugacy. For example, let
f : A → A where A is any set. We are often interested in the evolution of sequences
defined recursively by such maps:
a t +1 = f ( a t ), a0 = a given point in A
f = τ ◦ g ◦ τ −1
This means that f ( a) = τ ( g(τ −1 ( a))) for all a ∈ A. Observe that, under this conju-
gacy,
f 2 = f ◦ f = τ ◦ g ◦ τ −1 ◦ τ ◦ g ◦ τ −1 = τ ◦ g 2 ◦ τ −1
CHAPTER 1. LINEAR ALGEBRA 32
P −1 P
x A Ax
In this expression we’ve used the symbol λn for the scalars, which is reminiscent of
our notation for eigenvalues. There is a reason for this:
CHAPTER 1. LINEAR ALGEBRA 33
Proof. If A = PDP−1 then AP = PD. Equating the n-th column on each side gives
Apn = λn pn , where pn := coln (P). Thus, to show that (pn , λn ) is an eigenpair, we
need only check that pn is not the zero vector. In fact this is immediate, because if it
was the zero vector then P would not be invertible. (Why?)
Conversely, if A has N linearly independent eigenvectors p1 , . . . , p N , then by form-
ing P via coln (P) = pn and taking D = diag(λ1 , . . . , λ N ) where λn is the eigenvalue
associated with pn , we can stack the individual vector equations Apn = λn pn into
the matrix form AP = PD. Using the invertibility implied by linear independence,
we then have A = PDP−1 . In particular, A is diagonalizable.
kAxk
kAk := max : x ∈ RK , x 6 = 0 (1.11)
kxk
Note that in this (standard) notation there are two different norms in play. The
left hand side is a matrix norm. The norm expressions on the right hand side are
ordinary Euclidean vector norms.
The matrix norm behaves very much like the Euclidean norm. For example,
Fact 1.4.11. For any N × K matrices A and B, the matrix norm satisfies
3. kA + Bk ≤ kAk + kBk.
Returning to the problem at hand, it’s clear from (1.11) that if kAk < 1, then any
nonzero point is “contracted” by A, in the sense of being pulled closer to the origin
(i.e., kAxk < kxk). In this sense its implications are similar to | a| < 1 in the scalar
case. In particular, we have the following parallel result:
We’ve spent a lot of time discussing linear maps, one class of which is the linear
real-valued maps. By theorem 1.3.1 we know that any linear map from R N to R
takes the form x 7→ x0 a for some vector a ∈ R N . The next level of complexity is
quadratic real-valued maps. To describe them, let A be N × N and symmetric, and
let x be N × 1. The quadratic function or quadratic form on R N associated with A
is the map Q defined by
N N
Q(x) := x0 Ax = ∑ ∑ aij xi x j
j =1 i =1
To give a simple illustration, let N = 2 and let A be the identity matrix I. In this
case,
Q(x) = kxk2 = x12 + x22
A 3D graph of this function is shown in figure 1.8.
One thing you’ll notice about this function is that its graph lies everywhere above
zero, or Q(x) ≥ 0. In fact we know that kxk2 is nonnegative and will be zero only
when x = 0. Hence the graph touches zero only at the point x = 0. Many other
choices of A yield a quadratic form with this property. Such A are said to be positive
definite. More generally, an N × N symmetric matrix A is called
12
10
8
6
4
2
0
3
2
1
3 0
2 1 1
0 1 2
2 3 3
0
2
4
6
8
10
12
3
2
1
3 0
2 1 1
0 1 2
2 3 3
40
20
0
20
3
40 2
1
0
3 2 1
1 0 1 2
2 3 3
If A fits none of these categories then A is called indefinite. Figure 1.9 shows the
graph of a negative definite quadratic function. Now the function is hill-shaped,
and 0 is the unique global maximum. Figure 1.10 shows an indefinite form.
The easiest case for detecting definiteness is when the matrix A is diagonal, since
From the right hand expression we see that a diagonal matrix is positive definite
if and only if all diagonal elements are positive. Analogous statements are true for
nonnegative, nonpositive and negative definite matrices. The next fact generalizes
this idea and is proved in §3.1.2.
Finally, here’s a necessary (but not sufficient) condition for each kind of definiteness.
Fact 1.4.15. If A is positive definite, then each element ann on the principal diagonal
is positive, and the same for nonnegative, nonpositive and negative.
To be written.
1.6 Exercises
Ex. 1.6.1. Given two vectors x and y, show that |kxk − kyk| ≤ kx − yk.3
Ex. 1.6.2. Use fact 1.1.2 on page 5 to show that if y ∈ R N is such that y0 x = 0 for
every x ∈ R N , then y = 0.
Ex. 1.6.4. Show that if S and S0 are two linear subspaces of R N , then S ∩ S0 is also a
linear subspace of R N .
Ex. 1.6.5. Show that every linear subspace of R N contains the origin 0.
Ex. 1.6.6. Show that the vectors (1, 1) and (−1, 2) are linearly independent.5
3 Hint: Use the triangle inequality.
4 Hint: There’s no need to go taking derivatives and setting them equal to zero. An easier proof
exists. If you’re stuck, consider the Cauchy-Schwarz inequality.
5 Hint: Look at the different definitions of linear independence. Choose the one that’s easiest to
Q := {( x1 , x2 , x3 ) ∈ R3 : x2 = x1 + x3 }
Q := {( x1 , x2 , x3 ) ∈ R3 : x2 = 1}
Ex. 1.6.10. Show that if T : R N → R N is a linear function and λ is any scalar, then
E := {x ∈ R N : Tx = λx} is a linear subspace of R N .
Ex. 1.6.13. Show that if S is a linear subspace of R N then every basis of S has the
same number of elements.
Ex. 1.6.16. Show that if T : RK → R N is linear and K > N, then T is not one-to-one.
Ex. 1.6.17. Prove the claim in fact 1.4.14 on page 38 that if A is positive definite, then
A is nonsingular. If you can, prove it without invoking positivity of its eigenvalues.
Ex. 1.6.19. Show that for any two conformable matrices A and B, we have (AB)−1 =
B − 1 A − 1 .6
Ex. 1.6.21. Show that if ei and e j are the i-th and j-th canonical basis vectors of R N
respectively, and A is an N × N matrix, then ei0 Ae j = aij , the i, j-th element of A.
Ex. 1.6.22. Prove fact 1.3.6 on page 26.
Ex. 1.6.23. Let
1 −1 1 2
A := , B :=
−1 1 2 1
Show that
1. A is nonnegative definite.
Ex. 1.6.25. Show that for any matrix A, the matrix A0 A is well-defined (i.e., multi-
plication is possible), square, and nonnegative definite.
Ex. 1.6.26. Show that if A and B are positive definite and A + B is well defined, then
it is also positive definite.
Ex. 1.6.27. Let A be N × K. Show that if Ax = 0 for all K × 1 vectors x, then A = 0
(i.e., every element of A is zero). Show as a corollary that if A and B are N × K and
Ax = Bx for all K × 1 vectors x, then A = B.
Ex. 1.6.28. Let I N be the N × N identity matrix.
Ex. 1.6.32. Show that A is nonsingular if and only if 0 is not an eigenvalue for A.
Ex. 1.6.33. Show that the only nonsingular idempotent matrix is the identity matrix.
1. Show that if x is any N × 1 vector, then Zx is a vector with all elements equal
to the mean of the elements of x.
Solution to Exercise 1.6.8. If a := (1, −1, 1), then Q is all x with a0 x = 0. This set is
a linear subspace of R3 , as shown in exercise 1.6.7.
Solution to Exercise 1.6.11. We are asked to verify the equivalences in fact 1.1.6 on
page 10 for the set X := {x1 , . . . , xK }. The right way to do this is to establish a cycle,
such as part 1 implies part 2 implies part 3 implies part 1. It is then clear that part i
implies part j for any i and j.
First let’s show that part 1 implies part 2, which is that if X0 is a proper subset of X,
then span( X0 ) is a proper subset of span( X ). To save fiddly notation let’s take X0 :=
{x2 , . . . , xK }. Suppose to the contrary that span( X0 ) = span( X ). Since x1 ∈ span( X )
CHAPTER 1. LINEAR ALGEBRA 42
we must then have x1 ∈ span( X0 ), from which we deduce the existence of scalars
α2 , . . . , αK such that 0 = −x1 + α2 x2 + · · · + αK xK . Since −1 6= 0, this contradicts
part 1.
The next claim is that part 2 implies part 3; that is, that so vector in X can be
written as a linear combination of the others. Suppose to the contrary that x1 =
α2 x2 + · · · + αK xK , say. Let y ∈ span( X ), so that y = β 1 x1 + · · · + β K xK . If we
use the preceding equality to substitute out x1 , we get y as a linear combination of
{x2 , . . . , xK } alone. In other words, any element of span( X ) is in the span of the
proper subset {x2 , . . . , xK }. Contradiction.
The final claim is that part 3 implies part 1; that is, that α1 = · · · = αK = 0 whenever
α1 x1 + · · · + αK xK = 0. Suppose to the contrary that there exist scalars with α1 x1 +
· · · + αK xK = 0 and yet αk 6= 0 for at least one k. It follows immediately that xk =
(1/αk ) ∑ j6=k α j x j . Contradiction.
Solution to Exercise 1.6.12. The aim is to prove fact 1.1.7 on page 11. Regarding
the first part, let’s take X as linearly independent and show that the subset X0 :=
{x1 , . . . , xK−1 } is linearly independent. (The argument for more general subsets is
similar.) Suppose to the contrary that X0 is linearly dependent. Then by the defini-
tion we can take α1 , . . . , αK −1 not all zero with ∑kK=−11 αk xk = 0. Letting αK = 0 we can
write this as ∑kK=1 αk xk = 0. Since not all coefficients are zero we have contradicted
linear independence of X.
Regarding the second claim, let X := {x1 , . . . , xK } be linearly independent and sup-
pose that x j = 0. Then by setting αk = 0 for k 6= j and α j = 1 we can form scalars
not all equal to zero with ∑kK=1 αk xk = 0.
Regarding the third claim, let X := {x1 , . . . , xK } ⊂ R N be linearly independent and
let xK +1 be any point in R N such that xK +1 ∈ / span( X ). The claim is that X ∪ {xK +1 }
is linearly independent. Suppose to the contrary that there exist α1 , . . . , αK , αK +1 not
all zero such that ∑kK=+11 αk xk = 0. There are two possibilities for αK +1 , both of which
lead to a contradiction: First, if αK +1 = 0, then, since α1 , . . . , αK , αK +1 are not all
zero, at least one of α1 , . . . , αK are nonzero, and, moreover, ∑kK=1 αk xk = ∑kK=+11 αk xk =
0. This contradicts our assumption of independence on X. On the other hand, if
αK +1 6= 0, then from ∑kK=+11 αk xk = 0 we can express xK +1 as a linear combination of
elements of X. This contradicts the hypothesis that xK +1 ∈ / span( X ).
Solution to Exercise 1.6.13. Let B1 and B2 be two bases of S, with K1 and K2 ele-
ments respectively. By definition, B2 is a linearly independent subset of S. More-
over, S is spanned by the set B1 , which has K1 elements. Applying theorem 1.1.2, we
CHAPTER 1. LINEAR ALGEBRA 43
see that B2 has at most K1 elements. That is, K2 ≤ K1 . Reversing the roles of B1 and
B2 gives K1 ≤ K2 .
Solution to Exercise 1.6.14. The aim is to prove fact 1.1.8 on page 14. Suppose that
S and T are K-dimensional linear subspaces of R N with S ⊂ T. We claim that
S = T. To see this, observe that by the definition of dimension, S is equal to span( B)
where B is a set of K linearly independent basis vectors {b1 , . . . , bK }. If S 6= T,
then there exists a vector x ∈ T such that x ∈ / span( B). In view of fact 1.1.6 on
page 10, the set {x, b1 , . . . , bK } is linearly independent. Moreover, since x ∈ T and
since B ⊂ S ⊂ T, we now have K + 1 linearly independent vectors inside T. At
the same time, being K-dimensional, we know that T is spanned by K vectors. This
contradicts theorem 1.1.2 on page 12.
Regarding part 2, suppose that S is an M-dimensional linear subspace of R N where
M < N and yet S = R N . Then we have a space S spanned by M < N vectors that at
the same time contains the N linearly independent canonical basis vectors. We are
led to another contradiction of theorem 1.1.2. Hence S = R N cannot hold.
From fact 1.2.1 we have ker( T ) = {0} iff T is one-to-one, so we can now state the
following equivalences
Solution to Exercise 1.6.16. Let T be as described in the exercise and let K > N.
Seeking a contradiction, suppose in addition that T is one-to-one. Let {αk }kK=1 be
such that ∑kK=1 αk Tek = 0. By linearity, T (∑kK=1 αk ek ) = 0, and since T is one-to-one
and T0 = 0, this in turn implies that ∑kK=1 αk ek = 0. Since the canonical basis vectors
are linearly independent, it must be that α1 = · · · = αK = 0. From this we conclude
that { Te1 , . . . , TeK } is linearly independent. Thus R N contains K linearly indepen-
dent vectors, despite the fact that N < K. This is impossible by theorem 1.1.2 on
page 12.
Solution to Exercise 1.6.17. Let A be positive definite and consider the following: If
A is singular, then there exists nonzero x with Ax = 0 (see theorem 1.3.2 on page 22).
But then x0 Ax = 0 for nonzero x. Contradiction.
Solution to Exercise 1.6.28. The solutions are as follows: (1) I N is full column rank
because its columns are the canonical basis vectors, which are independent. (2) By
definition, B is the inverse of A if BA = AB = I N . It follows immediately that I N
is the inverse of itself. (3) A sufficient condition is α > 0. If this holds, then given
x 6= 0, we have x0 αI N x = αkxk2 > 0.
CHAPTER 1. LINEAR ALGEBRA 45
Second, XX = I N , because
Probability
[Roadmap]
We begin with some foundations of probability theory. These involve a few set
theoretic operations—see Appendix 4.1 for a review.
Example 2.1.1. Let Ω := {1, . . . , 6} represent the six different faces of a dice. A
realization of uncertainty corresponds to a roll of the dice, with the outcome being
an integer ω in the set {1, . . . , 6}.
The specification of all possible outcomes Ω is one part of our model. The other
thing we need to do is to assign probabilities to outcomes. One idea is to start by
assigning appropriate probabilities to every ω in Ω. It turns out that this is not the
46
CHAPTER 2. PROBABILITY 47
right way forward. Instead, the standard approach is to directly assign probabilities
to subsets of Ω instead. In the language of probability theory, subsets of Ω are called
events. The set of events is usually denoted by F , and we follow this convention.1
Below we attach probabilities to events using notation P( B), where B is an event
(i.e., B ∈ F ). The symbol P( B) should be interpreted as representing “the probabil-
ity that event B occurs.” The way you should think about it is this:
To illustrate, consider again example 2.1.1. Let B be the event {1, 2}. The number
P( B) represents the probability that the face ω selected by the roll is either 1 or 2.
Let Ω be any sample space. Two events we always find in F are Ω itself and the
empty set ∅, the latter because the empty set is regarded as being a subset of every
set.2 In this context, Ω is called the certain event because it always occurs (regard-
less of which outcome ω is selected, ω ∈ Ω is true by definition). The empty set ∅
is called the impossible event.
Remark 2.1.1. Why not start the other way, assigning probability to individual points
ω ∈ Ω, and then working out the probability of events by looking at the probability
of the points contained in those events? In fact this approach fails in many cases.
For example, suppose we take Ω to be the interval (0, 1) and set up a model where
all numbers in (0, 1) are “equally likely.” In this scenario, it can be shown that the
probability of an event of the form ( a, b) is just the length of the interval, or b − a.
Moreover, the probability of an individual point x ∈ (0, 1) occuring should be less
than the probability of some point in the interval ( x − e, x + e) occuring. Hence the
probability of hitting x is less than 2e for any e we choose. No positive number is
less than 2e for any e. Hence the probability of x must be zero. As a result, we
can’t build probabilities of events from probabilities of individual elements, since
the probability of hitting any given point is zero. There is no way to build up a
proper probabilistic model with this information alone.
1 I’m skirting some technical details here. In many situations, we exclude some of the more “com-
plex” subsets of Ω from F because they are so messy that assigning probabilities to these sets causes
problems for the theory. See §2.1.4 for more discussion.
2 Why? Because in mathematics every sensible mathematical statement must be either true or
false, so ∅ ⊂ Ω must be true or false. To show the statement true, we need to show that every element
of ∅ is also in Ω. Since ∅ contains no elements we cannot test this statement directly. However it is
certain not false, since no counterexample can be given. As a result we regard it to be true.
CHAPTER 2. PROBABILITY 48
2.1.2 Probabilities
1. P(Ω) = 1, and
Example 2.1.2. Let Ω := {1, . . . , 6} represent the six different faces of a dice, as in
example 2.1.1. We define a function P over all A ∈ F by
#A
P( A ) : = where #A := number of elements in A (2.1)
6
For example, P{2, 4, 6} = 3/6 = 1/2. Let’s check the axioms that define a proba-
bility. It’s easy to see that 0 ≤ P( A) ≤ 1 for any A ∈ F , and that P(Ω) = 1. Re-
garding additivity, suppose that A and B are two disjoint subsets of {1, . . . , 6}. Then
#( A ∪ B) = #A + #B, since, by disjointness, the number of elements in the union is
just the number contributed by A plus the number contributed by B. Hence
#( A ∪ B ) #A + #B #A #B
P( A ∪ B ) = = = + = P( A ) + P( B )
6 6 6 6
CHAPTER 2. PROBABILITY 49
In the definition of P, additivity was defined for pairs of sets, but this is enough
to imply additivity over any disjoint finite union. In particular, if A1 , . . . , A J are
disjoint in the sense that Ai ∩ A j = ∅ whenever i 6= j, then
J
P ∪ jJ=1 A j = ∑ P( A j )
j =1
Example 2.1.3. Continuing example 2.1.2, if we roll the dice, the probability of get-
ting an even number is the probability of getting a 2 plus that of getting a 4 plus that
of getting a 6. Formally,
P( A) := 2− N (#A)
To see that this is indeed a probability on (Ω, F ) we need to check that 0 ≤ P( A) ≤ 1
for all A ⊂ Ω, that P(Ω) = 1, and that P is additive. Exercise 2.8.7 asks you to
confirm that P is additive. That P(Ω) = 1 follows from the fact that the number of
distinct binary sequences of length N is 2 N .
Now let’s go back to the general case, where (Ω, F , P) is an arbitrary probability
space. From the axioms above, we can derive a suprising number of properties.
Let’s list the key ones, starting with the next fact.
1. P( B \ A ) = P( B ) − P( A );
2. P( A ) ≤ P( B );
3. P( Ac ) := P(Ω \ A) = 1 − P( A); and
4. P(∅) = 0.
These claims are not hard to prove. Regarding the part 1, if A ⊂ B, then we have B =
( B \ A) ∪ A. (Sketch the Venn diagram.) Since B \ A and A are disjoint, additivity
of P now gives
P( B ) = P( B \ A ) + P( A ) (whenever A ⊂ B)
This equality implies parts 1–4 of fact 2.1.1. Rearranging gives part 1, while nonneg-
ativity of P gives part 2. Specializing to B = Ω gives part 3, and setting B = A gives
part 4.
The property that A ⊂ B implies P( A) ≤ P( B) is called monotonicity, and is fun-
damental. If A ⊂ B, then we know that B occurs whenever A occurs (because if
ω lands in A, then it also lands in B). Hence, the probability of B should be larger.
Many crucial ideas in probability boil down to this one point.
Fact 2.1.2. If A and B are any (not necessarily disjoint) events, then
P( A ∪ B ) = P( A ) + P( B ) − P( A ∩ B )
Example 2.1.5. Consider an experiment where we roll a dice twice. A suitable sam-
ple space is the set of pairs (i, j), where i and j are between 1 and 6. The first element
i represents the outcome of the first roll, while the second element j represents the
outcome of the second roll. Formally,
Ω := {(i, j) : i, j ∈ {1, . . . , 6}}
For our probability, let’s define P( E) := #E/36. (Here elements of E are pairs, so #E
is the number of pairs in E.) Now consider the events
A := {(i, j) ∈ Ω : i is even} and B := {(i, j) ∈ Ω : j is even}
In this case we have
A ∩ B = {(i, j) ∈ Ω : i and j are even}
We can establish (2.3), indicating independence of A and B under P. To check this
we need to be able to count the number of elements in A, B and A ∩ B. The basic
principle for counting ordered tuples is that the total number of possible tuples is the
product of the number of possibilities for each element. For example, the number of
distinct tuples
(i, j, k) where i ∈ I, j ∈ J and k ∈ K
is (#I ) × (#J ) × (#K ). Hence, the number of elements in A is 3 × 6 = 18, the number
of elements in B is 6 × 3 = 18, and the number of elements in A ∩ B is 3 × 3 = 9. As
a result,
P( A ∩ B) = 9/36 = 1/4 = (18/36) × (18/36) = P( A)P( B)
Thus, A and B are independent, as claimed.
A very useful result is the law of total probability, which says the following: Let
A ∈ F and let B1 , . . . , B M be a partition of Ω, so that Bm ∈ F for each m, Bj ∩ Bk = ∅
when j 6= k, and ∪m M B = Ω. If P( B ) > 0 for all m, then
=1 m m
M
P( A) = ∑ P( A | Bm ) · P( Bm )
m =1
The proof is straightforward, although you should check that the manipulations of
intersections and unions if you have not seen them before:
P( A) = P[ A ∩ (∪mM=1 Bm )] = P[∪mM=1 ( A ∩ Bm )]
M M
= ∑ P( A ∩ Bm ) = ∑ P( A | Bm ) · P( Bm )
m =1 m =1
CHAPTER 2. PROBABILITY 52
Example 2.1.6. Suppose we flip a coin to decide whether to take part in a poker
game. If we play the chance of losing money is 2/3. The overall chance of losing
money (LM) that evening is
As alluded to above, in the presentation of probability spaces I’ve swept some tech-
nical details under the carpet to make the presentation smooth. These details won’t
affect anything that follows, and this whole course can be completed successfully
without knowing anything about them. Hence you can skip this section on first
reading. Nevertheless, if you intend to keep going deeper into probability and
statistics, eventually you will have to work your way through them. So let’s note
them as points for future study.
First, it turns out that assigning probabilities to all subsets of an arbitrary set Ω
in a consistent way can be quite a difficult task. The reason is that if Ω is relatively
large—a continuum, say—then it contains an awful lot of subsets, and some of them
can be manipulated to exhibit very strange phenomena. (Look up the Banach-Tarski
paradox for a hint of what I mean.) Because of this we usually take our set of events
F to be a “well behaved” subset of the subsets of Ω, and only assign probabilities to
elements of F .
How to choose F ? As stated above, we don’t just choose freely because doing so
will make it hard to form a consistent theory. Instead, the usual method is to start
with a collection of sets that are reasonable and well behaved, and then permit ex-
tention to other sets than can be obtained from the original sets by some standard
set operations.
For example, let’s suppose that A ∈ F , so that P( A) is well defined, and represents
the “probability of event A.” Now, given that we can assign a probability to the
event A, it would be a bit unfortunate if we couldn’t assign a probability to the event
“not A”, which corresponds to Ac . So normally we require that if A ∈ F , then Ac ∈
F . When this is true, we say that F is “closed under the taking of complements”.
Also, let’s suppose that A and B are both in F , so we assign probabilities to these
events. In this case, it would be natural to think about the probability of the event
CHAPTER 2. PROBABILITY 53
A ∪ B = ( Ac ∩ Bc )c
[Roadmap]
CHAPTER 2. PROBABILITY 54
Thus random variables convert outcomes in sample space—which can be any kind
of object—into numerical outcomes. This is valuable because numerical outcomes are
easy to order, add, subtract, etc. In other words, random variables “report” the
outcome of an experiment in a format amenable to further analysis.3
Example 2.2.1. Consider the sample space
Ω := {(b1 , b2 , . . .) : bn ∈ {0, 1} for each n}
Ω is called the set of all infinite binary sequences. (This is an infinite version of
the sample space in example 2.1.4. Imagine a computer with an infinite amount of
memory.) Consider an experiment where we flip a coin until we get a “heads”. We
let 0 represent tails and 1 represent heads. The experiment of flipping until we get a
heads can be modeled with the random variable
x (ω ) = x (b1 , b2 , . . .) = min{n ∈ N : bn = 1}
The number of heads in the first 10 flips is given by the random variable
10
y(ω ) = y(b1 , b2 , . . .) = ∑ bn
n =1
As per the definition, x and y are well-defined functions from Ω into R.4
3 I’m skipping some technical details again. If Ω is a continuum, then, when identifying random
variables with the class of all functions from Ω to R, we typically exclude some particularly compli-
cated functions. The remaining “nice” functions are called the measurable functions. These are our
random variables. In this course we will never meet the nasty functions, and there’s no need to go
into further details. Those who want to know more should consult any text on measure theory (e.g.,
Williams, 1991).
4 Is this true? What if ω = ω is an infinite sequence containing only zeros? Then { n ∈ N : b =
0 n
1} = ∅. So what is x (ω0 ). The convention here is to set min ∅ = ∞. This is reasonable, but now x
is not a map into R, because it can take the value ∞. However, in most applications this event has
probability zero, and hence we can ignore it. For example, we can set x (ω0 ) = 0 without changing
anything significant. Now we’re back to a well-defined function from Ω to R.
CHAPTER 2. PROBABILITY 55
Before going further, let’s discuss a common notational convention with random
variables that we’ve adopted above and that will be used below. With a random
variable x, you will often see notation such as
We’ll follow this convention, but you should translated it backwards and forwards
in your mind to start off with. To give an example, consider the claim that, for any
random variable x,
P{ x ≤ a} ≤ P{ x ≤ b} whenever a ≤ b (2.4)
This is intuitively obvious. The mathematical argument goes as follows: Pick any
a, b ∈ R with a ≤ b. We have
{ x ≤ a} := {ω ∈ Ω : x (ω ) ≤ a} ⊂ {ω ∈ Ω : x (ω ) ≤ b} =: { x ≤ b}
The set of events and probability were defined as F := all subsets of Ω and P( A) :=
2− N (#A). Consider a random variable x on Ω that returns the first element of any
given sequence. That is,
x (ω ) = x (b1 , . . . , b N ) = b1
The probability that x = 1 is 1/2. Indeed,
Aside from trivial random variables that take one value no matter what happens
(and hence provide no information), the simplest kind of random variables are bi-
nary or Bernoulli random variables. A binary random variable is a random variable
that takes values in {0, 1}. While binary random variables are by themselves rather
limited, in fact they play a role analogous to basis vectors in R N . In particular, a
great variety of random variables can be constructed as linear combinations of bi-
nary random variables. Things like expectation of a random variable can then be
defined in terms of these primitive components. This section explores these ideas.
There is a generic way to create binary random variables, using indicator functions.
If Q is a statement, such as “on the planet Uranus, there exists a tribe of three-headed
monkeys,” then 1{ Q} is considered as equal to 1 when the statement Q is true, and
zero when the statement Q is false. Hence, 1{ Q} is a binary indicator of the truth of
the statement Q. Another common variation on the notation is, for arbitrary C ⊂ Ω,
(
1 if ω ∈ C
1C ( ω ) : = : 1 { ω ∈ C } : =
0 otherwise
Note that 1C is a binary random variable. In fact, when you think about it, any
binary random variable has the form
x ( ω ) = 1C ( ω ) : = : 1 { ω ∈ C } (2.5)
x ( ω ) = s1 A ( ω ) + t1 B ( ω ) (2.6)
is a finite random variable taking the value s when ω falls in A, t when ω falls in B,
and zero otherwise. Figure 2.1 shows a graph of x when Ω = R.
In fact any finite random variable can be expressed as the linear combinations of
binary random variables. Thus we take as our generic expression for a finite random
CHAPTER 2. PROBABILITY 57
A B Ω
1. x (ω ) = s j if and only if ω ∈ A j .
2. { x = s j } = A j .
3. P{ x = s j } = P( A j ).
Convince yourself of these results before continuing. (The second two statements
follow from the first.)
CHAPTER 2. PROBABILITY 58
2.2.3 Expectations
Our next task is to define expectations of random variables and state their basic
properties. The idea behind expectations is that they give a kind of weighted aver-
age value, defined as the sum of all possible values of of the variable, weighted by
their probabilities. Let’s start with the finite case, where things are clearest. Letting
x be as defined in (2.7), the expectation of x is
J
E [ x ] : = ∑ s j P( A j ) (2.8)
j =1
As required, the expectation is the sum of the different values that x may take,
weighted by their probabilities.
We can already make some important observations.
Fact 2.2.2. E [1 A ] = P( A) for any A ∈ F .
Fact 2.2.2 seems trivial but in fact it represents the fundamental link between prob-
abilities and expectations. To see why it holds, observe that 1 A (ω ) = 1 × 1 A (ω ) +
0 × 1 Ac (ω ). Applying (2.8), we get E [1 A ] = 1 × P( A) + 0 × P( Ac ) = P( A).
Fact 2.2.3. If α ∈ R, then E [α] = α.
The right way to understand this is to take α to be the constant random variable
α1Ω . From (2.8), the expectation of this constant is E [α] := E [α1Ω ] = αP(Ω) = α.
How about expectation for arbitrary random variables, such as those with infinite
range? The same mode of definition doesn’t work but this presents no major prob-
lem because any random variable can be well approximated by finite random vari-
ables. A finite approximation xn to an arbitrary random variable x is shown in
figure 2.2. This approximation can be improved without limit if we allow the finite
approximation to take a larger and larger number of distinct values.
Hence we can take a sequence { xn } of finite random variables converging to any
selected arbitrary random variable x. The expectation of x is then defined as
E [ x] := lim E [ xn ] (2.9)
n→∞
CHAPTER 2. PROBABILITY 59
x
xn
To complete the definition we need to make sure that this limit makes sense. There
are a couple of issues that arise here. Suppose first that x is nonnegative (i.e., x (ω ) ≥
0 for all ω ∈ Ω). It turns out that the sequence { xn } approximating x can then
be chosen so that E [ xn ] is monotone increasing, and therefore must converge to
something. While that something can be +∞, this causes no real problem as we just
say that E [ x ] = ∞.
If x is not nonnegative then some thought will convince you that we can write it as
the sum x = x + − x − , where x + (ω ) := max{ x (ω ), 0} and x − (ω ) := − min{ x (ω ), 0}.
Both x + and x − are nonnegative random variables. The expectation can then be de-
fined as E [ x ] = E [ x + ] − E [ x − ]. The only issue is that this expression might have
the form ∞ − ∞, which is not allowed. Hence for random variables that are not
necessarily nonnegative, we usually restrict attention to integrable random vari-
ables. An integrable random variable is a random variable x such that E [| x |] < ∞.
Since x + ≤ | x | and x − ≤ | x | this is enough to ensure that both E [ x + ] < ∞ and
E [ x− ] < ∞. Hence E [ x] = E [ x+ ] − E [ x− ] is well defined.
To be careful we should also check that the value in (2.9) does not depend on the
particular approximating sequence { xn } that we choose. This and other related
technical details can be found in a text such as Williams (1991). In that reference the
following facts are also established:
Fact 2.2.4. If x and y are integrable random variables and α and β are any constants,
CHAPTER 2. PROBABILITY 60
then
E [αx + βy] = αE [ x] + βE [y]
var[ x ] := E [( x − E [ x ])2 ]
Fact 2.2.6. If the k-th moment of x exists, then so does the j-th moment for all j ≤ k.
Fact 2.2.7. If x and y are random variables with finite second moment, then
q
| E [ xy] | ≤ E [ x2 ]E [y2 ] (2.10)
Fact 2.2.8. For any nonnegative random variable x any δ > 0, we have
E [x]
P{ x ≥ δ } ≤ (2.11)
δ
Fact 2.2.7 is called the Cauchy-Schwarz inequality for random variables, while
(2.11) is called Chebyshev’s inequality. A common variation of Chebyshev’s in-
equality is the bound
E [ x2 ]
P{| x| ≥ δ} ≤ 2 (2.12)
δ
Exercise 2.8.36 asks you to check (2.11) and (2.12).
CHAPTER 2. PROBABILITY 61
In concluding this section, let us agree that to avoid repetition, we will assume that
every random variable introduced below is integrable unless stated otherwise. Also,
only random variables with finite second moment have a well defined variance, but
in what follows we will often talk about the variance of a given random variable
without adding the caveat “assuming it exists.”
2.3 Distributions
Let x be a random variable on some probability space (Ω, F , P), and consider the
function F defined by
F ( s ) : = P{ x ≤ s } : = P{ ω ∈ Ω : x ( ω ) ≤ s } ( s ∈ R) (2.13)
It can be shown that, for any choice of random variable x, this function always has
the following properties:
For example, monotonicity is immediate from (2.4) on page 55. (The other properties
are a bit trickier to prove. See, e.g, Williams, 1991.)
Any function F : R → [0, 1] satisfying conditions 1–3 is called a cumulative dis-
tribution function or cdf on R. Thus, to each random variable, we can associate
a unique cdf on R. We say that F is the cdf of x, or, alternatively, that F is the
distribution of x, and write x ∼ F.
Example 2.3.1. The function F (s) = arctan(s)/π + 1/2 is a cdf—one variant of the
Cauchy distribution. A plot is given in figure 2.3.
CHAPTER 2. PROBABILITY 62
1.0
0.8
0.6
0.4
0.2
0.0
10 5 0 5 10
It’s worth nothing that the following is also true: For every cdf F, there exists a prob-
ability space (Ω, F , P) and a random variable x : Ω → R such that the distribution
of x is F. Exercise 2.8.16 gives some hints on how one construction works.
Fact 2.3.1. If x ∼ F, then P{ a < x ≤ b} = F (b) − F ( a) for any a ≤ b.
One often needs to obtain the cdf of the transform of a random variable. This is easy
in the monotone case. For example, if x ∼ F and y = exp( x ), then the cdf of y is
G (s) := F (ln(s)), because
Every random variable has a well-defined cdf via (2.13). However, as representa-
tions of distributions, cdfs have some disadvantages. For example, and plotting cdfs
is a poor way to convey information about probabilities. The amount of probability
mass in different regions is determined by the slope of the cdf. Research shows that
humans are poor at extracting quantitative information from slopes. They do much
better with heights, which leads us into our discussion of densities and probability
mass functions (pmfs). Densities and pmfs correspond to two different, mutually
exclusive cases. The density case arises when the increase of the cdf in question is
smooth, and contains no jumps. The pmf case arises when the increase consists of
jumps alone.
Let’s have a look at these two situations, starting with the second. The pure jump
case occurs when the cdf represents a discrete random variable. To understand this,
suppose that x takes values s1 , . . . , s J . Let p j := P{ x = s j }. We then have 0 ≤ p j ≤ 1
for each j, and ∑ jJ=1 p j = 1 (exercise 2.8.15). A finite collection of numbers p1 , . . . , p J
such that 0 ≤ p j ≤ 1 and p1 + · · · + p J = 1 is called a probability mass function
(pmf). The cdf corresponding to this random variable is
J
F (s) := ∑ 1{ s j ≤ s } p j (2.14)
j =1
Visually, F is a step function, with a jump up of size p j at point s j . Figure 2.4 gives
an example with J = 2. The cdf is right continuous but not continuous.
The other case of interest is the density case. A density is a nonnegative function
p that integrates to 1. For example, suppose that F is a smooth cdf, so that the
derivative F 0 exists. Let p := F 0 . By the fundamental theorem of calculus, we then
have Z s Z s
p(t)dt = F 0 (t)dt = F (s) − F (r )
r r
From this equality and the properties of cdfs, we can see that p is nonnegative and
R +∞
−∞ p ( s ) ds = 1. In other words, p is a density. Also, taking the limit as r → − ∞ we
CHAPTER 2. PROBABILITY 64
p2
p1
s1 s2
obtain Z s
F (s) = p(t)dt
−∞
which tells us that F can be recovered from p.
Not every random variable has a density. The exact necessary and sufficient condi-
tion for a density to exist is that F is “absolutely continuous.” This is a smoothness
condition, an important special case of which is differentiability.6 If F is absolutely
continuous and p : R → [0, ∞) satisfies
Z s
F (s) = p(t)dt for all s ∈ R
−∞
As discussed above, cdfs are useful because every random variable has one, but
pmfs and densities are nicer to work with, and visually more informative. For ex-
ample, consider figure 2.5, which shows the density corresponding to the Cauchy
cdf in figure 2.3. Information about probability mass is now conveyed by height
rather than slope, which is easier for us humans to digest.
6 Inelementary texts, random variables with densities are often called “continuous random vari-
ables.” This terminology is confusing because “continuous” here has nothing to do with the usual
definition of continuity of functions.
CHAPTER 2. PROBABILITY 65
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
10 5 0 5 10
Let F be any cdf on R. Suppose that F is strictly increasing, so that the inverse
function F −1 exists:
The inverse of the cdf is called the quantile function, and has many applications
in probability and statistics. For example, the quantile function associated with the
Cauchy cdf in example 2.3.1 is F −1 (q) = tan[π (q − 1/2)]. See figure 2.6.
Things are a bit more complicated when F is not strictly increasing, as the inverse
F −1 is not well defined. (If F is not strictly increasing, then there exists at least two
distinct points s and s0 such that F (s) = F (s0 ).) This problem is negotiated by setting
This expression is a bit more complicated, but in the case where F is strictly increas-
ing, it reduces to (2.15).
The value F −1 (1/2) is called the median of F. It gives an alternative measure of
central tendency (alternative to the mean).
The quantile function features in hypothesis testing, where it can be used to define
critical values. An abstract version of the problem is as follows: Let x ∼ F, where F is
strictly increasing, differentiable (so that a density exists and x puts no probability
mass on any one point) and symmetric. Given α ∈ (0, 1), we want to find the c
CHAPTER 2. PROBABILITY 66
15
10
10
15
0.0 0.2 0.4 0.6 0.8 1.0
Until now, we’ve been calculating expectations using the expectation operator E ,
which was defined from a given probability P in §2.2.3. One of the most useful
facts about distributions is that they encode all the information necessary to calcu-
late E [ x ]. For a full treatment of this topic you can consult a text such as Williams
(1991). Here we’ll stick to noting down the main facts. In all of what follows, h is an
arbitrary function from R to R.
CHAPTER 2. PROBABILITY 67
0.30
0.25
0.20
0.15
0.10
0.05
0.00
−c 0 c
It’s convenient to have a piece of notation that captures both of these cases. As a
result, if x ∼ F, then we will write
Z
E [h( x)] = h(s) F (ds)
The way you should understand this expression is that when F is differentiable with
R∞
derivative p = F 0 , then h(s) F (ds) is defined as −∞ h(s) p(s)ds. If, on the other
R
hand, F is the step function F (s) = ∑ jJ=1 1{s j ≤ s} p j corresponding to the discrete
random variable in fact 2.3.4, then h(s) F (ds) is defined as ∑ jJ=1 h(s j ) p j .
R
CHAPTER 2. PROBABILITY 68
Just for the record, for a given cdf F, the expression h(s) F (ds) has its own precise
R
(Pick an arbitrary A j and check that the left- and right-hand sides are equal when
ω ∈ A j .) This is a discrete random variable, which we can take the expectation of
using (2.8) on page 58. We get
J J
E [h( x)] = ∑ h(s j )P( A j ) = ∑ h(s j ) p j
j =1 j =1
Let’s list a few well-known distributions. First, given a < b, the uniform distribu-
tion on interval [ a, b] is the distribution associated with the density
1
p(s; a, b) := ( a ≤ s ≤ b)
b−a
(If s < a or s > b, then p(s; a, b) := 0.) The mean is
Z b
a+b
s p(s; a, b)ds =
a 2
1
2 −1/2 2 −2
p(s) := p(s; µ, σ) := (2πσ ) exp − (s − µ) σ
2
In other words, linear combinations of normals are normal. This is a fact, telling us
that linear models and normal distributions play very well together.7
The chi-squared distribution with k degrees of freedom is the distribution with
density
1
p(s; k ) := k/2 sk/2−1 e−s/2 ( s ≥ 0)
2 Γ(k/2)
where Γ is the Gamma function (details omitted). If x has a distribution described
by this density, then we write x ∼ χ2 (k ).
Student’s t-distribution with k degrees of freedom, or, more simply, the t-distribution
with k degrees of freedom, is the distribution on R with density
−(k+1)/2
Γ( k+2 1 ) s2
p(s; k ) := 1+
(kπ )1/2 Γ( 2k ) k
The F-distribution with parameters k1 , k2 is the distribution with the unlikely look-
ing density
q
(k1 s)k1 kk22 /[k1 s + kk21 +k2 ]
p(s; k1 , k2 ) := ( s ≥ 0)
sB(k1 /2, k2 /2)
where B is the Beta function (details omitted). The F-distribution arises in certain
hypothesis tests, some of which we will examine later.
[roadmap]
7 Weshould be a bit careful here—what if αn = 0 for all n? To save ourselves from embarrassment
we can declare a random variable concentrated on a point to be a normal random variable with “zero
variance”.
CHAPTER 2. PROBABILITY 70
This distribution tells us about the random properties of xn viewed as a single entity.
But we often want to know about the relationships between the variables x1 , . . . , x N ,
and outcomes for the group of variables as a whole. To quantify these things, we
define the joint distribution of x1 , . . . , x N to be
for all tn ∈ R, n = 1, . . . , N.
Typically, the joint distribution cannot be determined from the N marginal distri-
butions alone, since the marginals do not tell us about the interactions between the
different variables. Once special case where we can tell the joint from the marginals
is when there is no interaction. This is called independence, and we treat it in the
next section.
From joint densities we can construct conditional densities. The conditional density
of xk+1 , . . . , x N given x1 = s1 , . . . , xk = sk is defined by
p ( s1 , . . . , s N )
p ( s k +1 , . . . , s N | s 1 , . . . , s k ) : = (2.22)
p ( s1 , . . . , s k )
p ( s 1 , . . . , s N ) = p ( s k +1 , . . . , s N | s 1 , . . . , s k ) p ( s 1 , . . . , s k ) (2.23)
2.4.2 Independence
P{ x1 ≤ s1 , . . . , x N ≤ s N } = P{ x1 ≤ s1 } × · · · × P{ x N ≤ s N } (2.24)
We use the abbreviation IID for collections of random variables that are both inde-
pendent and identically distributed.
Example 2.4.1. Consider a monkey throwing darts at a dartboard. Let x denote the
horizontal location of the dart relative to the center of the board, and let y denote
the vertical location. (For example, if x = −1 and y = 3, then the dart is 1cm to the
left of the center, and 3cm above.) At first pass, we might suppose that x and y are
independent and identically distributed.
We won’t prove the last fact in the general case, as this involves measure theory.
However, we can illustrate the idea by showing that E [ xy] = E [ x ]E [y] when x
and y are independent and defined by (2.40). In this case, it can be shown (details
omitted) that the random variables x and y are independent precisely when the
events A and B are independent. Now observe that
Fact 2.4.3. If random variables x1 , . . . , x N are independent, and each has density pn ,
then the joint density p exists, and is the product of the marginal densities:
N
p ( s1 , . . . , s N ) = ∏ pn (sn )
n =1
Here are some useful facts relating independence and certain common distributions.
IID
Fact 2.4.4. If x1 , . . . , xk ∼ N (0, 1), then
k
Q := ∑ xi2 ∼ χ2 (k)
i =1
1. Z ∼ N (0, 1),
2. Q ∼ χ2 (k), and
Q1 /k1
Q2 /k2
is distributed as F (k1 , k2 ).
2.4.3 Covariance
In particular, if α and β are real numbers and x and y are random variables, then
var[α] = 0,8 var[α + βx ] = β2 var[ x ], and
Given two random variables x and y with finite variances σx2 and σy2 respectively, the
correlation of x and y is defined as
cov[ x, y]
corr[ x, y] :=
σx σy
If corr[ x, y] = 0, we say that x and y are uncorrelated. For this to occur, it is nec-
essary and sufficient that cov[ x, y] = 0. Positive correlation means that corr[ x, y] is
positive, while negative correlation means that corr[ x, y] is negative. The first part
of the next fact follows immediately from fact 2.2.7, while the second is just algebra.
Fact 2.4.9. Given any two random variables x, y and positive constants α, β, we have
Note that the converse is not true: One can construct examples of dependent ran-
dom variables with zero covariance.
Let’s consider the problem of predicting the value of a random variable y given
knowledge of the value of a second random variable x (and also knowledge of
the underlying probability distributions, which makes this a problem in probability
8 Here var[α] should be understood as var[α1{ω ∈ Ω}], as was the case when we discussed
fact 2.2.3 on page 58.
CHAPTER 2. PROBABILITY 74
rather than statistics). Thus, we seek a function f such that f ( x ) is close to y on av-
erage. To measure the “average distance” between f ( x ) and y, we will use the mean
squared deviation between f ( x ) and y, which is
E [(y − f ( x))2 ]
As we will learn in chapter 3, the minimizer of the mean squared deviation over all
functions of x is obtained by choosing f ( x ) = E [y | x ], where the right-hand size is
the conditional expectation of y given x. However, the conditional expectation may
be nonlinear and complicated, so let’s now consider the simpler problem of finding
a good predictor of y within a small and well-behaved class of functions. The class
of functions we will consider is the set of “linear” functions
(While elementary courses refer to these functions as linear, in fact they are not linear
unless α = 0 (see §1.2.1). The class of functions L is more correctly known as the set
of affine functions.) Thus, we consider the problem
∂ψ(α, β) ∂ψ(α, β)
=0 and =0
∂α ∂β
We obtain (exercise 2.8.29) the minimizers
cov[ x, y]
β∗ := and α∗ := E [y] − β∗ E [ x ] (2.26)
var[ x ]
The best linear predictor is therefore
`∗ ( x ) := α∗ + β∗ x
If you’ve studied elementary linear least squares regression before, you will realize
that α∗ and β∗ are the “population” counterparts for the coefficient estimates in the
regression setting. We’ll talk more about the connections below.
CHAPTER 2. PROBABILITY 75
2.5 Asymptotics
[Roadmap]
Let { xn }∞ ∞
n=1 be a sequence of random variables. We say that { xn }n=1 converges to
random variable x in probability if
A full proof of the convergence result in example 2.5.1 can be found by looking at the
normal density and bounding tail probabilities. However, a much simpler proof can
also be obtained by exploiting the connection between convergence in probability
and convergence in mean squared error. The details are below.
Fact 2.5.1. Regarding convergence in probability, the following statements are true:
p p
1. If g : R → R is continuous at x and xn → x, then g( xn ) → g( x ).
p p p p
2. If xn → x and yn → y, then xn + yn → x + y and xn yn → xy.
p p p
3. If xn → x and αn → α, then xn + αn → x + α and xn αn → xα.9
E [( xn − x)2 ] → 0 as n → ∞ (2.27)
ms
In symbols, this convergence is represented by xn → x.
9 Here {αn } is a nonrandom scalar sequence.
CHAPTER 2. PROBABILITY 76
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
α−δ α α+δ α−δ α α+δ
(a) n = 10 (b) n = 20
Fact 2.5.2. Regarding convergence in mean square, the following statements are
true:
ms p
1. If xn → x, then xn → x.
ms
2. If α is constant, then xn → α if and only if E [ xn ] → α and var[ xn ] → 0.
Part 1 of fact 2.5.2 follows from Chebyshev’s inequality (page 60). Using monotonic-
ity of P and then applying (2.12) to xn − x, we obtain
E [( xn − x)2 ]
P{| xn − x| > δ} ≤ P{| xn − x| ≥ δ} ≤
δ2
Part 1 of fact 2.5.2 follows. Part 2 is implied by the equality
1.0 t with 1 df
t with 2 dfs
0.8 t with 10 dfs
t with 100 dfs
0.6 standard normal
0.4
0.2
0.0
10 5 0 5 10
Let { Fn }∞
n=1 be a sequence of cdfs, and let F be a cdf. We say that Fn converges
weakly to F if, for any s such that F is continuous at s, we have
Example 2.5.3. It is well-known that the cdf of the t-distribution with k degrees
of freedom converges to the standard normal cdf as k → ∞. This convergence is
illustrated in figure 2.9.
Sometimes densities are easier to work with than cdfs. In this connection, note that
pointwise convergence of densities implies weak convergence of the corresponding
distribution functions:
Let { xn }∞
n=1 and x be random variables, where xn ∼ Fn and x ∼ F. We say that
xn converges in distribution to x if Fn converges weakly to F. In symbols, this
d
convergence is represented by xn → x.
Fact 2.5.4. Regarding convergence in distribution, the following statements are true:
d d
1. If g : R → R is continuous at x and xn → x, then g( xn ) → g( x ).
CHAPTER 2. PROBABILITY 78
p d
2. If xn → x, then xn → x.
d p
3. If α is constant and xn → α, then xn → α.
d
Indeed, by Slutsky’s theorem (fact 2.5.5) we have xn yn → 0. Since the limit is con-
stant, 2.5.4 then tells us that convergence is in probability as well.
Two of the most important theorems in both probability and statistics are the law
of large numbers and the central limit theorem. In their simplest forms, these theo-
rems deal with averages of independent and identically distributed (IID) sequences.
The law of large numbers tells us that these averages converge in probability to the
mean of the distribution in question. The central limit theorem tells us that a simple
transform of the average converges to a normal distribution.
Let’s start with the law of large numbers, which relates to the sample mean
N
1
x̄ N :=
N ∑ xn
n =1
of a given sample x1 , . . . , x N
Theorem 2.5.1. Let { xn } be an IID sequence of random variables with common distribution
R
F. If the first moment |s| F (ds) is finite, then
Z
p
x̄ N → E [ xn ] = sF (ds) as N→∞ (2.28)
CHAPTER 2. PROBABILITY 79
To prove theorem 2.5.1, we can use fact 2.5.2 on page 76. In view of this fact, it
suffices to show that E [ x̄ N ] → sF (ds) and var[ x̄ N ] → 0 as N → ∞. These steps
R
are left as an exercise (exercise 2.8.39). When you do the exercise, note to yourself
exactly where independence bites.10
Example 2.5.4. To illustrate the law of large numbers, consider flipping a coin until
10 heads have occurred. The coin is not fair: The probability of heads is 0.4. Let x
be the number of tails observed in the process. It is known that such an x has the
negative binomial distribution, and, with a little bit of googling, we find that the
mean E [ x ] is 15. This means that if we simulate many observations of x and take
the sample mean, we should get a value close to 15. Code to do this is provided in
the next listing. Can you see how this program works?11
import numpy as np
from random import uniform
num_repetitions = 10000
outcomes = np.empty(num_repetitions)
for i in range(num_repetitions):
num_tails = 0
num_heads = 0
while num_heads < 10:
b = uniform(0, 1)
num_heads = num_heads + (b < 0.4)
num_tails = num_tails + (b >= 0.4)
outcomes[i] = num_tails
print(outcomes.mean())
At first glance, the law of large numbers (2.28) appears to only be a statement about
the sample mean, but actually it can be applied to functions of the random variable
10 The proof involves a bit of cheating, because it assumes that the variance of each xn is finite. This
second moment assumption is not necessary for the result, but it helps to simplify the proof.
11 Hint: If u is uniform on [0, 1] and q ∈ [0, 1], then P{ u ≤ q } = q. This fact is used to simulate the
coin flips. Also recall that the logical values TRUE and FALSE are treated as 1 and 0 respectively in
algebraic expressions.
CHAPTER 2. PROBABILITY 80
N
1
Z
p
N ∑ h(xn ) → E [h(xn )] = h(s) F (ds) (2.29)
n =1
The left hand side is the fraction of the sample that falls in the set B, and (2.30) tells
us that this fraction converges to the probability that xn ∈ B.
The central limit theorem is another classical result from probability theory. It is
arguably one of the most beautiful and important results in all of mathematics. Rel-
ative to the LLN, it requires an additional second moment condition.
Theorem 2.5.2. Assume the conditions of theorem 2.5.1. If, in addition, the second moment
R 2
s F (ds) is finite, then
√ d
N ( x̄ N − µ) → y ∼ N (0, σ2 ) as N→∞ (2.31)
Another common statement of the central limit theorem is as follows: If all the con-
ditions of theorem 2.5.2 are satisfied, then
√ x̄ N − µ d
z N := N → z ∼ N (0, 1) as N → ∞ (2.32)
σ
CHAPTER 2. PROBABILITY 81
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
4 3 2 1 0 1 2 3 4 5
Exercise 2.8.40 asks you to confirm this via theorem 2.5.2 and fact 2.5.4.
The central limit theorem tells us about the distribution of the sample mean when
N is large. Arguing informally, for N large we have
√
N ( x̄ N − µ) ≈ y ∼ N (0, σ2 )
σ2
y
∴ x̄ N ≈ √ + µ ∼ N µ,
N N
Here ≈ means that the distributions are approximately equal. We see that x̄ N is
approximately normal, with mean equal to µ := E [ x1 ] and variance converging to
zero at a rate proportional to 1/N.
The convergence in (2.32) is illustrated by listing 2, the output of which is given
in figure 2.10. The listing generates 5,000 observations of the random variable z N
defined in (2.32), where each xn is χ2 (5). (The mean of this distribution is 5, and the
variance is 2 × 5 = 10.) The observations of z N are stored in the vector outcomes,
and then histogrammed. At the end of the listing we superimpose the density of the
standard normal distribution over the histogram. As predicted, the fit is relatively
good.
Before finishing this section, we briefly note the following asymptotic result, which
is frequently used in conjunction with the central limit theorem:
CHAPTER 2. PROBABILITY 82
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
num_replications = 5000
outcomes = np.empty(num_replications)
N = 1000
k = 5 # Degrees of freedom
chi = st.chi2(k)
for i in range(num_replications):
xvec = chi.rvs(N)
outcomes[i] = np.sqrt(N / (2 * k)) * (xvec.mean() - k)
Theorem 2.5.3. Let {tn } be a sequence of random numbers and let θ be a constant. Sup-
√ d
pose that n(tn − θ ) → N (0, σ2 ) for some σ > 0. Suppose further that g : R → R is
differentiable at θ and g0 (θ ) 6= 0. Under these conditions we have
√ d
n{ g(tn ) − g(θ )} → N (0, g0 (θ )2 σ2 ) as n→∞ (2.33)
The technique illustrated in theorem 2.5.3 is referred to as the delta method. The
delta method is extremely useful, particularly when one seeks the asymptotic distri-
bution of certain kinds estimators. We will see its importance in some applications
later on. The proof of theorem 2.5.3 is based on a Taylor expansion of g around
the point θ, and can be found in almost any text on mathematical statistics. Exer-
cise 2.8.43 walks you throught the most important ideas.
Instead of giving a full proof here, we will cover some parts of the proof of the
following corollary: If the conditions of theorem 2.5.2 are satisfied and g : R → R is
CHAPTER 2. PROBABILITY 83
F ( s ) : = : F ( s1 , . . . , s K ) : = P{ x1 ≤ s1 , . . . , x K ≤ s K } : = : P{ x ≤ s } (2.35)
for each s in RK . (Here and in what follows, the statement x ≤ s means that xn ≤ sn
for n = 1, . . . , K.)
Just as some but not all distributions on R have a density representation (see §2.3.2),
some but not all distributions on RK can be represented by a density. We say that
f : RK → R is the density of random vector x := ( x1 , . . . , xK ) if
Z
f (s) ds = P{x ∈ B} (2.36)
B
for every subset B of RK .12 Most of the distributions we work with in this course
have density representations.
For random vectors, the definition of independence mirrors the scalar case. In par-
ticular, a collection of random vectors x1 , . . . , x N is called independent if, given any
s1 , . . . , s N , we have
P{ x1 ≤ s1 , . . . , x N ≤ s N } = P{ x1 ≤ s1 } × · · · × P{ x N ≤ s N }
We note the following multivariate version of fact 2.4.2:
Fact 2.6.1. If x and y are independent and g and f are any functions, then f (x) and
g(y) are also independent.
Fact 2.6.3. For any random vector x, the variance-covariance matrix var[x] is square,
symmetric and nonnegative definite.
Fact 2.6.4. For any random vector x, any constant conformable matrix A and any
constant conformable vector a, we have
Here, the fact that a + Ax has mean a + Aµ and variance-covariance matrix AΣA0 is
not surprising. What is important is that normality is preserved .
Fact 2.6.7. Normally distributed random variables are independent if and only if
they are uncorrelated. In particular, if both x and y are normally distributed and
cov[ x, y] = 0, then x and y are independent.
Fact 2.6.8. If x ∼ N (µ, Σ), then (x − µ)0 Σ−1 (x − µ) ∼ χ2 (k ), where k := length of x.
Fact 2.6.9. If x ∼ N (0, I) and A is a conformable idempotent and symmetric matrix
with rank(A) = j, then x0 Ax ∼ χ2 ( j). (In view of fact 1.4.7, when using this result it
is sufficient to show that trace(A) = j.)
13 If a = 0 then we can interpret a0 x as a “normal” random variable with zero variance.
CHAPTER 2. PROBABILITY 86
Fact 2.6.10. Let {Xn } be a sequence of random N × K matrices, let {xn } is a sequence
of random vectors in RK and let X and x be, respectively, a random matrix and vector
of the same dimensions. Then,
p p
1. xn → x if and only if kxn − xk → 0.
p p
2. Xn → X if and only if kXn − Xk → 0.
Here the first norm is the ordinary Euclidean norm and the second is the matrix
norm of §1.4.5.
Now let’s extend the notion of convergence in distribution to random vectors. The
definition is almost identical to the scalar case, with only the obvious modifications.
n=1 be a sequence of cdfs on R , and let F be a cdf on R . We say that Fn
Let { Fn }∞ K K
Fn (s) → F (s) as n → ∞
xKn xK
Put differently, convergence of the marginals does not necessarily imply conver-
gence of the joint distribution. (As you might have guessed, one setting where con-
vergence of the marginals implies convergence of the joint is when the elements of
the vectors are independent, and the joint is just the product of the marginals.)
The fact that elementwise convergence in distribution does not necessarily imply
convergence of the vectors is problematic, because vector convergence is harder
to work with than scalar convergence. Fortunately, we have the following results,
which provide a link from scalar to vector convergence:
d d
1. If a0 xn → a0 x for any a ∈ RK , then xn → x.
p p
2. If a0 xn → a0 x for any a ∈ RK , then xn → x.
The second of these results is quite straightforward to prove (exercise 2.8.45). The
first is more difficult (the standard argument uses characteristic functions). It is often
referred as the Cramer-Wold device.
We noted a variety of useful results pertaining to convergence in probability and
distribution for sums and products in the scalar case. Most of these carry over to the
vector case essentially unchanged. For example,
p 1 −1 p
1. If Xn → X and Xn and X are nonsingular, then X−
n → X .
p p
2. If Xn → X and Yn → Y, then
p p p
Xn + Yn → X + Y, Xn Yn → XY and Yn Xn → YX
CHAPTER 2. PROBABILITY 88
p
3. If Xn → X and An → A, then
p p p
Xn + An → X + A, Xn An → XA and An Xn → AX
In part 3 of fact 2.6.12, the matrices An and A are nonrandom. The convergence
An → A means that each element of An converges in the usual scalar sense to the
corresponding element of A:
Alternatively, we can stack the matrices into vectors and take the norms, as dis-
cussed above. Then we say that An → A if kAn − Ak → 0. The two definitions can
be shown to be equivalent.
Example 2.6.1. To see how fact 2.6.12 can be used, let’s establish convergence of the
quadratic form
p p
a0 Xn a → a0 Xa whenever a is a conformable constant vector and Xn → X (2.37)
This follows from two applications of fact 2.6.12. Applying fact 2.6.12 once we get
p
a0 Xn → a0 X. Applying it a second time yields the convergence in (2.37).
Fact 2.6.13 (Continuous mapping theorem, vector case). Let xn and x be random
vectors in RK , and let g : RK → R J be continuous at x. In this setting,
d d
1. If xn → x, then g(xn ) → g(x).
p p
2. If xn → x, then g(xn ) → g(x).
Another result used routinely in econometric theory is the vector version of Slut-
sky’s theorem:
Fact 2.6.14 (Slutsky’s theorem, vector case). Let xn and x be random vectors in RK ,
p d
let Yn be random matrices, and let C be a constant matrix. If Yn → C and xn → x,
then
d d
Yn xn → Cx and Yn + xn → C + x
whenever the matrices are conformable.
CHAPTER 2. PROBABILITY 89
The delta method from theorem 2.5.3 on page 81 extends to random vectors. For
example, let g : RK → R be differentiable at a vector θ ∈ RK , in the sense that the
gradiant vector
∂g(θ)
∂θ1
.
∇ g(θ) := .
.
∂g(θ)
∂θK
is well defined (i.e., the limit defining each of the partial derivatives exists). In this
context,
√ d
Fact 2.6.15. If {tn } is a sequence of random vectors in RK with n(tn − θ) →
N (0, Σ) for some θ ∈ RK and positive definite K × K matrix Σ, then
√ d
n{ g(tn ) − g(θ)} → N (0, ∇ g(θ)0 Σ∇ g(θ)) as n→∞ (2.38)
whenever ∇ g(θ)0 Σ∇ g(θ) is positive. This last assumption will be satisfied if, for
example, at least one of the partial derivatives in ∇ g(θ) is nonzero (why?).
With the above definitions of convergence in hand, we can move on to the next topic:
Vector LLN and CLT. The scalar LLN and CLT that we discussed in §2.5 extend to
the vector case in a natural way. For example, we have the following result:
Theorem 2.6.1. Let {xn } be an IID sequence of random vectors in RK with E [kxn k2 ] < ∞.
Let µ := E [xn ] and let Σ := var[xn ]. For this sequence we have
1 N
p √
∑ xn → µ
d
x̄ N := and N (x̄ N − µ) → N (0, Σ) (2.39)
N n =1
Figure 2.11 illustrates the LLN in two dimensions. The green dot is the point 0 =
(0, 0) in R2 . The black dots are IID observations x1 , . . . , x N of a random vector with
mean µ = 0. The red dot is the sample mean N1 ∑nN=1 xn . (Remember that we are
working with vectors here, so the summation and scalar multiplication in the sam-
ple mean x̄ N is done elementwise—in this case for two elements. In particular, the
sample mean is a linear combination of the observations x1 , . . . , x N .) By the vector
LLN, the red dot converges to the green dot.
The vector LLN in theorem 2.6.1 follows from the scalar LLN. To see this, let xn be as
in theorem 2.6.1, let a be any constant vector in RK and consider the scalar sequence
CHAPTER 2. PROBABILITY 90
{yn } defined by yn = a0 xn . The sequence {yn } inherets the IID property from {xn }.14
By the scalar LLN (theorem 2.5.1) we have
N
1 p
N ∑ yn → E [yn ] = E [a0 xn ] = a0 E [xn ] = a0 µ
n =1
But " #
N N N
1 1 1
N ∑ yn = N ∑ a0 xn = a0 N ∑ xn = a0 x̄ N
n =1 n =1 n =1
To be written
14 Functions of independent random variables are themselves independent (fact 2.4.2, page 71).
CHAPTER 2. PROBABILITY 91
2.8 Exercises
Ex. 2.8.3. Given sample space Ω := {1, 2, 3}, let A := {1}, B := {2} and C := {3}.
Let P( A) = P( B) = 1/3. Compute P(C ), P( A ∪ B), P( A ∩ B), P( Ac ), P( Ac ∪ Bc )
and P( A | B). Are A and C independent?
Ex. 2.8.4. A dice is designed so that the probability of getting face m is qm, where
m ∈ {1, . . . , 6} and q is a constant. Compute q.
Ex. 2.8.5. Let Ω be a nonempty finite set, and let ω0 be a fixed element of Ω. For
each A ⊂ Ω, define P( A) := 1{ω0 ∈ A}. Is P a probability on Ω? Why or why not?
Ex. 2.8.6. Let Ω be any sample space, and let P be a probability on the subsets F . Let
A ∈ F . Show that if P( A) = 0 or P( A) = 1, then A is independent of every other
event in F . Show that if A is independent of itself, then either P( A) = 0 or P( A) =
1. Show that if A and B are independent, then Ac and Bc are also independent.
Ex. 2.8.7. Let P and Ω be defined as in example 2.1.4. Show that P is additive, in
the sense that if A and B are disjoint events, then P( A ∪ B) = P( A) + P( B).
Ex. 2.8.8. Let P and Ω be defined as in example 2.1.4. Let A be the event that the
first switch is on, and let B be the event that the second switch is on. Show that A
and B are independent under P.
Ex. 2.8.9. Show that when Ω is finite, a random variable x on Ω can only take on a
finite set of values (i.e., has finite range).16
Ex. 2.8.10. Fact 2.2.4 on page 59 states that, among other things, if x is a random
variable, then E [αx ] = αE [ x ]. Show this for the case where x is a finite random
variable.
15 Hint: Sketching the Venn diagram, convince yourself that A = [( A ∪ B) \ B] ∪ ( A ∩ B). Finish
the proof using the definition of a probability and fact 2.1.1 (page 49).
16 Hint: Have a look at the definition of a function in §4.2.
CHAPTER 2. PROBABILITY 92
Ex. 2.8.11. Fact 2.2.4 on page 59 states that, among other things, if x and y are two
random variables, then E [ x + y] = E [ x ] + E [y]. Instead of giving the full proof, try
to show this for the binary random variables
Note that even this is a little fiddly—see the solution if you get lost.
Ex. 2.8.12. Recall F defined in (2.13). We claimed that lims→∞ F (s) = 1. Verify this
when x is the finite-valued random variable in (2.7).
Ex. 2.8.13. Recall F defined in (2.13). Suppose that x is the finite-valued random
variable in (2.7). Show that lims→−∞ Fx (s) = 0. If you can, show that F is right-
continuous.
Ex. 2.8.15. Let x be a discrete random variable taking values s1 , . . . , s J , and let p j :=
P{ x = s j }. Show that 0 ≤ p j ≤ 1 for each j, and ∑ jJ=1 p j = 1.
Ex. 2.8.16. This exercise describes the inverse transform method for generating
random variables with arbitrary distribution from uniform random variables. The
uniform cdf on [0, 1] is given by F (s) = 0 if s < 0, F (s) = s if 0 ≤ s ≤ 1, and F (s) = 1
if s > 1. Let G be another cdf on R. Suppose that G is strictly increasing, and let
G −1 be the inverse (quantile). Show that if u ∼ F, then G −1 (u) ∼ G.
Ex. 2.8.17. Let x ∼ F where F is the uniform cdf on [0, 1]. Give an expression for the
cdf G of the random variable y = x2 .
Ex. 2.8.19. Let y ∼ F, where F is a cdf. Show that F (s) = E [1{y ≤ s}] for any s.
Ex. 2.8.20. Confirm monotonicity of expectations (fact 2.2.5 on page 60) for the spe-
cial case where x and y are the random variables in (2.40).
Ex. 2.8.21. Prove fact 2.2.6 (existence of k-th moment implies existence of j-th mo-
ment for all j ≤ k).
CHAPTER 2. PROBABILITY 93
Ex. 2.8.22. Confirm the expression for variance of linear combinations in fact 2.4.8.
Ex. 2.8.23. Let x and y be scalar random variables. With reference to fact 2.4.9 on
page 73, is it true that corr[αx, βy] = corr[ x, y] for any constant scalars α and β? Why
or why not?
Ex. 2.8.24. Confirm the claim in fact 2.4.10: If x and y are independent, then cov[ x, y] =
corr[ x, y] = 0.
Ex. 2.8.25. Let x1 and x2 be random variables with densities p1 and p2 . Let q be their
joint density. Show that x1 and x2 are independent whenever q(s, s0 ) = p1 (s) p2 (s0 )
for every s, s0 ∈ R.
Ex. 2.8.26. Fact 2.4.2 tells us that if x and y are independent random variables and g
and f are any two functions, then f ( x ) and g(y) are independent. Prove this for the
case where f ( x ) = 2x and g(y) = 3y − 1.
Ex. 2.8.27. Let x and y be independent uniform random variables on [0, 1]. Let
z := max{ x, y}. Compute the cdf, density and mean of z.17 In addition, compute
the cdf of w := min{ x, y}.
Ex. 2.8.28. A discrete random variable x taking values in the set N0 := {0, 1, 2, . . .}
has pmf { p j } = { p0 , p1 , . . .}, if P{ x = j} = p j for all j ∈ N0 . (Here { p j } is a
sequence in [0, 1] with ∑∞ j=0 p j = 1.) Let u and v be indepedent random variables
taking values in N0 with pmfs { p j } and {q j } respectively, and let z := u + v. Obtain
an expression for the pmf of z in terms of { p j } and {q j }.
Ex. 2.8.30. Consider the setting of §2.4.4. Let α∗ , β∗ and `∗ be as defined there. Let
the prediction error u be defined as u := y − `∗ ( x ). Show that
1. E [`∗ ( x )] = E [y]
Ex. 2.8.32. Show that if x is a random vector with E [xx0 ] = I and A is a conformable
constant matrix, then E [x0 Ax] = trace(A).
Ex. 2.8.33. Let x be random and let a be constant. Show that if E [x] = µ and
var[x] = Σ, then E [a0 x] = a0 µ and var[a0 x] = a0 Σa.
Ex. 2.8.34. Let x be a random K × 1 vector. Show that E [xx0 ] is nonnegative definite.
Ex. 2.8.39. In this exercise, we complete the proof of the LLN on page 78. Let { xn }
be an IID sequence of random variables with common distribution F. Show that
E [ x̄ N ] → sF(ds) and var[ x̄ N ] → 0 as N → ∞.
R
Ex. 2.8.40. Confirm (2.32) via theorem 2.5.2 and fact 2.5.4.
p
2. Show that xn → 0 as n → ∞.
Ex. 2.8.42. Let x be any random variable with E [ x ] = µ and var[ x ] = σ2 < ∞. Show
that xn := x/n converges to zero in probability as n → ∞.
Ex. 2.8.43. This exercise covers some of the proof behind theorem 2.5.3 on page 81.
Suppose that {tn } is a sequence of random variables, θ is a constant, and
√ d
n(tn − θ ) → N (0, σ2 ) as n→∞
Ex. 2.8.44. Using fact 2.5.1 (page 75) as appropriate, prove the following part of
p p p
fact 2.6.12: If Xn → X and Yn → Y, then Xn Yn → XY whenever the matrices are
conformable.
p
Ex. 2.8.45. Confirm the following claim in fact 2.6.11: If a0 xn → a0 x for every a ∈ RK ,
p
then xn → x.
Ex. 2.8.47. Verify the first part of fact 2.6.10 on page 86. (Note that exercise 2.8.46 is
a warm up to this exercise.)
√ d
Ex. 2.8.48. Confirm the claim N (x̄ N − µ) → N (0, Σ) in theorem 2.6.1.
Ex. 2.8.49. Let {xn } be an IID sequence of random vectors in RK with E [xn ] = 0 and
var[xn ] = IK . Let
1 N
N n∑
x̄ N := xn and y N := N · kx̄ N k2
=1
What is the asymptotic distribution of {y N }?
CHAPTER 2. PROBABILITY 96
Solution to Exercise 2.8.1. If A, B and C are disjoint, then A ∪ B and C are also
disjoint, and A ∪ B ∪ C = ( A ∪ B) ∪ C. As a result, using additivity over pairs,
P( A ∪ B ) = P( A ) + P( B ) − P( A ∩ B )
we start by decomposing A into the union of two disjoint sets: A = [( A ∪ B) \ B] ∪
( A ∩ B). Using additivity of P, we then have
P( A) = P[( A ∪ B) \ B] + P( A ∩ B)
Since B ⊂ ( A ∪ B), we can apply part 1 of fact 2.1.1 (page 49) to obtain
P( A ) = P( A ∪ B ) − P( B ) + P( A ∩ B )
Rearranging this expression gives the result that we are seeking.
Solution to Exercise 2.8.4. When the dice is rolled one face must come up, so the
sum of the probabilities is one. More formally, letting Ω = {1, . . . , 6} be the sample
space, we have
6 6
P{1, . . . , 6} = P ∪6m=1 {m} = ∑ P{m} = ∑ qm = 1
m =1 m =1
2. 1{ω0 ∈ Ω} = 1
Solution to Exercise 2.8.7. The proof is almost identical to the proof of additivity in
example 2.1.3 (page 49).
Solution to Exercise 2.8.8. The proof of independence is essentially the same as the
proof of independence of A and B in example 2.1.5 (page 51).
Solution to Exercise 2.8.11. Consider the sum x + y. By this, we mean the random
variable ( x + y)(ω ) := x (ω ) + y(ω ). We claim that E [ x + y] = E [ x ] + E [y]. To see
that this is the case, note first that
( x + y)(ω ) = 1 A\ B (ω ) + 1B\ A (ω ) + 21 A∩ B (ω )
(To check this, just go through the different cases for ω, and verify that the right
hand side of this expression agrees with x (ω ) + y(ω ). Sketching a Venn diagram
will help.) Therefore, by the definition of expectation,
E [ x + y ] = P ( A \ B ) + P ( B \ A ) + 2P ( A ∩ B ) (2.41)
E [ x ] : = P( A ) = P( A \ B ) + P( A ∩ B )
Performing a similar calculation with y produces
E [ y ] : = P( B ) = P( B \ A ) + P( A ∩ B )
Adding these two produces the value on the right-hand side of (2.41), and we have
confirmed that E [ x + y] = E [ x ] + E [y].
CHAPTER 2. PROBABILITY 99
Solution to Exercise 2.8.12. We are assuming that x has finite range, and hence takes
only finitely many different values. Let m be the largest such value. For this m, we
have
lim Fx (s) ≥ Fx (m) = P{ω ∈ Ω : x (ω ) ≤ m} = P(Ω) = 1
s→∞
(The inequality is due to the fact that Fx is increasing.) On the other hand,
Solution to Exercise 2.8.14. Fix s ≥ 0. Using additivity over disjoint sets, we have
The claim F| x| (s) = 2F (s) − 1 now follows from the definition of symmetry.
(We are using the fact that the sets { x = s j } disjoint. Why is this always true? Look
carefully at the definition of a function given in §4.2.)
as required.
Solution to Exercise 2.8.27. As in the statement of the exercise, x and y are indepen-
dent uniform random variables on [0, 1], z := max{ x, y} and w := min{ x, y}. As
a first step to the proofs, you should convince yourself that if a, b and c are three
numbers, then
• {z ≤ s} = { x ≤ s} ∩ {y ≤ s}
• {w ≤ s} = { x ≤ s} ∪ {y ≤ s}
(For each, equality, show that if ω is in the right-hand side, then ω is in the left-hand
side, and vice versa.) Now, for s ∈ [0, 1], we have
Hence P{w ≤ s} = 2s − s2 .
Solution to Exercise 2.8.31. Using y = `∗ ( x ) + u and the results form exercise 2.8.30,
we have
var[`∗ ( x ) + u] = var[y]
= corr[ x, y]2 var[y] + (1 − corr[ x, y]2 ) var[y]
= var[`∗ ( x )] + var[u]
Solution to Exercise 2.8.35. First note that since x ∼ N (0, I N ) we have cov[ xi , x j ] =
0 for all i 6= j. Since uncorrelated normal random variables are independent, we
IID
then have x1 , . . . , x N ∼ N (0, 1). Since sums of squares of independent standard
normals are chi-squared, we have in particular that
k
∑ xn2 ∼ χ2 (k) (2.43)
n =1
3. x12 /x22 ∼ F (1, 1), because if Q1 ∼ χ2 (k1 ) and Q2 ∼ χ2 (k2 ) and Q1 and Q2 and
independent, then ( Q1 /k1 )/( Q2 /k2 ) ∼ F (k1 , k2 ).
4. x1 [2/( x22 + x32 )]1/2 ∼ t(2), because if Z ∼ N (0, 1) and Q ∼ χ2 (k ) and Z and Q
are independent, then Z (k/Q)1/2 ∼ t(k ).
Solution to Exercise 2.8.36. Pick any nonnegative random variable x and δ > 0. By
considering what happens at an arbitrary ω ∈ Ω, you should be able to convince
yourself that
x = 1{ x ≥ δ } x + 1{ x < δ } x ≥ 1{ x ≥ δ } δ
Using fact 2.2.5 (page 60), fact 2.2.2 (page 58) and rearranging gives (2.11). Regard-
ing (2.12), observe that
p p p
Solution to Exercise 2.8.44. Let Xn → X and Yn → Y. To prove that Xn Yn → XY,
we need to show that the i, j-th element of Xn Yn converges in probability to the i, j-th
element of XY. By hypothesis, we have
n p p
xik → xik and ynkj → ykj for all k
n n p
xik ykj → xik ykj for all k
and then
p
∑ xikn ynkj → ∑ xik ykj
k k
In other words, the i, j-th element of Xn Yn converges in probability to the i, j-th ele-
ment of XY.
p
Solution to Exercise 2.8.45. If a0 xn → a0 x for every a ∈ RK , then we know in par-
ticular that this convergence holds for the canonical basis vectors. Hence
p
e0k xn → e0k x for every k
p
∴ xnk → x k for every k (elementwise convergence)
p
∴ xn → x (vector convergence, by definition)
Solution to Exercise 2.8.46. From fact 2.5.1 on page 75, we know that if g : R → R
p
is continuous and {un } is a scalar sequence of random variables with un → u, then
p p p p
g(un ) → g(u). We also know that if un → u and vn → v, then un + vn → u + v. By
assumption, we have
p p
xn → 0 and yn → 0
p p
∴ xn2 → 02 = 0 and y2n → 02 = 0
p
∴ kxn k2 = xn2 + y2n → 0 + 0 = 0
p √
q
∴ k x n k = k x n k2 → 0 = 0
CHAPTER 2. PROBABILITY 105
A special case of this argument can be found in the solution to exercise 2.8.46. The
p
general case is similar: Suppose first that xkn → xk for all k. Combining the various
results about scalar convergence in probability in fact 2.5.1 (page 75), one can then
verify (details left to you) that
v
u K
p
kxn − xk := t ∑ ( xkn − xk )2 → 0
u
(n → ∞)
k =1
p
Regarding the converse, suppose now that kxn − xk → 0. Fix e > 0 and arbitrary k.
From the definition of the norm we see that | xkn − xk | ≤ kxn − xk is always true, and
hence
| xkn − xk | > e =⇒ kxn − xk > e
∴ {| xkn − xk | > e} ⊂ {kxn − xk > e}
∴ 0 ≤ P{| xkn − xk | > e} ≤ P{kxn − xk > e} → 0
The proof is done.
Letting g(s) := ksk2 and applying the continuous mapping theorem (fact 2.6.13 on
page 88), we obtain
√ K
y N = k N x̄ N k2 → kzk2 = ∑ z2k
d
k =1
d
From fact 2.4.4 on page 72 we conclude that y N → χ2 (K ).
Chapter 3
[roadmap]
3.1 Orthogonality
[roadmap]
k x1 + · · · + x K k2 = k x1 k2 + · · · + k x K k2
107
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 108
x z
Figure 3.1: x ⊥ z
S
x
Figure 3.2: x ⊥ S
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 109
While not every linearly independent set is orthogonal, a partial converse to fact 3.1.3
is given in §3.3.3.
A set of vectors O ⊂ R N is called an orthonormal set if it is an orthogonal set and
kuk = 1 for all u ∈ O. An orthonormal set that spans a linear subspace S of R N is
called an orthonormal basis for S. The standard example of an orthonormal basis
for all of R N is the canonical basis vectors e1 , . . . , e N .
One neat thing about orthonormal bases is the following: If O = {u1 , . . . , uK } is any
basis of S ⊂ R N , then we can write x in terms of the basis vectors as in x = ∑kK=1 αk uk
for suitable scalars α1 , . . . , αK . The value of these scalars is not always transparent,
but for an orthonormal basis we have αk = x0 uk for each k. That is,
Fact 3.1.5. Let Q and P be orthogonal N × N matrices. The following statements are
true:
2. QP is an orthogonal matrix.
This theorem is sometimes called the spectral decomposition theorem. One nice
application is a proof of fact 1.4.13 on page 37, which states among other things that
symmetric matrix A is positive definite if and only if its eigenvalues are all positive.
Exercise 3.6.5 and its solution step you through the arguments.
There is another kind of matrix decomposition using orthogonality that has many
important applications: the QR decomposition.
S⊥ := {x ∈ R N : x ⊥ S}
In other words, S⊥ is the set of all vectors that are orthogonal to S. Figure 3.3 gives
an example in R2 .
S⊥
S
This is easy enough to confirm: Looking back at the definition of linear subspaces,
we see that the following statement must be verified: Given x, y ∈ S⊥ and α, β ∈ R,
the vector that αx + βy is also in S⊥ . Clearly this is the case, because if z ∈ S, then
[Roadmap]
ŷ = argmin ky − zk (3.2)
z∈S
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 112
y − ŷ S
ŷ
Theorem 3.2.1 (Orthogonal Projection Theorem I). Let y ∈ R N and let S be any
nonempty linear subspace of R N . The following statements are true:
The vector ŷ in theorem 3.2.1 is called the orthogonal projection of y onto S. Al-
though we do not prove the theorem here, the intuition is easy to grasp from a
graphical presentation. Figure 3.4 illustrates. Looking at the figure, we can see that
the closest point ŷ to y within S is indeed the one and only point in S such that y − ŷ
is orthogonal to S.
[Add a proof?]
Example 3.2.1. Let y ∈ R N and let 1 ∈ R N be the vector of ones. Let S be the set of
constant vectors in R N , meaning that all elements are equal. Evidently S is the span
of {1}. The orthogonal projection of y onto S is ŷ := ȳ1, where ȳ is the scalar
N
1
ȳ :=
N ∑ yn
n =1
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 113
S
Py
y0
Py0
Theorem 3.2.2 (Orthogonal Projection Theorem II). Let S be any linear subspace, and
let P : R N → R N be the orthogonal projection onto S. The following statements are true:
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 114
2. Py ∈ S,
3. y − Py ⊥ S,
6. Py = y if and only if y ∈ S.
These results are not difficult to prove, given theorem 3.2.1. Linearity of P is left as
an exercise (exercise 3.6.11). Parts 2 and 3 follow directly from theorem 3.2.1. To see
part 4, observe that y can be decomposed as y = Py + y − Py. Part 4 now follows
from parts 2–3 and the Pythagorean law. (Why?) Part 5 follows from part 4. (Why?)
Part 6 follows from the definition of Py as the closest point to y in S.
Fact 3.2.1. If {u1 , . . . , uK } is an orthonormal basis for S, then, for each y ∈ R N ,
K
Py = ∑ (y0 uk )uk (3.3)
k =1
Fact 3.2.1 is a fundamental result. We can see it’s true immediately because the right
hand side of (3.3) clearly lies in S (being a linear combination of basis functions) and,
for any u j in the basis set
K
(y − Py)0 u j = y0 u j − ∑ (y0 uk )(u0k u j ) = y0 u j − y0 u j = 0
k =1
y
S⊥
S
û = My
ŷ = Py
There’s yet another way of stating the orthogonal projection theorem, which is also
informative. Recall the definition of orthogonal complements from §3.1.3. In the
orthogonal projection theorem, our interest was in projecting y onto S, but we have
just learned that S⊥ is itself a linear subspace, so we can also project y onto S⊥ . Just
as we used P to denote the function sending y into its projection onto S, so we’ll
use M to denote the function sending y into its projection onto S⊥ . The result we’ll
denote by û, so that û := My. Figure 3.6 illustrates. The figure suggests that we will
have y = ŷ + û, and indeed that is the case. The next theorem states this somewhat
more mathematically.
Theorem 3.2.3 (Orthogonal Projection Theorem III). Let S be a linear subspace of R N .
If P is the orthogonal projection onto S and M is the orthogonal projection onto S⊥ , then Py
and My are orthogonal, and
y = Py + My
If S1 and S2 are two subspaces of R N with S1 ⊂ S2 , then S2⊥ ⊂ S1⊥ . This means that
the result in fact 3.2.2 is reversed for M.
Fact 3.2.3. Let S1 and S2 be two subspaces of R N and let y ∈ R N . Let M1 and M2 be
the projections onto S1⊥ and S2⊥ respectively. If S1 ⊂ S2 , then,
M1 M2 y = M2 M1 y = M2 y
Fact 3.2.4. Py = 0 if and only if y ∈ S⊥ , and My = 0 if and only if y ∈ S.1
1 For example, if Py = 0, then y = Py + My = My. Hence M does not shift y. If an orthogonal
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 116
[Roadmap]
P = B ( B 0 B ) −1 B 0 (3.4)
This result is actually a generalization of fact 3.2.1 on page 114. While the expression
isn’t quite as neat, the advantage is that it applies to any basis, not just orthonormal
ones. We further explore the connection between the two results in §3.3.3.
Proof of theorem 3.3.1. Fix y ∈ R N . The claim is that the vector ŷ := B(B0 B)−1 B0 y is
the orthogonal projection of y onto S, where B is the matrix defined by colk (B) = bk .
To verify this, we need to show that
1. ŷ ∈ S, and
2. y − ŷ ⊥ S.
Part 1 is true because ŷ can be written as ŷ = Bx where x := (B0 B)−1 B0 y. The vector
Bx is a linear combination of the columns of B. Since these columns form a basis of
S they must lie in S. Hence ŷ ∈ S as claimed.
projection onto a subspace doesn’t shift a point, that’s because the point is already in that subspace
(see, e.g., theorem 3.2.2). In this case the subspace is S⊥ , and we conclude that y ∈ S⊥ .
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 117
Regarding part 2, from the assumption that B gives a basis for S, all points in S have
the form Bx for some x ∈ RK . Thus part 2 translates to the claim that
Theorem 3.3.1 and its proof implicitly assume that B0 B is nonsingular, but this is
justified because B is assumed to be full column rank (exercise 3.6.17).
The matrix P := B(B0 B)−1 B0 defined in (3.4) is called the projection matrix associ-
ated with the basis B (or the subspace S). It is common to also define the annihilator
associated with B as
M := I − P (3.5)
Here I is, as usual, the identity matrix (in this case N × N). In view of theorem 3.2.3,
the annihilator matrix projects any vector onto the orthogonal complement of S.
and we cannot in general find a β that satisfies all N equations. We aim instead
for the “best available” vector, which is the β ∈ RK making Xβ is as close to y
as possible (in the Euclidean sense). Using the orthogonal projection theorem, the
minimizer is easy to identify:
Theorem 3.3.2. Let X be an N × K matrix with full column rank and let y ∈ R N . The
minimization problem minβ∈RK ky − Xβk has a unique solution. The solution is
Proof. Let X and y be as in the statement of the theorem. Let β̂ be as in (3.6) and let
S := span(X). By the full column rank assumption, X forms a basis for S. In view of
theorem 3.3.1, the orthogonal projection of y onto S is
ŷ := X(X0 X)−1 X0 y = X β̂
The solution β̂ in theorem 3.3.2 is sometimes called the least squares solution to the
minimization problem described in that theorem. The intuition behind this notation
will be given later on.
Remark 3.3.1. What happens if we drop the assumption that the columns of X are
linearly independent? The set span(X) is still a linear subspace, and the orthogonal
projection theorem still gives us a closest point ŷ to y in span(X). Since ŷ ∈ span(X),
there still exists a vector β̂ such that ŷ = X β̂. The problems is that now there exists
an infinity of such vectors. Exercise 3.6.18 asks you to prove this.
P : = X ( X 0 X ) −1 X 0 and M := I − P (3.7)
The expression for projection onto a basis given in theorem 3.3.1 is extremely useful.
At the same time, it isn’t quite as neat as the expression for projection onto an or-
thonormal basis given in fact 3.2.1 (page 114). For this and other reasons, it’s nice to
know that any basis of a linear subspace can be converted to an orthonormal basis
in a straightforward way. The rest of this section gives details.
Theorem 3.3.3. Given any linearly independent subset {b1 , . . . , bK } of R N , there exists
an orthonormal set {u1 , . . . , uK } such that
The proof of theorem 3.3.3 provides a concrete algorithm for generating the or-
thonormal set {u1 , . . . , uK }. The first step is to construct orthogonal sets {v1 , . . . , vk }
with span identical to {b1 , . . . , bk } for each k. At the end one can just normalize each
vector to produce orthonormal sets with the same span.
The construction of {v1 , . . . , vK } uses the so called Gram Schmidt orthogonaliza-
tion procedure, which runs as follows: First, for each k = 1, . . . , K, let Bk be the
N × k matrix formed by columns b1 , . . . , bk . Let Pk := Bk (B0k Bk )−1 B0k be the associ-
ated orthogonal projection matrix and let P0 := 0. Define v1 , . . . , vK by
v k : = b k − P k −1 b k (3.8)
Looking at the definition of the annihilator in (3.5), the idea behind the algorithm is
clear—we are using the annihilator to project each successive bk into the orthogonal
complement of the span of the previous vectors b1 , . . . , bk−1 .
Exercises 3.6.22–3.6.24 step you through the process of verifying that {v1 , . . . , vk }
is orthogonal with span equal to that of {b1 , . . . , bk } for k = 1, . . . , K. As part of
the proof it is shown that no vk equals 0, and hence we can define uk := vk /kvk k.
The set {u1 , . . . , uk } is clearly orthonormal for each k, and its span is the same as
{v1 , . . . , vk }. (Why?) In other words, it has the properties stated in theorem 3.3.3.
x1 = (x10 u1 )u1
x2 = (x20 u1 )u1 + (x20 u2 )u2
x3 = (x30 u1 )u1 + (x30 u2 )u2 + (x30 u3 )u3
and so on. Sticking to the 3 × 3 case to simplify expressions, we can stack these
equations horizontally to get
0
| | | | | | (x1 u1 ) (x20 u1 ) (x30 u1 )
x1 x2 x3 = u1 u2 u3 0 (x20 u2 ) (x30 u2 )
| | | | | | 0 0 (x30 u3 )
3.4 Projections in L2
The main purpose of this section is to introduce conditional expectations and study
their properties. The definition of conditional expectations given in elementary
probability texts is often cumbersome to work with, and fails to provide the big
picture. In advanced texts, there are several different approaches to presenting con-
ditional expectations. The one I present here is less common than the plain vanila
treatment, but it is, to my mind, by far the most intuitive. As you might expect given
the location of this discussion, the presentation involves orthogonal projection.
Sometimes we want to predict the value of a random variable y using another vari-
able x. In this case we usually choose x such that y and x are expected to be close
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 121
The space L2 with its norm (3.9) is important partly because it shares so many prop-
erties with Euclidean space, and hence geometric intuition carries over from the
latter to L2 . We’ll see many examples of this below. Here’s a list of some of the
many ways that L2 is similar to Euclidean space
Fact 3.4.1. Let x and y be random variables in L2 and let α and β be any scalars. The
following statements are true:
1. kαx k = |α|k x k
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 122
3. k x + yk ≤ k x k + kyk.
Following on from fact 3.4.1, we can draw another parallel between L2 norm and
the Euclidean norm. As we saw in §1.1.1, the Euclidean norm is defined in terms of
the inner product on R N . If x and y are √
two vectors in R N , then the inner product is
x0 y, and the norm of vector x is kxk = x0 x. Similarly, for random variables x and
y, we define
h x, yi := inner product of x and y := E [ xy]
2 Actually, while kxk = 0 certainly implies that x = 0 in Euclidean space, it isn’t quite true that
k x k = 0 implies that x is the zero random variable (i.e., x (ω ) = 0 for all ω ∈ Ω). However, we can say
that if k x k = 0, then x = 0 with probability one. More formally, the set E := {ω ∈ Ω : | x (ω )| > 0}
satisfies P( E) = 0. In this sense, x differs from the zero random variable only in a trivial way. In
the applications we consider this caveat never causes problems. If you do continuous time stochastic
dynamics, however, you’ll find that there’s a lot of mathematical machinery built up to keep control
of this issues.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 123
1. h x, yi = hy, x i
3. h x, (αy + βz)i = αh x, yi + βh x, zi
Theorem 3.4.1 (Orthogonal Projection Theorem IV). Let y ∈ L2 and let S be any
nonempty linear subspace of L2 . The following statements are true:3
The vector ŷ in theorem 3.2.1 is called the orthogonal projection of y onto S. Hold-
ing S fixed, we can think of the operation
Fact 3.4.3. Let S be any linear subspace of L2 , and let P : L2 → L2 be the orthogonal
projection onto S. For any y ∈ L2 , we have
2. Py ∈ S
3. y − Py ⊥ S
5. k Pyk ≤ kyk
6. Py = y if and only if y ∈ S
Now let x ∈ L2 with E [ x ] = µ and var[ x ] = σx2 . Observe first that P1 x = µ1.
(We’ve written “µ1” instead of just “µ” to remind ourselves that we’re thinking of
a constant random variable, rather than just a constant.) To see this we need only
confirm that µ1 ∈ S1 , which is obvious, and that E [1( x − µ1)] = 0. The second
claim is also obvious, because E [1( x − µ1)] = E [ x − µ] = E [ x ] − µ.
Since P1 x = µ1, we have M1 x = x − µ1, or, written more conventionally, M1 x =
x − µ. In other words, M1 x is the “de-meaned” or “centered” version of x. The norm
of M1 x is q
k M1 x k = E [( x − µ)2 ] = σx
In other words, the norm of M1 x is the standard devation of x.
Next fix x, y ∈ L2 and consider projecting y onto S2 := span(1, x ). That is, S2 is
equal to the set of random variables
α + βx := α1 + βx for scalars α, β
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 126
3.4.3 Measurability
z = g ( x1 , . . . , x p )
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 127
Informally, what this means is that once the values of the random variables x1 , . . . , x p
have been realized, the variable z is completely determined (i.e., no longer random)
and it’s realized value can be calculated (assuming that we can calculate the func-
tional form g). You might like to imagine it like this: Uncertainty is realized, in the
sense that some ω is selected from the sample space Ω. Suppose that we don’t get
to view ω itself, but we do get to view certain random outcomes. For example, we
might get to observe the realized values x1 (ω ), . . . , x p (ω ). If z is G -measurable, we
can now calculate the realized value z(ω ) of z, even without knowning ω, because
we can compute z(ω ) = g( x1 (ω ), . . . , x p (ω )).4
As a matter of notation, if G = { x } and y is G -measurable, then we will also say that
y is x-measurable.
Example 3.4.3. Let x and z be two random variables. If z = 2x + 3, then z is x-
measurable. To see this formally, we can write z = g( x ) when g( x ) = 2x + 3. Less
formally, when x is realized, the value of z can be calculated.
Example 3.4.4. Let x1 , . . . , x N be random variables and let x̄ N be their sample mean.
If G = { x1 , . . . , x N }, then x̄ N := N −1 ∑nN=1 xn is clearly G -measurable.
Example 3.4.5. If x and y are independent and neither random variable is constant,
then y is not x-measurable. Indeed, if y was x-measurable, then we would have
y = g( x ) for some function g. This contradicts independence of x and y.
Example 3.4.6. Let x, y and z be three random variables with z = x + y. Suppose that
x and y are independent. Then z is not x-measurable. Intuitively, even if we know
the realized value of x, the realization of z cannot be computed until we know the
realized value of y. Formally, if z is x-measurable then z = g( x ) for some function
g. But then y = g( x ) − x, so y is x-measurable. This contradicts independence of x
and y.
Example 3.4.7. Let y = α, where α is a constant. This degenerate random variable
is G -measurable for any information set G , because y is already deterministic. For
p
example, if G = { x1 , . . . , x p }, then we can take y = g( x1 , . . . , x p ) = α + ∑i=1 0xi .
If x and y are known given the information in G , then a third random variable that
depends on only on x and y is likewise known given G . Hence G -measurability is
preserved under the taking of sums, products, etc. In particular,
4A technical note: In the definition of measurability above, where we speak of existence of the
function g, it is additional required that the function g is “Borel measurable.” For the purposes of
this course, we can regard non-Borel measurable functions as a mere theoretical curiosity. As such,
the distinction will be ignored. See Williams (1991) or any similar text for further details.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 128
Fact 3.4.4. Let α, β be any scalars, and let x and y be random variables. If x and y are
both G -measurable, then u := xy and v := αx + βy are also G -measurable.
Let G and H be two information sets with G ⊂ H. In this case, if random variable z is
G measurable, then it is also H-measurable. This follows from our intuitive definition
of measurability: If the value z is known once the variables in G are known, then it
is certainly known when the extra information provided by H is available. The next
example helps to clarify.
Example 3.4.8. Let x, y and z be three random variables, let G = { x }, and let
H = { x, y}. Suppose that z = 2x + 3, so that z is G -measurable. Then z is also
H-measurable. Informally, we can see that z is deterministic once the variables in H
are realized. Formally, we can write z = g( x, y), where g( x, y) = 2x + 3 + 0y. Hence
z is also H-measurable as claimed.
We started off this section by talking about how conditional expectations were going
to be projections onto linear subspaces, and we wanted to identify the subspaces.
Let’s clarify this now. Given G ⊂ L2 , define
From fact 3.4.5 we see that, in the sense of set inclusion, the linear subspace is in-
creasing with respect to the information set.
Now it’s time to define conditional expectations. Let G ⊂ L2 and y be some ran-
dom variable in L2 . The conditional expectation of y given G is written as E [y | G]
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 129
or E G [y], and defined as the closest G -measurable random variable to y.5 More
formally,
E [y | G] := argmin ky − zk (3.13)
z∈ L2 (G)
This definition makes a lot of sense. Our intuitive understanding of the conditional
expectation E [y | G] is that it is the best predictor of y given the information con-
tained in G . The definition in (3.13) says the same thing. It simultaneously restricts
E [y | G] to be G -measurable, so we can actually compute it once the variables in G
are realized, and selects E [y | G] as the closest such variable to y in terms of RMSE.
While the definition makes sense, it still leaves many open questions. For example,
there are many situations where minimizers don’t exist, or, if the do exist, there are
lots of them. So is our definition really a definition? Moreover, even assuming we
do have a proper definition, how do we actually go about computing conditional ex-
pectations in practical situations? And what properties do conditional expectations
have?
These look like tricky questions, but fortunately the orthogonal projection theorem
comes to the rescue.
Comparing (3.13) and (3.10), we see that y 7→ E [y | G] is exactly the orthogonal
projection function P in the special case where the subspace S is the G -measurable
functions L2 (G).
Okay, so E [y | G] is the orthogonal projection of y onto L2 (G). That’s kind of neat,
but what does it actually tell us? Well, it tells us quite a lot. For starters, theo-
rem 3.4.1 implies that E [y | G] is always well defined and unique. Second, it gives
us a useful characterization of E [y | G], because we now know that E [y | G] is the
unique point in L2 such that E [y | G] ∈ L2 (G) and y − E [y | G] ⊥ z for all z ∈ L2 (G).
Rewriting these conditions in a slightly different way, we can give an alternative
(and equivalent) definition of conditional expectation: E [y | G] ∈ L2 is the condi-
tional expectation of y given G if
1. E [y | G] is G -measurable, and
This definition seems a bit formidable, but it’s not too hard to use. Before giving an
application, let’s bow to common notation and define
E [y | x1 , . . . , x p ] := E [y | G]
Also, let’s note the following “obvious” fact:
Let’s check this using the second definition of conditional expectations given above.
To check that x + E [w] is indeed the conditional expectation of y given G = { x }, we
need to show that x + E [w] is x-measurable and that E [( x + E [w]) z] = E [yz] for
all x-measurable z. The first claim is clearly true, because x + E [w] is a deterministic
function of x. The second claim translates to the claim that
for any function g. Verifying this equality is left as an exercise (exercise 3.6.27)
The next example shows that when x and y are linked by a conditional density (re-
member: densities don’t always exist), then our definition of conditional expectation
reduces to the one seen in elementary probability texts. The proof of the claim in the
example is the topic of exercise 3.6.32.
Example 3.4.10. If x and y are random variables and p(y | x ) is the conditional den-
sity of y given x, then Z
E [y | x ] = tp(t | x )dt
There are some additional goodies we can harvest using the fact that conditional
expectation is an orthogonal projection.
Fact 3.4.9. Let x and y be random variables in L2 , let α and β be scalars, and let G
and H be subsets of L2 . The following properties hold.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 131
4. If y is G -measurable, then E [y | G] = y.
Checking of these facts is mainly left to the exercises. Most are fairly straightfor-
ward. For example, consider the claim that if y is G -measurable, then E [y | G] = y.
In other words, we are saying that if y ∈ L2 (G), then y is projected into itself. This
is immediate from the last statement in theorem 3.4.1.
The fact that if G ⊂ H, then E [E [y | H] | G] = E [y | G] is called the “tower” property
of conditional expectations (by mathematicians), or the law of iterated expectations
(by econometricians). The law follows from the property of orthogonal projections
given in fact 3.2.2 on page 114: Projecting onto the bigger subspace L2 (H) and from
there onto L2 (G) is the same as projecting directly onto the smaller subspace L2 (G).
Conditional expectations of random matrices are defined using the notion of condi-
tional expectations for scalar random variables. For example, given random matri-
ces X and Y, we set
Using the definitions, one can show that all of the results on conditional expectations
in fact 3.4.9 continue to hold in the current setting, replacing scalars with vectors and
matrices. We state necessary results for convenience:
Fact 3.4.10. Let X, Y and Z be random matrices, and let A and B be constant matrices.
Assuming conformability of matrix operations, the following results hold:
2. E [AX + BY | Z] = AE [X | Z] + BE [Y | Z].
• E [ g(X) | X] = g(X)
• E [ g ( X ) Y | X ] = g ( X )E [ Y | X ]
• E [Y g(X) | X] = E [Y | X] g(X)
(y − g( x ))2 = (y − f ( x ) + f ( x ) − g( x ))2
= (y − f ( x ))2 + 2(y − f ( x ))( f ( x ) − g( x )) + ( f ( x ) − g( x ))2
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 133
Let’s consider the expectation of the cross-product term. From the law of iterated
expectations (fact 3.4.9), we obtain
We can re-write the term inside the curly brackets on the right-hand side of (3.16) as
( f ( x ) − g( x ))E [(y − f ( x )) | x ]
(Which part of fact 3.4.9 are we using here?) Regarding the second term in this
product, we have (by which facts?) the result
E [y − f ( x ) | x ] = E [y | x ] − E [ f ( x ) | x ] = E [y | x ] − f ( x ) = E [y | x ] − E [y | x ] = 0
We conclude that the expectation in (3.16) is E [0] = 0. It then follows that
To be written.
3.6 Exercises
Ex. 3.6.1. Find two unit vectors (i.e., vectors with norm equal to one) that are or-
thogonal to (1, −2).
Ex. 3.6.2. Prove the Pythagorean law (fact 3.1.2 on page 107). See fact 1.1.3 if you
need a hint.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 134
Ex. 3.6.4. Let Q be an orthogonal matrix. Show that Q−1 = Q0 and det(Q) ∈
{−1, 1} both hold.
Ex. 3.6.5. Use theorem 3.1.1 (page 110) to prove the following part of fact 1.4.13
(page 37): A symmetric matrix A is positive definite if and only if its eigenvalues
are all positive.
2. Explain the connection between this equality and the Pythagorean Law.
Ex. 3.6.12. Let P be an N × N matrix that is both symmetric and idempotent. Let
S := rng(P). Show that P is precisely the orthogonal projection mapping onto the
linear space S. (In other words, for any given y ∈ R N , the vector Py is the closest
point in S to y.)
Ex. 3.6.13. In this exercise you are asked to prove the Cauchy-Schwarz inequality
|x0 y| ≤ kxkkyk from fact 1.1.2 on page page 5 via the orthogonal projection theorem.
Let y and x be nonzero vectors in R N (since if either equals zero then the inequality
is trivial), and let span(x) be all vectors of the form αx for α ∈ R.
x0 y
Py = x
x0 x
2. Using this expression and any relevant properties of orthgonal projections (see
theorem 3.2.2 on page 113), confirm the Cauchy-Schwarz inequality.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 135
Ex. 3.6.14. Consider theorem 3.3.2 on page 118. If N = K, what does β̂ reduce to?
Interpret.
Ex. 3.6.15. Let S := {( x1 , x2 , x3 ) ∈ R3 : x3 = 0}, and let y := 1 := (1, 1, 1). Using the
orthogonal projection theorem, find the closest point in S to y.
Ex. 3.6.16. Let P be the orthogonal projection described in theorem 3.2.2 (page 113).
Is it true that Px 6= Py whenever x 6= y? Why or why not?6
Ex. 3.6.17. Show that when N × K matrix B is full column rank, the matrix B0 B is
nonsingular.7
Ex. 3.6.18. Prove the claim at the end of remark 3.3.1 on page 118. In particular, let
X be N × K with linearly dependent columns, and let ŷ be the closest point to y in
span(X), existence of which follows from the orthogonal projection theorem. Prove
that there are infinitely many b such that ŷ = Xb.
Ex. 3.6.19. Show by direct computation that the projection matrix P and annihilator
M in fact 3.3.1 are both symmetric and idempotent.
Ex. 3.6.21. Verify fact 3.3.2 (i.e., MX = 0) directly using matrix algebra.
Ex. 3.6.22. Taking all notation as in §3.3.3 and adopting the assumptions of theo-
rem 3.3.3 on page 119, let Vk be the N × k matrix formed by columns {v1 , . . . , vk }
for each k = 1, . . . , K. Show that span(Vk ) ⊂ span(Bk ) for each k.
Ex. 3.6.23. Continuing on from exercise 3.6.22, show that {v1 , . . . , vK } is an orthog-
onal set.
Ex. 3.6.24. Following on from exercises 3.6.22 and 3.6.23, show that span(Vk ) =
span(Bk ) for each k. 8
6 Hint: Sketch the graph and think about it visually.
7 Hint: In view of fact 1.4.14, it suffices to show that B0 B is positive definite.
8 Hint: Use the results of both of these exercises and fact 1.1.8 on page 14.
CHAPTER 3. ORTHOGONALITY AND PROJECTIONS 136
Ex. 3.6.26. Prove the Cauchy-Schwarz inequality for random variables, which was
first statedpas fact 2.2.7 on page 60. That is, shown that for x, y ∈ L2 we have
|E [ xy]| ≤ E [ x2 ]E [y2 ]. Use the results on orthogonal projections in L2 as found in
§3.4.1. If you get stuck, follow the solution to exercise 3.6.13, adjusting from vectors
to L2 as required.
Ex. 3.6.27. Show that the equality in (3.14) holds when x and w are independent.
Ex. 3.6.29. Confirm the claim in fact 3.4.9 that if x is G -measurable, then E [ xy | G] =
xE [y | G].
Ex. 3.6.32. Prove the claim in example 3.4.10. (Warning: The proof is a little ad-
vanced and you should be comfortable manipulating double integrals.)
Suppose that all eigenvalues are positive. Take x to be nonzero. The vector y must
be nonzero (why?), and it follows from (3.17) that x0 Ax > 0. Hence A is positive
definite as claimed.
Conversely, suppose that A is positive definite. Fix n ≤ N and set x = Qen . Evi-
dently x is nonzero (why?). Hence x0 Ax > 0. Since Q0 is the inverse of Q, it follows
that
λn = e0n Den = (Q0 x)0 DQ0 x = x0 QDQ0 x = x0 Ax > 0
Since n was arbitrary, all eigenvalues are positive.
To verify this equality, we need to show that the right-hand side is the orthogonal
projection of αx + βy onto S. Going back to theorem 3.2.1, we need to show that (i)
αPx + βPy ∈ S and (ii) for any z ∈ S, we have
Solution to Exercise 3.6.13. Regarding part 1, the expression for Py given in exer-
cise 3.6.13 can also be written as x(x0 x)−1 x0 y. Since x is a basis for span(x), the
validity of this expression as the projection onto span(x) follows immediately from
theorem 3.3.1. Regarding part 2, recall that orthogonal projections contract norms,
so that, in particular, kPyk ≤ kyk must hold. Using our expression for Py from part
1 and rearranging gives the desired bound |x0 y| ≤ kxkkyk.
Solution to Exercise 3.6.14. If N = K, then, in view of the full column rank as-
sumption and theorem 1.3.3 on page 23, the matrix X is nonsingular. By fact 1.4.4 on
page 29, X0 is likewise nonsingular. Applying the usual rule for inverse of products
(fact 1.3.4 on page 24), we have
hy − x, e1 i = 0 and hy − x, e2 i = 0
b0 Ab = b0 B0 Bb = (Bb)0 Bb = kBbk2
By the properties of norms, this last term is zero only when Bb = 0. But this is not
true, because b 6= 0 and B is full column rank (see fact 1.1.6).
Solution to Exercise 3.6.22. To see that span(Vk ) ⊂ span(Bk ) for each k, observe
first that (3.8) can be rewritten as vk = bk − Bk−1 x for suitable choice of x. Hence
vk ∈ span(Bk ). Since spans increase as we add more elements, it follows that v j ∈
span(Bk ) for j ≤ k. Therefore span(Vk ) ⊂ span(Bk ).
Part 1 is immediate, because E [y] is constant (see example 3.4.7 on page 127). Re-
garding part 2, if g is any function, then by facts 2.4.1 and 2.4.2 (see page 71) we have
E [yg( x)] = E [y]E [ g( x)]. By linearity of expectations, E [y]E [ g( x)] = E [E [y] g( x)].
1. xE [y | G] is G -measurable, and
Solution to Exercise 3.6.31. By fact 3.4.9 (page 130), we know that if α is G -measurable,
then E [α | G] = α. Example 3.4.7 on page 127 tells us that α is indeed G -measurable.
The first claim is obvious. Regarding (3.18), let h be any such function. Using the
notation in (2.22) on page 70, we can write
Z
E [ g( x)h( x)] = E tp(t | x )dt h( x )
Z Z
= tp(t | s)dt h(s) p(s)ds
p(s, t)
Z Z
= t dt h(s) p(s)ds
p(s)
Z Z
= t h(s) p(s, t)dtds
This is equal to the right-hand side of (3.18), and the proof is done.
Part I
Appendices
142
Chapter 4
Appendix A: Analysis
4.1 Sets
In the course we often refer to the real numbers. This set is denoted by R, and we
understand it to contain “all of the numbers.” R can be visualized as the “continu-
ous” real line:
143
CHAPTER 4. APPENDIX A: ANALYSIS 144
A B
( a, b) := { x ∈ R : a < x < b}
[ a, b] := { x ∈ R : a ≤ x ≤ b}
We also use half open intervals such as [ a, b) := { x ∈ R : a ≤ x < b}, half lines such
as (−∞, b) = { x ∈ R : x < b}, and so on.
Let S be a set and let A and B be two subsets of S, as illustrated in figure 4.1. The
union of A and B is the set of elements of S that are in A or B or both:
A ∪ B := { x ∈ S : x ∈ A or x ∈ B}
Here and below, “or” is used in the mathematical sense. It means “and/or”. The
intersection of A and B is the set of all elements of S that are in both A and B:
A ∩ B := { x ∈ S : x ∈ A and x ∈ B}
A \ B := { x ∈ S : x ∈ A and x ∈
/ B}
Ac := S \ A :=: { x ∈ S : x ∈
/ A}
CHAPTER 4. APPENDIX A: ANALYSIS 145
Here x ∈
/ A means that x is not an element of A. Figure 4.2 illustrate these defini-
tions.
For example, since R consists of the irrationals I and the rationals Q, we have
Q ⊂ R, I ⊂ R, Q ∪ I = R, Qc = I, etc.
Also,
N := {1, 2, 3, . . .} ⊂ Q ⊂ R
The empty set is, unsurprisingly, the set containing no elements. It is denoted by ∅.
If the intersection of A and B equals ∅, then A and B are said to be disjoint.
The next fact lists some well known rules for set theoretic operations.
Fact 4.1.1. Let A and B be subsets of S. The following statements are true:
1. A ∪ B = B ∪ A and A ∩ B = B ∩ A.
3. A \ B = A ∩ Bc .
4. ( Ac )c = A.
4.2 Functions
There are two fundamental primitives in mathematics: sets and functions.2 A brief
discussion of sets is given in §4.1. Here we recall some basic definitions concerning
functions.
Given arbitrary sets A and B, a function or map f from A to B is a rule that associates
to each element a of A one and only one element of B. This element of B is usually
called the image of a under f , and written f ( a). If we write f : A → B, this means
that f is a function from A to B.
Example 4.2.1. Think of the hands on an old school clock. If we know it’s morning,
then each position of the two hands is associated with one and only one time. If we
don’t know it’s morning, one position of the hands is associated with two possible
times, in am and pm. The relationship is no longer functional.
Consider figure 4.3. Bottom left is not a function because the middle point on the
left-hand side is associated with two different points (images). Bottom right is not
a function because the top point on the left-hand side is not associated with any
image. From the definition, this is not allowed. Top right and top left are both
functions.
If f : A → B and g : B → C then the function h : A → C defined by h( a) = g( f ( a))
is called the composition of g and f , and written as g ◦ f . For example, if f : R → R
is defined by f ( x ) = e x :=: exp( x ) and g : R → R is defined by g( x ) = x2 , then
( g ◦ f )( x ) = g( f ( x )) = exp(2x )
{b ∈ B : f ( a) = b for some a ∈ A}
2 Actually functions can be represented as a special type of set containing ordered pairs, and hence
in pure mathematics cannot claim to be as foundational as sets. However, for our purposes, it will
be fine to think of a function as a primitive in its own right, defined as a certain kind of “rule.”
CHAPTER 4. APPENDIX A: ANALYSIS 147
A B A B
function function
A B A B
is called the range of f , and written as rng( f ). Thus b is in the range of f if it has at
least one preimage.
Figure 4.4 graphs a one-dimensional function f : [0, 1] → R. The red interval repre-
sents the range of f . Also shown is the preimage x of a point b ∈ rng( f ).
A vast multitude of mathematical problems come down to finding the x such that
f ( x ) = b for given function f and constant (or vector, or any other object) b. There
are two things that can go wrong here. One is that such an x might not be uniquely
defined; in other words, there are multiple preimages of b under f . The other po-
tential problem is lack of existence. This happens if b lies outside the range of f .
Figure 4.5 gives an example of failure of uniqueness. Both x1 and x2 solve f ( x ) = b.
We use some additional language to keep track of when these problems will occur.
A function f : A → B is called one-to-one if f ( a) = f ( a0 ) implies that a = a0 . For
example, top right in figure 4.3 is one-to-one while top left is not (because there exist
distinct points a and a0 with f ( a) = f ( a0 )). If f : A → B and f is one-to-one, then the
equation f ( x ) = b has at most one solution. (Why?)
A function f : A → B is called onto if rng( f ) = B; this is if every b ∈ B has at
lease one preimage. For example, top left in figure 4.3 is onto but top left is not. If
f : A → B and f is onto, then the equation f ( x ) = b has at least one solution. (Why?)
CHAPTER 4. APPENDIX A: ANALYSIS 148
0 x 1
0 x1 x2 1
A B
f
f −1
f
f −1
A B C
f g
f −1 g −1
Let { xn }∞
n=1 be a sequence of real numbers. (For each n = 1, 2, . . . we have a corre-
sponding xn ∈ R.) We say that xn converges to 0 if, given any neighborhood of 0,
the sequence points are eventually in that neighborhood. More formally (we won’t
use the formal definition, so feel free to skip this), given any e > 0, there exists an
N ∈ N such that | xn | < e whenever n ≥ N. Symbolically, xn → 0.
Now let {xn }∞ n=1 be a sequence of vectors in R . We say that xn converges to x ∈ R
N
f ( a∗ ) ≥ f ( a) for all a ∈ A
CHAPTER 4. APPENDIX A: ANALYSIS 151
It’s easy to see why this is the case. Let a ∈ A. Since a∗ is a maximizer of f , it must
be the case that f ( a) ≤ f ( a∗ ). Since m is monotone increasing, this implies that
m( f ( a)) ≤ m( f ( a∗ )). Given that a was chosen arbitrarily, we have now shown that
g( a∗ ) ≥ g( a) for all a ∈ A
[2] Bishop, C.M. (2006): Pattern Recognition and Machine Learning, Springer.
[3] Casella, G. and R. L. Berger (1990): Statistical Inference, Duxbury Press, CA.
[7] Davidson, R., and MacKinnon, J. G. (1993): Estimation and Inference in Economet-
rics, OUP.
[8] Davidson, R., and MacKinnon, J. G. (2009): Econometric Theory and Methods,
OUP.
[9] Devroye, Luc and Gabor Lugosi (2001): “Combinatorial Methods in Density
Estimation” Springer-Verlag, New York.
[12] Evans, G. and S. Honkapohja (2005): “An Interview with Thomas Sargent,”
Macroeconomic Dynamics, 9, 561–583.
[13] Freedman, D. A. (2009): Statistical Models: Theory and Practice, Cambridge UP.
153
BIBLIOGRAPHY 154
[19] Jones, G. L. (2004): ”On the Markov chain central limit theorem,” Probability
Surveys, 1, 5-1, 299–320.
[21] Marcus, M. and H. Minc (1965): Introduction to Linear Algebra. Dover Publica-
tions, N.Y.
[22] Olley, S. and A. Pakes (1996): “The Dynamics of Productivity in the Telecom-
munications Equipment Industry,” Econometrica, 64 (6), 1263-97.
[23] Negri, I and Y. Nishiyama (2010): “Review on Goodness of Fit Tests for Ergodic
Diffusion processes by Different Sampling Schemes,” Economic Notes, 39, 91–
106.
[25] Stachurski, J. (2009): Economic Dynamics: Theory and Computation, MIT Press.
[26] Van der Vaart, A. W. (2000): Asymptotic statistics, Cambridge university press.
[30] Wooldridge, J. M. (2010): Econometric analysis of cross section and panel data, MIT
press.
Index
Basis, 13 Eigenpair, 30
Bernoulli random variable, 56 Eigenvalue, 30
Bijection, 149 Eigenvector, 30
Binary random variable, 56 Empty set, 145
Event, 47
Cauchy-Schwarz inequality, 5, 60 Expectation, 58
cdf, 61 Expectation, vector, 84
Central limit theorem, 80
Chebyshev’s inequality, 60 F-distribution, 69
Chi-squared distribution, 69 Full column rank, 21
Column space, 21
Gaussian distribution, 68
Column vector, 18
Gradiant vector, 89
Complement, 144
Gram Schmidt orthogonalization, 119
Composition of functions, 146
Conditional density, 70 Idempotent, 29
Conditional expectation, 128 Identity matrix, 18
Conditional probability, 50 Image, 147
Convergence in distribution, 77, 86 Independence, of events, 50
Convergence in mean square, 75 Independence, of r.v.s, 71
Convergence in probability, 75, 86 Indicator function, 56
Covariance, 72 Infimum, 151
Cumulative distribution function, 61 Information set, 126
Inner product, 2
Delta method, 82
Inner product in L2 , 123
Density, 63
Integrable random variable, 59
Determinant, 24
Intersection, 144
Diagonal matrix, 27
Inverse matrix, 23
Diagonalizable matrix, 32
Inverse transform method, 92
Dimension, 13
Invertible matrix, 23
Disjoint sets, 145
Irrational numbers, 143
155
INDEX 156
Symmetric, 29
Symmetric cdf, 62
Symmetric matrix, 18
Trace, 29
Transpose, 28
Triangle inequality, 5
Underdetermined system, 26
Uniform distribution, 68
Union, 144
Upper triangular, 28