Mathematical Statistics: Probability & Distributions
Mathematical Statistics: Probability & Distributions
Jon Warren
Contents
1 Probability Distributions 2
1.1 One random variable: discrete and continuous distributions . . . . . . . . . 3
1.2 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 A reminder of some important distributions . . . . . . . . . . . . . . . . . 10
1.4 Many random variables and joint distributions . . . . . . . . . . . . . . . . 12
1.5 Dependence and Independence . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Transformations of densities . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Conditioning 40
3.1 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 The law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Priors and posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 The Tower property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1
Chapter 1
Probability Distributions
2
1.1 One random variable: discrete and continuous
distributions
A random variable is a quantity measured in an experiment having random outcomes
whose value depends on the outcome of that experiment. More formally, if Ω is the sample
space that is used to model the experiment, then a random variable X is a function
X : Ω → R.
Then, X(ω) is the value you observe for X if the the outcome of the experiment corre-
sponds to the sample point ω. When modelling the experiment, probabilities are assigned
to events by a probability measure P on the sample space Ω: the probability of an event
A ⊂ Ω is denoted by P(A).
Given a probability measure P on the sample space1 Ω we can determine the proba-
bility of observing different values for X. If B ⊆ R then the probability2 that X takes a
value in B is denoted by P(X ∈ B) which is shorthand for
which illustrates the need to distinguish the random variable X and the ordinary variable
x denoting some particular, but unspecified, value. When we let the subset B vary we
end up with a function B 7→ P(X ∈ B) which defines a probability measure ( or simply
probability distribution) on R. It is the distribution of X.
Its important to appreciate the fundamental difference between the concept of a ran-
dom variable which is a function on the sample space, and its distribution which is a
function defined on subsets of R.
How do we describe probability measures ( or distributions) on R? A very useful
analogy is to think about distributions of mass along R. We distinguish two types of
distribution. The mass could be concentrated at certain selected points, or it could be
“smeared” along the line. These two possibilities correspond to discrete and continuous
probability distributions which we will now formally define.
Definition 1. A random variable X has a discrete distribution if there exists a finite or
countably infnite set of values {x1 , x2 , . . . xk , . . .} such that
3
In this case there exist corresponding probabilities p1 , p2 , . . . summing to 1 such that,
P(X = xi ) = pi for all i.
If pi > 0 for all i then the set X = {x1 , x2 , . . .} is called the support of the distribution
and the function fX : X → [0, 1] defined by
fX (xi ) = pi for all i,
is called the probability mass function.
Knowing the support and probability mass function determines the distribution en-
tirely because for B ⊆ R, X
P(X ∈ B) = fX (x).
x∈B∩X
Example 1. Suppose a die is rolled twice and let X denote the score obtained on the first
roll, and Y that obtained on the second roll. What’s the distribution of X + Y ?
We model the experiment with the sample space
Ω = {(i, j) ∈ Z2 : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6},
and define X and Y via
X(ω) = i, and Y (ω) = j for ω = (i, j) ∈ Ω.
To compute P(X + Y = 6) we identify the event
{ω ∈ Ω : (X + Y )(ω) = 6} = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}.
Thus, counting the sample points, since we assume each sample point has an equal prob-
ability, P(X + Y = 6) = 5/36. For k = 2, 3, . . . , 12 similar considerations give P(X = k),
and we conclude the distribution is discrete with support {2, 3, . . . , 12} and mass function
(
(k − 1)/36 for k = 2, 3, . . . , 7,
fX+Y (k) =
(13 − k)/36 for k = 8, 9, . . . , 12.
Notice how we can combine the random variables X and Y using algebraic operations such
a summation, using the fact they are functions. Notice too that this doesn’t correspond
to “summing” the distributions: fX+Y 6= fX + fY . In fact the distribution of X + Y is
computed from the distributions of each of X and Y using an operation called convolution.
Our second type of distribution is characterised by the fact the mass is “smeared out”
along the line, with the probability of observing any specific value given in advance always
being 0. So in order to describe the distribution we give the probability of seeing a value
falling in given intervals.
Definition 2. A random variable has a continuous distribution if there exists a function
fX : R → [0, ∞) so that
Z b
P(a ≤ X ≤ b) = fX (x)dx, for all a ≤ b.
a
4
A probability density function always satisfies
Z ∞
fX (x)dx = 1.
−∞
Once again, knowing the p.d.f. determines the distribution entirely because for B ⊆ R,
which in the case of B = [a, b] reduces to the interpretation of the definite integral as an
area under the graph of fX .
Suppose the density fX is continuous at x = a. Then for x close to a, fX (x) ≈ fX (a)
and hence for small h, Z a+h
fX (x)dx ≈ hfX (a).
a
Thus
fX (a) ≈ h−1 P(a ≤ X ≤ a + h),
which becomes an exact equality in the limit h ↓ 0. This explains the terminology
“density”: it is ( the limit of) a probability divided by a length3 . Notice that a density,
unlike a probability, can be greater than 1.
Example 2. Suppose (X, Y ) are the co-ordinates of the point chosen uniformly at random
from the square with side length L,
Ω = {(x, y) ∈ R2 : 0 ≤ x ≤ L, 0 ≤ y ≤ L}.
The random variable X is the function X(ω) = x for ω = (x, y) ∈ Ω. It has a continuous
distribution. If we consider 0 ≤ a ≤ b ≤ L, then P(a ≤ X ≤ b) is equal to ratio of the area
of the rectangle {(x, y) : a ≤ x ≤ b, 0 ≤ y ≤ L} to the area of the whole sample space Ω,
and so is (b − a)/L. On the otherhand if a ≤ b ≤ 0 or L ≤ a ≤ b, then P(a ≤ X ≤ b) = 0
so we see that if we take (
1/L if 0 ≤ x ≤ L
fX (x) =
0 otherwise.
then Z b
P(a ≤ X ≤ b) = fX (x)dx, for all a ≤ b.
a
Notice that the value of the density at x = 0 and x = L could equally be defined to be 0,
and this formula would still hold. This illustrates that denisties are not unique and can
be changed arbitrarily at a finite ( or countable ) set of values.
3
Densities in the physical world usually involve dividing mass by volume. But that’s because the mass
is distributed in the three dimensions of the physical universe, not along a line as here.
5
Example 3. This is a continuation of the previous example. Consider the random variable
max(X, Y ). It has a continuous distribution also. If we consider 0 ≤ a ≤ b ≤ L, then
P(a ≤ max(X, Y ) ≤ b) is proportional to the area of the set {(x, y) : a ≤ max(x, y) ≤ b},
and so is (b2 − a2 )/L2 . On the otherhand if a ≤ b ≤ 0 or L ≤ a ≤ b, then P(a ≤
max(X, Y ) ≤ b) = 0 so we see that we can take as density the function
(
2x/L2 if 0 ≤ x ≤ L
fmax(X,Y ) (x) =
0 otherwise.
Just check that when we calculate a definite integrate of this function we get the right
expression for P(a ≤ max(X, Y ) ≤ b)4 .
4
And to derive the formula for the density in the first place, we could put a = 0 and differentiate with
respect to b to obtain fmax(X,Y ) (b), from the formula for P(a ≤ max(X, Y ) ≤ b) when b is between 0 and
1.
6
1.2 Expectation and variance
The expectation or expected value of a random variable X, denoted by E[X] is the mean
of its distribution: loosely speaking the average of the possible values of X weighted
according to their probability. Classically it is also the basis for the idea of a game being
fair: if you win a random amount X but have to pay a fixed amount in order to play, then
the game is considered fair if you pay exactly E[X]. It is also an analogue of the centre
of mass ( or gravity): if a probability distribution is considered as a description of mass
distributed along a line, then its mean is the point along the line at which distribution
would balance if pivoted there.
In general we have the following definition.
A rather trivial, but nevertheless important, case of the definition of E[X] for a discrete
distribution is the following. Suppose that there is a non-random c ∈ R so that P(X =
c) = 1, then E[Z] = c. This is because such an X has a discrete distribution with support
{c} and mass function fX (c) = 1.
The next two propositions state very important properties of expectations, which we
will use frequently. We won’t give proofs.
7
(1) Positivity. Suppose that P(Z ≥ 0) = 1 then E[Z] ≥ 0. If additionally P(Z > 0) >
0 then E[Z] > 0. Suppose that P(Z1 ≥ Z2 ) = 1 then E[Z1 ] ≥ E[Z2 ].
E[XY ] = E[X]E[Y ].
A natural question is find a quantity that helps describe the spread or dispersion of
a distribution. Compare the uniform distribution on the interval [−1/2, +1/2] with the
uniform distribution on the interval [−2, +2]. Both have mean µ = 0 but on average
the value of a random variable with the second distribution is further away from 0 than
the value of a random variable with the first distribution. This suggests measuring the
dispersion of the distribution of a random variable X with E[|X − µ|] where µ = E[X].
Notice that the modulus inside the expectation is essential: E[X − µ] = 0!
In fact a better choice to measure dispersion is E[(X − µ)2 ], the average squared
distance that the random variable is from its mean. We will also consider its square
root. One way of seeing how mathematically natural this quantity is to consider its
generalization p
E[(X − Y )2 ]
where Y is another random variable. In some sense this measures how far apart the values
of X and Y tend to be. It is the natural analogue of the quantity
v
u n
uX
t (xi − yi )2
1=1
which measures the Euclidean distance between points (x1 , x2 , . . . xn ) and (y1 , y2 , . . . yn )
in Rn .
5
Independence is very important relationship which may hold between random variables. You will
have met it in it ST115, and we will look at the concept again in more detail in a few lectures time.
In the meantime recall that independent random variables arise when modelling quantities that do not
influence each other, such as in repeated rolls of a dice.
8
Definition 4. We define the variance of a random variable X by
var(X) = E[(X − µ)2 ],
where µ = E[X], and the standard deviation of X to be
p
E[(X − µ)2 ],
provided these expectations exist.
Proposition 3. Assuming the expectations all exist, then for any random variable X,
2
var(X) = E[X2 ] − E[X] ≥ 0.
Moreover var(X) = 0 if and only if P(X = µ) = 1 where µ = E(X).
Proof.
2
var(X) = E[(X − µ)2 ] = E[X 2 − 2µX + µ2 ] = E[X 2 ] − 2µE[X] + µ2 = E[X 2 ] − E[X] ,
2 2 2
by linearity
and using E[µ ] = µ because µ is a non-random constant. Since P (X −
2 2
µ) ≥ 0 = 1 we have by positivity of expectation that E[(X − µ) ] ≥ 0, with equality
only if P (X − µ)2 = 0 = 1.
The next proposition will be used many times in the future; the calculation technique
of expanding the square of a sum within the expectation that we use in its proof is very
common too
Proposition 4. Suppose that X1 , X2 , . . . , Xn are independent random variables. Then
n
! n
X X
var Xi = var(Xi ).
i=1 i=1
9
1.3 A reminder of some important distributions
The Binomial distribution with parameters p ∈ [0, 1] and integer n ≥ 1. Suppose that
A1 , A2 , A3 , . . . , An is a sequence of independent events with P(Ai ) = p for each i. Let X
be the number of these events that occur. Then X has the Binomial distribution with
parameters n and p. This distribution has support {0, 1, 2, . . . , n} and mass function
n k
fX (k) = p (1 − p)n−k for k = 0, 1, 2 . . . n.
k
(
1 if Ai occurs
i =
0 otherwise,
we can find the mean and variance of the Binomial distribution using linearity of expec-
tation and Proposition 4. We find that E[X = np and var(X) = np(1 − p).
The Geometric distribution with parameter p ∈ (0, 1]. Suppose that (Ai ; i ≥ 1)
is an infinite sequence of independent events with P(Ai ) = p for each i. Let the random
variable X be defined by
X + 1 = min{k : Ak occurs}
Then X has a geometric distribution with support {0, 1, 2, . . .} and mass function
By direct calculation we find that the mean and variance of the geometric distribution
are E[X] = (1 − p)/p and var(X) = (1 − p)/p2 .
The Exponential distribution with parameter α > 0. The continuous analogue of
the geometric distribution, which is a natural model for a waiting time, is the exponential
distribution with density (
αe−αx for x ≥ 0,
fX (x) =
0 otherwise.
By direct calculation we find that the mean and variance of the exponential distribution
are E[X] = 1/α and var(X) = 1/α2 .
The Gamma distribution. Suppose that X1 , X2 , . . . , Xn are independent random
variables each having the exponential distribution with parameter α. Then the sum
Sn = X1 + X2 + . . . + Xn has Gamma distribution with parameters n (called shape) and
α (called rate) which has density
(
xn−1 −αx
αn (n−1)! e for x ≥ 0,
fSn (x) =
0 otherwise.
10
Using linearity of expectation and Proposition 4, we can deduce the mean and variance of
Sn from the mean and variance of the exponential exponential distribution and find that
E[Sn ] = n/α and var(Sn ) = n/α2 .
The Poisson Distribution. A random variable N has the Poisson distribution with
parameter λ if it has a discrete distribution supported on {0, 1, 2, . . .} with
λk −λ
P(N = k) = e .
k!
By direct calculation we find that the mean and variance of the Poisson distribution are
E[N ] = var(N ) = λ. The Poisson distribution arises when modelling points in space or
instants of time occurring at random: for example the instants of time at which a radioac-
tive sample emits a particle. Suppose that X1 , X2 , . . . , Xn is a sequence of independent
random variables, each having the exponential distribution with parameter α. For a fixed
t ≥ 0 define a random variable N via
(
0 if X1 > t,
N=
max{k : X1 + X2 + . . . + Xk ≤ t} if X1 ≤ t.
11
1.4 Many random variables and joint distributions
Suppose that a pair of random variables X and Y are defined on the same sample space,
and so correspond to two quantities measured in the same experiment. We can ask how
the value observed for X relates to the value observed for Y , if it does at all. For example,
in tossing a coin twice, we may take the sample space {H, T }2 and consider the random
variables
X(HH) = 2 Y (HH) = 0,
X(HT ) = X(T H) = 1 Y (HT ) = Y (T H) = 1,
X(T T ) = 0 Y (T T ) = 2.
X and Y are different functions on the sample space representing the number of heads
tossed, and the number of tails respectively. Both have the Binomial distribution with
parameter n = 2 and p = 1/2 ( if the coin is unbiased). The notion of the joint distribution
of X and Y captures the relationship between X and Y .
In general the joint distribution of two random variables X and Y which are defined
on the same sample space is the mapping
B 7→ P((X, Y ) ∈ B) B ⊂ R2 .
This is a probability distribution ( or probability measure) on R2 . Similarly the joint
distribution of n random variables X1 , X2 , . . . , Xn is the mapping
B 7→ P((X1 , X2 , . . . , Xn ) ∈ B) B ⊂ Rn .
Definition 5. The joint distribution of random variables X and Y is discrete if there
exist two finite or countably infinite sets X = {x1 , x2 , . . .} and Y = {y1 , y2 , . . .} such that
P(X ∈ X and Y ∈ Y) = 1.
In this case we define their joint mass function to be
fXY (x, y) = P(X = x and Y = y) for each pair (x, y) ∈ X × Y.
This definition extends easily to the case of n random variables.
In the example of coin tossing given above, the joint mass function is defined on
{0, 1, 2} × {0, 1, 2} via
fXY (0, 2) = 1/4,
fXY (1, 1) = 1/2,
fXY (2, 0) = 1/4
and fXY (x, y) = 0 all other cases.
Definition 6. The joint distribution of random variables X and Y is continuous if there
exists a joint density function fXY : R2 → [0, ∞) so that
Z dZ b
P(a ≤ X ≤ b and c ≤ Y ≤ d) = fXY (x, y)dxdy for all a ≤ b and c ≤ d.
c a
12
Note that the definite integral appearing in this definition has the geometric interpre-
tation
Volume({(x, y, z) : a ≤ x ≤ b, c ≤ y ≤ d and 0 ≤ z ≤ fXY (x, y)}).
More generally if X and Y have a continuous joint distribution with density fXY then
Z Z
P((X, Y ) ∈ B) = fXY (x, y) dx dy
B
= Volume({(x, y, z) : (x, y) ∈ B and 0 ≤ z ≤ fXY (x, y)}).
Again this definition extends easily to the case of n random variables having a continuous
joint distribution.
We are already familiar with a pair of random variables having a continuous joint
distribution. If (X, Y ) are the co-ordinates of a point chosen uniformly at random from
the square [0, L]2 as in example 2, then their joint distribution has the density
(
1/L2 if 0 ≤ x ≤ L, 0 ≤ y ≤ L,
fXY =
0 otherwise.
To see that this is the right density, note that for this choice of density,
1
P((X, Y ) ∈ B) = Volume({(x, y, z) : (x, y) ∈ B and 0 ≤ z ≤ 1/L2 }) = Area(B∩[0, L]2 ),
L2
which agrees with the notion that the point is chosen uniformly at random from the
square.
It is easy to recover the ( marginal) distributions of each of X and Y from their joint
mass function or joint density.
If X and Y have a discrete joint distribution with mass function fXY defined on X × Y
then X has a discrete distribution with mass function
X
fX (x) = fXY (x, y) for each x ∈ X
y∈Y
S
This is because we have the equality between events {X = x} = y∈Y {X = x and Y =
y}6 . Analogously if X and Y have a continuous joint distribution with density function
fXY then X has a continuous distribution with density given by
Z ∞
fX (x) = fXY (x, y)dy.
−∞
Example 4. Suppose that random variables X and Y have a discrete joint distribution
with support {0, 1, 2, . . . , }2 and mass function
fXY (k, l) = p1 p2 (1 − p1 )k (1 − p2 )l
13
Then
X
P(Y > X) = fXY (k, l) =
{(k,l):0≤k<l}
∞ ∞ ∞
X X X p1 (1 − p2 )
p1 p2 (1 − p1 )k (1 − p2 )l = p1 (1 − p1 )k (1 − p2 )k+1 = .
k=0 l=k+1 k=0
1 − (1 − p1 )(1 − p2 )
Example 5. Suppose that X and Y have a continuous joint distribution with density
(
k(y − x) if 0 ≤ x ≤ y ≤ 1,
fXY (x, y) =
0 otherwise,
where k ∈ R is some constant. Find k, compute P(Y ≥ 2X), and find the marginal
distribution of X.
We can find k via the calculation
Z ∞Z ∞ Z 1 Z y
k 1 2
Z
k
1= fXY (x, y)dxdy = k(y − x)dx dy = y dy = .
−∞ −∞ 0 0 2 0 6
So X has denisty (
3(x − 1)2 if 0 ≤ x ≤ 1,
fX (x) =
0 otherwise.
Example 6. Let (X, Y, Z) be the co-ordinates of a point chosen uniformly at random from
the cube [0, 1]3 ⊂ R3 . Let
14
Consider for 0 ≤ m1 ≤ m2 ≤ 1,
F (m1 , m2 ) = P(M1 ≤ m1 , M2 ≤ m2 ) =
= P(M2 ≤ m2 ) − P(M2 ≤ m2 , M1 > m1 ) = m32 − (m2 − m1 )3 .
and hence differentiating first with respect to m2 and then with respect to m1 ,
∂2
f (m1 , m2 ) = F (m1 , m2 ).
∂m1 ∂m2
This gives a density of
(
6(m2 − m1 ) if 0 ≤ m1 ≤ m2 ≤ 1,
f (m1 , m2 ) =
0 otherwise,
15
Proposition 5. If X and Y each have discrete distributions with supports X and Y
respectively and joint mass function fXY ,then the expected value of g(X, Y ) is given by
X
E[g(X, Y )] = g(x, y)fXY (x, y),
x∈X ,y∈Y
P
provided this sum is absolutely convergent, meaning: x∈X ,y∈Y |g(x, y)|fXY (x, y) < ∞.
If X and Y have a continuous joint distribution with density function fXY then the
expected value of g(X, Y ) is given by
Z ∞Z ∞
E[g(X, Y )] = g(x, y)fXY (x, y)dxdy,
−∞ −∞
R∞ R∞
provided this integral is absolutely convergent, meaning: −∞ −∞
|g(x, y)|fXY (x, y)dxdy <
∞.
16
1.5 Dependence and Independence
Whereas from knowledge of the joint distribution, one may determine the marginal dis-
tributions of each X and Y , it is not true that the two marginal distributions determine
the joint distribution!
Suppose that X and Y have distributions specified by
P(X = 0) = P(X = 1) = 1/2,
P(Y = 0) = P(Y = 1) = 1/2.
The joint distributions which are consistent with this information are described by
17
Proposition 6. The joint distribution of X and Y is determined by its distribution func-
tion.
You can think of this proposition as saying suppose you know the values of P(X ∈
B and Y ∈ B 0 ) for all choices of B and B 0 of the form (−∞, x] and (−∞, y], then, in
theory, you can calculate from this information the probability P(X ∈ B and Y ∈ B 0 )
for a general choice of B and B 0 . In the light of this, the next proposition, which we state
but also don’t prove, is not surprising.
Proposition 7. X and Y are independent if and only if their joint distribution function
satisfies
FXY (x, y) = FX (x)FY (y) for all x, y ∈ R.
where FX and FY are the distribution functions of X and Y respectively.
In the case we assume the joint distribution is either discrete or continuous we can
give further characterizations of independence.
Proposition 8. If the joint distribution of X and Y is discrete then their being indepen-
dent is equivalent to the mass function satisfying
fXY (x, y) = fX (x)fY (y) for all x and y in the support of X and Y respectively,
Proposition 9. If the joint distribution of X and Y is continuous then their being inde-
pendent is equivalent to some version8 of the joint density satisfying
In the examples of joint distributions given at the beginning of this lecture, we saw
two instances of random variables X and Y which are independent. Firstly if they have
a discrete joint distribution was given by
Then secondly when the continuous joint distribution was uniformly distributed on the
square [0, 1]2 .
Example 8. Here is an example of checking independence from a given joint density.
Suppose that X and Y have a joint distribution with density
(
e−(x+y) if x ≥ 0 and y ≥ 0,
fXY (x, y) =
0 otherwise..
8
Recall the density of a distribution isn’t unique because its value can be changed arbitrarily on a
“small” set of points. So by some version of the density we means some choice of the density.
18
Then the density of the marginal distribution of X is computed as
Z ∞ (R ∞
0
e−(x+y) dy = e−x if x ≥ 0,
fX (x) = fXY (x, y) =
−∞ 0 otherwise.
Similarly we find fY (y) = e−y if y ≥ 0 and zero otherwise. Thus we have fXY (x, y) =
fX (x)fY (y) for all x, y ∈ R and X and Y are independent.
Independence of more than two random variables is defined as followed. A sequence
X1 , X2 , . . . , , Xn is a sequence of independent random variables if the events
for all choices of subsets B1 , B2 , . . . Bn ⊆ R. All the propositions we have stated above
for a pair of random variables have straightforward generalizations to n random variables.
We finish with a proof of the convolution formula for the density of a sum of indepen-
dent random variables with continuous distributions.
Proposition 10. Suppose X and Y are independent random variables with continuous
distributions having densities fX and fY respectively. Then the random variable X + Y
has a continuous distribution with density
Z ∞
fX+Y (x) = fX (x − y)fY (y)dy for x ∈ R.
−∞
Proof. The probability P(X + Y ≤ z) can be written as P((X, Y ) ∈ B) for the region B
given by
B = {(x, y) ∈ R2 : x + y ≤ z}.
Since X and Y are independent their joint density is fXY (x, y) = fX (x)fY (y), and hence
Z Z Z ∞ Z z−y
P(X + Y ≤ z) = fXY (x, y)dx dy = fX (x)fY (y)dx dy
B −∞ −∞
Z ∞Z z Z z Z ∞
= fX (x − y)fY (y)dx dy = fX (x − y)fY (y)dy dx
−∞ −∞ −∞ −∞
Z z
= fX+Y (x)dx.
−∞
19
1.6 Covariance
When random variables are independent, the variance of their sum is given by the sum
of the variances of each variable, as stated in Proposition 4 So what can we say about
the variance of a sum of random variables which are not independent? We need some
more information, which comes the joint distribution, to calculate the variance of a sum
in general.
Definition 8. The covariance of two random variables X and Y is defined to be
Also if a and b are constants and X and Y are random variables, then,
cov(aX, bY ) = ab cov(X, Y ).
Notice that if X1 = X2 and Y1 = Y2 then the first result stated in the proposition
agrees with our formula for var (X + Y ), and the proof is very similar: simply write the
covariance as an expectation and use the linearity of expectations.
What properties of the joint distribution are described by cov(X, Y )? We know that
if X and Y are independent, then using Fubini as in the proof of Proposition 4 gives
cov(X, Y ) = 0. But there exist pairs of random variables X and Y that are not indepen-
dent with cov(X, Y ) = 0 also. Consider the following example.
Let (X, Y ) be the co-ordinates of a point chosen uniformly at random from the unit
disc {(x, y) ∈ R2 : x2 + y 2 ≤ 1}. The density of the joint distribution is just
(
1/π if x2 + y 2 ≤ 1,
fXY (x, y) =
0 otherwise.
20
Because the disc is symmetric E[X] = E[Y ] = 0. If this isn’t clear immediately, then
consider the density of the marginal distribution of X: it is an even function. Now
consider Z ∞Z ∞
E[XY ] = xyfXY (x, y)dxdy.
−∞ −∞
Breaking this integral up into the sum of four integrals, each over one of the four quadrants
of the plane, shows it to be 0 too. Thus cov(X, Y ) = 0, and yet X and Y are not
independent.
We can develop this example to gain some insight into non-zero covariances. Suppose
we define two new random variables U and V via
U = aX + bY and V = bX + aY,
where a and b are non random constants. Let us consider the joint distribution of U and
V : the transformation (x, y) 7→ (ax + by, bx + ay) is a linear transformation that maps the
unit disc to an ellipse. Because the transformation is linear, (X, Y ) having the uniform
distribution on the disc is transformed into U, V ) having a uniform distribution on the
ellipse. In fact U and V have joint density
(
1
2 2 if (a2 + b2 )(u2 + v 2 ) − 4abuv ≤ (a2 − b2 )2 ,
fU V (u, v) = π|a −b |
0 otherwise.
The exact form of this density does not matter to us because using linearity we can
compute easily that E[U ] = E[V ] = 0, and that,
cov(U, V ) = E[U V ] = E[(aX + bY )(bX + aY )] = abE[X 2 ] + (a2 + b2 )E[XY ] + abE[Y 2 ].
Recall that E[XY ] = 0, and we can also compute9 that E[X 2 ] = E[Y 2 ] = 1/4. Thus we
obtain
cov(U, V ) = ab/2.
Figure 1.1 shows the ellipse for varying values of the parameters a and b. There are
two cases we will consider: case (a) that a > b > 0, illustrated on the top row of the
figure and case (b) a > −b > 0, illustrated on the bottom row. In case (a) the major axis
of the ellipse is the line {v = u} and the minor axis is the line {v = −u}. In case (b) the
major axis of the ellipse is the line {v = −u} and the minor axis is the line {v = u}. Note
that the covariance of U and V is positive in case (a) and negative in case (b). In case
(a) values of U and V that are either larger than average ( i.e. positive) or smaller than
average (i.e. negative) have a tendency to occur together. This leads to U and V tending
to reinforce each other in the sum U + V and var(U + V ) > var(X) + var(Y ). In case (b),
positive values of U tend to occur with negative values of V and vice versa. Thus U and V
have a tendency to ( partially) cancel in the sum U +V and var(U +V ) < var(X)+var(Y ).
How strong these effects are depends on the shape, but not the size of the ellipse.
Scaling a and b by some constant does not alter the shape of the ellipse but only its
size. To obtain a quantity that depends only on the shape of the ellipse we consider the
covariance of U and V appropriately scaled using their variances.
9
We found the distribution of X in lecture 2
21
Figure 1.1: The random variable (U, V ) is uniformly distributed within an ellipse. This
is illustrated for different values of the parameters a and b. Top left: a = 1.8, b = 1.33.
Top middle: a = 2.0, b = 1.0, Top right: a = 2.2, b = 0.4. Btm left: a = 1.8, b = −1.33.
Btm middle: a = 2.0, b = −1.0, Btm right: a = 2.2, b = −0.4.
4 4 4
2 2 2
-4 -2 2 4 -4 -2 2 4 -4 -2 2 4
-2 -2 -2
-4 -4 -4
4 4 4
2 2 2
-4 -2 2 4 -4 -2 2 4 -4 -2 2 4
-2 -2 -2
-4 -4 -4
22
Definition 9. Suppose X and Y are random variables with non-zero variances, then we
define the correlation of X and Y to be
cov(X, Y )
p .
var(X)var(Y )
In our example,
and so the correlation between U and V is 2ab/(a2 + b2 ). Notice that this value always
lies in the range [−1, +1]. In Figure 1.1 the correlation between U and V is positive on
the top row, decreasing from right to left, and negative on the bottom row, increasing
from right to left.
In fact the correlation of any pair of random variables always lies in the range [−1, +1].
This follows from the next proposition. Moreover a value of the correlation close +1
indicates that the mass of the joint distribution is concentrated near a line in the plane
having positive gradient, and a value of the correlation close −1 indicates that the mass
of the joint distribution is concentrated near a line in the plane having negative gradient.
Proof. Consider
by positivity of expectation. We can assume E[Y 2 ] > 0, otherwise the claimed inequality
is trivially true because P(Y = 0) = 1. Take t = −E[XY ]/E[Y 2 ] which minimizes the
quadratic expression. Then we obtain,
which rearranges to give the result. If we have equality, then, for this choice of t, we have
E[(X + tY )2 ] = 0 which implies P(X + tY = 0) = 1.
By applying the Cauchy Schwartz inequality to random variables X − µX and Y − µY
we obtain
(cov(X, Y ))2 ≤ var(X)var(Y ),
from which it follows immediately that correlation lies in the range [−1, +1].
23
1.7 Transformations of densities
Suppose X has a continuous distribution with density fX and g : R → R is some function
which we will assume is continuous, increasing and a bijection of R. We also assume
h = g −1 is differentiable. Let Y be the random variable Y = g(X). How do we calculate
the distribution of Y from fX and g?
For a < b we have10
where we make the substitution y = g(x) in the integral. Since this holds for for all a < b
we deduce that Y has a continuous distribution with density
1 −1
fY (y) = fX (g (y)) = h0 (y)fX (h(y)). (1.1)
g 0 (g −1 (y))
A very important special case is Y = aX + b where a > 0 and b ∈ R, and then we have
1 y−b
fY (y) = fX .
a a
1
We can understand the factor g0 (g−1 (y))
appearing in (1.1) intuitively by remembering that
fX and fY are densities. The function g stretches or contracts space (the real line) near
a point x according to whether g 0 (x) > 1 or g 0 (x) < 1. If g stretches space, then the same
amount of mass ( probability) is spread more thinly, and so its density falls. If g contracts
space then the density rises.
Now lets consider how joint densities behave when we transform variables. There is
a change of variable formula for multidimensional integrals which involves the Jacobian
matrix. Let us recall11 the definition of this. Suppose g is a differentiable R2 -valued
function defined on R2 ( or some subset ofR2 .) Write
where g1 and g2 are R-valued functions. Then the Jacobian matrix of g at a point
(x, y) ∈ R2 is the matrix !
∂g1 ∂g1
∂x ∂y
Jg (x, y) = ∂g2 ∂g2
∂x ∂y
and the Jacobian of g at the point (x, y) is the determinant of this matrix det(Jg (x, y)).
Now we can state our formula for the transformation of joint densities.
10
The second equality here depends on g being an increasing and a bijection. For functions such as
g(x) = x2 or g(x) = e−x it is necessary to modify the argument, and the formula (1.1) does not hold in
general.
11
If you havent met this already, then you will soon in ST208.
24
Proposition 13. Suppose X and Y have a joint distribution with density fXY . Suppose
that D ⊆ R2 is such that P((X, Y ) ∈ D) = 1 and g : D → D0 is a bijection between D
and D0 ⊆ R2 with non-vanishing Jacobian. Then the random variables U and V defined
by (U, V ) = g(X, Y ) have a joint distribution with density
(
1
f (g −1 (u, v)) if (u, v) ∈ D0 ,
| det(Jg (g −1 (u,v))| XY
fU V (u, v) =
0 otherwise
Notice the modulus of the determinant in the statement! There is an extension of this
result to more than two variables; it is entirely as you would expect.
Example 9. In the last lecture we considered the case that (X, Y ) was uniformly dis-
tributed in the disc D = {(x, y) ∈ R2 : x2 + y 2 ≤ 1} and we defined random variables U
and V by
U = aX + bY and V = bX + aY,
where a and b are constants. We can put this into the framework of the proposition by
defining g : R2 :→ R2 by
It doesn’t depend on x and y! This is a really important point: the Jacobian matrix of a
linear transformation is a constant matrix ( compare with the derivative of a linear func-
tion of one variable), and actually its the same matrix as describes the transformation.
Geometrically this means linear transformations stretch and contract space in the same
way everywhere12 .
Now the boundary of the D is described by the equation x2 + y 2 = 1. We need to find
the image of the disc under the transformation given by g. We do this first by finding
the inverse of g: if u = ax + by and v = bx + ay, then x = (au − bv)/(a2 − b2 ) and
y = (−bu + av)/(a2 − b2 ), substituting into the equation of the circle we find that the
boundary of D0 = g(D) is described by the equation
2 2
au − bv −bu + av
+ = 1,
a2 − b 2 a2 − b 2
which simplifies to
(a2 + b2 )(u2 + v 2 ) − 4abuv = (a2 − b2 )2 .
12
The determinant of a matrix tells you how the corresponding linear transformation scales area ( or
in higher dimensions volume). But Proposition 13 also deals with transformations that are not linear.
In this case at a point (x, y) the Jacobian matrix is telling you what linear transformation you can
approximate g by close to the point (x, y). And then the determinant defining the Jacobian calculates
how this approximating linear transformation scales area.
25
This we recognise13 as the equation of an ellipse. What is the area of this ellipse? Because
the Jacobian matrix of g is constant it must be given by
and hence
1
det(Jg (g −1 (u, v)) = .
det(Jg−1 (u, v))
We will see how this works in the next example.
Example 10. Suppose X and Y are independent and each have the exponential distribution
with parameter α > 0. Let U = X + Y and V = X/(X + Y ). We want to find the joint
distribution of U and V . So let g : (0, ∞)2 → (0, ∞) × (0, 1) be the function
which has determinant −u. Now the joint density of X and Y is α2 e−α(x+y) on the set
{(x, y) ∈ R2 : x > 0, y > 0} and zero otherwise. Consequently the proposition gives the
joint density of U and V as
(
α2 ue−αu if u > 0 and 0 < v < 1,
fU V (u, v) =
0 otherwise.
26
Chapter 2
27
2.1 The standard Gaussian distribution
Definition 10. A random vector1 Z = (Z1 , Z2 , . . . , Zn ) has the standard Gaussian2 dis-
tribution in Rn if the joint distribution of Z1 , Z2 , . . . , Zn is continuous with density
n
!
1 1X 2
f (z1 , z2 , . . . , zn ) = exp − z for (z1 , z2 , . . . , zn ) ∈ Rn
(2π)n/2 2 i=1 i
This distribution is very special for it combines probability and geometry in a unique
way.
(a) The random variables Z1 , Z2 , . . . Zn are independent because their joint density fac-
torizes !
n n
1 1X 2 Y 1
√ exp −zi2 /2 .
n/2
exp − zi =
(2π) 2 i=1 i=1
2π
Notice this also implies that each component Zi has the standard Gaussian distri-
bution on R.
Pn 2
(b) The joint density is a function of i=1 zi , which by Pythagoras is the squared
distance of the point (z1 , z2 , . . . , zn ) to the origin. Consequently the joint density is
rotationally symmetric.
A linear transformation of Rn is a rotation only if it can be written3 as z 7→ Oz where O
is an orthogonal n × n matrix4 . An orthogonal matrix is a matrix satsfying
OT O = OOT = I, (2.1)
where I denotes the identity matrix. Another way of expressing this is to say the rows (
and the columns too) of O form an orthonormal basis for R. Computing the determinant
of both sides of the above equation gives 1 = det(OT O) = det(OT ) det(O) = (det(O))2 .
So an the determinant of an orthogonal matrix is equal to either 1 or −1.
The following proposition formalises the observation (b) made above.
Proposition 14. Suppose that Z has the standard Gaussian distribution in Rn and O
is a non-random orthogonal n × n matrix. Let W = OZ. Then W has the standard
Gaussian distribution in Rn too.
Proof. Because | det(O)| = 1, and the map z 7→ Oz has inverse w 7→ OT w, Proposition
135 implies that the distribution of the vector W has density
fW (w) = fZ (OT w)
1
By a random vector we just mean a vector whose components are random variables. Our vector will
be column vector, and so it can be written as here as the transpose of a row vector.
2
Gaussian, named after Carl Friedrich Gauss, one the greatest mathematicians of all time. Also known
more prosaically as the normal distribution.
3
z is a n-dimensional column vector in what follows
4
not all orthogonal transfomrations are rotations; some are reflections.
5
Actually this is the generalization of Proposition 13 to n variables. Remember too that in example
9 we saw that the Jacobian of a linear transformation is the determinant of the matrix describing the
transformation, so in this case O.
28
where fZ (z) denotes the density ofPZ. Now notice that if z = (z1 , z2 , . . . , zn )T is n dimen-
sional column vector then z T z = ni=1 zi2 , and if z = OT w then6 z T z = (OT w)T (OT w) =
wT OOT w = wT w. Consequently the density of W is given by
1
exp −wT w/2 ,
fW (w) = n/2
(2π)
To deduce this from the proposition we just choose an orthogonal matrix O whose first
row7 is aT . Then the first component of W = OZ will be exactly aT Z. Further suppose
that b is another non-random n-dimensional column vector of length one and that aT b = 0,
then
n
X n
X
T T
the random variables a Z = ai Zi and b Z = bi Zi are independent.
i=1 i=1
To deduce this, just make aT and bT the first and second rows of O respectively8 .
6
I’m using properties of transpose here: (AB)T = B T AT and (AT )T = A.
7
We can always do this, by results from linear algebra
8
Again we can always do this, by results from linear algebra: any collection of orthonormal vectors
can be extended to an orthonormal basis.
29
2.2 The general Gaussian distribution
2.2.1 Definition
Recall that if X has a general Gaussian distribution on the real line, then it can be
constructed from a random variable Z having the standard Gaussian distribution on R
by a combination of a scaling and a translation:
X = aZ + b
where a and b are real constants and we can take a > 0. In fact this is really the
definition of general Gaussian distribution on R: any distribution that can be obtained
by scaling and translating the standard Gaussian distribution. Notice also that we then
have E[X] = b and var(X) = a2 , so knowing the mean and variance of X, together with
knowing it has a Gaussian distribution, completely specifies its distribution. We want to
do something similar in higher dimensions.
There is something very subtle about this definition9 which is that we don’t assume
m = n. We could assume m = n; a priori that might lead to a smaller family of
distributions being called Gaussian, but in fact it wouldn’t: we would in fact define the
same family that way. The advantage of allowing m to be different from n comes later,
when we prove a theorem which describes the properties of the Gaussian distribution. We
pay a price: the theorem in the next lecture is harder with our definition.
We want to generalise the fact that mean and variance determine a Gaussian distri-
bution in one dimension. So we first need to calculate the higher dimensional analogues
of mean and variance.
Suppose that X = AZ + b as in the definition. Then , for each 1 ≤ i ≤ n,
" m #
X
E[Xi ] = E Aij Zj + bi = bi . (2.2)
j=1
30
Next we can calculate, for 1 ≤ i ≤ n and 1 ≤ j ≤ n
" m
! m
!#
X X
cov(Xi , Xj ) = E Aik Zk Ajl Zl =
k=1 l=1
" m m
#
XX
E Aik Ajl Zk Zl =
k=1 l=1
m X
X m m
X
Aik Ajl E[Zk Zl ] = Aik Ajk . (2.3)
k=1 l=1 k=1
And notice that this is exactly the ijth entry of the matrix AAT . We say that the
variance-covariance matrix of X is the n × n matrix Σ with entries Σij = cov(Xi , Xj ).
And so we have Σ = AAT . Notice that Σ is a symmetric matrix: Σij = Σji with its
diagonal elements given by the variances of X1 , X2 , . . . , Xn .
Given a random vector X = (X1 , X2 , . . . , Xn )T with mean vector b and variance-
covariance matrix Σ we can easily calculate the expected value and variance of a random
variable Y of the form n
X
T
Y =v X= vi Xi
i=1
This holds, of course, because we have just seen the quantity v T Σv is the variance of a
random variable.
31
2.2.2 Characterization by mean and variance-covariance matrix
Theorem 1. The distribution of a random vector X = (X1 , X2 , . . . , Xn )T having a Gaus-
sian distribution10 is characterized by its mean vector and variance-covariance matrix.
Moreover for any vector µ ∈ Rn and non-negative definite, symmetric n × n matrix Σ
there exists a random vector X having mean µ and variance-covariance matrix Σ.
For the proof we are going to use moment generating functions. If X = (X1 , X2 , . . . , Xn )T
is a random vector then its moment generating function is the function φX defined by
" !#
X
φX (t1 , t2 , . . . , tn ) = E exp ti Xi (2.5)
i=1
whenever the expectation exists ( and is finite). If X and Y are two n-dimensional random
vectors and their moment generating functions exist and are equal ( in a neighbourhood
of the origin) then11 X and Y have the same distribution.
Proof of theorem. Recall that if Z has the standard Gaussian distribution on R, then
2
E etZ = et /2 .
10
If n ≥ 2 we often say a multivariate Gaussian distribution
11
This is a very deep result from analysis, even when n = 1
32
This last expression can be written as
exp tT b + tT Σt/2 ,
where t = (t1 , t2 , . . . , tn )T , and Σ = AAT . Thus the moment generating function, and
hence the distribution, of X is determined by b, its mean vector, and Σ its variance-
covariance matrix.
For the second half of the theorem, suppose we are given an n-dimensional vector
b and a non-negative definite matrix Σ. We recall that the symmetric matrix Σ can
be written12 as OT ΛO where O is an orthogonal matrix and Λ is diagonal. Since Σ is
supposed to be non-negative definite13 , the entries of Λ must all be non-negative, and we
define the n × n matrix A by A = OT Λ1/2 where Λ1/2 denotes the diagonal matrix whose
entries are the square roots of the entries of Λ. Then let X = AZ + b where Z has the
standard Gaussian distribution on Rn . By Definition 11, X has a Gaussian distribution
and by the calculation we made in the previous lecture, the mean vector of X is b and its
variance-covariance matrix is AAT = (OT Λ1/2 )(OT Λ1/2 )T = OT ΛO = Σ.
12
take the diagonal entries of Λ to be the eigenvalues of Σ and the rows of O to be the corresponding
eigenvectors, normalized to have length 1. It is then easy to check that OT ΛO is a matrix with these
same eigenvalues and eigenvectors. and so must be equal to Σ.
13
Recall from the last lecture that a symmetric matrix Σ is called non-negative definite if v T Σv ≥ 0
for all vectors v; this is equivalent to saying all the eigenvalues of Σ are non-negative.
33
2.2.3 Properties
Proposition 15. Suppose that the random vector X = (X1 , X2 , . . . , Xn )T has a Gaussian
distribution with mean vector b and variance-covariance matrix Σ.
If Σ is invertible then the distribution of X is continuous with a density given by
1 1 T −1
fX (x1 , x2 , . . . , xn ) = p exp − (x − b) Σ (x − b)
(2π)n/2 det(Σ) 2
where x = (x1 , x2 , . . . , xn )T ∈ Rn .
If Σ is not invertible then there exist one or more linear relationships between the
random variables X1 , X2 , . . . , Xn which hold with probability one. Consequently the dis-
tribution of X is not continuous.
Proof. Suppose Σ is invertible then all its eigenvalues are strictly positive. If we write
Σ = OT ΛO with O orthogonal and Λ diagonal then the matrix Λ has strictly positive
diagonal entries and is invertible too. Moreover Σ−1 = OT Λ−1 O. Define A = OT Λ1/2 and
suppose, as we may, that X = AZ + b where Z has the standard Gaussian distribution
on Rn . Notice that the map z 7→ Az + b is a bijection on Rn because the matrix A is
invertible. In fact if x = Az + b then z = A−1 (x − b). By the extension of Proposition 13
to n-variables we have that the density of the distribution of X is given by
1
fX (x) = fZ (A−1 (x − b))
det(A)
If Σ is not invertible then there exists at least one vector v 6= 0 so that Σv and hence
T
v Σv is zero. Then !
Xn
var vi Xi = v T Σv = 0
i=1
34
Theorem 2. Suppose that the random vector X = (X1 , X2 , . . . , Xn )T has a Gaussian
distribution.
Y = BX + c
(b) For any positive integer p ≤ n, and 1 ≤ i1 < i2 < . . . < ip ≤ n the vector
Y = (BA)Z + (Bb + c) = CZ + d,
and this representation shows that Y has a Gaussian distribution. Let à be the p × m
matrix consisting of the rows i1 , i2 , . . . , ip of A, and let b̃ = (bi1 , bi,2 , . . . , bip )T . Then
X̃ = ÃZ + b̃,
14
It might be that Y is just a deterministic constant vector; thats ok, we include constants as “degen-
erate” Gaussian distributions too.
35
2.3 Fisher’s Theorem and a statistical application
We proved on the first tutorial sheet that if Z1 , Z2 , . . . , Zn are independet random variables
each having the standard Gaussian distribution on R then the random variables
has the Gamma distribution15 with parameters 1/2 and n/2. This distribution is also
known as the χ2 -distribution with n degrees of freedom.
If X and Y are independent random variables having continuous distributions with
densities fX and fY and P(Y > 0) = 1, then a very similar argument to that used to
prove Proposition 10 shows that the random variable X/Y has a continuous distribution
with density Z ∞
fX/Y (t) = yfX (ty)fY (y)dy. (2.6)
0
Calculating, using this formula and the densities of the Gaussian and Gamma distribu-
tions, the following proposition can be proved.
(1 + t2 /n)−(n+1)/2 t ∈ R.
The distribution appearing in this proposition is called the t-distribution with n de-
grees of freedom.
Then
(a) X̄ has the Gaussian distribution on R with mean µ and variance σ 2 /n.
36
Proof. A linear combination of independent random variables each having a Gaussian
distribution, has a Gaussian distribution itself. Hence X̄ has a Gaussian distribution and
we can calculate " n # n
1X 1X
E[X̄] = E Xi = E[Xi ] = µ.
n i=1 n i=1
and !
n n
1X 1 X σ2
var(X̄) = var Xi = var(X i ) = .
n i=1 n2 i=1 n
This proves (a).
Let X = (X1 , X2 , . . . , Xn )T ; this random vector has a Gaussian distribution16 with
covariance matrix Σ = σ 2 I and mean vector µ1, where I denotes the n × n identity
matrix and 1 = (1, 1, . . . , 1)T is the n-dimensional vector whose every component is equal
to 1.
The key idea of the proof is to transform the random vector X into a new random
vector Y with some nice properties. To this end let O = (Oij )1≤i,j≤n be the orthogonal
matrix whose first row17 is
1
(n−1/2 , n−1/2 , . . . n−1/2 )T = √ 1.
n
Because O is an orthogonal matrix, if i 6= 1, then its ith row Oi1 , Oi2 , . . . , Oin is a vector
which is orthogonal to 1 and consequently
n
X
Oij = 0.
j=1
16
In fact each of X1 , X2 , . . . Xn can be written as a linear transformation of a standard Gaussian random
variable Zi ,
Xi = σZi + µ,
and because X1 , X2 , . . . Xn are independent so are Z1 , . . . Zn . But this means the vector Z =
(Z1 , Z2 . . . , Zn )T has the standard Gaussian distribution on Rn , and the vector X can be written as
a linear transformation of Z.
17
Notice this vector has length one, and so there exists an orthogonal matrix with it as its first row
37
Next using the othogonality of O, we have,
n
X n
X
Yi2 = Y T Y = (OX)T (OX) = X T (OT O)X = Xi2 .
i=1 i=1
Consequently
n n
!
1 X 1 X
S2 = (Xi − X̄)2 = Xi2 − nX̄ 2 =
n − 1 i=1 n−1 i=1
n
! n
1 X 1 X 2
Yi2 − Y12 = Y .
n−1 i=1
n − 1 i=2 i
and the random variables Y2 /σ, Y3 /σ, . . . , Yn /σ are independent and each have the stan-
dard Gaussian distribution on R. Thus (b) follows from the definition of the χ2 distribu-
tion.
Finally, for (d), we write
p
(X̄ − µ) (X̄ − µ)/ σ 2 /n
p = p .
S 2 /n S 2 /σ 2
Then, from (a), (b) and (c), we see that the preceding proposition applies to the right
handside.
However it is very important to understand how accurate this estimate is likely to be.
This can be done by considering the distribution of the random variable
n
1X
X̄ = Xi ,
n i=1
38
and seeing how far away X̄ tends to be from its mean µ. We know that the distribution
of X̄ − µ is Gaussian with mean zero and variance σ 2 /n. So we see immediately how our
estimate will (probably) be more accurate for larger sample sizes n, and smaller σ 2 , which
describes the inherent variablilty of the data. The problem is that σ 2 also needs to be
estimated from the data. The natural way to do this is with the sample variance
n
2 1 X
s = (xi − x̄)2 .
n − 1 i=1
But since this is an estimate, we now need to understand how accurate this is likely to
be too! And so we are led to consider the distribution of the random variable
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1
E[S 2 ] = σ 2 ,
18
You can view this a special case of the mean of a gamma distribution, or just calculate E[Z12 + Z22 +
2
. . . + Zn−1 ] where Zi each have the standard Gaussian distribution.
39
Chapter 3
Conditioning
40
3.1 Conditional distributions
Suppose A is an event with probability P(A) > 0 and X a random variable. Then the
conditional distribution of X given A is the probability distribution of R given by
Together with the fact that P(T ≤ t|T ≥ t) = 0 this shows that the conditional distribu-
tion of T given T ≥ t has density
(
αe−α(u−t) if u ≥ t
f (u) = (3.2)
0 otherwise.
41
distribution of X. For k=0,1,2. . . , 10 we have,
Y = y?
First we find the marginal distribution of Y . For any non-negative integer y we have
y x y−x y y+1
1X y 1 1 1 1 1 1
fY (y) = = + = .
2 x=0 x 6 3 2 6 3 2
So in fact we see Y has a geometric distribution. Now the conditional mass function of
X given Y = y is
x y−x , y+1 x y−x
1 y 1 1 1 y 1 2
fX|Y (x|y) = = for x = 0, 1, 2, . . . , y.
2 x 6 3 2 x 3 3
42
Now suppose that X and Y have a continuous joint distribution. Then if y is some
given value, the event Y = y has zero probability, and so it doesn’t make sense to condition
on it happening. But the preceding definition for mass functions still seems to suggest a
natural formula.
Definition 13. Let X and Y have a continuous joint distribution with density fXY .
Denote the density of the ( marginal) distribution of Y by fY . Then for each y ∈ R such
that f (y) > 0 we define the conditional density of X given Y = y to be
fXY (x, y)
fX|Y (x|y) = for x ∈ R.
fY (y)
We then define the conditional distribution of X given Y = y to be the distribution on R
with density x 7→ fX|Y (x|y).
There is a problem with this definition which is that we know densities are not unique,
and their values can be changed on small sets. We tend to ignore this issue, but it means
that really, for any particular value y, such as y = 0, the conditional density of X given
Y = y isn’t meaningfully defined at all. In practice we choose versions of our densities
which are functions which are continuous wherever possible, and this mostly avoids the
problem.
Why is definition reasonable? One justification is to consider conditioning on the event
y − ≤ Y ≤ y + . This makes sense in the usual way because we are conditioning on an
event of positive probability. Then let tend to 0 and see what happens. Suppose B ⊆ R
then we would like
43
and fY (y) = 0 otherwise. Thus we find that the conditional density of X given Y = y
satisfying −1 < y < 1 is
1 p p
fX|Y (x|y) = p for − 1 − y 2 ≤ x ≤ 1 − y 2 .
2 1 − y2
P(X > 0|Y = y) is now computed ( not that a calculation is really needed in this case)
from this conditional distribution:
Z ∞
1
P(X > 0|Y = y) = fX|Y (x|y)dx = .
0 2
Its important to reflect on this example and see that it is exactly as you would expect:
the random point (X, Y ) is somewhere in the disc. If you are told that in fact Y = y0 for
some given value of y0 , then the point (X, Y ) must be somewhere on the line segment
{(x, y) ∈ R2 : x2 + y 2 ≤ 1} ∩ {(x, y) ∈ R2 : y = y0 }
q q
= {(x, y) ∈ R2 : − 1 − y02 ≤ x ≤ 1 − y02 , y = y0 } (3.6)
Moreover since the point is chosen uniformly at random it has no preference for one region
of the disc over another, so it it has no preference as to where on this line segment it lies
either.
Example 15. Suppose the joint distribution of X and Y is given by
r
y −y(x2 /2+1)
fXY (x, y) = e for x ∈ R and y > 0, (3.7)
2π
fXY (x, y) = 0 if y ≤ 0. Then, for a fixed y > 0, the conditional density of X given Y = y
must be
2
C(y)e−yx /2 for x ∈ R (3.8)
p y −y
where C(y) = 2π e /fY (y) is some constant depending on y. This density is propor-
tional to, and hence must be equal4 to, the density of the Gaussian distribution withpmean
y
zero and variance 1/y. Since that implies that the constant C(y) must then be 2π it
follows the marginal distribution of Y is the exponential with rate 1. This is an argument
really worth thinking about! First we didnt need to find fY to find the conditional dis-
tribution of X. Secondly, because we know the value of the normalizing constant in the
Gaussian density we didn’t have to calculate the marginal distribution of Y by doing an
integration. Magic!
3
take care here: remember fX|Y is the density of a distribution when treated as a function of x with
y thought of as fixed.
4
if two probability density functions are proportional they must be equal since the integral of both (
over R) is equal to one.
44
3.2 Regression
Proposition 17. Suppose X and Y have a joint distribution which is Gaussian, with
2
the means and variances of X and Y being µX , µY , σX > 0 and σY2 > 0. Suppose
5 2 2 2
the covariance between X and Y is σXY and that σXY < σX σY . Then the conditional
6
distribution of Y given X = x ∈ R is the Gaussian distribution with mean
σXY
µY |X=x = µY + 2
(x − µX )
σX
and variance
2
σXY
σY2 |X = σY2 − 2
.
σX
Notice that variance of the conditional distribution doesn’t depend on x, but the mean
of the conditional distribution does, but only as a linear function. The exact form of that
linear function is a very famous formula in statistics. Suppose we are interested in how
the value of X can be used to predict the value of Y , then if we observe that the value of
X is x say, a sensible7 prediction for Y is the conditional mean µY |X=x . So we are using a
linear function of X to predict Y . And the line described by that linear function is known
as the regression line. The variance σY2 |X is a measure of how accurate this prediction will
be; we see that
σY2 |X
= 1 − ρ2 (3.9)
σY2
where ρ is the correlation between X and Y . So for a fixed variance of Y the correlation
controls how accurate the prediction is.
Proof. Under the conditions stated in the proposition the variance-covariance matrix Σ
of the vector (X, Y )T is invertible. We have
2 2
σX σXY −1 1 σY −σXY
Σ= and Σ = 2 2
σXY σY2 2
σX σY − σXY −σXY 2
σX
2 2 2
Denote the determinant σX σY − σXY by D. The joint density of X and Y is, according
to Proposition 15, proportional to
We want to treat x as fixed and consider this expression as a function of y. Notice that
2
D/σX = σY2 |X and σX
2
µY /D +σXY (x−µX )/D = µY |X /σY2 |X , and checking the coefficients
of y 2 and y on both sides verifies that
2
σX (y − µY )2 /(2D) − σXY (x − µX )(y − µY )/D = (y − µY |X )2 /(2σY2 |X ) + some stuff
where “some stuff” doesnt have any dependency on y. Thus the joint density of X
and Y , treated as a function of y alone is proportional to the density of the Gaussian
5
notice this the same as saying the correlation between X and Y is neither 1 nor −1
6
Note we are reversing the role of X and Y compared to the last lecture
7
optimal in a sense we will see in a few lectures
45
distribution on R with mean µY |X and variance σY2 |X . This proves the proposition by the
same argument as used in example 15.
46
3.3 The law of total probability
Recall that the law of total probability from elementary probability states that if B1 , B2 , . . . Bn
are a partition8 and A is some further event then
n
X
P(A) = P(A|Bi )(Bi ). (3.10)
i=1
This extends to the case where the partition is an infinite sequence of sets too. As a
consequence, if X and Y have discrete distributions then, for any x ∈ X , the support of
the distribution of X,
X
P(X = x) = P(X = x|Y = y)P(Y = y) (3.11)
y∈Y
where Y is the support of the distribution of Y . What is the analagous statement for
continuous distributions? Suppose X and Y have a joint distribution which is continuous
then Z ∞
P(a ≤ X ≤ b) = P(a ≤ X ≤ b|Y = y)fY (y)dy, (3.12)
−∞
where fY is the density of the distribution of Y , and the conditional probability P(a ≤
X ≤ b|Y = y) is defined to mean
Z b
fX|Y (x|y)dx,
a
where the conditional density is defined as in fX|Y as in Definition 13, Notice this condi-
tional density is only defined for y such that fY (y) > 0, but this doesn’t matter in making
sense of equation (3.12)! It is very easy to check (3.12); just substitute the formula for
P(a ≤ X ≤ b|Y = y) into the righthandside, further write the conditional density as the
ratio which defines it in Definition 13, and then we see that (3.12) is equivalent to
Z ∞ Z b
P(a ≤ X ≤ b) = fXY (x, y) dx dy, (3.13)
−∞ a
Then we have Z ∞
P (X, Y ) ∈ B = P(X ∈ By |Y = y)fY (y) dy. (3.15)
−∞
8
This means these sets are disjoint and their union is the entire sample space
47
The proof9 is exactly the same as for (3.12), with the conditional probability being defined
as an integral of the conditional density.
Let’s look at a example where we can solve an interesting problem by using condi-
tioning. We are going to use a generalization of (3.15) that involves conditioning on two
random variables. Suppose that B be a subset of R2 , and that X, Y and Z are three
random variables whose joint distribution is continuous. Then
Z ∞
P (X, Y ) ∈ B|Z = z = P(X ∈ By |Y = y, Z = z)fY |Z (y|z) dy. (3.16)
−∞
The proof is similar to before, noting that the conditional density of X given Y = y and
Z = z satisfies
fXY Z (x, y, z) fXY |Z (x, y|z)
fX|Y Z (x|y, z) = = (3.17)
fY Z (y, z) fY |Z (y|z)
where the first equality is taken to be a definition, and the conditional joint distribution
of X and Y given Z = z is defined to have density
fXY Z (x, y, z)
fXY |Z (x, y|z) = . (3.18)
fZ (z)
Example 16. Consider the following game between two players. Player A goes first. They
observe a random number X1 which is uniformly distributed in [0, 1]. They then choose
whether to“stick” and score X1 , or continue and observe a second uniformly distributed
random number X2 . If X1 + X2 > 1 they go “bust” and loose automatically. Otherwise
they score X1 + X2 , and player B begins to play. This second player observes a random
number Y1 , again uniformly distributed in [0, 1]. If Y1 is greater than the score of player
A then player B wins, if not then a second random number Y2 with uniform distribution
is observed. If Y1 + Y2 is both greater than the score of player A, and less than 1, then
player B wins, otherwise player A has won. Assume all four random variables, X1 , X2 ,
Y1 and Y2 are independent. Find the optimal strategy for player A.
Let X denote the score10 of player A; because we don’t know the strategy of player
A, we don’t know the distribution of X, but we do know that X must be independent of
Y1 and Y2 , since it will be some function of X1 and X2 . Let A ⊂ R3 be given by
This is exactly the set of values for the random variables X, Y1 and Y2 that correspond
to player A winning. We can’t calculate P((X, Y1 , Y2 ) ∈ A) because we dont yet know
the distribution of X. But we can calculate the conditional probability that A wins given
their score X = x, which more formally is,
P((Y1 , Y2 ) ∈ Ax |X = x)
9
I am hiding what is really going on mathematically speaking. The deep mathematics is that we can
always calculate double integrals by integrating first over one variable then over the remaining variable,
but we have been assuming we could do that all through the module, so this isn’t the right time to make
a fuss about it.
10
if player A continues and then X1 + X2 > 1 we can still call this this score of player A
48
where Ax is the slice through A:
Ax = {(y1 , y2 ) ∈ [0, 1]2 : y1 + y2 < x or y1 < x < 1 < y1 + y2 } for x ∈ [0, 1],
Z ∞
P (X, Y1 , Y2 ) ∈ A|X1 = x1 = P (Y1 , Y2 ) ∈ Ax |X = x fX|X1 (x|x1 )dx
−∞
Z 1 Z 1
2
= x fX|X1 (x|x1 )dx = x2 dx = (1 − x31 )/3,
x1 x1
{(x, y) ∈ R2 : x ∈ X }
with probability one, and yet this set has zero area13 , so we cannot say the joint distri-
bution is a continuous distribution. Nevertheless we should be able to describe the joint
11
Dont worry that it doesnt look exactly like (3.17); its the same idea. Concentrate on convincing
yourself that it makes intuitive sense. The step that I’m skipping over is justifying that P (Y1 , Y2 ) ∈
Ax |X = x, X1 = x1 = P (Y1 , Y2 ) ∈ Ax |X = x
12
the probability that the value of X1 gives equality here is zero, so we can be sloppy about whether
the inequality is strict or not
13
it is a finite or countably infinite collection of vertical lines in the plane
49
distribution with a function14 fXY : X × R → [0, ∞) such that for each x ∈ X and a ≤ b,
Z b
P(X = x and a ≤ Y ≤ b) = fXY (x, y)dy. (3.19)
a
In many situations its natural to describe such a joint distribution using conditional
distributions. We can write the function fXY appearing in (3.19) in the form
fXY (x, y) = fX (x)fY |X (y|x) (3.20)
R∞
where fX (x) = −∞ f (x, y)dy is the mass function of the distribution of X defined for
x ∈ X , and y 7→ fY |X (y|x) is the density of the conditional distribution of Y given X = x.
If instead we wish to condition on Y = y, then we run into the problem of conditioning
on an event of zero probability again. However the sensible way to define the conditional
distribution of X given Y = y is to say that this conditional distribution is the discrete
distribution on X with mass function
fXY (x, y)
fX|Y (x|y) = for x ∈ X (3.21)
fY (y)
defined at y ∈ R for which the density of Y , fY (y), is strictly positive. Here fXY denotes
the function appearing in (3.19). Why is this a good15 definition of the conditional
distribution of X? Because it makes the law of total probability work: for each x ∈ X ,
Z ∞
P(X = x) = P(X = x|Y = y)fY (y)dy (3.22)
−∞
Example 17. An electrical device contains a fragile component that fails at a random time
Y which has a continuous distribution. However the component has been manufactured
in one of two factories. If it has been produced in factory one then Y is exponentially
distributed with rate λ1 . If it has been produced in factory two then Y is exponentially
distributed with rate λ2 . Factory one produces a proportion p of these components and
factory two a proportion 1 − p. What is the distribution of the random time Y and if the
component fails at some time y > 0, what is the probability that it was manufactured in
factory one?
Let X be the random variable taking the value 1 if the component is manufactured in
factory one and the value 2 if it is manufactured in factory two. The question tells us the
mass function of X, and the conditional distributions of Y given X = 1 and given X = 2.
So using (3.20) we obtain the following description of the joint distribution of X and Y .
For 0 ≤ a ≤ b, Z b
P(X = 1 and a ≤ Y ≤ b) = pλ1 e−λ1 y dy
a
and Z b
P(X = 0 and a ≤ Y ≤ b) = (1 − p)λ2 e−λ2 y dy
a
14
I’m going to avoid calling this a density function because it is not a density in the same sense as
the density of a continuous joint distribution. Neverthless it is a sort of density, however with respect to
length rather than area.
15
Other than the formula just looks right!
50
Summing these two expressions shows the distribution of Y has density given by
(
pλ1 e−λ1 y + (1 − p)λ2 e−λ2 y for y ≥ 0,
fY (y) =
0 otherwise.
pλ1 e−λ1 y
P(X = 1|Y = y) =
pλ1 e−λ1 y + (1 − p)λ2 e−λ2 y
51
3.4 Priors and posteriors
We begin with an example using elementary probability ( very relevant to real life right
now!) to illustrate the notion of prior and posterior probabilities, and the use Bayes
formula.
Example 18. Mass testing of the people of Liverpool ( population approximately 500,000)
for Covid-19 began last week. According to Public Health England the test being used has
a sensitivity of 70% and a specificity of 99.68%. If one in fifty people living in Liverpool
are currently infected, and everyone is tested, how many positive test results do we expect
to obtain, and of those testing positive how many are expected to in fact be healthy?
This is the important question of false positives16 in testing. Suppose we test a random
Liverpudlian, D denotes the event they are infected, and we can reasonably assert that
P(D) = 1/50. This is known as the prior probability of being infected, because this is
before we take account of the result of the test. Let T be event that this person has a
positive test result. Then the Public Health England data states that
The total number of positive test results we expect to obtain in Liverpool is therefore by
the law of total probability,
Of these 7,000 were genuinely infected, and 1,568 were “false positives”. If our random
Liverpudlian tests positive, then the posterior (because it is after the test) probability of
being infected is
P(D)P(T |D) 7, 000
P(D|T ) = c c
= = 0.82.
P(D)P(T |D) + P(D )P(T |D ) 8, 568
The method illustrated above, combining prior probabilities with the evidence from
the test to arrive at posterior probabilities, is the basis of an entire system for doing
statistics: Bayesian Statistics. But to appreciate Bayesian Statistics fully, we have to
revisit the question of what probability actually means in the real world.
Classically, probabilities are assigned to events according to the characteristic fre-
quency with which the event tends to occur in repetitions of the experiment. For example
in many tosses of an unbiased coin, we expect to see half the tosses result in a head.
The weekend before the recent US election, CNN estimated Trump’s chances of win-
ning to be ten percent. But what did this mean? Did it mean they were planning to
rerun the general election many times17 and observe that Trump won in approximately
ten percent of them? This is an example of using probability to quantify our beliefs
about how likely something is to happen. However the problem with this interpretation
16
However the argument that there are many false positives arising from the NHS tests used for
people with symptoms is without foundation. Those people arguing for this position either lack a basic
grounding in statistics, or have not made their argument in full knowledge of the facts, or are attempting
to deliberately deceive the public.
17
Thankfully they didn’t mean this, although it does seem now that Trump wants the election rerun!
52
of probability is that different individuals may have different beliefs, and then what? It
is because of this subjectivity that Bayesian Statistics can occasionally be controversial,
but is also an enormously powerful method, and very widely used.
Example 19. The estimate that one in fifty people in Liverpool has Covid-19 comes from
a survey conducted by the Office for National Statistics. Suppose that an unknown
proportion of a population is infected and that in a random sample of n people k test18
positive. The naive estimate is that a proportion k/n of the population is infected. That’s
perfectly good, but we would like to understand how uncertain this estimate is.
The Bayesian approach to this is to treat the unknown proportion of a population
which is infected as a random variable, X, and assign a prior distribution to X which
reflects our uncertainty about its value before doing any sampling. In this case, if we know
nothing about X, we may decide19 to model it as having a uniform distribution on [0, 1].
We next let Y be a random variable representing the number of people in our sample
who test positive. If the proportion of the population that is infected was known to be p
then we would model Y as having a Binomial distribution with parameters p and n ( the
sample size, remember). So since we are now treating the proportion of the population
that is infected as a random variable X, we specify that the conditional distribution of
Y given X = p is Binomial with parameters p and n. So X has a continuous distribution
with density (
1 if 0 ≤ x ≤ 1
fX (x) =
0 otherwise.
The conditional distribution of Y given X = x has mass function
n k
fY |X (k|x) = x (1 − x)n−k for k = 0, 1, 2, . . . , n.
k
53
Figure 3.1: The density of the posterior distribution of the proportion of the population
who are infected, if 6 are infected in a sample size of 10 (blue), if 12 are infected in a
sample size of 20 (red), if 30 are infected in a sample size of 50 ( green), if 60 are infected
in a sample size of 100 (brown).
10
This distribution is called a beta distribution with parameters k+1 an n−k+1. Notice that
the first equality is a version of Bayes formula in our setting: but be careful21 interpreting
it as it mixes up mass functions and densities!
21
particularly because as compared to (3.20) and (3.21) I have swapped which variable has a discrete
distribution and which has a continuous distribution; just to keep you on your toes!
54
3.5 Conditional expectation
Whenever X and Y are two random variables such that we can make sense of the con-
ditional distribution of X given Y = y we can also define the notion of the conditional
expectation of X given Y = y. This is simply the mean of the conditional distribution.
Definition 14. Suppose X and Y are two random variables, for which the conditional
distribution of X given Y = y is defined. Then the conditional expectation of X given
Y = y is defined by X
E[X|Y = y] = x fX|Y (x|y)
x∈X
if the conditional distribution of X is discrete with support X and mass function fX|Y (·|y),
or by Z ∞
E[X|Y = y] = x fX|Y (x|y) dx
−∞
if the conditional distribution of X is continuous with density fX|Y (·|y). We also require
that the sum or integral be absolutely convergent in order for the conditional expectation
to be defined.
if the conditional distribution of X is discrete with support X and mass function fX|Y (·|y),
or Z ∞
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y) dx
−∞
if the conditional distribution of X is continuous with density fX|Y (·|y), provided the sum
or integral is absolutely convergent.
This proposition can save alot of work. Just imagine if you forgot about it and had
to use Definition 14 instead; you would first need to compute the conditional distribution
of h(X, Y ) given Y = y.
Example 20. Let X and Y be the coordinates of a random point uniformly distributed
in the disc {(x, y) ∈ R2 : x2 + y 2 ≤ 1}. As we saw in example 14, the conditional
p Y = y where y ∈ (−1, 1) is the uniform distribution in the
distributionpof X given
interval [− 1 − y , 1 − y 2 ]. Consequently
2
Z √1−y2
x dx
E[X|Y = y] = √ p = 0,
2
− 1−y 2 2 1 − y
and
Z √1−y2
x2 dx 1 − y2
E[X 2 |Y = y] = √ p =
− 1−y 2 2 1 − y2 3
55
Proposition 18 can also be extended to cover more random variables. So for example,
if X, Y and Z had a continuous joint distribution we would have
Z ∞Z ∞
E[h(X, Y, Z)|Z = z] = h(x, y, z) fXY |Z (x, y|z) dx dy, (3.23)
−∞ −∞
where fXY |Z (x, y|z) is the density of conditional joint distribution of X and Y given Z = z
which we defined previously at (3.18).
Just as linearity is a key property of expectations, so it is for conditional expectations
also. Except now it is more general because we can extend it so that it applies not just
to constants coefficients, but any function of the random variable on which we are condi-
tioning. The proof of the following proposition is a simple application of the preceeding
Proposition 18, together with properties of sum or integrals.
Proposition 19 (Linearity of conditional expectation). (a) If a : R → R is any ( non-
random) function, then, for any random variable Y ,
E[a(Y )|Y = y] = a(y).
56
Thinking about this further may help you appreciate why it so natural to treat a condi-
tional expectation as a random variable. Imagine a game in which you win the total score
on the two dice. We know that your expected winnings are E[X + Y ] = E[X] + E[Y ] = 7.
The dice are rolled one by one, and after the first die has been rolled and you see the
result, you assess your expected winnings to be whatever you have scored on the first roll
plus E[Y ] = 7/2. But whatever you score on the first roll is the value of the random
variable X. So your expected winnings at this stage of the game are X + 7/2.
Similarly, in an alternative game, you win just whatever the score X on the first die
is. If the value of the combined score X + Y on the two dice is revealed to you, then you
would assess your expected winnings to be half of this, so (X + Y )/2.
57
3.6 The Tower property
Proposition 20 (Tower property). For random variables X, Y and Z such that the
conditional expectations exist,
E E[X|Y ] = E[X],
and
E E[X|Y, Z]|Z = E[X|Z].
The tower property is actually a generalization of the law of total probability. So for
example, to deduce (3.12) from the preceding proposition we take h : R → R to be the
function defined by (
1 if a ≤ x ≤ b
h(x) =
0 otherwise,
and then the equation E E[h(X)|Y ] = E[h(X)] is equivalent23 to
E P(a ≤ X ≤ b|Y ) = P(a ≤ X ≤ b) (3.24)
and if Y has a continuous distribution then the left handside can be computed as an
integral.
Example 22. Consider the scores X and Y on a pair of dice as in Example 21 again. Then
we can easily verify these two examples of the tower property holding.
E E[X + Y |X] = E[X + 7/2] = 7 = E[X + Y ],
and
E E[X|X + Y ] = E[(X + Y )/2] = 7/2 = E[X].
Example 23. Suppose that X and Y are a pair of random variables with the property
that
E[Y |X] = aX + b
for some a, b ∈ R. Remember that if the joint distribution of X and Y is Gaussian then
this will be the case. Let us use the notation:
2
E[X] = µX , E[Y ] = µY , var(X) = σX , var(Y ) = σY2 and cov(X, Y ) = σXY .
Now the tower property implies that
µY = E[Y ] = E E[Y |X] = E[aX + b] = aµX + b,
and also that
58
With these equations, we can solve to find a and b:
σXY σXY
a= 2
and b = µY − 2
µX ,
σX σX
Proposition 21. Consider a pair of random variables X and Y . The choice of random
variable Z which minimizes
E[(Z − Y )2 ]
amongst all random variables which can be written Z = f (X) for some function f , is
Z = E[Y |X].
Proof. Let Ŷ = E[Y |X] and consider any Z which can be written as a function of X,
then,
E[(Z − Y )2 ] = E[(Z − Ŷ + Ŷ − Y )2 ]
= E[(Z − Ŷ )2 ] + E[(Ŷ − Y )2 ] + 2E[(Z − Ŷ )(Ŷ − Y )].
Now we will show that the last term is zero, and hence E[(Z −Y )2 ] is minimized by Z = Ŷ .
Using the facts that Z − Ŷ is some function of X, and E[(Ŷ − Y )|X] = Ŷ − E[Y |X] = 0,
we have
E[(Z − Ŷ )(Ŷ − Y )] = E E[(Z − Ŷ )(Ŷ − Y )|X] = E [(Z − Ŷ )E[(Ŷ − Y )|X] = 0
Proposition 22. Suppose that X and Y are independent random variables, then
Saying X and Y are independent given Z means that for any subsets B and B 0 of R
and any value z such that the conditional probabilities are defined
59
Example 24. Consider again Example 19. X represents the proportion of the population
that is infected - assumed to have a uniform distribution. Y is the number of individuals
who are infected in a sample of size n, which we assume to have a conditional distribution
given X = x which is the Binomial distribution with parameters x and n. Then we saw
in Example 19 that the conditional distribution of X given Y = k is the beta distribution
with density (
(n+1)! k
k!(n−k)!
x (1 − x)n−k if 0 ≤ x ≤ 1
fX|Y (x|k) =
0 otherwise.
We said previously that the niave estimate of X given Y = k would be k/n, and its easy
to check that the function x 7→ fX|Y (x|k) attains its maximum at x = k/n. However this
is not the conditional expectation of X given Y = k. We have24
Z 1 Z 1
(n + 1)! k+1 k+1
xfX|Y (x|k) dx = x (1 − x)n−k dx =
0 0 k!(n − k)! n+2
So
Y +1
E[X|Y ] =
n+2
This is not greatly different from the naive estimate, unless n is small.
Now let Z be the number of individuals who are infected in a second sample of size
m, and suppose that we want to predict Z from the value we observe for Y . We assume
that the conditional distribution of Z given X = x is the Binomial distribution with
parameters x and m, and consequently
E[Z|X] = mX
We further assume that Z and Y are independent given X. Using the tower property and
the preceding proposition gives
Y +1
E[Z|Y ] = E E[Z|X, Y ]|Y = E E[Z|X]|Y = E mX|Y = m .
n+2
24
notice the integral is another beta integral with one parameter changed
60
Chapter 4
61
4.1 Markov’s inequality and the weak law of Large
Numbers
Proposition 23 (Markov’s inequality). Let Z be a random variable whose expectation
exists and which satisfies P(Z ≥ 0) = 1. Then for any c > 0,
P(Z ≥ c) ≤ E[Z]/c.
Proof. Consider the random variable Y defined by
(
c if Z ≥ c,
Y =
0 otherwise.
Y has a discrete distribution with two possible values, and we have P(Y = c) = P(Z ≥ c).
Consequently
E[Y ] = c.P(Y = c) = c.P(Z ≥ c).
Notice also it follows from the definition of Y , and the non-negativity of Z that P(Y ≤
Z) = 1. This implies, by positivity of expectation that
E[Y ] ≤ E[Z],
from which the result follows.
Effective use of Markov’s inequality depends on making a good choice of the random
variable Z. Here is a classic case.
Proposition 24 (Chebyshev’s inequality). Suppose X is a random variable whose ex-
pected value and variance both exist. Then, for any > 0,
var(X)
P(|X − E[X]| ≥ ) ≤ .
2
Proof. Let Z = (X − E[X])2 . Then E[Z] = var(X), and P(Z ≥ 0) = 1. Take c = 2 in
Markov’s inequality to obtain
P(|X − E[X]| ≥ ) = P(Z ≥ 2 ) ≤ E[Z]/2 = var(X)/2 .
This result is important because it allows us to use the variance, which we know
describes the spread of the distribution about its mean, to obtain quantitative bounds on
the distribution.
1
PThe
n
following theorem describes the behaviour of the averages of n random quantities,
n k=1 Xi , when n is large. The key to the poof P is to combine Chebyshev’s inequality
with the fact that the variance of the average n1 nk=1 Xi is decreasing with n.
Theorem 4 (Weak law of large numbers). Let X1 , X2 , X3 , . . . be a sequence
P of independent
random variables having common mean µ and variance σ 2 . Let Sn = nk=1 Xk . Then for
any > 0,
P(|Sn /n − µ| ≥ ) → 0, as n tends to infinity.
62
Proof. First by linearity of expectation
n
1X
E[Sn /n] = E[Xi ] = µ.
n i=1
where we also use var(aX) = a2 var(X) for any non random constant a. Then applying
Chebyshev’s inequality to Sn /n gives,
σ2
P(|Sn /n − µ| ≥ ) ≤ ,
n2
and the lefthandside tends to 0 as n tends to infinity.
Suppose that A1 , A2 . . . , An , . . . is a sequence of independent events with P(An ) = p
for all n ≥ 1. Define random variables Xn via
(
1 if An occurs
Xn = (4.1)
0 otherwise.
Since E[Xi ] = p, the law of large numbers is, in this case, a result which is in accord1
with the frequentist interpretation of probability: in many repetitions of the activity with
the uncertain outcome2 , and then there is some characteristic frequency with which a
particular outcome tends to occur.
The assumption that the variance of Xi exists in the theorem is not necessary for the
conclusion to hold, at least if we assume the Xi are identically distributed, but the proof
is then more involved. The assumption that the mean of Xi exists is, however, essential-
otherwise the behaviour can be very different indeed.
Example 25. Consider a random path along the edges of a n × n chess board. The path
starts at the bottom left corner, must always move in a direction towards the top right
corner and finishes there. All such paths are equally likely. The figure shows such a path
on a 4 × 4 chess board. Suppose n = 100, use Chebyshev’s inequality to estimate the
probability that the path passes no more than 10 squares away from the centre of the
chess board.
Define random variables X1 , X2 , . . . , X2n by setting Xi = 1 if the ith edge of the path is
“upwards” and Xi = −1 if the ith edge of the path is “rightwards”. The joint distribution
of X1 , X2 , . . . , X2n is uniform on the set
( 2n
)
X
x ∈ {−1, 1}2n : xi = 0 .
i=1
1
This argument can seem a bit circular at first- we have proved something that justifies what we
assumed probability meant in the first place.
2
Think of An as describing the outcome of the nth repetition of an activity like tossing a (biased)
coin.
63
Figure 4.1: A path on a 4 × 4 chess board
It follows3
1
P(Xi = 1) = for each i,
2
and4
n−1
P(Xi = 1 and Xj = 1) = for all i 6= j.
4n − 2
From this we find that E[Xi2 = 1] for each i and that E[Xi Xj ] = −1/(2n − 1) for i 6= j.
Now assume n is even, then the path passes more than k squares from the centre of
the board if and only if
X n
Xi ≥ 2(k + 1).
i=1
Pn
We compute the variance of i=1 Xi as
! !2
n
X n
X
var Xi = E Xi
i=1 i=1
X n2
= E[Xi Xj ] = nE[X12 ] + n(n − 1)E[X1 X2 ] = .
1≤i,j≤n
2n − 1
Now Chebyshev’s inequality gives us that the probability of not passing within k-squares
of the centre is less that
n2
4(k + 1)2 (2n − 1)
This is just less that 0.11 when n = 100 and k = 10, so there is at least a 90% chance of
the path passing within 10 squares of the centre of the board.
64
4.2 Convergence in Distribution
The law of large numbers says something about the distribution of the random variable of
interest, n1 Sn , when n is large. It tells us that the mass of the distribution is concentrated
near the point µ = E[Sn /n]. In other problems we may be interested in a sequence of
random variables having distributions which have some other limiting behaviour. This is
best illustrated through examples.
Example 26. Suppose that (Xn , n ≥ 1) is a sequence of independent random variables
with each Xn uniformly distributed on [0, 1]. Let Mn = min1≤i≤n Xi . We can find the
distribution of Mn easily: for any 0 ≤ x ≤ 1,
So now consider nMn . The distribution of this random variable is described by the fact
that
P(nMn > x) = P(Mn > x/n) = (1 − x/n)n ,
provided that 0 ≤ x ≤ n. Letting n tend to infinity, and using a celebrated fact from
analysis, we obtain
P(nMn > x) → e−x , (4.3)
for every x ≥ 0. Recall that if X is a random variable that has the exponential distribri-
bution with rate 1, then
P(X > x) = e−x ,
for all x ≥ 0. Comparing this (4.3) we see that as n grows large the distribution of nMn
gets close to the distribution of X. In fact the density of the distribution of nMn is
(
(1 − x/n)n−1 if 0 ≤ x ≤ n,
fnMn (x) = (4.4)
0 otherwise
which converges to the density of the exponential distribution for each x ∈ R as illustrated
in Figure 4.2.
In the preceding example the densities of the random variables of interest converged.
But the next example shows that it makes sense to talk about convergence in distribution
without densities existing at all.
Example 27. Suppose that (Xn , n ≥ 1) is a sequence of random variables with the distri-
bution of Xn being the discrete uniform distribution on {1, 2, 3 . . . , n}. Should we say Xn
converge in distribution as n tends to infinity? It is certainly the case that mass functions
of Xn converge: for each positive integer k the mass function fXn (k) = 1/n provided
n ≥ k, so
fXn (k) → 0 as n → ∞.
But there is no random variable X whose distribution could be described by this limit.
Consider instead the random variables Xn /n. These random variables have discrete
distributions, but the support of the distribution of Xn /n changes with n in such a way
65
Figure 4.2: The density of the random variable nMn for n = 2 (green), n = 4 (red), n = 8
(cyan),n = 16 ( blue) and n = 32 (purple)
0.8
0.6
0.4
0.2
1 2 3 4
that it is impossible to say a limit of the mass function exists. However if we look at the
probabilities
bnxc
P(Xn /n ≤ x) = for any x ∈ [0, 1] (4.5)
n
then we see, see Figure 4.2, that these converge
5
notice that this is exactly the form of the statement made in the law of large numbers.
66
Xn
Figure 4.3: The cummulative distribution function of n
for n = 2, 10 and 100.
0.8
0.6
0.4
0.2
The limiting distribution puts mass one at the single point c, so if6 P(X = c) = 1, then
we have we have
P(Xn ≤ x) → P(X ≤ x) = 0 for all x < c, (4.9)
and
P(Xn ≤ x) → P(X ≤ x) = 1 for all x > c. (4.10)
But it is important to notice that we dont have the same statement holding at x = c:
P(Xn ≤ c) = Φ(−1) 6= 1 = P(X ≤ c), (4.11)
where, here, Φ denotes the cummulative distribution function of the standard Gaussian
distribution.
The following definition encompasses all three of the previous examples.
Definition 16. Let (Xn ; n ≥ 1) be a sequence of random variables, and X some further
random variable. Then we say Xn converges in distribution to X as n tends to infinity,
written
dist
Xn → X as n → ∞,
if the distribution functions FXn of Xn , and FX of X, satisfy
FXn (x) → FX (x) as n → ∞,
6
It might seem strange to treat this as a probability distribution, the random variable X isnt “random”
at all, but its is perfectly ok to do this.
67
for every x ∈ R at which the function FX is continuous.
To prove this we need to verify that P(Xn < a) converges to P(X < a), which we do as
follows. By continuity, for any > 0 there exists an h > 0 so that
This shows that P(Xn < a) converges to FX (a) which, by continuity of the distribution
of X is equal to P(X < a).
We now turn to look at distributions which are discrete, and whose support is always
a subset of Z. In this case convergence in distribution is equivalent to convergence of
mass functions. Suppose (Xn , n ≥ 1) is a sequence of random variables, each of which has
a discrete distribution with support contained in Z. Let X be a further random variable
whose support is contained in Z. Then Xn converges in distribution to X if, and only if,
for all k ∈ Z,
P(Xn = k) → P(X = k) as n → ∞. (4.13)
Example 29. Fix a constant λ > 0. Suppose, for n > λ, that Xn has the Binomial
distribution with parameters λ/n and n. Then Xn converges in distribution to X where
X has a Poisson distribution with parameter λ > 0. To verify this we just compute, for
a fixed integer k ≥ 0
k n−k
n λ λ λk −λ
lim 1− = e . (4.14)
n→∞ k n n k!
68
4.3 The Central Limit Theorem
Stirlings formula
n!
lim =1
n→∞ √ n+
1
2πn 2 e−n
Theorem 5. Fix p ∈ (0, 1). For n = 1, 2, 3, . . . let Xn have the Binomial distribution
with parameters n and p. Then for all −∞ ≤ z1 < z2 ≤ ∞,
p p Z z2
P(np + z1 np(1 − p) ≤ Xn ≤ np + z2 np(1 − p)) → φ(z)dz as n tends to infinity,
z1
2
where φ(z) = √1 e−z /2 is the standard Gaussian density.
2π
p
In the language of the previous lecture we say that (Xn −np)/ np(1 − p) converges in
distribution to a random variable having the standard Gaussian distribution. Remember
that, for a Binomially distributed random variable, E[Xn ] = np and var(Xn ) = np(1 − p).
7
Dont try to check this for yourself; its not particularly difficult but its quite messy!
69
p
So the mean and variance of (Xn − np)/ np(1 − p) are 0 and 1 which agree with the
mean and variance of the limiting distribution.
This theorem suggests approximating suitable sums of Binomial probabilities with
probabilities calculated from the Gaussian density but the Theorem doesn’t tells how
good the approximation will be. But in fact these approximations are very good if n is
only moderately large ( such as n = 20) providing neither p nor 1 − p is too small.
Example 30. Suppose we toss a fair coin 10, 000 times. Then the number of heads X has
the Binomial distribution with parameters n = 10, 000 and p = 1/2. We know that the
most likely value of X is np = 5,√000. The probability of getting exactly 5000 heads is,
by (4.15), approximately 2φ(0)/ n = 0.008 (3 d.p.) The law of large numbers tells us
that we should expect however to observe a value of X close to 5, 000, but it gives no
information that allows us to make sense of “close” in a more quantitative way. Using
the Gaussian approximation to the Binomial distribution helps. Taking8 z1 = −1.96,
z2 = 1.96 we have
Z 1.96
P(4902 ≤ Xn ≤ 5098) ≈ φ(z)dz = 0.95.
−1.96
We could use Chebyshev to give a bound on the probability of the same event, this would
give
P(4902 ≤ Xn ≤ 5098) ≥ 1 − var(Xn )/992 = 0.74(3.dp)
In this case the Gaussian approximation is accurate to about 0.01.
Theorem 5 turns out to be a special case of a much, much more general result known
as the Central Limit Theorem. Before we get to this, we need to learn how to use moment
generating functions to investigate the convergence of distributions. Recall that if X is a
random variable, then its morment generating function MX (t) is defined by
MX (t) = E[etX ]
for all t ∈ R such that the expectation exists. In fact, either the moment generating
function exists only at t = 0, in which case it is essentially useless, or there exists some
δ > 0 so that exists for t ∈ (−δ, δ). Possibly it exists for all t ∈ R, but this isn’t necessary
for it to be useful.
We will make use of the following property of moment generating functions, which
explains their name.
Proposition 25. Suppose that X is a random variable with moment generating function
M (t) = E[etX ] existing for t ∈ (−δ, δ) for some δ > 0. Then
∞
X mn
M (t) = tn
t=0
n!
where mn = E[X n ], the nth moment of X exists for every n ≥ 1, and the power series
converges for all t ∈ (−δ, δ).
8
these numbers come from a table of the Gaussian distribution function. See https://www.
mathsisfun.com/data/standard-normal-distribution-table.html
70
It is easy to see why this proposition should be true if one replaces etX by its Taylor
expansion, and then uses the linearity of expectations. However because it deals with an
infinite sum, justifying the use of linearity here requires some sophisticated analysis.
The key property of moment generating functions that makes them so useful, is that
they characterise distributions.
Theorem 6. Suppose that X and Y are two random variables whose moment generating
functions are defined and equal for all |t| < δ for some δ > 0. Then X and Y have the
same distribution.
Closely related to this uniqueness property, and of interest to us now, is that the
convergence of moment generating functions implies convergence of distributions.
Theorem 7. Suppose that (Xn , n ≥ 1) is a sequence of random variables such that the
moment generating function Mn (t) of Xn exists if |t| < δ for some δ > 0 not depending
on n. Suppose further that X is a random variable whose moment generating function
M (t) also exists for |t| < δ and that
Mn (t) → 1 as n → ∞. (4.17)
Now M (t) = 1 for all t ∈ R is the moment generating function of X satisfying P(X =
0) = 1, and so Xn converge in distribution to such an X. This is a result which is very
easy to check directly from Definition 16, and so here using moment generating functions
was overkill- but it is still interesting to see how the argument works.
Suppose on the other hand that Xn has the exponential distribution with rate param-
eter 1/n. Then the moment generating function of Xn is
1
Mn (t) = for t < 1/n. (4.18)
1 − nt
So now, the limit is given is given by
(
1 0 if t 6= 0,
lim = (4.19)
n→∞ 1 − nt 1 if t = 0.
9
This might worry you, since Mn (t) isn’t defined for every t ∈ R, but its okay because for each t,
Mn (t) is defined for all n > 1/t and so it still makes sense to talk abut a limit; you can always throw
away the first few terms of the sequence of random variables.
71
But in fact this is not the limit of the functions Mn which are only defined if t < 1/n.
Because 1/n → 0 as n → ∞, there is no choice of δ > 0 so as to satisfy the conditions in
the statement of Theorem 7. In fact this sequence of random variables Xn do not converge
in distribution at all10
Example 32. Suppose that Xn is uniformly distributed on the set {1, 2, . . . , n}. Then the
moment generating function of Xn is given by Mn (0) = 1 and
et etn − 1
Mn (t) = for t ∈ R \ {0} (4.20)
n et − 1
The moment generating function of the random variable Xn /n is given by
et/n et − 1
E[etXn /n ] = Mn (t/n) = (4.21)
n et/n − 1
Since (ex − 1)/x → 1 as x → 0, we have, for a fixed t 6= 0,
et − 1
lim Mn (t/n) = . (4.22)
n→∞ t
Since M (t) = (et − 1)/t for t 6= 1 andM (0) = 1 is the moment generating function of a
random variable X with the uniform distribution on the interval [0, 1], this is in agreement
with example 27.
Here is a useful result from first year analysis.
Lemma 1. Suppose (an , n ≥ 1) is a sequence of numbers satisfying an → 1 as n tends to
infinity, and
lim n(an − 1) = b,
n→∞
Example 33. Suppose that Xn has the Binomial distribution with parameters 1/2 and n.
Then the moment generating function of Xn is given by
n
1 + et
Mn (t) = for t ∈ R. (4.23)
2
√
Lets consider the random variable (2Xn − n)/ n. Notice that the mean of this random
variable is zero, and its variance one. Its moment generating functions is given by
√ n −t/√n √ n
√ −t√n 1 + e2t/ n + et/ n
√ e
−t n
Mn (2t/ n)e = e = (4.24)
2 2
Since by Taylor expanding the exponentials we obtain
√ √
e−t/ n
+ et/ n
t2
=1+ + ...
2 2n
10
Despite the fact their distribution functions satisfy limn→∞ Fn (x) = 0 for all x ∈ R. This is because
F (x) = 0 for all x ∈ R isn’t a distribution function corresponding to any probability distribution on R.
72
Lemma 1 implies that
√ √ 2
Mn (2t/ n)e−t n → et /2 as n → ∞, (4.25)
and we recognize the limit as the moment generating functions of the standard Gaussian
distribution. This proves Theorem 5 in the case p = 1/2.
Now we come to one of the deepest and most beautiful results in probability and
statistics. The proof is a straightforward generalization11 of the argument we used in the
preceding example.
Theorem 8 ( The Central Limit Theorem). Let X1 , X2 , X3 , . . . be a sequence of indepen-
dent and identically
Pn distributed random variables having common mean µ and variance
2
σ . Let Sn = k=1 Xk . Then, as n tends to infinity,
Sn − nµ
√
nσ 2
converges in distribution to a random variable having the standard Gaussian distribution.
Proof. We give a proof under the additional assumption12 that the moment generating
function of Xi exists. Let
M (t) = E[exp{t(X − µ)/σ)}],
which we assume exists for t ∈ (−δ, δ) for some δ > 0. Now compute the moment
−nµ
generating function of Sn∗ = S√nnσ 2
, using independence, as follows.
" ( n
)#
∗ t X √ n
Mn (t) = E[etSn ] = E exp √ (Xi − µ)/σ) = M (t/ n) .
n i=1
73
What is remarkable in this result is that the limiting distribution is always the same,
no matter what the distribution of the random variables Xi , subject to the assumption
of having a variance. This phenomenon is called13 the universality of the Gaussian
distribution.
In Theorem 8, we assume the distribution of each Xi is the same and the random
variables are independent of each other. There are other results which relax this as-
sumptions and give Gaussian approximations for sums of random variables each having
different distributions, and which may be (weakly) dependent. This is the reason the
Gaussian distribution is such a good model for “random errors” in many situations: such
errors arise by combining many smaller random effects. For similar reasons the log-normal
distribution is an excellent model for the variability of such diverse quantities as stock
market prices and blood pressures.
You will see the central limit theorem used to prove results in statistics about the
distribution of estimators of unknown parameters in ST219. The Gaussian distribution (
because of the central limit theorem) also appears in physics, geometry and even number
theory.
13
See https://terrytao.files.wordpress.com/2011/01/universality.pdf for more about this
74
4.4 Convergence of random variables
We have previously given a definition of statement “random variables X1 , X2 , . . . converge
in distribution to a random variable X” . This is standard terminology used by everyone,
but it is actually quite misleading. Remember that from the very beginning of the module
we stressed that a random variable and its distribution are very different mathematical
dist
objects. When we write Xn −→ X, what we are really saying is that the distribution of
Xn becomes close, in a certain sense, to the distribution of X as n increases. But this
does not necessarily mean that the random variable Xn gets close to the random variable
X in any reasonable sense.
Example 34. Remember random variables are functions on some sample space Ω. Our
sample space could have just two sample points Ω = {H, T }, with each sample point
being equally likely. Then let
Xn (H) = 1 and Xn (T ) = 0 if n is even,
and
Xn (T ) = 1 and Xn (H) = 0 if n is odd.
Its pretty clear that the distribution of every Xn is the same for every n: P(Xn = 0) =
P(Xn = 1) = 1/2 for each n. So there is no doubt that this sequence of random variables
converge in distribution. But as functions, the sequence Xn is alternating between two
fixed functions, which are different, and in no sense getting close together as n increases.
This is reflected in what happens if we conduct the corresponding experiment: when we
observe the values of the sequence random variables we either see 1, 0, 1, 0, 1, 0, 1..., if we
choose ω = T , or we see 0, 1, 0, 1, 0... if we choose ω = H. In neither eventuality does the
sequence we observe converge!
A slightly less artificial example of the same phenomena would be to take P(Xn =
0) = P(Xn = 1) = 1/2 for each n and X1 , X2 . . . are independent. Then, again, since the
dist
distribution of Xn doesn’t vary with n at all, it is trivially true that Xn −→ X where
P(X = 0) = P(X = 1) = 1/2. But if we observed values for the random variables
X1 , X2 , . . ., for example generated by repeatedly tossing a coin, we would see a “random”
sequence of 0s and 1s with no tendency to converge to any particular value.
An even more interesting example occurs in the setting of the Central Limit Theo-
rem. The random variables Sn∗ converge in distribution to a random variable having the
standard Gaussian distribution. However if we generate observations of the sequence of
random variables S1∗ , S2∗ , . . . then the values obtained will oscillate, with the oscillations
slowly becoming larger as n increases. There is a beautiful theorem in advanced prob-
ability theory called the Law of the Iterated Logarithm that describes this oscillating
behaviour very precisely.
We want a notion of convergence for random variables that relates to the values we
would observe if the experiment was conducted. Your first guess for a definition that
captures this idea might be to say Xn converges to X if Xn (ω) converges to X(ω) for
every ω ∈ Ω, the sample space. This is a good idea; it doesn’t quite work, but it can
be fixed. But fixing it turns out to be quite interesting! So we will start with something
easier, if slightly less intuitive.
75
Definition 17. Let X1 , X2 , . . . be a sequence of random variables, and X a further random
variable, all defined on the same sample space. Then we say Xn converges to X in
probability, written
prob
Xn −→ X,
if for every > 0,
P(|Xn − X| > ) → 0, as n → ∞.
This definition should remind you of the statement of the law of large numbers, the
conclusion of which we can now write as
n
1X prob
Xi −→ µ
n i=1
where µ = E[Xi ].
Example 35. Suppose X is some random variable, and 1 , 2 , . . . a sequence of random
variables each having the standard Gaussian distribution. Let
1
Xn = X + n ,
n
so we can think of Xn as a noisy measurement of X, with an error that tends to decreases
prob
with n. We have Xn −→ X because
Z ∞
2 2
P(|Xn − X| > ) = P(|n | > n) = √ e−z /2 dz → 0.
2π n
Notice that we don’t have to evaluate the integral to see that it converges to 0: this is a
consequence simply of n → ∞ and the fact the integral is finite.
Now let’s go back to that first idea of convergence for random variables. We need to
consider the possibility that there might be sample points in the sample space for which
it is not true that Xn (ω) converges to X(ω). But providing the set of all these “bad”
sample points has probability zero, it shouldn’t stop us from saying Xn converges to X,
because when we conduct the experiment we will never choose14 one of these bad ω.
76
Example 36. Suppose that X is some random variable. Define random variables
Xn = bnXc.
Then no matter what the value of X, we have |Xn /n − X| ≤ 1/n and consequently
Xn (ω)/n → X(ω) for every sample point in the sample space on which X is defined.
a.s.
Thus Xn /n −→ X.
But in general the event {ω ∈ Ω : Xn (ω) → X(ω) as n → ∞} can be complicated,
and it can be difficult to compute its probability. A key tool is the following notion.
Definition 19. Let A1 , A2 , . . . be a sequence of events, all subsets of some sample space
Ω. Then the event An occurs infinitely often is defined to be the event
∞ [
\ ∞
{ω ∈ Ω : ω ∈ An for infinitely many n} = An .
m=1 n=m
If ∞
P
n=1 P(An ) < ∞ then
77
R ∞ 2dz 1 ∞ dz
R 1
where we have estimated the integral using n π(1+z 2) ≥ π n z2
= πn . Consequently by
the second Borel-Cantelli Lemma, P(|Xn − X| > 1 infinitely often ) = 1 and Xn cannot
be converging almost surely to X.
This is strange! What is happening? Convergence in probability of Xn to X tells us
that the error |Xn − X| will be below any given threshold with probability that tends that
one as n tends to infinity. But nevertheless, occasionally large errors occur, and although
these become increasingly rare as n increases they continue to occur indefinitely.
Now suppose instead that
1
Xn = X + 2 n .
n
Then, fixing an arbitrary > 0, let An be the event that {|Xn − X| > } = {|n | > n2 }.
Then ∞ ∞ Z ∞ ∞
X X 2dz 2 X 1
P(An ) = 2
≤ < ∞.
n=1
2 π(1 + z )
n=1 n
π n=1 n2
Thus by the first Borel-Cantelli Lemma P(|Xn − X| > infinitely often ) = 0 and Xn
does converge almost surely to X.
78
4.5 More on convergence of random variables
The relationship between the three notions of convergence we have discussed is given the
following theorem; the reverse implications do not hold.
Proof. We will give the proof of the first implication. So suppose that (Xn , n ≥ 1) is a
sequence of random variables converging almost surely to a limit X.
Fix some > 0, and consider the event:
and consequently15
∞
!
\
lim P(Em ) = P Em .
m→∞
m=1
But righthand side is the probability of the event A and so zero. Finally we have
An ⊆ En for each n and so since
we deduce that
lim P(|Xn − X| > ) = lim P(An ) = 0.
n→∞ n→∞
15
This is a fundamental property of the probability measure P which follows from countable additivity,
you may have see it in ST115; if not don’t worry, using countable additivity of P is very important in
more advanced work, and you may do more of this in modules next year.
79
We have seen examples in the previous lecture which show that convergence in prob-
ability does not imply convergence almost surely, and neither does convergence in distri-
bution imply convergence in probability in general. The exception is the special case in
which the limit is a constant c. In this case convergence in probability and convergence
prob
in distribution are equivalent and we usually write Xn −→ c.
Example 38. This is a continuation of example 36. Suppose the random variable X has
the exponential distribution so that P(X > t) = e−t for any t ≥ 0. Then Xn = bnXc has
a geometric distribution, for we have, for any non-negative integer k,
We saw that Xn /n tends to X almost surely. From the preceding Theorem we can
conclude that if X1 , X2 , . . . , Xn , . . . are random variables having geometric distributions
dist
with parameters (1 − e−1/n ) then Xn /n −→ X where X has the exponential distribution
with rate parameter 1.
The properties of limits for sums and products that you are familar with for sequences
of real numbers hold too for convergence of random variables in either the almost sure
sense or in probability.
Proposition 27. Let (Xn , n ≥ 1) and (Yn , n ≥ 1) be two sequences of random variables.
Suppose that, as n tends to infinity,
a.s a.s.
Xn −→ X, and Yn −→ Y.
Then
a.s
Xn + Yn −→ X + Y,
and
a.s
Xn Yn −→ XY.
The analogous statements hold for convergence in probability.
Proposition 28. Let (Xn , n ≥ 1) and (Yn , n ≥ 1) be two sequences of random variables.
Suppose that, as n tends to infinity,
dist dist
Xn −→ X, and Yn −→ Y.
80
In this next proposition, we don’t assume anything about the relationship between Xn
and Yn but instead consider the case that the limit of Yn is a constant.
Proposition 29. Let (Xn , n ≥ 1) and (Yn , n ≥ 1) be two sequences of random variables.
Suppose that, as n tends to infinity,
dist prob
Xn −→ X, and Yn −→ c,
and
dist
Xn Yn −→ aX,
and provided a 6= 0,
Xn dist X
−→ .
Yn a
Finally we consider the effect of composing with a continuous function.
Example 39. Fix p ∈ (0, 1) and let Xn be a random variable having the Binomial distri-
bution with parameters n and p. Let us show that, as n tends to infinity,
Xn − np dist
p −→ Z (4.26)
Xn (n − Xn )/n
where Z is a random variable with the standard Gaussian distribution. Note that there
appears to be a problem with this statement; it is a ratio of random variables and the
denominator can be zero with positive probability. But the probability the denominator
equals zero, tends to zero as n tends to infinity. So, you could define the ratio to take any
value you line in the eventuality the denominator is zero, without affecting it converging
in distribution. For this reason, we don’t bother about the possibility that denominator
is zero.
First, the Central Limit Theorem16 , implies
X − np dist
p n −→ Z (4.27)
np(1 − p)
16
Pn
This is because we can write Xn as a sum of independent random variables i=1 i as in Section 1.3
81
where Z is a random variable with the standard Gaussian distribution. On the otherhand
the law of large numbers tells us that
Xn prob
−→ p
n
Using Proposition 30 we can deduce from this that
s
Xn (n − Xn ) prob
−→ 1 (4.28)
n2 p(1 − p)
Finally we combine the two limits (4.27) and (4.28) using the third statement from Propo-
sition 29 to obtain (4.26).
This result is used in practical applications in Statistics. The context is the same
as Example 19; when estimating the proportion of a population infected with a disease,
we test a sample of size n and find k infected individuals. The natural estimate of the
proportion of the population that is infected is then k/n, but how accurate should we
judge this estimate to be? We can treat the number k as the observed value of a random
variable X having the Binomial distribution with parameters p and n where p represents
the unknown proportion of the population which is infected. The convergence (4.26) is
used to construct an approximate confidence interval17 for p, giving a range of values of
p that are “reasonable” having observed k infected individuals in the sample.
17
I don’t want to explain this now, but its something to look forward to ST219
82
4.6 The strong law of large numbers
We finish the module with one of the greatest results of probability theory.
Theorem 11 (The strong law of large numbers). Let (Xi , i ≥ 1) be a sequence of inde-
pendent and identically distributed random variables whose mean µ exists. Then
n
1X a.s.
Xi −→ µ.
n i=1
Proof. We will prove the special case in which we make the strong extra assumption that
the moment generating function of Xi exists.
Let M (t) = E[etXi ] which we assume exists for t ∈ (−δ, δ). By independence
h Pn i n
E et i=1 Xi = M (t) .
Suppose > 0. Then M (0) = 1, and by Proposition 25, M 0 (0) = µ, and so for some
h > 0,
M (t) < 1 + (µ + )t
for all 0 < t < h. Pick such a t. By Markov’s inequality
n
!
X Pn n
P Xi ≥ n(µ + ) = P et i=1 Xi ≥ en(µ+)t ≤ M (t) e−n(µ+)t = enβ
i=1
where β = log M (t) − (µ + )t < log(1 + (µ + )t) − (µ + )t ≤ 0 since log(1 + x) ≤ x for
all x > −1. Since β < 0 we have
∞ n
! ∞
X X X
P Xi ≥ n(µ + ) ≤ enβ < ∞,
n=1 i=1 n=1
so !
1 Xn
P Xi − µ ≥ for infinitely many n = 0,
n
i=1
83
The full proof, without making extra assumptions, is much harder.
An immediate consequence of this result, and the fact almost sure convergence im-
plies convergence in probability is we can improve on the statement of the weak law of
large numbers, Theorem 4, which we previously proved using Chebyshev’s inequality; an
argument which required finite variances to exist.
The hypotheses in the statement of strong law are all needed for the conclusion to
hold.
Pn If the mean of the distribution of the random variables Xi does not exist then
1
n i=1 Xi need not converge almost surely at all. Taking the sequence (Xi ; i ≥ 1) to
be independent random variables each having the standard Cauchy distribution, we have
seen that n
1X dist
Xi = X 1 , (4.29)
n i=1
1
Pn
andPn consequently n i=1 Xi cannot converge almost surely to any constant. In fact
1
n i=1 Xi doesn’t converge in either the sense of probability or almost surely to any
limit at all.
The next example shows that the assumption that the random variables Xi be iden-
tically distributed is also essential.
Example 40. Let (Xi , i ≥ 1) be a sequence of independent random variables satisfying
1 1
P(Xi = i2 − 1) = 2
and P(Xi = −1) = 1 − 2 . (4.30)
i i
Then E[Xi ] = 0 for each i, but we can show that
n
1X a.s.
Xi −→ −1. (4.31)
n i=1
84
It is important to note that the N that appears in the above expression can depend on
the sample point ω: that is to say N is a random variable. If ω ∈ E c , then for n ≥ N (ω),
n N (ω)
1X 1 X N (ω) − n
Xi (ω) = Xi (ω) − , (4.35)
n i=1 n i=1 n
and so as n → ∞, we have n1 ni=1 Xi (ω) → −1. But we know that P(E c ) = 1, so this
P
proves (4.31).
Example 41. Suppose x is a real number satisfying 0 ≤ x < 1; it can be expressed in
decimal form
x = 0.x1 x2 x3 . . .
where xi ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is the ith digit in the expansion. More formally we
have ∞
X
x= xi .10−i .
i=1
18
This expansion is esssentially unique . Let us define for each k ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
and n ≥ 1 the function fn,k : [0, 1) → Z,
fn,k (x) = |{1 ≤ i ≤ n : xi = k}|,
in other words fn,k counts the number of times the digit k appears in the first n decimal
places of the expansion of x. Now let
1
Ek = {x ∈ [0, 1) : lim fn,k (x) exists and equals 1/10}
n→∞ n
Its quite tricky at first to find a choice of x ∈ [0, 1) that belongs to any of these sets. Any
number that has a decimal expansion that terminates definitely doesn’t belong to any Ek .
The cleverly chosen rational number
123, 456, 789
= 0.012345678901234567890....
9, 999, 999, 999
belongs to Ek for every k because the decimal expansion keeps repeating the sequence
0123456789. What about irrational choices of x such as x = π − 3? Surprisingly it is not
even known there are an infinite number of the digit k in the decimal expansion π − 3,
and so we cannot say whether π − 3 belongs to Ek or not19 .
However there is a very simple way to find x which do belong to Ek . Suppose X is
a random number chosen uniformly at random from [0, 1), and let Xi be the ith digit in
its decimal expansion. Then20 (Xi ; i ≥ 1) is a sequence of independent random variables
each uniformly distributed in the set {0, 1, 2, . . . , 9}. Now for any digit k of interest let
(
1 if Xi = k,
Yi =
0 otherwise.
18
To get uniqueness you just have to agree that you will write one tenth as 0.1 and not 0.099999999 . . .
and so on.
19
If you want to read about more examples, look at https://en.wikipedia.org/wiki/Normal_number
20
its messy to write out a formal proof of this, but not surprising: just think about, for example,
P(X1 = 1 and X2 = 9) to get the idea
85
The sequence of random variables (Yi ; i ≥ 1) is a sequence of independent and identically
distributed random variables with E[Yi ] = 1/10. Applying the strong law of large numbers
to (Yi ; i ≥ 1) shows that the probability that X belongs to Ek is one.
Example 42. Imagine a simple gambling game, where you bet on repeated tosses of a
biased coin. On each toss, if you stake a certain amount x then you either lose your stake
with probability q or receive back 2x with probability p = 1 − q. Assume p > 1/2 > q so
the game is biased in your favour. You start with an initial fund of money W0 and make
repeated bets. What is a good strategy to follow for deciding how much to bet on each
toss?
Since the game is biased in your favour it might at first appear sensible to bet as much
as possible so as to maximize your expected winnings. But in fact this is foolish idea,
since if you bet your entire fund each time, you will, soon or later, lose everything and go
bust (you can’t bet using borrowed money).
Suppose you bet a fraction f of your current funds on each toss. Then your funds
after the nth toss Wn satisfy
(
(1 + f )Wn−1 if you win the nth toss
Wn = (4.36)
(1 − f )Wn−1 if you lose the nth toss.
n
Notice that E[Wn ] = 1 + (p − q)f grows geometrically, and to maximize this growth
rate you might pick f close to 1 (although not equal to 1 bearing in mind our earlier
comment about going bankrupt.)
But consider how the random variable log Wn behaves. From (4.36) we obtain
Wn = (1 + f )Xn (1 − f )n−Xn (4.37)
where Xn is the number of the first n tosses that resulted in wins for you. Consequently,
since by the strong law of large numbers, Xn /n → p almost surely,
1 Xn n − Xn a.s.
log Wn = log(1 + f ) + log(1 − f ) −→ p log(1 + f ) + q log(1 − f ), (4.38)
n n n
as n tends to infinity. This is bad news, because for f close to one this limit is negative,
and consequently, for such an f ,
a.s.
Wn −→ 0. (4.39)
Even though your expected winnings are growing, in fact you are guaranteed to lose all
your money if you keep playing with this strategy!
We can now suggest a good strategy; pick f so as to maximize the long term growth
rate of Wn given by the limit in (4.38). Some easy calculus shows this maximum occurs
at f = p − q, and moreover the growth rate is positive there. Good news!
If p = 0.6 then the proportion to bet to maximize the growth rate would be 20%.
In a real life experiment, 61 people were given an initial $25 to play this game (knowing
the bias of the coin was 3:2 in favour of heads) for up to 300 coin tosses, or until they
had reached maximum winnings of $250. Betting 20% of their available funds at each
toss, one would expect nearly all the participants to reach the $250 maximum. In fact 17
participants went bust and only 13 reached the maximum. Average winnings were $90.
40 participants gambled on tails at some point in the experiment21 .
21
The moral is most people are very poor gamblers. No wonder Las Vegas makes a lot of money.
86